2019 CVPR Accepted Papers

Show each: abstract LDA topics bibtex

2019 CVPR Accepted Papers

Maintained by Matt Deitke
Adapted from Andrej Karpathy
(data from here)
Below every paper are TOP 100 most-occuring words in that paper and their color is based on LDA topic model with k = 7.
(It looks like 0 = videos, 1 = geometry, 2 = image processing, 3 = neural network pruning, 4 = captioning, 5 = segmentation, 6 = unsupervised learning)
Toggle LDA topics to sort by: TOPIC0 TOPIC1 TOPIC2 TOPIC3 TOPIC4 TOPIC5 TOPIC6
Show each: abstract LDA topics bibtex
Finding Task-Relevant Features for Few-Shot Learning by Category Traversal
Hongyang Li, David Eigen, Samuel Dodge, Matthew Zeiler, Xiaogang Wang


Few-shot learning is an important area of research. Conceptually, humans are readily able to understand new concepts given just a few examples, while in more pragmatic terms, limited-example training situations are common practice. Recent effective approaches to few-shot learning employ a metric-learning framework to learn a feature similarity comparison between a query (test) example, and the few support (training) examples. However, these approaches treat each support class independently from one another, never looking at the entire task as a whole. Because of this, they are constrained to use a single set of features for all possible test-time tasks, which hinders the ability to distinguish the most relevant dimensions for the task at hand. In this work, we introduce a Category Traversal Module that can be inserted as a plug-and-play module into most metric-learning based few-shot learners. This component traverses across the entire support set at once, identifying task-relevant features based on both intra-class commonality and inter-class uniqueness in the feature space. Incorporating our module improves performance considerably (5%-10% relative) over baseline systems on both miniImageNet and tieredImageNet benchmarks, with overall performance competitive with the most recent state-of-the-art systems.
[incorporate] [note, shape, matching, problem] [based, projector, comparison, color, method, proposed, image, component] [network, net, accuracy, performance, output, neural, table, compared, better, effective, entire, number, design, weight, size, implementation, compare] [query, model, relevant, find, improved, simple, making, arxiv] [feature, module, relation, category, baseline, cnn, mask, backbone, improvement] [support, ctm, learning, set, class, metric, task, concentrator, traversal, training, embeddings, test, prototypical, similarity, dimension, miniimagenet, data, classification, discriminative, large, loss, tieredimagenet, paper, learn, extractor, labeled, trained, distance, traversing, existing, base, commonality]
@InProceedings{Li_2019_CVPR,
  author = {Li, Hongyang and Eigen, David and Dodge, Samuel and Zeiler, Matthew and Wang, Xiaogang},
  title = {Finding Task-Relevant Features for Few-Shot Learning by Category Traversal},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Edge-Labeling Graph Neural Network for Few-Shot Learning
Jongmin Kim, Taesup Kim, Sungwoong Kim, Chang D. Yoo


In this paper, we propose a novel edge-labeling graph neural network (EGNN), which adapts a deep neural network on the edge-labeling graph, for few-shot learning. The previous graph neural network (GNN) approaches in few-shot learning have been based on the node-labeling framework, which implicitly models the intra-cluster similarity and the inter-cluster dissimilarity. In contrast, the proposed EGNN learns to predict the edge-labels rather than the node-labels on the graph that enables the evolution of an explicit clustering by iteratively updating the edge-labels with direct exploitation of both intra-cluster similarity and the inter-cluster dissimilarity. It is also well suited for performing on various numbers of classes without retraining, and can be easily extended to perform a transductive inference. The parameters of the EGNN are learned by episodic training with an edge-labeling loss to obtain a well-generalizable model for unseen low-data problem. On both of the supervised and semi-supervised few-shot image classification tasks with two benchmark datasets, the proposed EGNN significantly improves the performances over the existing GNNs.
[graph, framework, prediction, previous, iteratively, perform, propagation, consists] [algorithm, problem, alternative, initial, explicit, well, allows] [proposed, figure, image, based, presented, separate, method] [neural, network, number, table, inference, deep, compared, performance, convolutional, layer, aggregation, parameter, process] [node, query, model, attention, richard] [edge, feature] [learning, egnn, support, training, set, classification, labeled, gnn, similarity, transductive, update, task, test, clustering, loss, episodic, data, large, prototypical, setting, miniimagenet, label, supervised, gnns, tieredimagenet, existing, embedding, unlabeled, fewshot, representation, metric, train, learn, reptile]
@InProceedings{Kim_2019_CVPR,
  author = {Kim, Jongmin and Kim, Taesup and Kim, Sungwoong and Yoo, Chang D.},
  title = {Edge-Labeling Graph Neural Network for Few-Shot Learning},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Generating Classification Weights With GNN Denoising Autoencoders for Few-Shot Learning
Spyros Gidaris, Nikos Komodakis


Given an initial recognition model already trained on a set of base classes, the goal of this work is to develop a meta-model for few-shot learning. The meta-model, given as input some novel classes with few training examples per class, must properly adapt the existing recognition model into a new model that can correctly classify in a unified way both the novel and the base classes. To accomplish this goal it must learn to output the appropriate classification weight vectors for those two types of classes. To build our meta-model we make use of two main innovations: we propose the use of a Denoising Autoencoder network (DAE) that (during training) takes as input a set of classification weights corrupted with Gaussian noise and learns to reconstruct the target-discriminative classification weights. In this case, the injected noise on the classification weights serves the role of regularizing the weight generating meta-model. Furthermore, in order to capture the co-dependencies between different classes in a given task instance of our meta-model, we propose to implement the DAE model as a Graph Neural Network (GNN). In order to verify the efficacy of our approach, we extensively evaluate it on ImageNet based few-shot benchmarks and we report state-of-the-art results.
[recognition, graph, work, perform, recognize, consists, state] [initial, estimate, linear, provide, parametric, formulation, reconstruction] [input, noise, reconstruct, based, denoising, figure, prior, conference, image, ieee] [weight, neural, network, order, validation, deep, architecture, layer, output, gaussian, gradient, performance, aggregation, injected, implement, imagenet, apply] [model, arxiv, preprint, node, generation, vector, implemented, visual, example, goal, message] [feature, instance] [classification, novel, training, dae, set, base, learning, function, gnn, test, class, data, learn, task, loss, miniimagenet, update, trained, autoencoder, extractor, dtr, knowledge, train, classifier]
@InProceedings{Gidaris_2019_CVPR,
  author = {Gidaris, Spyros and Komodakis, Nikos},
  title = {Generating Classification Weights With GNN Denoising Autoencoders for Few-Shot Learning},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Kervolutional Neural Networks
Chen Wang, Jianfei Yang, Lihua Xie, Junsong Yuan


Convolutional neural networks (CNNs) have enabled the state-of-the-art performance in many computer vision tasks. However, little effort has been devoted to establishing convolution in non-linear space. Existing works mainly leverage on the activation layers, which can only provide point-wise non-linearity. To solve this problem, a new operation, kervolution (kernel convolution), is introduced to approximate complex behaviors of human perception systems leveraging on the kernel trick. It generalizes convolution, enhances the model capacity, and captures higher order interactions of features, via patch-wise kernel functions, but without introducing additional parameters. Extensive experiments show that kervolutional neural networks (KNN) achieve higher accuracy and faster convergence than baseline CNN.
[time, extract, interesting, complex] [polynomial, computer, linear, vision, pattern, additional, international] [conference, ieee, image, input, figure, chen, method, proposed, based, translation] [kernel, convolution, kervolution, convolutional, kervolutional, accuracy, neural, validation, network, performance, order, layer, deep, higher, learnable, pooling, architecture, table, computational, complexity, applied, best, processing, size, number, gaussian, polynomail, activation, channel, capacity, resnet, extends, lihua, better, increased, achieves, speed, filter, output, max, introducing, achieve, receptive] [model, visual, indicates, machine, simple, cortex] [cnn, feature, faster, improve] [training, knn, learning, convergence, data, existing, hyperparameters, specific, similarity, paper]
@InProceedings{Wang_2019_CVPR,
  author = {Wang, Chen and Yang, Jianfei and Xie, Lihua and Yuan, Junsong},
  title = {Kervolutional Neural Networks},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Why ReLU Networks Yield High-Confidence Predictions Far Away From the Training Data and How to Mitigate the Problem
Matthias Hein, Maksym Andriushchenko, Julian Bitterwolf


Classifiers used in the wild, in particular for safety-critical systems, should not only have good generalization properties but also should know when they don't know, in particular make low confidence predictions far away from the training data. We show that ReLU type neural networks which yield a piecewise linear classifier function fail in this regard as they produce almost always high confidence predictions far away from the training data. For bounded domains like images we propose a new robust optimization technique similar to adversarial training which enforces low confidence predictions far away from the training data. We show that this technique is surprisingly effective in reducing the confidence of predictions far away from the training data while maintaining high confidence predictions and test error on the original classification task compared to standard training.
[report, dataset, interesting] [confidence, linear, note, affine, theorem, problem, robust, optimization, good, contrast, technique, property, approach] [noise, high, image, produce, figure, result, input, generative, enhancing] [relu, neural, plain, network, deep, low, original, number, standard, max, output, table, compared, rescaling, arbitrarily, filter, performance, order] [adversarial, model, true, generated, pgd, evaluation, fact, robustness] [piecewise, detection, propose] [training, acet, mnist, ceda, data, classifier, uniform, auroc, distribution, set, pout, mmc, test, learning, trained, function, classification, loss, svhn, overconfident, lpout, datasets, softmax, class, paper, efk, lce, uncertainty, temperature]
@InProceedings{Hein_2019_CVPR,
  author = {Hein, Matthias and Andriushchenko, Maksym and Bitterwolf, Julian},
  title = {Why ReLU Networks Yield High-Confidence Predictions Far Away From the Training Data and How to Mitigate the Problem},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
On the Structural Sensitivity of Deep Convolutional Networks to the Directions of Fourier Basis Functions
Yusuke Tsuzuku, Issei Sato


Data-agnostic quasi-imperceptible perturbations on inputs are known to degrade recognition accuracy of deep convolutional networks severely. This phenomenon is considered to be a potential security issue. Moreover, some results on statistical generalization guarantees indicate that the phenomena can be a key to improve the networks' generalization. However, the characteristics of the shared directions of such harmful perturbations remain unknown. Our primal finding is that convolutional networks are sensitive to the directions of Fourier basis functions. We derived the property by specializing a hypothesis of the cause of the sensitivity, known as the linearity of neural networks, to convolutional networks and empirically validated it. As a byproduct of the analysis, we propose an algorithm to create shift-invariant universal adversarial perturbations available in black-box settings.
[hypothesis, work, lie, current] [fourier, algorithm, international, single, linear, analysis, computer, optimization, singular, matrix, vision, proposition, problem, pattern, property] [figure, conference, input, frequency, image, method, change, ieee, transformation, result, prior] [convolutional, size, neural, deep, ratio, layer, output, universality, channel, better, compression, fast] [adversarial, basis, uaps, sensitivity, fool, attack, sensitive, machine, perturbation, characterization, random, created, create, transferability, existence, defense, creation, fgsm, find, evaluation, observed, goodfellow, tendency] [visualization, average] [universal, learning, data, domain, training, mnist, generalization, transferable, combination, test, trained, datasets, tested]
@InProceedings{Tsuzuku_2019_CVPR,
  author = {Tsuzuku, Yusuke and Sato, Issei},
  title = {On the Structural Sensitivity of Deep Convolutional Networks to the Directions of Fourier Basis Functions},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Neural Rejuvenation: Improving Deep Network Training by Enhancing Computational Resource Utilization
Siyuan Qiao, Zhe Lin, Jianming Zhang, Alan L. Yuille


In this paper, we study the problem of improving computational resource utilization of neural networks. Deep neural networks are usually over-parameterized for their tasks in order to achieve good performances, thus are likely to have underutilized computational resources. This observation motivates a lot of research topics, e.g. network pruning, architecture search, etc. As models with higher computational costs (e.g. more parameters or more computations) usually have better performances, we study the problem of improving the resource utilization of neural networks so that their potentials can be further realized. To this end, we propose a novel optimization method named Neural Rejuvenation. As its name suggests, our method detects dead neurons and computes resource utilization in real time, rejuvenates dead neurons by resource reallocation and reinitialization, and trains them with new training schemes. By simply replacing standard optimizers with Neural Rejuvenation, we are able to improve the performances of neural networks by a very large margin while using similar training efforts and maintaining their original resource usages. The code is available here: https://github.com/joe-siyuan-qiao/NeuralRejuvenation-CVPR19
[previous, dataset, multiple] [computer, sin, error, vision, denote, optimization, constraint, international, problem, fit, additional, pattern] [method, conference, image, ieee, mixed, remove] [neural, rejuvenation, resource, dead, utilization, network, deep, architecture, layer, convolutional, computational, neuron, number, imagenet, rin, table, capacity, original, pruning, parameter, ratio, search, increase, scaling, batch, output, sout, processing, order, better, cifar, size, add, rout, efficient, cost, rejuvenate, maintaining, small, rate, morphnet, sparsity, liveliness, wrs, achieve] [preprint, arxiv, model, visual, attention, step, find] [propose] [training, learning, set, cross, trained, large, loss, train]
@InProceedings{Qiao_2019_CVPR,
  author = {Qiao, Siyuan and Lin, Zhe and Zhang, Jianming and Yuille, Alan L.},
  title = {Neural Rejuvenation: Improving Deep Network Training by Enhancing Computational Resource Utilization},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Hardness-Aware Deep Metric Learning
Wenzhao Zheng, Zhaodong Chen, Jiwen Lu, Jie Zhou


This paper presents a hardness-aware deep metric learning (HDML) framework. Most previous deep metric learning methods employ the hard negative mining strategy to alleviate the lack of informative samples for training. However, this mining strategy only utilizes a subset of training data, which may not be enough to characterize the global geometry of the embedding space comprehensively. To address this problem, we perform linear interpolation on embeddings to adaptively manipulate their hard levels and generate corresponding label-preserving synthetics for recycled training, so that information buried in all samples can be fully exploited and the metric is always challenged with proper difficulty. Our method achieves very competitive performance on the widely used CUB-200-2011, Cars196, and Stanford Online Products datasets.
[framework, dataset, online, tuple, perform] [augmented, point, linear, latexit, corresponding, reconstruction] [proposed, synthetic, generator, method, figure, image, synthesis, mapping] [deep, original, performance, network, connected, effectiveness, adaptive, fixed, achieve, better, compared] [model, generate, manifold] [feature, fully, person, level, anchor, cnn, map, three] [metric, learning, loss, hard, negative, embedding, triplet, training, space, distance, data, set, hdml, tuples, positive, retrieval, sample, harder, mining, large, label, softmax, embeddings, clustering, jgen, sampling, trained, train, nmi, daml, strategy, informative, stanford, learn, datasets, conventional, contrastive, lifted, exploit, learned, augmentation, jsof, class, hardness, function, pair, objective, close, javg]
@InProceedings{Zheng_2019_CVPR,
  author = {Zheng, Wenzhao and Chen, Zhaodong and Lu, Jiwen and Zhou, Jie},
  title = {Hardness-Aware Deep Metric Learning},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Auto-DeepLab: Hierarchical Neural Architecture Search for Semantic Image Segmentation
Chenxi Liu, Liang-Chieh Chen, Florian Schroff, Hartwig Adam, Wei Hua, Alan L. Yuille, Li Fei-Fei


Recently, Neural Architecture Search (NAS) has successfully identified neural network architectures that exceed human designed ones on large-scale image classification. In this paper, we study NAS for semantic image segmentation. Existing works often focus on searching the repeatable cell structure, while hand-designing the outer network structure that controls the spatial resolution changes. This choice simplifies the search space, but becomes increasingly problematic for dense image prediction which exhibits a lot more network level architectural variations. Therefore, we propose to search the network level structure in addition to the cell level structure, which forms a hierarchical architecture search space. We present a network level search space that includes many popular designs, and develop a formulation that allows efficient gradient-based architecture search (3 P100 GPU days on Cityscapes images). We demonstrate the effectiveness of the proposed method on the challenging Cityscapes, PASCAL VOC 2012, and ADE20K datasets. Auto-DeepLab, our architecture searched specifically for semantic image segmentation, attains state-of-the-art performance without any ImageNet pretraining.
[outperforms, work, report, previous, dataset, recognition] [dense, continuous, relaxation, general, differentiable, discrete, optimization, scene, formulation] [image, resolution, method, input, outer, proposed, high] [search, architecture, network, neural, cell, convolutional, performance, imagenet, best, deep, atrous, convolution, efficient, layer, validation, structure, pooling, rate, attains, output, size, table, number, sep, searching, entire, connection, residual, addition, gpu, higher, efficiently, gridnet] [model, path] [level, semantic, spatial, segmentation, hierarchical, pascal, voc, object, pspnet, pyramid, miou, propose, coarse, context, coco, cnn] [space, set, learning, training, classification, test]
@InProceedings{Liu_2019_CVPR,
  author = {Liu, Chenxi and Chen, Liang-Chieh and Schroff, Florian and Adam, Hartwig and Hua, Wei and Yuille, Alan L. and Fei-Fei, Li},
  title = {Auto-DeepLab: Hierarchical Neural Architecture Search for Semantic Image Segmentation},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Learning Loss for Active Learning
Donggeun Yoo, In So Kweon


The performance of deep neural networks improves with more annotated data. The problem is that the budget for annotation is limited. One solution to this is active learning, where a model asks human to annotate data that it perceived as uncertain. A variety of recent methods have been proposed to apply active learning to deep networks but most of them are either designed specific for their target tasks or computationally inefficient for large networks. In this paper, we propose a novel active learning method that is simple but task-agnostic, and works efficiently with the deep networks. We attach a small parametric module, named "loss prediction module," to a target network, and learn it to predict target losses of unlabeled inputs. Then, this module can suggest data that the target model is likely to produce a wrong prediction. This method is task-agnostic as networks are learned from a single loss regardless of target tasks. We rigorously validate our method through image classification, object detection, and human pose estimation, with the recent network architectures. The results demonstrate that our method consistently outperforms the previous methods over the tasks.
[prediction, human, recognition, dataset, current, multiple] [active, computer, international, pose, approach, initial, vision, pattern, problem, hourglass, body, point, define, estimation, well] [method, conference, image, figure, ieee, mse, proposed, input, real] [deep, performance, network, size, neural, design, accuracy, number, small, scale, processing, efficient, architecture, output] [model, random, machine, expected, choose, simple, requires] [module, object, predicted, feature, detection, pool, average, annotation, regression, european, annotate, bounding, three, including, map, final] [loss, learning, target, data, learn, set, entropy, uncertainty, classification, labeled, training, unlabeled, meanstd, large, class, distribution, task, function, posterior, subset]
@InProceedings{Yoo_2019_CVPR,
  author = {Yoo, Donggeun and So Kweon, In},
  title = {Learning Loss for Active Learning},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Striking the Right Balance With Uncertainty
Salman Khan, Munawar Hayat, Syed Waqas Zamir, Jianbing Shen, Ling Shao


Learning unbiased models on imbalanced datasets is a significant challenge. Rare classes tend to get a concentrated representation in the classification space which hampers the generalization of learned boundaries to new test examples. In this paper, we demonstrate that the Bayesian uncertainty estimates directly correlate with the rarity of classes and the difficulty level of individual samples. Subsequently, we present a novel framework for uncertainty based class imbalance learning that follows two key insights: First, classification boundaries should be extended further away from a more uncertain (rare) class to avoid over-fitting and enhance its generalization. Second, each sample should be modeled as a multi-variate Gaussian distribution with a mean vector and a covariance matrix defined by the sample's uncertainty. The learned boundaries should respect not only the individual samples but also their distribution in the feature space. Our proposed approach efficiently utilizes sample and class uncertainty information to learn robust features and more generalizable classifiers. We systematically study the class imbalance problem and derive a novel loss formulation for max-margin learning based on Bayesian uncertainty measure. The proposed method shows significant performance improvements on six benchmark datasets for face verification, attribute prediction, digit/object classification and skin lesion detection.
[recognition, dataset, prediction, report, individual] [computer, approach, vision, pattern, international, confidence, error, provide, directly, problem, formulation, note] [face, conference, ieee, based, proposed, attribute, skin, hair, method, image, input] [deep, performance, table, neural, bayesian, gaussian, network, standard, better, dropout, higher, number, accuracy, larger, achieve, output] [model, machine, empirical, probability, arxiv, preprint, true] [feature, lesion, cnn, level, challenging] [class, uncertainty, loss, learning, imbalanced, imbalance, training, softmax, margin, classification, data, distribution, set, learned, minority, function, classifier, rare, generalization, datasets, sample, paper, large, uncertain, test, novel, representation, balanced, biased, belonging]
@InProceedings{Khan_2019_CVPR,
  author = {Khan, Salman and Hayat, Munawar and Waqas Zamir, Syed and Shen, Jianbing and Shao, Ling},
  title = {Striking the Right Balance With Uncertainty},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
AutoAugment: Learning Augmentation Strategies From Data
Ekin D. Cubuk, Barret Zoph, Dandelion Mane, Vijay Vasudevan, Quoc V. Le


Data augmentation is an effective technique for improving the accuracy of modern image classifiers. However, current data augmentation implementations are manually designed. In this paper, we describe a simple procedure called AutoAugment to automatically search for improved data augmentation policies. In our implementation, we have designed a search space where a policy consists of many sub-policies, one of which is randomly chosen for each image in each mini-batch. A sub-policy consists of two operations, each operation being an image processing function such as translation, rotation, or shearing, and the probabilities and magnitudes with which the functions are applied. We use a search algorithm to find the best policy such that the neural network yields the highest validation accuracy on a target dataset. Our method achieves state-of-the-art accuracy on CIFAR-10, CIFAR-100, SVHN, and ImageNet (without additional data). On ImageNet, we attain a Top-1 accuracy of 83.5% which is 0.4% better than the previous record of 83.1%. On CIFAR-10, we achieve an error rate of 1.5%, which is 0.6% better than the previous state-of-the-art. Augmentation policies we find are transferable between datasets. The policy learned on ImageNet transfers well to achieve significant improvements on other datasets, such as Oxford Flowers, Caltech-101, Oxford-IIT Pets, FGVC Aircraft, and Stanford Cars.
[dataset, previous, work] [error, computer, vision, algorithm, augmented, international, pattern, approach, well] [image, conference, method, ieee, figure, result, generative, application] [search, autoaugment, best, neural, accuracy, imagenet, rate, reduced, operation, network, validation, table, controller, better, architecture, deep, cutout, applied, processing, number, applying, decay, fgvc, achieves, achieve, convolutional, weight, standard, magnitude, full, size] [policy, arxiv, preprint, model, find, child, random, probability, machine, reinforcement, adversarial] [baseline] [augmentation, data, training, learning, set, trained, train, svhn, learned, datasets, test, randomly, classification, transfer, stanford, generalization]
@InProceedings{Cubuk_2019_CVPR,
  author = {Cubuk, Ekin D. and Zoph, Barret and Mane, Dandelion and Vasudevan, Vijay and Le, Quoc V.},
  title = {AutoAugment: Learning Augmentation Strategies From Data},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
SDRSAC: Semidefinite-Based Randomized Approach for Robust Point Cloud Registration Without Correspondences
Huu M. Le, Thanh-Toan Do, Tuan Hoang, Ngai-Man Cheung


This paper presents a novel randomized algorithm for robust point cloud registration without correspondences. Most existing registration approaches require a set of putative correspondences obtained by extracting invariant descriptors. However, such descriptors could become unreliable in noisy and contaminated settings. In these settings, methods that directly handle input point sets are preferable. Without correspondences, however, conventional randomized techniques require a very large number of samples in order to reach satisfactory solutions. In this paper, we propose a novel approach to address this problem. In particular, our work enables the use of randomized methods for point cloud registration without the need of putative correspondences. By considering point cloud alignment as a special instance of graph matching and employing an efficient semi-definite relaxation, we propose a novel sampling mechanism, in which the size of the sampled subsets can be larger-than-minimal. Our tight relaxation scheme enables fast rejection of the outliers in the sampled sets, resulting in high quality hypotheses. We conduct extensive experiments to demonstrate that our approach outperforms other state-of-the-art methods. Importantly, our proposed method serves as a generic framework which can be extended to problems with known correspondences.
[time, graph, work, recognition] [point, robust, computer, registration, cloud, matching, optimal, problem, solution, nsample, algorithm, vision, pattern, international, correspondence, solving, convex, matrix, icp, approach, relaxation, consensus, sdrsac, david, sdp, note, chosen, local, ransac, estimating, globally, inlier, volume, outlier, semidefinite, good, case, constraint, analysis, huu, directly, tight, rigid] [ieee, conference, input, method, proposed, image, synthetic, figure, real, high] [number, best, fast, search, stopping, performance, order, efficient, size, max, criterion] [randomized, sampled, random, machine] [propose] [sampling, large, set, sample, data, subset, novel, alignment, strategy, maximum, randomly]
@InProceedings{Le_2019_CVPR,
  author = {Le, Huu M. and Do, Thanh-Toan and Hoang, Tuan and Cheung, Ngai-Man},
  title = {SDRSAC: Semidefinite-Based Randomized Approach for Robust Point Cloud Registration Without Correspondences},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
BAD SLAM: Bundle Adjusted Direct RGB-D SLAM
Thomas Schops, Torsten Sattler, Marc Pollefeys


A key component of Simultaneous Localization and Mapping (SLAM) systems is the joint optimization of the estimated 3D map and camera trajectory. Bundle adjustment (BA) is the gold standard for this. Due to the large number of variables in dense RGB-D SLAM, previous work has focused on approximating BA. In contrast, in this paper we present a novel, fast direct BA formulation which we implement in a real-time dense RGB-D SLAM algorithm. In addition, we show that direct RGB-D SLAM systems are highly sensitive to rolling shutter, RGB and depth sensor synchronization, and calibration errors. In order to facilitate state-of-the-art research on direct RGB-D SLAM, we propose a novel, well-calibrated benchmark for this task that uses synchronized global shutter RGB and depth cameras. It includes a training set, a test set without public ground truth, and an online evaluation service. We observe that the ranking of methods changes on this dataset compared to existing ones, and our proposed algorithm outperforms all other evaluated SLAM methods. Our benchmark and our open source SLAM algorithm are available at: www.eth3d.net
[keyframe, dataset, keyframes, online, performs, outperforms, graph] [depth, slam, surfel, direct, optimization, camera, surfels, dense, pose, photometric, shutter, loop, geometric, rolling, rmse, ate, thomas, scene, geometry, tum, stereo, approach, point, normal, daniel, bundle, algorithm, reconstruction, david, marc, ground, rgeom, michael, adjustment, robust, corresponding, sch, contrast, optimizing, bundlefusion, odometry, truth, accurate, radius, position, measurement, rphoto] [image, pixel, method, color, figure] [number, optimize, andrew, cost, sparse, cell, standard, better] [model, visual, evaluation, system, common] [benchmark, global, map, merge] [datasets, training, test, set, existing, data, update, alignment, alternating, hard]
@InProceedings{Schops_2019_CVPR,
  author = {Schops, Thomas and Sattler, Torsten and Pollefeys, Marc},
  title = {BAD SLAM: Bundle Adjusted Direct RGB-D SLAM},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Revealing Scenes by Inverting Structure From Motion Reconstructions
Francesco Pittaluga, Sanjeev J. Koppal, Sing Bing Kang, Sudipta N. Sinha


Many 3D vision systems localize cameras within a scene using 3D point clouds. Such point clouds are often obtained using structure from motion (SfM), after which the images are discarded to preserve privacy. In this paper, we show, for the first time, that such point clouds retain enough information to reveal scene appearance and compromise privacy. We present a privacy attack that reconstructs color images of the scene from the point cloud. Our method is based on a cascaded U-Net that takes as input, a 2D multichannel image of the points rendered from a specific viewpoint containing point depth and optionally color and SIFT descriptors and outputs a color image of the scene from that viewpoint. Unlike previous feature inversion methods, we deal with highly sparse and irregular 2D point distributions and inputs where many point attributes are missing, namely keypoint orientation and scale, the descriptor image source and the 3D point visibility. We evaluate our attack algorithm on public datasets and analyze the significance of the point cloud attributes. Finally, we show that novel views can also be generated thereby enabling compelling virtual tours of the underlying scene.
[work, report] [point, isib, sfm, visibility, depth, sift, oarse, scene, inverting, efine, nyu, cloud, camera, estimation, indoor, keypoint, problem, reconstruction, descriptor, approach, occluded, view, viewpoint, associated, single, geometric, rgb, orientation, virtual, megadepth, dense, visible, implicit, visibnet, left, vision] [image, input, figure, color, reconstructed, based, method, preserving, ssim, inversion, perceptual, synthesis, synthesized] [sparse, network, deep, table, neural, sparsity, convolutional, accuracy, original, processing] [privacy, adversarial, model, generated, visual, attack] [feature, map, three, mae, localization] [trained, source, set, unknown, novel, training, data, learning, loss, test, specific]
@InProceedings{Pittaluga_2019_CVPR,
  author = {Pittaluga, Francesco and Koppal, Sanjeev J. and Bing Kang, Sing and Sinha, Sudipta N.},
  title = {Revealing Scenes by Inverting Structure From Motion Reconstructions},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Strand-Accurate Multi-View Hair Capture
Giljoo Nam, Chenglei Wu, Min H. Kim, Yaser Sheikh


Hair is one of the most challenging objects to reconstruct due to its micro-scale structure and a large number of repeated strands with heavy occlusions. In this paper, we present the first method to capture high-fidelity hair geometry with strand-level accuracy. Our method takes three stages to achieve this. In the first stage, a new multi-view stereo method with a slanted support line is proposed to solve the hair correspondences between different views. In detail, we contribute a novel cost function consisting of both photo-consistency term and geometric term that reconstructs each hair pixel as a 3D line. By merging all the depth maps, a point cloud, as well as local line directions for each point, is obtained. Thus, in the second stage, we feature a novel strand reconstruction method with the mean-shift to convert the noisy point data to a set of strands. Lastly, we grow the hair strands with multi-view geometric constraints to elongate the short strands and recover the missing strands, thus significantly increasing the reconstruction completeness. We evaluate our method on both synthetic data and real captured data, showing that our method can reconstruct hair strands with sub-millimeter accuracy.
[capture, previous, work, fusion, multiple, modeling, short, current] [point, strand, cloud, direction, reconstruction, growing, view, geometry, computer, orientation, lpmvs, position, stereo, geometric, plane, vision, algorithm, single, local, robust, angle, pattern, curly, straight, linjie, hao, depth, matching, estimate, surface, pdir, nnei, defined, ground, solve] [hair, method, figure, reconstructed, acm, reference, captured, pixel, real, input, conference, reconstruct, traditional, ieee, patchmatch, synthetic, image, difference, photograph] [cost, output, process, achieve, number, structure] [find, random, evaluation, evaluate] [neighboring, map, segment] [data, set, function, sample, novel, large, noisy]
@InProceedings{Nam_2019_CVPR,
  author = {Nam, Giljoo and Wu, Chenglei and Kim, Min H. and Sheikh, Yaser},
  title = {Strand-Accurate Multi-View Hair Capture},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
DeepSDF: Learning Continuous Signed Distance Functions for Shape Representation
Jeong Joon Park, Peter Florence, Julian Straub, Richard Newcombe, Steven Lovegrove


Computer graphics, 3D computer vision and robotics communities have produced multiple approaches to representing 3D geometry for rendering and reconstruction. These provide trade-offs across fidelity, efficiency and compression capabilities. In this work, we introduce DeepSDF, a learned continuous Signed Distance Function (SDF) representation of a class of shapes that enables high quality shape representation, interpolation and completion from partial and noisy 3D input data. DeepSDF, like its classical counterpart, represents a shape's surface by a continuous volumetric field: the magnitude of a point in the field represents the distance to the surface boundary and the sign indicates whether the region is inside (-) or outside (+) of the shape, hence our representation implicitly encodes a shape's boundary as the zero-level-set of the learned function while explicitly representing the classification of space as being part of the shapes interior or not. While classical SDF's both in analytical or discretized voxel form typically represent the surface of a single shape, DeepSDF can represent an entire class of shapes. Furthermore, we show state-of-the-art performance for learned 3D shape representation and completion while reducing the model size by an order of magnitude compared with previous work.
[modeling, complex, represented, work] [shape, sdf, surface, deepsdf, completion, point, continuous, depth, mesh, signed, single, approach, reconstruction, formulation, atlasnet, implicit, voxel, voxels, note, occupancy, volumetric, well, watertight, parameterization, dense, computer] [latent, input, generative, figure, method, comparison, high, quality] [deep, network, neural, table, convolutional, number] [arxiv, vector, preprint, model, represent, representing, partial, decoder, describe, complete, decision, memory, encoder, generate, ability] [object, spatial, boundary, oriented] [learning, training, representation, distance, space, function, code, data, learned, learn, test, loss, trained, set, embedding, sample, unknown, train]
@InProceedings{Park_2019_CVPR,
  author = {Joon Park, Jeong and Florence, Peter and Straub, Julian and Newcombe, Richard and Lovegrove, Steven},
  title = {DeepSDF: Learning Continuous Signed Distance Functions for Shape Representation},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Pushing the Boundaries of View Extrapolation With Multiplane Images
Pratul P. Srinivasan, Richard Tucker, Jonathan T. Barron, Ravi Ramamoorthi, Ren Ng, Noah Snavely


We explore the problem of view synthesis from a narrow baseline pair of images, and focus on generating high-quality view extrapolations with plausible disocclusions. Our method builds upon prior work in predicting a multiplane image (MPI), which represents scene content as a set of RGBA planes within a reference view frustum and renders novel views by projecting this content into the target viewpoints. We present a theoretical analysis showing how the range of views that can be rendered from an MPI increases linearly with the MPI disparity sampling frequency, as well as a novel MPI prediction procedure that theoretically enables view extrapolations of up to 4 times the lateral viewpoint movement allowed by prior work. Our method ameliorates two specific issues that limit the range of views renderable by prior methods: 1) We expand the range of novel views that can be rendered without depth discretization artifacts by using a 3D convolutional network architecture along with a randomized-resolution training procedure to allow our model to predict MPIs with increased disparity sampling frequency. 2) We reduce the repeated texture artifacts seen in disocclusions by enforcing a constraint that the appearance of hidden content at any depth must be drawn from visible content at or behind that depth.
[hidden, prediction, predict, work, predicting] [mpi, view, rendered, scene, disparity, rendering, visible, depth, disoccluded, viewpoint, rgb, fourier, occluded, range, mpis, render, camera, disocclusions, predicts, geometry, initial, single, field, plane, light, rfin, renderable, limited, rinit, problem, discretization, truth, volume, slice, theoretical, linearly, local, zhou] [content, input, image, reference, synthesis, figure, appearance, extrapolation, method, prior, frequency, inpainting, quality, convincing, transmittance, acm, repeated, texture, realistic, color, transform] [network, number, deep, original, lateral, architecture] [model, plausible, adversarial, procedure] [predicted, spatial, cnn, final] [novel, training, sampling, target, representation, set, learning, train]
@InProceedings{Srinivasan_2019_CVPR,
  author = {Srinivasan, Pratul P. and Tucker, Richard and Barron, Jonathan T. and Ramamoorthi, Ravi and Ng, Ren and Snavely, Noah},
  title = {Pushing the Boundaries of View Extrapolation With Multiplane Images},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
GA-Net: Guided Aggregation Net for End-To-End Stereo Matching
Feihu Zhang, Victor Prisacariu, Ruigang Yang, Philip H.S. Torr


In the stereo matching task, matching cost aggregation is crucial in both traditional methods and deep neural network models in order to accurately estimate disparities. We propose two novel neural net layers, aimed at capturing local and the whole-image cost dependencies respectively. The first is a semi-global aggregation layer which is a differentiable approximation of the semi-global matching, the second is the local guided aggregation layer which follows a traditional cost filtering strategy to refine thin structures. These two layers can be used to replace the widely used 3D convolutional layer which is computationally costly and memory-consuming as it has cubic computational/memory complexity. In the experiments, we show that nets with a two-layer guided aggregation block easily outperform the state-of-the-art GC-Net which has nineteen 3D convolutional layers. We also train a deep guided aggregation network (GA-Net) which gets better accuracies than state-of-the-art methods on both Scene Flow dataset and KITTI benchmarks.
[flow, dataset, recognition] [matching, disparity, stereo, computer, local, error, vision, scene, kitti, sgm, volume, pattern, psmnet, textureless, dmax, international, estimation, ground, hourglass] [conference, ieee, traditional, image, proposed, input, reflective, thin, pixel, row, filtering, figure, based] [cost, aggregation, layer, convolutional, sga, neural, deep, network, lga, table, accuracy, weight, compared, filter, best, order, better, aggregate, effective, replace, block, stacked, number, efficient, qnp, max, rate] [model, correct] [guided, feature, region, object, extraction, guidance, subnet, challenging, three, average, refine, improve] [large, loss, learning, train]
@InProceedings{Zhang_2019_CVPR,
  author = {Zhang, Feihu and Prisacariu, Victor and Yang, Ruigang and Torr, Philip H.S.},
  title = {GA-Net: Guided Aggregation Net for End-To-End Stereo Matching},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Real-Time Self-Adaptive Deep Stereo
Alessio Tonioni, Fabio Tosi, Matteo Poggi, Stefano Mattoccia, Luigi Di Stefano


Deep convolutional neural networks trained end-to-end are the state-of-the-art methods to regress dense disparity maps from stereo pairs. These models, however, suffer from a notable decrease in accuracy when exposed to scenarios significantly different from the training set (e.g., real vs synthetic images, etc.). We argue that it is extremely unlikely to gather enough samples to achieve effective training/tuning in any target domain, thus making this setup impractical for many applications. Instead, we propose to perform unsupervised and continuous online adaptation of a deep stereo network, which allows for preserving its accuracy in any environment. However, this strategy is extremely computationally demanding and thus prevents real-time inference. We address this issue introducing a new lightweight, yet effective, deep stereo architecture, Modularly ADaptive Network(MADNet), and developing a Modular ADaptation (MAD) algorithm, which independently trains sub-portions of the network. By deploying MADNet together with MAD we introduce the first real-time self-adaptive deep stereo system enabling competitive performance on heterogeneous datasets. Our code is publicly available at https://github.com/CVLAB-Unibo/Real-time-self-adaptive-deep-stereo.
[online, frame, recognition, time, perform, dataset, flow, key, sequence, optical, prediction, work] [stereo, madnet, computer, vision, mad, kitti, disparity, pattern, error, dispnetc, depth, matching, estimation, stefano, confidence, epe, international, matteo, left, additional, allows, groundtruth, accurate, compute, fabio, dense, rely] [conference, ieee, proposed, synthetic, resolution, high] [network, full, deep, performance, compared, accuracy, rate, offline, inference, cost, fps, convolutional, architecture, higher, fast, layer, achieve, deployment, speed, better] [model, machine, evaluation, modular] [refinement, module, propose, road] [adaptation, training, learning, unsupervised, data, loss, domain, target, adapt, train, trained, pair, novel]
@InProceedings{Tonioni_2019_CVPR,
  author = {Tonioni, Alessio and Tosi, Fabio and Poggi, Matteo and Mattoccia, Stefano and Di Stefano, Luigi},
  title = {Real-Time Self-Adaptive Deep Stereo},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
LAF-Net: Locally Adaptive Fusion Networks for Stereo Confidence Estimation
Sunok Kim, Seungryong Kim, Dongbo Min, Kwanghoon Sohn


We present a novel method that estimates confidence map of an initial disparity by making full use of tri-modal input, including matching cost, disparity, and color image through deep networks. The proposed network, termed as Locally Adaptive Fusion Networks (LAF-Net), learns locally-varying attention and scale maps to fuse the tri-modal confidence features. The attention inference networks encode the importance of tri-modal confidence features and then concatenate them using the attention maps in an adaptive and dynamic fashion. This enables us to make an optimal fusion of the heterogeneous features, compared to a simple concatenation technique that is commonly used in conventional approaches. In addition, to encode the confidence features with locally-varying receptive fields, the scale inference networks learn the scale map and warp the fused confidence features through convolutional spatial transformer networks. Finally, the confidence map is progressively estimated in the recursive refinement networks to enforce a spatial context and local consistency. Experimental results show that this model outperforms the state-of-the-art methods on various benchmarks.
[fusion, auc, dataset] [confidence, matching, disparity, stereo, mid, pattern, estimation, optimal, kitti, handcrafted, local, estimated, yoon, shaked, estimate, initial, left, haeusler, spyropoulos, ccnn, pbcp, poggi, confnet, single] [ieee, proposed, color, image, input, pixel, park, method, figure, bad, based, study] [scale, conv, inference, recursive, kim, cost, convolutional, relu, performance, network, deep, neural, lfn, adaptive, receptive, concatenation, accuracy, order, output, size, table] [attention, improved, locally, simple] [feature, refinement, map, extraction, spatial, global, ablation] [learning, measure, set, conventional, learn, consist, learned]
@InProceedings{Kim_2019_CVPR,
  author = {Kim, Sunok and Kim, Seungryong and Min, Dongbo and Sohn, Kwanghoon},
  title = {LAF-Net: Locally Adaptive Fusion Networks for Stereo Confidence Estimation},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
NM-Net: Mining Reliable Neighbors for Robust Feature Correspondences
Chen Zhao, Zhiguo Cao, Chi Li, Xin Li, Jiaqi Yang


Feature correspondence selection is pivotal to many feature-matching based tasks in computer vision. Searching spatially k-nearest neighbors is a common strategy for extracting local information in many previous works. However, there is no guarantee that the spatially k-nearest neighbors of correspondences are consistent because the spatial distribution of false correspondences is often irregular. To address this issue, we present a compatibility-specific mining method to search for consistent neighbors. Moreover, in order to extract and aggregate more reliable features from neighbors, we propose a hierarchical network named NM-Net with a series of graph convolutions that is insensitive to the order of correspondences. Our experimental results have shown the proposed method achieves the state-of-the-art performance on four datasets with various inlier ratios and varying numbers of feature consistencies.
[graph, key, multiple, employed] [local, correspondence, computer, inlier, inliers, matching, international, pattern, consistent, vision, colmap, initial, point, pointnet, reliable, approach, locality, analysis, good, corresponding, journal, cloud, irregular] [spatially, conference, image, ieee, method, consistency, transformation, figure, extracted, proposed, narrow, based, raw] [selection, deep, search, network, convolution, resnet, performance, binary, ratio, higher, table, achieves, structure, standard, number] [correct, visual, represent, generated, finding] [feature, global, segmentation, grouping, multi, propose, hierarchical, spatial] [compatibility, classification, learning, metric, mining, neighbor, datasets, knn, set, compatible, experimental, loss, mine]
@InProceedings{Zhao_2019_CVPR,
  author = {Zhao, Chen and Cao, Zhiguo and Li, Chi and Li, Xin and Yang, Jiaqi},
  title = {NM-Net: Mining Reliable Neighbors for Robust Feature Correspondences},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Coordinate-Free Carlsson-Weinshall Duality and Relative Multi-View Geometry
Matthew Trager, Martial Hebert, Jean Ponce


We present a coordinate-free description of Carlsson-Weinshall duality between scene points and camera pinholes and use it to derive a new characterization of primal/dual multi-view geometry. In the case of three views, a particular set of reduced trilinearities provide a novel parameterization of camera geometry that, unlike existing ones, is subject only to very simple internal constraints. These trilinearities lead to new "quasi-linear" algorithms for primal and dual structure from motion. We include some preliminary experiments with real and synthetic data.
[joint, passing, internal, motion] [point, scene, trilinearities, geometry, associated, corresponding, coordinate, projective, duality, camera, proposition, geometric, form, solution, reconstruction, condition, linear, trifocal, reprojection, relative, matrix, pinhole, cremona, approach, perspective, algebraic, projection, general, supplementary, bundle, algorithm, course, valid, equation, trilinearity, case, parameterization, concurrent, ucker, single, note, material, written] [image, dual, figure, pixel, synthetic, real, arbitrary, reference, result, transformation, method] [reduced, fixed, structure] [primal, vector, visual, basis, simple, system, belongs, random, reasonable] [three] [set, data]
@InProceedings{Trager_2019_CVPR,
  author = {Trager, Matthew and Hebert, Martial and Ponce, Jean},
  title = {Coordinate-Free Carlsson-Weinshall Duality and Relative Multi-View Geometry},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Deep Reinforcement Learning of Volume-Guided Progressive View Inpainting for 3D Point Scene Completion From a Single Depth Image
Xiaoguang Han, Zhaoxuan Zhang, Dong Du, Mingdai Yang, Jingming Yu, Pan Pan, Xin Yang, Ligang Liu, Zixiang Xiong, Shuguang Cui


We present a deep reinforcement learning method of progressive view inpainting for 3D point scene completion under volume guidance, achieving high-quality scene reconstruction from only a single depth image with severe occlusion. Our approach is end-to-end, consisting of three modules: 3D scene volume reconstruction, 2D depth map inpainting, and multi-view selection for completion. Given a single depth image, our method first goes through the 3D volume branch to obtain a volumetric scene reconstruction as a guide to the next view inpainting step, which attempts to make up the missing information; the third step involves projecting the volume under the same view of the input, concatenating them to complete the current view depth, and integrating all depth into the point cloud. Since the occluded areas are unavailable, we resort to a deep Q-Network to glance around and pick the next best view for large hole completion progressively until a scene is adequately reconstructed while guaranteeing validity. All steps are learned jointly to achieve robust and consistent results. We perform qualitative and quantitative evaluations with extensive experiments on the SUNCG data, obtaining better results than the state of the art.
[action, state, current, sequence] [depth, view, completion, point, scene, volume, computer, cloud, single, vision, dqn, shape, volumetric, pattern, reconstruction, approach, sscnet, projection, algorithm, viewpoint, directly, voxel, daniel, problem, camera, define, corresponding, olume] [inpainting, image, conference, input, figure, acm, missing, ieee, method, based, proposed, incomplete, resolution, denoted] [network, deep, best, output, neural, better, iteration, number, structure, convolutional] [reward, reinforcement, path, generated, arxiv, preprint, complete, model, agent, partial, probability] [map, semantic, context, object, global, propose] [learning, set, training, train, space, function, loss, data, existing]
@InProceedings{Han_2019_CVPR,
  author = {Han, Xiaoguang and Zhang, Zhaoxuan and Du, Dong and Yang, Mingdai and Yu, Jingming and Pan, Pan and Yang, Xin and Liu, Ligang and Xiong, Zixiang and Cui, Shuguang},
  title = {Deep Reinforcement Learning of Volume-Guided Progressive View Inpainting for 3D Point Scene Completion From a Single Depth Image},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Video Action Transformer Network
Rohit Girdhar, Joao Carreira, Carl Doersch, Andrew Zisserman


We introduce the Action Transformer model for recognizing and localizing human actions in video clips. We repurpose a Transformer-style architecture to aggregate features from the spatiotemporal context around the person whose actions we are trying to classify. We show that by using high-resolution, person-specific, class-agnostic queries, the model spontaneously learns to track individual people and to pick up on semantic context from the actions of others. Additionally its attention mechanism learns to emphasize hands and faces, which are often crucial to discriminate an action - all without explicit supervision other than boxes and class labels. We train and test our Action Transformer network on the Atomic Visual Actions (AVA) dataset, outperforming the state-of-the-art by a significant margin using only raw RGB frames as input.
[action, video, people, human, clip, previous, multiple, ava, temporal, dataset, highres, learns, work, time, frame, qpr, challenge, recognizing, recognize, recognition, flow, spatiotemporal, interesting, passed] [note, rgb, linear, corresponding] [figure, input, described] [performance, network, architecture, size, table, convolutional, unit, batch, original, validation, neural, layer, dropout] [transformer, model, query, attention, evaluate, visual, describe, arxiv, attend, find, refer] [head, feature, person, map, rpn, box, context, bounding, object, detection, proposal, trunk, localization, region, roipool, location, art, failure, regression, center] [classification, training, class, base, data, test, embeddings, classify, set, large, representation, learning]
@InProceedings{Girdhar_2019_CVPR,
  author = {Girdhar, Rohit and Carreira, Joao and Doersch, Carl and Zisserman, Andrew},
  title = {Video Action Transformer Network},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Timeception for Complex Action Recognition
Noureldien Hussein, Efstratios Gavves, Arnold W.M. Smeulders


This paper focuses on the temporal aspect for recognizing human activities in videos; an important visual cue that has long been undervalued. We revisit the conventional definition of activity and restrict it to Complex Action: a set of one-actions with a weak temporal pattern that serves a specific purpose. Related works use spatiotemporal 3D convolutions with fixed kernel size, too rigid to capture the varieties in temporal extents of complex actions, and too short for long-range temporal modeling. In contrast, we use multi-scale temporal convolutions, and we reduce the complexity of 3D convolutions. The outcome is Timeception convolution layers, which reasons about minute-long temporal patterns, a factor of 8 longer than best related works. As a result, Timeception achieves impressive accuracy in recognizing the human activities of Charades, Breakfast Actions and MultiTHUMOS. Further, we demonstrate that Timeception learns long-range temporal dependencies and tolerate temporal extents of complex actions.
[temporal, timeception, complex, action, video, spatiotemporal, timesteps, recognition, human, long, modeling, dataset, breakfast, tolerate, second, efstratios, learns, focus, tidy] [rgb, decomposition] [figure, input, shuffling, image, method] [kernel, layer, convolution, channel, number, resnet, table, convolutional, deep, conv, fixed, cnns, top, size, dilation, separable, factor, pooling, experiment, complexity, neural, better, grouped, group, performance, operation, efficient, design] [model] [spatial, backbone, map, cnn, feature, baseline, propose, three, cascade] [learn, subspace, learning, close, test, classification, training, observe, learned]
@InProceedings{Hussein_2019_CVPR,
  author = {Hussein, Noureldien and Gavves, Efstratios and Smeulders, Arnold W.M.},
  title = {Timeception for Complex Action Recognition},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
STEP: Spatio-Temporal Progressive Learning for Video Action Detection
Xitong Yang, Xiaodong Yang, Ming-Yu Liu, Fanyi Xiao, Larry S. Davis, Jan Kautz


In this paper, we propose Spatio-TEmporal Progressive (STEP) action detector--a progressive learning framework for spatio-temporal action detection in videos. Starting from a handful of coarse-scale proposal cuboids, our approach progressively refines the proposals towards actions over a few steps. In this way, high-quality proposals (i.e., adhere to action movements) can be gradually obtained at later steps by leveraging the regression outputs from previous steps. At each step, we adaptively extend the proposals in time to incorporate more related temporal context. Compared to the prior work that performs action detection in one run, our progressive learning framework is able to naturally handle the spatial displacement within action tubes and therefore provides a more effective way for spatio-temporal modeling. We extensively evaluate our approach on UCF101 and AVA, and demonstrate superior detection results. Remarkably, we achieve mAP of 75.0% and 18.6% on the two datasets with 3 progressive steps and using respectively only 11 and 34 initial proposals.
[action, temporal, video, fusion, tubelet, clip, early, sequence, longer, framework, work, perform, extend, performs, optical, flow, anticipation, time, displacement, involves, modeling, movement, frame, tube, linking, multiple, spatiotemporal, report] [initial, approach, problem, ground, truth, algorithm, scene, accurate] [figure, input, based, image, method] [progressive, better, network, convolutional, number, performance, output, adaptive, process, overlap] [step, progressively, model, generate, xiaodong] [detection, spatial, proposal, iou, object, context, regression, bounding, refinement, global, localization, illustrated, location] [learning, classification, extension, training, loss, positive, smax, set, negative, distribution]
@InProceedings{Yang_2019_CVPR,
  author = {Yang, Xitong and Yang, Xiaodong and Liu, Ming-Yu and Xiao, Fanyi and Davis, Larry S. and Kautz, Jan},
  title = {STEP: Spatio-Temporal Progressive Learning for Video Action Detection},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Relational Action Forecasting
Chen Sun, Abhinav Shrivastava, Carl Vondrick, Rahul Sukthankar, Kevin Murphy, Cordelia Schmid


This paper focuses on multi-person action forecasting in videos. More precisely, given a history of H previous frames, the goal is to detect actors and to predict their future actions for the next T frames. Our approach jointly models temporal and spatial interactions among different actors by constructing a recurrent graph, using actor proposals obtained with Faster R-CNN as nodes. Our method learns to select a subset of discriminative relations without requiring explicit supervision, thus enabling us to tackle challenging visual data. We refer to our model as Discriminative Relational Recurrent Network (DRRN). Evaluation of action prediction on AVA demonstrates the effectiveness of our proposed method compared to simpler baselines. Furthermore, we significantly improve performance on the task of early action classification on J-HMDB, from the previous SOTA of 48% to 60%.
[action, future, prediction, actor, video, graph, time, forecasting, recurrent, ava, frame, human, temporal, predict, gru, dataset, capture, previous, jointly, early, multiple, modeling, rnn, fnode, interaction, predicting, second, activity, current, outperforms, state, work] [approach, single, compute] [figure, method, proposed, input] [network, neural, performance, connected, table, number, apply, standard, top] [model, relational, visual, node, arxiv, preprint, observed, consider, attention, generating] [feature, detection, relation, predicted, localization, object, proposal, bounding, edge, person, three, region, semantic] [learning, task, classification, set, discriminative, label, function]
@InProceedings{Sun_2019_CVPR,
  author = {Sun, Chen and Shrivastava, Abhinav and Vondrick, Carl and Sukthankar, Rahul and Murphy, Kevin and Schmid, Cordelia},
  title = {Relational Action Forecasting},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Long-Term Feature Banks for Detailed Video Understanding
Chao-Yuan Wu, Christoph Feichtenhofer, Haoqi Fan, Kaiming He, Philipp Krahenbuhl, Ross Girshick


To understand the world, we humans constantly need to relate the present to the past, and put events in context. In this paper, we enable existing video models to do the same. We propose a long-term feature bank--supportive information extracted over the entire span of a video--to augment state-of-the-art video models that otherwise would only view short clips of 2-5 seconds. Our experiments demonstrate that augmenting 3D convolutional networks with a long-term feature bank yields state-of-the-art results on three challenging video datasets: AVA, EPIC-Kitchens, and Charades. Code is available online.
[video, lfb, bank, action, temporal, ava, sto, short, dataset, clip, time, window, work, tsn, strg, recognition, fbonl, frame] [rgb, supplementary, approach, compute, linear, material] [input, prior, figure, based] [table, operator, pooling, size, validation, max, convolutional, batch, layer, schedule, better, standard, deep, implementation, entire, block, network] [model, memory, visual, arxiv, preprint] [feature, cnn, object, person, roi, backbone, detection, pool, map, box, propose, context, spatial, detector, average] [training, test, learning, support, large, classification, set, target, representation, task, train]
@InProceedings{Wu_2019_CVPR,
  author = {Wu, Chao-Yuan and Feichtenhofer, Christoph and Fan, Haoqi and He, Kaiming and Krahenbuhl, Philipp and Girshick, Ross},
  title = {Long-Term Feature Banks for Detailed Video Understanding},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Which Way Are You Going? Imitative Decision Learning for Path Forecasting in Dynamic Scenes
Yuke Li


Path forecasting is a pivotal step toward understanding dynamic scenes and an emerging topic in the computer vi- sion field. This task is challenging due to the multimodal nature of the future, namely, given a partial history, there is more than one plausible prediction. Yet, the state-of-the-art methods seem not fully responsive to this innate variabil- ity. Hence, how to better foresee the forthcoming trajectory in dynamic scenes has to be more thoroughly pursued. To this end, we propose a novel Imitative Decision Learning (IDL) approach. It delves deeper into the key that inher- ently characterizes the multimodality - the latent decision. The proposed IDL first infers the distribution of such latent decisions by learning from moving histories. A policy is then generated by taking the sampled latent decision into account to predict the future. Different plausible upcoming paths corresponds to each sampled latent decision. This ap- proach significantly differs from the mainstream literature that relies on a predefined latent variable to extrapolate di- verse predictions. In order to augment the understanding of the latent decision and resultant mutimodal future, we in- vestigate their connection through mutual information op- timization. Moreover, the proposed IDL integrates spatial and temporal dependencies into one single framework, in contrast to handling them with two-step settings. As a re- sult, our approach enables simultaneous anticipation of the paths of all pedestrians in the scene. We assess our pro- posal on the large-scale SAP, ETH and UCY datasets. The experiments show that IDL introduces considerable margin improvements with respect to recent leading studies.
[idl, future, forecasting, dynamic, social, temporal, human, eth, spatiotemporal, sap, moving, ucy, motion, recognition, framework, prediction, multimodality, work, time, dataset, predefined, imitative, multiple, historical, trajectory, video, henc, ade, alexandre, nature, deterministic, desire, lstm] [computer, vision, pattern, international, single, ground, truth, approach, optimization] [latent, conference, ieee, proposed, based, generative, figure] [convolutional, best, neural, processing, better, deep, order, inference, deeper, process, impact, layer, network] [decision, path, multimodal, example, understanding, gan, plausible, policy, diverse, discriminator, sampled] [spatial, propose, person, fully, pedestrian, semantic] [learning, distribution, mutual, datasets, set]
@InProceedings{Li_2019_CVPR,
  author = {Li, Yuke},
  title = {Which Way Are You Going? Imitative Decision Learning for Path Forecasting in Dynamic Scenes},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
What and How Well You Performed? A Multitask Learning Approach to Action Quality Assessment
Paritosh Parmar, Brendan Tran Morris


Can performance on the task of action quality assessment (AQA) be improved by exploiting a description of the action and its quality? Current AQA and skills assessment approaches propose to learn features that serve only one task - estimating the final score. In this paper, we propose to learn spatio-temporal features that explain three related tasks - fine-grained action recognition, commentary generation, and estimating the AQA score. A new multitask-AQA dataset, the largest to date, comprising of 1412 diving samples was collected to evaluate our approach (http://rtis.oit.unlv.edu/datasets.html). We show that our MTL approach outperforms STL approach using two different kinds of architectures: C3D-AVG and MSCADC. The C3D-AVG-MTL approach achieves the new state-of-the-art performance with a rank correlation of 90.44%. Detailed experiments were performed to show that MTL offers better generalization than STL, and representations from action recognition models are not sufficient for the AQA task and instead should be learned.
[action, aqa, mtl, dive, dataset, diving, recognition, commentary, assessment, video, splash, work, mscadc, athlete, skill, individual, outperforms, surgical] [approach, computer, international, good, single, well, vision, pattern, position, pose, linear, rotation, provide] [quality, conference, ieee, proposed, comparison, input] [performance, better, network, table, multitask, convolutional, aggregation, number, bit, best] [captioning, model, evaluation, arxiv, preprint, description, common] [score, feature, detailed, final, context, propose, help, backbone, three, improve] [learning, task, existing, learn, stl, train, loss, classification, training, auxiliary, trained, data, set, learned, function, generalization]
@InProceedings{Parmar_2019_CVPR,
  author = {Parmar, Paritosh and Tran Morris, Brendan},
  title = {What and How Well You Performed? A Multitask Learning Approach to Action Quality Assessment},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
MHP-VOS: Multiple Hypotheses Propagation for Video Object Segmentation
Shuangjie Xu, Daizong Liu, Linchao Bao, Wei Liu, Pan Zhou


We address the problem of semi-supervised video object segmentation (VOS), where the masks of objects of interests are given in the first frame of an input video. To deal with challenging cases where objects are occluded or missing, previous work relies on greedy data association strategies that make decisions for each frame individually. In this paper, we propose a novel approach to defer the decision making for a target object in each frame, until a global view can be established with the entire video being taken into consideration. Our approach is in the same spirit as Multiple Hypotheses Tracking (MHT) methods, making several critical adaptations for the VOS problem. We employ the bounding box (bbox) hypothesis for tracking tree formation, and the multiple hypotheses are spawned by propagating the preceding bbox into the detected bbox proposals within a gated region starting from the initial object mask in the first frame. The gated region is determined by a gating scheme which takes into account a more comprehensive motion model rather than the simple Kalman filtering model in traditional MHT. To further design more customized algorithms tailored for VOS, we develop a novel mask propagation score instead of the appearance similarity score that could be brittle due to large deformations. The mask propagation score, together with the motion score, determines the affinity between the hypotheses during tree pruning. Finally, a novel mask merging strategy is employed to handle mask conflicts between objects. Extensive experiments on challenging datasets demonstrate the effectiveness of the proposed method, especially in the case of object missing.
[propagation, video, frame, multiple, motion, tracking, hypothesis, gating, track, time, vos, temporal, mht, davis, previous, kalman, current, challenge, work] [computer, pattern, vision, corresponding, algorithm, single, problem, approach] [method, ieee, conference, appearance, proposed, missing, figure, image] [pruning, best, network, performance, achieves, fast, size, denotes, validation, design, larger] [tree, model, node, probability, arxiv, preprint, find, decision, evaluation, van, gated] [object, mask, segmentation, proposal, score, bounding, box, bbox, instance, region, scoring, final, global, challenging, merging, semantic, detection, map] [target, novel, set, strategy, data, maximum, independent, large, similarity]
@InProceedings{Xu_2019_CVPR,
  author = {Xu, Shuangjie and Liu, Daizong and Bao, Linchao and Liu, Wei and Zhou, Pan},
  title = {MHP-VOS: Multiple Hypotheses Propagation for Video Object Segmentation},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
2.5D Visual Sound
Ruohan Gao, Kristen Grauman


Binaural audio provides a listener with 3D sound sensation, allowing a rich perceptual experience of the scene. However, binaural recordings are scarcely available and require nontrivial expertise and equipment to obtain. We propose to convert common monaural audio into binaural audio by leveraging video. The key idea is that visual frames reveal significant spatial cues that, while explicitly lacking in the accompanying single-channel audio, are strongly linked to it. Our multi-modal approach recovers this link from unlabeled video. We devise a deep convolutional neural network that learns to decode the monaural (single-channel) soundtrack into its binaural counterpart by injecting visual information about object and scene configurations. We call the resulting output 2.5D visual sound—the visual stream helps “lift” the flat single channel audio into spatialized sound. In addition to sound generation, we show the self-supervised representation learned by our network benefits audio-visual source separation. Our video results: http://vision.cs.utexas.edu/projects/2.5D_visual_sound/
[audio, binaural, sound, mono, video, monaural, perform, signal, complex, spectrogram, predict, dataset, accompanying, speech, stft, recorded, ambisonics, music, work, frame, convert, key, multiple, extract, predicting, report, antonio, human, time, listening] [approach, single, left, international, scene, ear] [separation, input, method, difference, mixed, user, based, conference, figure, separate, image, gopro] [network, deep, channel, better, andrew, table] [visual, generate, room, model, creates] [spatial, predicted, feature, mask, object, baseline, street] [source, learning, data, representation, training, distance, unlabeled, datasets, large, transfer, test]
@InProceedings{Gao_2019_CVPR,
  author = {Gao, Ruohan and Grauman, Kristen},
  title = {2.5D Visual Sound},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Language-Driven Temporal Activity Localization: A Semantic Matching Reinforcement Learning Model
Weining Wang, Yan Huang, Liang Wang


Current studies on action detection in untrimmed videos are mostly designed for action classes, where an action is described at word level such as jumping, tumbling, swing, etc. This paper focuses on a rarely investigated problem of localizing an activity via a sentence query which would be more challenging and practical. Considering that current methods are generally time-consuming due to the dense frame-processing manner, we propose a recurrent neural network based reinforcement learning model which selectively observes a sequence of frames and associates the given sentence with video content in a matching-based manner. However, directly matching sentences with video content performs poorly due to the large visual-semantic discrepancy. Thus, we extend the method to a semantic matching reinforcement learning (SM-RL) model by extracting semantic concepts of videos and then fusing them with global context features. Extensive experiments on three benchmark datasets, TACoS, Charades-STA and DiDeMo, show that our method achieves the state-of-the-art performance with a high detection speed, demonstrating both effectiveness and efficiency of our method.
[video, temporal, action, frame, activity, time, recognition, state, prediction, untrimmed, dataset, current, recurrent, sequence, clip, start, didemo] [vision, computer, matching, observation, pattern, international, ground, directly, problem, truth] [conference, based, method, ieee, content, comparison, high, figure, input, proposed, image] [table, network, performance, output, number, layer, neural, connected, binary, sigmoid] [model, sentence, agent, concept, reinforcement, query, visual, candidate, reward, observed, find, ctrl, step] [semantic, detection, location, regression, iou, faster, global, fully, localization, feature, context, three, person, sliding, proposal, final, propose] [learning, loss, training, classification, set, train, function, class, experimental, trained]
@InProceedings{Wang_2019_CVPR,
  author = {Wang, Weining and Huang, Yan and Wang, Liang},
  title = {Language-Driven Temporal Activity Localization: A Semantic Matching Reinforcement Learning Model},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Gaussian Temporal Awareness Networks for Action Localization
Fuchen Long, Ting Yao, Zhaofan Qiu, Xinmei Tian, Jiebo Luo, Tao Mei


Temporally localizing actions in a video is a fundamental challenge in video understanding. Most existing approaches have often drawn inspiration from image object detection and extended the advances, e.g., SSD and Faster R-CNN, to produce temporal locations of an action in a 1D sequence. Nevertheless, the results can suffer from robustness problem due to the design of predetermined temporal scales, which overlooks the temporal structure of an action and limits the utility on detecting actions with complex variations. In this paper, we propose to address the problem by introducing Gaussian kernels to dynamically optimize temporal scale of each action proposal. Specifically, we present Gaussian Temporal Awareness Networks (GTAN) --- a new architecture that novelly integrates the exploitation of temporal structure into an one-stage action localization framework. Technically, GTAN models the temporal structure through learning a set of Gaussian kernels, each for a cell in the feature maps. Each Gaussian kernel corresponds to a particular interval of an action proposal and a mixture of Gaussian kernels could further characterize action proposals with various length. Moreover, the values in each Gaussian curve reflect the contextual contributions to the localization of an action proposal. Extensive experiments are conducted on both THUMOS14 and ActivityNet v1.3 datasets, and superior results are reported when comparing to state-of-the-art approaches. More remarkably, GTAN achieves 1.9% and 1.1% improvements in mAP on testing set of the two datasets.
[action, temporal, gtan, activitynet, video, multiple, long, ssad, pole, vault, predict, challenge, bsn, auc, capture, utilized, dataset, bernard, stream] [single, corresponding, approach] [figure, mixed, based] [gaussian, kernel, network, structure, convolutional, scale, overlap, pooling, layer, table, number, performance, cell, size, validation, dynamically, architecture, interval, width, better, conv, increase, receptive, fixed, design, weighted] [generate, length, evaluation] [proposal, feature, localization, anchor, map, detection, grouping, boundary, iou, average, contextual, center, regression, default, location, awareness, curve, three, score, object, faster, cascaded, final] [set, loss, learning, classification, testing, base, learnt]
@InProceedings{Long_2019_CVPR,
  author = {Long, Fuchen and Yao, Ting and Qiu, Zhaofan and Tian, Xinmei and Luo, Jiebo and Mei, Tao},
  title = {Gaussian Temporal Awareness Networks for Action Localization},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Efficient Video Classification Using Fewer Frames
Shweta Bhardwaj, Mukundhan Srinivasan, Mitesh M. Khapra


Recently, there has been a lot of interest in building compact models for video classification which have a small memory footprint (
[video, lrep, serial, frame, dataset, work, time, lpred, focus, workshop, nextvlad, recognition, recurrent, hidden, sequence, expensive] [netvlad, computer, vision, single, computed, matrix, compute] [based, proposed, conference, figure, intermediate, ieee] [network, parallel, neural, performance, number, efficient, table, best, processing, inference, deep, size, compact, fewer, original, experiment, building, learnable, process, reduce, computational, computationally, mentioned, output] [model, memory, understanding, refer, consider] [map, baseline, final, context, predicted, average] [student, teacher, training, classification, representation, loss, gap, train, lce, trained, learning, idea, distillation, large, knowledge, minimize, data, observe, cluster, set, clustering]
@InProceedings{Bhardwaj_2019_CVPR,
  author = {Bhardwaj, Shweta and Srinivasan, Mukundhan and Khapra, Mitesh M.},
  title = {Efficient Video Classification Using Fewer Frames},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Parsing R-CNN for Instance-Level Human Analysis
Lu Yang, Qing Song, Zhihui Wang, Ming Jiang


Instance-level human analysis is common in real-life scenarios and has multiple manifestations, such as human part segmentation, dense pose estimation, human-object interactions, etc. Models need to distinguish different human instances in the image panel and learn rich features to represent the details of each instance. In this paper, we present an end-to-end pipeline for solving the instance-level human analysis, named Parsing R-CNN. It processes a set of human instances simultaneously through comprehensive considering the characteristics of region-based approach and the appearance of a human, thus allowing representing the details of instances. Parsing R-CNN is very flexible and efficient, which is applicable to many issues in human instance analysis. Our approach outperforms all state-of-the-art methods on CIHP (Crowd Instance-level Human Parsing), MHP v2.0 (Multi-Human Parsing) and DensePose-COCO datasets. Based on the proposed Parsing R-CNN, we reach the 1st place in the COCO 2018 Challenge DensePose Estimation task. Code and models are publicly available.
[human, app, report, dataset, outperforms] [pose, dense, estimation, analysis, geometric, approach, good, single, pipeline, place, densepose] [proposed, figure, image, resolution, separation, based, study] [convolutional, table, performance, increasing, network, scale, neural, deep, capacity, err, accuracy, speed, efficient, operation, aspp] [encoding, visual, find] [parsing, branch, segmentation, feature, gce, coco, semantic, cihp, mhp, mask, roi, baseline, object, miou, instance, adopt, context, module, improvement, three, improves, ablation, propose, doll, pyramid, improve, rpn, map, detection, enlarging] [strategy, sampling, large, learning, task]
@InProceedings{Yang_2019_CVPR,
  author = {Yang, Lu and Song, Qing and Wang, Zhihui and Jiang, Ming},
  title = {Parsing R-CNN for Instance-Level Human Analysis},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Large Scale Incremental Learning
Yue Wu, Yinpeng Chen, Lijuan Wang, Yuancheng Ye, Zicheng Liu, Yandong Guo, Yun Fu


Modern machine learning suffers from catastrophic forgetting when learning new classes incrementally. The performance dramatically degrades due to the missing data of old classes. Incremental learning methods have been proposed to retain the knowledge acquired from the old classes, by using knowledge distilling and keeping a few exemplars from the old classes. However, these methods struggle to scale up to a large number of classes. We believe this is because of the combination of two factors: (a) the data imbalance between the old and new classes, and (b) the increasing number of visually similar classes. Distinguishing between an increasing number of visually similar classes is particularly challenging, when the training data is unbalanced. We propose a simple and effective method to address this data imbalance issue. We found that the last fully connected layer has a strong bias towards the new classes, and this bias can be corrected by a linear model. With two bias parameters, our method performs remarkably well on two large datasets: ImageNet (1000 classes) and MS-Celeb-1M (10000 classes), outperforming the state-of-the-art algorithms by 11.1% and 13.2% respectively.
[dataset, outperforms] [linear, computer, bound, estimate, note] [method, correction, figure, degradation, proposed, conference, image, visually, study, ieee] [number, validation, layer, small, connected, accuracy, performance, scale, neural, deep, table, effective, convolution, network, compared, batch, better, compare, best] [model, random, strong, step, correct, simple, machine] [fully, feature, final, baseline, ablation] [incremental, bias, learning, bic, training, data, icarl, eeil, set, large, loss, split, upper, classification, classifier, knowledge, distilling, imbalance, class, datasets, distillation, learn, valold, lwf, valnew, stored, exemplar, catastrophic, gap, subset, confusion, viewed, trainold]
@InProceedings{Wu_2019_CVPR,
  author = {Wu, Yue and Chen, Yinpeng and Wang, Lijuan and Ye, Yuancheng and Liu, Zicheng and Guo, Yandong and Fu, Yun},
  title = {Large Scale Incremental Learning},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
TopNet: Structural Point Cloud Decoder
Lyne P. Tchapmi, Vineet Kosaraju, Hamid Rezatofighi, Ian Reid, Silvio Savarese


3D point cloud generation is of great use for 3D scene modeling and understanding. Real-world 3D object point clouds can be properly described by a collection of low-level and high-level structures such as surfaces, geometric primitives, semantic parts,etc. In fact, there exist many different representations of a 3D object point cloud as a set of point groups. Existing frameworks for point cloud genera-ion either do not consider structure in their proposed solutions, or assume and enforce a specific structure/topology,e.g. a collection of manifolds or surfaces, for the generated point cloud of a 3D object. In this work, we pro-pose a novel decoder that generates a structured point cloud without assuming any specific structure or topology on the underlying point set. Our decoder is softly constrained to generate a point cloud following a hierarchical rooted tree structure. We show that given enough capacity and allowing for redundancies, the proposed decoder is very flexible and able to learn any arbitrary grouping of points including any topology on the point set. We evaluate our decoder on the task of point cloud generation for 3D point cloud shape completion. Combined with encoders from existing frameworks, we show that our proposed decoder significantly outperforms state-of-the-art 3D point cloud completion methods on the Shapenet dataset
[previous, recognition, leaf, represented, outperforms, work] [point, cloud, topology, shape, completion, rooted, computer, vision, chamfer, pattern, discrete, topological, general, definition, local, finite, defined, proposition, single] [method, proposed, conference, ieee, input, arbitrary, figure, based] [structure, number, structured, network, mlp, performance, neural, design, represents, deep, output, root, table, multilayer, architecture] [decoder, tree, node, generate, encoder, generation, represent, generated, generates, generating, model, collection, representing, embed, evaluate, partial] [object, feature, propose, level, improvement, final] [set, learning, distance, representation, subset, specific, embedding, task, loss, space, existing, enforcing, learned]
@InProceedings{Tchapmi_2019_CVPR,
  author = {Tchapmi, Lyne P. and Kosaraju, Vineet and Rezatofighi, Hamid and Reid, Ian and Savarese, Silvio},
  title = {TopNet: Structural Point Cloud Decoder},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Perceive Where to Focus: Learning Visibility-Aware Part-Level Features for Partial Person Re-Identification
Yifan Sun, Qin Xu, Yali Li, Chi Zhang, Yikang Li, Shengjin Wang, Jian Sun


This paper considers a realistic problem in person re-identification (re-ID) task, i.e., partial re-ID. Under partial re-ID scenario, the images may contain a partial observation of a pedestrian. If we directly compare a partial pedestrian image with a holistic one, the extreme spatial misalignment significantly compromises the discriminative ability of the learned representation. We propose a Visibility-aware Part Model (VPM) for partial re-ID, which learns to perceive the visibility of regions through self-supervision. The visibility awareness allows VPM to extract region-level features and compare two images with focus on their shared regions (which are visible on both images). VPM gains two-fold benefit toward higher accuracy for partial re-ID. On the one hand, compared with learning a global feature, VPM learns region-level features and thus benefits from fine-grained information. On the other hand, with visibility awareness, VPM is capable to estimate the shared regions between two images and thus suppresses the spatial misalignment. Experimental results confirm that our method significantly improves the learned feature representation and the achieved accuracy is on par with the state of the art.
[learns, focus, state, human, formulated] [visibility, visible, computer, corresponding, problem, pose, well, allows, local, respective] [image, input, method, ieee, conference, realistic, comparison, pixel, figure, identity] [convolutional, accuracy, deep, performance, compared, tensor, table, achieved, achieves] [partial, probability, model, generates, visual] [vpm, region, feature, holistic, pedestrian, person, crop, spatial, locator, baseline, map, misalignment, comparing, pcb, three, global, bottom, liang, shengjin, jian, propose, awareness, unshared] [learning, training, loss, triplet, shared, retrieval, set, strategy, trained, distance, extractor, datasets, train, large, discriminative, learn]
@InProceedings{Sun_2019_CVPR,
  author = {Sun, Yifan and Xu, Qin and Li, Yali and Zhang, Chi and Li, Yikang and Wang, Shengjin and Sun, Jian},
  title = {Perceive Where to Focus: Learning Visibility-Aware Part-Level Features for Partial Person Re-Identification},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Meta-Transfer Learning for Few-Shot Learning
Qianru Sun, Yaoyao Liu, Tat-Seng Chua, Bernt Schiele


Meta-learning has been proposed as a framework to address the challenging few-shot learning setting. The key idea is to leverage a large number of similar few-shot tasks in order to learn how to adapt a base-learner to a new task for which only a few labeled samples are available. As deep neural networks (DNNs) tend to overfit using a few samples only, meta-learning typically uses shallow neural networks (SNNs), thus limiting its effectiveness. In this paper we propose a novel few-shot learning method called meta-transfer learning (MTL) which learns to adapt a deep NN for few shot learning tasks. Specifically, "meta" refers to training multiple tasks, and "transfer" is achieved by learning scaling and shifting functions of DNN weights for each task. In addition, we introduce the hard task (HT) meta-batch scheme as an effective learning curriculum for MTL. We conduct experiments using (5-class, 1-shot) and (5-class, 5-shot) recognition tasks on two challenging few-shot learning benchmarks: miniImageNet and Fewshot-CIFAR100. Extensive comparisons to related works validate that our meta-transfer learning approach trained with the proposed HT meta-batch scheme achieves top performance. An ablation study also shows that both components contribute to fast convergence and high accuracy.
[mtl, multiple, recognition, online, learns] [algorithm, note, approach, single] [method, figure, proposed, based, image] [deep, dnn, conv, number, phase, performance, accuracy, table, scaling, optimize, fast, network, neural, effective, achieves, neuron, better, gradient, layer, batch, called, optimized] [shifting, model, episode, evaluate] [feature, object, challenging, three, failure] [learning, task, training, data, hard, transfer, extractor, sample, maml, miniimagenet, learn, test, classification, trained, loss, meta, curriculum, unseen, classifier, large, convergence, learner, tunseen, update, novel]
@InProceedings{Sun_2019_CVPR,
  author = {Sun, Qianru and Liu, Yaoyao and Chua, Tat-Seng and Schiele, Bernt},
  title = {Meta-Transfer Learning for Few-Shot Learning},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Structured Binary Neural Networks for Accurate Image Classification and Semantic Segmentation
Bohan Zhuang, Chunhua Shen, Mingkui Tan, Lingqiao Liu, Ian Reid


In this paper, we propose to train convolutional neural networks (CNNs) with both binarized weights and activations, leading to quantized models specifically for mobile devices with limited power capacity and computation resources. By assuming the same architecture to full-precision networks, previous works on quantizing CNNs seek to preserve the floating-point information using a set of discrete values, which we call value approximation. However, we take a novel "structure approximation" view for quantization--- it is very likely that a different architecture may be better for best performance. In particular, we propose a "network decomposition" strategy, named Group-Net, in which we divide the network into groups. In this way, each full-precision group can be effectively reconstructed by aggregating a set of homogeneous binary branches. In addition, we learn effect connections among groups to improve the representational capability. Moreover, the proposed Group-Net shows strong generalization to other tasks. For instance, we extend Group-Net for highly accurate semantic segmentation by embedding rich context into the binary structure. Experiments on both classification and semantic segmentation tasks demonstrate the superior performance of the proposed methods over various popular architectures. In particular, we outperform the previous best binary neural networks in terms of accuracy and huge computation saving.
[multiple, previous] [decomposition, directly, accurate, approach, note, homogeneous, limited] [ieee, proposed, comparison, image, based, decompose, method] [binary, neural, network, quantization, convolutional, structure, approximate, performance, deep, architecture, group, approximation, complexity, convolution, efficient, design, atrous, residual, table, computational, binarization, bpac, imagenet, output, original, number, conv, better, block, weight, layer, parallel, highly, explore, achieves, binarizations, hin, dilation, binarized, accuracy, designing] [model, arxiv, preprint] [propose, semantic, segmentation, branch, pascal, object] [set, learning, strategy, classification, training, combination, seek, learn, train]
@InProceedings{Zhuang_2019_CVPR,
  author = {Zhuang, Bohan and Shen, Chunhua and Tan, Mingkui and Liu, Lingqiao and Reid, Ian},
  title = {Structured Binary Neural Networks for Accurate Image Classification and Semantic Segmentation},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Deep RNN Framework for Visual Sequential Applications
Bo Pang, Kaiwen Zha, Hanwen Cao, Chen Shi, Cewu Lu


Extracting temporal and representation features efficiently plays a pivotal role in understanding visual sequence information. To deal with this, we propose a new recurrent neural framework that can be stacked deep effectively. There are mainly two novel designs in our deep RNN framework: one is a new RNN module called Context Bridge Module (CBM) which splits the information flowing along the sequence (temporal direction) and along depth (spatial representation direction), making it easier to train when building deep by balancing these two directions; the other is the Overlap Coherence Training Scheme that reduces the training complexity for long visual sequential tasks on account of the limitation of computing resources. We provide empirical evidence to show that our deep RNN framework is easy to optimize and can gain accuracy from the increased depth on several visual sequence problems. On these tasks, we evaluate our deep RNN framework with 15 layers, 7x than conventional RNN networks, but it is still easy to train. Our deep framework achieves more than 11% relative improvements over shallow RNN models on Kinetics, UCF-101, and HMDB-51 for video classification. For auxiliary annotation, after replacing the shallow RNN part of Polygon-RNN with our 15-layer deep CBM, the performance improves by 14.7%. For video future prediction, our deep RNN improves the state-of-the-art shallow model's performance by 2.4% on PSNR and SSIM.
[rnn, temporal, action, sequence, video, cbm, coherence, flow, framework, recognition, recurrent, long, convlstm, sequential, lstm, bridge, anticipation, dataset, future, stacking, extract, prediction, short] [depth, relative, analysis] [method, image, proposed, psnr, input, ssim, figure, based] [deep, overlap, shallow, neural, convolutional, scheme, rate, unit, original, structure, performance, denotes, computing, accuracy, compared, production, achieves, dropout, table, stacked, design, layer, architecture, deeper, better, process] [model, visual, arxiv, evaluate, length] [spatial, adopt, merge, module, context, annotation, iou, feature] [training, representation, set, loss, learning, conventional, classification, task, train, function]
@InProceedings{Pang_2019_CVPR,
  author = {Pang, Bo and Zha, Kaiwen and Cao, Hanwen and Shi, Chen and Lu, Cewu},
  title = {Deep RNN Framework for Visual Sequential Applications},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Graph-Based Global Reasoning Networks
Yunpeng Chen, Marcus Rohrbach, Zhicheng Yan, Yan Shuicheng, Jiashi Feng, Yannis Kalantidis


Globally modeling and reasoning over relations between regions can be beneficial for many computer vision tasks on both images and videos. Convolutional Neural Networks (CNNs) excel at modeling local relations by convolution operations, but they are typically inefficient at capturing global relations between distant regions and require stacking multiple convolution layers. In this work, we propose a new approach for reasoning globally in which a set of features are globally aggregated over the coordinate space and then projected to an interaction space where relational reasoning can be efficiently computed. After reasoning, relation-aware features are distributed back to the original coordinate space for down-stream tasks. We further present a highly efficient instantiation of the proposed approach and introduce the Global Reasoning unit (GloRe unit) that implements the coordinate-interaction space mapping by weighted global pooling and weighted broadcasting, and the relation reasoning via graph convolution on a small graph in interaction space. The proposed GloRe unit is lightweight, end-to-end trainable and can be easily plugged into existing CNNs for a wide range of tasks. Extensive experiments show our GloRe unit can consistently boost the performance of state-of-the-art backbone architectures, including ResNet, ResNeXt, SE-Net and DPN, for both 2D and 3D CNNs, on image classification, semantic segmentation and video action recognition task.
[graph, interaction, video, recognition, multiple, action, capture, focus, capturing, stacking, dataset, modeling] [coordinate, projection, vision, single, computer, approach, matrix, globally, directly, optimization] [proposed, figure, method, image, input, distant] [convolution, glore, unit, performance, deep, convolutional, weighted, layer, imagenet, accuracy, neural, table, original, efficient, better, validation, residual, block, deeper, pooling, cnns, network, gain, nonlocal, architecture, output, highly, number, disjoint, denotes] [reasoning, node, step, model, arxiv, adding, preprint, find] [global, feature, extra, backbone, semantic, propose, segmentation, relation, baseline, fcn, object, adopt, cnn] [space, learning, set, training, classification, dimension, trained]
@InProceedings{Chen_2019_CVPR,
  author = {Chen, Yunpeng and Rohrbach, Marcus and Yan, Zhicheng and Shuicheng, Yan and Feng, Jiashi and Kalantidis, Yannis},
  title = {Graph-Based Global Reasoning Networks},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
SSN: Learning Sparse Switchable Normalization via SparsestMax
Wenqi Shao, Tianjian Meng, Jingyu Li, Ruimao Zhang, Yudian Li, Xiaogang Wang, Ping Luo


Normalization methods improve both optimization and generalization of ConvNets. To further boost performance, the recently-proposed switchable normalization (SN) provides a new perspective for deep learning: it learns to select different normalizers for different convolution layers of a ConvNet. However, SN uses softmax function to learn importance ratios to combine normalizers, leading to redundant computations compared to a single normalizer. This work addresses this issue by presenting Sparse Switchable Normalization (SSN) where the importance ratios are constrained to be sparse. Unlike l_1 and l_0 constraints that impose difficulties in optimization, we turn this constrained optimization problem into feed-forward computation by proposing SparsestMax, which is a sparse version of softmax. SSN has several appealing properties. (1) It inherits all benefits from SN such as applicability in various tasks and robustness to a wide range of batch sizes. (2) It is guaranteed to select only one normalizer for each normalization layer, avoiding redundant computations. (3) SSN can be transferred to various tasks in an end-to-end manner. Extensive experiments show that SSN outperforms its counterparts on various challenging benchmarks such as ImageNet, Cityscapes, ADE20K, and Kinetics. Code is available at https://github.com/switchablenorms/Sparse_SwitchNorm.
[kinetics, work, recognition, learns, action, multiple] [computer, completely, vision, direction, optimization, constraint, solution, single, radius, algorithm, pattern, point, problem, denote] [ieee, image, figure, conference] [ssn, sparse, normalization, batch, sparsestmax, deep, sparsemax, size, normalizer, simplex, validation, performance, table, network, switchable, neural, imagenet, convolution, accuracy, inference, selection, increasing, achieves, regularization, output, rate, group, circular, higher, pretrained, layer, sparsity, variance] [arxiv, preprint, model, making, evaluate, constrained] [stage, three, semantic, center, segmentation] [training, learning, softmax, learn, function, train, space, distribution, generalization, set, trained, loss, select]
@InProceedings{Shao_2019_CVPR,
  author = {Shao, Wenqi and Meng, Tianjian and Li, Jingyu and Zhang, Ruimao and Li, Yudian and Wang, Xiaogang and Luo, Ping},
  title = {SSN: Learning Sparse Switchable Normalization via SparsestMax},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Spherical Fractal Convolutional Neural Networks for Point Cloud Recognition
Yongming Rao, Jiwen Lu, Jie Zhou


We present a generic, flexible and 3D rotation invariant framework based on spherical symmetry for point cloud recognition. By introducing regular icosahedral lattice and its fractals to approximate and discretize sphere, convolution can be easily implemented to process 3D points. Based on the fractal structure, a hierarchical feature learning framework together with an adaptive sphere projection module is proposed to learn deep feature in an end-to-end manner. Our framework not only inherits the strong representation power and generalization capability from convolutional neural networks for image recognition, but also extends CNN to learn robust feature resistant to rotations and perturbations. The proposed model is effective yet robust. Comprehensive experimental study demonstrates that our approach can achieve competitive performance compared to state-of-the-art techniques on both 3D object classification and part segmentation tasks, meanwhile, outperform other rotation invariant models on rotated 3D object classification and retrieval tasks by a large margin.
[recognition, framework, previous, discretized] [point, spherical, cloud, projection, local, shape, fractal, rotation, lattice, robust, symmetry, sphere, pointnet, shapenet, vertex, hao, approach, directly, irregular, algorithm, well, volumetric] [method, input, proposed, image, study, figure, based] [convolutional, performance, neural, deep, network, convolution, table, structure, number, processing, architecture, achieve, block, resistant, compared, accuracy, applied, better] [model, robustness, regular, easily, adversarial, arxiv, preprint, strong] [feature, segmentation, cnn, object, neighboring, module, stage, including, hierarchical, rotated, improve] [learning, classification, retrieval, training, invariant, learn, representation, generalization, generalize, data, set, trained, experimental, unseen, nearest]
@InProceedings{Rao_2019_CVPR,
  author = {Rao, Yongming and Lu, Jiwen and Zhou, Jie},
  title = {Spherical Fractal Convolutional Neural Networks for Point Cloud Recognition},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Learning to Generate Synthetic Data via Compositing
Shashank Tripathi, Siddhartha Chandra, Amit Agrawal, Ambrish Tyagi, James M. Rehg, Visesh Chari


We present a task-specific approach to synthetic data generation. Our framework employs a trainable synthesizer network that is optimized to produce meaningful training samples by assessing the strengths and weaknesses of a 'target' classifier. The synthesizer and target networks are trained in an adversarial manner wherein each network is updated with a goal to outdo the other. Additionally, we ensure the synthesizer generates realistic data by pairing it with a discriminator trained on real-world images. Further, to make the target classifier invariant to blending artefacts, we introduce these artefacts to background regions of the training images so the target does not over-fit to them. We demonstrate the efficacy of our approach by applying it to different target networks including a classification network on AffNIST [46], and two object detection networks (SSD, Faster-RCNN) on different datasets. On the AffNIST benchmark, our approach is able to surpass the baseline results with just half the training examples. On the VOC person detection benchmark, we show improvements of up to 2.7% as a result of our data augmentation. Similarly on the GMU detection benchmark, we report a performance boost of 3.5% in mAP over the baseline method, outperforming the previous state of the art approaches by as much as 7.5% in individual categories.
[dataset, recognition, work, updated, multiple, previous] [approach, computer, vision, pattern, affine, note] [synthetic, image, background, blending, synthesizer, affnist, figure, composite, conference, ieee, real, synthesis, gmu, method, produce, compositing, realistic, demonstrate, paste, comparison, synthesized] [network, performance, accuracy, neural, table, deep, process, convolutional, processing] [discriminator, generated, generate, adversarial, model, generation, generating, adding, example, improved] [object, baseline, foreground, improve, detection, voc, person, ssd, bounding, instance, improves, spatial, feature, map, pascal, context, iou, ross] [data, target, training, learning, hard, trained, augmentation, set, classification, loss, test, classifier, positive, mnist]
@InProceedings{Tripathi_2019_CVPR,
  author = {Tripathi, Shashank and Chandra, Siddhartha and Agrawal, Amit and Tyagi, Ambrish and Rehg, James M. and Chari, Visesh},
  title = {Learning to Generate Synthetic Data via Compositing},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Divide and Conquer the Embedding Space for Metric Learning
Artsiom Sanakoyeu, Vadim Tschernezki, Uta Buchler, Bjorn Ommer


Learning the embedding space, where semantically similar objects are located close together and dissimilar objects far apart, is a cornerstone of many computer vision applications. Existing approaches usually learn a single metric in the embedding space for all available data points, which may have a very complex non-uniform distribution with different notions of similarity between objects, e.g. appearance, shape, color or semantic meaning. Approaches for learning a single distance metric often struggle to encode all different types of relationships and do not generalize well. In this work, we propose a novel easy-to-implement divide and conquer approach for deep metric learning, which significantly improves the state-of-the-art performance of metric learning. Our approach utilizes the embedding space more efficiently by jointly splitting the embedding space and data into K smaller sub-problems. It divides both, the data and the embedding space into K subsets and learns K separate distance metrics in the non-overlapping subspaces of the embedding space, defined by groups of neurons in the embedding layer of the neural network. The proposed approach increases the convergence speed and improves generalization since the complexity of each sub-problem is reduced compared to the original one. We show that our approach outperforms the state-of-the-art by a large margin in retrieval, clustering and re-identification tasks on CUB200-2011, CARS196, Stanford Online Products, In-shop Clothes and PKU VehicleID datasets. Source code: https://bit.ly/dcesml.
[online, dataset, individual, recognition, divide, conquer] [computer, approach, vision, pattern, international, single, defined, histogram] [conference, ieee, image, figure, based, splitting, proposed, qualitative] [deep, neural, layer, number, network, entire, performance, complexity, original, denotes, processing, size, epoch, full, table, implementation, smaller] [query] [clothes, feature] [embedding, learning, metric, data, loss, space, training, distance, set, clustering, margin, triplet, cluster, test, stanford, retrieval, negative, similarity, learn, learner, nmi, distribution, large, pku, subspace, embeddings, existing, vehicleid, class, learned, datasets, train, split, sampling, randomly, trained, independent, hdc, bier, dreml]
@InProceedings{Sanakoyeu_2019_CVPR,
  author = {Sanakoyeu, Artsiom and Tschernezki, Vadim and Buchler, Uta and Ommer, Bjorn},
  title = {Divide and Conquer the Embedding Space for Metric Learning},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Latent Space Autoregression for Novelty Detection
Davide Abati, Angelo Porrello, Simone Calderara, Rita Cucchiara


Novelty detection is commonly referred as the discrimination of observations that do not conform to a learned model of regularity. Despite its importance in different application settings, designing a novelty detector is utterly complex due to the unpredictable nature of novelties and its inaccessibility during the training procedure, factors which expose the unsupervised nature of the problem. In our proposal, we design a general unsupervised framework where we equip a deep autoencoder with a parametric density estimator that learns the probability distribution underlying the latent representations with an autoregressive procedure. We show that a maximum likelihood objective, optimized in conjunction with the reconstruction of normal samples, effectively acts as a regularizer for the task at hand, by minimizing the differential entropy of the distribution spanned by latent vectors. In addition to providing a very general formulation, extensive experiments of our model on publicly available datasets deliver on-par or superior performances if compared to state-of-the-art methods in one-class and in video anomaly detection settings. Differently from our competitors, we remark that our proposal does not make any assumption about the nature of the novelties, making our work easily applicable to disparate contexts.
[anomaly, modeling, surprisal, video, nature, event, recurrent, temporal, abnormal, employed, frame, framework, remembering, despite, modeled] [international, computer, normal, reconstruction, pattern, estimator, vision, differential, estimation, analysis, parametric, underlying, supplementary] [conference, latent, ieee, input, prior, image, figure, generative, proposed, masked, attentional] [density, autoregressive, deep, output, neural, employ, convolution, ucsd, network, order, gaussian, convolutional, stacked, standard, autoregression, low] [model, encoder, machine, probability, van, type, variational] [detection, score, proposal, feature, semantic, roc] [novelty, training, distribution, learning, test, autoencoder, likelihood, entropy, novel, learned, representation, unsupervised, set, vae, mnist, auroc, space, task]
@InProceedings{Abati_2019_CVPR,
  author = {Abati, Davide and Porrello, Angelo and Calderara, Simone and Cucchiara, Rita},
  title = {Latent Space Autoregression for Novelty Detection},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Attending to Discriminative Certainty for Domain Adaptation
Vinod Kumar Kurmi, Shanu Kumar, Vinay P. Namboodiri


In this paper, we aim to solve for unsupervised domain adaptation of classifiers where we have access to label information for the source domain while these are not available for a target domain. While various methods have been proposed for solving these including adversarial discriminator based methods, most approaches have focused on the entire image based domain adaptation. In an image, there would be regions that can be adapted better, for instance, the foreground object may be similar in nature. To obtain such regions, we propose methods that consider the probabilistic certainty estimate of various regions and specific focus on these during classification for adaptation. We observe that just by incorporating the probabilistic certainty of the discriminator while training the classifier, we are able to obtain state of the art results on various datasets as compared against all the recent methods. We provide a thorough empirical analysis of the method by providing ablation analysis, statistical significance test, and visualization of the attention maps and t-SNE embeddings. These evaluations convincingly demonstrate the effectiveness of the proposed approach.
[dataset, state, recognition, consists] [computer, vision, pattern, estimation, analysis, international, defined, problem, provide, solving] [conference, proposed, based, image, ieee, method, figure, statistical, clear, high] [deep, bayesian, neural, table, better, processing, resnet, accuracy, number, kumar, vinay] [discriminator, attention, adversarial, model, machine, visual, empirical] [art, feature, foreground, visualization, var, propose, average, object] [domain, uncertainty, adaptation, certainty, aleatoric, predictive, target, classifier, source, unsupervised, classification, class, training, learning, loss, transfer, adapted, uncertain, trained, discrepancy, data, negative, kate, label, significance, obtaining, reported, mingsheng, jianmin, probabilistic, observe, distance, grl, discriminative, vinod, datasets]
@InProceedings{Kurmi_2019_CVPR,
  author = {Kumar Kurmi, Vinod and Kumar, Shanu and Namboodiri, Vinay P.},
  title = {Attending to Discriminative Certainty for Domain Adaptation},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Feature Denoising for Improving Adversarial Robustness
Cihang Xie, Yuxin Wu, Laurens van der Maaten, Alan L. Yuille, Kaiming He


Adversarial attacks to image classification systems present challenges to convolutional networks and opportunities for understanding them. This study suggests that adversarial perturbations on images lead to noise in the features constructed by these networks. Motivated by this observation, we develop new network architectures that increase adversarial robustness by performing feature denoising. Specifically, our networks contain blocks that denoise the features using non-local means or other filters; the entire networks are trained end-to-end. When combined with adversarial training, our feature denoising networks substantially improve the state-of-the-art in adversarial robustness in both white-box and black-box attack settings. On ImageNet, under 10-iteration PGD white-box attacks where prior art has 27.9% accuracy, our method achieves 55.7%; even under extreme 2000-iteration PGD white-box attacks, our method secures 42.6% accuracy. Our method was ranked first in Competition on Adversarial Attacks and Defenses (CAAD) 2018 --- it achieved 50.6% classification accuracy on a secret, ImageNet-like test dataset against 48 unknown attackers, surpassing the runner-up approach by 10%. Code is available at https://github.com/facebookresearch/ImageNet-Adversarial-Training.
[version, dot, perform, work] [null, total, note] [denoising, figure, image, clean, noise, pixel, based, alp, bilateral, input, study, removing, method] [accuracy, block, gaussian, convolutional, network, residual, imagenet, operation, better, neural, number, bottleneck, table, achieves, counterpart, performance, connection, filter, standard, add, small, design] [adversarial, pgd, attack, robustness, caad, adding, median, attacker, perturbation, defense, model, adversarially, perturbed, consider, find, strong, suggests] [feature, baseline, improve, map, ablation, challenging, improves] [training, trained, maximum, classification, product, unknown]
@InProceedings{Xie_2019_CVPR,
  author = {Xie, Cihang and Wu, Yuxin and van der Maaten, Laurens and Yuille, Alan L. and He, Kaiming},
  title = {Feature Denoising for Improving Adversarial Robustness},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Selective Kernel Networks
Xiang Li, Wenhai Wang, Xiaolin Hu, Jian Yang


In standard Convolutional Neural Networks (CNNs), the receptive fields of artificial neurons in each layer are designed to share the same size. It is well-known in the neuroscience community that the receptive field size of visual cortical neurons are modulated by the stimulus, which has been rarely considered in constructing CNNs. We propose a dynamic selection mechanism in CNNs that allows each neuron to adaptively adjust its receptive field size based on multiple scales of input information. A building block called Selective Kernel (SK) unit is designed, in which multiple branches with different kernel sizes are fused using softmax attention that is guided by the information in these branches. Different attentions on these branches yield different sizes of the effective receptive fields of neurons in the fusion layer. Multiple SK units are stacked to a deep network termed Selective Kernel Networks (SKNets). On the ImageNet and CIFAR benchmarks, we empirically show that SKNet outperforms the existing state-of-the-art architectures with lower model complexity. Detailed analyses show that the neurons in SKNet can capture target objects with different scales, which verifies the capability of neurons for adaptively adjusting their receptive field sizes according to the input. The code and models are available at https://github.com/implus/SKNet.
[multiple, dynamic] [note, single, error, corresponding, case] [image, proposed, difference, based] [kernel, number, size, table, convolutional, neural, adaptive, convolution, deep, imagenet, performance, selection, sknets, residual, sknet, larger, receptive, validation, group, adaptively, dilation, science, unit, network, increase, computational, architecture, channel, better, resnext, block, aggregation, lightweight, grouped, compared, dilated, efficient, adjust, operator, compact, depthwise, filter] [attention, arxiv, preprint, visual, model, mechanism] [feature, object, selective, spatial, fuse, average, three, global] [training, target, learning, large, national]
@InProceedings{Li_2019_CVPR,
  author = {Li, Xiang and Wang, Wenhai and Hu, Xiaolin and Yang, Jian},
  title = {Selective Kernel Networks},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
On Implicit Filter Level Sparsity in Convolutional Neural Networks
Dushyant Mehta, Kwang In Kim, Christian Theobalt


We investigate filter level sparsity that emerges in convolutional neural networks (CNNs) which employ Batch Normalization and ReLU activation, and are trained with adaptive gradient descent techniques and L2 regularization or weight decay. We conduct an extensive experimental study casting our initial findings into hypotheses and conclusions about the mechanisms underlying the emergent filter level sparsity. This study allows new insight into the performance gap obeserved between adapative and non-adaptive gradient descent methods in practice. Further, analysis of the effect of training strategies and hyperparameters on the sparsity leads to practical suggestions in designing CNN training strategies enabling us to explore the tradeoffs between feature selectivity, network capacity, and generalization performance. Lastly, we show that the implicit sparsity can be harnessed for neural network speedup at par or better than explicit sparsification / pruning approaches, with no modifications to the typical training pipeline required.
[work, moving] [explicit, implicit, disparity, additional, error, note, degree] [figure, study, based] [sparsity, regularization, network, weight, sgd, gradient, table, adaptive, higher, size, neural, pruned, convolutional, basicnet, selectivity, batch, performance, pruning, sparsification, filter, decay, scale, convolution, cifar, emergent, structure, layer, descent, regularizer, low, relu, comparable, emergence, tinyimagenet, rate, number, err, imagenet, deep, impact, increasing, capacity, parameter] [consider, arxiv, preprint, observed, trend, understanding, refer] [feature, selective, extent, level, val] [adam, test, training, learned, learning, generalization, loss, trained, task, train, hyperparameters, learn, class, update, gap]
@InProceedings{Mehta_2019_CVPR,
  author = {Mehta, Dushyant and In Kim, Kwang and Theobalt, Christian},
  title = {On Implicit Filter Level Sparsity in Convolutional Neural Networks},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
FlowNet3D: Learning Scene Flow in 3D Point Clouds
Xingyu Liu, Charles R. Qi, Leonidas J. Guibas


Many applications in robotics and human-computer interaction can benefit from understanding 3D motion of points in a dynamic environment, widely noted as scene flow. While most previous methods focus on stereo and RGB-D images as input, few try to estimate scene flow directly from point clouds. In this work, we propose a novel deep neural network named FlowNet3D that learns scene flow from point clouds in an end-to-end fashion. Our network simultaneously learns deep hierarchical features of point clouds and flow embeddings that represent point motions, supported by two newly proposed learning layers for point sets. We evaluate the network on both challenging synthetic data from FlyingThings3D and real Lidar scans from KITTI. Trained on synthetic data only, our network successfully generalizes to real scans, outperforming various baselines and showing competitive results to the prior art. We also demonstrate two applications of our scene flow output (scan registration and motion segmentation) to show its potential wide use cases.
[flow, motion, optical, learns, dataset, frame, multiple, work, propagate, dynamic, consecutive] [point, scene, cloud, estimation, upconv, lidar, icp, estimate, rigid, directly, kitti, ground, dense, stereo, registration, estimated, depth, local, epe, well, disparity, rgb, truth, error] [real, based, method, synthetic, proposed, input, figure, prior] [layer, conv, deep, network, max, table, compared, architecture, design, neural, structure, newly, output] [model, evaluate, vector, variational, sampled] [feature, three, spatial, final, segmentation, hierarchical, object] [set, learning, embedding, novel, embeddings, data, trained, distance, learn, train, function, large]
@InProceedings{Liu_2019_CVPR,
  author = {Liu, Xingyu and Qi, Charles R. and Guibas, Leonidas J.},
  title = {FlowNet3D: Learning Scene Flow in 3D Point Clouds},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Scene Memory Transformer for Embodied Agents in Long-Horizon Tasks
Kuan Fang, Alexander Toshev, Li Fei-Fei, Silvio Savarese


Many robotic applications require the agent to perform long-horizon tasks in partially observable environments. In such applications, decision making at any step can depend on observations received far in the past. Hence, being able to properly memorize and utilize the long-term history is crucial. In this work, we propose a novel memory-based policy, named Scene Memory Transformer (SMT). The proposed policy embeds and adds each observation to a memory and uses the attention mechanism to exploit spatio-temporal dependencies. This model is generic and can be efficiently trained with reinforcement learning over long episodes. On a range of visual navigation tasks, SMT demonstrates superior performance to existing reactive and memory-based policies by a margin.
[time, lstm, long, action, current, state, rnns, updated, recognition, rnn] [scene, observation, international, defined, pose, computer, vision, autonomous, single, robotics, pattern, partially, received] [conference, ieee, image, based, method, figure] [neural, performance, search, network, deep, size, number, structure, complexity, factorization, applied, processing, computational, capacity, pooling, table] [memory, smt, policy, model, attention, agent, navigation, visual, coverage, reward, robot, embedded, reinforcement, reactive, transformer, step, encoder, arxiv, preprint, roaming, vector, environment, att, covered, mechanism, machine, random, observable, common] [object, spatial, semantic, three] [learning, task, trained, training, set, embedding, target, existing, unseen, large]
@InProceedings{Fang_2019_CVPR,
  author = {Fang, Kuan and Toshev, Alexander and Fei-Fei, Li and Savarese, Silvio},
  title = {Scene Memory Transformer for Embodied Agents in Long-Horizon Tasks},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Co-Occurrent Features in Semantic Segmentation
Hang Zhang, Han Zhang, Chenguang Wang, Junyuan Xie


Recent work has achieved great success in utilizing global contextual information for semantic segmentation, including increasing the receptive field and aggregating pyramid feature representations. In this paper, we go beyond global context and explore the fine-grained representation using co-occurrent features by introducing Co-occurrent Feature Model, which predicts the distribution of co-occurrent features for a given target. To leverage the semantic context in the co-occurrent features, we build an Aggregated Co-occurrent Feature (ACF) Module by aggregating the probability of the co-occurrent feature with the co-occurrent context. ACF Module learns a fine-grained spatial invariant representation to capture co-occurrent context information across the scene. Our approach significantly improves the segmentation results using FCN and achieves superior performance 54.0% mIoU on Pascal Context, 87.2% mIoU on Pascal VOC 2012 and 44.89% mIoU on ADE20K datasets. The source code and complete system will be publicly available upon publication.
[work, dataset, capture, build, learns, recognition, outperforms] [computer, scene, vision, pattern, field, international] [image, conference, proposed, prior, ieee, figure, study, input, transformation, based, method] [network, cfnet, pooling, achieves, convolutional, table, deep, neural, convolution, validation, performance, number, atrous, size, featuremap, best, receptive, rate, output] [model, arxiv, preprint, probability, vector, adding] [feature, semantic, context, module, segmentation, global, fcn, acf, pascal, contextual, miou, spatial, object, cooccurrent, aggregated, voc, baseline, coco, parsing, pyramid, improves, utilize, including] [base, target, set, distribution, learning, training, representation, similarity, train, test, existing, strategy]
@InProceedings{Zhang_2019_CVPR,
  author = {Zhang, Hang and Zhang, Han and Wang, Chenguang and Xie, Junyuan},
  title = {Co-Occurrent Features in Semantic Segmentation},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Bag of Tricks for Image Classification with Convolutional Neural Networks
Tong He, Zhi Zhang, Hang Zhang, Zhongyue Zhang, Junyuan Xie, Mu Li


Much of the recent progress made in image classification research can be credited to training procedure refinements, such as changes in data augmentations and optimization methods. In the literature, however, most refinements are either briefly mentioned as implementation details or only visible in source code. In this paper, we will examine a collection of such refinements and empirically evaluate their impact on the final model accuracy through ablation study. We will show that, by combining these refinements together, we are able to improve various CNN models significantly. For example, we raise ResNet-50's top-1 validation accuracy from 75.3% to 79.29% on ImageNet. We will also demonstrate that improvement on image classification accuracy leads to better transfer learning performance in other application domains such as object detection and semantic segmentation.
[] [computer, initial, vision, pattern, note, linear] [input, image, figure, ieee, conference, zhang] [rate, batch, accuracy, size, output, validation, table, network, mixup, convolution, conv, convolutional, neural, resnet, decay, stride, imagenet, residual, efficient, layer, number, downsampling, distill, performance, mobilenet, deep, smoothing, gradient, block, architecture, precision, compared, tweak, computational, channel, warmup, implementation] [model, path, arxiv, preprint, evaluate, procedure, empirical, random, probability] [semantic, baseline, improve, object, three, stage, detection] [training, learning, cosine, label, base, trained, large, gap, set, train, loss, classification, transfer, randomly, teacher, distribution, min, distillation, data]
@InProceedings{He_2019_CVPR,
  author = {He, Tong and Zhang, Zhi and Zhang, Hang and Zhang, Zhongyue and Xie, Junyuan and Li, Mu},
  title = {Bag of Tricks for Image Classification with Convolutional Neural Networks},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Learning Channel-Wise Interactions for Binary Convolutional Neural Networks
Ziwei Wang, Jiwen Lu, Chenxin Tao, Jie Zhou, Qi Tian


In this paper, we propose a channel-wise interaction based binary convolutional neural network learning method (CI-BCNN) for efficient inference. Conventional methods apply xnor and bitcount operations in binary convolution with notable quantization error, which usually obtains inconsistent signs in binary feature maps compared with their full-precision counterpart and leads to significant information loss. In contrast, our CI-BCNN mines the channel-wise interactions, through which prior knowledge is provided to alleviate inconsistency of signs in binary feature maps and preserves the information of input samples during inference. Specifically, we mine the channel-wise interactions by a reinforcement learning model, and impose channel-wise priors on the intermediate feature maps through the interacted bitcount function. Extensive experiments on the CIFAR-10 and ImageNet datasets show that our method outperforms the state-of-the-art binary convolutional neural networks with less computational and storage cost.
[graph, influence, interaction, transition, state, tracking, action] [matrix, error] [inconsistent, input, comparison, proposed, figure, method, pixel, based, intermediate] [binary, neural, bitcount, convolutional, network, deep, interacted, quantization, xnor, compared, layer, storage, imagenet, efficient, max, represents, performance, density, output, applied, gradient, table, cost, structure, wet, computational, channelwise, original, accuracy, lth, ratio] [existence, policy, reinforcement, arxiv, preprint, model, sign, create, reward] [feature, object, map, edge, mined] [learning, training, loss, set, classification, teacher, space, china, knowledge, existing, function]
@InProceedings{Wang_2019_CVPR,
  author = {Wang, Ziwei and Lu, Jiwen and Tao, Chenxin and Zhou, Jie and Tian, Qi},
  title = {Learning Channel-Wise Interactions for Binary Convolutional Neural Networks},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Knowledge Adaptation for Efficient Semantic Segmentation
Tong He, Chunhua Shen, Zhi Tian, Dong Gong, Changming Sun, Youliang Yan


Both accuracy and efficiency are of significant importance to the task of semantic segmentation. Existing deep FCNs suffer from heavy computations due to a series of high-resolution feature maps for preserving the detailed knowledge in dense estimation. Although reducing the feature map resolution (i.e., applying a large overall stride) via subsampling operations (e.g., polling and convolution striding) can instantly increase the efficiency, it dramatically decreases the estimation accuracy. To tackle this dilemma, we propose a knowledge distillation method tailored for semantic segmentation to improve the performance of the compact FCNs with large overall stride. To handle the inconsistency between the features of the student and teacher network, we optimize the feature similarity in a transferred latent domain formulated by utilizing a pre-trained autoencoder. Moreover, an affinity distillation module is proposed to capture the long-range dependency by calculating the non local interactions across the whole image. To validate the effectiveness of our proposed method, extensive experiments have been conducted on three popular benchmarks: Pascal VOC, Cityscapes and Pascal Context. Built upon a highly competitive baseline, our proposed method can improve the performance of a student network by 2.5% (mIOU boosts from 70.2 to 72.7 on the cityscapes test set) and can train a better compact model with only 8% float operations (FLOPS) of a model that achieves comparable performances.
[capture, dataset, outperforms] [defined, dense, pattern] [method, proposed, ieee, figure, resolution, image, input, inherent, transferring, comparison] [network, output, performance, convolution, deep, stride, compact, table, effectiveness, atrous, better, small, size, neural, achieved, introducing, lightweight, receptive, computation, process, accuracy, achieves, trainaug, low, format, net] [model, attention, chunhua, rich] [semantic, feature, affinity, segmentation, pascal, val, voc, context, module, propose, spatial, map, help, easier, extra, detailed, adapter, coco, three, ablation] [knowledge, student, teacher, distillation, training, large, loss, set, learning, tailored, test, task, train, trained, learn, fitnet]
@InProceedings{He_2019_CVPR,
  author = {He, Tong and Shen, Chunhua and Tian, Zhi and Gong, Dong and Sun, Changming and Yan, Youliang},
  title = {Knowledge Adaptation for Efficient Semantic Segmentation},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Parametric Noise Injection: Trainable Randomness to Improve Deep Neural Network Robustness Against Adversarial Attack
Zhezhi He, Adnan Siraj Rakin, Deliang Fan


Recent developments in the field of Deep Learning have exposed the underlying vulnerability of Deep Neural Network (DNN) against adversarial examples. In image classification, an adversarial example is a carefully modified image that is visually imperceptible to the original image but can cause DNN model to misclassify it. Training the network with Gaussian noise is an effective technique to perform model regularization, thus improving model robustness against input variation. Inspired by this classical method, we explore to utilize the regularization characteristic of noise injection to improve DNN's robustness against adversarial attack. In this work, we propose Parametric-Noise-Injection (PNI) which involves trainable Gaussian noise injection at each layer on either activation or weights through solving the Min-Max optimization problem, embedded with adversarial training. These parameters are trained explicitly to achieve improved robustness. The extensive results show that our proposed PNI technique effectively improves the robustness against a variety of powerful white-box and black-box attacks such as PGD, C&W, FGSM, transferable attack, and ZOO attack. Last but not the least, PNI method improves both clean- and perturbed-data accuracy, in comparison to the state-of-the-art defense methods, which outperforms current unbroken PGD defense by 1.1% and 6.8% on clean- and perturbed- test data respectively, using ResNet-20 architecture.
[perform, outperforms, work, term, performs] [optimization, international, technique, robust, parametric, computer] [noise, method, proposed, input, conference, image, comparison, clean, based, ieee] [accuracy, gradient, network, table, neural, vanilla, deep, weight, regularization, coefficient, dnn, performance, scaling, increasing, layer, rate, stochastic, trainable, gaussian, injected, capacity, order, parameter, improving, descent, optimized, output, number] [adversarial, pni, attack, model, pgd, robustness, defense, fgsm, arxiv, example, preprint, injection, success, zoo, perturbed, fpni, machine, generation, substitute, obfuscation, nstep] [improvement, improve, propose] [training, learning, test, data, trained, target, loss, transferable, set, train, ensemble]
@InProceedings{He_2019_CVPR,
  author = {He, Zhezhi and Siraj Rakin, Adnan and Fan, Deliang},
  title = {Parametric Noise Injection: Trainable Randomness to Improve Deep Neural Network Robustness Against Adversarial Attack},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Invariance Matters: Exemplar Memory for Domain Adaptive Person Re-Identification
Zhun Zhong, Liang Zheng, Zhiming Luo, Shaozi Li, Yi Yang


This paper considers the domain adaptive person re-identification (re-ID) problem: learning a re-ID model from a labeled source domain and an unlabeled target domain. Conventional methods are mainly to reduce feature distribution gap between the source and target domains. However, these studies largely neglect the intra-domain variations in the target domain, which contain critical factors influencing the testing performance on the target domain. In this work, we comprehensively investigate into the intra-domain variations of the target domain and propose to generalize the re-ID model w.r.t three types of the underlying invariance, i.e., exemplar-invariance, camera-invariance and neighborhood-invariance. To achieve this goal, an exemplar memory is introduced to store features of the target domain and accommodate the three invariance properties. The memory allows us to enforce the invariance constraints over global training batch without significantly increasing computation cost. Experiment demonstrates that the three invariance properties and the proposed memory are indispensable towards an effective domain adaptation system. Results on three re-ID domains show that our domain adaptation accuracy outperforms the state of the art by a large margin. Code is available at: https://github.com/zhunzhong07/ECN
[key, work, outperforms] [approach, camera, enforce, underlying, corresponding, problem, university, well, assumption] [image, method, based, proposed, identity, demonstrates, component] [accuracy, network, deep, table, number, performance, achieve, layer, achieves, compare, experiment] [model, memory, fact] [person, three, liang, feature, map, baseline, propose, module, improve] [target, source, domain, learning, exemplar, invariance, labeled, unsupervised, set, training, data, tested, unlabeled, loss, adaptation, trained, uda, close, learn, positive, zhun, testing, supervised, classification, camstyle, shaozi, representation, transfer, img, temperature, ptgan, investigate, large, open, transferable, sample, hhl]
@InProceedings{Zhong_2019_CVPR,
  author = {Zhong, Zhun and Zheng, Liang and Luo, Zhiming and Li, Shaozi and Yang, Yi},
  title = {Invariance Matters: Exemplar Memory for Domain Adaptive Person Re-Identification},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Dissecting Person Re-Identification From the Viewpoint of Viewpoint
Xiaoxiao Sun, Liang Zheng


Variations in visual factors such as viewpoint, pose, illumination and background, are usually viewed as important challenges in person re-identification (re-ID). In spite of acknowledging these factors to be influential, quantitative studies on how they affect a re-ID system are still lacking. To derive insights in this scientific campaign, this paper makes an early attempt in studying a particular factor, viewpoint. We narrow the viewpoint problem down to the pedestrian rotation angle to obtain focused conclusions. In this regard, this paper makes two contributions to the community. First, we introduce a large-scale synthetic data engine, PersonX. Composed of hand-crafted 3D person models, the salient characteristic of this engine is "controllable". That is, we are able to synthesize pedestrians by setting the visual variables to arbitrary values. Second, on the 3D data engine, we quantitatively analyze the influence of pedestrian rotation angle on re-ID accuracy. Comprehensively, the person rotation angles are precisely customized from 0 to 360, allowing us to investigate its effect on the training, query, and gallery sets. Extensive experiment helps us have a deeper understanding of the fundamental problems in person re-ID. Our research also provides useful insights for dataset building and future practical usage, e.g., a person of a side view makes a better query.
[influence, dataset, environmental, manually] [viewpoint, match, left, front, illumination, continuous, rotation, angle, pose, consistent, observation, scientific, view, camera, disparity] [control, synthetic, figure, background, image, resolution, missing, study, real, identity, difference] [group, accuracy, performance, impact, design, higher, number, experiment, original, designed, compared] [true, query, model, visual, system, engine, environment, controllable] [person, personx, map, liang, pedestrian, feature, three, pcb, average, indicative, clothes] [training, set, data, experimental, train, trained, gallery, learning, randomly, triplet, affect, datasets, paper, distribution, invariant, somaset, domain]
@InProceedings{Sun_2019_CVPR,
  author = {Sun, Xiaoxiao and Zheng, Liang},
  title = {Dissecting Person Re-Identification From the Viewpoint of Viewpoint},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Learning to Reduce Dual-Level Discrepancy for Infrared-Visible Person Re-Identification
Zhixiang Wang, Zheng Wang, Yinqiang Zheng, Yung-Yu Chuang, Shin'ichi Satoh


Infrared-Visible person RE-IDentification (IV-REID) is a rising task. Compared to conventional person re-identification (re-ID), IV-REID concerns the additional modality discrepancy originated from the different imaging processes of spectrum cameras, in addition to the person's appearance discrepancy caused by viewpoint changes, pose variations and deformations presented in the conventional re-ID task. The co-existed discrepancies make IV-REID more difficult to solve. Previous methods attempt to reduce the appearance and modality discrepancies simultaneously using feature-level constraints. It is however difficult to eliminate the mixed discrepancies using only feature-level constraints. To address the problem, this paper introduces a novel Dual-level Discrepancy Reduction Learning (D^2RL) scheme which handles the two discrepancies separately. For reducing the modality discrepancy, an image-level sub-network is trained to translate an infrared image into its visible counterpart and a visible image to its infrared version. With the image-level sub-network, we can unify the representations for images with different modalities. With the help of the unified multi-spectral images, a feature-level sub-network is trained to reduce the remaining appearance discrepancy through feature embedding. By cascading the two sub-networks and training them jointly, the dual-level reductions take their responsibilities cooperatively and attentively. Extensive experiments demonstrate the proposed approach outperforms the state-of-the-art methods.
[dataset, joint, framework, consists, zheng, previous] [visible, computer, vision, pattern, initial, international, note, pose, total, camera] [image, infrared, appearance, regdb, figure, method, conference, proposed, mixed, translation, latent, separate, translated, balance, based] [reduction, reduce, reducing, performance, network, table, original, effective, weight, best, optimized] [modality, generate, generated, query, vector, attention, find, model, gan] [person, feature, map, propose, liang, cascaded] [discrepancy, learning, training, set, loss, unified, gap, space, conventional, domain, bdtr, triplet, gallery, trained, embedding, distance, unification, objective]
@InProceedings{Wang_2019_CVPR,
  author = {Wang, Zhixiang and Wang, Zheng and Zheng, Yinqiang and Chuang, Yung-Yu and Satoh, Shin'ichi},
  title = {Learning to Reduce Dual-Level Discrepancy for Infrared-Visible Person Re-Identification},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Progressive Feature Alignment for Unsupervised Domain Adaptation
Chaoqi Chen, Weiping Xie, Wenbing Huang, Yu Rong, Xinghao Ding, Yue Huang, Tingyang Xu, Junzhou Huang


Unsupervised domain adaptation (UDA) transfers knowledge from a label-rich source domain to a fully-unlabeled target domain. To tackle this task, recent approaches resort to discriminative domain transfer in virtue of pseudo-labels to enforce the class-level distribution alignment across the source and target domains. These methods, however, are vulnerable to the error accumulation and thus incapable of preserving cross-domain category consistency, as the pseudo-labeling accuracy is not guaranteed explicitly. In this paper, we propose the Progressive Feature Alignment Network (PFAN) to align the discriminative features across domains progressively and effectively, via exploiting the intra-class variation in the target domain. To be specific, we first develop an Easy-to-Hard Transfer Strategy (EHTS) and an Adaptive Prototype Alignment (APA) step to train our model iteratively and alternatively. Moreover, upon observing that a good domain adaptation usually requires a non-saturated source classifier, we consider a simple yet efficient way to retard the convergence speed of the source classification loss by further involving a temperature variate into the soft-max function. The extensive experimental results reveal that the proposed PFAN exceeds the state-of-the-art performance on three UDA datasets.
[work, explicitly, hypothesis, joint] [error, approach, computer, international, vision, pattern, reliable, analysis] [conference, proposed, ieee, based, method, figure] [deep, accuracy, network, neural, performance, denotes, table, processing, progressive, speed, number, better, achieve, small] [adversarial, machine, model, expected, consider] [feature, category, three, global, propose] [domain, target, source, pfan, adaptation, alignment, training, unsupervised, transfer, classification, labeled, learning, loss, class, ehts, selected, apa, distribution, discriminative, revgrad, align, prototype, convergence, temperature, uda, discrepancy, similarity, distance, bias, classifier, function, mnist, falsely, alleviate, extractor, set, strategy, learned]
@InProceedings{Chen_2019_CVPR,
  author = {Chen, Chaoqi and Xie, Weiping and Huang, Wenbing and Rong, Yu and Ding, Xinghao and Huang, Yue and Xu, Tingyang and Huang, Junzhou},
  title = {Progressive Feature Alignment for Unsupervised Domain Adaptation},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Feature-Level Frankenstein: Eliminating Variations for Discriminative Recognition
Xiaofeng Liu, Site Li, Lingsheng Kong, Wanqing Xie, Ping Jia, Jane You, B.V.K. Kumar


Recent successes of deep learning-based recognition rely on maintaining the content related to the main-task label. However, how to explicitly dispel the noisy signals for better generalization remains an open issue. We systematically summarize the detrimental factors as task-relevant/irrelevant semantic variations and unspecified latent variation. In this paper, we cast these problems as an adversarial minimax game in the latent space. Specifically, we propose equipping an end-to-end conditional adversarial network with the ability to decompose an input sample into three complementary parts. The discriminative representation inherits the desired invariance property guided by prior knowledge of the task, which is marginally independent to the task-relevant/irrelevant semantic and latent variations. Our proposed framework achieves top performance on a serial of tasks, including digits recognition, lighting, makeup, disguise-tolerant face recognition, and facial attributes recognition.
[recognition, dataset, framework, multiple, incorporate, explicitly, dispel] [lighting, note, variable, pattern, volume, computer, vision, property, well] [face, latent, variation, image, attribute, makeup, celeba, proposed, disentangled, ieee, input, method, prior, conference, desired, conditional, style, generative, identity, figure] [accuracy, network, table, deep, better, binary, vgg, performance, original, achieve, number] [adversarial, arxiv, preprint, model, discriminator, gans, choose, expected] [semantic, feature, three, complementary, baseline, propose] [representation, learning, discriminative, training, flf, task, invariant, independent, label, trained, mnist, marginally, main, loss, class, invariance, classification, domain, unsupervised, digit, unspecified, testing, dis, svhn, generalization, sample]
@InProceedings{Liu_2019_CVPR,
  author = {Liu, Xiaofeng and Li, Site and Kong, Lingsheng and Xie, Wanqing and Jia, Ping and You, Jane and Kumar, B.V.K.},
  title = {Feature-Level Frankenstein: Eliminating Variations for Discriminative Recognition},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Learning a Deep ConvNet for Multi-Label Classification With Partial Labels
Thibaut Durand, Nazanin Mehrasa, Greg Mori


Deep ConvNets have shown great performance for single-label image classification (e.g. ImageNet), but it is necessary to move beyond the single-label classification task because pictures of everyday life are inherently multi-label. Multi-label classification is a more difficult task than single-label classification because both the input images and output label spaces are more complex. Furthermore, collecting clean multi-label annotations is more difficult to scale-up than single-label annotations. To reduce the annotation cost, we propose to train a model with partial labels i.e. only some labels are known per image. We first empirically compare different labeling strategies to show the potential for using partial labels on multi-label datasets. Then to learn with partial labels, we introduce a new classification loss that exploits the proportion of known labels per example. Our approach allows the use of the same training settings as when learning with all the annotations. We further explore several curriculum learning based strategies to predict missing labels. Experiments are performed on three large-scale multi-label datasets: MS COCO, NUS-WIDE and Open Images.
[predict, recognition, hidden, dataset, graph] [computer, vision, international, pattern, respect, approach, problem, algorithm, allows, partially, note] [conference, missing, image, clean, ieee, figure, based, proposed, method] [neural, deep, normalization, bayesian, better, number, standard, performance, scalable, analyze, table, convnet, binary, processing, correlation, network] [partial, model, introduced, visual, machine, example] [score, bce, map, propose, easy, coco, category, european, labeling] [learning, label, strategy, proportion, function, loss, classification, training, learn, noisy, update, observe, curriculum, gnn, uncertainty, labeled, web, learned, positive, data, collecting, open, datasets, large, hyperparameter, vic, supervised, set, subset, knowledge]
@InProceedings{Durand_2019_CVPR,
  author = {Durand, Thibaut and Mehrasa, Nazanin and Mori, Greg},
  title = {Learning a Deep ConvNet for Multi-Label Classification With Partial Labels},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Generalized Intersection Over Union: A Metric and a Loss for Bounding Box Regression
Hamid Rezatofighi, Nathan Tsoi, JunYoung Gwak, Amir Sadeghian, Ian Reid, Silvio Savarese


Intersection over Union (IoU) is the most popular evaluation metric used in the object detection benchmarks. However, there is a gap between optimizing the commonly used distance losses for regressing the parameters of a bounding box and maximizing this metric value. The optimal objective for a metric is the metric itself. In the case of axis-aligned 2D bounding boxes, it can be shown that IoU can be directly used as a regression loss. However, IoU has a plateau making it infeasible to optimize in the case of non-overlapping bounding boxes. In this paper, we address the this weakness by introducing a generalized version of IoU as both a new loss and a new metric. By incorporating this generalized IoU ( GIoU) as a loss into the state-of-the art object detection frameworks, we show a consistent improvement on their performance using both the standard, IoU based, and new, GIoU based, performance measures on popular object detection benchmarks such as PASCAL VOC and MS COCO.
[incorporating, challenge, dataset] [computer, vision, relative, case, optimizing, smallest, well, pattern, solution, directly, calculating, ground, truth, analytical, optimal, defined] [based, conference, arbitrary, mse, ieee, comparison, figure] [performance, compared, validation, popular, scale, accuracy, table, neural, size, standard, network, better, correlation] [evaluation, choice, provided] [iou, giou, bounding, lgiou, box, object, liou, detection, regression, coco, yolo, faster, pascal, mask, intersection, voc, union, predicted, enclosing, improvement, area, comparing, map, improve, improv, segmentation, aligned, threshold] [loss, metric, training, set, trained, reported, measure, generalized, test, classification, distance, objective, surrogate, representation, similarity, invariant]
@InProceedings{Rezatofighi_2019_CVPR,
  author = {Rezatofighi, Hamid and Tsoi, Nathan and Gwak, JunYoung and Sadeghian, Amir and Reid, Ian and Savarese, Silvio},
  title = {Generalized Intersection Over Union: A Metric and a Loss for Bounding Box Regression},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Densely Semantically Aligned Person Re-Identification
Zhizheng Zhang, Cuiling Lan, Wenjun Zeng, Zhibo Chen


We propose a densely semantically aligned person re-identification (re-ID) framework. It fundamentally addresses the body misalignment problem caused by pose/viewpoint variations, imperfect person detection, occlusion, etc.. By leveraging the estimation of the dense semantics of a person image, we construct a set of densely semantically aligned part images (DSAP-images), where the same spatial positions have the same semantics across different person images. We design a two-stream network that consists of a main full image stream (MF-Stream) and a densely semantically-aligned guiding stream (DSAG-Stream). The DSAG-Stream, with the DSAP-images as input, acts as a regulator to guide the MF-Stream to learn densely semantically aligned features from the original image. In the inference, the DSAG-Stream is discarded and only the MF-Stream is needed, which makes the inference system computationally efficient and robust. To our best knowledge, we are the first to make use of fine grained semantics for addressing misalignment problems for re-ID. Our method achieves rank-1 accuracy of 78.9% (new protocol) on the CUHK03 dataset, 90.4% on the CUHK01 dataset, and 95.7% on the Market1501 dataset, outperforming state-of-the-art methods.
[human, stream, joint, outperforms, fusion, consists, multiple, work, framework] [dense, body, local, estimation, pose, matching, corresponding] [image, based, input, method, proposed] [network, deep, performance, architecture, table, scheme, design, denotes, achieves, full, original, best, pooling, output, add, computationally, convolutional, neural, wei, inference, efficient] [semantically, arxiv, preprint, model, making, attention] [person, aligned, semantics, global, feature, densely, misalignment, semantic, map, spatial, head, baseline, liang, propose, branch, coarse, fused, detected, fully, leverage, regulator, merging, rui, xiaogang] [learning, loss, alignment, training, representation, learn, exploit, triplet, address, main, space]
@InProceedings{Zhang_2019_CVPR,
  author = {Zhang, Zhizheng and Lan, Cuiling and Zeng, Wenjun and Chen, Zhibo},
  title = {Densely Semantically Aligned Person Re-Identification},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Generalising Fine-Grained Sketch-Based Image Retrieval
Kaiyue Pang, Ke Li, Yongxin Yang, Honggang Zhang, Timothy M. Hospedales, Tao Xiang, Yi-Zhe Song


Fine-grained sketch-based image retrieval (FG-SBIR) addresses matching specific photo instance using free-hand sketch as a query modality. Existing models aim to learn an embedding space in which sketch and photo can be directly compared. While successful, they require instance-level pairing within each coarse-grained category as annotated training data. Since the learned embedding space is domain-specific, these models do not generalise well across categories. This limits the practical applicability of FG-SBIR. In this paper, we identify cross-category generalisation for FG-SBIR as a domain generalisation problem, and propose the first solution. Our key contribution is a novel unsupervised learning approach to model a universal manifold of prototypical visual sketch traits. This manifold can then be used to paramaterise the learning of a sketch/photo representation. Model adaptation to novel categories then becomes automatic via embedding the novel sketch in the manifold and updating the representation and retrieval function accordingly. Experiments on the two largest FG-SBIR datasets, Sketchy and QMUL-Shoe-V2, demonstrate the efficacy of our approach in enabling cross-category generalisation of FG-SBIR.
[work, challenge, trn, framework, human] [descriptor, assignment, matching, approach, directly, corresponding, problem, university, practical, perspective] [dictionary, image, photo, proposed, method, latent, based, generative, row, visually] [network, deep, performance, table, dynamically] [visual, model, trait, query, abstract, manifold, find, attention, semantically] [category, feature, assigned, improve, instance] [sketch, embedding, training, domain, generalisation, learning, sketchy, data, unsupervised, novel, trained, train, retrieval, set, timothy, tao, universal, triplet, loss, soft, testing, sbir, specific, learn, adapt, test, representation, vtd, target, embeddings, hard, learned, paramaterise, adaptation]
@InProceedings{Pang_2019_CVPR,
  author = {Pang, Kaiyue and Li, Ke and Yang, Yongxin and Zhang, Honggang and Hospedales, Timothy M. and Xiang, Tao and Song, Yi-Zhe},
  title = {Generalising Fine-Grained Sketch-Based Image Retrieval},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Adapting Object Detectors via Selective Cross-Domain Alignment
Xinge Zhu, Jiangmiao Pang, Ceyuan Yang, Jianping Shi, Dahua Lin


State-of-the-art object detectors are usually trained on public datasets. They often face substantial difficulties when applied to a different domain, where the imaging condition differs significantly and the corresponding annotated data are unavailable (or expensive to acquire). A natural remedy is to adapt the model by aligning the image representations on both domains. This can be achieved, for example, by adversarial learning, and has been shown to be effective in tasks like image classification. However, we found that in object detection, the improvement obtained in this way is quite limited. An important reason is that conventional domain adaptation methods strive to align images as a whole, while object detection, by nature, focuses on local regions that may contain objects of interest. Motivated by this, we propose a novel approach to domain adaption for object detection to handle the issues in "where to look" and "how to align". Our key idea is to mine the discriminative regions, namely those that are directly pertinent to object detection, and focus on aligning them across both domains. Experiments show that the proposed method performs remarkably better than existing methods with about 4% 6% improvement under various domain-shift scenarios while keeping good scalability.
[framework, dataset, focus, perform, report, dahua] [estimator, local, note] [image, method, proposed, based, real, figure, synthetic, cover] [table, performance, fixed, number, size, better, effective, denotes, compared, convolutional, designed] [adversarial, model, arxiv, preprint, observed, fake, introduce] [object, region, detection, grouping, instance, semantic, segmentation, map, bounding, feature, faster, location, jianping, ablation, improvement, three, mask, xinge, global, adjusted] [domain, adaptation, target, source, alignment, learning, strategy, weighting, training, loss, data, task, update, trained, existing, gap, unsupervised, mining, set, align, discriminative, cluster, train]
@InProceedings{Zhu_2019_CVPR,
  author = {Zhu, Xinge and Pang, Jiangmiao and Yang, Ceyuan and Shi, Jianping and Lin, Dahua},
  title = {Adapting Object Detectors via Selective Cross-Domain Alignment},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Cyclic Guidance for Weakly Supervised Joint Detection and Segmentation
Yunhang Shen, Rongrong Ji, Yan Wang, Yongjian Wu, Liujuan Cao


Weakly supervised learning has attracted growing research attention due to the significant saving in annotation cost for tasks that require intra-image annotations, such as object detection and semantic segmentation. To this end, existing weakly supervised object detection and semantic segmentation approaches follow an iterative label mining and model training pipeline. However, such a self-enforcement pipeline makes both tasks easy to be trapped in local minimums. In this paper, we join weakly supervised object detection and segmentation tasks with a multi-task learning scheme for the first time, which uses their respective failure patterns to complement each other's learning. Such cross-task enforcement helps both tasks to leap out of their respective local minimums. In particular, we present an efficient and effective framework termed Weakly Supervised Joint Detection and Segmentation (WS-JDS). WS-JDS has two branches for the above two tasks, which share the same backbone network. In the learning stage, it uses the same cyclic training paradigm but with a specific loss function such that the two branches benefit each other. Extensive experiments have been conducted on the widely-used Pascal VOC and COCO benchmarks, which demonstrate that our model has achieved competitive performance with the state-of-the-art algorithms.
[framework, work, benefit, second, multiple, predict] [local, cyclic, apr, problem, additional] [proposed, image, method, based, figure, prior, pixel, demonstrate, intermediate, result, background] [performance, deep, convolutional, network, pooling, neural, output, size] [model, evaluate] [object, segmentation, detection, weakly, semantic, localization, instance, wsod, map, branch, pascal, proposal, fully, voc, bounding, region, feature, frcnn, mask, guidance, refine, illustrated, wsddn, failure, backbone, box, tang, score, category, corloc, coco, detector, cgl, discovery, saliency, mic, apb, propose, spatial, improve] [supervised, learning, training, classification, loss, train, mining, label, learned, china, trained, datasets]
@InProceedings{Shen_2019_CVPR,
  author = {Shen, Yunhang and Ji, Rongrong and Wang, Yan and Wu, Yongjian and Cao, Liujuan},
  title = {Cyclic Guidance for Weakly Supervised Joint Detection and Segmentation},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Thinking Outside the Pool: Active Training Image Creation for Relative Attributes
Aron Yu, Kristen Grauman


Current wisdom suggests more labeled image data is always better, and obtaining labels is the bottleneck. Yet curating a pool of sufficiently diverse and informative images is itself a challenge. In particular, training image curation is problematic for fine-grained attributes, where the subtle visual differences of interest may be rare within traditional image sources. We propose an active image generation approach to address this issue. The main idea is to jointly learn the attribute ranking task while also learning to generate novel realistic image samples that will benefit that task. We introduce an end-to-end framework that dynamically "imagines" image pairs that would confuse the current model, presents them to human annotators for labeling, then improves the predictive model with the new examples. On two datasets, we show that by thinking outside the pool of real images, our approach gains generalization accuracy on challenging fine-grained attribute comparisons.
[recognition, human, manually, learns, work, report, current, jointly] [active, relative, approach, computer, vision, international, pattern, technical, pose, defined, manual] [image, real, attribute, synthetic, control, ranker, conference, ieee, figure, generator, attic, jitter, semjitter, latent, face, method, generative, synthesis, curation, traditional, masculine, augment, conditional] [network, deep, neural, accuracy, convolutional, output, standard, selection, best] [adversarial, generated, generate, visual, generation, model, query, actively, creation] [module, object, pool, improve] [training, learning, labeled, data, ranking, augmentation, learn, label, sample, existing, function, set, novel, pair, datasets, unlabeled, difficult, loss, task]
@InProceedings{Yu_2019_CVPR,
  author = {Yu, Aron and Grauman, Kristen},
  title = {Thinking Outside the Pool: Active Training Image Creation for Relative Attributes},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Generalizable Person Re-Identification by Domain-Invariant Mapping Network
Jifei Song, Yongxin Yang, Yi-Zhe Song, Tao Xiang, Timothy M. Hospedales


We aim to learn a domain generalizable person re-identification (ReID) model. When such a model is trained on a set of source domains (ReID datasets collected from different camera networks), it can be directly applied to any new unseen dataset for effective ReID without any model updating. Despite its practical value in real-world deployments, generalizable ReID has seldom been studied. In this work, a novel deep ReID model termed Domain-Invariant Mapping Network (DIMN) is proposed. DIMN is designed to learn a mapping between a person image and its identity classifier, i.e., it produces a classifier using a single shot. To make the model domain-invariant, we follow a meta-learning pipeline and sample a subset of source domain training tasks during each training episode. However, the model is significantly different from conventional meta-learning methods in that: (1) no model updating is required for the target domain, (2) different training tasks share a memory bank for maintaining both scalability and discrimination ability, and (3) it can be used to match an arbitrary number of identities in a target domain. Extensive experiments on a newly proposed large-scale ReID domain generalization benchmark show that our DIMN significantly outperforms alternative domain generalization or meta-learning methods.
[dataset, bank, follow] [directly, matching, single, problem, pipeline, corresponding, note] [mapping, image, method, identity, proposed, based, input, result] [network, deep, weight, table, number, running, effective, performance, wei] [model, memory, encoding, vector, type, arxiv, preprint] [person, subnet, feature, average, map, liang, pcb, grid] [domain, reid, training, target, learning, source, dimn, classifier, gallery, generalization, probe, loss, agg, testing, trained, ppa, datasets, learn, label, unsupervised, reptile, split, generalizable, set, sample, updating, existing, supervised, crossgrad, mldg, unseen, data, classification, tao, timothy, conventional, adaptation, align, viper, prid, shaogang]
@InProceedings{Song_2019_CVPR,
  author = {Song, Jifei and Yang, Yongxin and Song, Yi-Zhe and Xiang, Tao and Hospedales, Timothy M.},
  title = {Generalizable Person Re-Identification by Domain-Invariant Mapping Network},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Visual Attention Consistency Under Image Transforms for Multi-Label Image Classification
Hao Guo, Kang Zheng, Xiaochuan Fan, Hongkai Yu, Song Wang


Human visual perception shows good consistency for many multi-label image classification tasks under certain spatial transforms, such as scaling, rotation, flipping and translation. This has motivated the data augmentation strategy widely used in CNN classifier training -- transformed images are included for training by assuming the same class labels as their original images. In this paper, we further propose the assumption of perceptual consistency of visual attention regions for classification under such transforms, i.e., the attention region for a classification follows the same transform if the input image is spatially transformed. While the attention regions of CNN classifiers can be derived as an attention heatmap in middle layers of the network, we find that their consistency under many transforms are not preserved. To address this problem, we propose a two-branch network with an original image and its transformed image as inputs and introduce a new attention consistency loss that measures the attention heatmap consistency between two branches. This new loss is then combined with multi-label image classification loss for network training. Experiments on three datasets verify the superiority of the proposed network by achieving new state-of-the-art classification performance.
[human, recognition, considering, middle] [computer, vision, pattern, international, consistent, horizontal, corresponding] [image, proposed, consistency, conference, ieee, method, attribute, input, transform, comparison, figure, perceptual, verify] [network, performance, transforms, original, convolutional, table, deep, neural, scaling, cnns, represents, achieves, layer, better, wei] [attention, model, visual, arxiv, preprint, transformed, perception, considered, relevant, improved] [heatmaps, cnn, spatial, baseline, map, feature, object, presence, heatmap, wider, semantic, european, propose, pedestrian, xiaogang, three] [classification, label, loss, data, training, learning, flipping, set, enforcing, augmentation, multilabel, class, train, strategy, flipped]
@InProceedings{Guo_2019_CVPR,
  author = {Guo, Hao and Zheng, Kang and Fan, Xiaochuan and Yu, Hongkai and Wang, Song},
  title = {Visual Attention Consistency Under Image Transforms for Multi-Label Image Classification},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Re-Ranking via Metric Fusion for Object Retrieval and Person Re-Identification
Song Bai, Peng Tang, Philip H.S. Torr, Longin Jan Latecki


This work studies the unsupervised re-ranking procedure for object retrieval and person re-identification with a specific concentration on an ensemble of multiple metrics (or similarities). While the re-ranking step is involved by running a diffusion process on the underlying data manifolds, the fusion step can leverage the complementarity of multiple metrics. We give a comprehensive summary of existing fusion with diffusion strategies, and systematically analyze their pros and cons. Based on the analysis, we propose a unified yet robust algorithm which inherits their advantages and discards their disadvantages. Hence, we call it Unified Ensemble Diffusion (UED). More interestingly, we derive that the inherited properties indeed stem from a theoretical framework, where the relevant works can be elegantly summarized as special cases of UED by imposing additional constraints on the objective function and varying the solver of similarity propagation. Extensive experiments with 3D shape retrieval, image retrieval and person re-identification demonstrate that the proposed framework outperforms the state of the arts, and at the same time suggest that re-ranking via metric fusion is a promising tool to further improve the retrieval performance of existing algorithms.
[fusion, multiple, auc, work, framework, graph, dataset] [shape, matrix, robust, solution, matching, algorithm, defined, optimization] [image, input, proposed, comparison] [performance, weight, table, deep, tensor, best, regularized, simplex, process, equal, achieves, neural, fast, efficient, better, convolutional] [red, step, visual, interplay, consider, simply, model, manifold, query] [person, map, object, affinity, fusing, baseline, three, average, including, cnn] [diffusion, ued, similarity, retrieval, learning, ensemble, metric, unified, tpf, existing, product, function, naive, noisy, summarized, objective, replicator, unsupervised, loss, data, ranking, pairwise, large, learn, rank, target, representation, specific]
@InProceedings{Bai_2019_CVPR,
  author = {Bai, Song and Tang, Peng and Torr, Philip H.S. and Jan Latecki, Longin},
  title = {Re-Ranking via Metric Fusion for Object Retrieval and Person Re-Identification},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Unsupervised Open Domain Recognition by Semantic Discrepancy Minimization
Junbao Zhuo, Shuhui Wang, Shuhao Cui, Qingming Huang


We address the unsupervised open domain recognition (UODR) problem, where categories in labeled source domain S is only a subset of those in unlabeled target domain T. The task is to correctly classify all samples in T including known and unknown categories. UODR is challenging due to the domain discrepancy, which becomes even harder to bridge when a large number of unknown categories exist in T. Moreover, the classification rules propagated by graph CNN (GCN) may be distracted by unknown categories and lack generalization capability. To measure the domain discrepancy for asymmetric label space between S and T, we propose Semantic-Guided Matching Discrepancy (SGMD), which first employs instance matching between S and T, and then the discrepancy is measured by a weighted feature distance between matched instances. We further design a limited balance constraint to achieve a more balanced classification output on known and unknown categories. We develop Unsupervised Open Domain Transfer Network (UODTN), which learns both the backbone classification network and GCN jointly by reducing the SGMD, enforcing the limited balance constraint and minimizing the classification loss on S. UODTN better preserves the semantic structure and enforces the consistency between the learned domain invariant visual features and the semantic embeddings. Experimental results show superiority of our method on recognizing images of both known and unknown categories.
[gcn, graph, joint, recognition, propagate, construct, dataset] [matching, constraint, limited, problem, exists] [balance, proposed, figure, method, based] [network, deep, reducing, table, structure, layer, better, weight, weighted, pretrained] [word, model, adversarial, arxiv, preprint, encoded] [semantic, feature, category, matched, backbone, object, propose, including, challenging] [domain, unknown, target, discrepancy, classification, uodtn, source, classifier, learning, unsupervised, knowledge, training, transfer, uodr, adaptation, data, embeddings, open, large, labeled, space, loss, set, distance, generalized, bgcn, unlabeled, label, zsl, minimizing, subset, classify, share, distribution, negative, transductive, trained, bipartite, zgcn, task]
@InProceedings{Zhuo_2019_CVPR,
  author = {Zhuo, Junbao and Wang, Shuhui and Cui, Shuhao and Huang, Qingming},
  title = {Unsupervised Open Domain Recognition by Semantic Discrepancy Minimization},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Weakly Supervised Person Re-Identification
Jingke Meng, Sheng Wu, Wei-Shi Zheng


In the conventional person re-id setting, it is assumed that the labeled images are the person images within the bounding box for each individual; this labeling across multiple nonoverlapping camera views from raw video surveillance is costly and time-consuming. To overcome this difficulty, we consider weakly supervised person re-id modeling. The weak setting refers to matching a target person with an untrimmed gallery video where we only know that the identity appears in the video without the requirement of annotating the identity in any frame of the video during the training procedure. Hence, for a video, there could be multiple video-level labels. We cast this weakly supervised person re-id challenge into a multi-instance multi-label learning (MIML) problem. In particular, we develop a Cross-View MIML (CV-MIML) method that is able to explore potential intraclass person images from all the camera views by incorporating the intra-bag alignment and the cross-view bag alignment. Finally, the CV-MIML method is embedded into an existing deep neural network for developing the Deep Cross-View MIML (Deep CV-MIML) model. We have performed extensive experiments to show the feasibility of the proposed weakly supervised setting and verify the effectiveness of our method compared to related methods on four weakly labeled datasets.
[video, multiple, dataset, tagged, current, term, untrimmed, frame, individual] [camera, corresponding, matching, form, problem] [method, raw, figure, proposed, image, identity, comparison, prior] [deep, performance, table, parameter, network, neural, group, accuracy, compared] [model, potential, probability, appears] [person, weakly, bag, miml, map, instance, feature, seed, fully, bounding, weak, baseline, liang] [set, supervised, gallery, alignment, learning, probe, label, distribution, prototype, unsupervised, training, datasets, class, setting, labeled, target, existing, classifier, nonoverlapping, classification, testing, data, conventional, belonging, loss, unknown, specific]
@InProceedings{Meng_2019_CVPR,
  author = {Meng, Jingke and Wu, Sheng and Zheng, Wei-Shi},
  title = {Weakly Supervised Person Re-Identification},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
PointRCNN: 3D Object Proposal Generation and Detection From Point Cloud
Shaoshuai Shi, Xiaogang Wang, Hongsheng Li


In this paper, we propose PointRCNN for 3D object detection from raw point cloud. The whole framework is composed of two stages: stage-1 for the bottom-up 3D proposal generation and stage-2 for refining proposals in the canonical coordinates to obtain the final detection results. Instead of generating proposals from RGB image or projecting point cloud to bird's view or voxels as previous methods do, our stage-1 sub-network directly generates a small number of high-quality 3D proposals from point cloud in a bottom-up manner via segmenting the point cloud of the whole scene into foreground points and background. The stage-2 sub-network transforms the pooled points of each proposal to canonical coordinates to learn better local spatial features, which is combined with global semantic features of each point learned in stage-1 for accurate box refinement and confidence prediction. Extensive experiments on the 3D detection benchmark of KITTI dataset show that our proposed architecture outperforms state-of-the-art methods with remarkable margins by using only point cloud as input. The code is available at https://github.com/sshaoshuai/PointRCNN.
[previous] [point, cloud, canonical, computer, vision, pattern, view, directly, coordinate, kitti, local, lidar, rgb, robust, confidence, international, scene, accurate, estimation, autonomous, depth, voxels] [conference, ieee, method, proposed, based, image, raw] [network, size, number, pooled, performance, achieves, pooling, deep, small, search, residual] [generation, generate, generating, system, generates, transformed] [box, object, proposal, detection, bounding, foreground, segmentation, recall, refinement, feature, bin, iou, regression, semantic, threshold, pointrcnn, mask, car, val, propose, spatial, center, predicted, refining, region, location, moderate, raquel] [loss, learning, training, split, classification, learn, test, set, hard]
@InProceedings{Shi_2019_CVPR,
  author = {Shi, Shaoshuai and Wang, Xiaogang and Li, Hongsheng},
  title = {PointRCNN: 3D Object Proposal Generation and Detection From Point Cloud},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Automatic Adaptation of Object Detectors to New Domains Using Self-Training
Aruni RoyChowdhury, Prithvijit Chakrabarty, Ashish Singh, SouYoung Jin, Huaizu Jiang, Liangliang Cao, Erik Learned-Miller


This work addresses the unsupervised adaptation of an existing object detector to a new target domain. We assume that a large number of unlabeled videos from this domain are readily available. We automatically obtain labels on the target data by using high-confidence detections from the existing detector, augmented with hard (misclassified) examples acquired by exploiting temporal cues using a tracker. These automatically-obtained labels are then used for re-training the original model. A modified knowledge distillation loss is proposed, and we investigate several ways of assigning soft-labels to the training examples from the target domain. Our approach is empirically evaluated on challenging face and pedestrian detection tasks: a face detector trained on WIDER-Face, which consists of high-quality images crawled from the web, is adapted to a large-scale surveillance data set; a pedestrian detector trained on clear, daytime images from the BDD-100K driving data set is adapted to all other scenarios such as rainy, foggy, night-time. Our results demonstrate the usefulness of incorporating hard examples obtained from tracking, the advantage of using soft-labels via distillation loss versus hard-labels, and show promising performance as a simple method for unsupervised domain adaptation of object detectors, with minimal dependence on hyper-parameters.
[video, dataset, recognition, work, jin, temporal, tracking, consists, driving] [computer, vision, pattern, approach, volume, confidence] [conference, face, ieee, method, surveillance, based, image, figure] [performance, deep, shift, table, neural, number, automatically, tracker, network] [adversarial, model, arxiv, preprint, discriminator, example, missed, machine, creating] [object, detector, baseline, detection, pedestrian, score, wider, challenging, faster, threshold, improve] [domain, target, training, labeled, data, hard, learning, unlabeled, source, set, adaptation, unsupervised, trained, label, distribution, loss, soft, adapting, large, distillation, sample, knowledge, difficult, selected]
@InProceedings{RoyChowdhury_2019_CVPR,
  author = {RoyChowdhury, Aruni and Chakrabarty, Prithvijit and Singh, Ashish and Jin, SouYoung and Jiang, Huaizu and Cao, Liangliang and Learned-Miller, Erik},
  title = {Automatic Adaptation of Object Detectors to New Domains Using Self-Training},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Deep Sketch-Shape Hashing With Segmented 3D Stochastic Viewing
Jiaxin Chen, Jie Qin, Li Liu, Fan Zhu, Fumin Shen, Jin Xie, Ling Shao


Sketch-based 3D shape retrieval has been extensively studied in recent works, most of which focus on improving the retrieval accuracy, whilst neglecting the efficiency. In this paper, we propose a novel framework for efficient sketch-based 3D shape retrieval, i.e., Deep Sketch-Shape Hashing (DSSH), which tackles the challenging problem from two perspectives. Firstly, we propose an intuitive 3D shape representation method to deal with unaligned shapes with arbitrary poses. Specifically, the proposed Segmented Stochastic-viewing Shape Network models discriminative 3D representations by a set of 2D images rendered from multiple views, which are stochastically selected from non-overlapping spatial segments of a 3D sphere. Secondly, Batch-Hard Binary Coding (BHBC) is developed to learn semantics-preserving compact binary codes by mining the hardest samples. The overall framework is jointly learned by developing an alternating iteration algorithm. Extensive experimental results on three benchmarks show that DSSH improves both the retrieval efficiency and accuracy remarkably, compared to the state-of-the-art methods.
[time, recognition, jointly, learns] [shape, view, rendering, computer, rendered, vision, pattern, international, optimization, horizontal, matching] [based, conference, ieee, proposed, image, method, figure] [binary, deep, network, stochastic, coding, convolutional, performance, efficient, table, whilst, compact, computational, quantization, neural, efficiency, number, fast] [attention, model, indicates, memory, vector, sampled] [segmented, propose, spatial, semantic, average, feature, adopt] [dssh, learning, sampling, hashing, retrieval, set, sketch, learn, discriminative, loss, hash, existing, selected, code, distance, training, learned, large, sample, function, novel, hardest, data, lbc, supervised]
@InProceedings{Chen_2019_CVPR,
  author = {Chen, Jiaxin and Qin, Jie and Liu, Li and Zhu, Fan and Shen, Fumin and Xie, Jin and Shao, Ling},
  title = {Deep Sketch-Shape Hashing With Segmented 3D Stochastic Viewing},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Generative Dual Adversarial Network for Generalized Zero-Shot Learning
He Huang, Changhu Wang, Philip S. Yu, Chang-Dong Wang


This paper studies the problem of generalized zero-shot learning which requires the model to train on image-label pairs from some seen classes and test on the task of classifying new images from both seen and unseen classes. In this paper, we propose a novel model that provides a unified framework for three different approaches: visual->semantic mapping, semantic->visual mapping, and metric learning. Specifically, our proposed model consists of a feature generator that can generate various visual features given class embeddings as input, a regressor that maps each visual feature back to its corresponding class embedding, and a discriminator that learns to evaluate the closeness of an image feature and a class embedding. All three components are trained under the combination of cyclic consistency loss and dual adversarial loss. Experimental results show that our model not only preserves higher accuracy in classifying images from seen classes, but also performs better than existing state-of-the-art models in in classifying images from unseen classes.
[framework] [computer, vision, pattern, latexit, well, problem, cyclic, defined, equation, corresponding] [image, dual, generative, figure, synthetic, conference, real, ieee, proposed, generator, component, consistency, latent, method, mapping] [network, accuracy, deep, neural, performance, table, number, better, achieves, order, output, represents] [model, visual, adversarial, discriminator, cvae, generate, arxiv, preprint, evaluate, fake, gan, variational, ability, machine] [semantic, feature, three, illustrated, propose, map] [learning, unseen, regressor, class, generalized, train, loss, metric, embedding, learn, gdan, training, data, set, apy, sun, space, task, test, epdata, cub, novel, label, relationnet, accy, unified, datasets, harmonic, large, classifying]
@InProceedings{Huang_2019_CVPR,
  author = {Huang, He and Wang, Changhu and Yu, Philip S. and Wang, Chang-Dong},
  title = {Generative Dual Adversarial Network for Generalized Zero-Shot Learning},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Query-Guided End-To-End Person Search
Bharti Munjal, Sikandar Amin, Federico Tombari, Fabio Galasso


Person search has recently gained attention as the novel task of finding a person, provided as a cropped sample, from a gallery of non-cropped images, whereby several other people are also visible. We believe that i. person detection and re-identification should be pursued in a joint optimization framework and that ii. the person search should leverage the query image extensively (e.g. emphasizing unique query patterns). However, so far, no prior art realizes this. We introduce a novel query-guided end-to-end person search network (QEEPS) to address both aspects. We leverage a most recent joint detector and re-identification work, OIM [37]. We extend this with i. a query-guided Siamese squeeze-and-excitation network (QSSE-Net) that uses global context from both the query and gallery images, ii. a query-guided region proposal network (QRPN) to produce query-relevant proposals, and iii. a query-guided similarity subnetwork (QSimNet), to learn a query-guided re-identification score. QEEPS is the first end-to-end query-guided detection and re-id network. On both the most recent CUHK-SYSU [37] and PRW [46] datasets, we outperform the previous state-of-the-art by a large margin.
[recognition, joint, dataset, people, jointly, work, report] [computer, vision, pattern, international, approach, note, local] [conference, image, ieee, proposed, figure, separate, method, based] [network, search, siamese, size, performance, best, block, table, deep, standard, number, scale, channel, neural, net, residual] [query, model, attention, introduce, consider, evaluation] [person, oim, qeeps, feature, detection, qrpn, rpn, map, prw, proposal, qsimnet, global, faster, baseline, adopt, detector, npsm, qsse, european, region, object, score, basenet, hxwxc, anchor, bounding] [gallery, loss, learning, similarity, novel, metric, training, specific, set, task, learn, classification, base]
@InProceedings{Munjal_2019_CVPR,
  author = {Munjal, Bharti and Amin, Sikandar and Tombari, Federico and Galasso, Fabio},
  title = {Query-Guided End-To-End Person Search},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Libra R-CNN: Towards Balanced Learning for Object Detection
Jiangmiao Pang, Kai Chen, Jianping Shi, Huajun Feng, Wanli Ouyang, Dahua Lin


Compared with model architectures, the training process, which is also crucial to the success of detectors, has received relatively less attention in object detection. In this work, we carefully revisit the standard training practice of detectors, and find that the detection performance is often limited by the imbalance during the training process, which generally consists in three levels - sample level, feature level, and objective level. To mitigate the adverse effects caused thereby, we propose Libra R-CNN, a simple but effective framework towards balanced learning for object detection. It integrates three novel components: IoU-balanced sampling, balanced feature pyramid, and balanced L1 loss, respectively for reducing the imbalance at sample, feature, and objective level. Benefitted from the overall balanced design, Libra R-CNN significantly improves the detection performance. Without bells and whistles, it achieves 2.5 points and 2.0 points higher Average Precision (AP) than FPN Faster R-CNN and RetinaNet respectively on MSCOCO.
[crucial, framework] [computer, pattern, vision, corresponding, pipeline] [conference, ieee, figure, method, denoted, based, aps, proposed] [higher, table, weight, compared, deep, achieves, process, effective, number, equal, effectiveness, neural, performance, convolutional] [model, random, simple, potential, procedure, arxiv] [object, libra, feature, faster, detection, pyramid, semantic, level, fpn, ablation, ross, coco, retinanet, localization, rpn, kaiming, three, involved, iou, average, box, fully, integrated, cascade, propose, easy, mask, brings, pafpn, apm, apl, european, piotr, doll, jianping, wanli] [balanced, loss, sampling, training, hard, imbalance, sample, objective, learning, selected]
@InProceedings{Pang_2019_CVPR,
  author = {Pang, Jiangmiao and Chen, Kai and Shi, Jianping and Feng, Huajun and Ouyang, Wanli and Lin, Dahua},
  title = {Libra R-CNN: Towards Balanced Learning for Object Detection},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Learning a Unified Classifier Incrementally via Rebalancing
Saihui Hou, Xinyu Pan, Chen Change Loy, Zilei Wang, Dahua Lin


Conventionally, deep neural networks are trained offline, relying on a large dataset prepared in advance. This paradigm is often challenged in real-world applications, e.g. online services that involve continuous streams of incoming data. Recently, incremental learning receives increasing attention, and is considered as a promising solution to the practical challenges mentioned above. However, it has been observed that incremental learning is subject to a fundamental difficulty -- catastrophic forgetting, namely adapting a model to new data often results in severe performance degradation on previous tasks or classes. Our study reveals that the imbalance between previous and new data is a crucial cause to this problem. In this work, we develop a new framework for incrementally learning a unified classifier, e.g. a classifier that treats both old and new classes uniformly. Specifically, we incorporate three components, cosine normalization, less-forget constraint, and inter-class separation, to mitigate the adverse effects of the imbalance. Experiments show that the proposed method can effectively rebalance the training process, thus obtaining superior performance compared to the existing methods. On CIFAR-100 and ImageNet, our method can reduce the classification errors by more than 6% and 13% respectively, under the incremental setting of 10 phases.
[previous, work, dataset, current, report, framework, finetune, long, multiple] [approach, constraint, computed, well, normalized] [figure, method, proposed, preserve, chen, study, separate, change] [performance, original, neural, number, weight, better, deep, adaptive, normalization, imagenet, compared, reduce, network, superior] [model, introduce, observed, evaluate] [three, propose, cnn, feature, adopted] [incremental, learning, loss, class, imbalance, cosine, distillation, knowledge, icarl, data, training, dis, embeddings, reserved, trained, classifier, classification, imbalanced, set, unified, margin, adverse, setting, ranking, lifelong, large, forgetting, confusion, lce, hard, sample, strategy, base, catastrophic, effectively, negative]
@InProceedings{Hou_2019_CVPR,
  author = {Hou, Saihui and Pan, Xinyu and Change Loy, Chen and Wang, Zilei and Lin, Dahua},
  title = {Learning a Unified Classifier Incrementally via Rebalancing},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Feature Selective Anchor-Free Module for Single-Shot Object Detection
Chenchen Zhu, Yihui He, Marios Savvides


We motivate and present feature selective anchor-free (FSAF) module, a simple and effective building block for single-shot object detectors. It can be plugged into single-shot detectors with feature pyramid structure. The FSAF module addresses two limitations brought up by the conventional anchor-based detection: 1) heuristic-guided feature selection; 2) overlap-based anchor sampling. The general concept of the FSAF module is online feature selection applied to the training of multi-level anchor-free branches. Specifically, an anchor-free branch is attached to each level of the feature pyramid, allowing box encoding and decoding in the anchor-free manner at an arbitrary level. During training, we dynamically assign each instance to the most suitable feature level. At the time of inference, the FSAF module can work independently or jointly with anchor-based branches. We instantiate this concept with simple implementations of anchor-free branches and online feature selection strategy. Experimental results on the COCO detection track show that our FSAF module performs better than anchor-based counterparts while being faster. When working jointly with anchor-based branches, the FSAF module robustly improves the baseline RetinaNet by a large margin under various settings, while introducing nearly free inference overhead. And the resulting best model can achieve a state-of-the-art 44.6% mAP, outperforming all existing single-shot detectors on COCO.
[online, jointly, work, recognition, time] [computer, vision, pattern, focal, additional, total] [conference, figure, ieee, image, based, proposed] [selection, network, scale, effective, table, inference, best, better, deep, convolutional, neural, small, applied, suitable, number, lower, design] [arxiv, preprint, simple, model] [feature, fsaf, module, instance, box, object, level, person, retinanet, pyramid, detection, backbone, anchor, regression, iou, region, coco, map, assigned, location, heuristic, subnet, detector, piotr, doll, ross, kaiming, anchorbased, final, bounding, faster, selective, marios, detecting] [loss, classification, training, selected, class, large, select, learning, testing]
@InProceedings{Zhu_2019_CVPR,
  author = {Zhu, Chenchen and He, Yihui and Savvides, Marios},
  title = {Feature Selective Anchor-Free Module for Single-Shot Object Detection},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Bottom-Up Object Detection by Grouping Extreme and Center Points
Xingyi Zhou, Jiacheng Zhuo, Philipp Krahenbuhl


With the advent of deep learning, object detection drifted from a bottom-up to a top-down recognition problem. State of the art algorithms enumerate a near-exhaustive list of object locations and classify each into: object or not. In this paper, we show that bottom-up approaches still perform competitively. We detect four extreme points (top-most, left-most, bottom-most, right-most) and one center point of objects using a standard keypoint estimation network. We group the five keypoints into a bounding box if they are geometrically aligned. Object detection is then a purely appearance-based keypoint estimation problem, without region classification or implicit feature learning. The proposed method performs on-par with the state-of-the-art region based detection methods, with a bounding box AP of 43.7% on COCO test-dev. In addition, our estimated extreme points directly span a coarse octagonal mask, with a COCO Mask AP of 18.9%, much better than the Mask AP of vanilla bounding boxes. Extreme point guided segmentation further improves this to 34.6% Mask AP.
[human, framework, prediction, work, learns, lie] [point, keypoint, estimation, pose, geometric, directly, ground, algorithm, keypoints, truth, single, error, implicit, form] [image, method, input, figure, high] [network, deep, aggregation, table, convolutional, group, number, top, higher, scale] [simple, model] [extreme, object, center, bounding, box, heatmap, detection, mask, segmentation, region, coco, offset, cornernet, edge, grouping, heatmaps, associative, instance, map, feature, ghost, ross, predicted, score, dextr, kaiming, octagon, extremenet, jian, enumerate, response, fully, peak, semantic, propose, average, faster, proposal, location] [trained, loss, learning, embedding, set, training, class]
@InProceedings{Zhou_2019_CVPR,
  author = {Zhou, Xingyi and Zhuo, Jiacheng and Krahenbuhl, Philipp},
  title = {Bottom-Up Object Detection by Grouping Extreme and Center Points},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Feature Distillation: DNN-Oriented JPEG Compression Against Adversarial Examples
Zihao Liu, Qi Liu, Tao Liu, Nuo Xu, Xue Lin, Yanzhi Wang, Wujie Wen


Image compression-based approaches for defending against the adversarial-example attacks, which threaten the safety use of deep neural networks (DNN), have been investigated recently. However, prior works mainly rely on directly tuning parameters like compression rate, to blindly reduce image features, thereby lacking guarantee on both defense efficiency (i.e. accuracy of polluted images) and classification accuracy of benign images, after applying defense methods. To overcome these limitations, we propose a JPEG-based defensive compression framework, namely "feature distillation", to effectively rectify adversarial examples without impacting classification accuracy on benign data. Our framework significantly escalates the defense efficiency with marginal accuracy reduction using a twostep method: First, we maximize malicious features filtering of adversarial input perturbations by developing defensive quantization in frequency domain of JPEG compression or decompression, guided by a semi-analytical method; Second, we suppress the distortions of benign features to restore classification accuracy through a DNN-oriented quantization refine process. Our experimental results show that proposed "feature distillation" can significantly surpass the latest input-transformation based mitigations such as Quilting and TV Minimization in three aspects, including defense efficiency (improve classification accuracy from 20% to 90% on adversarial examples), accuracy of benign images after defense (<= 1% accuracy degradation), and processing time per image ( 259x Speedup). Moreover, our solution also can provide the best defense efficiency ( 60% accuracy) against the latest BPDA attack with least accuracy reduction ( 1%) on benign images among all other input-transformation based defense methods.
[extract] [crafted, deviation, discrete, analysis] [jpeg, raw, dequantization, image, frequency, component, band, input, mapping] [coefficient, process, quantization, original, accuracy, rate, std, arithmetic, compression, dnn, table, standard] [pass, fgsm, bim, deepfool, cwi, attack, success, decoding, model, encoding, adversarial, example, ascending] [enhanced, average] [data, dct, ranking]
@InProceedings{Liu_2019_CVPR,
  author = {Liu, Zihao and Liu, Qi and Liu, Tao and Xu, Nuo and Lin, Xue and Wang, Yanzhi and Wen, Wujie},
  title = {Feature Distillation: DNN-Oriented JPEG Compression Against Adversarial Examples},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
SCOPS: Self-Supervised Co-Part Segmentation
Wei-Chih Hung, Varun Jampani, Sifei Liu, Pavlo Molchanov, Ming-Hsuan Yang, Jan Kautz


Parts provide a good intermediate representation of objects that is robust with respect to camera, pose and appearance variations. Existing work on part segmentation is dominated by supervised approaches that rely on large amounts of manual annotations and also can not generalize to unseen object categories. We propose a self-supervised deep learning approach for part segmentation, where we devise several loss functions that aids in predicting part segments that are geometrically concentrated, robust to object variations and are also semantically consistent across different object instances. Extensive experiments on different types of image collections demonstrate that our approach can produce part segments that adhere to object boundaries and also more semantically consistent across object instances compared to existing self-supervised techniques.
[dataset, multiple, predict] [robust, consistent, pose, geometric, constraint, single, estimation, dense, good, camera, technique, respect, approach, well, matrix, matching] [image, landmark, consistency, figure, proposed, appearance, input, transform, unaligned, celeba, background, lsc, method, intermediate] [network, deep, neural, compared, convolutional, table, factorization] [basis, semantically, visual, collection, evaluate, model, common] [object, segmentation, scops, semantic, feature, spatial, dff, equivariance, propose, response, saliency, uld, pascal, detection, map, concentration, challenging, indicate, foreground, cnn, voc, annotated] [loss, learning, unsupervised, train, learn, learned, training, existing, large, test, cub, trained]
@InProceedings{Hung_2019_CVPR,
  author = {Hung, Wei-Chih and Jampani, Varun and Liu, Sifei and Molchanov, Pavlo and Yang, Ming-Hsuan and Kautz, Jan},
  title = {SCOPS: Self-Supervised Co-Part Segmentation},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Unsupervised Moving Object Detection via Contextual Information Separation
Yanchao Yang, Antonio Loquercio, Davide Scaramuzza, Stefano Soatto


We propose an adversarial contextual model for detecting moving objects in images. A deep neural network is trained to predict the optical flow in a region using information from everywhere else but that region (context), while another network attempts to make such context as uninformative as possible. The result is a model where hypotheses naturally compete with no need for explicit regularization or hyper-parameter tuning. Although our method requires no supervision whatsoever, it outperforms several methods that are pre-trained on large annotated datasets. Our model can be thought of as a generalization of classical variational generative region-based segmentation, but in a way that avoids explicit regularization or solution of partial differential equations at run-time.
[flow, optical, motion, moving, video, inpainter, recognition, time, puin, predict, outperforms, uin, dataset, uout, tracking, arp, nlc, naturally, relates] [vision, computer, approach, pattern, international, require, classical, well, analysis, problem, definition, explicit, functional] [image, ieee, conference, method, generator, background, input, generative, figure] [network, deep, performance, table, neural, best, inference, compare, regularization, complexity, modern, convolutional] [model, variational, machine, adversarial, visual, considered, generated, requires, partial, call] [object, segmentation, region, foreground, inside, detection, contextual, mask, fully, piecewise, annotated, detect, supervision] [unsupervised, function, learning, training, loss, large, supervised, datasets, trained, data, set, trivial, test, learn]
@InProceedings{Yang_2019_CVPR,
  author = {Yang, Yanchao and Loquercio, Antonio and Scaramuzza, Davide and Soatto, Stefano},
  title = {Unsupervised Moving Object Detection via Contextual Information Separation},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Pose2Seg: Detection Free Human Instance Segmentation
Song-Hai Zhang, Ruilong Li, Xin Dong, Paul Rosin, Zixi Cai, Xi Han, Dingcheng Yang, Haozhi Huang, Shi-Min Hu


The standard approach to image instance segmentation is to perform the object detection first, and then segment the object from the detection bounding-box. More recently, deep learning methods like Mask R-CNN perform them jointly. However, little research takes into account the uniqueness of the "human" category, which can be well defined by the pose skeleton. Moreover, the human pose skeleton can be used to better distinguish instances with heavy occlusion than using bounding-boxes. In this paper, we present a brand new pose-based instance segmentation framework for humans which separates instances based on human pose, rather than proposal region detection. We demonstrate that our pose-based framework can achieve better accuracy than the state-of-art detection-based approach on the human instance segmentation problem, and can moreover better handle occlusion. Furthermore, there are few public datasets containing many heavily occluded humans along with comprehensive annotations, which makes this a challenging problem seldom noticed by researchers. Therefore, in this paper we introduce a new benchmark "Occluded Human (OCHuman)", which focuses on occluded humans with comprehensive annotations including bounding-box, human pose and instance masks. This dataset contains 8110 detailed annotated human instances within 4731 images. With an average 0.67 MaxIoU for each person, OCHuman is the most complex and challenging dataset related to human instance segmentation. Through this dataset, we want to emphasize occlusion as a challenging problem for researchers to study.
[human, dataset, skeleton, framework, perform, joint] [pose, computer, vision, occluded, estimation, occlusion, body, pattern, keypoints, valid, keypoint, provide, single, field, international, well, problem, general, matrix] [conference, based, figure, ieee, image, method, comprehensive] [better, performance, network, achieve, table, residual, receptive, validation, called, operation, best, convolutional] [evaluate, represent] [instance, segmentation, detection, object, mask, ochuman, heavily, person, challenging, coco, segmodule, benchmark, maxiou, personlab, jian, ross, category, cocopersons, kaiming, european, heavy, module, map, score, val, bbox, kpt] [alignment, align, large, set, learning, cluster, training, trained]
@InProceedings{Zhang_2019_CVPR,
  author = {Zhang, Song-Hai and Li, Ruilong and Dong, Xin and Rosin, Paul and Cai, Zixi and Han, Xi and Yang, Dingcheng and Huang, Haozhi and Hu, Shi-Min},
  title = {Pose2Seg: Detection Free Human Instance Segmentation},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
DrivingStereo: A Large-Scale Dataset for Stereo Matching in Autonomous Driving Scenarios
Guorun Yang, Xiao Song, Chaoqin Huang, Zhidong Deng, Jianping Shi, Bolei Zhou


Great progress has been made on estimating disparity maps from stereo images. However, with the limited stereo data available in the existing datasets and unstable ranging precision of current stereo methods, industry-level stereo matching in autonomous driving remains challenging. In this paper, we construct a novel large-scale stereo dataset named DrivingStereo. It contains over 180k images covering a diverse set of driving scenarios, which is hundreds of times larger than the KITTI Stereo dataset. High-quality labels of disparity are produced by a model-guided filtering strategy from multi-frame LiDAR points. For better evaluations, we present two new metrics for stereo matching in the driving scenes, i.e. a distance-aware metric and a semantic-aware metric. Extensive experiments show that compared with the models trained on FlyingThings3D or Cityscapes, the models trained on our DrivingStereo achieve higher generalization accuracy in real-world driving scenes, while the proposed metrics better evaluate the stereo methods on all-range distances and across different classes. Our dataset and code are available at https://drivingstereo-dataset.github.io.
[dataset, driving, current, moving, prediction, flow] [stereo, disparity, kitti, matching, drivingstereo, lidar, depth, guidenet, point, valid, epe, error, autonomous, range, cloud, provide, projected, single, define, sgm, scene, virtual, dispnet, edgestereo, compute, calibration] [filtering, image, proposed, synthetic, figure, based] [deep, accuracy, better, convolutional, performance, pretrained, compared, rate, capacity, network, secondary, segstereo] [diverse, evaluation, evaluate, model, correct] [semantic, map, guided, fused, evaluated, baseline, foreground, segmentation, final, spatial] [set, learning, datasets, training, data, strategy, trained, existing, test, metric, large]
@InProceedings{Yang_2019_CVPR,
  author = {Yang, Guorun and Song, Xiao and Huang, Chaoqin and Deng, Zhidong and Shi, Jianping and Zhou, Bolei},
  title = {DrivingStereo: A Large-Scale Dataset for Stereo Matching in Autonomous Driving Scenarios},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
PartNet: A Large-Scale Benchmark for Fine-Grained and Hierarchical Part-Level 3D Object Understanding
Kaichun Mo, Shilin Zhu, Angel X. Chang, Li Yi, Subarna Tripathi, Leonidas J. Guibas, Hao Su


We present PartNet: a consistent, large-scale dataset of 3D objects annotated with fine-grained, instance-level, and hierarchical 3D part information. Our dataset consists of 573,585 part instances over 26,671 3D models covering 24 object categories. This dataset enables and serves as a catalyst for many tasks such as shape analysis, dynamic 3D scene modeling and simulation, affordance analysis, and others. Using our dataset, we establish three benchmarking tasks for evaluating 3D part recognition: fine-grained semantic segmentation, hierarchical semantic segmentation, and instance segmentation. We benchmark four state-of-the-art 3D deep learning algorithms for fine-grained semantic segmentation and three baseline methods for hierarchical semantic segmentation. We also propose a baseline method for part instance segmentation and demonstrate its superior performance over existing methods.
[dataset, multiple, avg, prediction, work] [shape, point, computer, hao, template, leonidas, vision, mesh, provide, chair, shapenet, pattern, algorithm, defined, cloud, require, consistent, bed, depth] [method, figure, proposed, conference, acm, ieee, cover, collected, row] [table, network, deep, number, small, neural] [understanding, arxiv, preprint, evaluation] [segmentation, semantic, hierarchical, object, instance, partnet, three, annotation, coarse, benchmark, baseline, lamp, propose, average, miou, door, level, semantics, category, iou, mask, annotate, interface] [learning, data, existing, train, training, task]
@InProceedings{Mo_2019_CVPR,
  author = {Mo, Kaichun and Zhu, Shilin and Chang, Angel X. and Yi, Li and Tripathi, Subarna and Guibas, Leonidas J. and Su, Hao},
  title = {PartNet: A Large-Scale Benchmark for Fine-Grained and Hierarchical Part-Level 3D Object Understanding},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
A Dataset and Benchmark for Large-Scale Multi-Modal Face Anti-Spoofing
Shifeng Zhang, Xiaobo Wang, Ajian Liu, Chenxu Zhao, Jun Wan, Sergio Escalera, Hailin Shi, Zezheng Wang, Stan Z. Li


Face anti-spoofing is essential to prevent face recognition systems from a security breach. Much of the progresses have been made by the availability of face anti-spoofing benchmark datasets in recent years. However, existing face anti-spoofing benchmarks have limited number of subjects (
[dataset, fusion, recognition, video] [rgb, depth, cut, error, limited, hold] [face, proposed, figure, real, pad, image, spoofing, method, siw, replay, stan, printed, zhen, live, tpr, presentation, photo, acer, based, ebastien, abdenour, apcer, high, liveness, jukka, shifeng] [table, performance, number, rate, binary, deep, network, validation, squeeze, excitation, convolutional, capability, best, effectiveness] [attack, evaluation, fake, model, development, visual] [detection, three, feature, mask, person, aligned, roc] [data, training, testing, datasets, set, generalization, classification, existing, trained, select, learning, protocol, large, train]
@InProceedings{Zhang_2019_CVPR,
  author = {Zhang, Shifeng and Wang, Xiaobo and Liu, Ajian and Zhao, Chenxu and Wan, Jun and Escalera, Sergio and Shi, Hailin and Wang, Zezheng and Li, Stan Z.},
  title = {A Dataset and Benchmark for Large-Scale Multi-Modal Face Anti-Spoofing},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Unsupervised Learning of Consensus Maximization for 3D Vision Problems
Thomas Probst, Danda Pani Paudel, Ajad Chhatkuli, Luc Van Gool


Consensus maximization is a key strategy in 3D vision for robust geometric model estimation from measurements with outliers. Generic methods for consensus maximization, such as Random Sampling and Consensus (RANSAC), have played a tremendous role in the success of 3D vision, in spite of the ubiquity of outliers. However, replicating the same generic behaviour in a deeply learned architecture, using supervised approaches, has proven to be difficult. In that context, unsupervised methods have a huge potential to adapt to any unseen data distribution, and therefore are highly desirable. In this paper, we propose for the first time an unsupervised learning framework for consensus maximization, in the context of solving 3D vision problems. For that purpose, we establish a relationship between inlier measurements, represented by an ideal of inlier set, and the subspace of polynomials representing the space of target transformations. Using this relationship, we derive a constraint that must be satisfied by the sought inlier set. This constraint can be tested without knowing the transformation parameters, therefore allows us to efficiently define the geometric model fitting cost. This model fitting cost is used as a supervisory signal for learning consensus maximization, where the learning process seeks for the largest measurement set that minimizes the proposed model fitting cost. Using our method, we solve a diverse set of 3D vision problems, including 3D-3D matching, non-rigid 3D shape matching with piece-wise rigidity and image-to-image matching. Despite being unsupervised, our method outperforms RANSAC in all three tasks for several datasets.
[represented, recognition, framework, involves, time, signal, largest] [matrix, consensus, maximization, vision, inlier, fundamental, ideal, vandermonde, problem, point, outlier, computer, estimation, rigid, pattern, geometric, ransac, homography, robust, fitting, shape, body, linear, optimal, singular, polynomial, compute, note, define, approach, pose, inliers, corresponding, rotation, left, constraint, allows, matching] [transformation, method, ieee, figure, synthetic, image, conference, variety] [network, deep, architecture, number, performance, max, increasing] [model, basis, random, relationship, vector, existence, behaviour, representing, maximizing] [global, three] [set, learning, unsupervised, supervised, data, training, loss, classification, learn, sample, space, train, trained, subspace, minimizing, supervisory]
@InProceedings{Probst_2019_CVPR,
  author = {Probst, Thomas and Pani Paudel, Danda and Chhatkuli, Ajad and Van Gool, Luc},
  title = {Unsupervised Learning of Consensus Maximization for 3D Vision Problems},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
VizWiz-Priv: A Dataset for Recognizing the Presence and Purpose of Private Visual Information in Images Taken by Blind People
Danna Gurari, Qing Li, Chi Lin, Yinan Zhao, Anhong Guo, Abigale Stangl, Jeffrey P. Bigham


We introduce the first visual privacy dataset originating from people who are blind in order to better understand their privacy disclosures and to encourage the development of algorithms that can assist in preventing their unintended disclosures. It includes 8,862 regions showing private content across 5,537 images taken by blind people. Of these, 1,403 are paired with questions and 62% of those directly ask about the private content. Experiments demonstrate the utility of this data for predicting whether an image shows private information and whether a question asks about the private content in an image. The dataset is publicly-shared at http://vizwiz.org/data/.
[people, dataset, work, recognition, versus, social, predicting, showing, challenge] [computer, international, vision, well, directly, analysis, problem, algorithm] [image, content, blind, conference, figure, acm, ieee, user, removed, method, prior, frequency, personal] [order, number, sharing, mobile, performance, automatically, size, computing] [private, visual, privacy, question, text, asks, type, common, vispr, uncorrupted, answering, answer, pregnancy, asked, commonly, collection, reflect, finding, introduce, development] [taxonomy, object, annotation, three, public, context, annotated, person] [training, learn, test, observe, data, datasets, shared, share, existing, valuable, avoid, learning]
@InProceedings{Gurari_2019_CVPR,
  author = {Gurari, Danna and Li, Qing and Lin, Chi and Zhao, Yinan and Guo, Anhong and Stangl, Abigale and Bigham, Jeffrey P.},
  title = {VizWiz-Priv: A Dataset for Recognizing the Presence and Purpose of Private Visual Information in Images Taken by Blind People},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Structural Relational Reasoning of Point Clouds
Yueqi Duan, Yu Zheng, Jiwen Lu, Jie Zhou, Qi Tian


The symmetry for the corners of a box, the continuity for the surfaces of a monitor, the linkage between the torso and other body parts --- it suggests that 3D objects may have common and underlying inner relations between local structures, and it is a fundamental ability for intelligent species to reason for them. In this paper, we propose an effective plug-and-play module called the structural relation network (SRN) to reason about the structural dependencies of local regions in 3D point clouds. Existing network architectures on point sets such as PointNet++ capture local structures individually, without considering their inner interactions. Instead, our SRN simultaneously exploits local information by modeling their geometrical and locational relations, which play critical roles for our humans to understand 3D objects. The proposed SRN module is simple, interpretable, and does not require any additional supervision signals, which can be easily equipped with the existing networks. Experimental results on benchmark datasets indicate promising boosts on the tasks of 3D point cloud classification and segmentation by capturing structural relations with the SRN module.
[structural, capture, key, dataset, understand, learns] [point, local, cloud, geometrical, scannet, pointnet, varying, leonidas, hao, shapenet, shape, simultaneously, additional, repetitive] [proposed, figure, based, play, input] [network, deep, performance, compared, structure, employ, neural, residual, table, designed, architecture, better, operation, equipped, highly] [relational, reason, reasoning, simple, model, ability, common, represent, understanding] [srn, feature, module, locational, segmentation, object, holistic, relation, final, utilize, cnn, ablation, grouping, visualization, benchmark] [classification, learning, existing, data, experimental, learned, datasets, exploit, test, observe, exploitation, task, china, representation]
@InProceedings{Duan_2019_CVPR,
  author = {Duan, Yueqi and Zheng, Yu and Lu, Jiwen and Zhou, Jie and Tian, Qi},
  title = {Structural Relational Reasoning of Point Clouds},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
MVF-Net: Multi-View 3D Face Morphable Model Regression
Fanzi Wu, Linchao Bao, Yajing Chen, Yonggen Ling, Yibing Song, Songnan Li, King Ngi Ngan, Wei Liu


We address the problem of recovering the 3D geometry of a human face from a set of facial images in multiple views. While recent studies have shown impressive progress in 3D Morphable Model (3DMM) based facial reconstruction, the settings are mostly restricted to a single view. There is an inherent drawback in the single-view setting: the lack of reliable 3D constraints can cause unresolvable ambiguities. We in this paper explore 3DMM-based shape recovery in a different setting, where a set of multi-view facial images are given as input. A novel approach is proposed to regress 3DMM parameters from multi-view inputs with an end-to-end trainable Convolutional Neural Network (CNN). Multi-view geometric constraints are incorporated into the network by establishing dense correspondences between different views leveraging a novel self-supervised view alignment loss. The main ingredient of the view alignment loss is a differentiable dense optical flow estimator that can backpropagate the alignment errors between an input view and a synthetic rendering from another input view, which is projected to the target view through the 3D shape to be inferred. Through minimizing the view alignment loss, better 3D shapes can be recovered such that the synthetic projections from one view to another can better align with the observed image. Extensive experiments demonstrate the superiority of the proposed method over other 3DMM methods.
[flow, optical, dataset, work, recognition] [view, rendered, pose, photometric, reconstruction, lighting, note, projection, iab, regress, differentiable, error, compute, visibility, approach, geometric, micc, dense, initial, christian, shape, rendering, underlying, fitting, camera, textured, volume, computer, vision] [face, image, facial, input, proposed, texture, figure, morphable, synthetic, method, mofa, pixel, expression, real] [deep, neural, trainable, convolutional, network, order, effectiveness, table] [model, observed, evaluation, visual] [three, detailed, mask, predicted] [loss, alignment, training, set, trained, supervised, sampling, novel, shared]
@InProceedings{Wu_2019_CVPR,
  author = {Wu, Fanzi and Bao, Linchao and Chen, Yajing and Ling, Yonggen and Song, Yibing and Li, Songnan and Ngi Ngan, King and Liu, Wei},
  title = {MVF-Net: Multi-View 3D Face Morphable Model Regression},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Photometric Mesh Optimization for Video-Aligned 3D Object Reconstruction
Chen-Hsuan Lin, Oliver Wang, Bryan C. Russell, Eli Shechtman, Vladimir G. Kim, Matthew Fisher, Simon Lucey


In this paper, we address the problem of 3D object mesh reconstruction from RGB videos. Our approach combines the best of multi-view geometric and data-driven methods for 3D reconstruction by optimizing object meshes for multi-view photometric consistency while constraining mesh deformations with a shape prior. We pose this as a piecewise image alignment problem for each mesh face projection. Our approach allows us to update shape parameters from the photometric error without any depth or mask information. Moreover, we show how to avoid a degeneracy of zero photometric gradients via rasterizing from a virtual viewpoint. We demonstrate 3D object mesh reconstruction results from both synthetic and real-world videos with our photometric mesh optimization, which is unachievable with either naive mesh generation networks or traditional pipelines of surface reconstruction without heavy manual post-processing.
[sequence, prediction, multiple, recognition, explored] [mesh, atlasnet, photometric, shape, reconstruction, rgb, computer, error, optimization, point, vision, depth, pattern, note, coordinate, simon, problem, approach, optimizing, single, geometric, allows, dense, camera, lphoto, virtual, view, triangle, chair, international, surface, accurate, viewpoint, rasterization, plane, projection, geometry, active] [method, image, conference, figure, latent, ieee, noise, consistency, traditional, mapping, input, pixel, transform, reconstruct, prior, appearance, synthetic] [neural, optimize, table, network, deep] [perturbation, model, system, generation] [object, piecewise, mask] [code, learning, alignment, similarity, loss, function, training]
@InProceedings{Lin_2019_CVPR,
  author = {Lin, Chen-Hsuan and Wang, Oliver and Russell, Bryan C. and Shechtman, Eli and Kim, Vladimir G. and Fisher, Matthew and Lucey, Simon},
  title = {Photometric Mesh Optimization for Video-Aligned 3D Object Reconstruction},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Guided Stereo Matching
Matteo Poggi, Davide Pallotti, Fabio Tosi, Stefano Mattoccia


Stereo is a prominent technique to infer dense depth maps from images, and deep learning further pushed forward the state-of-the-art, making end-to-end architectures unrivaled when enough data is available for training. However, deep networks suffer from significant drops in accuracy when dealing with new environments. Therefore, in this paper, we introduce Guided Stereo Matching, a novel paradigm leveraging a small amount of sparse, yet reliable depth measurements retrieved from an external source enabling to ameliorate this weakness. The additional sparse cues required by our method can be obtained with any strategy (e.g., a LiDAR) and used to enhance features linked to corresponding disparity hypotheses. Our formulation is general and fully differentiable, thus enabling to exploit the additional sparse inputs in pre-trained deep stereo networks as well as for training a new instance from scratch. Extensive experiments on three standard datasets and two state-of-the-art deep architectures show that even with a small set of sparse input cues, i) the proposed paradigm enables significant improvements to pre-trained networks. Moreover, ii) training from scratch notably increases accuracy and robustness to domain shifts. Finally, iii) it is suited and effective even with traditional stereo algorithms such as SGM.
[recognition, dataset, fusion] [stereo, depth, vision, computer, disparity, matching, kitti, error, iresnet, psmnet, pattern, middlebury, sgm, confidence, volume, international, accurate, technique, dense, additional, corresponding, sceneflow, stefano, notice, gij, matteo, leveraging] [conference, ieee, synthetic, proposed, enhancement, figure, based, method, input, traditional, guide, image, amount] [deep, sparse, network, cost, accuracy, table, vij, rate, scratch, correlation, small, standard] [evaluation, model, guiding, external] [feature, guided, proposal, average, baseline, carried, leverage, improve] [training, data, learning, trained, strategy, domain, datasets, large, conventional, exploit, test, set, train, experimental, paradigm]
@InProceedings{Poggi_2019_CVPR,
  author = {Poggi, Matteo and Pallotti, Davide and Tosi, Fabio and Mattoccia, Stefano},
  title = {Guided Stereo Matching},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Unsupervised Event-Based Learning of Optical Flow, Depth, and Egomotion
Alex Zihao Zhu, Liangzhe Yuan, Kenneth Chaney, Kostas Daniilidis


In this work, we propose a novel framework for unsupervised learning for event cameras that learns motion information from only the event stream. In particular, we propose an input representation of the events in the form of a discretized volume that maintains the temporal distribution of the events, which we pass through a neural network to predict the motion of the events. This motion is used to attempt to remove any motion blur in the event image. We then propose a loss function applied to the motion compensated event image that measures the motion blur in this image. We train two networks with this framework, one to predict optical flow, and one to predict egomotion and depths, and evaluate these networks on the Multi Vehicle Stereo Event Camera dataset, along with qualitative results from a variety of different scenes.
[event, flow, motion, optical, egomotion, predict, time, temporal, aee, ecn, discretized, timestamp, tracking, work, flashing, frame, photoconsistency, spatiotemporal, timestamps, prediction, learns] [depth, outdoor, volume, camera, stereo, indoor, outlier, estimation, pose, census, computer, well, left, monodepth, ground, vision, provide, error, relative, pattern, corresponding, night, sfm, computed, truth, rotation] [image, grayscale, blur, input, pixel, ieee, conference, method, figure, quantitative, deblurred, based, traditional] [network, apply, neural, applied, number, scale, structure] [model, generated, evaluation, visual, generate] [average, predicted, propose, multi] [loss, unsupervised, learning, novel, learn, set, representation, function, train, training, generalize, trained, distribution]
@InProceedings{Zhu_2019_CVPR,
  author = {Zihao Zhu, Alex and Yuan, Liangzhe and Chaney, Kenneth and Daniilidis, Kostas},
  title = {Unsupervised Event-Based Learning of Optical Flow, Depth, and Egomotion},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Modeling Local Geometric Structure of 3D Point Clouds Using Geo-CNN
Shiyi Lan, Ruichi Yu, Gang Yu, Larry S. Davis


Recent advances in deep convolutional neural networks (CNNs) have motivated researchers to adapt CNNs to directly model points in 3D point clouds. Modeling local structure has been proven to be important for the success of convolutional architectures, and researchers exploited the modeling of local point sets in the feature extraction hierarchy. However, limited attention has been paid to explicitly model the geometric structure amongst points in a local region. To address this problem, we propose Geo-CNN, which applies a generic convolution-like operation dubbed as GeoConv to each point and its local neighborhood. Local geometric relationships among points are captured when extracting edge features between the center and its neighboring points. We first decompose the edge feature extraction process onto three orthogonal bases, and then aggregate the extracted features based on the angles between the edge vector and the bases. This encourages the network to preserve the geometric structure in Euclidean space throughout the feature extraction hierarchy. GeoConv is a generic and efficient operation that can be easily integrated into 3D point cloud analysis pipelines for multiple applications. We evaluate Geo-CNN on ModelNet40 and KITTI and achieve state-of-the-art performance.
[recognition, modeling, extract, multiple, represented, construct, explicitly] [point, geoconv, geometric, local, computer, vision, cloud, pattern, frustum, directly, pointnet, shape, june, neighborhood, creduc, kitti, analysis, geometry, single, project] [conference, ieee, method, based, input, preserve] [structure, neural, network, layer, aggregate, performance, apply, deep, operation, convolutional, applied, weight, cout, number, cin, increasing, table, reduction, orthogonal, efficient, norm, variance] [model, vector, easily, evaluate] [feature, edge, object, detection, extraction, three, baseline, european, level, september, center, segmentation, global, neighboring, module] [learning, data, classification, learn, set, euclidean, large, augmentation, training, generic]
@InProceedings{Lan_2019_CVPR,
  author = {Lan, Shiyi and Yu, Ruichi and Yu, Gang and Davis, Larry S.},
  title = {Modeling Local Geometric Structure of 3D Point Clouds Using Geo-CNN},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
3D Point Capsule Networks
Yongheng Zhao, Tolga Birdal, Haowen Deng, Federico Tombari


In this paper, we propose 3D point-capsule networks, an auto-encoder designed to process sparse 3D point clouds while preserving spatial arrangements of the input data. 3D capsule networks arise as a direct consequence of our unified formulation of the common 3D auto-encoders. The dynamic routing scheme and the peculiar 2D latent space deployed by our capsule networks bring in improvements for several common point cloud-related tasks, such as object classification, object reconstruction and part segmentation as substantiated by our extensive evaluations. Moreover, it enables new applications such as part interpolation and replacement.
[capsule, dynamic, routing, multiple, dataset, state, specialize] [point, local, shape, computer, cloud, vision, reconstruction, foldingnet, single, pattern, chamfer, international, surface, atlasnet, pointnet, leonidas, geometric, volumetric, well, ppfnet, hotel, tolga, dimensional] [latent, input, conference, ieee, interpolation, proposed, figure, generative, based, reconstruct] [network, deep, better, processing, neural, mlp, convolutional, standard, size, replacement] [arxiv, preprint, vector, common, decoder, evaluation, random, primary, simple, explain] [feature, object, segmentation, grid, art, benchmark, extraction, propose] [learning, training, space, data, unsupervised, set, trained, classification, learn, target, representation, unified, source, transfer]
@InProceedings{Zhao_2019_CVPR,
  author = {Zhao, Yongheng and Birdal, Tolga and Deng, Haowen and Tombari, Federico},
  title = {3D Point Capsule Networks},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
GS3D: An Efficient 3D Object Detection Framework for Autonomous Driving
Buyu Li, Wanli Ouyang, Lu Sheng, Xingyu Zeng, Xiaogang Wang


We present an efficient 3D object detection framework based on a single RGB image in the scenario of autonomous driving. Our efforts are put on extracting the underlying 3D information in a 2D image and determining the accurate 3D bounding box of object without point cloud or stereo data. Leveraging the off-the-shelf 2D object detector, we propose an artful approach to efficiently obtain a coarse cuboid for each predicted 2D box. The coarse cuboid has enough accuracy to guide us to determine the 3D box of the object by refinement. In contrast to previous state-of-the-art methods that only use the features extracted from the 2D bounding box for box refinement, we explore the 3D structure information of the object by employing the visual features of visible surfaces. The new features from surfaces are utilized to eliminate the problem of representation ambiguity brought by only using 2D bounding box. Moreover, we investigate different methods of 3D box refinement and discover that a classification formulation with quality aware loss have much better performance than regression. Evaluated on KITTI benchmark, our approach outperforms current state-of-the-art methods for single RGB image based 3D object detection.
[recognition, framework, determine, previous, utilized, predict] [surface, computer, vision, orientation, pattern, visible, kitti, monocular, rgb, point, stereo, problem, projection, observation, autonomous, single, cuboid, formulation, projected, coordinate, angle, accurate, camera, cloud, corresponding, approach, reliable, well, rotation] [based, image, method, ieee, conference, quality, figure, extracted, comparison] [better, top, residual, size, accuracy, basic, table, efficiently, called] [model] [box, object, detection, feature, bounding, guidance, center, refinement, bottom, aware, regression, extra, evaluated, cnn, category, moderate, easy, xiaogang, predicted, location, region, subnet] [classification, data, loss, training, class, target, set, hard, label, metric, learning]
@InProceedings{Li_2019_CVPR,
  author = {Li, Buyu and Ouyang, Wanli and Sheng, Lu and Zeng, Xingyu and Wang, Xiaogang},
  title = {GS3D: An Efficient 3D Object Detection Framework for Autonomous Driving},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Single-Image Piece-Wise Planar 3D Reconstruction via Associative Embedding
Zehao Yu, Jia Zheng, Dongze Lian, Zihan Zhou, Shenghua Gao


Single-image piece-wise planar 3D reconstruction aims to simultaneously segment plane instances and recover 3D plane parameters from an image. Most recent approaches leverage convolutional neural networks (CNNs) and achieve promising results. However, these methods are limited to detecting a fixed number of planes with certain learned order. To tackle this problem, we propose a novel two-stage method based on associative embedding, inspired by its recent success in instance segmentation. In the first stage, we train a CNN to map each pixel to an embedding space where pixels from the same plane instance have similar embeddings. Then, the plane instances are obtained by grouping the embedding vectors in planar regions via an efficient mean shift clustering algorithm. In the second stage, we estimate the parameter for each plane instance by considering both pixel-level and instance-level consistencies. With the proposed method, we are able to detect an arbitrary number of planes. Extensive experiments on public datasets validate the effectiveness and efficiency of our method. Furthermore, our method runs at 30 fps at the testing time, thus could facilitate many real-time applications such as visual SLAM and human-robot interaction. Code is available at https://github.com/svip-lab/PlanarReconstruction.
[predict, prediction, dataset, second, work, jia, tackle, considering] [plane, depth, planar, algorithm, ground, single, truth, geometric, reconstruction, scannet, planenet, problem, approach, estimate, manhattan, form, surface, directly, indoor, local, estimation, supplementary] [method, pixel, image, figure, based, proposed, input] [parameter, shift, number, network, efficient, group, performance, table, deep, small, better, convolutional, fixed] [model, generate, infer, vector] [instance, segmentation, map, propose, detection, associative, branch, semantic, detect, mask, predicted, segment, detecting, stage, cnn, grouping] [embedding, clustering, loss, embeddings, cluster, set, distance, train, existing, training, novel, space, learning, learn]
@InProceedings{Yu_2019_CVPR,
  author = {Yu, Zehao and Zheng, Jia and Lian, Dongze and Zhou, Zihan and Gao, Shenghua},
  title = {Single-Image Piece-Wise Planar 3D Reconstruction via Associative Embedding},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
3DN: 3D Deformation Network
Weiyue Wang, Duygu Ceylan, Radomir Mech, Ulrich Neumann


Applications in virtual and augmented reality create a demand for rapid creation and easy access to large sets of 3D models. An effective way to address this demand is to edit or deform existing 3D models based on a reference, e.g., a 2D image which is very easy to acquire. Given such a source 3D model and a target which can be a 2D image, 3D model, or a point cloud acquired as a depth scan, we introduce 3DN, an end-to-end network that deforms the source model to resemble the target. Our method infers per-vertex offset displacements while keeping the mesh connectivity of the source model fixed. We present a training strategy which uses a novel differentiable operation, mesh sampling operator, to generalize our method across source and target models with varying mesh densities. Mesh sampling operator can be seamlessly integrated into the network to handle meshes with different topologies. Qualitative and quantitative results show that our method generates higher quality results compared to the state-of-the art learning-based methods for 3D shape generation.
[emd, previous, work, perform] [mesh, point, shape, deformation, cloud, reconstruction, atlasnet, surface, differentiable, single, chamfer, ffd, topology, template, local, laplacian, compute, approach, pointnet, note, chair, shapenet, deform, vertex, predicts, symmetry, permutation, provide, rendered, deforms, ground, truth] [method, figure, image, based, input, quantitative, earth, qualitative, editing, preserve, quality, row, real] [network, deep, output, original, architecture, table, compare, operator, higher, fixed] [model, sampled, decoder, generate, generation, vector, generating, access] [global, offset, feature, object, propose, utilize] [source, target, deformed, set, learning, loss, sampling, sample, large, representation, similarity, distance]
@InProceedings{Wang_2019_CVPR,
  author = {Wang, Weiyue and Ceylan, Duygu and Mech, Radomir and Neumann, Ulrich},
  title = {3DN: 3D Deformation Network},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
HorizonNet: Learning Room Layout With 1D Representation and Pano Stretch Data Augmentation
Cheng Sun, Chi-Wei Hsiao, Min Sun, Hwann-Tzong Chen


We present a new approach to the problem of estimating the 3D room layout from a single panoramic image. We represent room layout as three 1D vectors that encode, at each image column, the boundary positions of floor-wall and ceiling-wall, and the existence of wall-wall boundary. The proposed network, HorizonNet, trained for predicting 1D layout, outperforms previous state-of-the-art approaches. The designed post-processing procedure for recovering 3D room layouts from 1D predictions can automatically infer the room shape with low computation cost--it takes less than 20ms for a panorama image while prior works might need dozens of seconds. We also propose Pano Stretch Data Augmentation, which can diversify panorama data and be applied to other panorama-related learning tasks. Due to the limited data available for non-cuboid layout, we relabel 65 general layout from the current dataset for finetuning. Our approach shows good performance on general layouts by qualitative results and cross-validation.
[rnn, dataset, prediction, predict, work, capture, outperforms, previous] [computer, vision, stretch, pattern, corner, pano, column, single, panoramic, geometric, manhattan, dense, panorama, indoor, approach, layoutnet, cuboid, scene, panocontext, estimation, floor, ceiling, ground, recovering, general, perspective, cfl, camera, occluded, international, problem, estimating] [image, conference, ieee, proposed, method, figure, quantitative, pixel, based, result] [deep, output, network, table, neural, size, efficient, equirectangular, computation, orthogonal] [room, model, procedure, probability, existence] [layout, boundary, three, object, semantic, aligned, map, spatial] [data, augmentation, representation, training, learning, train, trained]
@InProceedings{Sun_2019_CVPR,
  author = {Sun, Cheng and Hsiao, Chi-Wei and Sun, Min and Chen, Hwann-Tzong},
  title = {HorizonNet: Learning Room Layout With 1D Representation and Pano Stretch Data Augmentation},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Deep Fitting Degree Scoring Network for Monocular 3D Object Detection
Lijie Liu, Jiwen Lu, Chunjing Xu, Qi Tian, Jie Zhou


In this paper, we propose to learn a deep fitting degree scoring network for monocular 3D object detection, which aims to score fitting degree between proposals and object conclusively. Different from most existing monocular frameworks which use tight constraint to get 3D location, our approach achieves high-precision localization through measuring the visual fitting degree between the projected 3D proposals and the object. We first regress the dimension and orientation of the object using an anchor-based method so that a suitable 3D proposal can be constructed. We propose FQNet, which can infer the 3D IoU between the 3D proposals and the object solely based on 2D cues. Therefore, during the detection process, we sample a large number of candidates in the 3D space and project these 3D bounding boxes on 2D image individually. The best candidate can be picked out by simply exploring the spatial overlap between proposals and the object, in the form of the output 3D IoU score of FQNet. Experiments on the KITTI dataset demonstrate the effectiveness of our framework.
[perform, dataset] [orientation, monocular, estimation, kitti, regress, error, cuboid, fitting, local, degree, tight, autonomous, ground, projected, projection, camera, constraint, problem, dense, approach, pipeline, well, range, confidence, truth, view] [method, image, based, figure, proposed, input, appearance, result, patch] [deep, network, convolutional, table, number, output, overlap, neural, achieve, architecture, validation] [arxiv, preprint, infer, sensitive] [object, detection, iou, regression, location, bounding, anchor, module, box, fqnet, spatial, average, global, vehicle, localization, fully, easy, moderate, propose, proposal, relation, three, car] [dimension, training, learning, set, sampling, loss, large, classification, existing, data, trained, function, hard]
@InProceedings{Liu_2019_CVPR,
  author = {Liu, Lijie and Lu, Jiwen and Xu, Chunjing and Tian, Qi and Zhou, Jie},
  title = {Deep Fitting Degree Scoring Network for Monocular 3D Object Detection},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Pushing the Envelope for RGB-Based Dense 3D Hand Pose Estimation via Neural Rendering
Seungryul Baek, Kwang In Kim, Tae-Kyun Kim


Estimating 3D hand meshes from single RGB images is challenging, due to intrinsic 2D-3D mapping ambiguities and limited training data. We adopt a compact parametric 3D hand model that represents deformable and articulated hand meshes. To achieve the model fitting to RGB images, we investigate and contribute in three ways: 1) Neural rendering: inspired by recent work on human body, our hand mesh estimator (HME) is implemented by a neural network and a differentiable renderer, supervised by 2D segmentation masks and 3D skeletons. HME demonstrates good performance for estimating diverse hand shapes and improves pose estimation accuracies. 2) Iterative testing refinement: Our fitting function is differentiable. We iteratively refine the initial estimate using the gradients, in the spirit of iterative model fitting methods like ICP. The idea is supported by the latest research on human body. 3) Self-data augmentation: collecting sized RGB-mesh (or segmentation mask)-skeleton triplets for training is a big hurdle. Once the model is successfully fitted to input RGB images, its meshes i.e. shapes and articulations, are realistic, and we augment view-points on top of estimated dense hand poses. Experiments using three RGB-based benchmarks show that our framework offers beyond state-of-the-art accuracy in 3D pose estimation, as well as recovers dense 3D hand shapes. Each technical component above meaningfully improves the accuracy in the ablation study.
[skeleton, human, joint, work, framework, dataset, tracking] [hand, pose, mesh, estimation, rgb, shape, depth, dense, estimator, skeletal, estimating, dhpe, algorithm, estimated, single, hme, corresponding, proj, differentiable, estimate, camera, error, initial, recovering, mano, ren, fitting, articulation, well, approach] [input, based, figure, image, proposed, reg, method, intermediate, refiner, mapping, generative] [neural, accuracy, deep, performance, network, parameter, full, output, convolutional] [model, evidence, iterative, generate, renderer] [segmentation, foreground, improves, refinement, feature, mask, three, supervision] [training, learning, data, testing, loss, set, lsh]
@InProceedings{Baek_2019_CVPR,
  author = {Baek, Seungryul and In Kim, Kwang and Kim, Tae-Kyun},
  title = {Pushing the Envelope for RGB-Based Dense 3D Hand Pose Estimation via Neural Rendering},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Self-Supervised Learning of 3D Human Pose Using Multi-View Geometry
Muhammed Kocabas, Salih Karagoz, Emre Akbas


Training accurate 3D human pose estimators requires large amount of 3D ground-truth data which is costly to collect. Various weakly or self supervised pose estimation methods have been proposed due to lack of 3D data. Nevertheless, these methods, in addition to 2D ground-truth poses, require either additional supervision in various forms (e.g. unpaired 3D ground truth data, a small subset of labels) or the camera parameters in multiview settings. To address these problems, we present EpipolarPose, a self-supervised learning method for 3D human pose estimation, which does not need any 3D ground-truth data or camera extrinsics. During training, EpipolarPose estimates 2D poses from multi-view images, and then, utilizes epipolar geometry to obtain a 3D pose and camera geometry which are subsequently used to train a 3D pose estimator. We demonstrate the effectiveness of our approach on standard benchmark datasets (i.e. Human3.6M and MPI-INF-3DHP) where we set the new state-of-the-art among weakly/self-supervised methods. Furthermore, we propose a new performance measure Pose Structure Score (PSS) which is a scale invariant, structure aware measure to evaluate the structural plausibility of a pose with respect to its ground truth. Code and pretrained models are available at https://github.com/mkocabas/EpipolarPose
[human, structural, joint, mpii, subject] [pose, estimation, computer, ground, camera, truth, vision, mpjpe, pattern, epipolarpose, rhodin, geometry, pck, triangulation, nmpjpe, international, epipolar, single, pavlakos, keypoints, require, respect, note, body, depth, monocular, pmpjpe, keypoint, pictorial, drover, pipeline, problem, estimator] [conference, image, ieee, method, figure, reference, input, proposed, unpaired, produce] [performance, network, deep, structure, neural, inference, table, better, small, scale, full, denotes, unit] [model, machine] [supervision, score, branch, utilize, refinement, european, detector, fully] [training, supervised, learning, data, trained, set, train, measure, datasets, upper, large, loss]
@InProceedings{Kocabas_2019_CVPR,
  author = {Kocabas, Muhammed and Karagoz, Salih and Akbas, Emre},
  title = {Self-Supervised Learning of 3D Human Pose Using Multi-View Geometry},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
FSA-Net: Learning Fine-Grained Structure Aggregation for Head Pose Estimation From a Single Image
Tsun-Yi Yang, Yi-Ting Chen, Yen-Yu Lin, Yung-Yu Chuang


This paper proposes a method for head pose estimation from a single image. Previous methods often predict head poses through landmark or depth estimation and would require more computation than necessary. Our method is based on regression and feature aggregation. For having a compact model, we employ the soft stagewise regression scheme. Existing feature aggregation methods treat inputs as a bag of features and thus ignore their spatial relationship in a feature map. We propose to learn a fine-grained structure mapping for spatially grouping features before aggregation. The fine-grained structure provides part-based information and pooled values. By utilizing learnable and non-learnable importance over the spatial location, different model variants can be generated and form a complementary ensemble. Experiments show that our method outperforms the state-of-the-art methods including both the landmark-free ones and the ones based on landmark or depth estimation. With only a single RGB frame as input, our method even outperforms methods utilizing multi-modality information (RGB-D, RGB-Time) on estimating the yaw angle. Furthermore, the memory overhead of our model is 100 times smaller than those of previous methods.
[recognition, dataset, yaw, capsule, joint, previous, stagewise, multiple, time, fusion, temporal] [pose, estimation, computer, vision, pattern, international, single, depth, problem, roll, rgb, analysis] [conference, face, method, facial, figure, proposed, biwi, image, mapping, landmark, age, pitch, real, hopenet, row, ieee] [aggregation, structure, size, table, compact, neural, deep, learnable, better] [model, attention, vector, find] [feature, head, regression, map, spatial, scoring, stage, detection, three, module, mae, european, complementary, utilize] [training, representative, set, learning, function, alignment, soft, datasets, classification, testing, protocol, large, trained]
@InProceedings{Yang_2019_CVPR,
  author = {Yang, Tsun-Yi and Chen, Yi-Ting and Lin, Yen-Yu and Chuang, Yung-Yu},
  title = {FSA-Net: Learning Fine-Grained Structure Aggregation for Head Pose Estimation From a Single Image},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Dense 3D Face Decoding Over 2500FPS: Joint Texture & Shape Convolutional Mesh Decoders
Yuxiang Zhou, Jiankang Deng, Irene Kotsia, Stefanos Zafeiriou


3D Morphable Models (3DMMs) are statistical models that represent facial texture and shape variations using a set of linear bases and more particular Principal Component Analysis (PCA). 3DMMs were used as statistical priors for reconstructing 3D faces from images by solving non-linear least square optimization problems. Recently, 3DMMs were used as generative models for training non-linear mappings (i.e., regressors) from image to the parameters of the models via Deep Convolutional Neural Networks (DCNNs). Nevertheless, all of the above methods use either fully connected layers or 2D convolutions on parametric unwrapped UV spaces leading to large networks with many parameters. In this paper, we present the first, to the best of our knowledge, non-linear 3DMMs by learning joint texture and shape auto-encoders using direct mesh convolutions. We demonstrate how these auto-encoders can be used to train very light-weight models that perform Coloured Mesh Decoding (CMD) in-the-wild at a speed of over 2500 FPS.
[dataset, joint, graph, jointly] [mesh, shape, reconstruction, linear, error, directly, fitting, michael, normalized, geometric, single, differentiable, matrix, position, robust, dense, direct, defined, chebyshev, approach] [face, coloured, texture, method, facial, morphable, image, proposed, prnet, stefanos, pca, vrn, figure, sela, expression, mofa, input, statistical, reconstruct, landmark, florence, comparison, based] [convolutional, size, deep, network, performance, convolution, neural, compared, compare, connected, compact, output, running, comparable, order, better] [model, decoder, decoding, represent, encoder] [regression, map, fully] [learning, trained, embedding, large, training, alignment, space, data]
@InProceedings{Zhou_2019_CVPR,
  author = {Zhou, Yuxiang and Deng, Jiankang and Kotsia, Irene and Zafeiriou, Stefanos},
  title = {Dense 3D Face Decoding Over 2500FPS: Joint Texture & Shape Convolutional Mesh Decoders},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Does Learning Specific Features for Related Parts Help Human Pose Estimation?
Wei Tang, Ying Wu


Human pose estimation (HPE) is inherently a homogeneous multi-task learning problem, with the localization of each body part as a different task. Recent HPE approaches universally learn a shared representation for all parts, from which their locations are linearly regressed. However, our statistical analysis indicates not all parts are related to each other. As a result, such a sharing mechanism can lead to negative transfer and deteriorate the performance. This potential issue drives us to raise an interesting question. Can we identify related parts and learn specific features for them to improve pose estimation? Since unrelated tasks no longer share a high-level representation, we expect to avoid the adverse effect of negative transfer. In addition, more explicit structural knowledge, e.g., ankles and knees are highly related, is incorporated into the model, which helps resolve ambiguities in HPE. To answer this question, we first propose a data-driven approach to group related parts based on how much information they share. Then a part-based branching network (PBN) is introduced to learn representations specific to each part group. We further present a multi-stage version of this network to repeatedly refine intermediate features and pose estimates. Ablation experiments indicate learning specific features significantly improves the localization of occluded parts and thus benefits HPE. Our approach also outperforms all state-of-the-art methods on two benchmark datasets, with an outstanding advantage when occlusion occurs.
[human, mpii, dataset, knee, outperforms, previous, focus, mtl] [pose, body, approach, estimation, heat, left, occluded, hip, lsp, hourglass, linear, normalized, problem, general, provide] [study, input, based, figure, landmark, image, statistical, intermediate, result] [network, deep, group, residual, convolutional, neural, number, validation, sharing, wei, branching, table, better, applied, block] [model, identify, random] [feature, hpe, spatial, localization, ablation, head, map, benchmark, improve, fully, refine, improves, person, location] [specific, learning, learn, shared, representation, mutual, training, set, testing, negative, share, exploit, unrelated, learned]
@InProceedings{Tang_2019_CVPR,
  author = {Tang, Wei and Wu, Ying},
  title = {Does Learning Specific Features for Related Parts Help Human Pose Estimation?},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Linkage Based Face Clustering via Graph Convolution Network
Zhongdao Wang, Liang Zheng, Yali Li, Shengjin Wang


In this paper, we present an accurate and scalable approach to the face clustering task. We aim at grouping a set of faces by their potential identities. We formulate this task as a link prediction problem: a link exists between two faces if they are of the same identity. The key idea is that we find the local context in the feature space around an instance (face) contains rich information about the linkage relationship between this instance and its neighbors. By constructing sub-graphs around each instance as input data, which depict the local context, we utilize the graph convolution network (GCN) to perform reasoning and infer the likelihood of linkage between pairs in the sub-graphs. Experiments show that our method is more robust to the complex distribution of faces than conventional methods, yielding favorably comparable results to state-of-the-art methods on standard face clustering benchmarks, and is scalable to large datasets. Furthermore, we show that the proposed method does not need the number of clusters as prior, is aware of noises and outliers, and can be extended to a multi-view version for more accurate clustering accuracy.
[graph, link, linked, prediction, gcn, complex, dataset, predicting, perform, construct, audio, key, predict, framework] [local, matrix, approach, corresponding, algorithm, problem, directly] [face, method, based, proposed, spectral, input, comparison, figure, presented, ieee] [convolution, number, performance, network, aggregation, table, accuracy, weighted, neural, parameter, approximate, apply, experiment, scalable] [node, find, subgraph, attention] [instance, pivot, feature, three, propose, context, adopt] [clustering, linkage, data, nmi, likelihood, set, nearest, singleton, ahc, aro, positive, training, negative, distribution, neighbor, large, dbscan, proportion, learn, upper, bcubed, hyperparameters, idea, learning, vary]
@InProceedings{Wang_2019_CVPR,
  author = {Wang, Zhongdao and Zheng, Liang and Li, Yali and Wang, Shengjin},
  title = {Linkage Based Face Clustering via Graph Convolution Network},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Towards High-Fidelity Nonlinear 3D Face Morphable Model
Luan Tran, Feng Liu, Xiaoming Liu


Embedding 3D morphable basis functions into deep neural networks opens great potential for models with better representation power. However, to faithfully learn those models from an image collection, it requires strong regularization to overcome ambiguities involved in the learning process. This critically prevents us from learning high fidelity face models which are needed to represent face images in high level of details. To address this problem, this paper presents a novel approach to learn additional proxies as means to side-step strong regularizations, as well as, leverages to promote detailed shape/albedo. To ease the learning, we also propose to use a dual-pathway network, a carefully-designed architecture that brings a balance between global and local-based models. By improving the nonlinear 3D morphable model in both learning objective and network architecture, we present a model which is superior in capturing higher level of details than the linear or its precedent nonlinear counterparts. As a result, our model achieves state-of-the-art performance on 3D face reconstruction by solely optimizing latent representations.
[liu, work, capture, modeling] [reconstruction, shape, albedo, linear, well, local, computer, monocular, fitting, michael, christian, single, geometry, directly, mesh, estimated, surface, error] [face, facial, nonlinear, image, input, proposed, figure, tran, morphable, high, texture, lrec, acm, xiaoming, tewari, based, latent, luan, faithfully, recover, shading, pairing, reconstruct, pablo, fidelity, reconstructed, quality, blanz, difference, suv] [network, better, residual, deep, neural, regularization, power, architecture, small, original] [model, represent, strong, step] [global, level, propose, detailed, feature, pathway, final] [learning, loss, learn, representation, objective, proxy, space, distance, novel, overcome, learned, large, alignment]
@InProceedings{Tran_2019_CVPR,
  author = {Tran, Luan and Liu, Feng and Liu, Xiaoming},
  title = {Towards High-Fidelity Nonlinear 3D Face Morphable Model},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
RegularFace: Deep Face Recognition via Exclusive Regularization
Kai Zhao, Jingyi Xu, Ming-Ming Cheng


We consider the face recognition task where facial images of the same identity (person) is expected to be closer in the representation space, while different identities be far apart. Several recent studies encourage the intra-class compactness by developing loss functions that penalize the variance of representations of the same identity. In this paper, we propose the `exclusive regularization' that focuses on the other aspect of discriminability -- the inter-class separability, which is neglected in many recent approaches. The proposed method, named RegularFace, explicitly distances identities by penalizing the angle between an identity and its nearest neighbor, resulting in discriminative face representations. Our method has intuitive geometric interpretation and presents unique benefits that are absent in previous works. Quantitative comparisons against prior methods on several open benchmarks demonstrate the superiority of our method. In addition, our method is easy to implement and requires only a few lines of python code on modern deep learning frameworks.
[recognition, dataset, term, perform, explicitly, joint] [pattern, angle, sphere] [face, proposed, identity, method, ieee, facial, image, figure, based, input] [regularization, performance, deep, original, weight, convolutional, verification, gradient, neural, represents, separable, compact, small, accuracy, architecture] [model, conf, decision, vector, expected] [center, feature, propose, improve, penalizes, identification] [loss, softmax, exclusive, angular, margin, representation, sphereface, cluster, distance, embedding, comput, learning, compactness, lfw, training, large, discriminative, testing, separability, classification, megaface, euclidean, train, embeddings, mnist, ytf, space, function, regularface, trained, annealing, nearest, belonging, existing, learn, angularly, hypersphere, datasets, set, discriminability]
@InProceedings{Zhao_2019_CVPR,
  author = {Zhao, Kai and Xu, Jingyi and Cheng, Ming-Ming},
  title = {RegularFace: Deep Face Recognition via Exclusive Regularization},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
BridgeNet: A Continuity-Aware Probabilistic Network for Age Estimation
Wanhua Li, Jiwen Lu, Jianjiang Feng, Chunjing Xu, Jie Zhou, Qi Tian


Age estimation is an important yet very challenging problem in computer vision. Existing methods for age estimation usually apply a divide-and-conquer strategy to deal with heterogeneous data caused by the non-stationary aging process. However, the facial aging process is also a continuous process, and the continuity relationship between different components has not been effectively exploited. In this paper, we propose BridgeNet for age estimation, which aims to mine the continuous relation between age labels effectively. The proposed BridgeNet consists of local regressors and gating networks. Local regressors partition the data space into multiple overlapping subspaces to tackle heterogeneous data and gating networks learn continuity aware weights for the results of local regressors by employing the proposed bridge-tree structure, which introduces bridge connections into tree models to enforce the similarity between neighbor nodes. Moreover, these two components of BridgeNet can be jointly learned in an end-to-end way. We show experimental results on the MORPH II, FG-NET and Chalearn LAP 2015 datasets and find that BridgeNet outperforms the state-of-the-art methods.
[gating, regressors, leaf, bridgenet, bridge, dataset, continuity, human, triple, heterogeneous, chalearn, lap, adjacent, multiple, merged, apparent, outperforms, build] [local, estimation, continuous, problem, error, computer, single, depth] [age, morph, method, proposed, based, facial, face, aging, figure, image, result, input] [binary, deep, layer, performance, table, network, neural, number, process, structure, better, connection, grant, dex, architecture] [node, tree, decision, relationship, probability, child, model, random] [regression, mae, cnn, three, edge] [data, learning, training, classification, regressor, function, set, datasets, similarity, softmax, setting, probabilistic, space, experimental, distribution, sample, loss, test, china]
@InProceedings{Li_2019_CVPR,
  author = {Li, Wanhua and Lu, Jiwen and Feng, Jianjiang and Xu, Chunjing and Zhou, Jie and Tian, Qi},
  title = {BridgeNet: A Continuity-Aware Probabilistic Network for Age Estimation},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
GANFIT: Generative Adversarial Network Fitting for High Fidelity 3D Face Reconstruction
Baris Gecer, Stylianos Ploumpis, Irene Kotsia, Stefanos Zafeiriou


In the past few years, a lot of work has been done towards reconstructing the 3D facial structure from single images by capitalizing on the power of Deep Convolutional Neural Networks (DCNNs). In the most recent works, differentiable renderers were employed in order to learn the relationship between the facial identity features and the parameters of a 3D morphable model for shape and texture. The texture features either correspond to components of a linear texture space or are learned by auto-encoders directly from in-the-wild images. In all cases, the quality of the facial texture reconstruction of the state-of-the-art methods is still not capable of modeling textures in high fidelity. In this paper, we take a radically different approach and harness the power of Generative Adversarial Networks (GANs) and DCNNs in order to reconstruct the facial texture and shape from single images. That is, we utilize GANs to train a very powerful generator of facial texture in UV space. Then, we revisit the original 3D Morphable Models (3DMMs) fitting approaches making use of non-linear optimization to find the optimal latent parameters that best reconstruct the test image but under a new perspective. We optimize the parameters with the supervision of pretrained deep identity features through our end-to-end differentiable framework. We demonstrate excellent results in photorealistic and identity preserving 3D face reconstructions and achieve for the first time, to the best of our knowledge, facial texture reconstruction with high-frequency details.
[recognition, capture, work, dataset] [shape, reconstruction, fitting, differentiable, approach, rendered, camera, single, well, geometry, linear, optimization, illumination, albedo, error, problem, lighting, mesh, robust] [texture, face, identity, image, facial, statistical, morphable, input, high, method, expression, landmark, proposed, reconstruct, pixel, stefanos, quality, generator, figure, latent, based, content, comparison, qualitative, reg, color, generative, reconstructed, preserving, acm] [network, deep, cost, order, optimize, scale, formulate, gradient, convolutional, powerful, best, william] [model, gan, adversarial, gans, renderer, excellent] [] [loss, space, alignment, test, learning, trained, representation, function, novel, sample, large, set, distance]
@InProceedings{Gecer_2019_CVPR,
  author = {Gecer, Baris and Ploumpis, Stylianos and Kotsia, Irene and Zafeiriou, Stefanos},
  title = {GANFIT: Generative Adversarial Network Fitting for High Fidelity 3D Face Reconstruction},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Improving the Performance of Unimodal Dynamic Hand-Gesture Recognition With Multimodal Training
Mahdi Abavisani, Hamid Reza Vaezi Joze, Vishal M. Patel


We present an efficient approach for leveraging the knowledge from multiple modalities in training unimodal 3D convolutional neural networks (3D-CNNs) for the task of dynamic hand gesture recognition. Instead of explicitly combining multimodal information, which is commonplace in many state-of-the-art methods, we propose a different framework in which we embed the knowledge of multiple modalities in individual networks so that each unimodal network can achieve an improved performance. In particular, we dedicate separate networks per available modality and enforce them to collaborate and learn to develop networks with common semantics and better representations. We introduce a "spatiotemporal semantic alignment" loss (SSA) to align the content of the features from different networks. In addition, we regularize this loss with our proposed "focal regularization parameter" to avoid negative knowledge transfer. Experimental results show that our framework improves the test time recognition accuracy of unimodal networks, and provides the state-of-the-art performance on various dynamic hand gesture recognition datasets.
[recognition, gesture, dataset, unimodal, flow, ssa, dynamic, multiple, optical, fusion, time, spatiotemporal, mtutf, viva, egogesture, video, action, framework, nvgesture, mtut, performer, individual, recurrent, temporal] [hand, rgb, depth, computer, vision, pattern, focal, international, note, approach] [method, conference, ieee, figure, input, proposed, based, image, denoted] [network, performance, table, convolutional, regularization, neural, deep, parameter, better, top, correlation, accuracy] [modality, multimodal, model, system, improved, develop, common, calculated, understanding] [feature, improve, semantics, semantic] [knowledge, training, learning, loss, trained, transfer, test, classification, data, testing, negative, alignment, positive, vishal, share]
@InProceedings{Abavisani_2019_CVPR,
  author = {Abavisani, Mahdi and Reza Vaezi Joze, Hamid and Patel, Vishal M.},
  title = {Improving the Performance of Unimodal Dynamic Hand-Gesture Recognition With Multimodal Training},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Learning to Reconstruct People in Clothing From a Single RGB Camera
Thiemo Alldieck, Marcus Magnor, Bharat Lal Bhatnagar, Christian Theobalt, Gerard Pons-Moll


We present Octopus, a learning-based model to infer the personalized 3D shape of people from a few frames (1-8) of a monocular video in which the person is moving with a reconstruction accuracy of 4 to 5mm, while being orders of magnitude faster than previous methods. From semantic segmentation images, our Octopus model reconstructs a 3D shape, including the parameters of SMPL plus clothing and hair in 10 seconds or less. The model achieves fast and accurate predictions based on two key design choices. First, by predicting shape in a canonical T-pose space, the network learns to encode the images of the person into pose-invariant latent codes, where the information is fused. Second, based on the observation that feed-forward predictions are fast but do not always align with the input images, we predict using both, bottom-up and top-down streams (one per view) allowing information to flow in both directions. Learning relies only on synthetic 3D data. Once learned, Octopus can take a variable number of frames as input, and is able to reconstruct shapes even from a single image with an accuracy of 5mm. Results on 3 different datasets demonstrate the efficacy and accuracy of our approach.
[human, video, capture, work, dataset, people, joint, predict, multiple, time, tracking] [shape, pose, computer, body, vision, pattern, optimization, single, christian, smpl, michael, reconstruction, estimation, international, volume, depth, monocular, error, rgb, ground, camera, gerard, octopus, allows, mesh, require, truth, vertex, david, surface, volumetric, lifescans] [ieee, method, input, acm, image, based, figure, latent, synthetic] [network, full, performance, number, convolutional, neural, conv, accuracy, fast] [model, automatic, personalized] [clothing, european, semantic, supervision, person, predicted, segmentation, fully, including] [test, loss, training, set, learning, data, train]
@InProceedings{Alldieck_2019_CVPR,
  author = {Alldieck, Thiemo and Magnor, Marcus and Lal Bhatnagar, Bharat and Theobalt, Christian and Pons-Moll, Gerard},
  title = {Learning to Reconstruct People in Clothing From a Single RGB Camera},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Distilled Person Re-Identification: Towards a More Scalable System
Ancong Wu, Wei-Shi Zheng, Xiaowei Guo, Jian-Huang Lai


Person re-identification (Re-ID), for matching pedestrians across non-overlapping camera views, has made great progress in supervised learning with abundant labelled data. However, the scalability problem is the bottleneck for applications in large-scale systems. We consider the scalability problem of Re-ID from three aspects: (1) low labelling cost by reducing label amount, (2) low extension cost by reusing existing knowledge and (3) low testing computation cost by using lightweight models. The requirements render scalable Re-ID a challenging problem. To solve these problems in a unified system, we propose a Multi-teacher Adaptive Similarity Distillation Framework, which requires only a few labelled identities of target domain to transfer knowledge from multiple teacher models to a user-specified lightweight student model without accessing source domain data. We propose the Log-Euclidean Similarity Distillation Loss for Re-ID and further integrate the Adaptive Knowledge Aggregator to select effective teacher models to transfer target-adaptive knowledge. Extensive evaluations show that our method can extend with high scalability and the performance is comparable to the state-of-the-art unsupervised and semi-supervised Re-ID methods.
[multiple, joint, framework, fusion, key] [matrix, scene, camera, problem, provide, denote] [method] [adaptive, validation, performance, table, scalable, compared, deep, computation, effective, cost, lightweight, small, fast, denotes, low, reducing, neural, network] [model, empirical, system] [person, feature, dukemtmc, map, propose, aggregated, pool, xiang, liang] [teacher, knowledge, similarity, learning, distillation, student, data, labelled, training, loss, unsupervised, source, transfer, domain, target, risk, distance, aggregator, metric, sample, scalability, shaogang, testing, learn, positive, set, hhl, existing, unlabelled, learned, labelling, adaptation, tao, large, pairwise, camel, soft, reidentification, train]
@InProceedings{Wu_2019_CVPR,
  author = {Wu, Ancong and Zheng, Wei-Shi and Guo, Xiaowei and Lai, Jian-Huang},
  title = {Distilled Person Re-Identification: Towards a More Scalable System},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
A Perceptual Prediction Framework for Self Supervised Event Segmentation
Sathyanarayanan N. Aakur, Sudeep Sarkar


Temporal segmentation of long videos is an important problem, that has largely been tackled through supervised learning, often requiring large amounts of annotated training data. In this paper, we tackle the problem of self-supervised temporal segmentation that alleviates the need for any supervision in the form of labels (full supervision) or temporal ordering (weak supervision). We introduce a self-supervised, predictive learning framework that draws inspiration from cognitive psychology to segment long, visually complex videos into constituent events. Learning involves only a single pass through the training data. We also introduce a new adaptive learning paradigm that helps reduce the effect of catastrophic forgetting in recurrent neural networks. Extensive experiments on three publicly available datasets - Breakfast Actions, 50 Salads, and INRIA Instructional Videos datasets show the efficacy of the proposed approach. We show that the proposed approach outperforms weakly-supervised and unsupervised baselines by up to 24% and achieves competitive segmentation results compared to fully supervised baselines with only a single pass through the training data. Finally, we show that the proposed self-supervised learning paradigm learns highly discriminating features to improve action recognition.
[event, prediction, temporal, time, recognition, recurrent, video, lstm, action, dataset, breakfast, signal, gating, current, state, activity, long, instructional, rnn, framework, term, internal, future, frame] [approach, error, computer, vision, pattern, ground, defined, truth, problem, international, form, allow, require] [proposed, perceptual, conference, ieee, input, figure, quality, based, high] [adaptive, performance, network, neural, higher, low, rate, table, processing, convolutional, highly, layer, lower] [model, visual, evaluate, observed, evaluation, pass, memory] [segmentation, feature, predicted, fully, weakly, supervision, boundary, detection, inria, level, propose, semantics] [learning, supervised, training, unsupervised, data, predictor, representation, trained, learn, predictive, set]
@InProceedings{Aakur_2019_CVPR,
  author = {Aakur, Sathyanarayanan N. and Sarkar, Sudeep},
  title = {A Perceptual Prediction Framework for Self Supervised Event Segmentation},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
COIN: A Large-Scale Dataset for Comprehensive Instructional Video Analysis
Yansong Tang, Dajun Ding, Yongming Rao, Yu Zheng, Danyang Zhang, Lili Zhao, Jiwen Lu, Jie Zhou


There are substantial instruction videos on the Internet, which enables us to acquire knowledge for completing various tasks. However, most existing datasets for instruction video analysis have the limitations in diversity and scale, which makes them far from many real-world applications where more diverse activities occur. Moreover, it still remains a great challenge to organize and harness such data. To address these problems, we introduce a large-scale dataset called "COIN" for COmprehensive INstruction video analysis. Organized with a hierarchical structure, the COIN dataset contains 11,827 videos of 180 tasks in 12 domains (e.g., vehicles, gadgets, etc.) related to our daily life. With a new developed toolbox, all the videos are annotated effectively with a series of step descriptions and the corresponding temporal boundaries. Furthermore, we propose a simple yet effective method to capture the dependencies among different steps, which can be easily plugged into conventional proposal-based action detection methods for localizing important steps in instruction videos. In order to provide a benchmark for instruction video analysis, we evaluate plenty of approaches on the COIN dataset under different evaluation criteria. We expect the introduction of the COIN dataset will promote the future in-depth research on instruction video analysis for the community.
[video, instructional, action, coin, dataset, temporal, frame, series, localize, time, codebase, human, recognition, second, breakfast, youtube, ordering, activitynet, activity, predict] [corresponding, analysis, provide, associated] [method, figure, based, proposed, developed, comprehensive, great] [order, table, performance, accuracy, number, compared, convolutional, network] [step, model, cooking, evaluation, mode, vector, visual, goal, length] [localization, detection, annotation, three, segment, proposal, segmentation, map, hierarchical, benchmark, semantic, annotated, adopted, score] [task, datasets, existing, domain, label, set, specific, learning, classification, training, address, unsupervised, large]
@InProceedings{Tang_2019_CVPR,
  author = {Tang, Yansong and Ding, Dajun and Rao, Yongming and Zheng, Yu and Zhang, Danyang and Zhao, Lili and Lu, Jiwen and Zhou, Jie},
  title = {COIN: A Large-Scale Dataset for Comprehensive Instructional Video Analysis},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Recurrent Attentive Zooming for Joint Crowd Counting and Precise Localization
Chenchen Liu, Xinyu Weng, Yadong Mu


Crowd counting is a new frontier in computer vision with far-reaching applications particularly in social safety management. A majority of existing works adopt a methodology that first estimates a person-density map and then calculates integral over this map to obtain the final count. As noticed by several prior investigations, the learned density map can significantly deviate from the true person density even though the final reported count is precise. This implies that the density map is unreliable for localizing crowd. To address this issue, this work proposes a novel framework that simultaneously solving two inherently related tasks - crowd counting and localization. The contributions are several-fold. First, our formulation is based on a crucial observation that localization tends to be inaccurate at high-density regions, and increasing the resolution is an effective albeit simple solution for improving localization. We thus propose Recurrent Attentive Zooming Network, which recurrently detects ambiguous image region and zooms it into high resolution for re-inspection. Second, the two tasks of counting and localization mutually reinforce each other. We propose an adaptive fusion scheme that effectively elevates the performance. Finally, a well-defined evaluation metric is proposed for the rarely-explored localization task. We conduct comprehensive evaluations on several crowd benchmarks, including the newly-developed large-scale UCF-QNRF dataset and demonstrate superior advantages over state-of-the-art methods.
[] [] [] [] [] [] []
@InProceedings{Liu_2019_CVPR,
  author = {Liu, Chenchen and Weng, Xinyu and Mu, Yadong},
  title = {Recurrent Attentive Zooming for Joint Crowd Counting and Precise Localization},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
An Attention Enhanced Graph Convolutional LSTM Network for Skeleton-Based Action Recognition
Chenyang Si, Wentao Chen, Wei Wang, Liang Wang, Tieniu Tan


Skeleton-based action recognition is an important task that requires the adequate understanding of movement characteristics of a human action from the given skeleton sequence. Recent studies have shown that exploring spatial and temporal features of the skeleton sequence is vital for this task. Nevertheless, how to effectively extract discriminative spatial and temporal features is still a challenging problem. In this paper, we propose a novel Attention Enhanced Graph Convolutional LSTM Network (AGC-LSTM) for human action recognition from skeleton data. The proposed AGC-LSTM can not only capture discriminative features in spatial configuration and temporal dynamics but also explore the co-occurrence relationship between spatial and temporal domains. We also present a temporal hierarchical architecture to increase temporal receptive fields of the top AGC-LSTM layer, which boosts the ability to learn the high-level semantic representation and significantly reduces the computation cost. Furthermore, to select discriminative spatial information, the attention mechanism is employed to enhance information of key joints in each AGC-LSTM layer. Experimental results on two datasets are provided: NTU RGB+D dataset and Northwestern-UCLA dataset. The comparison results demonstrate the effectiveness of our approach and show that our approach outperforms the state-of-the-art methods on both datasets.
[temporal, graph, action, skeleton, human, lstm, recognition, ntu, dataset, key, joint, capture, agclstm, time, vti, hidden, spatiotemporal, sequence, employed, video, recurrent, focus, state, predict] [position, body, matrix, rgb, vision, local] [proposed, based, comparison, figure, spectral, method] [convolutional, network, neural, layer, architecture, performance, denotes, explore, operator, compared, convolution, configuration, top, effective, pooling, achieves, table, receptive, scale] [attention, model, node, relationship, ability, mechanism] [spatial, feature, hierarchical, enhanced, three, propose, enhance, global, liang, challenging, semantic] [discriminative, learning, learn, set, representation, data, ensemble, confusion, effectively, novel, class]
@InProceedings{Si_2019_CVPR,
  author = {Si, Chenyang and Chen, Wentao and Wang, Wei and Wang, Liang and Tan, Tieniu},
  title = {An Attention Enhanced Graph Convolutional LSTM Network for Skeleton-Based Action Recognition},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Graph Convolutional Label Noise Cleaner: Train a Plug-And-Play Action Classifier for Anomaly Detection
Jia-Xing Zhong, Nannan Li, Weijie Kong, Shan Liu, Thomas H. Li, Ge Li


Video anomaly detection under weak labels is formulated as a typical multiple-instance learning problem in previous works. In this paper, we provide a new perspective, i.e., a supervised learning task under noisy labels. In such a viewpoint, as long as cleaning away label noise, we can directly apply fully supervised action classifiers to weakly supervised anomaly detection, and take maximum advantage of these well-developed classifiers. For this purpose, we devise a graph convolutional network to correct noisy labels. Based upon feature similarity and temporal consistency, our network propagates supervisory signals from high-confidence snippets to low-confidence ones. In this manner, the network is capable of providing cleaned supervision for action classifiers. During the test phase, we only need to obtain snippet-wise predictions from the action classifier without any extra post-processing. Extensive experiments on 3 datasets at different scales with 2 types of action classifiers demonstrate the efficacy of our method. Remarkably, we obtain the frame-level AUC score of 82.12% on UCF-Crime.
[anomaly, action, temporal, graph, video, auc, alternate, cleaner, gcn, abnormal, event, flow, long, previous, work, tsn] [normal, problem, pattern, directly, june, international, ground, analysis, matrix, optimization] [noise, figure, consistency, anomalous, based, conference, method, proposed, input, conduct] [convolutional, performance, deep, table, neural, network, process, sparse, number] [model, machine, correct, procedure] [detection, feature, module, mil, weakly, score, fully, false, shanghaitech, weak, supervision] [learning, training, label, supervised, similarity, classifier, noisy, loss, train, classification, test, data, adjacency, task, cleaned, trained, datasets, set]
@InProceedings{Zhong_2019_CVPR,
  author = {Zhong, Jia-Xing and Li, Nannan and Kong, Weijie and Liu, Shan and Li, Thomas H. and Li, Ge},
  title = {Graph Convolutional Label Noise Cleaner: Train a Plug-And-Play Action Classifier for Anomaly Detection},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
MAN: Moment Alignment Network for Natural Language Moment Retrieval via Iterative Graph Adjustment
Da Zhang, Xiyang Dai, Xin Wang, Yuan-Fang Wang, Larry S. Davis


This research strives for natural language moment retrieval in long, untrimmed video streams. The problem is not trivial especially when a video contains multiple moments of interests and the language describes complex temporal dependencies, which often happens in real scenarios. We identify two crucial challenges: semantic misalignment and structural misalignment. However, existing approaches treat different moments separately and do not explicitly model complex moment-wise temporal relations. In this paper, we present Moment Alignment Network (MAN), a novel framework that unifies the candidate moment encoding and temporal structural reasoning in a single-shot feed-forward network. MAN naturally assigns candidate moment representations aligned with language semantics over different temporal locations and scales. Most importantly, we propose to explicitly model moment-wise temporal relations as a structured graph and devise an iterative graph adjustment network to jointly learn the best structure in an end-to-end manner. We evaluate the proposed approach on two challenging public benchmarks DiDeMo and Charades-STA, where our MAN significantly outperforms the state-of-the-art by a large margin.
[temporal, moment, video, graph, dynamic, igan, action, activity, recognition, complex, didemo, structural, lstm, multiple, outperforms, xiyang, untrimmed, second, time, work, sequence, updated] [computer, vision, pattern, adjustment, matching, international, ground, matrix, directly, truth, single] [conference, ieee, proposed, input, figure, component, study, method, produce] [network, convolutional, structure, cell, neural, table, best, better, deep, performance, residual, compare, number] [language, natural, model, candidate, man, visual, sentence, iterative, node, query, encoding, encoder, reasoning, simple] [feature, semantic, propose, detection, misalignment, sliding, object, ablation, hierarchical] [retrieval, representation, alignment, learning, set, training, adjacency, learn]
@InProceedings{Zhang_2019_CVPR,
  author = {Zhang, Da and Dai, Xiyang and Wang, Xin and Wang, Yuan-Fang and Davis, Larry S.},
  title = {MAN: Moment Alignment Network for Natural Language Moment Retrieval via Iterative Graph Adjustment},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Less Is More: Learning Highlight Detection From Video Duration
Bo Xiong, Yannis Kalantidis, Deepti Ghadiyaram, Kristen Grauman


Highlight detection has the potential to significantly ease video browsing, but existing methods often suffer from expensive supervision requirements, where human viewers must manually identify highlights in training videos. We propose a scalable unsupervised solution that exploits video duration as an implicit supervision signal. Our key insight is that video segments from shorter user-generated videos are more likely to be highlights than those from longer videos, since users tend to be more selective about the content when capturing shorter videos. Leveraging this insight, we introduce a novel ranking framework that prefers segments from shorter videos, while properly accounting for the inherent noise in the (unlabeled) training data. We use it to train a highlight detector with 10M hashtagged Instagram videos. In experiments on two challenging public video highlight detection benchmarks, our method substantially improves the state-of-the-art for unsupervised highlight detection.
[video, highlight, duration, instagram, summarization, youtube, longer, shorter, long, tvsum, short, dataset, outperforms, manually, kristen, framework, time, second, multiple, surfing, unedited, signal, learns] [approach, valid, wij, variable, total, require] [method, latent, noise, figure, prior, tend] [selection, higher, deep, group, binary, network, table, sharing, neural] [model, visual, introduce, attention] [detection, segment, supervision, public, feature, score, propose, improves, category, predicted, average, challenging, selective, interest, leverage, weakly, detect] [training, ranking, data, learning, unsupervised, supervised, function, novel, train, trained, noisy, web, dog, existing, domain, loss, test, learn, large, set, discriminative, label]
@InProceedings{Xiong_2019_CVPR,
  author = {Xiong, Bo and Kalantidis, Yannis and Ghadiyaram, Deepti and Grauman, Kristen},
  title = {Less Is More: Learning Highlight Detection From Video Duration},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
DMC-Net: Generating Discriminative Motion Cues for Fast Compressed Video Action Recognition
Zheng Shou, Xudong Lin, Yannis Kalantidis, Laura Sevilla-Lara, Marcus Rohrbach, Shih-Fu Chang, Zhicheng Yan


Motion has shown to be useful for video understanding, where motion is typically represented by optical flow. However, computing flow from video frames is very timeconsuming. Recent works directly leverage the motion vectors and residuals readily available in the compressed video to represent motion at no cost. While this avoids flow computation, it also hurts accuracy since the motion vector is noisy and has substantially reduced resolution, which makes it a less discriminative motion representation. To remedy these issues, we propose a lightweight generator network, which reduces noises in motion vectors and captures fine motion details, achieving a more Discriminative Motion Cue (DMC) representation. Since optical flow is a more accurate motion representation, we train the DMC generator to approximate flow using a reconstruction loss and a generative adversarial loss, jointly with the downstream action classification task. Extensive evaluations on three action recognition benchmarks (HMDB-51, UCF-101, and a subset of Kinetics) confirm the effectiveness of our method. Our full system, consisting of the generator and the classifier, is coined as DMC-Net which obtains high accuracy close to that of using flow and runs two orders of magnitude faster than using optical flow at inference time.
[motion, flow, optical, video, action, dmc, coviar, recognition, time, gdmc, zheng, temporal, frame, jointly, decoded, follow, spatiotemporal, capture, prediction, cue, operates, work, modeling] [rgb, computer, pattern, vision, estimation, reconstruction, analysis, supplementary] [generator, based, ieee, figure, conference, proposed, image, input, method, high] [compressed, accuracy, network, inference, table, convolutional, residual, fine, deep, achieve, cnns, full, architecture, magnitude] [adversarial, vector, model, generated, arxiv, generating] [cnn, three, faster, propose] [training, discriminative, loss, representation, classification, learning, classifier, trained, downstream, classify, set, domain, data, noisy]
@InProceedings{Shou_2019_CVPR,
  author = {Shou, Zheng and Lin, Xudong and Kalantidis, Yannis and Sevilla-Lara, Laura and Rohrbach, Marcus and Chang, Shih-Fu and Yan, Zhicheng},
  title = {DMC-Net: Generating Discriminative Motion Cues for Fast Compressed Video Action Recognition},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
AdaFrame: Adaptive Frame Selection for Fast Video Recognition
Zuxuan Wu, Caiming Xiong, Chih-Yao Ma, Richard Socher, Larry S. Davis


We present AdaFrame, a framework that adaptively selects relevant frames on a per-input basis for fast video recognition. AdaFrame contains a Long Short-Term Memory network augmented with a global memory that provides context information for searching which frames to use over time. Trained with policy gradient methods, AdaFrame generates a prediction, determines which frame to observe next, and computes the utility, i.e., expected future rewards, of seeing more frames at each time step. At testing time, AdaFrame exploits predicted utilities to achieve adaptive lookahead inference such that the overall computational costs are reduced without incurring a decrease in accuracy. Extensive experiments are conducted on two large-scale video benchmarks, FCVID and ActivityNet. AdaFrame matches the performance of using all frames with only 8.21 and 8.65 frames on FCVID and ActivityNet, respectively. We further qualitatively demonstrate learned frame usage can indicate the difficulty of making classification decisions; easier samples need fewer frames while harder ones require more, both at instance-level within the same class and at class-level among different categories.
[frame, video, adaframe, time, lstm, fcvid, activity, hidden, future, current, prediction, avg, recognition, temporal, action, ooling, work, early, eward] [confidence, derive] [figure, based, downsampled, input, produce, spatially] [network, adaptive, inference, number, computational, computation, selection, usage, fast, neural, performance, small, efficient, better, deep, residual, max, gflops, gradient, achieve, achieves, table] [utility, memory, model, reward, policy, expected, step, introduce, vector, conditioned, strong, decide, relevant] [global, context, average, predicted, detection] [learning, testing, class, classification, function, learn, training, trained, sampling, set, select, observe, learned, large, selected, measure]
@InProceedings{Wu_2019_CVPR,
  author = {Wu, Zuxuan and Xiong, Caiming and Ma, Chih-Yao and Socher, Richard and Davis, Larry S.},
  title = {AdaFrame: Adaptive Frame Selection for Fast Video Recognition},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Spatio-Temporal Video Re-Localization by Warp LSTM
Yang Feng, Lin Ma, Wei Liu, Jiebo Luo


The need for efficiently finding the video content a user wants is increasing because of the erupting of user-generated videos on the Web. Existing keyword-based or content-based video retrieval methods usually determine what occurs in a video but not when and where. In this paper, we make an answer to the question of when and where by formulating a new task, namely spatio-temporal video re-localization. Specifically, given a query video and a reference video, spatio-temporal video re-localization aims to localize tubelets in the reference video such that the tubelets semantically correspond to the query. To accurately localize the desired tubelets in the reference video, we propose a novel warp LSTM network, which propagates the spatio-temporal information for a long period and thereby captures the corresponding long-term dependencies. Another issue for spatio-temporal video re-localization is the lack of properly labeled video datasets. Therefore, we reorganize the videos in the AVA dataset to form a new dataset for spatio-temporal video re-localization research. Extensive experimental results show that the proposed model achieves superior performances over the designed baselines on the spatio-temporal video re-localization task.
[video, warp, lstm, action, tubelets, combined, dataset, tubelet, temporal, convlstm, previous, stvr, hidden, flow, optical, state, clip, ava, recognition, localize, spatiotemporal, second, warped, long, current, correspond, short, localizing, motion, trajlstm, human, spatio, moving] [corresponding, matching] [reference, figure, proposed, control, image, interpolation, chen, based, content, input] [wei, convolutional, designed, network, fixed, neural, deep, number, size, validation] [query, semantically, model, find, generate, natural, create] [bounding, feature, box, proposal, object, localization, baseline, cnn, lin, detection, map, three, predicted, propose, semantic] [label, labeled, training, learning, classification, retrieval, set, task]
@InProceedings{Feng_2019_CVPR,
  author = {Feng, Yang and Ma, Lin and Liu, Wei and Luo, Jiebo},
  title = {Spatio-Temporal Video Re-Localization by Warp LSTM},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Completeness Modeling and Context Separation for Weakly Supervised Temporal Action Localization
Daochang Liu, Tingting Jiang, Yizhou Wang


Temporal action localization is crucial for understanding untrimmed videos. In this work, we first identify two underexplored problems posed by the weak supervision for temporal action localization, namely action completeness modeling and action-context separation. Then by presenting a novel network architecture and its training strategy, the two problems are explicitly looked into. Specifically, to model the completeness of actions, we propose a multi-branch neural network in which branches are enforced to discover distinctive action parts. Complete actions can be therefore localized by fusing activations from different branches. And to separate action instances from their surrounding context, we generate hard negative data for training using the prior that motionless video clips are unlikely to be actions. Experiments performed on datasets THUMOS'14 and ActivityNet show that our framework outperforms state-of-the-art methods. In particular, the average mAP on ActivityNet v1.2 is significantly improved from 18.0% to 22.4%. Our code will be released soon.
[action, temporal, recognition, video, multiple, sequence, untrimmednet, untrimmed, avg, modeling, aavg, time, activity, outperforms, static, optical, lmil] [vision, computer, pattern, completeness, international, june, single, localized, ground] [conference, ieee, proposed, method, background, input, based, july, separation] [network, full, validation, table, activation, denotes, convolutional, number, selection, norm, experiment] [model, diversity, attention, generated, common, generation] [branch, localization, feature, context, weak, weakly, average, iou, map, instance, category, module, european, extraction] [hard, negative, supervised, classification, loss, set, training, testing, class, trained, learning, embedding, test, learned]
@InProceedings{Liu_2019_CVPR,
  author = {Liu, Daochang and Jiang, Tingting and Wang, Yizhou},
  title = {Completeness Modeling and Context Separation for Weakly Supervised Temporal Action Localization},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Unsupervised Deep Tracking
Ning Wang, Yibing Song, Chao Ma, Wengang Zhou, Wei Liu, Houqiang Li


We propose an unsupervised visual tracking method in this paper. Different from existing approaches using extensive annotated data for supervised learning, our CNN model is trained on large-scale unlabeled videos in an unsupervised manner. Our motivation is that a robust tracker should be effective in both the forward and backward predictions (i.e., the tracker can forward localize the target object in successive frames and backtrace to its initial position in the first frame). We build our framework on a Siamese correlation filter network, which is trained using unlabeled raw videos. Meanwhile, we propose a multiple-frame validation method and a cost-sensitive loss to facilitate unsupervised learning. Without bells and whistles, the proposed unsupervised tracker achieves the baseline accuracy of fully supervised trackers, which require complete and accurate labels during training. Furthermore, unsupervised framework exhibits a potential in leveraging unlabeled or weakly labeled data to further improve the tracking accuracy.
[tracking, forward, backward, frame, framework, track, multiple, online, motion, lun, video, jin] [template, initial, computed, error, robust, corresponding, michael] [proposed, patch, method, consistency, based, figure] [network, tracker, correlation, udt, siamese, deep, search, filter, performance, siamfc, precision, cfnet, validation, dsst, comparable, dcf, convolutional, scale, effective, kcf, weight, rate, overlap, chao] [visual, success, model, evaluation, expected] [object, bounding, feature, response, box, cnn, propose, baseline, map, location, cropped, benchmark, threshold] [unsupervised, training, target, learning, unlabeled, loss, supervised, data, representation, learned, label, trained, train, randomly, existing, large, update]
@InProceedings{Wang_2019_CVPR,
  author = {Wang, Ning and Song, Yibing and Ma, Chao and Zhou, Wengang and Liu, Wei and Li, Houqiang},
  title = {Unsupervised Deep Tracking},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Tracking by Animation: Unsupervised Learning of Multi-Object Attentive Trackers
Zhen He, Jian Li, Daxue Liu, Hangen He, David Barber


Online Multi-Object Tracking (MOT) from videos is a challenging computer vision task which has been extensively studied for decades. Most of the existing MOT algorithms are based on the Tracking-by-Detection (TBD) paradigm combined with popular machine learning approaches which largely reduce the human effort to tune algorithm parameters. However, the commonly used supervised learning approaches require the labeled data (e.g., bounding boxes), which is expensive for videos. Also, the TBD framework is usually suboptimal since it is not end-to-end, i.e., it considers the task as detection and tracking, but not jointly. To achieve both label-free and end-to-end learning of MOT, we propose a Tracking-by-Animation framework, where a differentiable neural model first tracks objects from input frames and then animates these objects into reconstructed frames. Learning is then driven by the reconstruction error through backpropagation. We further propose a Reprioritized Attentive Tracking to improve the robustness of data association. Experiments conducted on both synthetic and real video datasets show the potential of the proposed model. Our project page is publicly available at: https://github.com/zhen-he/tracking-by-animation
[tracking, tba, online, time, rat, mot, framework, frame, state, tbac, video, multiple, nnupd, rnn, extract, disrupted, tracked, eat, key, people] [occlusion, reconstruction, array, note, algorithm, well, defined, robust] [input, image, figure, reconstructed, appearance, based, generative, ieee, real, background, qualitative] [tracker, neural, number, layer, network, computation, size, output, iteration, parameterized, channel, process, weight, reduce] [model, association, attention, memory, write, consider, vector, generate, visual, robustness, evaluate, renderer, read, van, machine, tbd] [object, feature, associate, bounding, propose, attentive, detection, challenging] [learning, data, set, unsupervised, extractor, update, training, loss, supervised]
@InProceedings{He_2019_CVPR,
  author = {He, Zhen and Li, Jian and Liu, Daxue and He, Hangen and Barber, David},
  title = {Tracking by Animation: Unsupervised Learning of Multi-Object Attentive Trackers},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Fast Online Object Tracking and Segmentation: A Unifying Approach
Qiang Wang, Li Zhang, Luca Bertinetto, Weiming Hu, Philip H.S. Torr


In this paper we illustrate how to perform both visual object tracking and semi-supervised video object segmentation, in real-time, with a single simple approach. Our method, dubbed SiamMask, improves the offline training procedure of popular fully-convolutional Siamese approaches for object tracking by augmenting their loss with a binary segmentation task. Once trained, SiamMask solely relies on a single bounding box initialisation and operates online, producing class-agnostic object segmentation masks and rotated bounding boxes at 55 frames per second. Despite its simplicity, versatility and fast speed, our strategy allows us to establish a new state-of-the-art among real-time trackers on VOT-2018, while at the same time demonstrating competitive performance and the best speed for the semi-supervised video object segmentation task on DAVIS-2016 and DAVIS-2017.
[video, tracking, vos, online, frame, despite, time, multiple, mbr, optical, davis] [computer, pattern, vision, approach, require, international, accurate, allows] [conference, ieee, figure, method, produce, image, arbitrary, row, high] [binary, siamese, performance, network, table, speed, correlation, fast, siamfc, siamrpn, accuracy, search, achieve, output, variant, convolutional, eao, fixed, offline, competitive, best, fastest, neural] [visual, simple, simply, requires, evaluation] [object, bounding, box, segmentation, mask, siammask, european, branch, rotated, three, spatial, region, rectangle, response, lmask] [target, representation, learning, strategy, test, task, loss, similarity, oracle]
@InProceedings{Wang_2019_CVPR,
  author = {Wang, Qiang and Zhang, Li and Bertinetto, Luca and Hu, Weiming and Torr, Philip H.S.},
  title = {Fast Online Object Tracking and Segmentation: A Unifying Approach},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Object Tracking by Reconstruction With View-Specific Discriminative Correlation Filters
Ugur Kart, Alan Lukezic, Matej Kristan, Joni-Kristian Kamarainen, Jiri Matas


Standard RGB-D trackers treat the target as a 2D structure, which makes modelling appearance changes related even to out-of-plane rotation challenging. This limitation is addressed by the proposed long-term RGB-D tracker called OTR - Object Tracking by Reconstruction. OTR performs online 3D target reconstruction to facilitate robust learning of a set of view-specific discriminative correlation filters (DCFs). The 3D reconstruction supports two performance- enhancing features: (i) generation of an accurate spatial support for constrained DCF learning from its 2D projection and (ii) point-cloud based estimation of 3D pose change for selection and storage of view-specific DCFs which robustly localize the target after out-of-view rotation or heavy occlusion. Extensive evaluation on the Princeton RGB-D tracking and STC Benchmarks shows OTR outperforms the state-of-the-art by a large margin.
[tracking, current, updated, frame, work, outperforms] [depth, position, reconstruction, estimated, computed, rotation, well, approach, robust, estimation, rgb, icp, pose, occlusion, formulation] [appearance, color, proposed, based, ieee, figure, method, change, image, described, study] [filter, tracker, otr, dcf, correlation, performance, dcfs, stc, standard, ptb, scale, princeton, small, ratio, table, oapf, precision, search] [model, success, visual, constrained, evaluation, improved] [object, mask, benchmark, segmentation, region, localization, bounding, aspect, ablation, presence, response, box, location, average, threshold, map, three] [target, set, learning, test, update, discriminative, large, training, maximum, representation]
@InProceedings{Kart_2019_CVPR,
  author = {Kart, Ugur and Lukezic, Alan and Kristan, Matej and Kamarainen, Joni-Kristian and Matas, Jiri},
  title = {Object Tracking by Reconstruction With View-Specific Discriminative Correlation Filters},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
SoPhie: An Attentive GAN for Predicting Paths Compliant to Social and Physical Constraints
Amir Sadeghian, Vineet Kosaraju, Ali Sadeghian, Noriaki Hirose, Hamid Rezatofighi, Silvio Savarese


This paper addresses the problem of path prediction for multiple interacting agents in a scene, which is a crucial step for many autonomous platforms such as self-driving cars and social robots. We present SoPhie; an interpretable framework based on Generative Adversarial Network (GAN), which leverages two sources of information, the path history of all the agents in a scene, and the scene context information, using images of the scene. To predict a future path for an agent, both physical and social information must be leveraged. Previous work has not been successful to jointly model physical and social interactions. Our approach blends a social attention mechanism with physical attention that helps the model to learn where to look in a large scene and extract the most salient parts of the image relevant to the path. Whereas, the social attention component aggregates information across the different agent interactions and extracts the most important trajectory information from the surrounding neighbors. SoPhie also takes advantage of GAN to generates more realistic samples and to capture the uncertain nature of the future paths by modeling its distribution. All these mechanisms enable our approach to predict socially and physically plausible paths for the agents and to achieve state-of-the-art performance on several different trajectory forecasting benchmarks.
[social, future, trajectory, lstm, sophie, prediction, predict, state, multiple, joint, predicting, modeling, hidden, time, dataset, socially, influence, learns, previous, drone, focus, eth] [scene, computer, well, feasible, truth, physically, ground, vision, pattern, approach, single, error, problem, account] [generative, generator, conference, image, based, ieee, input, method] [order, performance, neural, table, better, network] [attention, model, physical, agent, gan, path, decoder, discriminator, visual, generate, arxiv, preprint, adversarial, plausible, applies, generated, encoder, vector] [context, module, feature, pedestrian, attentive, predicted, average] [learn, distribution, stanford, set, trained, sample, datasets]
@InProceedings{Sadeghian_2019_CVPR,
  author = {Sadeghian, Amir and Kosaraju, Vineet and Sadeghian, Ali and Hirose, Noriaki and Rezatofighi, Hamid and Savarese, Silvio},
  title = {SoPhie: An Attentive GAN for Predicting Paths Compliant to Social and Physical Constraints},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Leveraging Shape Completion for 3D Siamese Tracking
Silvio Giancola, Jesus Zarzar, Bernard Ghanem


Point clouds are challenging to process due to their sparsity, therefore autonomous vehicles rely more on appearance attributes than pure geometric features. However, 3D LIDAR perception can provide crucial information for urban navigation in challenging light or weather conditions. In this paper, we investigate the versatility of Shape Completion for 3D Object Tracking in LIDAR point clouds. We design a Siamese tracker that encodes model and candidate shapes into a compact latent representation. We regularize the encoding by enforcing the latent representation to decode into an object model shape. We observe that 3D object tracking and 3D shape completion complement each other. Learning a more meaningful latent representation shows better discriminatory capabilities, leading to improved tracking performance. We test our method on the KITTI Tracking set using car 3D bounding boxes. Our model reaches a 76.94% Success rate and 81.38% Precision for 3D Object Tracking, with the shape completion regularization leading to an improvement of 3% in both metrics.
[tracking, frame, fusion, recognition, previous, decoded, time, current, kalman, driving] [shape, point, completion, vision, computer, pattern, ground, cloud, exhaustive, truth, provide, geometric, lidar, reconstruction, autonomous, single, equation, comp, international, kitti, depth, dense, particle, well, june] [latent, ieee, conference, method, figure, based, high, proposed, meaningful] [siamese, network, search, best, tracker, table, better, regularization, performance, gaussian, filter, sparse, size, efficient, max, pooling, concatenating, aggregation] [model, candidate, visual, arxiv, preprint, complete, generate, encoded, encoder] [object, detection, semantic, car, bev] [representation, space, loss, similarity, training, distance, learning, set, trained, cosine]
@InProceedings{Giancola_2019_CVPR,
  author = {Giancola, Silvio and Zarzar, Jesus and Ghanem, Bernard},
  title = {Leveraging Shape Completion for 3D Siamese Tracking},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Target-Aware Deep Tracking
Xin Li, Chao Ma, Baoyuan Wu, Zhenyu He, Ming-Hsuan Yang


Existing deep trackers mainly use convolutional neural networks pre-trained for the generic object recognition task for representations. Despite demonstrated successes for numerous vision tasks, the contributions of using pre-trained deep features for visual tracking are not as significant as that for object recognition. The key issue is that in visual tracking the targets of interest can be arbitrary object class with arbitrary forms. As such, pre-trained deep features are less effective in modeling these targets of arbitrary forms for distinguishing them from the background. In this paper, we propose a novel scheme to learn target-aware features, which can better recognize the targets undergoing significant appearance variations than pre-trained deep features. To this end, we develop a regression loss and a ranking loss to guide the generation of target-active and scale-sensitive features. We identify the importance of each convolutional filter according to the back-propagated gradients and select the target-aware features based on activations for representing the targets. The target-aware features are integrated with a Siamese matching network for visual tracking. Extensive experimental results show that the proposed algorithm performs favorably against the state-of-the-art methods in terms of accuracy and speed.
[tracking, online, framework, dataset, auc] [computer, vision, pattern, algorithm, matching, michael, international, regress, computed] [conference, ieee, proposed, based, figure, xin, arbitrary] [deep, siamese, convolutional, correlation, tracker, performance, effective, scale, filter, table, search, number, best, network, speed, staple, achieves, neural, accuracy, bacf, siamfc, overlap, effectiveness, compared, size, tadt, dasiamrpn, traca, cfnet, otb, precision] [visual, model, success, generate, generated, indicates] [object, feature, regression, score, cnn, european, objectness, map, lrank] [target, loss, learning, ranking, training, select, discriminative, experimental, exploit, classification, gap, set]
@InProceedings{Li_2019_CVPR,
  author = {Li, Xin and Ma, Chao and Wu, Baoyuan and He, Zhenyu and Yang, Ming-Hsuan},
  title = {Target-Aware Deep Tracking},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Spatiotemporal CNN for Video Object Segmentation
Kai Xu, Longyin Wen, Guorong Li, Liefeng Bo, Qingming Huang


In this paper, we present a unified, end-to-end trainable spatiotemporal CNN model for VOS, which consists of two branches, i.e., the temporal coherence branch and the spatial segmentation branch. Specifically, the temporal coherence branch pretrained in an adversarial fashion from unlabeled video data, is designed to capture the dynamic appearance and motion cues of video sequences to guide object segmentation. The spatial segmentation branch focuses on segmenting objects accurately based on the learned appearance and motion cues. To obtain accurate segmentation results, we design a coarse-to-fine process to sequentially apply a designed attention module on multi-scale feature maps, and concatenate them to produce the final prediction. In this way, the spatial segmentation branch is enforced to gradually concentrate on object regions. These two branches are jointly fine-tuned on video segmentation sequences in an end-to-end manner. Several experiments are carried out on three challenging datasets (i.e., DAVIS-2016, DAVIS-2017 and Youtube-Object) to show that our method achieves favorable performance against the state-of-the-arts. Code is available at https://github.com/longyin880815/STCNN.
[video, temporal, coherence, stcnn, online, optical, motion, dataset, vos, flow, spatiotemporal, frame, time, capture, dynamic, previous, iteratively, manner, osvos, current, onavos] [algorithm, accurate] [method, based, appearance, proposed, figure, generator, contour, demonstrate, guide, produce] [network, table, performance, layer, design, accuracy, fast, designed, better, pretrained, concatenate, convolutional, running, convolution, batch, number, size, validation, deep, process, sequentially] [attention, generate, model, adversarial, discriminator, alexander, evaluate] [segmentation, object, branch, spatial, mask, region, three, feature, module, cnn, segment, fully, indicate, challenging] [training, learning, set, unsupervised, similarity, data, train, unlabeled]
@InProceedings{Xu_2019_CVPR,
  author = {Xu, Kai and Wen, Longyin and Li, Guorong and Bo, Liefeng and Huang, Qingming},
  title = {Spatiotemporal CNN for Video Object Segmentation},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Towards Rich Feature Discovery With Class Activation Maps Augmentation for Person Re-Identification
Wenjie Yang, Houjing Huang, Zhang Zhang, Xiaotang Chen, Kaiqi Huang, Shu Zhang


The fundamental challenge of small inter-person variation requires Person Re-Identification (Re-ID) models to capture sufficient fine-grained information. This paper proposes to discover diverse discriminative visual cues without extra assistance, e.g., pose estimation, human parsing. Specifically, a Class Activation Maps (CAM) augmentation model is proposed to expand the activation scope of baseline Re-ID model to explore rich visual cues, where the backbone network is extended by a series of ordered branches which share the same input but output complementary CAM. A novel Overlapped Activation Penalty is proposed to force the new branch to pay more attention to the image regions less activated by the old ones, such that spatial diverse visual features can be discovered. The proposed model achieves state-of-the-art results on three person Re-ID benchmarks. Moreover, a visualization approach termed ranking activation map (RAM) is proposed to explicitly interpret the ranking results in the test stage, which gives qualitative validations of the proposed method.
[human, multiple, work, force, previous] [approach, body, analysis, local, pose] [proposed, image, based, input, method, figure, lid, qualitative] [activation, deep, number, network, denotes, penalty, small, achieves, top, better, layer, interpret, performance, convolutional, sigmoid, pooling] [model, visual, query, attention, discover, diverse, indicates, arxiv, preprint, green, red] [person, feature, map, cama, baseline, branch, spatial, cam, global, overlapped, cnn, loap, activated, three, adopt, average, mti, visualization, propose, enhance] [discriminative, learning, ranking, class, training, test, gallery, loss, function, set, data, learned, metric, positive, triplet]
@InProceedings{Yang_2019_CVPR,
  author = {Yang, Wenjie and Huang, Houjing and Zhang, Zhang and Chen, Xiaotang and Huang, Kaiqi and Zhang, Shu},
  title = {Towards Rich Feature Discovery With Class Activation Maps Augmentation for Person Re-Identification},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Wide-Context Semantic Image Extrapolation
Yi Wang, Xin Tao, Xiaoyong Shen, Jiaya Jia


This paper studies the fundamental problem of extrapolating visual context using deep generative models, i.e., extending image borders with plausible structure and details. This seemingly easy task actually faces many crucial technical challenges and has its unique properties. The two major issues are size expansion and one-side constraints. We propose a semantic regeneration network with several special contributions and use multiple spatial related losses to address these issues. Our results contain consistent structures and high-quality textures. Extensive experiments are conducted on various possible alternatives and related methods. We also explore the potential of our method for various interesting applications that can benefit research in a variety of fields.
[prediction, time] [relative, reconstruction, local, problem, view, note, directly, defined, body] [image, figure, input, method, texture, extrapolation, generative, inpainting, acm, psnr, jiaya, synthesis, style, mrf, quantitative, ssim, comparison, xin, content, based, deconvolution, consistency, hair] [network, deep, size, variant, normalization, table, structure, convolution, neural, padding, output, vanilla, compared, convolutional, full, design, designed, inference] [visual, adversarial, model, generation, arxiv, preprint, adv, semantically, natural, generate, common] [context, srn, feature, expansion, spatial, semantic, filling, contextual, three, global, module] [loss, unknown, learning, training, task, trained, large]
@InProceedings{Wang_2019_CVPR,
  author = {Wang, Yi and Tao, Xin and Shen, Xiaoyong and Jia, Jiaya},
  title = {Wide-Context Semantic Image Extrapolation},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
End-To-End Time-Lapse Video Synthesis From a Single Outdoor Image
Seonghyeon Nam, Chongyang Ma, Menglei Chai, William Brendel, Ning Xu, Seon Joo Kim


Time-lapse videos usually contain visually appealing content but are often difficult and costly to create. In this paper, we present an end-to-end solution to synthesize a time-lapse video from a single outdoor image using deep neural networks. Our key idea is to train a conditional generative adversarial network based on existing datasets of time-lapse videos and image sequences. We propose a multi-frame joint conditional generation framework to effectively learn the correlation between the illumination change of an outdoor scene and the time of the day. We further present a multi-domain training scheme for robust training of our generative models from two datasets with different distributions and missing timestamp labels. Compared to alternative time-lapse video synthesis algorithms, our method uses the timestamp as the control variable and does not require a reference video to guide the synthesis of the final output. We conduct ablation studies to validate our algorithm and compare with state-of-the-art techniques both qualitatively and quantitatively.
[video, dataset, timestamp, time, joint, recognition, frame, sequence, framework, temporal, motion] [computer, vision, outdoor, illumination, scene, pattern, single, variable, algorithm, continuous, require, corresponding] [image, input, conditional, color, method, figure, conference, reference, amos, synthesis, tlvdb, ieee, synthesize, generative, based, tone, generator, unconditional, control, translation, real, timelapse, day, visually, change, captured, proposed] [output, neural, network, deep, residual, enc, upsampling] [generate, adversarial, discriminator, generation, model, visual, generated] [propose, semantic, european, baseline] [training, learning, set, train, loss, datasets, domain, learn, data, transfer, effectively, sample, trained, log, existing, task, test]
@InProceedings{Nam_2019_CVPR,
  author = {Nam, Seonghyeon and Ma, Chongyang and Chai, Menglei and Brendel, William and Xu, Ning and Joo Kim, Seon},
  title = {End-To-End Time-Lapse Video Synthesis From a Single Outdoor Image},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
GIF2Video: Color Dequantization and Temporal Interpolation of GIF Images
Yang Wang, Haibin Huang, Chuan Wang, Tong He, Jue Wang, Minh Hoai


Graphics Interchange Format (GIF) is a highly portable graphics format that is ubiquitous on the Internet. Despite their small sizes, GIF images often contain undesirable visual artifacts such as flat color regions, false contours, color shift, and dotted patterns. In this paper, we propose GIF2Video, the first learning-based method for enhancing the visual quality of GIFs in the wild. We focus on the challenging task of GIF restoration by recovering information lost in the three steps of GIF creation: frame sampling, color quantization, and color dithering. We first propose a novel CNN architecture for color dequantization. It is built upon a compositional architecture for multi-step color correction, with a comprehensive loss function designed to handle large quantization errors. We then adapt the SuperSlomo network for temporal interpolation of GIF frames. We introduce two large datasets, namely GIF-Faces and GIF-Moments, for both training and evaluation. Experimental results show that our method can significantly improve the visual quality of GIFs, and outperforms direct baseline and state-of-the-art approaches.
[video, frame, temporal, flow, optical, dataset, current, multiple] [ground, truth, corresponding, algorithm, error, estimation, compute, equation, approach, reconstruction, computed] [color, image, gif, gifs, dequantization, input, ccdnet, figure, palette, proposed, interpolation, method, contour, psnr, dithering, quality, flat, dithered, superslomo, row, ssim, resolution, pixel, separate, ieee, removal, unfolding, ladv] [quantization, network, gradient, performance, deep, original, small, architecture, process, table, designed, output] [visual, adversarial, adv, compositional, model, iterative] [false, propose, detection, dotted, three] [loss, function, training, trained, large, learning, update, train, task, set]
@InProceedings{Wang_2019_CVPR,
  author = {Wang, Yang and Huang, Haibin and Wang, Chuan and He, Tong and Wang, Jue and Hoai, Minh},
  title = {GIF2Video: Color Dequantization and Temporal Interpolation of GIF Images},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Mode Seeking Generative Adversarial Networks for Diverse Image Synthesis
Qi Mao, Hsin-Ying Lee, Hung-Yu Tseng, Siwei Ma, Ming-Hsuan Yang


Most conditional generation tasks expect diverse outputs given a single conditional context. However, conditional generative adversarial networks (cGANs) often focus on the prior conditional information and ignore the input noise vectors, which contribute to the output variations. Recent attempts to resolve the mode collapse issue for cGANs are usually task-specific and computationally expensive. In this work, we propose a simple yet effective regularization term to address the mode collapse issue for cGANs. The proposed method explicitly maximizes the ratio of the distance between generated images with respect to the corresponding latent codes, thus encouraging the generators to explore more minor modes during training. This mode seeking regularization term is readily applicable to various conditional generation tasks without imposing training overhead or modifying the original network structures. We validate the proposed algorithm on three conditional image synthesis tasks including categorical generation, image-to-image translation, and text-to-image synthesis with different baseline models. Both qualitative and quantitative results demonstrate the effectiveness of the proposed regularization method for improving diversity without loss of quality.
[term, dataset, focus, multiple] [additional, respect, corresponding, problem] [proposed, latent, image, conditional, generative, method, translation, real, synthesis, cgans, figure, noise, drit, input, quantitative, lpips, demonstrate, generator, qualitative] [regularization, better, applied, table, ratio, original, network, output, effective] [mode, msgan, adversarial, collapse, generated, text, diversity, seeking, conditioned, ndb, diverse, generation, generate, gans, fid, evaluate, jsd, dcgan, simple, easily, visual, lori, model, discriminator, msgans] [propose, baseline, ignore, three, bin] [distance, training, distribution, data, categorical, code, existing, address, minor, issue, space, mapped, objective, loss, alleviate]
@InProceedings{Mao_2019_CVPR,
  author = {Mao, Qi and Lee, Hsin-Ying and Tseng, Hung-Yu and Ma, Siwei and Yang, Ming-Hsuan},
  title = {Mode Seeking Generative Adversarial Networks for Diverse Image Synthesis},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Pluralistic Image Completion
Chuanxia Zheng, Tat-Jen Cham, Jianfei Cai


Most image completion methods produce only one result for each masked input, although there may be many reasonable possibilities. In this paper, we present an approach for pluralistic image completion - the task of generating multiple and diverse plausible solutions for image completion. A major challenge faced by learning-based approaches is that usually only one ground truth training instance per label. As such, sampling from conditional VAEs still leads to minimal diversity. To overcome this, we propose a novel and probabilistically principled framework with two parallel paths. One is a reconstructive path that utilizes the only one given ground truth to get prior distribution of missing parts and rebuild the original image from this distribution. The other is a generative path for which the conditional prior is coupled to the distribution obtained in the reconstructive path. Both are supported by GANs. We also introduce a new short+long term attention layer that exploits distant relations among decoder and encoder features, improving appearance consistency. When tested on datasets with buildings (Paris), faces (CelebA-HQ), and natural images (ImageNet), our method not only generated higherquality completion results, but also with multiple and diverse plausible outputs.
[term, multiple, framework, hidden, dataset, recognition] [completion, computer, visible, vision, single, pattern, directly, pipeline, ground] [image, conditional, generative, prior, latent, conference, masked, ieee, content, figure, missing, method, input, reconstructive, face, comparison, based, appearance, quantitative, inpainting, pluralistic, proposed, acm, alexei] [original, network, neural, output, lower, layer, full, deep, structure, imagenet, processing] [attention, path, model, encoder, partial, diverse, decoder, generated, generation, generate, adversarial, arxiv, preprint, discriminator, cvae, diversity, sampled, plausible] [feature, contextual, instance, context, european, singapore] [training, loss, sampling, set, distribution, large, trained, learned, learning, sample, function]
@InProceedings{Zheng_2019_CVPR,
  author = {Zheng, Chuanxia and Cham, Tat-Jen and Cai, Jianfei},
  title = {Pluralistic Image Completion},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Salient Object Detection With Pyramid Attention and Salient Edges
Wenguan Wang, Shuyang Zhao, Jianbing Shen, Steven C. H. Hoi, Ali Borji


This paper presents a new method for detecting salient objects in images using convolutional neural networks (CNNs). The proposed network, named PAGE-Net, offers two key contributions. The first is the exploitation of an essential pyramid attention structure for salient object detection. This enables the network to concentrate more on salient regions while considering multi-scale saliency information. Such a stacked attention design provides a powerful tool to efficiently improve the representation ability of the corresponding network layer with an enlarged receptive field. The second contribution lies in the emphasis on the importance of salient edges. Salient edge information offers a strong cue to better segment salient objects and refine object boundaries. To this end, our model is equipped with a salient edge detection module, which is learned for precise salient boundary estimation. This encourages better edge-preserving salient object segmentation. Exhaustive experiments confirm that the proposed pyramid attention and salient edges are effective for salient object detection. We show that our deep saliency model outperforms state-of-the-art approaches for several benchmarks with a fast processing speed (25fps on one GPU).
[jianbing, previous, learns] [corresponding, field, ground, truth, explicit, contrast, single] [image, proposed, method, based, ieee, figure, background, identity, quantitative, comparison, enlarged] [deep, network, layer, better, neural, receptive, convolution, convolutional, design, architecture, conv, table, original, performance, scale, stacked, effective, efficient, equipped, illustration, inference, speed] [attention, model, visual, ability] [salient, saliency, edge, object, detection, module, pyramid, feature, map, mae, huchuan, sod, eccsd, wenguan, boundary, hierarchical, detailed, ali, global, readout, xiang, propose, segmentation, spatial, detecting, enhance] [learning, training, representation, loss, essential, novel, set, learned]
@InProceedings{Wang_2019_CVPR,
  author = {Wang, Wenguan and Zhao, Shuyang and Shen, Jianbing and Hoi, Steven C. H. and Borji, Ali},
  title = {Salient Object Detection With Pyramid Attention and Salient Edges},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Latent Filter Scaling for Multimodal Unsupervised Image-To-Image Translation
Yazeed Alharbi, Neil Smith, Peter Wonka


In multimodal unsupervised image-to-image translation tasks, the goal is to translate an image from the source domain to many images in the target domain. We present a simple method that produces higher quality images than current state-of-the-art while maintaining the same amount of multimodal diversity. Previous methods follow the unconditional approach of trying to map the latent code directly to a full-size image. This leads to complicated network architectures with several introduced hyperparameters to tune. By treating the latent code as a modifier of the convolutional filters, we produce multimodal output while maintaining the traditional Generative Adversarial Network (GAN) loss and without additional hyperparameters. The only tuning required by our method controls the tradeoff between variability and quality of generated images. Furthermore, we achieve disentanglement between source domain content and target domain style for free as a by-product of our formulation. We perform qualitative and quantitative experiments showing the advantages of our method compared with the state-of-the art on multiple benchmark image-to-image translation datasets.
[previous, multiple, work, follow] [approach, reconstruction, computer, vision, case, directly, problem, additional] [latent, input, image, method, quality, translation, style, bicyclegan, unconditional, figure, produce, generative, disentanglement, mapping, conditional, variability, munit, content, quantitative, variety, conference, disentangled, user, result, high, preferred, ieee, control, variation, change, treating, qualitative, limitation] [output, network, filter, scaling, higher, convolutional, standard, weight, deep, scalar] [gan, multimodal, generated, vector, adversarial, diversity, arxiv, preprint, gans, find, simple, diverse, encoded, generating, adding] [map, feature, semantic] [code, domain, target, source, loss, unsupervised, learning, mapped, training, pair, transfer, aim]
@InProceedings{Alharbi_2019_CVPR,
  author = {Alharbi, Yazeed and Smith, Neil and Wonka, Peter},
  title = {Latent Filter Scaling for Multimodal Unsupervised Image-To-Image Translation},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Attention-Aware Multi-Stroke Style Transfer
Yuan Yao, Jianqiang Ren, Xuansong Xie, Weidong Liu, Yong-Jin Liu, Jun Wang


Neural style transfer has drawn considerable attention from both academic and industrial field. Although visual effect and efficiency have been significantly improved, existing methods are unable to coordinate spatial distribution of visual attention between the content image and stylized image, or render diverse level of detail via different brush strokes. In this paper, we tackle these limitations by developing an attention-aware multi-stroke style transfer model. We first propose to assemble self-attention mechanism into a style-agnostic reconstruction autoencoder framework, from which the attention map of a content image can be derived. By performing multi-scale style swap on content features and style features, we produce multiple feature maps reflecting different stroke patterns. A flexible fusion strategy is further presented to incorporate the salient characteristics from the attention map, which allows integrating multiple stroke patterns into different spatial regions of the output image harmoniously. We demonstrate the effectiveness of our method, as well as generate comparable stylized images with multiple stroke patterns against the state-of-the-art methods.
[fusion, multiple, work, performing, perform] [reconstruction, field, match, allows, corresponding, position] [style, stroke, content, image, stylized, figure, method, swap, control, consistency, arbitrary, proposed, patch, wct, stylization, adain, result, perceptual, texture, synthesis, detail, intermediate, user] [size, neural, network, number, flexible, effectiveness, convolutional, denotes, residual, smoothing, better, receptive, weight, compared, process, deep, larger, scale] [attention, visual, introduce, mechanism, generate, model, encoder, arxiv, preprint, automatic, evaluate, enables, decoder, selfattention] [feature, map, spatial, module, saliency, salient, three, propose, level, integrate, integrating, guidance] [transfer, autoencoder, loss, strategy, distribution]
@InProceedings{Yao_2019_CVPR,
  author = {Yao, Yuan and Ren, Jianqiang and Xie, Xuansong and Liu, Weidong and Liu, Yong-Jin and Wang, Jun},
  title = {Attention-Aware Multi-Stroke Style Transfer},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Feedback Adversarial Learning: Spatial Feedback for Improving Generative Adversarial Networks
Minyoung Huh, Shao-Hua Sun, Ning Zhang


We propose feedback adversarial learning (FAL) framework that can improve existing generative adversarial networks by leveraging spatial feedback from the discriminator. We formulate the generation task as a recurrent framework, in which the discriminator's feedback is integrated into the feedforward path of the generation process. Specifically, the generator conditions on the discriminator's spatial output response, and its previous generation to improve generation quality over time - allowing the generator to attend and fix its previous mistakes. To effectively utilize the feedback, we propose an adaptive spatial transform layer, which learns to spatially modulate feature maps from its previous generation and the error signal from the discriminator. We demonstrate that one can easily adapt FAL to existing adversarial learning frameworks on a wide range of tasks, including image generation, image-to-image translation, and voxel generation.
[time, previous, signal, consists, iteratively, recurrent] [computer, international, vision, pattern, voxels, voxel, depth, ground, truth, compute, provide] [feedback, image, generator, conference, generative, input, figure, real, ieee, translation, method, proposed, voxelgan, latent, quality, transform, demonstrate, lpips, synthesis, celeba, conditional] [neural, output, processing, network, adaptive, table, accuracy, deep, modulate] [generation, adversarial, discriminator, generated, model, gan, generate, goal, arxiv, vector, gans, decoder, step, sampled, encoder, encoding, preprint, attend] [spatial, improve, response, map, propose, segmentation, score, feature, european, utilize] [learning, trained, training, train, loss, task, sample, dimension, existing, classification, unsupervised, data, distribution]
@InProceedings{Huh_2019_CVPR,
  author = {Huh, Minyoung and Sun, Shao-Hua and Zhang, Ning},
  title = {Feedback Adversarial Learning: Spatial Feedback for Improving Generative Adversarial Networks},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Learning Pyramid-Context Encoder Network for High-Quality Image Inpainting
Yanhong Zeng, Jianlong Fu, Hongyang Chao, Baining Guo


High-quality image inpainting requires filling missing regions in a damaged image with plausible content. Existing works either fill the regions by copying high-resolution patches or generating semantically-coherent patches from region context, while neglecting the fact that both visual and semantic plausibility are highly-demanded. In this paper, we propose a Pyramid-context Encoder Network (denoted as PEN-Net) for image inpainting by deep generative models. The proposed PEN-Net is built upon a U-Net structure with three tailored components, ie., a pyramid-context encoder, a multi-scale decoder, and an adversarial training loss. First, we adopt a U-Net as backbone which can encode the context of an image from high-resolution pixels into high-level semantic features, and decode the features reversely. Second, we propose a pyramid-context encoder, which progressively learns region affinity by attention from a high-level semantic feature map, and transfers the learned attention to its adjacent high-resolution feature map. As the missing content can be filled by attention transfer from deep to shallow in a pyramid fashion, both visual and semantic coherence for image inpainting can be ensured. Third, we further propose a multi-scale decoder with deeply-supervised pyramid losses and an adversarial loss. Such a design not only results in fast convergence in training, but more realistic results in testing. Extensive experiments on a broad range of datasets shows the superior performance of the proposed network.
[coherence, prediction, work] [] [image, missing, inpainting, proposed, figure, latent, fill, generative, input, atn, pconv, reconstructed, atns, patchmatch, filled, texture, real, masked, resolution, qualitative, damaged, face, synthesis] [network, deep, table, compact, effectiveness, higher, layer, performance, output, full, shallow, skip, dilated] [attention, encoder, decoder, adversarial, generated, generate, model, visual, encode, introduced] [feature, pyramid, semantic, filling, region, propose, context, affinity, mask, map, inside, final, refine, contextual, semantics, level] [transfer, loss, training, viewed, learned, datasets, distance]
@InProceedings{Zeng_2019_CVPR,
  author = {Zeng, Yanhong and Fu, Jianlong and Chao, Hongyang and Guo, Baining},
  title = {Learning Pyramid-Context Encoder Network for High-Quality Image Inpainting},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Example-Guided Style-Consistent Image Synthesis From Semantic Labeling
Miao Wang, Guo-Ye Yang, Ruilong Li, Run-Ze Liang, Song-Hai Zhang, Peter M. Hall, Shi-Min Hu


Example-guided image synthesis aims to synthesize an image from a semantic label map and an exemplary image indicating style. We use the term "style" in this problem to refer to implicit characteristics of images, for example: in portraits "style" includes gender, racial identity, age, hairstyle; in full body pictures it includes clothing; in street scenes it refers to weather and time of day and such like. A semantic label map in these cases indicates facial expression, full body pose, or scene segmentation. We propose a solution to the example-guided image synthesis problem using conditional generative adversarial networks with style consistency. Our key contributions are (i) a novel style consistency discriminator to determine whether a pair of images are consistent in style; (ii) an adaptive semantic consistency loss; and (iii) a training data sampling strategy, for synthesizing style-consistent results to the exemplar. We demonstrate the efficiency of our method on face, dance and street view synthesis tasks.
[video, framework, human, dataset] [scene, view, consistent, pose, corresponding, problem, measured] [image, style, synthesis, input, consistency, synthetic, method, figure, photorealistic, synthesize, translation, generative, dsc, real, dance, face, proposed, munit, conditional, generator, facial, lscadv, lsc, sketchface, alexei, pairedmunit, study, based, result] [adaptive, table, standard, network, deep, vgg, output, full] [adversarial, discriminator, model, arxiv, sampled, preprint, semantically, generated, visual, goal, abstract] [semantic, map, street, feature, parsing, guidance, segmentation, distinguish, baseline] [label, loss, exemplar, transfer, domain, training, pair, data, learning, target, unsupervised, sample, novel, function, sampling]
@InProceedings{Wang_2019_CVPR,
  author = {Wang, Miao and Yang, Guo-Ye and Li, Ruilong and Liang, Run-Ze and Zhang, Song-Hai and Hall, Peter M. and Hu, Shi-Min},
  title = {Example-Guided Style-Consistent Image Synthesis From Semantic Labeling},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
MirrorGAN: Learning Text-To-Image Generation by Redescription
Tingting Qiao, Jing Zhang, Duanqing Xu, Dacheng Tao


Generating an image from a given text description has two goals: visual realism and semantic consistency. Although significant progress has been made in generating high-quality and visually realistic images using generative adversarial networks, guaranteeing semantic consistency between the text description and visual content remains very challenging. In this paper, we address this problem by proposing a novel global-local attentive and semantic-preserving text-to-image-to-text framework called MirrorGAN. MirrorGAN exploits the idea of learning text-to-image generation by redescription and consists of three modules: a semantic text embedding module (STEM), a global-local collaborative attentive module for cascaded image generation (GLAM), and a semantic text regeneration and alignment module (STREAM). STEM generates word- and sentence-level embeddings. GLAM has a cascaded architecture for generating target images from coarse to fine scales, leveraging both local word attention and global sentence attention to progressively enhance the diversity and semantic consistency of the generated images. STREAM seeks to regenerate the text description from the generated image, which semantically aligns with the given text description. Thorough experiments on two public benchmark datasets demonstrate the superiority of MirrorGAN over other representative state-of-the-art methods.
[recognition, stream, framework, lstm] [computer, vision, pattern, local, consistent, underlying, ground, international, corresponding, truth, reconstruction] [image, conference, ieee, consistency, proposed, generator, realistic, figure, visually, generative, input, realism, collaborative, superiority] [neural, processing, better, table, small, structure] [text, mirrorgan, attention, generated, visual, generation, adversarial, bird, attngan, model, inception, generating, semantically, word, description, sentence, glam, white, sca, captioning, diversity, blue, generate, natural, machine, common, progressively, belly] [semantic, feature, global, coco, score, attentive, stage, module, cascaded, semantics, three, propose, enhance] [cub, learning, embedding, loss, test, training, idea, dimension, alignment]
@InProceedings{Qiao_2019_CVPR,
  author = {Qiao, Tingting and Zhang, Jing and Xu, Duanqing and Tao, Dacheng},
  title = {MirrorGAN: Learning Text-To-Image Generation by Redescription},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Light Field Messaging With Deep Photographic Steganography
Eric Wengrowski, Kristin Dana


We develop Light Field Messaging (LFM), a process of embedding, transmitting, and receiving hidden information in video that is displayed on a screen and captured by a handheld camera. The goal of the system is to minimize perceived visual artifacts of the message embedding, while simultaneously maximizing the accuracy of message recovery on the camera side. LFM requires photographic steganography for embedding messages that can be displayed and camera-captured. Unlike digital steganography, the embedding requirements are significantly more challenging due to the combined effect of the screen's radiometric emittance function, the camera's sensitivity function, and the camera-display relative geometry. We devise and train a network to jointly learn a deep embedding and recovery algorithm that requires no multi-frame synchronization. A key novel component is the camera display transfer function (CDTF) to model the camera-display pipeline. To learn this CDTF we introduce a dataset (Camera-Display 1M) of 1,000,000 camera-captured images collected from 25 camera-display pairs. The result of this work is a high-performance real-time LFM system using consumer-grade displays and smartphone cameras.
[dataset, hidden, work, temporal, early] [camera, international, light, computer, field, algorithm, vision, approach, single, robust, simultaneously, pattern] [image, steganography, lfm, carrier, conference, display, photographic, digital, ber, figure, coded, cdtf, perceptual, ieee, recovery, pixel, radiometric, recovered, method, acer, electronic, prior, quality, frontal, messaging, displayed, based, basler, transform, baluja, proposed, high, predator, lsb] [network, deep, neural, convolutional, architecture, channel, weight, fixed, table, mobile, computing] [message, model, visual, communication, arxiv, preprint, steganalysis, generated, simple, encoded, annual] [spatial, feature] [learning, trained, function, embedding, training, loss, transfer, minimize, learn, domain, objective, existing]
@InProceedings{Wengrowski_2019_CVPR,
  author = {Wengrowski, Eric and Dana, Kristin},
  title = {Light Field Messaging With Deep Photographic Steganography},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Im2Pencil: Controllable Pencil Illustration From Photographs
Yijun Li, Chen Fang, Aaron Hertzmann, Eli Shechtman, Ming-Hsuan Yang


We propose a high-quality photo-to-pencil translation method with fine-grained control over the drawing style. This is a challenging task due to multiple stroke types (e.g., outline and shading), structural complexity of pencil shading (e.g., hatching), and the lack of aligned training data pairs. To address these challenges, we develop a two-branch model that learns separate filters for generating sketchy outlines and tonal shading from a collection of pencil drawings. We create training data pairs by extracting clean outlines and tonal illustrations from original pencil drawings using image filtering techniques, and we manually label the drawing styles. In addition, our model creates different pencil styles (e.g., line sketchiness and shading style) in a user-controllable manner. Experimental results on different types of pencil drawings show that the proposed algorithm performs favorably against existing methods in terms of quality, diversity and user evaluations.
[extract, perform] [well, procedural, algorithm, single, provide, rendering] [pencil, shading, outline, figure, drawing, style, input, tone, image, paired, method, translation, xdog, hatching, control, photo, abstraction, user, stylization, produce, row, tonal, proposed, real, cyclegan, synthesis, result, perceptual, acm, based, texture, gatys, clean, smooth, extracted] [network, neural, filter, selection, deep, applied, output, adjust, filtered, convolutional] [model, generate, adversarial, generated, example] [edge, boundary, branch, detector, three, propose] [training, data, loss, learning, transfer, main, set, existing, select, trained, sketchy, train, draw, sketch]
@InProceedings{Li_2019_CVPR,
  author = {Li, Yijun and Fang, Chen and Hertzmann, Aaron and Shechtman, Eli and Yang, Ming-Hsuan},
  title = {Im2Pencil: Controllable Pencil Illustration From Photographs},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
When Color Constancy Goes Wrong: Correcting Improperly White-Balanced Images
Mahmoud Afifi, Brian Price, Scott Cohen, Michael S. Brown


This paper focuses on correcting a camera image that has been improperly white-balanced. This situation occurs when a camera's auto white balance fails or when the wrong manual white-balance setting is used. Even after decades of computational color constancy research, there are no effective solutions to this problem. The challenge lies not in identifying what the correct white balance should have been, but in the fact that the in-camera white-balance procedure is followed by several camera-specific nonlinear color manipulations that make it challenging to correct the image's colors in post-processing. This paper introduces the first method to explicitly address this problem. Our method is enabled by a dataset of over 65,000 pairs of incorrectly white-balanced images and their corresponding correctly white-balanced images. Using this dataset, we introduce a k-nearest neighbor strategy that is able to compute a nonlinear color mapping function to correct the image's colors. We show our method is highly effective and generalizes well to camera models not in the training set.
[dataset, represented, work] [camera, ground, truth, scene, rendered, calibration, chart, illumination, histogram, matrix, corresponding, computer, michael, compute, vision, additional, estimation, computed, analysis, exact, pattern, linear, approach, supplemental, error] [color, image, correction, srgb, diagonal, input, radiometric, balance, method, pca, reference, based, adobe, constancy, figure, user, mapping, ieee, improperly, nonlinear, result, proposed, study, achromatic, igt, correcting, gamma, transform, photoshop] [applied, computational, performed, applying, number, represents, output, effective, standard] [white, correct, generated, incorrect, picture, machine, example, vector, correctly, include, find] [feature, final, average] [training, set, function, data, incorrectly, space, target]
@InProceedings{Afifi_2019_CVPR,
  author = {Afifi, Mahmoud and Price, Brian and Cohen, Scott and Brown, Michael S.},
  title = {When Color Constancy Goes Wrong: Correcting Improperly White-Balanced Images},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Beyond Volumetric Albedo -- A Surface Optimization Framework for Non-Line-Of-Sight Imaging
Chia-Yin Tsai, Aswin C. Sankaranarayanan, Ioannis Gkioulekas


Non-line-of-sight (NLOS) imaging is the problem of reconstructing properties of scenes occluded from a sensor, using measurements of light that indirectly travels from the occluded scene to the sensor through intermediate diffuse reflections. We introduce an analysis-by-synthesis framework that can reconstruct complex shape and reflectance of an NLOS object. Our framework deviates from prior work on NLOS reconstruction, by directly optimizing for a surface representation of the NLOS object, in place of commonly employed volumetric representations. At the core of our framework is a new rendering formulation that efficiently computes derivatives of radiometric measurements with respect to NLOS geometry and reflectance, while accurately modeling the underlying light transport physics. By coupling this with stochastic optimization and geometry processing techniques, we are able to reconstruct NLOS surface at a level of detail significantly exceeding what is possible with previous volumetric reconstruction methods.
[framework, multiple, perform] [surface, nlos, rendering, mesh, reconstruction, optimization, transient, equation, snlos, reflectance, geometry, light, shape, scene, monte, differentiable, albedo, volume, computer, volumetric, respect, inverse, visible, integral, diffuse, active, pipeline, formation, virtual, visibility, problem, occluded, formulation, triangular, rendered, geometric, additionally, point, initial, algorithm, slos, ramesh, ioannis, reconstructing, sensor] [imaging, figure, image, reconstruct, radiometric, acm, isotropic, based, conference, detail, method, intensity, result, quality] [gradient, stochastic, computational, processing, implementation, descent, efficiently] [model, carlo, path] [object, level, improve] [function, transport, source, loss, setting, minimize, adam, representation]
@InProceedings{Tsai_2019_CVPR,
  author = {Tsai, Chia-Yin and Sankaranarayanan, Aswin C. and Gkioulekas, Ioannis},
  title = {Beyond Volumetric Albedo -- A Surface Optimization Framework for Non-Line-Of-Sight Imaging},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Reflection Removal Using a Dual-Pixel Sensor
Abhijith Punnappurath, Michael S. Brown


Reflection removal is the challenging problem of removing unwanted reflections that occur when imaging a scene that is behind a pane of glass. In this paper, we show that most cameras have an overlooked mechanism that can greatly simplify this task. Specifically, modern DLSR and smartphone cameras use dual pixel (DP) sensors that have two photodiodes per pixel to provide two sub-aperture views of the scene from a single captured image. "Defocus-disparity" cues, which are natural by-products of the DP sensor encoded within these two sub-aperture views, can be used to distinguish between image gradients belonging to the in-focus background and those caused by reflection interference. This gradient information can then be incorporated into an optimization framework to recover the background layer with higher accuracy than currently possible from the single captured image. As part of this work, we provide the first image dataset for reflection removal consisting of the sub-aperture views from the DP sensor.
[dataset, capture, work, motion, focus] [left, scene, single, disparity, sensor, depth, estimated, provide, camera, light, aperture, ground, lens, reflected, field, truth, well, point, approach, compute, position, view, confidence, formation, algorithm, equation] [reflection, background, image, removal, pixel, captured, separation, input, method, intensity, controlled, blur, dual, based, glass, half, proposed, defocus, ieee, figure, imaging, blurred, sharp, recover, difference, glv, grv, competing] [layer, gradient, cost, size, shift, hardware, deep] [observed, model, example, natural, sum] [map, object] [data, distribution, function, learning]
@InProceedings{Punnappurath_2019_CVPR,
  author = {Punnappurath, Abhijith and Brown, Michael S.},
  title = {Reflection Removal Using a Dual-Pixel Sensor},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Practical Coding Function Design for Time-Of-Flight Imaging
Felipe Gutierrez-Barragan, Syed Azer Reza, Andreas Velten, Mohit Gupta


The depth resolution of a continuous-wave time-of-flight (CW-ToF) imaging system is determined by its coding functions. Recently, there has been growing interest in the design of new high-performance CW-ToF coding functions. However, these functions are typically designed in a hardware agnostic manner, i.e., without considering the practical device limitations, such as bandwidth, source power, digital (binary) function generation. Therefore, despite theoretical improvements, practical implementation of these functions remains a challenge. We present a constrained optimization approach for designing practical coding functions that adhere to hardware constraints. The optimization problem is non-convex with a large search space and no known globally optimal solutions. To make the problem tractable, we design an iterative, alternating least-squares algorithm, along with convex relaxation of the constraints. Using this approach, we design high-performance coding functions that can be implemented on existing hardware with minimal modifications. We demonstrate the performance benefits of the resulting functions via extensive simulations and a hardware prototype.
[time, framework] [hamiltonian, demodulation, depth, sinusoid, practical, optimization, range, light, bandwidth, problem, scene, impulse, sensor, convex, limited, pmax, decomposition, error, constraint, approach, tof, total, adhere, laser, square, unwrapping, supplementary, mde, illumination, point, vision, exposure, infinite] [figure, high, imaging, input, frequency, intensity, image, ieee, blind, proposed, resolution, arbitrary] [coding, correlation, modulation, low, binary, hardware, design, performance, power, output, designing, achieve, phase] [system, constrained, consider, model, find, implemented] [mae, peak, fmax, complementary] [function, snr, source, maximum, code, objective, large, space, alternating]
@InProceedings{Gutierrez-Barragan_2019_CVPR,
  author = {Gutierrez-Barragan, Felipe and Azer Reza, Syed and Velten, Andreas and Gupta, Mohit},
  title = {Practical Coding Function Design for Time-Of-Flight Imaging},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Meta-SR: A Magnification-Arbitrary Network for Super-Resolution
Xuecai Hu, Haoyuan Mu, Xiangyu Zhang, Zilei Wang, Tieniu Tan, Jian Sun


Recent research on super-resolution has achieved greatsuccess due to the development of deep convolutional neu-ral networks (DCNNs). However, super-resolution of arbi-trary scale factor has been ignored for a long time. Mostprevious researchers regard super-resolution of differentscale factors as independent tasks. They train a specificmodel for each scale factor which is inefficient in comput-ing, and prior work only take the super-resolution of sev-eral integer scale factors into consideration. In this work,we propose a novel method called Meta-SR to firstly solvesuper-resolution of arbitrary scale factor (including non-integer scale factors) with a single model. In our Meta-SR,the Meta-Upscale Module is proposed to replace the tradi-tional upscale module. For arbitrary scale factor, the Meta-Upscale Module dynamically predicts the weights of the up-scale filters by taking the scale factor as input and use theseweights to generate the HR image of arbitrary size. For anylow-resolution image, our Meta-SR can continuously zoomin it with arbitrary scale factor by only using a single model.We evaluated the proposed method through extensive exper-iments on widely used benchmark datasets on single imagesuper-resolution. The experimental results show the superi-ority of our Meta-Upscale.
[prediction, predict, recognition, multiple, time, work] [single, computer, vision, pattern, solve, projection, dense, corresponding, june, practical, coordinate] [image, pixel, arbitrary, proposed, conference, ieee, input, bicubic, sisr, zoom, based, method, interpolation, superresolution, mapping, psnr, ssim] [scale, factor, upscale, network, weight, convolution, residual, deep, rdn, number, convolutional, called, dynamically, better, size, filter, neural, flr, vij, integer, connected, upsampling, science, firstly, metaupscale, upscaled, efficient, compared, typical, performance] [model, generate, generated, introduced] [feature, module, baseline, predicted, map, fully, final, location] [learning, train, training, function, novel, test]
@InProceedings{Hu_2019_CVPR,
  author = {Hu, Xuecai and Mu, Haoyuan and Zhang, Xiangyu and Wang, Zilei and Tan, Tieniu and Sun, Jian},
  title = {Meta-SR: A Magnification-Arbitrary Network for Super-Resolution},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Multispectral and Hyperspectral Image Fusion by MS/HS Fusion Net
Qi Xie, Minghao Zhou, Qian Zhao, Deyu Meng, Wangmeng Zuo, Zongben Xu


Hyperspectral imaging can help better understand the characteristics of different materials, compared with traditional image systems. However, only high-resolution multispectral (HrMS) and low-resolution hyperspectral (LrHS) images can generally be captured at video rate in practice. In this paper, we propose a model-based deep learning approach for merging an HrMS and LrHS images to generate a high-resolution hyperspectral (HrHS) image. In specific, we construct a novel MS/HS fusion model which takes the observation models of low-resolution images and the low-rankness knowledge along the spectral mode of HrHS image into consideration. Then we design an iterative algorithm to solve the model by exploiting the proximal gradient method. And then, by unfolding the designed algorithm, we construct a deep network, called MS/HS Fusion Net, with learning the proximal operators and model parameters by convolutional neural networks. Experimental results on simulated and real data substantiate the superiority of our method both visually and quantitatively as compared with state-of-the-art methods along this line of research.
[fusion, current] [matrix, observation, algorithm, computer, vision, supplementary, rgb, ground, truth, journal, pattern, international, approach] [image, hrms, spectral, ieee, lrhs, hrhs, proposed, hyperspectral, remote, multispectral, method, prior, real, based, sensing, comparison, simulated, geoscience, conference, band, resolution, pansharpening, competing, proximal, figure, rhw, cnmf, traditional, input, cave, pnn] [network, deep, performance, resnet, tensor, number, structure, equivalent, size, net, better, operator, original, design, neural, residual, sparse, compared, convolutional, regularization, table, gsa] [model, iterative, easily, generate] [spatial, stage, easy, final, average, fuse] [data, training, learning, set, testing, knowledge, learn, exploiting]
@InProceedings{Xie_2019_CVPR,
  author = {Xie, Qi and Zhou, Minghao and Zhao, Qian and Meng, Deyu and Zuo, Wangmeng and Xu, Zongben},
  title = {Multispectral and Hyperspectral Image Fusion by MS/HS Fusion Net},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Learning Attraction Field Representation for Robust Line Segment Detection
Nan Xue, Song Bai, Fudong Wang, Gui-Song Xia, Tianfu Wu, Liangpei Zhang


This paper presents a region-partition based attraction field dual representation for line segment maps, and thus poses the problem of line segment detection (LSD) as the region coloring problem. The latter is then addressed by learning deep convolutional neural networks (ConvNets) for accuracy, robustness and efficiency. For a 2D line segment map, our dual representation consists of three components: (i) A region-partition map in which every pixel is assigned to one and only one line segment; (ii) An attraction field map in which every pixel in a partition region is encoded by its 2D projection vector w.r.t. the associated line segment; and (iii) A squeeze module which squashes the attraction field to a line segment map that almost perfectly recovers the input one. By leveraging the duality, we learn ConvNets to compute the attraction field maps for raw in-put images, followed by the squeeze module for LSD, in an end-to-end manner. Our method rigorously addresses several challenges in LSD such as local ambiguity and class imbalance. Our method also harnesses the best practices developed in ConvNets based semantic segmentation methods such as the encoder-decoder architecture and the a-trous convolution. In experiments, our method is tested on the WireFrame dataset and the YorkUrban dataset with state-of-the-art performance obtained. Especially, we advance the performance by 4.5 percents on the WireFramedataset. Our method is also fast with 6.6 10.4 FPS, outperforming most of existing line segment detectors.
[dataset] [field, local, problem, projection, point, denote, robust, compute, heat, duality, geometry, ambiguity, estimated, approach, corresponding, form] [proposed, image, method, based, pixel, figure, input, dual, coloring, ieee, raw] [deep, squeeze, performance, network, precision, stride, convnets, residual, convolutional, operator, convolution, size, best, computing, number] [vector, evaluate, simple, represent] [segment, map, attraction, wireframe, detection, lsd, region, edge, module, yorkurban, junction, false, recall, detect, threshold, linelet, segmentation, detected, parser, mcmlsd, semantic, hough, detector, stage, feature] [representation, learning, training, set, partition, existing, learn, class, distance]
@InProceedings{Xue_2019_CVPR,
  author = {Xue, Nan and Bai, Song and Wang, Fudong and Xia, Gui-Song and Wu, Tianfu and Zhang, Liangpei},
  title = {Learning Attraction Field Representation for Robust Line Segment Detection},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Blind Super-Resolution With Iterative Kernel Correction
Jinjin Gu, Hannan Lu, Wangmeng Zuo, Chao Dong


Deep learning based methods have dominated super-resolution (SR) field due to their remarkable performance in terms of effectiveness and efficiency. Most of these methods assume that the blur kernel during downsampling is predefined/known (e.g., bicubic). However, the blur kernels involved in real applications are complicated and unknown, resulting in severe performance drop for the advanced SR methods. In this paper, we propose an Iterative Kernel Correction (IKC) method for blur kernel estimation in blind SR problem, where the blur kernels are unknown. We draw the observation that kernel mismatch could bring regular artifacts (either over-sharpening or over-smoothing), which can be applied to correct inaccurate blur kernels. Thus we introduce an iterative correction scheme -- IKC that achieves better results than direct kernel estimation. We further propose an effective SR network architecture using spatial feature transform (SFT) layers to handle multiple blur kernels, named SFTMD. Extensive experiments on synthetic and real-world images show that the proposed IKC method with SFTMD can provide visually favorable SR results and the state-of-the-art performance in blind SR problem.
[multiple, previous, recognition, work] [computer, vision, pattern, estimation, single, provide, volume, direct, estimated, sota, international, assume, problem] [image, blur, blind, ikc, conference, method, ieee, figure, proposed, sftmd, carn, real, input, srmd, sft, psnr, correction, zssr, based, corrector, ssim, comparison, bicubic, isotropic, result, sisr, quantitative, pan, synthetic, degradation] [kernel, performance, network, deep, residual, layer, gaussian, width, downsampling, iteration, output, architecture, convolution, size, mismatch, concatenation, chao, factor, convolutional, table, severe, better, block, denotes] [iterative, correct, model, visual, introduce] [propose, feature, global, spatial, cnn] [predictor, learning, training, test, strategy, function, set]
@InProceedings{Gu_2019_CVPR,
  author = {Gu, Jinjin and Lu, Hannan and Zuo, Wangmeng and Dong, Chao},
  title = {Blind Super-Resolution With Iterative Kernel Correction},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Video Magnification in the Wild Using Fractional Anisotropy in Temporal Distribution
Shoichiro Takeda, Yasunori Akagi, Kazuki Okami, Megumi Isogai, Hideaki Kimata


Video magnification methods can magnify and reveal subtle changes invisible to the naked eye. However, in such subtle changes, meaningful ones caused by physical and natural phenomena are mixed with non-meaningful ones caused by photographic noise. Therefore, current methods often produce noisy and misleading magnification outputs due to the non-meaningful subtle changes. For detecting only meaningful subtle changes, several methods have been proposed but require human manipulations, additional resources, or input video scene limitations. In this paper, we present a novel method using fractional anisotropy (FA) to detect only meaningful subtle changes without the aforementioned requirements. FA has been used in neuroscience to evaluate anisotropic diffusion of water molecules in the body. On the basis of our observation that temporal distribution of meaningful subtle changes more clearly indicates anisotropic diffusion than that of non-meaningful ones, we used FA to design a fractional anisotropic filter that passes only meaningful subtle changes. Using the filter enables our method to obtain better and more impressive magnification results than those obtained with state-of-the-art methods.
[video, motion, temporal, time, hear, amplitude, recognition, current, flow, visualizing] [computer, approach, vision, pattern, additional, local, hand, require, textured] [subtle, method, meaningful, magnification, jerk, photographic, noise, color, anisotropic, proposed, caused, fractional, figure, background, magnifies, image, magnify, input, psnr, conference, quick, misdetects, produce, nonmeaningful, based, eulerian, pixel, high, magnified, intensity, ieee, anisotropy, pca, face, real, backboard, neuroscience, water, ukulele, magnifying, flat, synthetic, edo, spreading] [filter, phase, acceleration, regularization, gaussian, applied, william, design] [ball, indicates] [pyramid, presence, detect, level, detecting, hierarchical, region, spatial] [diffusion, distribution, novel, large, learning]
@InProceedings{Takeda_2019_CVPR,
  author = {Takeda, Shoichiro and Akagi, Yasunori and Okami, Kazuki and Isogai, Megumi and Kimata, Hideaki},
  title = {Video Magnification in the Wild Using Fractional Anisotropy in Temporal Distribution},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Attentive Feedback Network for Boundary-Aware Salient Object Detection
Mengyang Feng, Huchuan Lu, Errui Ding


Recent deep learning based salient object detection methods achieve gratifying performance built upon Fully Convolutional Neural Networks (FCNs). However, most of them have suffered from the boundary challenge. The state-of-the-art methods employ feature aggregation tech- nique and can precisely find out wherein the salient object, but they often fail to segment out the entire object with fine boundaries, especially those raised narrow stripes. So there is still a large room for improvement over the FCN based models. In this paper, we design the Attentive Feedback Modules (AFMs) to better explore the structure of objects. A Boundary-Enhanced Loss (BEL) is further employed for learning exquisite boundaries. Our proposed deep model produces satisfying results on the object boundaries and achieves state-of-the-art performance on five widely tested salient object detection benchmarks. The network is in a fully convolutional fashion running at a speed of 26 FPS and does not need any post-processing.
[prediction, previous, passing, employed] [computer, vision, pattern, ground, local, truth, international] [conference, ieee, feedback, image, proposed, method, produce, figure, input, based] [network, convolutional, deep, size, better, block, ternary, kernel, performance, neural, dilation, layer, convolution, effectiveness, employ, speed, operation, table, precision, achieve] [decoder, attention, encoder, model, visual, generate, perception, message, evaluate] [saliency, object, salient, global, attentive, map, detection, afm, fully, spatial, module, fmax, boundary, exquisite, semantic, refine, extra, mae, final, coarse, feature, hierarchical, picanet, recall] [loss, learning, afnet, set, large, training]
@InProceedings{Feng_2019_CVPR,
  author = {Feng, Mengyang and Lu, Huchuan and Ding, Errui},
  title = {Attentive Feedback Network for Boundary-Aware Salient Object Detection},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Heavy Rain Image Restoration: Integrating Physics Model and Conditional Adversarial Learning
Ruoteng Li, Loong-Fah Cheong, Robby T. Tan


Most deraining works focus on rain streaks removal but they cannot deal adequately with heavy rain images. In heavy rain, streaks are strongly visible, dense rain accumulation or rain veiling effect significantly washes out the image, further scenes are relatively more blurry, etc. In this paper, we propose a novel method to address these problems. We put forth a 2-stage network: a physics-based backbone followed by a depth-guided GAN refinement. The first stage estimates the rain streaks, the transmission, and the atmospheric light governed by the underlying physics. To tease out these components more reliably, a guided filtering framework is used to decompose the image into its low- and high-frequency components. This filtering is guided by a rain-free residue image --- its content is used to set the passbands for the two channels in a spatially-variant manner so that the background details do not get mixed up with the rain-streaks. For the second stage, the refinement stage, we put forth a depth-guided GAN to recover the background details failed to be retrieved by the first stage, as well as correcting artefacts introduced by that stage. We have evaluated our method against state of the art methods. Extensive experiments show that our method outperforms them on real rain image data, recovering visually clean images with good details.
[recognition, state, video, dataset, second] [computer, vision, decomposition, light, depth, pattern, algorithm, estimation, single, international, june, estimated, scene, error, problem, analysis, dense, well, estimating, provide] [rain, image, method, ieee, input, removal, conference, accumulation, atmospheric, background, transmission, generative, veiling, residue, component, streak, real, clean, frequency, deraining, result, figure, comparison, remove, synthetic, based, psnr, reconstructed, dehazing, dehaze, conditional, produced, filtering, proposed, distant] [network, layer, represents, output, channel, low, deep, convolutional] [model, gan, adversarial, strong] [heavy, stage, map, baseline, refinement, guided, art, guidance, propose, presence] [learning, loss, test, training, address, set, function, discriminative, novel, existing]
@InProceedings{Li_2019_CVPR,
  author = {Li, Ruoteng and Cheong, Loong-Fah and Tan, Robby T.},
  title = {Heavy Rain Image Restoration: Integrating Physics Model and Conditional Adversarial Learning},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Learning to Calibrate Straight Lines for Fisheye Image Rectification
Zhucun Xue, Nan Xue, Gui-Song Xia, Weiming Shen


This paper presents a new deep-learning based method to simultaneously calibrate the intrinsic parameters of fisheye lens and rectify the distorted images. Assuming that the distorted lines generated by fisheye projection should be straight after rectification, we propose a novel deep neural network to impose explicit geometry constraints onto processes of the fisheye lens calibration and the distorted image rectification. In addition, considering the nonlinearity of distortion distribution in fisheye images, the proposed network fully exploits multi-scale perception to equalize the rectification effects on the whole image. To train and evaluate the proposed model, we also create a new large-scale dataset labeled with corresponding distortion parameters and well-annotated distorted lines. Compared with the state-of-the-art methods, our model achieves the best published rectification quality and the most accurate estimation of distortion parameters on a large set of synthetic and real fisheye images.
[dataset, work] [fisheye, distorted, distortion, rectification, rectified, geometric, calibration, straight, camera, lens, well, ground, geometry, dlp, rectify, corresponding, accurate, local, perspective, truth, projection, estimation, single, lpe, estimated, error, scene, estimate, constraint, curvature, defined, kgt, bukhari, suncg, reprojection, calibrate, explicit, problem, formation] [image, input, proposed, method, real, figure, ssim] [network, deep, output, performance, neural, size, best, convolutional, cnns, structure] [model, perception, visual, collection, evaluation, evaluate, generate] [map, module, global, segment, detection, wireframe, detected, edge, feature, detect, three, propose] [set, training, loss, train, learning, learn, function, learned]
@InProceedings{Xue_2019_CVPR,
  author = {Xue, Zhucun and Xue, Nan and Xia, Gui-Song and Shen, Weiming},
  title = {Learning to Calibrate Straight Lines for Fisheye Image Rectification},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Camera Lens Super-Resolution
Chang Chen, Zhiwei Xiong, Xinmei Tian, Zheng-Jun Zha, Feng Wu


Existing methods for single image super-resolution (SR) are typically evaluated with synthetic degradation models such as bicubic or Gaussian downsampling. In this paper, we investigate SR from the perspective of camera lenses, named as CameraSR, which aims to alleviate the intrinsic tradeoff between resolution (R) and field-of-view (V) in realistic imaging systems. Specifically, we view the R-V degradation as a latent model in the SR process and learn to reverse it with realistic low- and high-resolution image pairs. To obtain the paired images, we propose two novel data acquisition strategies for two representative imaging systems (i.e., DSLR and smartphone cameras), respectively. Based on the obtained City100 dataset, we quantitatively analyze the performance of commonly-used synthetic degradation models, and demonstrate the superiority of CameraSR as a practical solution to boost the performance of existing SR methods. Moreover, CameraSR can be readily generalized to different content and devices, which serves as an advanced digital zoom tool in realistic imaging systems.
[modeling, dataset, version, optical, short] [focal, camera, single, ground, truth, lens, reconstruction, analysis, fov, problem, well] [image, degradation, camerasr, realistic, imaging, captured, smartphone, dslr, zoom, resolution, synthetic, bicubicsr, figure, interpolated, drv, based, perceptual, psnr, vdsr, bicubic, digital, quality, comparison, proposed, gaussiansr, color, dgau, iphone, paired, demonstrate, quantitative, dbic, ssim, content, translation, nikon, srgan] [network, deep, performance, downsampling, denotes, gaussian, vgg, tradeoff, convolutional, accuracy, compared, advanced, fixed, size, residual] [model, length, visual] [adopted, adopt, spatial] [trained, data, representative, distance, loss, existing, learn, test, metric, investigate, alleviate, generalization]
@InProceedings{Chen_2019_CVPR,
  author = {Chen, Chang and Xiong, Zhiwei and Tian, Xinmei and Zha, Zheng-Jun and Wu, Feng},
  title = {Camera Lens Super-Resolution},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Frame-Consistent Recurrent Video Deraining With Dual-Level Flow
Wenhan Yang, Jiaying Liu, Jiashi Feng


In this paper, we address the problem of rain removal from videos by proposing a more comprehensive framework that considers the additional degradation factors in real scenes neglected in previous works. The proposed framework is built upon a two-stage recurrent network with dual-level flow regularizations to perform the inverse recovery process of the rain synthesis model for video deraining. The rain-free frame is estimated from the single rain frame at the first stage. It is then taken as guidance along with previously recovered clean frames to help obtain a more accurate clean frame at the second stage. This two-step architecture is capable of extracting more reliable motion information from the initially estimated rain-free frame at the first stage for better frame alignment and motion modeling at the second stage. Furthermore, to keep the motion consistency between frames that facilitates a frame-consistent deraining model at the second stage, a dual-level flow based regularization is proposed at both coarse flow and fine pixel levels. To better train and evaluate the proposed video deraining network, a novel rain synthesis model is developed to produce more visually authentic paired training and evaluation videos. Extensive experiments on a series of synthetic and real videos verify not only the superiority of the proposed method over state-of-the-art but also the effectiveness of network design and its each component.
[video, flow, frame, motion, recurrent, temporal, joint, second, optical, time, previous, framework] [computer, single, inverse, vision, pattern, estimated, estimation, local, june, constraint, problem] [rain, image, ieee, removal, based, accumulation, proposed, deraining, recovery, method, degradation, fastderain, synthesis, spaccnn, background, input, streak, psnr, clean, pixel, ssim, figure, detail, removing, synthesized, atmospheric, comprehensive, capable, consistency, remove, denoted, dehazing, dual, july] [network, deep, convolutional, best, table, process, better, sparse, fine, layer, residual, architecture] [model, visual] [spatial, detection, module, feature, built, propose] [learning, alignment, novel]
@InProceedings{Yang_2019_CVPR,
  author = {Yang, Wenhan and Liu, Jiaying and Feng, Jiashi},
  title = {Frame-Consistent Recurrent Video Deraining With Dual-Level Flow},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Deep Plug-And-Play Super-Resolution for Arbitrary Blur Kernels
Kai Zhang, Wangmeng Zuo, Lei Zhang


While deep neural networks (DNN) based single image super-resolution (SISR) methods are rapidly gaining popularity, they are mainly designed for the widely-used bicubic degradation, and there still remains the fundamental challenge for them to super-resolve low-resolution (LR) image with arbitrary blur kernels. In the meanwhile, plug-and-play image restoration has been recognized with high flexibility due to its modular structure for easy plug-in of denoiser priors. In this paper, we propose a principled formulation and framework by extending bicubic degradation based deep SISR with the help of plug-and-play framework to handle LR images with arbitrary blur kernels. Specifically, we design a new SISR degradation model so as to take advantage of existing blind deblurring methods for blur kernel estimation. To optimize the new degradation induced energy function, we then derive a plug-and-play algorithm via variable splitting technique, which allows us to plug any super-resolver prior rather than the denoiser prior as a modular part. Quantitative and qualitative evaluations on synthetic and real LR images demonstrate that the proposed deep plug-and-play super-resolution framework is flexible and effective to deal with blurry LR images.
[motion, complex, framework, work, perform, karen] [computer, international, vision, single, pattern, distortion, algorithm, variable, problem, journal, directly, solve, limited, note, michael] [image, blur, degradation, ieee, conference, bicubic, sisr, dpsr, proposed, noise, prior, arbitrary, deblurring, method, blind, zssr, denoiser, handle, disk, based, blurry, comparison, rcan, splitting, srmd, gfn, restoration, vdsr, lei, kai, rise, bicubicly, psnr, denoising, wangmeng] [kernel, deep, gaussian, scale, factor, network, performance, neural, energy, convolutional, order, processing, fast, designed] [model, visual, simple, iterative, deal, adversarial] [level, adopt, improve, european] [existing, large, learning, function, training, noisy, alternating, trained, set]
@InProceedings{Zhang_2019_CVPR,
  author = {Zhang, Kai and Zuo, Wangmeng and Zhang, Lei},
  title = {Deep Plug-And-Play Super-Resolution for Arbitrary Blur Kernels},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Sea-Thru: A Method for Removing Water From Underwater Images
Derya Akkaynak, Tali Treibitz


Robust recovery of lost colors in underwater images remains a challenging problem. We recently showed that this was partly due to the prevalent use of an atmospheric image formation model for underwater images. We proposed a physically accurate model that explicitly showed: 1) the attenuation coefficient of the signal is not uniform across the scene but depends on object range and reflectance, 2) the coefficient governing the increase in backscatter with distance differs from the signal attenuation coefficient. Here, we present a method that recovers color with the revised model using RGBD images. The Sea-thru method first calculates backscatter using the darkest pixels in the image and their known range information. Then, it uses an estimate of the spatially varying illuminant to obtain the range-dependent attenuation coefficient. Using more than 1,100 images from two optically different water bodies, which we make available, we show that our method outperforms those using the atmospheric model. Consistent removal of water will open up large underwater datasets to powerful computer vision and machine learning algorithms, creating exciting opportunities for the future of underwater exploration and conservation.
[optical, signal, dataset, work, frame, multiple, structural] [range, attenuation, estimate, camera, scene, chart, light, single, estimation, formation, depth, vision, local, direct, ambient, computer, form, depends, rgbd, point, estimated, reflectance, rgb, reconstruction, physically, accurate, measured, constant, stereo] [underwater, image, color, water, ieee, method, backscatter, imaging, based, revised, dark, described, illuminant, figure, gray, marine, spectral, captured, reef, sony, enhancement, recovery, governed, haze, nikon, wavelength, assumed, dcp, bad, recover] [coefficient, channel, better] [model, white, machine, correct, type] [map, average, object, response] [large, datasets, space, coral, distance, learning, function, set]
@InProceedings{Akkaynak_2019_CVPR,
  author = {Akkaynak, Derya and Treibitz, Tali},
  title = {Sea-Thru: A Method for Removing Water From Underwater Images},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Deep Network Interpolation for Continuous Imagery Effect Transition
Xintao Wang, Ke Yu, Chao Dong, Xiaoou Tang, Chen Change Loy


Deep convolutional neural network has demonstrated its capability of learning a deterministic mapping for the desired imagery effect. However, the large variety of user flavors motivates the possibility of continuous transition among different output effects. Unlike existing methods that require a specific design to achieve one particular transition (e.g., style transfer), we propose a simple yet universal approach to attain a smooth control of diverse imagery effects in many low-level vision tasks, including image restoration, image-to-image translation, and style transfer. Specifically, our method, namely Deep Network Interpolation (DNI), applies linear interpolation in the parameter space of two or more correlated networks. A smooth control of imagery effects can be achieved by tweaking the interpolation coefficients. In addition to DNI and its broad applications, we also investigate the mechanism of network interpolation from the perspective of learned filters.
[transition, multiple] [continuous, vision, night, linear, general] [image, dni, interpolation, style, denoising, smooth, interpolated, noise, translation, pixel, restoration, strength, proposed, mse, figure, control, imagery, chen, change, user, produce, interpolating, stroke, artistic, input, content, day, capable, adjusting, balance, texture, photo, unable, generative] [deep, network, convolutional, filter, correlation, achieve, neural, normalization, parameter, effective, best, applied, chao, order] [model, gan, van, generating, adversarial, controllable, provided, diverse, generate] [instance, feature, level] [learned, transfer, learning, specific, existing, trained, loss, training, task, space]
@InProceedings{Wang_2019_CVPR,
  author = {Wang, Xintao and Yu, Ke and Dong, Chao and Tang, Xiaoou and Change Loy, Chen},
  title = {Deep Network Interpolation for Continuous Imagery Effect Transition},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Spatially Variant Linear Representation Models for Joint Filtering
Jinshan Pan, Jiangxin Dong, Jimmy S. Ren, Liang Lin, Jinhui Tang, Ming-Hsuan Yang


Joint filtering mainly uses an additional guidance image as a prior and transfers its structures to the target image in the filtering process. Different from existing algorithms that rely on locally linear models or hand-designed objective functions to extract the structural information from the guidance image, we propose a new joint filter based on a spatially variant linear representation model (SVLRM), where the target image is linearly represented by the guidance image. However, the SVLRM leads to a highly ill-posed problem. To estimate the linear representation coefficients, we develop an effective algorithm based on a deep convolutional neural network (CNN). The proposed deep CNN (constrained by the SVLRM) is able to estimate the spatially variant linear representation coefficients which are able to model the structural information of both the guidance and input images. We show that the proposed algorithm can be effectively applied to a variety of applications, including depth/RGB image upsampling and restoration, flash/no-flash image deblurring, natural image denoising, scale-aware filtering, etc. Extensive experimental results demonstrate that the proposed algorithm performs favorably against state-of-the-art methods that have been specially designed for each task.
[joint, structural, dataset, represented, performs, determine] [algorithm, linear, depth, estimate, estimated, well, local, note, directly, problem] [image, proposed, filtering, input, figure, based, spatially, method, svlrm, djf, noise, denoising, bilateral, acm, deblurring, ieee, jinshan, sharp, jiaya, extraneous, transferred, quantitative, fgt, restoration, prior, flash, application, variety, jbu] [deep, filter, variant, network, table, upsampling, neural, effective, weighted, applied, convolutional, gradient, efficient, better, preserved, cnns, convolution] [model, natural, locally, generate, generates, develop, evaluate] [guidance, cnn, guided, propose, including] [representation, target, learning, training, test, set, noisy, existing, objective, main, transfer, learn]
@InProceedings{Pan_2019_CVPR,
  author = {Pan, Jinshan and Dong, Jiangxin and Ren, Jimmy S. and Lin, Liang and Tang, Jinhui and Yang, Ming-Hsuan},
  title = {Spatially Variant Linear Representation Models for Joint Filtering},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Toward Convolutional Blind Denoising of Real Photographs
Shi Guo, Zifei Yan, Kai Zhang, Wangmeng Zuo, Lei Zhang


While deep convolutional neural networks (CNNs) have achieved impressive success in image denoising with additive white Gaussian noise (AWGN), their performance remains limited on real-world noisy photographs. The main reason is that their learned models are easy to overfit on the simplified AWGN model which deviates severely from the complicated real-world noise model. In order to improve the generalization ability of deep CNN denoisers, we suggest training a convolutional blind denoising network (CBDNet) with more realistic noise model and real-world noisy-clean image pairs. On the one hand, both signal-dependent noise and in-camera signal processing pipeline is considered to synthesize realistic noisy images. On the other hand, real-world noisy photographs and their nearly noise-free counterparts are also included to train our CBDNet. To further provide an interactive strategy to rectify denoising result conveniently, a noise estimation subnetwork with asymmetric learning to suppress under-estimation of noise level is embedded into CBDNet. Extensive experimental results on three datasets of real-world noisy pho- tographs clearly demonstrate the superior performance of CBDNet over state-of-the-arts in terms of quantitative met- rics and visual quality. The code has been made available at https://github.com/GuoShi28/CBDNet.
[modeling, performs] [estimation, computer, vision, pattern, international, well, pipeline, michael, error, estimated] [noise, image, denoising, real, blind, cbdnet, ieee, conference, synthetic, dnd, srgb, input, realistic, mcwnnm, twsc, lei, awgn, clean, cnne, method, figure, denoisers, removal, nam, isp, quantitative, removing, cnnd, jpeg, ffdnet, wnnm, wangmeng, result, sophisticated, introduction, raw] [gaussian, deep, network, convolutional, processing, performance, achieves, achieve, table, neural, better, batch, weighted, scale, gain] [model, ability, considered, visual] [cnn, level, subnetwork, interactive, three, adopt, improve] [noisy, asymmetric, learning, loss, generalization, training, train, trained, datasets]
@InProceedings{Guo_2019_CVPR,
  author = {Guo, Shi and Yan, Zifei and Zhang, Kai and Zuo, Wangmeng and Zhang, Lei},
  title = {Toward Convolutional Blind Denoising of Real Photographs},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Towards Real Scene Super-Resolution With Raw Images
Xiangyu Xu, Yongrui Ma, Wenxiu Sun


Most existing super-resolution methods do not perform well in real scenarios due to lack of realistic training data and information loss of the model input. To solve the first problem, we propose a new pipeline to generate realistic training data by simulating the imaging process of digital cameras. And to remedy the information loss of the input, we develop a dual convolutional neural network to exploit the originally captured radiance information in raw images. In addition, we propose to learn a spatially-variant color transformation which helps more effective color corrections. Extensive experiments demonstrate that super-resolution with raw data helps recover fine details and clear structures, and more importantly, the proposed network and data generation pipeline achieve superior results for single image super-resolution in real scenarios.
[work, complex, fusion, second] [well, linear, pipeline, camera, ground, solve, scene, dense, radiance, single, directly, truth, pattern, rgb, algorithm] [color, raw, image, figure, real, proposed, correction, isp, method, input, noise, transformation, captured, realistic, blur, recover, clear, synthetic, digital, imaging, dual, synthesize, bayer, demosaicing, xraw, xlin, reference, superresolution, restoration, degraded, sharper, simulating, resolution, pixel, nonlinear] [network, deep, processing, better, kernel, convolutional, neural, fine, process, gaussian, convolution, size, fixed, downsampling, architecture] [model, generate, generated, generation] [propose, branch, feature, cnn, global] [data, training, learning, learn, exploit, existing, test, loss, function]
@InProceedings{Xu_2019_CVPR,
  author = {Xu, Xiangyu and Ma, Yongrui and Sun, Wenxiu},
  title = {Towards Real Scene Super-Resolution With Raw Images},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
ODE-Inspired Network Design for Single Image Super-Resolution
Xiangyu He, Zitao Mo, Peisong Wang, Yang Liu, Mingyuan Yang, Jian Cheng


Single image super-resolution, as a high dimensional structured prediction problem, aims to characterize fine-grain information given a low-resolution sample. Recent advances in convolutional neural networks are introduced into super-resolution and push forward progress in this field. Current studies have achieved impressive performance by manually designing deep residual neural networks but overly relies on practical experience. In this paper, we propose to adopt an ordinary differential equation (ODE)-inspired design scheme for single image super-resolution, which have brought us a new understanding of ResNet in classification problems. Not only is it interpretable for super-resolution but it provides a reliable guideline on network designs. By casting the numerical schemes in ODE as blueprints, we derive two types of network structures: LF-block and RK-block, which correspond to the Leapfrog method and Runge-Kutta method in numerical ordinary differential equations. We evaluate our models on benchmark datasets, and the results show that our methods surpass the state-of-the-arts while keeping comparable parameters and operations.
[forward, recognition, time] [computer, vision, pattern, single, june, international, differential, euler, corresponding, david] [image, ieee, conference, method, edsr, psnr, july, figure, ssim, sisr, superresolution, input, kyoung, proposed, tend, amount] [deep, residual, network, numerical, neural, design, convolutional, dynamical, conv, performance, addition, computation, order, msrn, table, scheme, resnet, prelu, ode, building, block, leapfrog, apply, better, effectiveness, formula, approximation, xiangyu, comparable, deeper, architecture] [indicates, model, system] [cvpr, cnn, map, jian, propose, benchmark, module] [training, learning, similarity, china, datasets]
@InProceedings{He_2019_CVPR,
  author = {He, Xiangyu and Mo, Zitao and Wang, Peisong and Liu, Yang and Yang, Mingyuan and Cheng, Jian},
  title = {ODE-Inspired Network Design for Single Image Super-Resolution},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Blind Image Deblurring With Local Maximum Gradient Prior
Liang Chen, Faming Fang, Tingting Wang, Guixu Zhang


Blind image deblurring aims to recover sharp image from a blurred one while the blur kernel is unknown. To solve this ill-posed problem, a great amount of image priors have been explored and employed in this area. In this paper, we present a blind deblurring method based on Local Maximum Gradient (LMG) prior. Our work is inspired by the simple and intuitive observation that the maximum value of a local patch gradient will diminish after the blur process, which is proved to be true both mathematically and empirically. This inherent property of blur process helps us to establish a new energy function. By introducing an liner operator to compute the Local Maximum Gradient, together with an effective optimization scheme, our method can handle various specific scenarios. Extensive experimental results illustrate that our method is able to achieve favorable performance against state-of-the-art algorithms on both synthetic and real-world images.
[dataset, term, work, performs, signal, second] [computer, pattern, vision, local, corresponding, error, solve, optimization, algorithm, form, constraint, estimation, denote, single, property, well, problem, solution, estimated] [image, method, deblurring, conference, blurred, ieee, prior, result, blur, patch, blind, lmg, figure, deconvolution, clear, pan, based, latent, blurring, intermediate, yan, comparison, input, proposed, psnr, blurry, dark, jinshan, cho, krishnan, face, demonstrates] [kernel, gradient, norm, max, channel, energy, size, effectiveness, process, better, effective, sparse, scheme, best] [model, natural, text, evaluate, introduce, generate, generates] [adopt, average, benchmark, map, involved] [maximum, min, datasets, convergence, similarity, function]
@InProceedings{Chen_2019_CVPR,
  author = {Chen, Liang and Fang, Faming and Wang, Tingting and Zhang, Guixu},
  title = {Blind Image Deblurring With Local Maximum Gradient Prior},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Attention-Guided Network for Ghost-Free High Dynamic Range Imaging
Qingsen Yan, Dong Gong, Qinfeng Shi, Anton van den Hengel, Chunhua Shen, Ian Reid, Yanning Zhang


Ghosting artifacts caused by moving objects or misalignments is a key challenge in high dynamic range (HDR) imaging for dynamic scenes. Previous methods first register the input low dynamic range (LDR) images using optical flow before merging them, which are error-prone and cause ghosts in results. A very recent work tries to bypass optical flows via a deep network with skip-connections, however, which still suffers from ghosting artifacts for severe movement. To avoid the ghosting from the source, we propose a novel attention-guided end-to-end deep neural network (AHDRNet) to produce high-quality ghost-free HDR images. Unlike previous methods directly stacking the LDR images or features for merging, we use attention modules to guide the merging according to the reference image. The attention modules automatically suppress undesired components caused by misalignments and saturation and enhance desirable fine details in the non-reference images. In addition to the attention model, we use dilated residual dense block (DRDB) to make full use of the hierarchical features and increase the receptive field for hallucinating the missing details. The proposed AHDRNet is a non-flow-based method, which can also avoid the artifacts generated by optical-flow estimation error. Experiments on different datasets show that the proposed AHDRNet can achieve state-of-the-art quantitative and qualitative results.
[dynamic, flow, optical, motion, recognition, moving, previous, multiple] [ldr, range, computer, dense, vision, pattern, corresponding, single, exposure, approach, ground, estimation] [hdr, image, proposed, figure, ahdrnet, reference, ieee, ghosting, method, high, imaging, input, conference, based, kalantari, quantitative, produce, saturated, color, pixel, tonemapped, tmo, caused, saturation, misaligned, relying] [network, residual, dilated, deep, block, convolution, conv, neural, table, compare, denotes, obtains, apply, better, receptive, stacked] [attention, generate, model, visual, sen, van] [feature, merging, map, global, three, module, propose] [learning, alignment, training, loss, data, testing, datasets]
@InProceedings{Yan_2019_CVPR,
  author = {Yan, Qingsen and Gong, Dong and Shi, Qinfeng and van den Hengel, Anton and Shen, Chunhua and Reid, Ian and Zhang, Yanning},
  title = {Attention-Guided Network for Ghost-Free High Dynamic Range Imaging},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Searching for a Robust Neural Architecture in Four GPU Hours
Xuanyi Dong, Yi Yang


Conventional neural architecture search (NAS) approaches are usually based on reinforcement learning or evolutionary strategy, which take more than 1000 GPU hours to find a good model on CIFAR-10. We propose an efficient NAS approach, which learns the searching approach by gradient descent. Our approach represents the search space as a directed acyclic graph (DAG). This DAG contains thousands of sub-graphs, each of which indicates a kind of neural architecture. To avoid traversing all the possibilities of the sub-graphs, we develop a differentiable sampler over the DAG. This sampler is learnable and optimized by the validation loss after training the sampled architecture. In this way, our approach can be trained in an end-to-end fashion by gradient descent, named Gradient-based search using Differentiable Architecture Sampler (GDAS). In experiments, we can finish one searching procedure in four GPU hours on CIFAR-10, and the discovered model obtains a test error of 2.82% with only 2.5M parameters, which is on par with the state-of-the-art.
[rnn, previous, lstm, recognition, human] [normal, international, approach, computer, vision, differentiable, error, pattern, micro, robust, directly, optimization, good] [conference, ieee, based, image, intermediate, method] [search, neural, gdas, cell, architecture, searching, reduction, discovered, gpu, network, cost, convolutional, block, efficient, validation, computational, imagenet, conv, number, reduce, stride, weight, cutout, gradient, automatically, size, rate, dag, achieve, max, enas, better, compared, nodei, calculate] [model, indicates, discover, procedure, perplexity, node, candidate, find] [cnn, feature] [learning, training, function, set, space, sampling, test, distribution, train, data, loss]
@InProceedings{Dong_2019_CVPR,
  author = {Dong, Xuanyi and Yang, Yi},
  title = {Searching for a Robust Neural Architecture in Four GPU Hours},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Hierarchy Denoising Recursive Autoencoders for 3D Scene Layout Prediction
Yifei Shi, Angel X. Chang, Zhelun Wu, Manolis Savva, Kai Xu


Indoor scenes exhibit rich hierarchical structure in 3D object layouts. Many tasks in 3D scene understanding can benefit from reasoning jointly about the hierarchical context of a scene, and the identities of objects. We present a variational denoising recursive autoencoder (VDRAE) that generates and iteratively refines a hierarchical representation of 3D object layouts, interleaving bottom-up encoding for context aggregation and top-down decoding for propagation. We train our VDRAE on large-scale 3D scene datasets to predict both instance-level segmentations and a 3D object detections from an over-segmentation of an input point cloud. We show that our VDRAE improves object detection performance on real-world 3D point cloud datasets compared to baselines from prior work.
[work, predict, graph, performs, leaf, benefit, propagation] [point, scene, cloud, approach, initial, computer, indoor, vision, pattern, chair, normalized, optimization, single, leveraging] [input, figure, denoising, conference, prior, method, ieee, image, based, generative] [recursive, network, iteration, neural, table, mlp, aggregation, performance, deep] [node, encoding, decoding, variational, generates, iterative, model, obj, understanding] [object, hierarchy, detection, layout, context, vdrae, segment, hierarchical, bounding, semantic, affinity, segmentation, contextual, box, obb, average, parsing, pointcnn, iou, map, instance, refinement, ross, category, sofa, oriented] [training, set, autoencoder, train, trained, loss, learning, pair, representation, test, datasets, learned]
@InProceedings{Shi_2019_CVPR,
  author = {Shi, Yifei and Chang, Angel X. and Wu, Zhelun and Savva, Manolis and Xu, Kai},
  title = {Hierarchy Denoising Recursive Autoencoders for 3D Scene Layout Prediction},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Adaptively Connected Neural Networks
Guangrun Wang, Keze Wang, Liang Lin


This paper presents a novel adaptively connected neural network (ACNet) to improve the traditional convolutional neural networks (CNNs) in two aspects. First, ACNet employs a flexible way to switch global and local inference in processing the internal feature representations by adaptively determining the connection status among the feature nodes (e.g., pixels of the feature maps). Note that in a computer vision domain, a node refers to a pixel of a feature map, while in the graph domain, a node denotes a graph node. We can show that existing CNNs, the classical multilayer perceptron (MLP), and the recently proposed non-local network (NLN) are all special cases of ACNet. Second, ACNet is also capable of handling non-Euclidean data. Extensive experimental analyses on a variety of benchmarks (i.e., ImageNet-1k classification, COCO 2017 detection and segmentation, CUHK03 person re-identification, CIFAR analysis, and Cora document categorization) demonstrate that ACNet cannot only achieve state-of-the-art performance but also overcome the limitation of the conventional MLP and CNN. The code is available at https://github.com/wanggrun/Adaptively-Connected-Neural-Networks.
[graph, cora, internal, work] [local, computer, general, position, international, vision, pattern, problem, wij, form, note, constant, error] [image, conference, proposed, comparison, ieee, pixel] [acnet, inference, table, neural, adaptively, network, layer, cnns, connection, convolutional, deep, convolution, nln, mlp, performance, fixed, vij, connected, guangrun, standard, number, trinet, resnet, denotes, receptive, learnable, compare, imagenet] [arxiv, preprint, node, document, machine, article, model, represent] [global, person, cnn, feature, liang, coco, detection, segmentation, three] [learning, data, training, classification, euclidean, set, large, trained, representation, representative, investigate, existing, experimental]
@InProceedings{Wang_2019_CVPR,
  author = {Wang, Guangrun and Wang, Keze and Lin, Liang},
  title = {Adaptively Connected Neural Networks},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
CrDoCo: Pixel-Level Domain Transfer With Cross-Domain Consistency
Yun-Chun Chen, Yen-Yu Lin, Ming-Hsuan Yang, Jia-Bin Huang


Unsupervised domain adaptation algorithms aim to transfer the knowledge learned from one domain to another (e.g., synthetic to real images). The adapted representations often do not capture pixel-level domain shifts that are crucial for dense prediction tasks (e.g., semantic segmentation). In this paper, we present a novel pixel-wise adversarial domain adaptation algorithm. By leveraging image-to-image translation methods for data augmentation, our key insight is that while the translated images between domains may differ in styles, their predictions for the task should be consistent. We exploit this property and introduce a cross-domain consistency loss that enforces our adapted model to produce consistent predictions. Through extensive experimental results, we show that our method compares favorably against the state-of-the-art on a wide variety of unsupervised domain adaptation tasks.
[prediction, flow, dataset, optical, perform, consists, version] [depth, kitti, dense, estimation, approach, scene, reconstruction, define, consistent, corresponding, error, ground] [consistency, image, method, translation, proposed, translated, synthetic, figure, lrec, input, real, produce, translate] [network, deep, table, performance, convolutional, residual] [adversarial, model, adv, ltask, limg, evaluation] [semantic, segmentation, feature, iou] [domain, loss, task, adaptation, lconsis, source, target, learning, unsupervised, feat, training, img, experimental, set, adaptsegnet, transfer, labeled, data, train, test, unlabeled, main, enforcing, lfeat, cbst, judy, kate, trevor, learned]
@InProceedings{Chen_2019_CVPR,
  author = {Chen, Yun-Chun and Lin, Yen-Yu and Yang, Ming-Hsuan and Huang, Jia-Bin},
  title = {CrDoCo: Pixel-Level Domain Transfer With Cross-Domain Consistency},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Temporal Cycle-Consistency Learning
Debidatta Dwibedi, Yusuf Aytar, Jonathan Tompson, Pierre Sermanet, Andrew Zisserman


We introduce a self-supervised representation learning method based on the task of temporal alignment between videos. The method trains a network using temporal cycle-consistency (TCC), a differentiable cycle-consistency loss that can be used to find correspondences across time in multiple videos. The resulting per-frame embeddings can be used to align videos by simply matching frames using nearest-neighbors in the learned embedding space. To evaluate the power of the embeddings, we densely label the Pouring and Penn Action video datasets for action phases. We show that (i) the learned embeddings enable few-shot classification of these action phases, significantly reducing the supervised training requirements; and (ii) TCC is complementary to other methods of self-supervised learning in videos, such as Shuffle and Learn and Time-Contrastive Networks. The embeddings are also used for a number of applications based on alignment (dense temporal correspondence) between video pairs, including transfer of metadata of synchronized modalities between videos (sounds, temporal semantic labels), synchronized playback of multiple videos, and anomaly detection. Project webpage: https://sites.google.com/view/temporal-cycle-consistency .
[action, temporal, video, tcc, pouring, tcn, multiple, key, frame, penn, tau, time, recognition, dataset, selfsupervised, sequence] [computer, vision, pattern, international, differentiable, point, consistent] [conference, ieee, cycle, figure, consistency, image, method] [phase, sal, table, number, performance, deep, network, order, compare, scratch, andrew, neural, size] [ball, visual, model, understanding, progress, encoder, arxiv, preprint] [fully, regression] [learning, embedding, supervised, nearest, classification, learn, training, representation, loss, alignment, labeled, space, learned, task, datasets, distance, neighbor, unsupervised, align, set, soft, bottle, measure, pair, embeddings, transfer, metric]
@InProceedings{Dwibedi_2019_CVPR,
  author = {Dwibedi, Debidatta and Aytar, Yusuf and Tompson, Jonathan and Sermanet, Pierre and Zisserman, Andrew},
  title = {Temporal Cycle-Consistency Learning},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Predicting Future Frames Using Retrospective Cycle GAN
Yong-Hoon Kwon, Min-Gyu Park


Recent advances in deep learning have significantly improved the performance of video prediction, however, top-performing algorithms start to generate blurry predictions as they attempt to predict farther future frames. In this paper, we propose a unified generative adversarial network for predicting accurate and temporally consistent future frames over time, even in a challenging environment. The key idea is to train a single generator that can predict both future and past frames while enforcing the consistency of bi-directional prediction using the retrospective cycle constraints. Moreover, we employ two discriminators not only to identify fake frames but also to distinguish fake contained image sequences from the real sequence. The latter discriminator, the sequence discriminator, plays a crucial role in predicting temporally consistent future frames. We experimentally verify the proposed framework using various real-world videos captured by car-mounted cameras, surveillance cameras, and arbitrary devices with state-of-the-art methods.
[frame, future, sequence, prediction, predict, video, motion, retrospective, predicting, prednet, liu, dataset, backward, recognition, forward, consists, contextvp, temporally, time] [vision, computer, international, pattern, single, predicts, reconstruction, denote, consistent, ground, truth, defined, kitti] [generator, input, method, proposed, image, conference, psnr, real, quantitative, cycle, generative, surveillance, ssim, consistency, figure, dual, mse, korea, blurry, captured, based, verify] [network, deep, neural, number, table, processing, performance, convolutional, better, architecture] [discriminator, fake, adversarial, generate, adv, evaluation] [predicted, caltech, pedestrian, spatial, propose, distinguish, role] [loss, learning, training, train, set, function, unsupervised, pair]
@InProceedings{Kwon_2019_CVPR,
  author = {Kwon, Yong-Hoon and Park, Min-Gyu},
  title = {Predicting Future Frames Using Retrospective Cycle GAN},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Density Map Regression Guided Detection Network for RGB-D Crowd Counting and Localization
Dongze Lian, Jing Li, Jia Zheng, Weixin Luo, Shenghua Gao


To simultaneously estimate head counts and localize heads with bounding boxes, a regression guided detection network (RDNet) is proposed for RGB-D crowd counting. Specifically, to improve the robustness of detection-based approaches for small/tiny heads, we leverage density map to improve the head/non-head classification in detection network where density map serves as the probability of a pixel being a head. A depth-adaptive kernel that considers the variances in head sizes is also introduced to generate high-fidelity density map for more robust density map regression. Further, a depth-aware anchor is designed for better initialization of anchor sizes in detection framework. Then we use the bounding boxes whose sizes are estimated with depth to train our RDNet. The existing RGB-D datasets are too small and not suitable for performance evaluation on data-driven based approaches, we collect a large-scale RGB-D crowd counting dataset. Experiments on both our RGB-D dataset and the MICC RGB-D counting dataset show that our method achieves the best performance for RGB-D crowd counting and localization. Further, our method can be readily extended to RGB image based crowd counting and achieves comparable performance on the ShanghaiTech Part_B dataset for both counting and localization.
[dataset, people] [depth, computer, estimated, vision, pattern, rgb, micc, international, well, point, estimation, total, estimate, bandwidth, camera, rgbd] [conference, based, method, figure, image, ieee, surveillance, proposed] [density, performance, network, kernel, number, table, csrnet, achieves, effectiveness, gaussian, fixed, size, initialization, deep, neural, small] [evaluation, generate, decoding] [crowd, map, detection, counting, head, bounding, regression, anchor, retinanet, box, propose, localization, shanghaitech, shanghaitechrgbd, idrees, module, guided, detect, rdnet, leverage, object, crowded, count, mcnn, help, feature, european] [set, training, loss, strategy, learning, distance, train, classification, test]
@InProceedings{Lian_2019_CVPR,
  author = {Lian, Dongze and Li, Jing and Zheng, Jia and Luo, Weixin and Gao, Shenghua},
  title = {Density Map Regression Guided Detection Network for RGB-D Crowd Counting and Localization},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
TAFE-Net: Task-Aware Feature Embeddings for Low Shot Learning
Xin Wang, Fisher Yu, Ruth Wang, Trevor Darrell, Joseph E. Gonzalez


Learning good feature embeddings for images often requires substantial training data. As a consequence, in settings where training data is limited (e.g., few-shot and zero-shot learning), we are typically forced to use a general feature embedding across prediction tasks. Ideally, we would like to construct feature embeddings that are tuned for the given task and even input image. In this work, we propose Task Aware Feature Embedding Networks (TAFE-Nets) to learn how to adapt the image representation to a new task in a meta learning fashion. Our network is composed of a meta learner and a prediction network, where the meta learner generates parameters for the feature layers in the prediction network based on a task input so that the feature embedding can be accurately adjusted for that task. We show that TAFE-Net is highly effective in generalizing to new tasks or concepts and evaluate the TAFE-Net on a range of benchmarks in zero-shot and few-shot learning. Our model matches or exceeds the state-of-the-art on all tasks. In particular, our approach improves the prediction accuracy of unseen attribute-object pairs by 4 to 15 points on the challenging visual attribute-object composition task.
[prediction, dataset, dynamic, recognition, work] [computer, pattern, vision, additional, approach] [image, conference, ieee, proposed, figure, based, prior, input, generator] [network, weight, layer, neural, number, convolutional, output, table, accuracy, processing, factorization, deep, parameter, cout, size, binary, imagenet] [model, visual, generate, generation, evaluation, encoding, machine, requires] [feature, adopt] [task, learning, embedding, embeddings, training, meta, unseen, loss, data, learn, learner, novel, set, generic, classifier, apy, classification, shared, discriminative, sun, datasets, base, gzsl, mitstates, large, dimension, cub, trevor, representation]
@InProceedings{Wang_2019_CVPR,
  author = {Wang, Xin and Yu, Fisher and Wang, Ruth and Darrell, Trevor and Gonzalez, Joseph E.},
  title = {TAFE-Net: Task-Aware Feature Embeddings for Low Shot Learning},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Learning Semantic Segmentation From Synthetic Data: A Geometrically Guided Input-Output Adaptation Approach
Yuhua Chen, Wen Li, Xiaoran Chen, Luc Van Gool


As an alternative to manual pixel-wise annotation, synthetic data has been increasingly used for training semantic segmentation models. Such synthetic images and semantic labels can be easily generated from virtual 3D environments. In this work, we propose an approach to cross-domain semantic segmentation with the auxiliary geometric information, which can also be easily obtained from virtual environments. The geometric information is utilized on two levels for reducing domain shift: on the input level, we augment the standard image translation network with the geometric information to translate synthetic images into realistic style; on the output level, we build a task network which simultaneously performs semantic segmentation and depth estimation. Meanwhile, adversarial training is applied on the joint output space to preserve the correlation between semantics and depth. The proposed approach is validated on two pairs of synthetic to real dataset: from Virtual KITTI to KITTI, and from SYNTHIA to Cityscapes, where we achieve a clear performance gain compared to the baselines and various competing methods, demonstrating the effectiveness of the geometric information for cross-domain semantic segmentation.
[joint, prediction, dataset, luc, work, traffic, build] [depth, geometric, virtual, estimation, approach, kitti, additional, corresponding, scene, geometrically, computer, sky] [synthetic, image, real, transform, input, proposed, translation, wen, figure, study, preserve, cyclegan, method, realistic] [network, output, performance, correlation, deep, table, convolutional, effectiveness, reduce, applied, gain, highly] [adversarial, model, van, discriminator, visual, transformed] [semantic, segmentation, semantics, dimg, miou, guided, map, improves, improvement, urban, road, leverage] [domain, adaptation, task, training, learning, data, space, target, unsupervised, source, label, aligning, adapt, datasets, loss, gimg, synthia, doutput, auxiliary, gtask, trained]
@InProceedings{Chen_2019_CVPR,
  author = {Chen, Yuhua and Li, Wen and Chen, Xiaoran and Van Gool, Luc},
  title = {Learning Semantic Segmentation From Synthetic Data: A Geometrically Guided Input-Output Adaptation Approach},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Attentive Single-Tasking of Multiple Tasks
Kevis-Kokitsi Maninis, Ilija Radosavovic, Iasonas Kokkinos


In this work we address task interference in universal networks by considering that a network is trained on multiple tasks, but performs one task at a time, an approach we refer to as "single-tasking multiple tasks". The network thus modifies its behaviour through task-dependent feature adaptation, or task attention. This gives the network the ability to accentuate the features that are adapted to a task, while shunning irrelevant ones. We further reduce task interference by forcing the task gradients to be statistically indistinguishable through adversarial training, ensuring that the common backbone architecture serving all tasks is not dominated by any of the task-specific gradients. Results in three multi-task dense labelling problems consistently show: (i) a large reduction in the number of parameters while preserving, or even improving performance and (ii) a smooth trade-off between computation and multi-task accuracy. We provide our system's code and pre-trained models at http://www.vision.ee.ethz.ch/~kmaninis/astmt/.
[human, multiple, work, dataset] [depth, surface, single, estimation, allows, approach, dense, pose, relative, interference] [image, figure, method, synthetic] [network, performance, modulation, residual, convolutional, gradient, deep, drop, table, architecture, number, computation, layer, order, nyud, neural, increase, compare, better, fsv, computational, block, compared, norm, multitask, standard] [adversarial, visual, discriminator, attention, common, indistinguishable, model, requires] [edge, semantic, segmentation, detection, backbone, feature, baseline, saliency, instance, object, average, pascal, spatial, seg] [task, shared, learning, training, learned, representation, loss, train, domain, large, statistically, learn, observe]
@InProceedings{Maninis_2019_CVPR,
  author = {Maninis, Kevis-Kokitsi and Radosavovic, Ilija and Kokkinos, Iasonas},
  title = {Attentive Single-Tasking of Multiple Tasks},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Deep Metric Learning to Rank
Fatih Cakir, Kun He, Xide Xia, Brian Kulis, Stan Sclaroff


We propose a novel deep metric learning method by revisiting the learning to rank approach. Our method, named FastAP, optimizes the rank-based Average Precision measure, using an approximation derived from distance quantization. FastAP has a low complexity compared to existing methods, and is tailored for stochastic gradient descent. To fully exploit the benefits of the ranking formulation, we also propose a new minibatch sampling scheme, as well as a simple heuristic to enable large-batch training. On three few-shot image retrieval datasets, FastAP consistently outperforms competing methods, which often involve complex optimization heuristics or costly model ensembles.
[recognition, outperforms, online] [histogram, computer, vision, pattern, international, optimization, single, differentiable, problem, solution, optimizing, well, defined] [conference, ieee, method, image, figure, kun] [deep, precision, batch, performance, approximation, number, neural, gpu, size, gradient, standard, stochastic, quantization, table, optimize, efficient] [query, random, machine] [average, propose, clothes, three, heuristic, recall, area, challenging] [learning, metric, retrieval, fastap, distance, training, embedding, sampling, set, ensemble, large, loss, rank, minibatch, ranked, list, binning, class, test, novel, neighbor, triplet, pku, datasets, stanford, margin, probabilistic, strategy, ranking, consistently, main]
@InProceedings{Cakir_2019_CVPR,
  author = {Cakir, Fatih and He, Kun and Xia, Xide and Kulis, Brian and Sclaroff, Stan},
  title = {Deep Metric Learning to Rank},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
End-To-End Multi-Task Learning With Attention
Shikun Liu, Edward Johns, Andrew J. Davison


We propose a novel multi-task learning architecture, which allows learning of task-specific feature-level attention. Our design, the Multi-Task Attention Network (MTAN), consists of a single shared network containing a global feature pool, together with a soft-attention module for each task. These modules allow for learning of task-specific features from the global features, whilst simultaneously allowing for features to be shared across different tasks. The architecture can be trained end-to-end and can be built upon any feed-forward neural network, is simple to implement, and is parameter efficient. We evaluate our approach on a variety of datasets, across both image-to-image predictions and image classification tasks. We show that our architecture is state-of-the-art in multi-task learning compared to existing methods, and is also less sensitive to various weighting schemes in the multi-task loss function. Code is available at https://github.com/lorenmt/mtan.
[dataset, multiple, learns, prediction, challenge, consists] [depth, computer, vision, single, pattern, approach, surface, international, normal, indoor, dense] [method, image, conference, based, ieee, proposed, figure, input] [network, conv, whilst, mtan, performance, architecture, equal, neural, validation, number, table, convolutional, multitask, batch, segnet, weight, rate, layer, deep, compared, norm, epoch, standard, designed, residual, decathlon, apply, vanilla] [attention, evaluate, visual, machine] [semantic, module, feature, mask, pool, segmentation, three, global] [learning, task, shared, weighting, loss, classification, data, learned, training, datasets, novel, function, transfer]
@InProceedings{Liu_2019_CVPR,
  author = {Liu, Shikun and Johns, Edward and Davison, Andrew J.},
  title = {End-To-End Multi-Task Learning With Attention},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Self-Supervised Learning via Conditional Motion Propagation
Xiaohang Zhan, Xingang Pan, Ziwei Liu, Dahua Lin, Chen Change Loy


Intelligent agent naturally learns from motion. Various self-supervised algorithms have leveraged the motion cues to learn effective visual representations. The hurdle here is that motion is both ambiguous and complex, rendering previous works either suffer from degraded learning efficacy, or resort to strong assumptions on object motions. In this work, we design a new learning-from-motion paradigm to bridge these gaps. Instead of explicitly modeling the motion probabilities, we design the pretext task as a conditional motion propagation problem. Given an input image and several sparse flow guidance on it, our framework seeks to recover the full-image motion. Compared to other alternatives, our framework has several appealing properties: (1) Using sparse flow guidance during training resolves the inherent motion ambiguity, and thus easing feature learning. (2) Solving the pretext task of conditional motion propagation encourages the emergence of kinematically-sound representations that poss greater expressive power. Extensive experiments demonstrate that our framework learns structural and coherent features; and achieves state-of-the-art self-supervision performance on several downstream tasks including semantic segmentation, instance segmentation and human parsing. Furthermore, our framework is successfully extended to several useful applications such as semi-automatic pixel-level annotation.
[motion, cmp, flow, optical, propagation, human, static, framework, predict, video, pathak, temporal, influence, previous, prediction, second, perform] [dense, walker, left, kinematic, single, ambiguity] [image, figure, method, result, conditional, user, lip, amount, recover] [sparse, number, performance, table, achieve, stride, effective, imagenet, design, convolutional, alexnet] [model, encoder, visual, sampled, create, vector] [guidance, semantic, segmentation, mask, polygon, object, voc, instance, pascal, spatial, feature, parsing, coco, miou, box, context, three] [learning, training, learn, task, representation, target, unsupervised, unlabeled, pretext, large, set, trained, negative, loss, noroozi]
@InProceedings{Zhan_2019_CVPR,
  author = {Zhan, Xiaohang and Pan, Xingang and Liu, Ziwei and Lin, Dahua and Change Loy, Chen},
  title = {Self-Supervised Learning via Conditional Motion Propagation},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Bridging Stereo Matching and Optical Flow via Spatiotemporal Correspondence
Hsueh-Ying Lai, Yi-Hsuan Tsai, Wei-Chen Chiu


Stereo matching and flow estimation are two essential tasks for scene understanding, spatially in 3D and temporally in motion. Existing approaches have been focused on the unsupervised setting due to the limited resource to obtain the large-scale ground truth data. To construct a self-learnable objective, co-related tasks are often linked together to form a joint framework. However, the prior work usually utilizes independent networks for each task, thus not allowing to learn shared feature representations across models. In this paper, we propose a single and principled network to jointly learn spatiotemporal correspondence for stereo matching and flow estimation, with a newly designed geometric connection as the unsupervised signal for temporally adjacent stereo pairs. We show that our method performs favorably against several state-of-the-art baselines for both unsupervised depth and flow estimation on the KITTI benchmark dataset.
[flow, optical, recognition, joint, warping, adjacent, framework, temporally, motion, jointly, video, dataset, frame, time, benefit, spatiotemporal, construct] [stereo, depth, estimation, matching, computer, vision, occlusion, correspondence, kitti, geometric, pattern, single, monocular, camera, reconstruction, principled, eigen, ground, godard, well, estimated, derived, occluded, truth] [consistency, proposed, conference, ieee, figure, pixel, image, based, method, reconstructed, lrec] [network, performance, deep, better, table, structure, full, order, apply] [model, introduce, common] [map, feature, improves, propose, including, utilize] [learning, unsupervised, loss, training, train, pair, shared, data, setting, learn, supervised]
@InProceedings{Lai_2019_CVPR,
  author = {Lai, Hsueh-Ying and Tsai, Yi-Hsuan and Chiu, Wei-Chen},
  title = {Bridging Stereo Matching and Optical Flow via Spatiotemporal Correspondence},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
All About Structure: Adapting Structural Information Across Domains for Boosting Semantic Segmentation
Wei-Lun Chang, Hui-Po Wang, Wen-Hsiao Peng, Wei-Chen Chiu


In this paper we tackle the problem of unsupervised domain adaptation for the task of semantic segmentation, where we attempt to transfer the knowledge learned upon synthetic datasets with ground-truth labels to real-world images without any annotation. With the hypothesis that the structural content of images is the most informative and decisive factor to semantic segmentation and can be readily shared across domains, we propose a Domain Invariant Structure Extraction (DISE) framework to disentangle images into domain-invariant structure and domain-specific texture representations, which can further realize image-translation across domains and enable label transfer to improve segmentation performance. Extensive experiments verify the effectiveness of our proposed DISE model and demonstrate its superiority over several state-of-the-art approaches.
[recognition, dataset, tex, prediction, framework] [computer, vision, pattern, reconstruction, international] [texture, image, conference, translation, dise, ltrans, translated, ieee, synthetic, method, figure, component, proposed, perceptual, str, appearance, content, prior, trans, based, tsai, input, lperc, comparison] [structure, output, performance, network, vgg, table, deep, order, layer, convolutional] [common, adversarial, encoder, adv, model, private, decoder] [semantic, segmentation, feature, seg, lseg, extraction, deeplab, european, propose] [domain, adaptation, loss, source, target, learning, label, transfer, training, space, classifier, representation, datasets, invariant, synthia, unsupervised, shared, train, minimizing, conventional]
@InProceedings{Chang_2019_CVPR,
  author = {Chang, Wei-Lun and Wang, Hui-Po and Peng, Wen-Hsiao and Chiu, Wei-Chen},
  title = {All About Structure: Adapting Structural Information Across Domains for Boosting Semantic Segmentation},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Iterative Reorganization With Weak Spatial Constraints: Solving Arbitrary Jigsaw Puzzles for Unsupervised Representation Learning
Chen Wei, Lingxi Xie, Xutong Ren, Yingda Xia, Chi Su, Jiaying Liu, Qi Tian, Alan L. Yuille


Learning visual features from unlabeled image data is an important yet challenging task, which is often achieved by training a model on some annotation-free information. We consider spatial contexts, for which we solve so-called jigsaw puzzles, i.e., each image is cut into grids and then disordered, and the goal is to recover the correct configuration. Existing approaches formulated it as a classification task by defining a fixed mapping from a small subset of configurations to a class set, but these approaches ignore the underlying relationship between different configurations and also limit their applications to more complex scenarios. This paper presents a novel approach which applies to jigsaw puzzles with an arbitrary grid size and dimensionality. We provide a fundamental and generalized principle, that weaker cues are easier to be learned in an unsupervised manner and also transfer better. In the context of puzzle recognition, we use an iterative manner which, instead of solving the puzzle all at once, adjusts the order of the patches in each step until convergence. In each step, we combine both unary and binary features of each patch into a cost function judging the correctness of the current configuration. Our approach, by taking similarity between puzzles into consideration, enjoys a more efficient way of learning visual knowledge. We verify the effectiveness of our approach from two aspects. First, it solves arbitrarily complex puzzles, including high-dimensional puzzles, that prior methods are difficult to handle. Second, it serves as a reliable way of network initialization, which leads to better transfer performance in visual recognition tasks including classification, detection and segmentation.
[recognition, term, complex, work, dataset] [computer, international, approach, vision, pattern, solving, position, solve, well, problem, relative] [image, conference, patch, input, arbitrary, figure, prior, recover, study] [network, binary, unary, deep, alexnet, configuration, accuracy, convolutional, number, performance, plain, table, better, neural, cost, iteration, entire, mirror, order] [visual, model, correct, ability, iterative, relationship, evaluate] [spatial, weak, feature, medical, backbone, segmentation, semantic, european, context, detection] [learning, puzzle, training, transfer, jigsaw, unsupervised, data, representation, classification, set, classifier, difficult, knowledge, unlabeled, function, large, testing, task, supervised, trained, randomly, class]
@InProceedings{Wei_2019_CVPR,
  author = {Wei, Chen and Xie, Lingxi and Ren, Xutong and Xia, Yingda and Su, Chi and Liu, Jiaying and Tian, Qi and Yuille, Alan L.},
  title = {Iterative Reorganization With Weak Spatial Constraints: Solving Arbitrary Jigsaw Puzzles for Unsupervised Representation Learning},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Revisiting Self-Supervised Visual Representation Learning
Alexander Kolesnikov, Xiaohua Zhai, Lucas Beyer


Unsupervised visual representation learning remains a largely unsolved problem in computer vision research. Among a big body of recently proposed approaches for unsupervised learning of visual representations, a class of self-supervised techniques achieves superior performance on many challenging benchmarks. A large number of the pretext tasks for self-supervised learning have been studied, but other important aspects, such as the choice of convolutional neural networks (CNN), has not received equal attention. Therefore, we revisit numerous previously proposed self-supervised models, conduct a thorough large scale study and, as a result, uncover multiple crucial insights. We challenge a number of common practices in self-supervised visual representation learning and observe that standard recipes for CNN design do not always translate to self-supervised representation learning. As part of our study, we drastically boost the performance of previously proposed techniques and outperform previously published state-of-the-art results by a large margin. We will release the code for reproducing our experiments when the anonymity requirements are lifted.
[recognition, selfsupervised, multiple, work] [vision, computer, linear, rotation, pattern, international, relative, supplementary, solving, single] [conference, image, figure, patch, quality, result, study, proposed, invertible] [imagenet, performance, residual, architecture, neural, accuracy, network, table, validation, size, deep, number, increasing, factor, resnet, layer, best, convolutional, standard, order, processing, decay, design] [model, evaluation, visual, arxiv, preprint, provided] [cnn, regression, context, semantic, official, evaluated] [learning, representation, pretext, task, downstream, training, large, trained, jigsaw, widening, unsupervised, learned, logistic, observe, exemplar, consistently, classification, data, revnet, investigate]
@InProceedings{Kolesnikov_2019_CVPR,
  author = {Kolesnikov, Alexander and Zhai, Xiaohua and Beyer, Lucas},
  title = {Revisiting Self-Supervised Visual Representation Learning},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
It's Not About the Journey; It's About the Destination: Following Soft Paths Under Question-Guidance for Visual Reasoning
Monica Haurilet, Alina Roitberg, Rainer Stiefelhagen


Visual Reasoning remains a challenging task, as it has to deal with long-range and multi-step object relationships in the scene. We present a new model for Visual Reasoning, aimed at capturing the interplay among individual objects in the image represented as a scene graph. As not all graph components are relevant for the query, we introduce the concept of a question-based visual guide, which constrains the potential solution space by learning an optimal traversal scheme, where the final destination nodes alone are used to produce the answer. We show, that finding relevant semantic structures facilitates generalization to new tasks by introducing a novel problem of knowledge transfer: training on one question type and answering questions from a different domain without any training data. Furthermore, we report state-of-the-art results for Visual Reasoning on multiple query types and diverse image and video datasets.
[graph, time, prediction, dataset, starting] [computer, scene, vision, pattern, international, approach, case, sphere, problem, direction, shape] [conference, image, ieee, based, guide, figure, produce, synthetic, extracted] [neural, performance, network, number, table, equal, accuracy, connected] [visual, model, question, node, reasoning, path, query, probability, destination, answer, answering, step, compositional, clevr, length, traveler, language, relevant, vqa, existence, attention, diagram, random, cog, type, evaluate] [object, final, counting, module, three, semantic, feature, cnn] [soft, learning, representation, knowledge, training, task, trained, set, traversal, embeddings, data, embedding, conventional, unseen]
@InProceedings{Haurilet_2019_CVPR,
  author = {Haurilet, Monica and Roitberg, Alina and Stiefelhagen, Rainer},
  title = {It's Not About the Journey; It's About the Destination: Following Soft Paths Under Question-Guidance for Visual Reasoning},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Actively Seeking and Learning From Live Data
Damien Teney, Anton van den Hengel


One of the key limitations of traditional machine learning methods is their requirement for training data that exemplifies all the information to be learned. This is a particular problem for visual question answering methods, which may be asked questions about virtually anything. The approach we propose is a step toward overcoming this limitation by searching for the information required at test time. The resulting method dynamically utilizes data from an external source, such as a large set of questions/answers or images/captions. Concretely, we learn a set of base weights for a simple VQA model, that are specifically adapted to a given question with the information specifically retrieved for this question. The adaptation process leverages recent advances in gradient-based meta learning and contributions for efficient retrieval and cross-domain adaptation. We surpass the state-of-the-art on the VQA-CP v2 benchmark and demonstrate our approach to be intrinsically more robust to out-of-distribution test data. We demonstrate the use of external non-VQA data using the MS COCO captioning dataset to support the answering process. This approach opens a new avenue for open-domain VQA systems that interface with diverse sources of data.
[dataset, breakfast] [approach, vision, additional, robust, limited, algorithm, projection, season, computer, underlying, classical] [image, method, proposed, input, demonstrate, ieee, conference, figure] [neural, gradient, performance, table, network, fixed, number] [vqa, model, question, visual, external, answering, arxiv, preprint, qas, relevance, evaluation, relevant, captioning, sport, cutting, language, procedure, van, den, reasoning, evaluate, machine, step, simple, vqacp, retrieve, answer] [baseline, propose, final, coco, utilize] [support, data, training, adaptation, learning, set, test, trained, loss, function, meta, existing, learn, retrieval, update, large, distribution, retrieved, source, domain, novel, maml, adapt, task, learned]
@InProceedings{Teney_2019_CVPR,
  author = {Teney, Damien and van den Hengel, Anton},
  title = {Actively Seeking and Learning From Live Data},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Improving Referring Expression Grounding With Cross-Modal Attention-Guided Erasing
Xihui Liu, Zihao Wang, Jing Shao, Xiaogang Wang, Hongsheng Li


Referring expression grounding aims at locating certain objects or persons in an image with a referring expression, where the key challenge is to comprehend and align various types of information from visual and textual domain, such as visual attributes, location and interactions with surrounding regions. Although the attention mechanism has been successfully applied for cross-modal alignments, previous attention models focus on only the most dominant features of both modalities, and neglect the fact that there could be multiple comprehensive textual-visual correspondences between images and referring expressions. To tackle this issue, we design a novel cross-modal attention-guided erasing approach, where we discard the most dominant information from either textual or visual domains to generate difficult training samples online, and to drive the model to discover complementary textual-visual correspondences. Extensive experiments demonstrate the effectiveness of our proposed method, which achieves state-of-the-art performance on three referring expression grounding datasets.
[subject, multiple, previous, horse, drive, second, perform, work, key] [computer, vision, pattern, dominant, matching, approach, corresponding] [expression, image, conference, ieee, proposed, based, figure, comprehensive] [original, network, better, denotes, calculate, design] [attention, erasing, visual, model, referring, query, erased, grounding, sentence, erase, relationship, girl, discover, language, word, candidate, textual, brown, arxiv, preprint, rode, generate, black, adversarial, mattnet, modular, modality] [object, context, region, location, complementary, module, feature, three, proposal, spatial, detection, score, salient, european, xiaogang] [training, learn, loss, embedding, positive]
@InProceedings{Liu_2019_CVPR,
  author = {Liu, Xihui and Wang, Zihao and Shao, Jing and Wang, Xiaogang and Li, Hongsheng},
  title = {Improving Referring Expression Grounding With Cross-Modal Attention-Guided Erasing},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Neighbourhood Watch: Referring Expression Comprehension via Language-Guided Graph Attention Networks
Peng Wang, Qi Wu, Jiewei Cao, Chunhua Shen, Lianli Gao, Anton van den Hengel


The task in referring expression comprehension is to localize the object instance in an image described by a referring expression phrased in natural language. As a language-to-vision matching task, the key to this problem is to learn a discriminative object feature that can adapt to the expression used. To avoid ambiguity, the expression normally tends to describe not only the properties of the referent itself, but also its relationships to its neighbourhood. To capture and exploit this important information we propose a graph-based, language-guided attention mechanism. Being composed of node attention component and edge attention component, the proposed graph attention mechanism explicitly represents inter-object relationships, and properties with a flexibility and power impossible with competing approaches. Furthermore, the proposed graph attention mechanism enables the comprehension decision to be visualizable and explainable. Experiments on three referring expression comprehension datasets show the advantage of the proposed approach.
[graph, key, highlight, directed, subject, work] [matching, eij, define] [expression, image, ieee, component, proposed, composed, based, study] [table, denotes, better, neural, network, top] [attention, node, referring, language, relationship, comprehension, visual, relevant, child, woman, lgrans, femb, referent, mechanism, held, represent, model, neighbourhood, attended, ssub, sintra, refcoco, identify, encode, sinter, encoding, aintra, ainter, aobj, obj, encoded, refcocog, testa, testb, van, natural, referred, monolithic, vector, grounding] [object, edge, feature, three, region, spatial, module, comparing, ablation, val] [representation, intra, inter, adapt, learning, loss, set, learn, experimental]
@InProceedings{Wang_2019_CVPR,
  author = {Wang, Peng and Wu, Qi and Cao, Jiewei and Shen, Chunhua and Gao, Lianli and van den Hengel, Anton},
  title = {Neighbourhood Watch: Referring Expression Comprehension via Language-Guided Graph Attention Networks},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Scene Graph Generation With External Knowledge and Image Reconstruction
Jiuxiang Gu, Handong Zhao, Zhe Lin, Sheng Li, Jianfei Cai, Mingyang Ling


Scene graph generation has received growing attention with the advancements in image understanding tasks such as object detection, attributes and relationship prediction, etc. However, existing datasets are biased in terms of object and relationship labels, or often come with noisy and missing annotations, which makes the development of a reliable scene graph prediction model very challenging. In this paper, we propose a novel scene graph generation algorithm with external knowledge and image reconstruction loss to overcome these dataset issues. In particular, we extract commonsense knowledge from the external knowledge base to refine object and phrase features for improving generalizability in scene graph generation. To address the bias of noisy object annotations, we introduce an auxiliary image reconstruction path to regularize the scene graph generation network. Extensive experiments show that our framework can generate better scene graphs, achieving the state-of-the-art performance on two benchmark datasets: Visual Relationship Detection and Visual Genome datasets.
[graph, dataset, jointly, recognizing, framework, dynamic, predict] [scene, reconstruction, approach, corresponding, well] [image, based, figure, proposed, generator, method] [table, number, output, performance, network, neural, regularize, deep, convolutional, improving, regularizer] [visual, generation, relationship, model, subgraph, external, memory, commonsense, predicate, generate, vector, phrdet, question, phrase, reasoning, introduced, vrd, language, jiuxiang, attention, generated, word, gan, sggen, conceptnet] [object, feature, detection, refinement, layout, supervision, module, relation, bounding, proposal, jianfei, refine, baseline, final, gang, branch, adopt, propose, semantic, context] [knowledge, set, training, auxiliary, update, loss, existing, learning, train, datasets, retrieved, noisy]
@InProceedings{Gu_2019_CVPR,
  author = {Gu, Jiuxiang and Zhao, Handong and Lin, Zhe and Li, Sheng and Cai, Jianfei and Ling, Mingyang},
  title = {Scene Graph Generation With External Knowledge and Image Reconstruction},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval
Yale Song, Mohammad Soleymani


Visual-semantic embedding aims to find a shared latent space where related visual and textual instances are close to each other. Most current methods learn injective embedding functions that map an instance to a single point in the shared space. Unfortunately, injective embedding cannot effectively handle polysemous instances with multiple possible meanings; at best, it would find an average representation of different meanings. This hinders its use in real-world scenarios where individual instances and their cross-modal associations are often ambiguous. In this work, we introduce Polysemous Instance Embedding Networks (PIE-Nets) that compute multiple and diverse representations of an instance by combining global context with locally-guided features via multi-head self-attention and residual learning. To learn visual-semantic embedding, we tie-up two PIE-Nets and optimize them jointly in the multiple instance learning framework. Most existing work on cross-modal retrieval focus on image-text pairs of data. Here, we also tackle a more challenging case of video-text retrieval. To facilitate further research in video-text retrieval, we release a new dataset of 50K video-sentence pairs collected from social media, dubbed MRW (my reaction when). We demonstrate our approach on both image-text and video-text retrieval scenarios using MS-COCO, TGIF, and our new MRW dataset.
[multiple, dataset, video, report, ldiv, combining] [local, approach, single, compute, constraint, robust] [image, figure, method, mapping, based, collected, described] [residual, performance, deep, table, apply, correlation, optimize, popular, neural, layer, kernel, compared] [mrw, partial, model, ambiguous, injective, sentence, polysemous, visual, reaction, text, pvse, tgif, attention, find, association, diverse, transformer] [instance, global, feature, mil, final, map, context] [embedding, learning, retrieval, loss, embeddings, test, datasets, distance, discrepancy, learn, existing, space, representation, shared, train, data, triplet, negative, function]
@InProceedings{Song_2019_CVPR,
  author = {Song, Yale and Soleymani, Mohammad},
  title = {Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
MUREL: Multimodal Relational Reasoning for Visual Question Answering
Remi Cadene, Hedi Ben-younes, Matthieu Cord, Nicolas Thome


Multimodal attentional networks are currently state-of-the-art models for Visual Question Answering (VQA) tasks involving real images. Although attention allows to focus on the visual content relevant to the question, this simple mechanism is arguably insufficient to model complex reasoning features required for VQA or other high-level tasks. In this paper, we propose MuRel, a multimodal relational network which is learned end-to-end to reason over real images. Our first contribution is the introduction of the MuRel cell, an atomic reasoning primitive representing interactions between question and image regions by a rich vectorial representation, and modeling region relations with pairwise combinations. Secondly, we incorporate the cell into a full MuRel network, which progressively refines visual and question interactions, and can be leveraged to define visualization schemes finer than mere attention maps. We validate the relevance of our approach with various ablation studies, and show its superiority to attention-based methods on three datasets: VQA 2.0, VQA-CP v2 and TDIUC. Our final MuRel network is competitive to or outperforms state-of-the-art results in this challenging context. Our code is available: github.com/Cadene/murel.bootstrap.pytorch
[fusion, modeling, recognition, dataset, work, graph, multiple, report, complex, framework, focus] [computer, vision, pattern, international, approach, scene, provide, analysis, define] [conference, image, ieee, figure, based, real, comparison, proposed, attentional] [network, cell, bilinear, neural, process, number, deep, table, accuracy, gain, structure, residual, performance, validation, compare, processing] [visual, murel, question, vqa, reasoning, model, attention, multimodal, relational, step, iterative, answer, answering, tdiuc, contribution, vector, relevant, reason, relationship, rich, vectorial] [region, spatial, semantic, three, module, context, object, score, visualization, final] [pairwise, representation, learning, set, trained, train, validate, embedding, training]
@InProceedings{Cadene_2019_CVPR,
  author = {Cadene, Remi and Ben-younes, Hedi and Cord, Matthieu and Thome, Nicolas},
  title = {MUREL: Multimodal Relational Reasoning for Visual Question Answering},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Heterogeneous Memory Enhanced Multimodal Attention Model for Video Question Answering
Chenyou Fan, Xiaofan Zhang, Shu Zhang, Wensheng Wang, Chi Zhang, Heng Huang


In this paper, we propose a novel end-to-end trainable Video Question Answering (VideoQA) framework with three major components: 1) a new heterogeneous memory which can effectively learn global context information from appearance and motion features; 2) a redesigned question memory which helps understand the complex semantics of question and highlights queried subjects; and 3) a new multimodal fusion layer which performs multi-step reasoning by attending to relevant visual and textual hints with self-updated attention. Our VideoQA model firstly generates the global context-aware visual and textual features respectively by interacting current inputs with memory contents. After that, it makes the attentional fusion of the multimodal visual and textual representations to infer the correct answer. Multiple cycles of reasoning can be made to iteratively refine attention weights of the multimodal data and improve the final representation of the QA pair. Experimental results demonstrate our approach achieves state-of-the-art performance on four VideoQA benchmark datasets.
[video, motion, hidden, lstm, fusion, state, heterogeneous, complex, temporal, current, dataset, understand, work, multiple, outperforms, time] [define] [appearance, method, content, proposed, image, attentional, figure, based] [table, number, network, designed, size, design, accuracy, better, layer, achieves, compare, best, iteration, experiment] [question, memory, visual, attention, model, multimodal, answer, videoqa, reasoning, relevant, read, write, correct, word, answering, encoded, attend, encoder, textual, generate, external, step, queried, man, vector, encoders, type] [global, feature, final, three, semantics, benchmark, challenging, integrate, module] [learn, learning, representation, existing, task, update]
@InProceedings{Fan_2019_CVPR,
  author = {Fan, Chenyou and Zhang, Xiaofan and Zhang, Shu and Wang, Wensheng and Zhang, Chi and Huang, Heng},
  title = {Heterogeneous Memory Enhanced Multimodal Attention Model for Video Question Answering},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Information Maximizing Visual Question Generation
Ranjay Krishna, Michael Bernstein, Li Fei-Fei


Though image-to-sequence generation models have become overwhelmingly popular in human-computer communications, they suffer from strongly favoring safe generic questions ("What is in this picture?"). Generating uninformative but relevant questions is not sufficient or useful. We argue that a good question is one that has a tightly focused purpose --- one that is aimed at expecting a specific type of response. We build a model that maximizes mutual information between the image, the expected answer and the generated question. To overcome the non-differentiability of discrete natural language tokens, we introduce a variational continuous latent space onto which the expected answers project. We regularize this latent space with a second latent space that ensures clustering of similar answers. Even when we don't know the expected answer, this second latent space can generate goal-driven questions specifically aimed at extracting objects ("what is the person throwing"), attributes, ("What kind of shirt is the person wearing?"), color ("what color is the frisbee?"), material ("What material is the frisbee?"), etc. We quantitatively show that our model is able to retain information about an expected answer category, resulting in more diverse, goal-driven questions. We launch our model on a set of real world images and extract previously unseen visual concepts.
[second, report, time, dataset, recurrent, people] [computer, vision, discrete, shape, pattern, continuous, well, international, optimizing, ground] [image, latent, color, conference, figure, ieee, input, result, real, generative, day, reconstruct, generator] [neural, processing, deep, table, network, low, mlp, number] [answer, question, model, visual, generate, generating, expected, arxiv, preprint, generation, generated, variational, maximizes, language, maximizing, man, relevance, vqa, unique, answering, diversity, maximize, embed, food, machine, relevant, adversarial, evaluation, van, find, uninformative] [category, person, final, cnn] [mutual, space, learning, training, set, specific, trained, learn, task, train, representation, loss, data, test, measure, generic]
@InProceedings{Krishna_2019_CVPR,
  author = {Krishna, Ranjay and Bernstein, Michael and Fei-Fei, Li},
  title = {Information Maximizing Visual Question Generation},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Learning to Detect Human-Object Interactions With Knowledge
Bingjie Xu, Yongkang Wong, Junnan Li, Qi Zhao, Mohan S. Kankanhalli


The recent advances in instance-level detection tasks lay a strong foundation for automated visual scenes understanding. However, the ability to fully comprehend a social scene still eludes us. In this work, we focus on detecting human-object interactions (HOIs) in images, an essential step towards deeper scene understanding. HOI detection aims to localize human and objects, as well as to identify the complex interactions between them. Innate in practical problems with large label space, HOI categories exhibit a long-tail distribution, i.e., there exist some rare categories with very few training samples. Given the key observation that HOIs contain intrinsic semantic regularities despite they are visually diverse, we tackle the challenge of long-tail HOI categories by modeling the underlying regularities among verbs and objects in HOIs as well as general relationships. In particular, we construct a knowledge graph based on the ground-truth annotations of training dataset and external source. In contrast to direct knowledge incorporation, we address the necessity of dynamic image-specific knowledge retrieval by multi-modal learning, which leads to an enhanced semantic embedding space for HOI comprehension. The proposed method shows improved performance on V-COCO and HICO-DET benchmarks, especially when predicting the rare HOI categories.
[verb, graph, hoi, joint, human, hois, prediction, modeling, dataset, recognition, interaction, xho, gcn, work, interacting] [well, scene, general, associated, problem, approach] [proposed, figure, based, method, image] [neural, network, convolutional, deep, structure, best, configuration] [visual, model, node, word, linguistic, external, compositional, glove, multimodal, vector] [object, semantic, detection, feature, score, spatial, detect, bounding, detected, extra, lsim] [knowledge, learning, embedding, embeddings, pairwise, space, training, label, triplet, set, learn, update, loss, rare, test, learned, pair, cross, entropy, address, distribution, similarity, data]
@InProceedings{Xu_2019_CVPR,
  author = {Xu, Bingjie and Wong, Yongkang and Li, Junnan and Zhao, Qi and Kankanhalli, Mohan S.},
  title = {Learning to Detect Human-Object Interactions With Knowledge},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Learning Words by Drawing Images
Didac Suris, Adria Recasens, David Bau, David Harwath, James Glass, Antonio Torralba


We propose a framework for learning through drawing. Our goal is to learn the correspondence between spoken words and abstract visual attributes, from a dataset of spoken descriptions of images. Building upon recent findings that GAN representations can be manipulated to edit semantic concepts in the generated output, we propose a new method to use such GAN-generated images to train a model using a triplet loss. To apply the method, we develop Audio CLEVRGAN, a new dataset of audio descriptions of GAN-generated CLEVR images, and we describe a training procedure that creates a curriculum of GAN-generated images that focuses training on image pairs that differ in a specific, informative way. Training is done without additional supervision beyond the spoken captions and the GAN. We find that training that takes advantage of GAN-generated edited examples results in improvements in the model's ability to learn attributes compared to previous results. Our proposed learning framework also results in models that can associate spoken words with some abstract visual concepts such as color and size.
[audio, dataset, human, work, speech, correspond, sound, framework] [computer, vision, pattern] [image, figure, conference, attribute, method, synthetic, generative, edit, ieee, drawing, proposed] [original, network, neural, deep, small, convolutional, apply] [edited, visual, model, gan, random, system, spoken, generate, targeted, edits, clevr, generated, abstract, concept, caption, davenet, language, matchmap, ability, adversarial, question, ablating, gans, compositional, generation, description, ball, ablate, create, procedure] [object, segmentation, semantic, improve, european, propose] [training, learning, learn, negative, learned, randomly, trained, representation, train, curriculum, specific, cluster, hard, set, test, triplet, positive]
@InProceedings{Suris_2019_CVPR,
  author = {Suris, Didac and Recasens, Adria and Bau, David and Harwath, David and Glass, James and Torralba, Antonio},
  title = {Learning Words by Drawing Images},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Factor Graph Attention
Idan Schwartz, Seunghak Yu, Tamir Hazan, Alexander G. Schwing


Dialog is an effective way to exchange information, but subtle details and nuances are extremely important. While significant progress has paved a path to address visual dialog with algorithms, details and nuances remain a challenge. Attention mechanisms have demonstrated compelling results to extract details in visual question answering and also provide a convincing framework for visual dialog due to their interpretability and effectiveness. However, the many data utilities that accompany visual dialog challenge existing attention techniques. We address this issue and develop a general attention mechanism for visual dialog which operates on any number of data utilities. To this end, we design a factor graph based attention mechanism which combines any number of utility representations. We illustrate the applicability of the proposed approach on the challenging and recently introduced VisDial datasets, outperforming recent state-of-the-art methods by 1.1% for VisDial0.9 and by 2% for VisDial1.0 on MRR. Our ensemble model improved the MRR score on VisDial1.0 by more than 6%.
[graph, lstm, dataset, focus, work, consists, interaction, people, second, previous, capture, framework] [approach, general, local, note, provide] [image, based, proposed, prior, subtle, figure, generative] [factor, number, table, deep, better, convolutional, neural, layer, trainable, performance, group, applied] [attention, question, visual, history, dialog, model, answer, utility, answering, caption, mrr, attended, word, generation, entity, mechanism, visdial, vector, wearing, attend, textual, fga, memory, probability, hat, rdi, develop, embed, hqt, skateboarder, arxiv, captioning, multimodal, evaluate, external, preprint, diverse, introduced] [hierarchical, improve, challenging, score] [representation, set, embedding, discriminative, data, pairwise, observe, learning]
@InProceedings{Schwartz_2019_CVPR,
  author = {Schwartz, Idan and Yu, Seunghak and Hazan, Tamir and Schwing, Alexander G.},
  title = {Factor Graph Attention},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Reducing Uncertainty in Undersampled MRI Reconstruction With Active Acquisition
Zizhao Zhang, Adriana Romero, Matthew J. Muckley, Pascal Vincent, Lin Yang, Michal Drozdzal


The goal of MRI reconstruction is to restore a high fidelity image from partially observed measurements. This partial view naturally induces reconstruction uncertainty that can only be reduced by acquiring additional measurements. In this paper, we present a novel method for MRI reconstruction that, at inference time, dynamically selects the measurements to take and iteratively refines the prediction in order to best reduce the reconstruction error and, thus, its uncertainty. We validate our method on a large scale knee MRI dataset, as well as on ImageNet. Results show that (1) our system successfully outperforms active acquisition baselines; (2) our uncertainty estimates correlate with error maps; and (3) our ResNet-based architecture surpasses standard pixel-to-pixel models in the task of MRI reconstruction. The proposed method not only shows high-quality reconstructions but also paves the road towards more applicable solutions for accelerating MRI.
[trajectory, dataset, knee, signal, outperforms, work] [reconstruction, active, error, measurement, note, well, additional, corresponding, initial, fourier] [image, mri, acquisition, evaluator, mse, figure, spectral, kma, method, high, magnetic, sensing, proposed, resonance, quality, input, undersampled, fidelity, acquired, ieee, dicom, comparison, generative, imaging] [network, deep, number, compressed, inference, process, output, convolutional, acceleration, low, scale, standard, neural, better, layer, adaptive, binary, reducing, reduce, accelerating, variance] [adversarial, observed, model, unobserved, goal, system, introduce] [score, medical, fully, mask, map, cascaded] [uncertainty, learning, sampling, training, trained, train, data, function, large, select, classification]
@InProceedings{Zhang_2019_CVPR,
  author = {Zhang, Zizhao and Romero, Adriana and Muckley, Matthew J. and Vincent, Pascal and Yang, Lin and Drozdzal, Michal},
  title = {Reducing Uncertainty in Undersampled MRI Reconstruction With Active Acquisition},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
ESIR: End-To-End Scene Text Recognition via Iterative Image Rectification
Fangneng Zhan, Shijian Lu


Automated recognition of texts in scenes has been a research challenge for years, largely due to the arbitrary text appearance variation in perspective distortion, text line curvature, text styles and different types of imaging artifacts. The recent deep networks are capable of learning robust representations with respect to imaging artifacts and text style changes, but still face various problems while dealing with scene texts with perspective and curvature distortions. This paper presents an end-to-end trainable scene text recognition system (ESIR) that iteratively removes perspective distortion and text line curvature as driven by better scene text recognition performance. An innovative rectification network is developed, where a line-fitting transformation is designed to estimate the pose of text lines in scenes. Additionally, an iterative rectification framework is developed which corrects scene text distortions iteratively towards a fronto-parallel view. The ESIR is also robust to parameter initialization and easy to train, where the training needs only scene text images and word-level annotations as required by most scene text recognition systems. Extensive experiments over a number of public datasets show that the proposed ESIR is capable of rectifying scene text distortions accurately, achieving superior recognition performance for both normal scene text images and those suffering from perspective and curvature distortions.
[recognition, consists, sequence, iteratively, middle, second, dataset, individual] [scene, rectification, perspective, robust, rectified, distorted, curvature, distortion, accurate, approach, estimation, estimate, polynomial, estimated, dealing, pose, greatly] [image, proposed, transformation, capable, amount, based, described, fangneng, figure] [network, number, performance, deep, better, table, neural, parameter, iteration, trainable, initialization, superior, order, residual, andrew] [text, esir, iterative, word, model, attention, svtp, lexicon, shijian, cute, character, synthtext, reading, svt, system] [detection, localization, illustrated, feature, cropped, driven, improve, three] [training, datasets, learning, novel, large, data, sample, suffer, set]
@InProceedings{Zhan_2019_CVPR,
  author = {Zhan, Fangneng and Lu, Shijian},
  title = {ESIR: End-To-End Scene Text Recognition via Iterative Image Rectification},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
ROI-10D: Monocular Lifting of 2D Detection to 6D Pose and Metric Shape
Fabian Manhardt, Wadim Kehl, Adrien Gaidon


We present a deep learning method for end-to-end monocular 3D object detection and metric shape retrieval. We propose a novel loss formulation by lifting 2D detection, orientation, and scale estimation into 3D space. Instead of optimizing these quantities separately, the 3D instantiation allows to properly measure the metric misalignment of boxes. We experimentally show that our 10D lifting of sparse 2D Regions of Interests (RoIs) achieves great results both for 6D pose and recovery of the textured metric geometry of instances. This further enables 3D synthetic data augmentation via inpainting recovered meshes directly onto the 2D scenes. We evaluate on KITTI3D against other strong monocular methods and demonstrate that our approach doubles the AP on the 3D pose metrics on the official test set, defining the new state of the art.
[recognition, work, prediction, instantiation, predict] [pose, monocular, depth, computer, shape, lifting, vision, ground, estimation, pattern, truth, view, single, rotation, approach, accurate, camera, projective, allocentric, directly, mesh, analysis, scene, regress, kitti, absolute, estimate] [conference, method, synthetic, ieee, figure, eye, image, translation, input, recover, latent] [deep, network, better, neural, scale, table, validation, full, convolutional, proper] [model, strong, describe, evaluation] [object, detection, bounding, official, map, box, regression, predicted, roi, iou, easy, moderate, propose] [data, loss, metric, learning, training, set, test, space, trained, split, hard, weighting]
@InProceedings{Manhardt_2019_CVPR,
  author = {Manhardt, Fabian and Kehl, Wadim and Gaidon, Adrien},
  title = {ROI-10D: Monocular Lifting of 2D Detection to 6D Pose and Metric Shape},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Collaborative Learning of Semi-Supervised Segmentation and Classification for Medical Images
Yi Zhou, Xiaodong He, Lei Huang, Li Liu, Fan Zhu, Shanshan Cui, Ling Shao


Medical image analysis has two important research areas: disease grading and fine-grained lesion segmentation. Although the former problem often relies on the latter, the two are usually studied separately. Disease severity grading can be treated as a classification problem, which only requires image-level annotations, while the lesion segmentation requires stronger pixel-level annotations. However, pixel-wise data annotation for medical images is highly time-consuming and requires domain experts. In this paper, we propose a collaborative learning method to jointly improve the performance of disease grading and lesion segmentation by semi-supervised learning with an attention mechanism. Given a small set of pixel-level annotated data, a multi-lesion mask generation model first performs the traditional semantic segmentation task. Then, based on initially predicted lesion maps for large quantities of image-level annotated data, a lesion attentive disease grading model is designed to improve the severity classification accuracy. Meanwhile, the lesion attention model can refine the lesion maps using class-specific information to fine-tune the segmentation model in a semi-supervised manner. An adversarial architecture is also integrated for training. With extensive experiments on a representative medical problem called diabetic retinopathy (DR), we validate the effectiveness of our method and achieve consistent improvements over state-of-the-art methods on three public datasets.
[auc, dataset, consists] [june, normal, limited] [image, proposed, method, based, generator, input, high, collaborative] [deep, network, performance, convolutional, convolution, table, effectiveness, neural, architecture, compared, low, xception, basic, small] [model, attention, adversarial, generate, discriminator, evaluation, van, requires] [lesion, segmentation, grading, disease, annotated, attentive, medical, diabetic, detection, semantic, retinopathy, feature, predicted, severity, mask, idrid, improve, final, kappa, roc, initially, fundus, adopt, eyepacs, messidor, september, three, module, spatial, improvement, imagelevel, refine] [learning, classification, data, training, set, pseudo, large, train, soft, loss, learn, hard]
@InProceedings{Zhou_2019_CVPR,
  author = {Zhou, Yi and He, Xiaodong and Huang, Lei and Liu, Li and Zhu, Fan and Cui, Shanshan and Shao, Ling},
  title = {Collaborative Learning of Semi-Supervised Segmentation and Classification for Medical Images},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Biologically-Constrained Graphs for Global Connectomics Reconstruction
Brian Matejek, Daniel Haehn, Haidong Zhu, Donglai Wei, Toufiq Parag, Hanspeter Pfister


Most current state-of-the-art connectome reconstruction pipelines have two major steps: initial pixel-based segmentation with affinity prediction and watershed transform, and refined segmentation by merging over-segmented regions. These methods rely only on local context and are typically agnostic to the underlying biology. Since a few merge errors can lead to several incorrectly merged neuronal processes, these algorithms are currently tuned towards over-segmentation producing an overburden of costly proofreading. We propose a third step for connectomics reconstruction pipelines to refine an over-segmentation using both local and global context with an emphasis on adhering to the underlying biology. We first extract a graph from an input segmentation where nodes correspond to segment labels and edges indicate potential split errors in the over-segmentation. In order to increase throughput and allow for large-scale reconstruction, we employ biologically inspired geometric constraints based on neuron morphology to reduce the number of nodes and edges. Next, two neural networks learn these neuronal shapes to further aid the graph construction process. Lastly, we reformulate the region merging problem as a graph partitioning one to leverage global context. We demonstrate the performance of our approach on four real-world connectomics datasets with an average variation of information improvement of 21.3%.
[graph, skeleton, dataset, current, multiple, merged] [optimization, error, corresponding, reconstruction, underlying, algorithm, local, volume, geometric, endpoint, initial, approach, voxel, topological, multicut, computer, total] [method, input, proposed, image, figure, variation, produce, based, conference] [small, process, number, neural, network, table, neuron, reduce, larger, employ, decrease] [pni, generation, node, arxiv, preprint, correctly, step, generate, automatic, potential, example, vector] [merge, segmentation, edge, neuronal, segment, three, global, merging, cnn, agglomeration, baseline, connectomics, affinity, kasthuri, context, region, electron, receive, boundary, belong, thinning, watershed] [split, test, learning, strategy, datasets, large, data, learn, partitioning]
@InProceedings{Matejek_2019_CVPR,
  author = {Matejek, Brian and Haehn, Daniel and Zhu, Haidong and Wei, Donglai and Parag, Toufiq and Pfister, Hanspeter},
  title = {Biologically-Constrained Graphs for Global Connectomics Reconstruction},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
P3SGD: Patient Privacy Preserving SGD for Regularizing Deep CNNs in Pathological Image Classification
Bingzhe Wu, Shiwan Zhao, Guangyu Sun, Xiaolu Zhang, Zhong Su, Caihong Zeng, Zhihong Liu


Recently, deep convolutional neural networks (CNNs) have achieved great success in pathological image classification. However, due to the limited number of labeled pathological images, there are still two challenges to be addressed: (1) overfitting: the performance of a CNN model is undermined by the overfitting due to its huge amounts of parameters and the insufficiency of labeled training data. (2) privacy leakage: the model trained using a conventional method may involuntarily reveal the private information of the patients in the training dataset. The smaller the dataset, the worse the privacy leakage. To tackle the above two challenges, we introduce a novel stochastic gradient descent (SGD) scheme, named patient privacy preserving SGD (P3SGD), which performs the model update of the SGD in the patient level via a large-step update built upon each patient's data. Specifically, to protect privacy and regularize the CNN model, we propose to inject the well-designed noise into the updates. Moreover, we equip our P3SGD with an elaborated strategy to adaptively control the scale of the injected noise. To validate the effectiveness of P3SGD, we perform extensive experiments on a real-world clinical dataset and quantitatively demonstrate the superior ability of P3SGD in reducing the risk of overfitting. We also provide a rigorous analysis of the privacy cost under differential privacy. Additionally, we find that the models trained with P3SGD are resistant to the model-inversion attack compared with those trained using non-private SGD.
[dataset, previous, work, adjacent, perform, outperforms, inject] [algorithm, differential, bound, theorem, provide, analysis, define] [image, noise, method, figure, control, input, demonstrate, traditional, clinical] [deep, sgd, convolutional, scale, dropout, patient, cnns, accuracy, fixed, regularization, accountant, cost, neural, performance, adaptive, modern, original, differentially, standard, number, dropblock, table, weight, gain, achieves, gradient, rigorous, gaussian, drop, batch, effectiveness] [privacy, model, private, introduce, randomized, attack, mechanism, empirical, named, randomness, arxiv, preprint] [cnn, pathological, propose] [training, testing, update, strategy, set, learning, trained, noisy, loss, overfitting, classification, randomly, data, function, gap, setting, risk]
@InProceedings{Wu_2019_CVPR,
  author = {Wu, Bingzhe and Zhao, Shiwan and Sun, Guangyu and Zhang, Xiaolu and Su, Zhong and Zeng, Caihong and Liu, Zhihong},
  title = {P3SGD: Patient Privacy Preserving SGD for Regularizing Deep CNNs in Pathological Image Classification},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Elastic Boundary Projection for 3D Medical Image Segmentation
Tianwei Ni, Lingxi Xie, Huangjie Zheng, Elliot K. Fishman, Alan L. Yuille


We focus on an important yet challenging problem: using a 2D deep network to deal with 3D segmentation for medical image analysis. Existing approaches either applied multi-view planar (2D) networks or directly used volumetric (3D) networks for this purpose, but both of them are not ideal: 2D networks cannot capture 3D contexts effectively, and 3D networks are both memory-consuming and less stable arguably due to the lack of pre-trained models. In this paper, we bridge the gap between 2D and 3D using a novel approach named Elastic Boundary Projection (EBP). The key observation is that, although the object is a 3D volume, what we really need in segmentation is to find its boundary which is a 2D surface. Therefore, we place a number of pivot points in the 3D space, and for each pivot, we determine its distance to the object boundary along a dense set of directions. This creates an elastic shell around each pivot which is initialized as a perfect sphere. We train a 2D deep network to determine whether each ending point falls within the object, and gradually adjust the shell so that it gradually converges to the actual shape of the boundary and thus achieves the goal of segmentation. EBP allows boundary-based segmentation without cutting a 3D volume into slices or patches, which stands out from conventional 2D and 3D approaches. EBP achieves promising accuracy in abdominal organ segmentation. Our code will be released on https://github.com/twni2016/Elastic-Boundary-Projection .
[second, dataset, determine] [volume, volumetric, point, international, voxels, direction, computer, reconstruction, pattern, algorithm, approach, radius, vision, directly, coordinate, define, angle, projection] [image, conference, dsc, ieee, figure, based, input, outer] [deep, number, elastic, network, convolutional, neural, accuracy, computing, process, iteration, max, converge, table, unit, apply, density, applied, initialized] [model, deal, indicates, generated, find, step] [ebp, boundary, segmentation, medical, pivot, organ, inner, rstn, located, shell, vnet, pancreas, abdominal, average, object, spleen, alan, stage, predicted, spatial, elliot] [data, training, set, sample, distance, target, testing, randomly, large, idea, convergence, learning, min, trained, gap]
@InProceedings{Ni_2019_CVPR,
  author = {Ni, Tianwei and Xie, Lingxi and Zheng, Huangjie and Fishman, Elliot K. and Yuille, Alan L.},
  title = {Elastic Boundary Projection for 3D Medical Image Segmentation},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
SIXray: A Large-Scale Security Inspection X-Ray Benchmark for Prohibited Item Discovery in Overlapping Images
Caijing Miao, Lingxi Xie, Fang Wan, Chi Su, Hongye Liu, Jianbin Jiao, Qixiang Ye


In this paper, we present a large-scale dataset and establish a baseline for prohibited item discovery in Security Inspection X-ray images. Our dataset, named SIXray, consists of 1,059,231 X-ray images, in which 6 classes of 8,929 prohibited items are manually annotated. It raises a brand new challenge of overlapping image data, meanwhile shares the same properties with existing datasets, including complex yet meaningless contexts and class imbalance. We propose an approach named class-balanced hierarchical refinement (CHR) to deal with these difficulties. CHR assumes that each input image is sampled from a mixture distribution, and that deep networks require an iterative process to infer image contents accurately. To accelerate, we insert reversed connections to different network backbones, delivering high-level visual cues to assist mid-level features. In addition, a class-balanced loss function is designed to maximally alleviate the noise introduced by easy negative samples. We evaluate CHR on SIXray with different ratios of positive/negative samples. Compared to the baselines, CHR enjoys a better ability of discriminating objects especially using mid-level features, which offers the possibility of using a weakly-supervised approach towards accurate object localization. In particular, the advantage of CHR is more significant in the scenarios with fewer positive training samples, which demonstrates its potential application in real-world security inspection.
[dataset, recognition, complex, work, second] [approach, computer, well, formulation] [image, based, inspection, figure, gun] [deep, network, accuracy, ratio, number, convolutional, gain, neural, performance, eqn, densenet, process, fewer, larger, formulate, iteration, small, binary, table] [security, sampled, visual, natural, named, provided, evaluate, deal, model] [object, prohibited, chr, localization, hierarchical, overlapping, sixray, refinement, feature, baggage, weakly, average, detection, baseline, including, wrench, scissors, gdxray, qixiang, knife, indicating, supervision, three] [class, training, loss, classification, negative, data, testing, positive, learning, function, item, supervised, distribution, observe, set]
@InProceedings{Miao_2019_CVPR,
  author = {Miao, Caijing and Xie, Lingxi and Wan, Fang and Su, Chi and Liu, Hongye and Jiao, Jianbin and Ye, Qixiang},
  title = {SIXray: A Large-Scale Security Inspection X-Ray Benchmark for Prohibited Item Discovery in Overlapping Images},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Noise2Void - Learning Denoising From Single Noisy Images
Alexander Krull, Tim-Oliver Buchholz, Florian Jug


The field of image denoising is currently dominated by discriminative deep learning methods that are trained on pairs of noisy input and clean target images. Recently it has been shown that such methods can also be trained without clean targets. Instead, independent pairs of noisy images can be used, in an approach known as Noise2Noise (N2N). Here, we introduce Noise2Void (N2V), a training scheme that takes this idea one step further. It does not require noisy image pairs, nor clean target images. Consequently, N2V allows us to train directly on the body of data to be denoised and can therefore be applied when other methods cannot. Especially interesting is the application to biomedical image data, where the acquisition of training targets, clean or noisy, is frequently not possible. We compare the performance of N2V to approaches that have either clean target images and/or noisy image pairs available. Intuitively, N2V cannot be expected to outperform methods that have more information available during training. Still, we observe that the denoising performance of Noise2Void drops in moderation and compares favorably to training-free denoising methods.
[signal, prediction, predict, work, multiple, internal] [ground, truth, single, field, approach, assume, initial, directly, body, well, corresponding, square] [image, oise, denoising, input, noise, pixel, clean, traditional, figure, patch, microscopy, based, simulated, traditionally, result, fluorescence, psnr, method, restoration, real, denoised, biomedical, quality] [network, scheme, applied, batch, receptive, deep, size, architecture, performance, convolutional, neural, structured, rate, standard, gaussian, compare, residual, structure] [requires, introduce, expected, consider, natural, median, random] [cnn, propose] [training, noisy, trained, data, target, learning, independent, train, task, learn, test, distribution, function, loss, novel]
@InProceedings{Krull_2019_CVPR,
  author = {Krull, Alexander and Buchholz, Tim-Oliver and Jug, Florian},
  title = {Noise2Void - Learning Denoising From Single Noisy Images},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Joint Discriminative and Generative Learning for Person Re-Identification
Zhedong Zheng, Xiaodong Yang, Zhiding Yu, Liang Zheng, Yi Yang, Jan Kautz


Person re-identification (re-id) remains challenging due to significant intra-class variations across different cameras. Recently, there has been a growing interest in using generative models to augment training data and enhance the invariance to input changes. The generative pipelines in existing methods, however, stay relatively separate from the discriminative re-id learning stages. Accordingly, re-id models are often trained in a straightforward manner on the generated data. In this paper, we seek to improve learned re-id embeddings by better leveraging the generated data. To this end, we propose a joint learning framework that couples re-id learning and data generation end-to-end. Our model involves a generative module that separately encodes each person into an appearance code and a structure code, and a discriminative module that shares the appearance encoder with the generative module. By switching the appearance or structure codes, the generative module is able to generate high-quality cross-id composed images, which are online fed back to the appearance encoder and used to improve the discriminative module. The proposed joint learning framework renders significant improvement over the baseline without using generated data, leading to the state-of-the-art performance on several benchmark datasets.
[joint, online, human, framework, dynamic, outperforms] [pose, reconstruction, approach, body, additional] [appearance, image, generative, figure, identity, real, input, latent, comparison, based, recon, synthetic, quality, switching, composed, proposed, method, style, lrecon, fprim] [structure, deep, better, table, network, wei, original, fine, apply] [generated, generation, model, generate, primary, encoder, adversarial, gans, gan, introduce, diversity, visual] [person, feature, module, pedestrian, liang, map, identification, three, improve, zhedong, propose, baseline, clothing, leverage] [learning, discriminative, training, loss, code, data, set, existing, learn, space, train, domain, unified, learned, soft, large]
@InProceedings{Zheng_2019_CVPR,
  author = {Zheng, Zhedong and Yang, Xiaodong and Yu, Zhiding and Zheng, Liang and Yang, Yi and Kautz, Jan},
  title = {Joint Discriminative and Generative Learning for Person Re-Identification},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Unsupervised Person Re-Identification by Soft Multilabel Learning
Hong-Xing Yu, Wei-Shi Zheng, Ancong Wu, Xiaowei Guo, Shaogang Gong, Jian-Huang Lai


Although unsupervised person re-identification (RE-ID) has drawn increasing research attentions due to its potential to address the scalability problem of supervised RE-ID models, it is very challenging to learn discriminative information in the absence of pairwise labels across disjoint camera views. To overcome this problem, we propose a deep model for the soft multilabel learning for unsupervised RE-ID. The idea is to learn a soft multilabel (real-valued label likelihood vector) for each unlabeled person by comparing the unlabeled person with a set of known reference persons from an auxiliary domain. We propose the soft multilabel-guided hard negative mining to learn a discriminative embedding for the unlabeled target domain by exploring the similarity consistency of the visual features and the soft multilabels of unlabeled target pairs. Since most target pairs are cross-view pairs, we develop the cross-view consistent soft multilabel learning to achieve the learning goal that the soft multilabels are consistently good across different camera views. To enable effecient soft multilabel learning, we introduce the reference agent learning to represent each reference person by a reference agent in a joint embedding. We evaluate our unified deep model on Market-1501 and DukeMTMC-reID. Our model outperforms the state-of-the-art unsupervised RE-ID methods by clear margins. Code is available at https://github.com/KovenYu/MAR.
[dataset, mar, joint, work, outperforms] [camera, consistent, relative, absolute, problem, corresponding] [reference, figure, based, visually, image, comparison, appearance, high] [deep, effectiveness, number, table, represents, pretrained] [model, agent, visual, potential, introduce, evaluate] [person, feature, comparing, propose, baseline, mined, map] [soft, multilabel, learning, unlabeled, unsupervised, target, discriminative, negative, hard, label, domain, embedding, set, auxiliary, pair, mining, learn, similarity, source, adaptation, lcm, data, training, agreement, multilabels, lal, likelihood, mine, comparative, metric, transfer, distribution, loss, knowledge, pseudo, representation, function, labeled, unlabelled, positive, observe, address, pairwise, unified, learned]
@InProceedings{Yu_2019_CVPR,
  author = {Yu, Hong-Xing and Zheng, Wei-Shi and Wu, Ancong and Guo, Xiaowei and Gong, Shaogang and Lai, Jian-Huang},
  title = {Unsupervised Person Re-Identification by Soft Multilabel Learning},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Learning Context Graph for Person Search
Yichao Yan, Qiang Zhang, Bingbing Ni, Wendong Zhang, Minghao Xu, Xiaokang Yang


Person re-identification has achieved great progress with deep convolutional neural networks. However, most previous methods focus on learning individual appearance feature embedding, and it is hard for the models to handle difficult situations with different illumination, large pose variance and occlusion. In this work, we take a step further and consider employing context information for person search. For a probe-gallery pair, we first propose a contextual instance expansion module, which employs a relative attention module to search and filter useful context information in the scene. We also build a graph learning framework to effectively employ context pairs to update target similarity. These two modules are built on top of a joint detection and instance feature learning framework, which improves the discriminativeness of the learned features. The proposed framework achieves state-of-the-art performance on two widely used person search datasets.
[graph, framework, dataset, individual, gcn, human, previous, build, joint, second] [relative, matching, scene] [figure, proposed, appearance, image, method, based, real, great] [search, deep, achieves, performance, layer, network, table, employ, convolutional, connected, design, bingbing, structure, neural, number, pooling, size, lower, better] [model, attention, ian, introduce] [person, context, feature, detection, instance, contextual, prw, pedestrian, bounding, cnn, propose, utilize, yichao, object, oim, xiaokang, expansion, liang, illustrated, global, region] [learning, target, similarity, gallery, loss, learned, learn, set, observe, representation, training, softmax, pair, metric, distance, positive, large, discriminative, train]
@InProceedings{Yan_2019_CVPR,
  author = {Yan, Yichao and Zhang, Qiang and Ni, Bingbing and Zhang, Wendong and Xu, Minghao and Yang, Xiaokang},
  title = {Learning Context Graph for Person Search},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Gradient Matching Generative Networks for Zero-Shot Learning
Mert Bulent Sariyildiz, Ramazan Gokberk Cinbis


Zero-shot learning (ZSL) is one of the most promising problems where substantial progress can potentially be achieved through unsupervised learning, due to distributional differences between supervised and zero-shot classes. For this reason, several works investigate the incorporation of discriminative domain adaptation techniques into ZSL, which, however, lead to modest improvements in ZSL accuracy. In contrast, we propose a generative model that can naturally learn from unsupervised examples, and synthesize training examples for unseen classes purely based on their class embeddings, and therefore, reduce the zero-shot learning problem into a supervised classification task. The proposed approach consists of two important components: (i) a conditional Generative Adversarial Network that learns to produce samples that mimic the characteristics of unsupervised data examples, and (ii) the Gradient Matching (GM) loss that measures the quality of the gradient signal obtained from the synthesized examples. Using our GM loss formulation, we enforce the generator to produce examples from which accurate classifiers can be trained. Experimental results on several ZSL benchmark datasets show that our approach leads to significant improvements over the state of the art in generalized zero-shot classification.
[learns, dataset] [pattern, matching, computer, approach, compute, optimizing] [generative, ieee, generator, conditional, synthesized, synthetic, based, noise, real, proposed, produce, synthesize, unconditional, image, synthesis, separate] [gradient, number, network, deep, accuracy, table, neural, validation, order, better] [model, wgan, discriminator, visual, evaluation, generate, vector, adversarial] [semantic, propose, lcls, feature, final, supervision, object] [training, class, learning, unseen, loss, zsl, lgm, set, classification, train, classifier, data, sun, embedding, function, gmn, learn, embeddings, cwgan, compatibility, awa, cub, observe, generalized, unsupervised, supervised, trained, unlabeled, minimizing, transductive, lwgan, sample, update, dfake, datasets, large, distribution]
@InProceedings{Sariyildiz_2019_CVPR,
  author = {Bulent Sariyildiz, Mert and Gokberk Cinbis, Ramazan},
  title = {Gradient Matching Generative Networks for Zero-Shot Learning},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Doodle to Search: Practical Zero-Shot Sketch-Based Image Retrieval
Sounak Dey, Pau Riba, Anjan Dutta, Josep Llados, Yi-Zhe Song


In this paper, we investigate the problem of zero-shot sketch-based image retrieval (ZS-SBIR), where human sketches are used as queries to conduct retrieval of photos from unseen categories. We importantly advance prior arts by proposing a novel ZS-SBIR scenario that represents a firm step forward in its practical application. The new setting uniquely recognizes two important yet often neglected challenges of practical ZS-SBIR, (i) the large domain gap between amateur sketch and photo, and (ii) the necessity for moving towards large-scale retrieval. We first contribute to the community a novel ZS-SBIR dataset, QuickDraw-Extended, that consists of 330,000 sketches and 204,000 photos spanning across 110 categories. Highly abstract amateur human sketches are purposefully sourced to maximize the domain gap, instead of ones included in existing datasets that can often be semi-photorealistic. We then formulate a ZS-SBIR framework to jointly model sketches and photos into a common embedding space. A novel strategy to mine the mutual information among domains is specifically engineered to alleviate the domain gap. External semantic knowledge is further embedded to aid semantic transfer. We show that, rather surprisingly, retrieval performance significantly outperforms that of state-of-the-art on existing datasets that can already be achieved using a reduced version of our model. We further demonstrate the superior performance of our full model by comparing with a number of alternatives on the newly proposed dataset. The new dataset, plus all training and testing code of our model, will be publicly released to facilitate future research.
[dataset, framework, human, moving, previous] [provide, practical, defined, problem, corresponding] [image, proposed, comparison, figure, real, based, high, abstraction, input, method] [network, deep, table, order, performance, top, full, number, gradient] [model, visual, attention, encoder, cvae, evaluation, abstract, common, system] [semantic, map, feature, propose, object, category, three, semantics, level, google] [sketch, domain, retrieval, embedding, loss, sbir, gap, datasets, training, learning, existing, test, triplet, space, novel, amateur, ranking, unseen, large, quickdraw, sketchy, set, train, learn, task, negative, trained, class, main, retrieved, data, draw]
@InProceedings{Dey_2019_CVPR,
  author = {Dey, Sounak and Riba, Pau and Dutta, Anjan and Llados, Josep and Song, Yi-Zhe},
  title = {Doodle to Search: Practical Zero-Shot Sketch-Based Image Retrieval},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Zero-Shot Task Transfer
Arghya Pal, Vineeth N Balasubramanian


In this work, we present a novel meta-learning algorithm that regresses model parameters for novel tasks for which no ground truth is available (zero-shot tasks). In order to adapt to novel zero-shot tasks, our meta-learner learns from the model parameters of known tasks (with ground truth) and the correlation of known tasks to zero-shot tasks. Such intuition finds its foothold in cognitive science, where a subject (human baby) can adapt to a novel concept (depth understanding) by correlating it with old concepts (hand movement or self-motion), without receiving an explicit supervision. We evaluated our model on the Taskonomy dataset, with four tasks as zero-shot: surface normal, room layout, depth and camera pose estimation. These tasks were chosen based on the data acquisition complexity and the complexity associated with the learning process using a deep network. Our proposed methodolgy outperforms state-of-the-art models (which use ground truth) on each of our zero-shot tasks, showing promise on zero-shot task transfer. We also conducted extensive experiments to study the various choices of our methodology, as well as showed how the proposed method can also be used in transfer learning. To the best of our knowledge, this is the first such effort on zero-shot learning in the task space.
[work, multiple, dataset, learns, finetuned] [ground, vision, surface, computer, normal, truth, estimation, depth, pattern, camera, pose, algorithm, regress, supplementary, international, well, optimal] [figure, conference, ieee, method, based, proposed, study, generative] [correlation, neural, number, deep, network, processing, table, process, rate] [model, room, mode, red, arxiv, preprint, decoder, considered, basis, consider, encoder] [supervision, layout, weak, object, predicted] [task, learning, data, ttnet, transfer, novel, domain, labeled, learn, taskonomy, knowledge, source, methodology, wcommon, ttnetls, objective, target, learned, training, meta, adapt, trained, function, loss, set]
@InProceedings{Pal_2019_CVPR,
  author = {Pal, Arghya and N Balasubramanian, Vineeth},
  title = {Zero-Shot Task Transfer},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
C-MIL: Continuation Multiple Instance Learning for Weakly Supervised Object Detection
Fang Wan, Chang Liu, Wei Ke, Xiangyang Ji, Jianbin Jiao, Qixiang Ye


Weakly supervised object detection (WSOD) is a challenging task when provided with image category supervision but required to simultaneously learn object locations and object detectors. Many WSOD approaches adopt multiple instance learning (MIL) and have non-convex loss functions which are prone to get stuck into local minima (falsely localize object parts) while missing full object extent during training. In this paper, we introduce a continuation optimization method into MIL and thereby creating continuation multiple instance learning (C-MIL), with the intention of alleviating the non-convexity problem in a systematic way. We partition instances into spatially related and class related subsets, and approximate the original loss function with a series of smoothed loss functions defined within the subsets. Optimizing smoothed loss functions prevents the training procedure falling prematurely into local minima and facilitates the discovery of Stable Semantic Extremal Regions (SSERs) which indicate full object extent. On the PASCAL VOC 2007 and 2012 datasets, C-MIL improves the state-of-the-art of weakly supervised object detection and weakly supervised object localization with large margins.
[multiple, series, selector, early] [optimization, defined, stable, problem, pattern, local, smoothed, single, algorithm, solution] [image, ieee, method, comparison, spatially, figure, proposed] [performance, epoch, selection, deep, table, network, full, neural, parameter, activation, smoothing, optimized, denotes, progressive, introducing, gradually, convolutional] [model, procedure, activate] [instance, object, continuation, mil, weakly, localization, detection, bag, detector, voc, extremal, map, wsod, semantic, score, iou, pascal, partitioned, highest, region, melm, weakrpn, category, extent, alleviating, context] [loss, supervised, learning, subset, training, function, classification, learned, class, positive, discriminative, large, partition, minimum, alleviate, selected]
@InProceedings{Wan_2019_CVPR,
  author = {Wan, Fang and Liu, Chang and Ke, Wei and Ji, Xiangyang and Jiao, Jianbin and Ye, Qixiang},
  title = {C-MIL: Continuation Multiple Instance Learning for Weakly Supervised Object Detection},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Weakly Supervised Learning of Instance Segmentation With Inter-Pixel Relations
Jiwoon Ahn, Sunghyun Cho, Suha Kwak


This paper presents a novel approach for learning instance segmentation with image-level class labels as supervision. Our approach generates pseudo instance segmentation labels of training images, which are used to train a fully supervised model. For generating the pseudo labels, we first identify confident seed areas of object classes from attention maps of an image classification model, and propagate them to discover the entire instance areas with accurate boundaries. To this end, we propose IRNet, which estimates rough areas of individual instances and detects boundaries between different object classes. It thus enables to assign instance labels to the seeds and to propagate them within the boundaries so that the entire areas of instances can be estimated accurately. Furthermore, IRNet is trained with inter-pixel relations on the attention maps, thus no extra supervision is required. Our method with IRNet achieves an outstanding performance on the PASCAL VOC 2012 dataset, surpassing not only previous state-of-the-art trained with the same level of supervision, but also some of previous models relying on stronger supervision.
[displacement, recognition, previous, propagation, predict] [computer, vision, pattern, field, approach, international, estimated] [conference, ieee, image, method, figure, quality, pixel, based, input] [network, performance, table, convolutional, entire, deep, convolution, number] [attention, vector, model, random, generating] [instance, segmentation, semantic, irnet, supervision, object, boundary, map, weakly, pascal, affinity, voc, equivalence, bounding, box, cam, feature, european, refined, affinitynet, fully, weak, score, branch, confident, level] [class, pseudo, learning, supervised, trained, label, training, classification, pairwise, train, pair, centroid, set, dog]
@InProceedings{Ahn_2019_CVPR,
  author = {Ahn, Jiwoon and Cho, Sunghyun and Kwak, Suha},
  title = {Weakly Supervised Learning of Instance Segmentation With Inter-Pixel Relations},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Attention-Based Dropout Layer for Weakly Supervised Object Localization
Junsuk Choe, Hyunjung Shim


Weakly Supervised Object Localization (WSOL) techniques learn the object location only using image-level labels, without location annotations. A common limitation for these techniques is that they cover only the most discriminative part of the object, not the entire object. To address this problem, we propose an Attention-based Dropout Layer (ADL), which utilizes the self-attention mechanism to process the feature maps of the model. The proposed method is composed of two key components: 1) hiding the most discriminative part from the model for capturing the integral extent of object, and 2) highlighting the informative region for improving the recognition power of the model. Based on extensive experiments, we demonstrate that the proposed method is effective to improve the accuracy of WSOL, achieving a new state-of-the-art localization accuracy in CUB-200-2011 dataset. We also show that the proposed method is much more efficient in terms of both parameter and computation overheads than existing techniques.
[dataset, current] [note, additional] [method, proposed, image, input, figure, background, based, produce] [accuracy, drop, adl, convolutional, applied, dropout, neural, layer, parameter, acol, applying, rate, deep, entire, vanilla, improving, efficient, wsol, better, best, computing, channelwise, pooling, lightweight, clas, table, power, effective, computation, compared, network] [model, attention, erase, machine, visual] [map, localization, feature, object, mask, weakly, region, spg, backbone, loc, improve, yunchao, extent, spatial, box, cam, jiashi, segmentation, activated, cnn, heatmap] [discriminative, classification, supervised, existing, learning, learn, observe, classifier, training, target, auxiliary, class, informative, distribution, set]
@InProceedings{Choe_2019_CVPR,
  author = {Choe, Junsuk and Shim, Hyunjung},
  title = {Attention-Based Dropout Layer for Weakly Supervised Object Localization},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Domain Generalization by Solving Jigsaw Puzzles
Fabio M. Carlucci, Antonio D'Innocente, Silvia Bucci, Barbara Caputo, Tatiana Tommasi


Human adaptability relies crucially on the ability to learn and merge knowledge both from supervised and unsupervised learning: the parents point out few important concepts, but then the children fill in the gaps on their own. This is particularly effective, because supervised learning can never be exhaustive and thus learning autonomously allows to discover invariances and regularities that help to generalize. In this paper we propose to apply a similar approach to the task of object recognition across domains: our model learns the semantic labels in a supervised fashion, and broadens its understanding of the data by learning from self-supervised signals how to solve a jigsaw puzzle on the same images. This secondary task helps the network to learn the concepts of spatial correlation while acting as a regularizer for the classification task. Multiple experiments on the PACS, VLCS, Office-Home and digits datasets confirm our intuition and show that this simple method outperforms previous domain generalization and adaptation solutions. An ablation study further illustrates the inner workings of our approach.
[recognition, multiple, work, dataset, ordered, previous, second] [computer, vision, pattern, solving, international, problem, single, permutation, approach, define, analysis] [image, conference, patch, based, method, figure, proposed, changing, presented, result, reference] [deep, network, accuracy, weight, table, alexnet, convolutional, original, standard, number, process, performance, neural, better] [model, visual, adversarial, considered, indicates, provided] [object, average, three, feature, european, ablation, grid, fully] [jigsaw, domain, jigen, learning, data, generalization, classification, target, source, unsupervised, task, loss, shuffled, adaptation, training, puzzle, bias, knowledge, specific, set, setting, learn, supervised, classifier, large, experimental, barbara, datasets, vlcs]
@InProceedings{Carlucci_2019_CVPR,
  author = {Carlucci, Fabio M. and D'Innocente, Antonio and Bucci, Silvia and Caputo, Barbara and Tommasi, Tatiana},
  title = {Domain Generalization by Solving Jigsaw Puzzles},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Transferrable Prototypical Networks for Unsupervised Domain Adaptation
Yingwei Pan, Ting Yao, Yehao Li, Yu Wang, Chong-Wah Ngo, Tao Mei


In this paper, we introduce a new idea for unsupervised domain adaptation via a remold of Prototypical Networks, which learn an embedding space and perform classification via a remold of the distances to the prototype of each class. Specifically, we present Transferrable Prototypical Networks (TPN) for adaptation such that the prototypes for each class in source and target domains are close in the embedding space and the score distributions predicted by prototypes separately on source and target data are similar. Technically, TPN initially matches each target example to the nearest prototype in the source domain and assigns an example a "pseudo" label. The prototype of each class could then be computed on source-only, target-only and source-target data, respectively. The optimization of TPN is end-to-end trained by jointly minimizing the distance across the prototypes on three types of data and KL-divergence of score distributions output by each pair of the prototypes. Extensive experiments are conducted on the transfers across MNIST, USPS and SVHN datasets, and superior results are reported when comparing to state-of-the-art approaches. More remarkably, we obtain an accuracy of 80.4% of single model on VisDA 2017 dataset.
[dataset, hypothesis, multiple, joint] [error, computed, directly, measured, problem] [image, jan, based, figure, noise] [accuracy, deep, performance, kernel, iteration] [model, discriminator, adversarial, example] [score, feature, three] [domain, target, adaptation, source, tpn, discrepancy, unsupervised, class, data, learning, transfer, prototypical, embedding, prototype, sample, pseudo, training, classification, confusion, space, loss, minimizing, learnt, labeled, visda, mmd, transferrable, distribution, set, classifier, dkl, distance, representation, mnist, ource, trained, usps, large, unlabeled, invariant, tpngen, tpntask, learn, close, task, function, datasets]
@InProceedings{Pan_2019_CVPR,
  author = {Pan, Yingwei and Yao, Ting and Li, Yehao and Wang, Yu and Ngo, Chong-Wah and Mei, Tao},
  title = {Transferrable Prototypical Networks for Unsupervised Domain Adaptation},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Blending-Target Domain Adaptation by Adversarial Meta-Adaptation Networks
Ziliang Chen, Jingyu Zhuang, Xiaodan Liang, Liang Lin


(Unsupervised) Domain Adaptation (DA) seeks for classifying target instances when solely provided with source labeled and target unlabeled examples for training. Learning domain-invariant features helps to achieve this goal, whereas it underpins unlabeled samples drawn from a single or multiple explicit target domains (Multi-target DA). In this paper, we consider a more realistic transfer scenario: our target domain is comprised of multiple sub-targets implicitly blended with each other so that learners could not identify which sub-target each unlabeled sample belongs to. This Blending-target Domain Adaptation (BTDA) scenario commonly appears in practice and threatens the validities of existing DA algorithms, due to the presence of domain gaps and categorical misalignments among these hidden sub-targets. To reap the transfer performance gains in this new scenario, we propose Adversarial Meta-Adaptation Network (AMEAN). AMEAN entails two adversarial transfer learning processes. The first is a conventional adversarial transfer to bridge our source and mixed target domains. To circumvent the intra-target category misalignment, the second process presents as "learning to adapt": It deploys an unsupervised meta-learner receiving target data and their ongoing feature-learning feedbacks, to discover target clusters as our "meta-sub-target" domains. This meta-sub-targets auto-design our meta-sub-target adaptation loss, which is capable to progressively eliminate the implicit category mismatching in our mixed target. We evaluate AMEAN and a variety of DA algorithms in three benchmarks under the BTDA setup. Empirical results show that BTDA is a quite challenging transfer setup for most existing DA algorithms, yet AMEAN significantly outperforms these state-of-the-art baselines and effectively restrains the negative transfer effects in BTDA.
[multiple, hidden, second] [explicit, international, mismatching, derived, vision, well, problem, optimization] [mixed, conference, drawn, figure, includes] [deep, network, neural, accuracy, best, performance, denotes, gradient, processing, achieve, dynamically] [adversarial, visual, arxiv, preprint, model, provided, evaluate, machine, consider, identify] [feature, category, propose, three] [target, transfer, domain, adaptation, learning, amean, btda, source, dmt, dst, rnt, unsupervised, negative, vmt, accant, unlabeled, mtda, existing, data, classification, learn, trained, loss, set, clustering, update, cluster, vada, sample, viewed, setup, suffer, extractor, classifier, min, log, distribution, training, transferable]
@InProceedings{Chen_2019_CVPR,
  author = {Chen, Ziliang and Zhuang, Jingyu and Liang, Xiaodan and Lin, Liang},
  title = {Blending-Target Domain Adaptation by Adversarial Meta-Adaptation Networks},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
ELASTIC: Improving CNNs With Dynamic Scaling Policies
Huiyu Wang, Aniruddha Kembhavi, Ali Farhadi, Alan L. Yuille, Mohammad Rastegari


Scale variation has been a challenge from traditional to modern approaches in computer vision. Most solutions to scale issues have a similar theme: a set of intuitive and manually designed policies that are generic and fixed (e.g. SIFT or feature pyramid). We argue that the scaling policy should be learned from data. In this paper, we introduce Elastic, a simple, efficient and yet very effective approach to learn a dynamic scale policy from data. We formulate the scaling policy as a non-linear function inside the network's structure that (a) is learned from data, (b) is instance specific, (c) does not add extra computation, and (d) can be applied on any network architecture. We applied Elastic to several state-of-the-art network architectures and showed consistent improvement without extra (sometimes even lower) computation on ImageNet classification, MSCOCO multi-label classification, and PASCAL VOC semantic segmentation. Our results show major improvement for images with scale challenges. Our code is available here: https://github.com/allenai/elastic
[multiple, recognition] [computer, pattern, vision, single, error] [image, figure, resolution, conference, ieee, high, input, major] [elastic, scale, network, scaling, imagenet, filter, number, layer, original, accuracy, table, deep, higher, applied, computational, convolutional, computation, resnext, best, lower, low, validation, small, dilated, dla, downsampling, block, branching, neural, structure, apply] [policy, model, mscoco, arxiv, preprint] [pyramid, feature, semantic, improvement, cnn, spatial, object, extra, challenging, pascal, voc, improve, improves] [classification, large, loss, trained, learning, learned, base, training, learn]
@InProceedings{Wang_2019_CVPR,
  author = {Wang, Huiyu and Kembhavi, Aniruddha and Farhadi, Ali and Yuille, Alan L. and Rastegari, Mohammad},
  title = {ELASTIC: Improving CNNs With Dynamic Scaling Policies},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
ScratchDet: Training Single-Shot Object Detectors From Scratch
Rui Zhu, Shifeng Zhang, Xiaobo Wang, Longyin Wen, Hailin Shi, Liefeng Bo, Tao Mei


Current state-of-the-art object objectors are fine-tuned from the off-the-shelf networks pretrained on large-scale classification dataset ImageNet, which incurs some additional problems: 1) The classification and detection have different degrees of sensitivity to translation, resulting in the learning objective bias; 2) The architecture is limited by the classification network, leading to the inconvenience of modification. To cope with these problems, training detectors from scratch is a feasible solution. However, the detectors trained from scratch generally perform worse than the pretrained ones, even suffer from the convergence issue in training. In this paper, we explore to train object detectors from scratch robustly. By analysing the previous work on optimization landscape, we find that one of the overlooked points in current trained-from-scratch detector is the BatchNorm. Resorting to the stable and predictable gradient brought by BatchNorm, detectors can be trained from scratch stably while keeping the favourable performance independent to the network architecture. Taking this advantage, we are able to explore various types of networks for object detection, without suffering from the poor convergence. By extensive experiments and analyses on downsampling factor, we propose the Root-ResNet backbone network, which makes full use of the information from original images. Our ScratchDet achieves the state-of-the-art accuracy on PASCAL VOC 2007, 2012 and MS COCO among all the train-from-scratch detectors and even performs better than several one-stage pretrained methods. Codes will be made publicly available at https://github.com/KimSoybean/ScratchDet.
[] [optimization, local, stable, analysis] [based, figure, input, study, remove] [batchnorm, pretrained, convolution, network, table, scratchdet, scratch, performance, layer, size, better, small, rate, higher, architecture, downsampling, original, larger, deep, operation, stride, resnet, dsod, batch, gradient, norm, accuracy, kernel, vggnet, convolutional, converge, root, best, impact, block] [landscape, model, critical, improved] [detection, object, voc, ssd, map, backbone, pascal, detector, trainval, head, coco, faster, improves, feature, subnetwork, ross, supervision, improvement] [training, learning, trained, train, base, test, classification, set, large, loss]
@InProceedings{Zhu_2019_CVPR,
  author = {Zhu, Rui and Zhang, Shifeng and Wang, Xiaobo and Wen, Longyin and Shi, Hailin and Bo, Liefeng and Mei, Tao},
  title = {ScratchDet: Training Single-Shot Object Detectors From Scratch},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
SFNet: Learning Object-Aware Semantic Correspondence
Junghyup Lee, Dohyung Kim, Jean Ponce, Bumsub Ham


We address the problem of semantic correspondence, that is, establishing a dense flow field between images depicting different instances of the same object or scene category. We propose to use images annotated with binary foreground masks and subjected to synthetic geometric deformations to train a convolutional neural network (CNN) for this task. Using these masks as part of the supervisory signal offers a good compromise between semantic flow methods, where the amount of training data is limited by the cost of manually selecting point correspondences, and semantic alignment ones, where the regression of a single global geometric transformation between images may be sensitive to image-specific details such as background clutter. We propose a new CNN architecture, dubbed SFNet, which implements this idea. It leverages a new and differentiable version of the argmax function for end-to-end training, with a loss that combines mask and flow consistency with smoothness terms. Experimental results demonstrate the effectiveness of our approach, which significantly outperforms the state of the art on standard benchmarks.
[flow, state, dataset, term] [matching, dense, correspondence, field, pck, establishing, compute, smoothness, corresponding, approach, local, scene, problem, differentiable, sift, robust, lack, discrete, depicting, geometric, directly, estimate, keypoint] [image, consistency, background, method, transformation, comparison, ieee, appearance] [network, kernel, performance, table, binary, convolutional, size, number, scale, best] [model, argmax, provided] [semantic, object, foreground, mask, feature, spatial, cnn, art, clutter, average, map, including, pascal, bounding, three] [training, source, target, loss, learning, soft, train, alignment, test, exploit, trained, large, learn, adaptation, set]
@InProceedings{Lee_2019_CVPR,
  author = {Lee, Junghyup and Kim, Dohyung and Ponce, Jean and Ham, Bumsub},
  title = {SFNet: Learning Object-Aware Semantic Correspondence},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Deep Metric Learning Beyond Binary Supervision
Sungyeon Kim, Minkyo Seo, Ivan Laptev, Minsu Cho, Suha Kwak


Metric Learning for visual similarity has mostly adopted binary supervision indicating whether a pair of images are of the same class or not. Such a binary indicator covers only a limited subset of image relations, and is not sufficient to represent semantic similarity between images described by continuous and/or structured labels such as object poses, image captions, and scene graphs. Motivated by this, we present a novel method for deep metric learning using continuous labels. First, we propose a new triplet loss that allows distance ratios in the label space to be preserved in the learned metric space. The proposed loss thus enables our model to learn the degree of similarity rather than just the order. Furthermore, we design a triplet mining strategy adapted to metric learning with continuous labels. We address three different image retrieval tasks with continuous labels in terms of human poses, room layouts and image captions, and demonstrate the superior performance of our approach compared to previous methods.
[recognition, human, outperforms, dataset, previous] [vision, computer, continuous, pose, pattern, dense, tri, international, approach, well, directly] [image, conference, ieee, figure, method, based] [binary, deep, imagenet, neural, performance, pretrained, processing, structured, network, compared, effective, number] [model, room, visual, caption, common, captioning, evaluation, tennis, query] [anchor, layout, feature, cnn, semantic, object, three, illustrated] [learning, loss, metric, triplet, embedding, retrieval, distance, mining, similarity, label, space, negative, training, learned, positive, nearest, rank, strategy, existing, margin, imgnet, learn, minibatch, conventional, set, neighbor, representation, large, euclidean]
@InProceedings{Kim_2019_CVPR,
  author = {Kim, Sungyeon and Seo, Minkyo and Laptev, Ivan and Cho, Minsu and Kwak, Suha},
  title = {Deep Metric Learning Beyond Binary Supervision},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Learning to Cluster Faces on an Affinity Graph
Lei Yang, Xiaohang Zhan, Dapeng Chen, Junjie Yan, Chen Change Loy, Dahua Lin


Face recognition sees remarkable progress in recent years, and its performance has reached a very high level. Taking it to a next level requires substantially larger data, which would involve prohibitive annotation cost. Hence, exploiting unlabeled data becomes an appealing alternative. Recent works have shown that clustering unlabeled faces is a promising approach, often leading to notable performance gains. Yet, how to effectively cluster, especially on a large-scale (i.e. million-level or above) dataset, remains an open question. A key challenge lies in the complex variations of cluster patterns, which make it difficult for conventional clustering methods to meet the needed accuracy. This work explores a novel approach, namely, learning to cluster instead of relying on hand-crafted criteria. Specifically, we propose a framework based on graph convolutional network, which combines a detection and a segmentation module to pinpoint face clusters. Experiments show that our method yields significantly more accurate face clusters, which, as a result, also lead to further performance gain in face recognition.
[graph, recognition, gcn, framework, work, gcns, complex, previous, multiple] [algorithm, vertex, approach] [face, method, based, figure, high, proposed, result, cdp, image] [performance, number, convolutional, precision, deep, pooling, max, design, gain, better, larger, approximate, process, size, apply] [model, generated] [proposal, affinity, segmentation, iou, detection, feature, module, threshold, recall, improve, object, benchmark] [cluster, clustering, unlabeled, data, supervised, set, large, training, labeled, iop, smax, learning, megaface, pairwise, label, learn, positive, similarity, trained, train, unsupervised, close, randomly, exploit, select, rank, hac]
@InProceedings{Yang_2019_CVPR,
  author = {Yang, Lei and Zhan, Xiaohang and Chen, Dapeng and Yan, Junjie and Change Loy, Chen and Lin, Dahua},
  title = {Learning to Cluster Faces on an Affinity Graph},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
C2AE: Class Conditioned Auto-Encoder for Open-Set Recognition
Poojan Oza, Vishal M. Patel


Models trained for classification often assume that all testing classes are known while training. As a result, when presented with an unknown class during testing, such closed-set assumption forces the model to classify it as one of the known classes. However, in a real world scenario, classification models are likely to encounter such examples. Hence, identifying those examples as unknown becomes critical to model performance. A potential solution to overcome this problem lies in a class of learning problems known as open-set recognition. It refers to the problem of identifying the unknown classes during testing, while maintaining performance on the known classes. In this paper, we propose an open-set recognition algorithm using class conditioned auto-encoders with novel training and testing methodologies. In this method, training procedure is divided in two sub-tasks, 1. closed-set classification and, 2. open-set identification (i.e. identifying a class as known or unknown). Encoder learns the first task following the closed-set classification training pipeline, whereas decoder learns the second task by reconstructing conditioned on class identity. Furthermore, we model reconstruction errors using the Extreme Value Theory of statistical modeling to find the threshold for identifying known/unknown class samples. Experiments performed on multiple image classification datasets show that the proposed method performs significantly better than the state of the art methods. The source code is available at: github.com/otkupjnoz/c2ae.
[recognition, modeling, multiple] [match, reconstruction, condition, computer, vision, problem, error, approach, histogram, algorithm, pattern, analysis, international, corresponding] [proposed, ieee, conference, method, image, described, based, input, conditioning, statistical, traditional, generative, conditional] [deep, performance, neural, better, network, experiment, layer, batch, processing] [model, vector, conditioned, probability, decoder, encoder, find, identifying, arxiv, preprint, referred, introduced, observed] [extreme, threshold, score, identification, detection, ablation] [class, unknown, training, set, classification, function, trained, learning, test, distribution, loss, snm, openness, vishal, softmax, open, testing, operating, evt, strategy, classifier, data, label, novel, sample, train, gpd, terrance]
@InProceedings{Oza_2019_CVPR,
  author = {Oza, Poojan and Patel, Vishal M.},
  title = {C2AE: Class Conditioned Auto-Encoder for Open-Set Recognition},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Shapes and Context: In-The-Wild Image Synthesis & Manipulation
Aayush Bansal, Yaser Sheikh, Deva Ramanan


We introduce a data-driven model for interactively synthesizing in-the-wild images from semantic label input masks. Our approach is dramatically different from recent work in this space, in that we make use of no learning. Instead, our approach uses simple but classic tools for matching scene context, shapes, and parts to a stored library of exemplars. Though simple, this approach has several notable advantages over recent work: (1) because nothing is learned, it is not limited to specific training data distributions (such as cityscapes, facades, or faces); (2) it can synthesize arbitrarily high-resolution images, limited only by the resolution of the exemplar library; (3) by appropriately composing shapes and parts, it can generate an exponentially large set of viable candidate output images (that can say, be interactively searched by a user). We present results on the diverse COCO dataset, significantly outperforming learning-based approaches on standard image synthesis metrics. Finally, we explore user-interaction and user-controllability, demonstrating that our system can be used as a platform for user-driven content creation.
[work, multiple, human, extract, dataset, recognition] [approach, shape, matching, parametric, scene, local, compute, limited, computer, contrast, column, varying, well, problem, rigid] [image, input, synthesis, alexei, figure, pixel, synthesized, consistency, content, user, manipulation, prior, acm, synthesize, generative, quality, composition, study, demonstrate] [output, better, original, performance, number, table, accuracy, william, process] [generate, query, diverse, model, generated, simple, adversarial, fid, enables, controllable, coverage, creation, visual] [global, semantic, mask, coco, object, three, context, score, deva, contextual, instance] [label, training, data, set, large, trained, exemplar, observe, oracle, nearest]
@InProceedings{Bansal_2019_CVPR,
  author = {Bansal, Aayush and Sheikh, Yaser and Ramanan, Deva},
  title = {Shapes and Context: In-The-Wild Image Synthesis & Manipulation},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Semantics Disentangling for Text-To-Image Generation
Guojun Yin, Bin Liu, Lu Sheng, Nenghai Yu, Xiaogang Wang, Jing Shao


Synthesizing photo-realistic images from text descriptions is a challenging problem. Previous studies have shown remarkable progresses on visual quality of the generated images. In this paper, we consider semantics from the input text descriptions in helping render photo-realistic images. However, diverse linguistic expressions pose challenges in extracting consistent semantics even they depict the same thing. To this end, we propose a novel photo-realistic text-to-image generation model that implicitly disentangles semantics to both fulfill the high-level semantic consistency and low-level semantic diversity. To be specific, we design (1) a Siamese mechanism in the discriminator to learn consistent high-level semantics, and (2) a visual-semantic embedding strategy by semantic-conditioned batch normalization to find diverse low-level semantics. Extensive experiments and ablation studies on CUB and MS-COCO datasets demonstrate the superiority of the proposed method in comparison to state-of-the-art methods.
[previous, dataset, human] [groundtruth] [proposed, image, generative, input, based, conditional, consistency, generator, expression, disentangling, method, figure, demonstrate] [siamese, structure, batch, normalization, network, layer, table, stacked, best, performance, effectiveness, neural, small, compare, modulation, architecture] [scbn, generation, text, visual, generated, linguistic, adversarial, attngan, bird, word, model, sentence, inception, indicates, encoder, gans, natural, generate, language, introduced, evaluate, stackgan, vector, diverse, mechanism, discriminator, yellow, distills] [semantic, feature, semantics, score, comparing, adopt, stage, instance] [cub, contrastive, embedding, training, loss, test, set, learning]
@InProceedings{Yin_2019_CVPR,
  author = {Yin, Guojun and Liu, Bin and Sheng, Lu and Yu, Nenghai and Wang, Xiaogang and Shao, Jing},
  title = {Semantics Disentangling for Text-To-Image Generation},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Semantic Image Synthesis With Spatially-Adaptive Normalization
Taesung Park, Ming-Yu Liu, Ting-Chun Wang, Jun-Yan Zhu


We propose spatially-adaptive normalization, a simple but effective layer for synthesizing photorealistic images given an input semantic layout. Previous methods directly feed the semantic layout as input to the network, forcing the network to memorize the information throughout all the layers. Instead, we propose using the input layout for modulating the activations in normalization layers through a spatially-adaptive, learned affine transformation. Experiments on several challenging datasets demonstrate the superiority of our method compared to existing approaches, regarding both visual fidelity and alignment with input layouts. Finally, our model allows users to easily control the style and content of image synthesis results as well as create multi-modal results. Code is available upon publication.
[recognition, outperforms, dataset, current] [computer, vision, international, pattern, scene, ground, truth] [image, conference, ieee, synthesis, method, input, spade, generative, figure, sims, crn, realistic, proposed, accu, real, conditional, high, generator, comparison, synthesizes, competing, kernelsize] [neural, processing, table, deep, convolutional, normalization, better, architecture, compare, network, size, achieve, performance, output, compact, batch, norm, capacity, number] [adversarial, model, arxiv, machine, visual, fid, find, preprint, diverse, decoder, strong, multimodal, flickr, landscape, generate, random] [semantic, segmentation, miou, layout, leading, european, mask, instance, baseline] [learning, training, label, distribution, trained, large]
@InProceedings{Park_2019_CVPR,
  author = {Park, Taesung and Liu, Ming-Yu and Wang, Ting-Chun and Zhu, Jun-Yan},
  title = {Semantic Image Synthesis With Spatially-Adaptive Normalization},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Progressive Pose Attention Transfer for Person Image Generation
Zhen Zhu, Tengteng Huang, Baoguang Shi, Miao Yu, Bofei Wang, Xiang Bai


This paper proposes a new generative adversarial network to the problem of pose transfer, i.e., transferring the pose of a given person to a target one. The generator of the network comprises a sequence of Pose-Attentional Transfer Blocks that each transfers certain regions it attends to, generating the person image progressively. Compared with those in previous works, our generated person images possess better appearance consistency and shape consistency with the input images, thus significantly more realistic-looking. The efficacy and efficiency of the proposed network are validated both qualitatively and quantitatively on Market-1501 and DeepFashion. Furthermore, the proposed architecture can generate training images for person re-identification, alleviating data insufficiency.
[previous, human, sequence, dataset, portion, updated, work, fed] [pose, condition, shape, depth, body] [image, generator, method, patn, real, appearance, deepfashion, ftp, proposed, pckh, generative, patbs, patb, comparison, consistency, siarohin, quantitative, qualitative, input, figure, conditional, background] [performance, network, denotes, resnet, residual, progressive, better, structure, process, number, conv, convolutional, deep, neural, block, relu, design, computation, normalization] [generated, attention, model, adversarial, generate, generation, generating, discriminator, manifold] [person, final, adopted, pathway, score, challenging, deformable] [transfer, target, training, code, data, set, testing, randomly, large, augmentation, loss, representation]
@InProceedings{Zhu_2019_CVPR,
  author = {Zhu, Zhen and Huang, Tengteng and Shi, Baoguang and Yu, Miao and Wang, Bofei and Bai, Xiang},
  title = {Progressive Pose Attention Transfer for Person Image Generation},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Unsupervised Person Image Generation With Semantic Parsing Transformation
Sijie Song, Wei Zhang, Jiaying Liu, Tao Mei


In this paper, we address unsupervised pose-guided person image generation, which is known challenging due to non-rigid deformation. Unlike previous methods learning a rock-hard direct mapping between human bodies, we propose a new pathway to decompose the hard mapping into two more accessible subtasks, namely, semantic parsing transformation and appearance generation. Firstly, a semantic generative network is proposed to transform between semantic parsing maps, in order to simplify the non-rigid deformation learning. Secondly, an appearance generative network learns to synthesize semantic-aware textures. Thirdly, we demonstrate that training our framework in an end-to-end manner further refines the semantic maps and final results accordingly. Our method is generalizable to other semantic-aware person image generation tasks, e.g., clothing texture transfer and controlled image manipulation. Experimental results demonstrate the superiority of our method on DeepFashion and Market-1501 datasets, especially in keeping the clothing attributes and better body shapes.
[human, work, framework, employed] [pose, computer, body, vision, pattern, shape, corresponding, well, condition, deformation] [image, appearance, generative, conference, ieee, transformation, style, texture, ladv, deepfashion, figure, mapping, method, paired, upis, conditional, reference, proposed, extracted, lsty, transform, controlled, unpaired, input, consistency, synthesis, ltotal, quality, quantitative, ssim, demonstrate] [network, better, neural, output, compared, processing] [generation, model, adversarial, generate, generated, visual, generates, ace] [semantic, map, parsing, person, clothing, sps, predicted, spatial, guided, propose, final, parser] [training, unsupervised, loss, target, transfer, learning, data, train, trained, set, supervised, address]
@InProceedings{Song_2019_CVPR,
  author = {Song, Sijie and Zhang, Wei and Liu, Jiaying and Mei, Tao},
  title = {Unsupervised Person Image Generation With Semantic Parsing Transformation},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
DeepView: View Synthesis With Learned Gradient Descent
John Flynn, Michael Broxton, Paul Debevec, Matthew DuVall, Graham Fyffe, Ryan Overbeck, Noah Snavely, Richard Tucker


We present a novel approach to view synthesis using multiplane images (MPIs). Building on recent advances in learned gradient descent, our algorithm generates an MPI from a set of sparse camera viewpoints. The resulting method incorporates occlusion reasoning, improving performance on challenging scene features such as object boundaries, lighting reflections, thin structures, and scenes with high depth complexity. We show that our method achieves high-quality, state-of-the-art results on two datasets: the Kalantari light field dataset, and a new camera array dataset, Spaces, which we make publicly available.
[dataset, explicitly, lgd, accumulated, iteratively, current, work] [view, mpi, depth, light, scene, field, deepview, visibility, plane, camera, approach, inverse, rig, well, computed, allows, sweep, ground, truth, multiplane, optimization, reconstruction, note, supplemental, rgba, rendered, algorithm] [input, method, image, synthesis, kalantari, based, ssim, color, described, acm, figure, high, transmittance, resolution] [gradient, network, number, descent, deep, table, performance, initialization, sparse, larger, represents, net, convolutional, operator] [model, generate, step] [cnn, feature, baseline, ablation] [learned, loss, training, set, update, learning, representation, difficult, function]
@InProceedings{Flynn_2019_CVPR,
  author = {Flynn, John and Broxton, Michael and Debevec, Paul and DuVall, Matthew and Fyffe, Graham and Overbeck, Ryan and Snavely, Noah and Tucker, Richard},
  title = {DeepView: View Synthesis With Learned Gradient Descent},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Animating Arbitrary Objects via Deep Motion Transfer
Aliaksandr Siarohin, Stephane Lathuiliere, Sergey Tulyakov, Elisa Ricci, Nicu Sebe


This paper introduces a novel deep learning framework for image animation. Given an input image with a target object and a driving video sequence depicting a moving object, our framework generates a video in which the target object is animated according to the driving sequence. This is achieved through a deep architecture that decouples appearance and motion information. Our framework consists of three main modules: (i) a Keypoint Detector unsupervisely trained to extract object keypoints, (ii) a Dense Motion prediction network for generating dense heatmaps from sparse keypoints, in order to better encode motion information and (iii) a Motion Transfer Network, which uses the motion heatmaps and appearance information extracted from the input image to synthesize the output frames. We demonstrate the effectiveness of our method on several benchmark datasets, spanning a wide variety of object appearances, and show that our approach outperforms state-of-the-art image animation and video generation methods.
[video, motion, driving, optical, flow, frame, framework, nemo, human, dataset, fcoarse, fresidual, static, bair, aed, prediction, outperforms, considering, capture] [keypoint, keypoints, approach, dense, problem, well, deformation, ground, truth, supplementary, material, body] [image, input, method, animation, face, generator, proposed, appearance, facial, nicu, conditional, translation, arbitrary, figure, generative, quantitative] [network, deep, order, architecture, employ, output, sparse, convolutional] [model, generated, generate, adversarial, generating, evaluation, gan, discriminator, generates, introduced, generation, visual] [object, feature, three, heatmaps, detector, propose, module] [source, learning, representation, unsupervised, large, loss, transfer, training, task, observe, sergey, trained, specific]
@InProceedings{Siarohin_2019_CVPR,
  author = {Siarohin, Aliaksandr and Lathuiliere, Stephane and Tulyakov, Sergey and Ricci, Elisa and Sebe, Nicu},
  title = {Animating Arbitrary Objects via Deep Motion Transfer},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Textured Neural Avatars
Aliaksandra Shysheya, Egor Zakharov, Kara-Ali Aliev, Renat Bashirov, Egor Burkov, Karim Iskakov, Aleksei Ivakhnenko, Yury Malkov, Igor Pasechnik, Dmitry Ulyanov, Alexander Vakhitov, Victor Lempitsky


We present a system for learning full body neural avatars, i.e. deep networks that produce full body renderings of a person for varying body pose and varying camera pose. Our system takes the middle path between the classical graphics pipeline and the recent deep learning approaches that generate images of humans using image-to-image translation. In particular, our system estimates an explicit two-dimensional texture map of the model surface. At the same time, it abstains from explicit shape modeling in 3D. Instead, at test time, the system uses a fully-convolutional network to directly map the configuration of body feature points w.r.t. the camera to the 2D texture coordinates of individual pixels in the image frame. We show that such system is capable of learning to generate realistic renderings while being trained on videos annotated with 3D poses and foreground masks. We also demonstrate that maintaining an explicit texture representation helps our system to achieve better generalization compared to systems that use direct image-to-image translation.
[human, video, modeling, motion] [body, pose, approach, single, camera, textured, rendering, direct, computer, surface, shape, explicit, monocular, densepose, michael, classical, estimation, christian, june, varying, multiview, rgb, coordinate, corresponding, gerard, pipeline] [texture, image, input, avatar, stack, figure, translation, acm, method, mapping, appearance, background, pixel, comparison, based, victor, produce] [neural, network, deep, convolutional, andrew, performance, initialized] [system, model, consider, generate, adversarial, arxiv, preprint] [map, person, mask, foreground] [training, learning, test, trained, generalization, data, set, loss, dan, unseen]
@InProceedings{Shysheya_2019_CVPR,
  author = {Shysheya, Aliaksandra and Zakharov, Egor and Aliev, Kara-Ali and Bashirov, Renat and Burkov, Egor and Iskakov, Karim and Ivakhnenko, Aleksei and Malkov, Yury and Pasechnik, Igor and Ulyanov, Dmitry and Vakhitov, Alexander and Lempitsky, Victor},
  title = {Textured Neural Avatars},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
IM-Net for High Resolution Video Frame Interpolation
Tomer Peleg, Pablo Szekely, Doron Sabo, Omry Sendik


Video frame interpolation is a long-studied problem in the video processing field. Recently, deep learning approaches have been applied to this problem, showing impressive results on low-resolution benchmarks. However, these methods do not scale-up favorably to high resolutions. Specifically, when the motion exceeds a typical number of pixels, their interpolation quality is degraded. Moreover, their run time renders them impractical for real-time applications. In this paper we propose IM-Net: an interpolated motion neural network. We use an economic structured architecture and end-to-end training with multi-scale tailored losses. In particular, we formulate interpolated motion estimation as classification rather than regression. IM-Net outperforms previous methods by more than 1.3dB (PSNR) on a high resolution version of the recently introduced Vimeo triplet dataset. Moreover, the network runs in less than 33msec on a single GPU for HD resolution.
[frame, motion, video, middle, previous, vimeo, dataset, vfi, imvf, arp, time, warping, warp, work, current, fhd, second, warped, version] [estimation, occlusion, computer, vision, range, estimated, international, pattern, single, classical, ground] [resolution, input, image, sepconv, high, interpolation, conference, method, ieee, interpolated, toflow, quality, synthesis, extracted, flev, psnr] [network, deep, architecture, low, adaptive, output, full, applied, processing, neural, original, convolutional, table, rate, separable, block, number] [strong, choice, include] [cnn, three, level, map, pyramid] [training, loss, learning, pair, set, large, classification]
@InProceedings{Peleg_2019_CVPR,
  author = {Peleg, Tomer and Szekely, Pablo and Sabo, Doron and Sendik, Omry},
  title = {IM-Net for High Resolution Video Frame Interpolation},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Homomorphic Latent Space Interpolation for Unpaired Image-To-Image Translation
Ying-Cong Chen, Xiaogang Xu, Zhuotao Tian, Jiaya Jia


Generative adversarial networks have achieved great success in unpaired image-to-image translation. Cycle consistency allows modeling the relationship between two distinct domains without paired data. In this paper, we propose an alternative framework, as an extension of latent space interpolation, to consider the intermediate region between two domains during translation. It is based on the fact that in a flat and smooth latent space, there exist many paths that connect two sample points. Properly selecting paths makes it possible to change only certain image attributes, which is useful for generating intermediate images between the two domains. We also show that this framework can be applied to multi-domain and multi-modal translation. Extensive experiments manifest its generality and applicability to various tasks.
[term] [defined, allows, continuous, note, corresponding, well] [attribute, interpolation, latent, translation, image, smiling, stargan, intermediate, interpolated, control, elegant, homomorphic, unpaired, change, method, real, hair, based, flat, smooth, figure, expression, male, interpolator, hom, generative, female, facelet, edit, strength, celeba, reference, input, produce, munit, rafd, color, mouth] [network, grouped, table, original, achieved, deep, vgg, rigorous] [model, encoder, path, vector, natural, generate, adversarial, decoder, edited, generated, serve, wasserstein] [feature, final, guidance, leading, connect] [space, training, domain, loss, update, sample, target, trained, min, test, knowledge, learning, function, train]
@InProceedings{Chen_2019_CVPR,
  author = {Chen, Ying-Cong and Xu, Xiaogang and Tian, Zhuotao and Jia, Jiaya},
  title = {Homomorphic Latent Space Interpolation for Unpaired Image-To-Image Translation},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Multi-Channel Attention Selection GAN With Cascaded Semantic Guidance for Cross-View Image Translation
Hao Tang, Dan Xu, Nicu Sebe, Yanzhi Wang, Jason J. Corso, Yan Yan


Cross-view image translation is challenging because it involves images with drastically different views and severe deformation. In this paper, we propose a novel approach named Multi-Channel Attention SelectionGAN (SelectionGAN) that makes it possible to generate images of natural scenes in arbitrary viewpoints, based on an image of the scene and a novel semantic map. The proposed SelectionGAN explicitly utilizes the semantic information and consists of two stages. In the first stage, the condition image and the target semantic map are fed into a cycled semantic-guided generation network to produce initial coarse results. In the second stage, we refine the initial results by using a multi-channel attention selection mechanism. Moreover, uncertainty maps automatically learned from attentions are used to guide the pixel loss for better network optimization. Extensive experiments on Dayton, CVUSA and Ego2Top datasets show that our model is able to generate significantly better results than the state-of-the-art methods. The source code, data and trained models are available at https://github.com/Ha0Tang/SelectionGAN.
[multiple, dataset, second] [scene, view, optimization, cycled] [image, proposed, translation, selectiongan, synthesis, generative, dayton, input, intermediate, conditional, real, generator, method, pixel, guide, figure, zhai, arbitrary, produce, synthesized, lcgan] [selection, network, deep, better, structure, table, pooling, convolutional, accuracy, size] [attention, generation, adversarial, generate, generated, gan, model, generating, discriminator] [semantic, baseline, stage, spatial, cvusa, propose, map, final, module, coarse, segmentation, cascade, aerial, challenging, feature] [uncertainty, loss, novel, learning, select, training, target, data, datasets, learn, set, trained, task, observe]
@InProceedings{Tang_2019_CVPR,
  author = {Tang, Hao and Xu, Dan and Sebe, Nicu and Wang, Yanzhi and Corso, Jason J. and Yan, Yan},
  title = {Multi-Channel Attention Selection GAN With Cascaded Semantic Guidance for Cross-View Image Translation},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Geometry-Consistent Generative Adversarial Networks for One-Sided Unsupervised Domain Mapping
Huan Fu, Mingming Gong, Chaohui Wang, Kayhan Batmanghelich, Kun Zhang, Dacheng Tao


Unsupervised domain mapping aims to learn a function GXY to translate domain X to Y in the absence of paired examples. Finding the optimal GXY without paired data is an ill-posed problem, so appropriate constraints are required to obtain reasonable solutions. While some prominent constraints such as cycle consistency and distance preservation successfully constrain the solution space, they overlook the special properties of images that simple geometric transformations do not change the image's semantic structure. Based on this special property, we develop a geometry-consistent generative adversarial network (Gc-GAN), which enables one-sided unsupervised domain mapping. GcGAN takes the original image and its counterpart image transformed by a predefined geometric transformation as inputs and generates two images in the new domain coupled with the corresponding geometry-consistency constraint. The geometry-consistency constraint reduces the space of possible solutions while keep the correct solutions in the search space. Quantitative and qualitative comparisons with the baseline (GAN alone) and the state-of-the-art methods including CycleGAN [66] and DistanceGAN [5] demonstrate the effectiveness of our method.
[predefined, acc, horse] [constraint, geometric, geometry, scene, ground, truth, respect] [gxy, gcgan, image, generative, mapping, cyclegan, input, translation, consistency, distancegan, figure, lgan, cycle, style, transformation, pixel, qualitative, conditional, quantitative, generator, translated, photo, demonstrate, method, lgeo, lgcgan, synthetic, kun, paired, real, unpaired, produce, day, synthesis, alexei, mingming] [table, original, deep, network, employ, convolutional] [adversarial, gan, discriminator, van, arxiv, preprint, random, correct] [semantic, including, parsing, baseline, feature, aerial] [domain, unsupervised, training, learning, function, distance, loss, trained, svhn, class, trevor, learn, set, learned, transfer]
@InProceedings{Fu_2019_CVPR,
  author = {Fu, Huan and Gong, Mingming and Wang, Chaohui and Batmanghelich, Kayhan and Zhang, Kun and Tao, Dacheng},
  title = {Geometry-Consistent Generative Adversarial Networks for One-Sided Unsupervised Domain Mapping},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
DeepVoxels: Learning Persistent 3D Feature Embeddings
Vincent Sitzmann, Justus Thies, Felix Heide, Matthias Niessner, Gordon Wetzstein, Michael Zollhofer


In this work, we address the lack of 3D understanding of generative neural networks by introducing a persistent 3D feature embedding for view synthesis. To this end, we propose DeepVoxels, a learned representation that encodes the view-dependent appearance of a 3D scene without having to explicitly model its geometry. At its core, our approach is based on a Cartesian 3D grid of persistent embedded features that learn to make use of the underlying 3D scene structure. Our approach combines insights from 3D geometric computer vision with recent advances in learning image-to-image mappings based on adversarial loss functions. DeepVoxels is supervised, without requiring a 3D reconstruction of the scene, using a 2D re-rendering loss and enforces perspective and multi-view geometry in a principled manner. We apply our persistent 3D scene representation to the problem of novel view synthesis demonstrating high-quality results for a variety of challenging scenes.
[explicitly, recurrent, learns, video, predict, state] [view, scene, volume, occlusion, voxel, approach, depth, persistent, deepvoxels, rendering, reconstruction, canonical, camera, geometry, single, perspective, point, visibility, geometric, ground, projection, differentiable, well, lifting, explicit] [based, image, proposed, latent, synthesis, generative, high, input, psnr, ssim, acm, figure, synthetic, conditional, resolution] [network, deep, neural, convolutional, architecture, structure, number] [model, adversarial, reasoning, vector, gated, generate, corpus] [feature, grid, spatial, object, baseline, fully, map, center, challenging] [novel, representation, learning, training, target, learned, space, embedding, learn, loss, set, test, source, nearest, lifted, code]
@InProceedings{Sitzmann_2019_CVPR,
  author = {Sitzmann, Vincent and Thies, Justus and Heide, Felix and Niessner, Matthias and Wetzstein, Gordon and Zollhofer, Michael},
  title = {DeepVoxels: Learning Persistent 3D Feature Embeddings},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Inverse Path Tracing for Joint Material and Lighting Estimation
Dejan Azinovic, Tzu-Mao Li, Anton Kaplanyan, Matthias Niessner


Modern computer vision algorithms have brought significant advancement to 3D geometry reconstruction. However, illumination and material reconstruction remain less studied, with current approaches assuming very simplified models for materials and illumination. We introduce Inverse Path Tracing, a novel approach to jointly estimate the material properties of objects and light sources in indoor scenes by using an invertible light transport simulation. We assume a coarse geometry scan, along with corresponding images and camera poses. The key contribution of this work is an accurate and simultaneous retrieval of light sources and physically based material properties (e.g., diffuse reflectance, specular reflectance, roughness, etc.) for the purpose of editing and re-rendering the scene under new conditions. To this end, we introduce a novel optimization method using a differentiable Monte Carlo renderer that computes derivatives with respect to the estimated unknown illumination and material properties. This enables joint optimization for physically correct light transport and material models using a tailored stochastic gradient descent.
[work, joint] [material, light, scene, inverse, lighting, emission, illumination, tracing, rendering, geometry, albedo, monte, optimization, differentiable, computer, estimate, problem, reconstruction, rendered, matthias, vision, reflectance, michael, solve, single, surface, christian, estimation, approach, respect, physically, specular, estimated, ground, truth, direction, brdf, algorithm, indoor, assume, accurate, intrinsic, tracer, spherical] [method, pixel, input, acm, figure, captured, based, real, handle, color, image, synthetic, conference, reconstructed] [stochastic, gradient, variance, number] [path, carlo, model, contribution, evaluation, correct, correctly] [object] [unknown, transport, source, function, loss, set, novel, product, update]
@InProceedings{Azinovic_2019_CVPR,
  author = {Azinovic, Dejan and Li, Tzu-Mao and Kaplanyan, Anton and Niessner, Matthias},
  title = {Inverse Path Tracing for Joint Material and Lighting Estimation},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
The Visual Centrifuge: Model-Free Layered Video Representations
Jean-Baptiste Alayrac, Joao Carreira, Andrew Zisserman


True video understanding requires making sense of non-lambertian scenes where the color of light arriving at the camera sensor encodes information about not just the last object it collided with, but about multiple mediums -- colored windows, dirty mirrors, smoke or rain. Layered video representations have the potential of accurately modelling realistic scenes but have so far required stringent assumptions on motion, lighting and shape. Here we propose a learning-based approach for multi-layered video representation: we introduce novel uncertainty-capturing 3D convolutional architectures and train them to separate blended videos. We show that these models then generalize to single videos, where they exhibit interesting abilities: color constancy, factoring out shadows and separating reflections. We present quantitative and qualitative results on real world videos.
[video, motion, multiple, frozen, work, kinetics, human, audio, second, frame, moving, temporal, action, version] [single, permutation, problem, normal, well, approach, scene, recovering, reconstruction, note, intrinsic] [image, blended, reflection, separation, separate, color, input, mixed, real, separating, figure, proposed, layered, mixing, removal, reconstruct, composition, blind, method, composing] [output, layer, network, architecture, original, deep, table, validation, process, standard, better, convolutional] [model, simple, natural, diversity, visual, diverse, simply, random] [propose, semantic] [trained, loss, training, learning, predictor, task, invariant, data, train, paper, generalize, domain, set, hard]
@InProceedings{Alayrac_2019_CVPR,
  author = {Alayrac, Jean-Baptiste and Carreira, Joao and Zisserman, Andrew},
  title = {The Visual Centrifuge: Model-Free Layered Video Representations},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Label-Noise Robust Generative Adversarial Networks
Takuhiro Kaneko, Yoshitaka Ushiku, Tatsuya Harada


Generative adversarial networks (GANs) are a framework that learns a generative distribution through adversarial training. Recently, their class conditional extensions (e.g., conditional GAN (cGAN) and auxiliary classifier GAN (AC-GAN)) have attracted much attention owing to their ability to learn the disentangled representations and to improve the training stability. However, their training requires the availability of large-scale accurate class-labeled data, which are often laborious or impractical to collect in a real-world scenario. To remedy this, we propose a novel family of GANs called label-noise robust GANs (rGANs), which, by incorporating a noise transition model, can learn a clean label conditional generative distribution even when training labels are noisy. In particular, we propose two variants: rAC-GAN, which is a bridging model between AC-GAN and the label-noise robust classification model, and rcGAN, which is an extension of cGAN and solves this problem with no reliance on any classifier. In addition to providing the theoretical background, we demonstrate the effectiveness of our models through extensive experiments using diverse GAN configurations, various noise settings, and multiple evaluation metrics (in which we tested 402 conditions in total).
[transition, learns, incorporate] [robust, problem, theorem, theoretical, technique, optimal, note, denote, corresponding] [conditional, noise, generative, image, clean, figure, proposed, real, generator, disentangled] [deep, rate, neural, table, optimized, performance, better, number, effectiveness, dnns] [adversarial, cgan, gans, rcgan, fid, model, generated, gan, improved, conditioned, discriminator, probability, generate, find, indicates, arxiv, preprint, evaluation, goal, studied, robustness, rac] [baseline, propose, indicate] [noisy, data, label, learning, training, loss, classifier, labeled, class, distribution, auxiliary, classification, learn, symmetric, trained, minimizing, tested, log, intra, flipped]
@InProceedings{Kaneko_2019_CVPR,
  author = {Kaneko, Takuhiro and Ushiku, Yoshitaka and Harada, Tatsuya},
  title = {Label-Noise Robust Generative Adversarial Networks},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
DLOW: Domain Flow for Adaptation and Generalization
Rui Gong, Wen Li, Yuhua Chen, Luc Van Gool


In this work, we present a domain flow generation(DLOW) model to bridge two different domains by generating a continuous sequence of intermediate domains flowing from one domain to the other. The benefits of our DLOW model are two-fold. First, it is able to transfer source images into different styles in the intermediate domains. The transferred images smoothly bridge the gap between source and target domains, thus easing the domain adaptation task. Second, when multiple target domains are provided for training, our DLOW model is also able to generate new styles of images that are unseen in the training data. We implement our DLOW model based on CycleGAN. A domainness variable is introduced to guide the model to generate the desired intermediate domain images. In the inference phase, a flow of various styles of images can be obtained by varying the domainness variable. We demonstrate the effectiveness of our model for both cross-domain semantic segmentation and the style generalization tasks on benchmark datasets. Our implementation is available at https://github.com/ETHRuiGong/DLOW .
[dataset, flow, multiple, work, luc, focus, second] [variable, direction, david, computer, vision, directly, well] [intermediate, dlow, image, style, domainness, translated, translation, cyclegan, based, gst, synthetic, input, wen, conditional, gogh, proposed, real, figure, translate, conduct, method, ladv, demonstrate, translating, generative] [shift, deep, dong, original, table, performance, network] [model, adversarial, van, generate, vector, ability] [semantic, segmentation, improve, fig, urban, benchmark, helpful, connect] [domain, target, source, adaptation, generalization, training, transfer, learning, distribution, data, unsupervised, loss, existing, unseen, adaptsegnet, datasets, relatedness, train, learn, distance, labeled, address, trained]
@InProceedings{Gong_2019_CVPR,
  author = {Gong, Rui and Li, Wen and Chen, Yuhua and Van Gool, Luc},
  title = {DLOW: Domain Flow for Adaptation and Generalization},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
CollaGAN: Collaborative GAN for Missing Image Data Imputation
Dongwook Lee, Junyoung Kim, Won-Jin Moon, Jong Chul Ye


In many applications requiring multiple inputs to obtain a desired output, if any of the input data is missing, it often introduces large amounts of bias. Although many techniques have been developed for imputing missing data, the image imputation is still difficult due to complicated nature of natural images. To address this problem, here we proposed a novel framework for missing image data imputation, called Collaborative Generative Adversarial Network (CollaGAN). CollaGAN convert the image imputation problem to a multi-domain images-to-image translation task so that a single generator and discriminator network can successfully estimate the missing data using the remaining clean data set. We demonstrate that CollaGAN produces the images with a higher visual quality compared to the existing competing approaches in various image imputation tasks.
[multiple, consists] [contrast, illumination, single, computer, estimate, vision, volume, defined, problem, algorithm] [image, input, facial, proposed, generator, missing, imputation, translation, stargan, cyclegan, real, method, expression, generative, reconstructed, collagan, dclsf, ssim, cycle, conference, ieee, collaborative, figure, consistency, clsf, korea, handle, impute, pixel, incomplete, quality, magnetic, resonance, mapping] [network, compared, neural, original, output, number, processing, performance, architecture, brain, process, structure, best] [discriminator, adversarial, gan, arxiv, generate, generated, preprint, vector, natural, fake, required, complete, manifold] [three, mask] [data, loss, domain, set, target, trained, classification, training, learning, large, difficult, datasets, transfer, train]
@InProceedings{Lee_2019_CVPR,
  author = {Lee, Dongwook and Kim, Junyoung and Moon, Won-Jin and Chul Ye, Jong},
  title = {CollaGAN: Collaborative GAN for Missing Image Data Imputation},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
d-SNE: Domain Adaptation Using Stochastic Neighborhood Embedding
Xiang Xu, Xiong Zhou, Ragav Venkatesan, Gurumurthy Swaminathan, Orchid Majumder


On the one hand, deep neural networks are effective in learning large datasets. On the other, they are inefficient with their data usage. They often require copious amount of labeled-data to train their scads of parameters. Training larger and deeper networks is hard without appropriate regularization, particularly while using a small dataset. Laterally, collecting well-annotated data is expensive, time-consuming and often infeasible. A popular way to regularize these networks is to simply train the network with more data from an alternate representative dataset. This can lead to adverse effects if the statistics of the representative dataset are dissimilar to our target.This predicament is due to the problem of domain shift. Data from a shifted domain might not produce bespoke features when a feature extractor from the representative domain is used. Several techniques of domain adaptation have been proposed in the past to solve this problem. In this paper, we propose a new technique (d-SNE) of domain adaptation that cleverly uses stochastic neighborhood embedding techniques and a novel modified-Hausdorff distance. The proposed technique is learnable end-to-end and is therefore, ideally suited to train neural networks. Extensive experiments demonstrate that d-SNE outperforms the current states-of-the-art and is robust to the variances in different datasets, even in the one-shot and semi-supervised learning settings. d-SNE also demonstrates the ability to generalize to multiple domains concurrently.
[dataset, outperforms, work] [computer, vision, pattern, international, neighborhood, technique, typically] [conference, proposed, figure, ieee, generative, transformation, image, method] [neural, network, deep, table, stochastic, processing, performance, best, imagenet] [adversarial, model, consider, dec, create, probability, visual] [feature, three, european] [domain, target, source, mnist, adaptation, learning, data, ccsa, unsupervised, datasets, supervised, distance, setting, fada, svhn, training, loss, xdk, usps, unlabeled, embedding, train, sample, label, space, learn, labeled, trained, discriminative, base, distribution, classifier, class, learnt, shared, extractor, novel, knowledge, idea]
@InProceedings{Xu_2019_CVPR,
  author = {Xu, Xiang and Zhou, Xiong and Venkatesan, Ragav and Swaminathan, Gurumurthy and Majumder, Orchid},
  title = {d-SNE: Domain Adaptation Using Stochastic Neighborhood Embedding},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Taking a Closer Look at Domain Shift: Category-Level Adversaries for Semantics Consistent Domain Adaptation
Yawei Luo, Liang Zheng, Tao Guan, Junqing Yu, Yi Yang


We consider the problem of unsupervised domain adaptation in semantic segmentation. The key in this campaign consists in reducing the domain shift, i.e., enforcing the data distributions of the two domains to be similar. A popular strategy is to align the marginal distribution in the feature space through adversarial learning. However, this global alignment strategy does not consider the local category-level feature distribution. A possible consequence of the global movement is that some categories which are originally well aligned between the source and target may be incorrectly mapped. To address this problem, this paper introduces a category-level adversarial network, aiming to enforce local semantic consistency during the trend of global alignment. Our idea is to take a close look at the category-level data distribution and align each class with an adaptive adversarial loss. Specifically, we reduce the weight of the adversarial loss for category-level aligned features while increasing the adversarial force for those poorly aligned. In this process, we decide how well a feature is category-level aligned between source and target by a co-training approach. In two domain adaptation tasks, i.e., GTA5 -> Cityscapes and SYNTHIA -> Cityscapes, we validate that the proposed method matches the state of the art in segmentation accuracy.
[joint, prediction, tan, traffic, recognition, focus] [computer, vision, local, pattern, well, analysis, problem, university] [conference, ieee, proposed, method, traditional, image, pixel, result, figure, input, generative, consistency, ladv] [weight, network, adaptive, deep, output, convolutional, performance, neural, denotes, best] [adversarial, arxiv, preprint, model, diverse, marginal, indicates, discriminator] [semantic, feature, segmentation, global, map, aligned, miou, improvement, center, utilize] [loss, domain, source, clan, target, adaptation, alignment, learning, class, distribution, discrepancy, unsupervised, training, data, distance, adapted, synthia, transfer, space, infrequent, address, classifier, train, learn]
@InProceedings{Luo_2019_CVPR,
  author = {Luo, Yawei and Zheng, Liang and Guan, Tao and Yu, Junqing and Yang, Yi},
  title = {Taking a Closer Look at Domain Shift: Category-Level Adversaries for Semantics Consistent Domain Adaptation},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
ADVENT: Adversarial Entropy Minimization for Domain Adaptation in Semantic Segmentation
Tuan-Hung Vu, Himalaya Jain, Maxime Bucher, Matthieu Cord, Patrick Perez


Semantic segmentation is a key problem for many computer vision tasks. While approaches based on convolutional neural networks constantly break new records on different benchmarks, generalizing well to diverse testing environments remains a major challenge. In numerous real-world applications, there is indeed a large gap between data distributions in train and test domains, which results in severe performance loss at run-time. In this work, we address the task of unsupervised domain adaptation in semantic segmentation with losses based on the entropy of the pixel-wise predictions. To this end, we propose two novel, complementary methods using (i) entropy loss and (ii) adversarial loss respectively. We demonstrate state-of-the-art performance in semantic segmentation on two challenging "synthetic-2-real" set-ups and show that the approach can also be used for detection.
[prediction] [approach, scene, problem, well, single, direct, computer] [image, figure, generative, high, based, proposed, synthetic, input, method, result, produce] [performance, network, table, weighted, compared, deep, better, convolutional, achieves, top, applied] [adversarial, model, adv, discriminator] [segmentation, semantic, miou, detection, feature, object, lseg, propose, map, urban] [entropy, domain, target, source, training, adaptation, loss, uda, learning, trained, class, minimization, unsupervised, minent, synthia, ent, train, task, advent, distribution, lent, large, data, gap, minimizing, log, min, set, base, specific, classification]
@InProceedings{Vu_2019_CVPR,
  author = {Vu, Tuan-Hung and Jain, Himalaya and Bucher, Maxime and Cord, Matthieu and Perez, Patrick},
  title = {ADVENT: Adversarial Entropy Minimization for Domain Adaptation in Semantic Segmentation},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
ContextDesc: Local Descriptor Augmentation With Cross-Modality Context
Zixin Luo, Tianwei Shen, Lei Zhou, Jiahui Zhang, Yao Yao, Shiwei Li, Tian Fang, Long Quan


Most existing studies on learning local features focus on the patch-based descriptions of individual keypoints, whereas neglecting the spatial relations established from their keypoint locations. In this paper, we go beyond the local detail representation by introducing context awareness to augment off-the-shelf local feature descriptors. Specifically, we propose a unified learning framework that leverages and aggregates the cross-modality contextual information, including (i) visual context from high-level image representation, and (ii) geometric context from 2D keypoint distribution. Moreover, we propose an effective N-pair loss that eschews the empirical hyper-parameter search and improves the convergence. The proposed augmentation scheme is lightweight compared with the raw local feature description, meanwhile improves remarkably on several large-scale benchmarks with diversified scenes, which demonstrates both strong practicality and generalization ability in geometric matching applications.
[framework, time, complex] [local, keypoint, geometric, matching, geodesc, point, sift, keypoints, sfm, descriptor, single, augmented, pointnet, allows, robust, manual, homography, preparation, indoor] [image, proposed, raw, input, patch, figure] [deep, scale, residual, performance, unit, normalization, compared, standard, aggregation, structure, better, original, table, computational, convolutional, sparse, introducing] [visual, evaluation, encoder, model, strong, ability, evaluate] [feature, context, regional, global, matchability, integration, spatial, recall, including, improves, final, contextual, semantic, module, awareness, resort, grid] [learning, loss, augmentation, retrieval, learned, training, representation, generalization, convergence, large, effectively, distribution, invariance, aim, perceptron, base]
@InProceedings{Luo_2019_CVPR,
  author = {Luo, Zixin and Shen, Tianwei and Zhou, Lei and Zhang, Jiahui and Yao, Yao and Li, Shiwei and Fang, Tian and Quan, Long},
  title = {ContextDesc: Local Descriptor Augmentation With Cross-Modality Context},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Large-Scale Long-Tailed Recognition in an Open World
Ziwei Liu, Zhongqi Miao, Xiaohang Zhan, Jiayun Wang, Boqing Gong, Stella X. Yu


Real world data often have a long-tailed and open-ended distribution. A practical recognition system must classify among majority and minority classes, generalize from a few known instances, and acknowledge novelty upon a never seen instance. We define Open Long-Tailed Recognition (OLTR) as learning from such naturally distributed data and optimizing the classification accuracy over a balanced test set which include head, tail, and open classes. OLTR must handle imbalanced classification, few-shot learning, and open-set recognition in one integrated algorithm, whereas existing classification approaches focus only on one aspect and deliver poorly over the entire class spectrum. The key challenges are how to share visual knowledge between head and tail classes and how to reduce confusion between tail and open classes. We develop an integrated OLTR algorithm that maps an image to a feature space such that visual concepts can easily relate to each other based on a learned metric that respects the closed-world classification while acknowledging the novelty of the open world. Our so-called dynamic meta-embedding combines a direct image feature and an associated memory feature, with the feature norm indicating the familiarity to known classes. On three large-scale OLTR datasets we curate from object-centric ImageNet, scene-centric Places, and face-centric MS1M data, our method consistently outperforms the state-of-the-art. Our code, datasets, and models enable future OLTR research and are publicly available at https://liuziwei7.github.io/projects/LongTail.html.
[recognition, dynamic, dataset, relates, learns] [direct, approach, range, focal, confidence, algorithm, directly] [image, input, method, figure, face, study, comparison] [performance, accuracy, deep, plain, network, neural, table, number, distributed, layer, effective, standard] [memory, visual, attention, model, arxiv, preprint, concept, robustness, natural] [feature, head, modulated, map, three, spatial, backbone, integrated, detection, final] [open, learning, tail, class, classification, oltr, meta, training, loss, data, set, imbalanced, test, reachability, knowledge, classifier, learned, metric, embedding, distribution, setting, balanced, metaembedding, discrimination, lifted, transfer, sample, classify, novelty, existing, learn, big, discriminative, distance]
@InProceedings{Liu_2019_CVPR,
  author = {Liu, Ziwei and Miao, Zhongqi and Zhan, Xiaohang and Wang, Jiayun and Gong, Boqing and Yu, Stella X.},
  title = {Large-Scale Long-Tailed Recognition in an Open World},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
AET vs. AED: Unsupervised Representation Learning by Auto-Encoding Transformations Rather Than Data
Liheng Zhang, Guo-Jun Qi, Liqiang Wang, Jiebo Luo


The success of deep neural networks often relies on a large amount of labeled examples, which can be difficult to obtain in many real scenarios. To address this challenge, unsupervised methods are strongly preferred for training neural networks without using any labeled data. In this paper, we present a novel paradigm of unsupervised representation learning by Auto-Encoding Transformation (AET) in contrast to the conventional Auto-Encoding Data (AED) approach. Given a randomly sampled transformation, AET seeks to predict it merely from the encoded features as accurately as possible at the output end. The idea is the following: as long as the unsupervised features successfully encode the essential information about the visual structures of original and transformed images, the transformation can be well predicted. We will show that this AET paradigm allows us to instantiate a large variety of transformations, from parameterized, to non-parameterized and GAN-induced ones. Our experiments show that AET greatly improves over existing unsupervised approaches, setting new state-of-the-art performances being greatly closer to the upper bounds by their fully supervised counterparts on CIFAR-10, ImageNet and Places datasets. Our source codes are available at https://github.com/maple-research-lab/AET.
[] [computer, international, well, estimate, projective, error, greatly, vision, directly] [transformation, input, image, conference, figure, comparison, proposed, generative, generator, ieee] [neural, output, convolutional, imagenet, compare, network, deep, original, top, table, parameterized, accuracy, compared, performance, alexnet, nin, literature, explore, architecture] [random, transformed, encoder, visual, sampled, model, arxiv, preprint, adversarial, gans, gan] [feature, fully, context] [unsupervised, aet, learning, train, data, trained, training, representation, supervised, labeled, classifier, learn, loss, rotnet, randomly, distribution, learned, knn, paradigm, existing, upper, set, large, jigsaw, surrogate]
@InProceedings{Zhang_2019_CVPR,
  author = {Zhang, Liheng and Qi, Guo-Jun and Wang, Liqiang and Luo, Jiebo},
  title = {AET vs. AED: Unsupervised Representation Learning by Auto-Encoding Transformations Rather Than Data},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
SDC - Stacked Dilated Convolution: A Unified Descriptor Network for Dense Matching Tasks
Rene Schuster, Oliver Wasenmuller, Christian Unger, Didier Stricker


Dense pixel matching is important for many computer vision tasks such as disparity and flow estimation. We present a robust, unified descriptor network that considers a large context region with high spatial variance. Our network has a very large receptive field and avoids striding layers to maintain spatial resolution. These properties are achieved by creating a novel neural network layer that consists of multiple, parallel, stacked dilated convolutions (SDC). Several of these layers are combined to form our SDC descriptor network. In our experiments, we show that our SDC features outperform state-of-the-art feature descriptors in terms of accuracy and robustness. In addition, we demonstrate the superior performance of SDC in state-of-the-art stereo matching, optical flow and scene flow algorithms on several famous public benchmarks.
[flow, optical, recognition, sintel, previous, second] [vision, computer, matching, stereo, epe, pattern, descriptor, sift, kitti, dense, field, middlebury, scene, noc, robust, single, international, corresponding, christian, didier, cpm, accurate] [conference, image, patch, figure, interpolation, produce] [sdc, network, dilated, convolution, receptive, dilation, deep, parallel, layer, design, size, table, accuracy, neural, architecture, striding, original, pooling, full, compare, performance, density, convolutional, batch, number, filtered, block, small] [robustness, evaluate] [feature, spatial, context, improve, heuristic] [data, large, learning, training, distance, set, loss, test, unified, novel]
@InProceedings{Schuster_2019_CVPR,
  author = {Schuster, Rene and Wasenmuller, Oliver and Unger, Christian and Stricker, Didier},
  title = {SDC - Stacked Dilated Convolution: A Unified Descriptor Network for Dense Matching Tasks},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Learning Correspondence From the Cycle-Consistency of Time
Xiaolong Wang, Allan Jabri, Alexei A. Efros


We introduce a self-supervised method for learning visual correspondence from unlabeled video. The main idea is to use cycle-consistency in time as free supervisory signal for learning visual representations from scratch. At training time, our model learns a feature map representation to be useful for performing cycle-consistent tracking. At test time, we use the acquired representation to find nearest neighbors across space and time. We demonstrate the generalizability of the representation -- without finetuning -- across a range of visual correspondence tasks, including video object segmentation, keypoint tracking, and optical flow. Our approach outperforms previous self-supervised methods and performs competitively with strongly supervised methods.
[flow, optical, video, tracking, frame, propagation, human, time, work, track, performs, propagate, multiple, perform, motion, follow, dataset, jhmdb, temporal] [correspondence, estimation, dense, approach, sift, compute, differentiable, pose, computer, vision, well, allows, note, initial] [image, method, patch, figure, pixel, cycle, input, acquired, colorization, texture] [imagenet, deep, better, table, neural, tracker, architecture, convolutional, network] [visual, model, finding, evaluation, find] [feature, semantic, spatial, object, instance, segmentation, affinity] [learning, representation, unsupervised, training, loss, supervised, trained, learn, space, metric, large, set, test, similarity, alignment, objective, deepcluster]
@InProceedings{Wang_2019_CVPR,
  author = {Wang, Xiaolong and Jabri, Allan and Efros, Alexei A.},
  title = {Learning Correspondence From the Cycle-Consistency of Time},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
AE2-Nets: Autoencoder in Autoencoder Networks
Changqing Zhang, Yeqing Liu, Huazhu Fu


Learning on data represented with multiple views (e.g., multiple types of descriptors or modalities) is a rapidly growing direction in machine learning and computer vision. Although effectiveness achieved, most existing algorithms usually focus on classification or clustering tasks. Differently, in this paper, we focus on unsupervised representation learning and propose a novel framework termed Autoencoder in Autoencoder Networks (AE^2-Nets), which integrates information from heterogeneous sources into an intact representation by the nested autoencoder framework. The proposed method has the following merits: (1) our model jointly performs view-specific representation learning (with the inner autoencoder networks) and multi-view information encoding (with the outer autoencoder networks) in a unified framework; (2) due to the degradation process from the latent representation to each single view, our model flexibly balances the complementarity and consistence among multiple views. The proposed model is efficiently solved by the alternating direction method (ADM), and demonstrates the effectiveness compared with state-of-the-art algorithms.
[multiple, jointly, heterogeneous, learns, focus] [view, single, canonical, algorithm, intrinsic, linear, analysis, optimization, matrix, direction, consistence] [latent, proposed, degradation, based, method, image, outer] [deep, correlation, neural, kernel, automatically, number, effectiveness, nested, compared, original, parameter, table, performance, process, flexibly] [model, common, encoding, multimodal, encoded, maximize, wae, machine] [feature, inner, map] [representation, learning, autoencoder, intact, cca, clustering, data, dccae, dimensionality, update, dcca, mdcr, classification, learned, bae, featconcate, metric, bdg, subspace, space, set, wdg, existing, datasets, objective, function, unsupervised, novel, complementarity]
@InProceedings{Zhang_2019_CVPR,
  author = {Zhang, Changqing and Liu, Yeqing and Fu, Huazhu},
  title = {AE2-Nets: Autoencoder in Autoencoder Networks},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Mitigating Information Leakage in Image Representations: A Maximum Entropy Approach
Proteek Chandan Roy, Vishnu Naresh Boddeti


Image recognition systems have demonstrated tremendous progress over the past few decades thanks, in part, to our ability of learning compact and robust representations of images. As we witness the wide spread adoption of these systems, it is imperative to consider the problem of unintended leakage of information from an image representation, which might compromise the privacy of the data owner. This paper investigates the problem of learning an image representation that minimizes such leakage of user information. We formulate the problem as an adversarial non-zero sum game of finding a good embedding function with two competing goals: to retain as much task dependent discriminative image information as possible, while simultaneously minimizing the amount of information, as measured by entropy, about other sensitive attributes of the user. We analyze the stability and convergence dynamics of the proposed formulation using tools from non-linear systems theory and compare to that of the corresponding adversarial zero-sum game formulation that optimizes likelihood as a measure of information content. Numerical experiments on UCI, Extended Yale B, CIFAR-10 and CIFAR-100 datasets indicate that our proposed approach is able to learn image representations that exhibit high task performance while mitigating leakage of predefined sensitive information.
[equilibrium, dataset, hidden, stationary, recognition, multiple, prediction, work, simultaneous] [problem, formulation, point, linear, optimization, measured, optimizing, solution, illumination, corresponding] [image, attribute, figure, proposed, competing, conference, face] [accuracy, neural, gradient, converge, compare, analyze, numerical, performance, network, descent] [sensitive, adversarial, encoder, adversary, discriminator, game, leakage, vfae, consider, vector, privacy, player, machine, maximize, mlarl, goal, arl] [three] [learning, target, representation, predictor, entropy, data, likelihood, fair, distribution, classification, convergence, learn, embedding, label, maximum, minimizing, task, learned, function, invariant, class, log, setting, setup, trained, min]
@InProceedings{Roy_2019_CVPR,
  author = {Chandan Roy, Proteek and Naresh Boddeti, Vishnu},
  title = {Mitigating Information Leakage in Image Representations: A Maximum Entropy Approach},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Learning Spatial Common Sense With Geometry-Aware Recurrent Networks
Hsiao-Yu Fish Tung, Ricson Cheng, Katerina Fragkiadaki


We integrate two powerful ideas, geometry and deep visual representation learning, into recurrent network architectures for mobile visual scene understanding. The proposed networks learn to "lift" 2D visual features and integrate them over time into latent 3D feature maps of the scene. They are equipped with differentiable geometric operations, such as projection, unprojection, egomotion estimation and stabilization, in order to compute a geometrically-consistent mapping between the world scene and their 3D latent feature space. We train the proposed architectures to predict novel image views given short frame sequences as input. Their predictions strongly generalize to scenes with a novel number of objects, appearances and configurations, and greatly outperform predictions of previous works that do not consider egomotion stabilization or a space-aware latent feature space. We train the proposed architectures to detect and segment objects in 3D, using the latent 3D feature map as input--as opposed to 2D feature maps computed from video frames. The resulting detections are permanent: they continue to exist even when an object gets occluded or leaves the field of view. Our experiments suggest the proposed space-aware latent feature arrangement and egomotion-stabilized convolutions are essential architectural choices for spatial common sense to emerge in artificial embodied visual agents.
[egomotion, prediction, frame, recurrent, predict, time, current, arrangement, gru, state, sense, opposed, human, work, unprojected, motion] [grnns, view, depth, scene, camera, estimation, shapenet, rgb, corresponding, groundtruth, elevation, azimuth, tower, voxel, reconstruction, differentiable, single, estimate, projected, vision, well, error, geometric] [input, latent, figure, image, proposed, mapping, pixel] [tensor, deep, network, neural, number, mobile, table, convolutional] [visual, memory, model, common, query, consider] [feature, object, map, spatial, detection, segmentation, predicted, baseline, location, integrate, bounding, box, detect, grid, average] [test, learning, train, set, novel, learn, trained, training, representation, update, generalization]
@InProceedings{Tung_2019_CVPR,
  author = {Fish Tung, Hsiao-Yu and Cheng, Ricson and Fragkiadaki, Katerina},
  title = {Learning Spatial Common Sense With Geometry-Aware Recurrent Networks},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Structured Knowledge Distillation for Semantic Segmentation
Yifan Liu, Ke Chen, Chris Liu, Zengchang Qin, Zhenbo Luo, Jingdong Wang


In this paper, we investigate the issue of knowledge distillation for training compact semantic segmentation networks by making use of cumbersome networks. We start from the straightforward scheme, pixel-wise distillation, which applies the distillation scheme originally introduced for image classification and performs knowledge distillation for each pixel separately. We further propose to distill the structured knowledge from cumbersome networks into compact networks, which is motivated by the fact that semantic segmentation is a structured prediction problem. We study two such structured distillation schemes: (i) pair-wise distillation that distills the pairwise similarities, and (ii) holistic distillation that uses adversarial training to distill holistic knowledge. The effectiveness of our knowledge distillation approaches is demonstrated by extensive experiments on three scene parsing datasets: Cityscapes, Camvid and ADE20K.
[dataset] [local, scene] [image, pixel, ieee, produced, figure, method, generative, study, input, enet, real, high, resolution] [network, compact, structured, espnet, deep, neural, convolutional, net, table, structure, effectiveness, efficient, improving, accuracy, higher, performance, imagenet, validation, scheme, distill, dilated, size, camvid, mobile, erfnet] [adversarial, model, discriminator, attention, gan, random] [segmentation, semantic, holistic, cumbersome, feature, map, imn, three, miou, score, pspnet, spatial, ocnet, improve, annotated, iou, parsing, refinenet, improvement, labeling, adopt] [distillation, knowledge, training, student, loss, transfer, class, embedding, teacher, learning, test, similarity, set, unlabeled, classification, function, trained, align]
@InProceedings{Liu_2019_CVPR,
  author = {Liu, Yifan and Chen, Ke and Liu, Chris and Qin, Zengchang and Luo, Zhenbo and Wang, Jingdong},
  title = {Structured Knowledge Distillation for Semantic Segmentation},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Scan2CAD: Learning CAD Model Alignment in RGB-D Scans
Armen Avetisyan, Manuel Dahnert, Angela Dai, Manolis Savva, Angel X. Chang, Matthias Niessner


We present Scan2CAD, a novel data-driven method that learns to align clean 3D CAD models from a shape database to the noisy and incomplete geometry of a commodity RGB-D scan. For a 3D reconstruction of an indoor scene, our method takes as input a set of CAD models, and predicts a 9DoF pose that aligns each model to the underlying scan geometry. To tackle this problem, we create a new scan-to-CAD alignment dataset based on 1506 ScanNet scans with 97607 annotated keypoint pairs between 14225 CAD models from ShapeNet and their counterpart objects in the scans. Our method selects a set of representative keypoints in a 3D scan for which we find correspondences to the CAD geometry. To this end, we design a novel 3D CNN architecture that learns a joint embedding between real and synthetic objects, and from this predicts a correspondence heatmap. Based on these correspondence heatmaps, we formulate a variational energy minimization that aligns a given set of CAD models to the reconstruction. We evaluate our approach on our newly introduced Scan2CAD benchmark where we outperform both handcrafted feature descriptor as well as state-of-the-art CNN based methods by 21.39%.
[dataset, prediction, predict, recognition] [cad, scan, keypoint, scene, computer, reconstruction, correspondence, vision, pose, geometric, voxel, approach, shape, shapenet, volumetric, ground, truth, indoor, scannet, matching, surface, international, pattern, optimization, point, geometry, depth, harris, optimal, keypoints] [conference, based, input, ieee, method, real, quality, synthetic, user, acm, figure, transformation, database, incomplete] [scale, network, architecture, output] [model, find, introduce, generate] [object, annotated, heatmap, feature, annotation, cnn, category, benchmark, semantic, heatmaps, propose] [alignment, set, retrieval, data, learning, training, distance, loss, compatibility, align, test, learned, task]
@InProceedings{Avetisyan_2019_CVPR,
  author = {Avetisyan, Armen and Dahnert, Manuel and Dai, Angela and Savva, Manolis and Chang, Angel X. and Niessner, Matthias},
  title = {Scan2CAD: Learning CAD Model Alignment in RGB-D Scans},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Towards Scene Understanding: Unsupervised Monocular Depth Estimation With Semantic-Aware Representation
Po-Yi Chen, Alexander H. Liu, Yen-Cheng Liu, Yu-Chiang Frank Wang


Monocular depth estimation is a challenging task in scene understanding, with the goal to acquire the geometric properties of 3D space from 2D images. Due to the lack of RGB-depth image pairs, unsupervised learning methods aim at deriving depth information with alternative supervision such as stereo pairs. However, most existing works fail to model the geometric structure of objects, which generally results from considering pixel-level objective functions during training. In this paper, we propose SceneNet to overcome this limitation with the aid of semantic understanding from segmentation. Moreover, our proposed model is able to perform region-aware depth estimation by enforcing semantics consistency between stereo pairs. In our experiments, we qualitatively and quantitatively verify the effectiveness and robustness of our model, which produces favorable results against the state-of-the-art approaches do.
[prediction, recognition, dataset, predict, work, warping, joint, perform, temporal, optical] [depth, disparity, estimation, scenenet, scene, vision, stereo, computer, monocular, pattern, kitti, ground, geometric, truth, single, godard, note, mismatching, problem, allows, eigen, smoothness, view, leveraging, corresponding] [image, conference, proposed, ieee, consistency, input, identity, method, based, verify, figure, amount] [inference, deep, convolutional, performance, network, architecture, output] [model, decoder, understanding, evaluate, encoder] [semantic, segmentation, map, annotated, baseline, ablation] [learning, unsupervised, task, training, data, loss, representation, objective, set, learn, unified, split, supervised, existing, shared, classifier, train]
@InProceedings{Chen_2019_CVPR,
  author = {Chen, Po-Yi and Liu, Alexander H. and Liu, Yen-Cheng and Frank Wang, Yu-Chiang},
  title = {Towards Scene Understanding: Unsupervised Monocular Depth Estimation With Semantic-Aware Representation},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Tell Me Where I Am: Object-Level Scene Context Prediction
Xiaotian Qiao, Quanlong Zheng, Ying Cao, Rynson W.H. Lau


Contextual information has been shown to be effective in helping solve various image understanding tasks. Previous works have focused on the extraction of contextual information from an image and use it to infer the properties of some object(s) in the image. In this paper, we consider an inverse problem of how to hallucinate missing contextual information from the properties of a few standalone objects. We refer to it as scene context prediction. This problem is difficult as it requires an extensive knowledge of complex and diverse relationships among different objects in natural scenes. We propose a convolutional neural network, which takes as input the properties (i.e., category, shape, and position) of a few standalone objects to predict an object-level scene layout that compactly encodes the semantics and structure of the scene context where the given objects are. Our quantitative experiments and user studies show that our model can generate more plausible scene context than the baseline approach. We demonstrate that our model allows for the synthesis of realistic scene images from just partial scene layouts and internally learns useful features for scene recognition.
[recognition, dataset, predict, prediction, predicting, complex] [scene, sky, shape, ground, truth, confidence, problem, indoor, outdoor, predicts, note] [input, image, method, synthesis, generator, figure, realistic, quantitative, user, qualitative, synthesize, produce] [output, standalone, network, full, table, deep, building, neural] [model, generate, generated, discriminator, grass, plausible, tree, partial, visual, diverse, bird, evaluate, pavement] [object, context, layout, person, baseline, bounding, semantic, category, airplane, spatial, region, box, car, foreground, boat, predicted, score, plausibility, road, contextual, feature, snowboard, detection, sand] [representation, learning, learn, learned, training, embedding, train, address, classifier, loss, randomly]
@InProceedings{Qiao_2019_CVPR,
  author = {Qiao, Xiaotian and Zheng, Quanlong and Cao, Ying and Lau, Rynson W.H.},
  title = {Tell Me Where I Am: Object-Level Scene Context Prediction},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Normalized Object Coordinate Space for Category-Level 6D Object Pose and Size Estimation
He Wang, Srinath Sridhar, Jingwei Huang, Julien Valentin, Shuran Song, Leonidas J. Guibas


The goal of this paper is to estimate the 6D pose and dimensions of unseen object instances in an RGB-D image. Contrary to "instance-level" 6D pose estimation tasks, our problem assumes that no exact object CAD models are available during either training or testing time. To handle different and unseen object instances in a given category, we introduce a Normalized Object Coordinate Space (NOCS)---a shared canonical representation for all possible object instances within a category. Our region-based neural network is then trained to directly infer the correspondence from observed pixels to this shared object representation (NOCS) along with other object information such as class label and instance mask. These predictions can be combined with the depth map to jointly estimate the metric 6D pose and dimensions of multiple objects in a cluttered scene. To train our network, we present a new context-aware technique to generate large amounts of fully annotated mixed reality data. To further improve our model and evaluate its performance on real data, we also provide a fully annotated real-world dataset with large environment and instance variation. Extensive experiments demonstrate that the proposed method is able to robustly estimate the pose and size of unseen object instances in real environments while also achieving state-of-the-art performance on standard 6D pose estimation benchmarks.
[dataset, multiple, prediction, work, predict, focus, recognition] [pose, estimation, computer, ground, truth, vision, depth, approach, pattern, normalized, reality, estimate, cad, scene, rgb, camera, coordinate, international, point, allows, problem, shape, predicts, symmetry] [conference, real, ieee, figure, method, pixel, synthetic, mixed, image, handle] [size, network, performance, full, deep, neural] [generate, model, making, arxiv, preprint, introduce] [object, map, detection, instance, bounding, cnn, regression, mask, category, coco, challenging, predicted] [data, training, unseen, space, large, test, datasets, loss, learning, class, metric, representation, trained, classification, shared, symmetric, set, testing, train]
@InProceedings{Wang_2019_CVPR,
  author = {Wang, He and Sridhar, Srinath and Huang, Jingwei and Valentin, Julien and Song, Shuran and Guibas, Leonidas J.},
  title = {Normalized Object Coordinate Space for Category-Level 6D Object Pose and Size Estimation},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Supervised Fitting of Geometric Primitives to 3D Point Clouds
Lingxiao Li, Minhyuk Sung, Anastasia Dubrovina, Li Yi, Leonidas J. Guibas


Fitting geometric primitives to 3D point cloud data bridges a gap between low-level digitized 3D data and high-level structural information on the underlying 3D shapes. As such, it enables many downstream applications in 3D data processing. For a long time, RANSAC-based methods have been the gold standard for such primitive fitting problems, but they require careful per-input parameter tuning and thus do not scale well for large datasets with diverse shapes. In this work, we introduce Supervised Primitive Fitting Network (SPFN), an end-to-end neural network that can robustly detect a varying number of primitives at different scales without any user control. The network is supervised using ground truth primitive surfaces and primitive membership for the input points. Instead of directly predicting the primitives, our architecture first predicts per-point properties and then uses a differential model estimation module to compute the primitive type and parameters. We evaluate our approach on a novel benchmark of ANSI 3D mechanical component models and demonstrate a significant improvement over both the state-of-the-art RANSAC-based methods and the direct neural prediction.
[predicting, predict, prediction, work, framework, multiple] [primitive, point, fitting, spfn, ransac, shape, cloud, ground, truth, axis, geometric, differentiable, normal, equation, plane, cad, matrix, predicts, problem, mechanical, matching, directly, surface, leonidas, estimation, approach, geometry, hungarian, squared, lres, varying, ansi, direct, fit, algorithm, scanned, pipeline, estimator, reordering, notice, assume] [input, figure, based, user, component, method] [network, number, neural, efficient, parameter, deep, residual, small, output, weighted, architecture, better] [type, model, coverage, cylinder, consider, step, sum] [predicted, segmentation, threshold, supervision] [loss, supervised, test, learning, training, data, membership, set]
@InProceedings{Li_2019_CVPR,
  author = {Li, Lingxiao and Sung, Minhyuk and Dubrovina, Anastasia and Yi, Li and Guibas, Leonidas J.},
  title = {Supervised Fitting of Geometric Primitives to 3D Point Clouds},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Do Better ImageNet Models Transfer Better?
Simon Kornblith, Jonathon Shlens, Quoc V. Le


Transfer learning is a cornerstone of computer vision, yet little work has been done to evaluate the relationship between architecture and transfer. An implicit hypothesis in modern computer vision research is that models that perform better on ImageNet necessarily perform better on other vision tasks. However, this hypothesis has never been systematically tested. Here, we compare the performance of 16 classification networks on 12 image classification datasets. We find that, when networks are used as fixed feature extractors or fine-tuned, there is a strong correlation between ImageNet accuracy and transfer accuracy (r = 0.99 and 0.96, respectively). In the former setting, we find that this relationship is very sensitive to the way in which networks are trained on ImageNet; many common forms of regularization slightly improve ImageNet accuracy but yield features that are much worse for transfer learning. Additionally, we find that, on two small fine-grained image classification datasets, pretraining on ImageNet provides minimal benefits, indicating the learned features from ImageNet do not transfer well to fine-grained tasks. Together, our results show that ImageNet architectures generalize well across datasets, but ImageNet features are less general than previously suggested.
[recognition, dataset, work] [computer, vision, pattern, international, error] [conference, ieee, image, figure, smooth, aux] [imagenet, accuracy, performance, better, neural, scale, deep, convolutional, fixed, dropout, correlation, penultimate, layer, initialization, best, network, higher, size, andrew, larger, fgvc, architecture, regularization, number, batch, scratch, achieved, processing, small, pretrained, modern] [inception, random, model, appendix, machine, visual, van, find] [regression, object, feature, head, detection, improve, cnn, average, ross, kaiming] [transfer, training, learning, datasets, classification, logistic, trained, label, large, pretraining, stanford, data, learned, aircraft]
@InProceedings{Kornblith_2019_CVPR,
  author = {Kornblith, Simon and Shlens, Jonathon and Le, Quoc V.},
  title = {Do Better ImageNet Models Transfer Better?},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Gotta Adapt 'Em All: Joint Pixel and Feature-Level Domain Adaptation for Recognition in the Wild
Luan Tran, Kihyuk Sohn, Xiang Yu, Xiaoming Liu, Manmohan Chandraker


Recent developments in deep domain adaptation have allowed knowledge transfer from a labeled source domain to an unlabeled target domain at the level of intermediate features or input pixels. We propose that advantages may be derived by combining them, in the form of different insights that lead to a novel design and complementary properties that result in better performance. At the feature level, inspired by insights from semi-supervised learning, we propose a classification-aware domain adversarial neural network that brings target examples into more classifiable regions of source domain. Next, we posit that computer vision insights are more amenable to injection at the pixel level. In particular, we use 3D geometry and image synthesis based on a generalized appearance flow to preserve identity across pose transformations, while using an attribute-conditioned CycleGAN to translate a single source into multiple target images that differ in lower-level properties such as lighting. Besides standard UDA benchmark, we validate on a novel and apt problem of car recognition in unlabeled surveillance images using labeled images from the web, handling explicitly specified, nameable factors of variation through pixel-level and implicit, unspecified factors through feature-level adaptation.
[recognition, joint, flow, multiple, framework, individual] [perspective, night, photometric, analysis, rendered, allows, problem, lighting, viewpoint, parameterization, vision, error] [image, cyclegan, pixel, surveillance, appearance, real, day, proposed, transformation, input, translation, figure, based, synthetic, face, xiaoming, conditional] [table, deep, accuracy, network, output, neural, performance, better, standard, convolutional] [adversarial, discriminator, model] [feature, car, propose, object, cnn, semantic, complementary, baseline] [domain, target, learning, adaptation, source, dann, web, classifier, training, unsupervised, uda, labeled, set, afnet, data, test, trained, unlabeled, shared, kfnet, novel, generalization, loss, log, classification, transfer, discrepancy, class]
@InProceedings{Tran_2019_CVPR,
  author = {Tran, Luan and Sohn, Kihyuk and Yu, Xiang and Liu, Xiaoming and Chandraker, Manmohan},
  title = {Gotta Adapt 'Em All: Joint Pixel and Feature-Level Domain Adaptation for Recognition in the Wild},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Understanding the Disharmony Between Dropout and Batch Normalization by Variance Shift
Xiang Li, Shuo Chen, Xiaolin Hu, Jian Yang


This paper first answers the question "why do the two most powerful techniques Dropout and Batch Normalization (BN) often lead to a worse performance when they are combined together in many modern neural networks, but cooperate well sometimes as in Wide ResNet (WRN)?" in both theoretical and empirical aspects. Theoretically, we find that Dropout shifts the variance of a specific neural unit when we transfer the state of that network from training to test. However, BN maintains its statistical variance, which is accumulated from the entire learning procedure, in the test phase. The inconsistency of variances in Dropout and BN (we name this scheme "variance shift") causes the unstable numerical behavior in inference that leads to erroneous predictions finally. Meanwhile, the large feature dimension in WRN further reduces the "variance shift" to bring benefits to the overall performance. Thorough experiments on representative modern convolutional networks like DenseNet, ResNet, ResNeXt and Wide ResNet confirm our findings. According to the uncovered mechanism, we get better understandings in the combination of these two techniques and summarize guidelines for better practices.
[moving, averaging, previous, speech, key, hidden, averaged] [case, error, form, corresponding] [statistical, real, rain, figure, based] [dropout, variance, shift, neural, ratio, densenet, layer, wrn, preresnet, resnext, modern, drop, convolutional, deep, network, bottleneck, wide, performance, vari, table, est, resnet, numerical, batch, normalization, science, scheme, better, weight, entire, channel, rate, lead, inference, denotes, applied, scale, apply, efficient, deduction, larger, cov] [arxiv, preprint, model, mode, vector, find, retain, natural, discover, random, consider] [art, feature, response, average, stage, final] [training, test, data, learning, trained, large, distribution]
@InProceedings{Li_2019_CVPR,
  author = {Li, Xiang and Chen, Shuo and Hu, Xiaolin and Yang, Jian},
  title = {Understanding the Disharmony Between Dropout and Batch Normalization by Variance Shift},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Circulant Binary Convolutional Networks: Enhancing the Performance of 1-Bit DCNNs With Circulant Back Propagation
Chunlei Liu, Wenrui Ding, Xin Xia, Baochang Zhang, Jiaxin Gu, Jianzhuang Liu, Rongrong Ji, David Doermann


The rapidly decreasing computation and memory cost has recently driven the success of many applications in the field of deep learning. Practical applications of deep learning in resource-limited hardware, such as embedded devices and smart phones, however, remain challenging. For binary convolutional networks, the reason lies in the degraded representation caused by binarizing full-precision filters. To address this problem, we propose new circulant filters (CiFs) and a circulant binary convolution (CBConv) to enhance the capacity of binarized convolutional features via our circulant back propagation (CBP). The CiFs can be easily incorporated into existing deep convolutional neural networks (DCNNs), which leads to new Circulant Binary Convolutional Networks (CBCNs). Extensive experiments confirm that the performance gap between the 1-bit and full-precision DCNNs is minimized by increasing the filter diversity, which further increases the representational ability in our networks. Our experiments on ImageNet show that CBCNs achieve 61.4% top-1 accuracy with ResNet18. Compared to the state-of-the-art such as XNOR, CBCNs can achieve up to 10% higher top-1 accuracy with more powerful representational ability.
[propagation, dataset, build] [matrix, computer, column, corresponding, rotation, algorithm, vision, initial, single, international] [based, input, conference, figure, proposed, result, degradation] [circulant, binary, cbcn, convolutional, xnor, convolution, network, cbcns, performance, filter, neural, table, binarized, deep, accuracy, cifs, cbconv, representational, dcnns, imagenet, increase, binarization, gradient, cif, achieve, better, layer, cbp, output, gaussian, kernel, number, rate, binarynet, denotes, original, represents, compare, baochang, binarizing] [ability, sign, vector, easily] [feature, improvement, map, stage, center, enhance, backbone] [learned, training, function, learning, set, loss, transfer, classification, testing, representation, gap, large, maximum]
@InProceedings{Liu_2019_CVPR,
  author = {Liu, Chunlei and Ding, Wenrui and Xia, Xin and Zhang, Baochang and Gu, Jiaxin and Liu, Jianzhuang and Ji, Rongrong and Doermann, David},
  title = {Circulant Binary Convolutional Networks: Enhancing the Performance of 1-Bit DCNNs With Circulant Back Propagation},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
DeFusionNET: Defocus Blur Detection via Recurrently Fusing and Refining Multi-Scale Deep Features
Chang Tang, Xinzhong Zhu, Xinwang Liu, Lizhe Wang, Albert Zomaya


Defocus blur detection aims to detect out-of-focus regions from an image. Although attracting more and more attention due to its widespread applications, defocus blur detection still confronts several challenges such as the interference of background clutter, sensitivity to scales and missing boundary details of defocus blur regions. To deal with these issues, we propose a deep neural network which recurrently fuses and refines multi-scale deep features (DeFusionNet) for defocus blur detection. We firstly utilize a fully convolutional network to extract multi-scale deep features. The features from bottom layers are able to capture rich low-level features for details preservation, while the features from top layers can characterize the semantic information to locate blur regions. These features from different layers are fused as shallow features and semantic features, respectively. After that, the fused shallow features are propagated to top layers for refining the fine details of detected defocus blur regions, and the fused semantic features are propagated to bottom layers to assist in better locating the defocus regions. The feature fusing and refining are carried out in a recurrent manner. Also, we finally fuse the output of each layer at the last recurrent step to obtain the final defocus blur map by considering the sensitivity to scales of the defocus degree. Experiments on two commonly used defocus blur detection benchmark datasets are conducted to demonstrate the superority of DeFusionNet when compared with other 10 competitors. Code and more results can be found at: http://tangchang.net
[recurrent, extract, dataset, capture, considering] [computer, pattern, vision, well, local, single, focal, degree, analysis, ground] [blur, defocus, image, ieee, conference, proposed, based, figure, input, blurred, method, comparison, background] [deep, defusionnet, network, shallow, layer, convolutional, fshf, neural, fine, output, btbnet, table, lbp, hifst, gradient, convolution, efficient, represents, order, relu, better, effectiveness] [step, generate, model] [detection, feature, semantic, final, fusing, fused, map, refine, refining, mae, fully, region, fsef, recurrently, detected, object, fuse, complementary, chang, propose] [training, learning, set, datasets, shi, testing, discriminative, china]
@InProceedings{Tang_2019_CVPR,
  author = {Tang, Chang and Zhu, Xinzhong and Liu, Xinwang and Wang, Lizhe and Zomaya, Albert},
  title = {DeFusionNET: Defocus Blur Detection via Recurrently Fusing and Refining Multi-Scale Deep Features},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Deep Virtual Networks for Memory Efficient Inference of Multiple Tasks
Eunwoo Kim, Chanho Ahn, Philip H.S. Torr, Songhwai Oh


Deep networks consume a large amount of memory by their nature. A natural question arises can we reduce that memory requirement whilst maintaining performance. In particular, in this work we address the problem of memory efficient learning for multiple tasks. To this end, we propose a novel network architecture producing multiple networks of different configurations, termed deep virtual networks (DVNs), for different tasks. Each DVN is specialized for a single task and structured hierarchically. The hierarchical structure, which contains multiple levels of hierarchy corresponding to different numbers of parameters, enables multiple inference for different memory budgets. The building block of a deep virtual network is based on a disjoint collection of parameters of a network, which we call a unit. The lowest level of hierarchy in a deep virtual network is a unit, and higher levels of hierarchy contain lower levels' units and other additional units. Given a budget on the number of parameters, a different level of a deep virtual network can be chosen to perform the task. A unit can be shared by different DVNs, allowing multiple DVNs in a single network. In addition, shared units provide assistance to the target task with additional knowledge learned from another tasks. This cooperative configuration of DVNs makes it possible to handle different tasks in a memory-aware manner. Our experiments show that the proposed method outperforms existing approaches for multiple tasks. Notably, ours is more efficient than others as it allows memory-aware inference for all tasks.
[multiple, perform, sequential, joint, performs, consists] [virtual, single, approach, computer, problem, international, corresponding, additional, vision, respect, pattern, allows, assume] [proposed, conference, based, figure, method, ieee, collected] [network, deep, inference, number, efficient, nestednet, unit, architecture, neural, density, compared, structure, accuracy, performance, dvn, table, parameter, disjoint, dvns, applied, dividing, convolutional, sharing, layer, output, better, lower, performed, order] [memory, enables, machine, arxiv, preprint] [hierarchy, level, feature, hierarchical, baseline, backbone, three] [learning, task, set, scenario, learned, trained, strategy, classification, datasets, function, training, lwf, shared, loss]
@InProceedings{Kim_2019_CVPR,
  author = {Kim, Eunwoo and Ahn, Chanho and Torr, Philip H.S. and Oh, Songhwai},
  title = {Deep Virtual Networks for Memory Efficient Inference of Multiple Tasks},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Universal Domain Adaptation
Kaichao You, Mingsheng Long, Zhangjie Cao, Jianmin Wang, Michael I. Jordan


Domain adaptation aims to transfer knowledge in the presence of the domain gap. Existing domain adaptation methods rely on rich prior knowledge about the relationship between the label sets of source and target domains, which greatly limits their application in the wild. This paper introduces Universal Domain Adaptation (UDA) that requires no prior knowledge on the label sets. For a given source label set and a target label set, they may contain a common label set and hold a private label set respectively, bringing up an additional category gap. UDA requires a model to either (1) classify the target sample correctly if it is associated with a label in the common label set, or (2) mark it as "unknown" otherwise. More importantly, a UDA model should work stably against a wide spectrum of commonness (the proportion of the common label set over the complete label set) so that it can handle real-world problems with unknown target label sets. To solve the universal domain adaptation problem, we propose Universal Adaptation Network (UAN). It quantifies sample-level transferability to discover the common label set and the label sets private to each domain, thereby promoting the adaptation in the automatically discovered common label set and recognizing the "unknown" samples successfully. A thorough evaluation shows that UAN outperforms the state of the art closed set, partial and open set domain adaptation methods in the novel UDA setting.
[outperforms, prediction, work, perform, dataset] [june, well, practical, exp, note, confidence] [figure, generative, prior, proposed, image] [criterion, network, deep, accuracy, performance, compared, resnet, neural, explore, size] [common, adversarial, transferability, partial, private, model, discriminator, evaluation, relationship, coming, probability, visual] [feature, category, propose] [domain, adaptation, label, target, set, source, uan, data, universal, uda, open, transfer, learning, training, existing, unsupervised, knowledge, closed, class, labeled, large, similarity, setting, gap, classifier, entropy, sample, distribution, iwan, osbp, shared, uncertainty, weighting, alignment, task, exq, dann, classification, log, unknown, discrepancy, exqc, expcs, china]
@InProceedings{You_2019_CVPR,
  author = {You, Kaichao and Long, Mingsheng and Cao, Zhangjie and Wang, Jianmin and Jordan, Michael I.},
  title = {Universal Domain Adaptation},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Improving Transferability of Adversarial Examples With Input Diversity
Cihang Xie, Zhishuai Zhang, Yuyin Zhou, Song Bai, Jianyu Wang, Zhou Ren, Alan L. Yuille


Though CNNs have achieved the state-of-the-art performance on various vision tasks, they are vulnerable to adversarial examples --- crafted by adding human-imperceptible perturbations to clean images. However, most of the existing adversarial attacks only achieve relatively low success rates under the challenging black-box setting, where the attackers have no knowledge of the model structure and parameters. To this end, we propose to improve the transferability of adversarial examples by creating diverse input patterns. Instead of only using the original images to generate adversarial examples, our method applies random transformations to the input images at each iteration. Extensive experiments on ImageNet show that the proposed attack method can generate adversarial examples that transfer much better to different networks than existing baselines. By evaluating our method against top defense solutions and official baselines from NIPS 2017 adversarial competition, the enhanced attack reaches an average success rate of 73.0%, which outperforms the top-1 attack submission in the NIPS competition by a large margin of 6.6%. We hope that our proposed attack strategy can serve as a strong benchmark baseline for evaluating the robustness of networks to adversaries and the effectiveness of different defense methods in the future. Code is available at https://github.com/cihangxie/DI-2-FGSM.
[walking] [international, total, momentum, computer, vision, single, corresponding] [input, proposed, method, image, conference, transformation, high, clean, figure] [network, iteration, rate, gradient, number, fast, size, table, deep, better, higher, top, neural] [adversarial, success, attack, adv, true, probability, step, transferability, diverse, iterative, sign, random, defense, generated, attacking, leopard, model, fgsm, arxiv, preprint, generate, competition, diversity, adversarially, machine, stick, evaluating, robustness, perturbation, transformed, degrades] [improve, official, average] [learning, ensemble, trained, set, cat, strategy, loss, training, test]
@InProceedings{Xie_2019_CVPR,
  author = {Xie, Cihang and Zhang, Zhishuai and Zhou, Yuyin and Bai, Song and Wang, Jianyu and Ren, Zhou and Yuille, Alan L.},
  title = {Improving Transferability of Adversarial Examples With Input Diversity},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Sequence-To-Sequence Domain Adaptation Network for Robust Text Image Recognition
Yaping Zhang, Shuai Nie, Wenju Liu, Xing Xu, Dongxiang Zhang, Heng Tao Shen


Domain adaptation has shown promising advances for alleviating domain shift problem. However, recent visual domain adaptation works usually focus on non-sequential object recognition with a global coarse alignment, which is inadequate to transfer effective knowledge for sequence-like text images with variable-length fine-grained character information. In this paper, we develop a Sequence-to-Sequence Domain Adaptation Network (SSDAN) for robust text image recognition, which could exploit unsupervised sequence data by an attention-based sequence encoder-decoder network. In the SSDAN, a gated attention similarity (GAS) unit is introduced to adaptively focus on aligning the distribution of the source and target sequence data in an attended character-level feature space rather than a global coarse alignment. Extensive text recognition experiments show the SSDAN could efficiently transfer sequence knowledge and validate the promising power of the proposed model towards real world applications in various recognition scenarios, including the natural scene text, handwritten text and even mathematical expression recognition.
[recognition, sequence, gru, focus, dataset] [scene, robust, international, computer, pattern, vision, analysis] [image, conference, figure, method, proposed, expression, input, ieee, real] [shift, unit, network, table, deep, gate, adaptively, neural, performance, effective, denotes, size, achieved, reduce, convolutional] [text, model, ssdan, attention, handwritten, character, mathematical, encoder, gas, vector, handwriting, adversarial, decoding, attended, decoder, natural, evaluate, introduced, probability, cer, machine, arxiv, preprint, gated, reading, encoded, ldec] [feature, global, cnn, context, spatial, map] [domain, adaptation, source, target, data, unsupervised, loss, training, test, function, similarity, specific, set, learning, learn, validate, minimizing, knowledge, coral, investigate]
@InProceedings{Zhang_2019_CVPR,
  author = {Zhang, Yaping and Nie, Shuai and Liu, Wenju and Xu, Xing and Zhang, Dongxiang and Tao Shen, Heng},
  title = {Sequence-To-Sequence Domain Adaptation Network for Robust Text Image Recognition},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Hybrid-Attention Based Decoupled Metric Learning for Zero-Shot Image Retrieval
Binghui Chen, Weihong Deng


In zero-shot image retrieval (ZSIR) task, embedding learning becomes more attractive, however, many methods follow the traditional metric learning idea and omit the problems behind zero-shot settings. In this paper, we first emphasize the importance of learning visual discriminative metric and preventing the partial/selective learning behavior of learner in ZSIR, and then propose the Decoupled Metric Learning (DeML) framework to achieve these individually. Instead of coarsely optimizing an unified metric, we decouple it into multiple attention-specific parts so as to recurrently induce the discrimination and explicitly enhance the generalization. And they are mainly achieved by our object-attention module based on random walk graph propagation and the channel-attention module based on the adversary constraint, respectively. We demonstrate the necessity of addressing the vital problems in ZSIR on the popular benchmarks, outperforming the state-of-the-art methods by a significant margin. Code is available at http://www.bhchen.cn
[recognition, multiple, explicitly, walk, graph, behavior, online, propagation] [computer, vision, corresponding, pattern, international, optimizing, directly, robust] [conference, image, ieee, input, ladv, based, capable] [deep, scale, neural, network, convolutional, weight, performance, order] [attention, diversity, random, adversary, visual, diverse, model, partial, arxiv, preprint, encourage, easily, rich] [feature, improve, proposal, holistic, module, object] [learning, metric, discriminative, discrimination, knowledge, training, deml, unified, generalization, embedding, loss, decoupling, learner, unseen, retrieval, cub, learn, observe, decoupled, testing, representation, ensemble, lact, idea, undiscriminating, set, zsir, stanford, informative]
@InProceedings{Chen_2019_CVPR,
  author = {Chen, Binghui and Deng, Weihong},
  title = {Hybrid-Attention Based Decoupled Metric Learning for Zero-Shot Image Retrieval},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Learning to Sample
Oren Dovrat, Itai Lang, Shai Avidan


Processing large point clouds is a challenging task. Therefore, the data is often sampled to a size that can be processed more easily. The question is how to sample the data? A popular sampling technique is Farthest Point Sampling (FPS). However, FPS is agnostic to a downstream application (classification, retrieval, etc.). The underlying assumption seems to be that minimizing the farthest point distance, as done by FPS, is a good proxy to other objective functions. We show that it is better to learn how to sample. To do that, we propose a deep network to simplify 3D point clouds. The network, termed S-NET, takes a point cloud and produces a smaller point cloud that is optimized for a particular task. The simplified point cloud is not guaranteed to be a subset of the original point cloud. Therefore, we match it to a subset of the original points in a post-processing step. We contrast our approach with FPS by experimenting on two standard data sets and show significantly better results for a variety of applications. Our code is publicly available.
[recognition, work, time, performs] [point, cloud, pointnet, reconstruction, progressivenet, shape, computer, vision, matching, approach, pattern, farthest, simplification, error, nre, well, normalized, international, match, defined, alternative] [input, figure, conference, ieee, proposed, method, based] [network, size, fps, accuracy, better, output, ratio, original, progressive, deep, inference, optimized, neural, processing, compare, fixed, number, applied, lower, order, larger, higher, simplified, performance, suggested, regularization] [sampled, generated, complete, random, arxiv, preprint, step, improved] [feature, three, improve, average, evaluated] [sampling, task, sample, trained, set, loss, subset, classification, data, training, train, learning, retrieval, objective, autoencoder, large, distance, test, target, specific, function]
@InProceedings{Dovrat_2019_CVPR,
  author = {Dovrat, Oren and Lang, Itai and Avidan, Shai},
  title = {Learning to Sample},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Few-Shot Learning via Saliency-Guided Hallucination of Samples
Hongguang Zhang, Jing Zhang, Piotr Koniusz


Learning new concepts from a few of samples is a standard challenge in computer vision. The main directions to improve the learning ability of few-shot training models include (i) a robust similarity learning and (ii) generating or hallucinating additional data from the limited existing samples. In this paper, we follow the latter direction and present a novel data hallucination model. Currently, most datapoint generators contain a specialized network (i.e., GAN) tasked with hallucinating new datapoints, thus requiring large numbers of annotated data for their training in the first place. In this paper, we propose a novel less-costly hallucination method for few-shot learning which utilizes saliency maps. To this end, we employ a saliency network to obtain the foregrounds and backgrounds of available image samples and feed the resulting maps into a two-stream network to hallucinate datapoints directly in the feature space from viable foreground-background combinations. To the best of our knowledge, we are the first to leverage saliency maps for such a task and we demonstrate their usefulness in hallucinating additional datapoints for few-shot learning. Our proposed network achieves the state of the art on publicly available datasets.
[perform, dataset] [additional, approach, form] [figure, background, image, mixing, proposed, real, based, prior, method] [network, net, accuracy, deep, table, performance, pooling, regularization, convolutional, neural, apply, number] [model, query, visual, simple, encoding, giraffe, ability, concept, encoder, generate, datapoint] [saliency, feature, foreground, relation, propose, airplane, detector, mic, piotr, salient, object, spatial, map, rfcn] [learning, hallucination, similarity, sosn, data, training, support, salnet, datapoints, miniimagenet, strategy, open, hallucinated, class, function, bsnw, novel, large, metric, representation, investigate, set, mnl, hongguang, hallucinating, unsupervised, rbd]
@InProceedings{Zhang_2019_CVPR,
  author = {Zhang, Hongguang and Zhang, Jing and Koniusz, Piotr},
  title = {Few-Shot Learning via Saliency-Guided Hallucination of Samples},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Variational Convolutional Neural Network Pruning
Chenglong Zhao, Bingbing Ni, Jian Zhang, Qiwei Zhao, Wenjun Zhang, Qi Tian


We propose a variational Bayesian scheme for pruning convolutional neural networks in channel level. This idea is motivated by the fact that deterministic value based pruning methods are inherently improper and unstable. In a nutshell, variational technique is introduced to estimate distribution of a newly proposed parameter, called channel saliency, based on this, redundant channels can be removed from model via a simple criterion. The advantages are two-fold: 1) Our method conducts channel pruning without desire of re-training stage, thus improving the computation efficiency. 2) Our method is implemented as a stand-alone module, called variational pruning layer, which can be straightforwardly inserted into off-the-shelf deep learning packages, without any special network design. Extensive experimental results well demonstrate the effectiveness of our method: For CIFAR-10, we perform channel removal on different CNN models up to 74% reduction, which results in significant size reduction and computation saving. For ImageNet, about 40% channels of ResNet-50 are removed without compromising accuracy.
[deterministic, perform, dataset, work] [computer, vision, international, pattern, estimate, analysis, special, view, algorithm, corresponding, optimization] [method, conference, based, proposed, ieee, remove, prior, figure, removing, image] [pruning, channel, neural, deep, pruned, convolutional, accuracy, compression, performance, network, parameter, redundant, layer, batch, computation, factor, weight, normalization, scale, table, efficient, processing, called, prune, unimportant, imagenet, inference, rate, bayesian, compact, compress, sparse, bingbing, inserted, size, best, sparsity, storage] [variational, model, introduce, introduced, machine, memory] [saliency, propose, cnn, object, european, stage, semantic] [distribution, learning, base, training, dkl, posterior, function]
@InProceedings{Zhao_2019_CVPR,
  author = {Zhao, Chenglong and Ni, Bingbing and Zhang, Jian and Zhao, Qiwei and Zhang, Wenjun and Tian, Qi},
  title = {Variational Convolutional Neural Network Pruning},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Towards Optimal Structured CNN Pruning via Generative Adversarial Learning
Shaohui Lin, Rongrong Ji, Chenqian Yan, Baochang Zhang, Liujuan Cao, Qixiang Ye, Feiyue Huang, David Doermann


Structured pruning of filters or neurons has received increased focus for compressing convolutional neural networks. Most existing methods rely on multi-stage optimizations in a layer-wise manner for iteratively pruning and retraining which may not be optimal and may be computation intensive. Besides, these methods are designed for pruning a specific structure, such as filter or block structures without jointly pruning heterogeneous structures. In this paper, we propose an effective structured pruning approach that jointly prunes filters as well as other structures in an end-to-end manner. To accomplish this, we first introduce a soft mask to scale the output of these structures by defining a new objective function with sparsity regularization to align the output of baseline and network with this mask. We then effectively solve the optimization problem by generative adversarial learning (GAL), which learns a sparse soft mask in a label-free and an end-to-end manner. By forcing more scale factors in the soft mask to zero, the fast iterative shrinkage-thresholding algorithm (FISTA) can be leveraged to fast and reliably remove the corresponding structures. Extensive experiments demonstrate the effectiveness of GAL on different datasets, including MNIST, CIFAR-10 and ImageNet ILSVRC 2012. For example, on ImageNet ILSVRC 2012, the pruned ResNet-50 achieves 10.88% Top-5 error and results in a factor of 3.7x speedup. This significantly outperforms state-of-the-art methods.
[jointly] [error, optimization, solve, corresponding, approach, algorithm, problem] [remove, proposed, input, generative, generator] [pruning, pruned, network, neural, gal, convolutional, output, achieves, rate, deep, prune, table, filter, efficient, structured, sparse, layer, channel, residual, compared, block, regularization, compression, selection, sparsity, fast, performance, scaling, imagenet, achieve, number, parameter, higher, binary, redundant, better, reliably, speedup, best, fista, sgd, factor] [adversarial, discriminator, model, introduced, random, iterative] [mask, baseline, feature, including, branch, three] [soft, learning, training, knowledge, set, googlenet, train, update, loss, log, classification, effectively, hard]
@InProceedings{Lin_2019_CVPR,
  author = {Lin, Shaohui and Ji, Rongrong and Yan, Chenqian and Zhang, Baochang and Cao, Liujuan and Ye, Qixiang and Huang, Feiyue and Doermann, David},
  title = {Towards Optimal Structured CNN Pruning via Generative Adversarial Learning},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Exploiting Kernel Sparsity and Entropy for Interpretable CNN Compression
Yuchao Li, Shaohui Lin, Baochang Zhang, Jianzhuang Liu, David Doermann, Yongjian Wu, Feiyue Huang, Rongrong Ji


Compressing convolutional neural networks (CNNs) has received ever-increasing research focus. However, most existing CNN compression methods do not interpret their inherent structures to distinguish the implicit redundancy. In this paper, we investigate the problem of CNN compression from a novel interpretable perspective. The relationship between the input feature maps and 2D kernels is revealed in a theoretical framework, based on which a kernel sparsity and entropy (KSE) indicator is proposed to quantitate the feature map importance in a feature-agnostic manner to guide model compression. Kernel clustering is further conducted based on the KSE indicator to accomplish high-precision CNN compression. KSE is capable of simultaneously compressing each layer in an efficient way, which is significantly faster compared to previous data-driven feature map pruning methods. We comprehensively evaluate the compression and speedup of the proposed method on CIFAR-10, SVHN and ImageNet 2012. Our method demonstrates superior performance gains over previous ones. In particular, it achieves 4.7x FLOPs reduction and 2.9x compression on ResNet-50 with only a top-5 accuracy drop of 0.35% on ImageNet 2012, which significantly outperforms state-of-the-art methods.
[recognition, previous] [corresponding, computer, vision, pattern, international, compute, problem, directly, approach, field, heat, david, simultaneously] [input, conference, method, proposed, ieee, based, figure, image] [convolutional, kernel, sparsity, kse, compression, neural, pruning, network, deep, number, indicator, channel, output, layer, compress, compressed, convolution, achieves, receptive, efficient, imagenet, richness, sparse, compared, original, accuracy, reduce, compact, represents, acceleration, table, processing, shaohui, rongrong, cnns, weight, filter, calculate, prune, architecture, achieve, better, unimportant] [relationship, evaluate, interpretable, model] [feature, map, cnn, three, european] [entropy, learning, set, clustering, training, china, novel, cluster]
@InProceedings{Li_2019_CVPR,
  author = {Li, Yuchao and Lin, Shaohui and Zhang, Baochang and Liu, Jianzhuang and Doermann, David and Wu, Yongjian and Huang, Feiyue and Ji, Rongrong},
  title = {Exploiting Kernel Sparsity and Entropy for Interpretable CNN Compression},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Fully Quantized Network for Object Detection
Rundong Li, Yan Wang, Feng Liang, Hongwei Qin, Junjie Yan, Rui Fan


Efficient neural network inference is important in a number of practical domains, such as deployment in mobile settings. An effective method for increasing inference efficiency is to use low bitwidth arithmetic, which can subsequently be accelerated using dedicated hardware. However, designing effective quantization schemes while maintaining network accuracy is challenging. In particular, current techniques face difficulty in performing fully end-to-end quantization, making use of aggressively low bitwidth regimes such as 4-bit, and applying quantized networks to complex tasks such as object detection. In this paper, we demonstrate that many of these difficulties arise because of instability during the fine-tuning stage of the quantization process, and propose several novel techniques to overcome these instabilities. We apply our techniques to produce fully quantized 4-bit detectors based on RetinaNet and Faster R-CNN, and show that these achieve state-of-the-art performance for quantized detectors. The mAP loss due to quantization using our methods is more than 3.8x less than the loss from existing methods.
[complex, current, moving, previous] [computer, point, vision, note, pattern] [conference, ieee, method, based, proposed, figure, input, percentile, row] [quantization, quantized, batch, network, fine, neural, normalization, activation, table, fqn, tuning, precision, bitwidth, weight, deep, low, accuracy, arithmetic, performed, ema, freezing, efficient, full, floating, small, convolutional, number, inference, instability, performance, quantizing, layer, folded] [arxiv, preprint, model, indicates] [object, detection, detector, retinanet, fully, stage, map, average, coco, faster, propose, final, including, ross] [training, loss, set, trained, classification, learning]
@InProceedings{Li_2019_CVPR,
  author = {Li, Rundong and Wang, Yan and Liang, Feng and Qin, Hongwei and Yan, Junjie and Fan, Rui},
  title = {Fully Quantized Network for Object Detection},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
MnasNet: Platform-Aware Neural Architecture Search for Mobile
Mingxing Tan, Bo Chen, Ruoming Pang, Vijay Vasudevan, Mark Sandler, Andrew Howard, Quoc V. Le


Designing convolutional neural networks (CNN) for mobile devices is challenging because mobile models need to be small and fast, yet still accurate. Although significant efforts have been dedicated to design and improve mobile CNNs on all dimensions, it is very difficult to manually balance these trade-offs when there are so many architectural possibilities to consider. In this paper, we propose an automated mobile neural architecture search (MNAS) approach, which explicitly incorporate model latency into the main objective so that the search can identify a model that achieves a good trade-off between accuracy and latency. Unlike previous work, where latency is considered via another, often inaccurate proxy (e.g., FLOPS), our approach directly measures real-world inference latency by executing the model on mobile phones. To further strike the right balance between flexibility and search space size, we propose a novel factorized hierarchical search space that encourages layer diversity throughout the network. Experimental results show that our approach consistently outperforms state-of-the-art mobile CNN models across multiple vision tasks. On the ImageNet classification task, our MnasNet achieves 75.2% top-1 accuracy with 78ms latency on a Pixel phone, which is 1.8x faster than MobileNetV2 with 0.5% higher accuracy and 2.3x faster than NASNet with 1.2% higher accuracy. Our MnasNet also achieves better mAP quality than MobileNets for COCO object detection. Code is at https://github.com/tensorflow/tpu/tree/master/models/official/mnasnet.
[previous, outperforms, multiple, work] [approach, manual, depth, equation, single, directly] [figure, input, based, image, balance, pixel, repeated, comparison] [search, latency, mobile, accuracy, architecture, mnasnet, layer, size, neural, inference, imagenet, table, block, number, performance, factorized, filter, denotes, achieves, network, higher, fewer, efficient, scaling, hxwxf, designing, design, automated, better, controller, larger, convolution, convolutional, nasnet, typical, kernel, pareto, batch, conv, top, reducing, depthwise] [model, reward, reinforcement, common, arxiv, preprint, diversity] [cnn, coco, object, hierarchical, faster, map, detection, baseline, propose] [space, target, objective, learning, novel, classification, auto, main, large]
@InProceedings{Tan_2019_CVPR,
  author = {Tan, Mingxing and Chen, Bo and Pang, Ruoming and Vasudevan, Vijay and Sandler, Mark and Howard, Andrew and Le, Quoc V.},
  title = {MnasNet: Platform-Aware Neural Architecture Search for Mobile},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Student Becoming the Master: Knowledge Amalgamation for Joint Scene Parsing, Depth Estimation, and More
Jingwen Ye, Yixin Ji, Xinchao Wang, Kairi Ou, Dapeng Tao, Mingli Song


In this paper, we investigate a novel deep-model reusing task. Our goal is to train a lightweight and versatile student model, without human-labelled annotations, that amalgamates the knowledge and masters the expertise of two pre-trained teacher models working on heterogeneous problems, one on scene parsing and the other on depth estimation. To this end, we propose an innovative training strategy that learns the parameters of the student intertwined with the teachers, achieved by "projecting" its amalgamated features onto each teacher's domain and computing the loss. We also introduce two options to generalize the proposed training strategy to handle three or more tasks simultaneously. The proposed scheme yields very encouraging results. As demonstrated on several benchmarks, the trained student model achieves results even superior to those of the teachers in their own expertise domains and on par with the state-of-the-art fully supervised models relying on human-labelled annotations.
[prediction, work, joint, learns, multiple] [depth, scene, estimation, computer, vision, pattern, surface, normal, denote, approach, international, depthnet, ldepth, rel, corresponding] [proposed, conference, pixel, method, handle, input, figure, described] [block, amalgamation, targetnet, network, deep, neural, segnet, architecture, amalgamated, convolutional, coding, size, processing, expertise, compact, teachernet, number, smaller, replace, performance, pooling] [decoder, model, encoder, working, introduce, step] [parsing, semantic, branch, segmentation, object, final, intertwined, three, lseg, feature, fully, map] [student, knowledge, teacher, training, learning, train, loss, strategy, task, learn, function, distilling, trained, classification, distillation, set]
@InProceedings{Ye_2019_CVPR,
  author = {Ye, Jingwen and Ji, Yixin and Wang, Xinchao and Ou, Kairi and Tao, Dapeng and Song, Mingli},
  title = {Student Becoming the Master: Knowledge Amalgamation for Joint Scene Parsing, Depth Estimation, and More},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
K-Nearest Neighbors Hashing
Xiangyu He, Peisong Wang, Jian Cheng


Hashing based approximate nearest neighbor search embeds high dimensional data to compact binary codes, which enables efficient similarity search and storage. However, the non-isometry sign() function makes it hard to project the nearest neighbors in continuous data space into the closest codewords in discrete Hamming space. In this work, we revisit the sign() function from the perspective of space partitioning. In specific, we bridge the gap between k-nearest neighbors and binary hashing codes with Shannon entropy. We further propose a novel K-Nearest Neighbors Hashing (KNNH) method to learn binary representations from KNN within the subspaces generated by sign(). Theoretical and experimental results show that the KNN relation is of central importance to neighbor preserving embeddings, and the proposed method outperforms the state-of-the-arts on benchmark datasets.
[time, dataset, term, outperforms, represented] [pattern, computer, vision, june, international, discrete, algorithm, point, dimensional, problem, matrix] [conference, ieee, method, image, based, input, high, preserving, proposed, pca, conditional, transformation, unconstrained] [binary, deep, quantization, table, precision, neural, performance, search, processing, approximate, orthogonal, small, compact, shrinkage, number, computation] [artificial] [feature, map, cvpr, jian, average, relation, propose] [hashing, knnh, knn, data, learning, mnist, nearest, itq, hamming, unsupervised, supervised, class, neighbor, function, space, set, codewords, minimization, randomly, representative, sample, similarity, entropy, min, comparsions, retrieval, training, hash, cluster, mutual, gallery, svi, ranking, imbalanced]
@InProceedings{He_2019_CVPR,
  author = {He, Xiangyu and Wang, Peisong and Cheng, Jian},
  title = {K-Nearest Neighbors Hashing},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Learning RoI Transformer for Oriented Object Detection in Aerial Images
Jian Ding, Nan Xue, Yang Long, Gui-Song Xia, Qikai Lu


Object detection in aerial images is an active yet challenging task in computer vision because of the bird's-eye view perspective, the highly complex backgrounds, and the variant appearances of objects. Especially when detecting densely packed objects in aerial images, methods relying on horizontal proposals for common object detection often introduce mismatches between the Region of Interests (RoIs) and objects. This leads to the common misalignment between the final object classification confidence and localization accuracy. In this paper, we propose a RoI Transformer to address these problems. The core idea of RoI Transformer is to apply spatial transformations on RoIs and learn the transformation parameters under the supervision of oriented bounding box (OBB) annotations. RoI Transformer is with lightweight and can be easily embedded into detectors for oriented object detection. Simply apply the RoI Transformer to light head RCNN has achieved state-of-the-art performances on two common and challenging aerial datasets, i.e., DOTA and HRSC2016, with a neglectable reduction to detection speed. Our RoI Transformer exceeds the deformable Position Sensitive RoI pooling when oriented bounding-box annotations are available. Extensive experiments have also validated the flexibility and effectiveness of our RoI Transformer.
[warping] [horizontal, ground, rotation, relative, position, scene, geometry, matching, accurate, orientation, light, problem, coordinate] [remote, proposed, based, sensing, image, ieee, method, figure] [pooling, connected, design, performance, layer, number, compared, convolutional, neural, better, output, represents, small] [transformer, model, text, sensitive] [roi, rroi, detection, object, feature, aerial, deformable, rotated, oriented, bounding, region, map, fully, dota, rrois, box, regression, baseline, ship, obb, hroi, iou, densely, packed, aspect, module, jian, fpn, misalignment, extraction, rpn, vehicle] [set, learning, classification, large, training, learner, align]
@InProceedings{Ding_2019_CVPR,
  author = {Ding, Jian and Xue, Nan and Long, Yang and Xia, Gui-Song and Lu, Qikai},
  title = {Learning RoI Transformer for Oriented Object Detection in Aerial Images},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Snapshot Distillation: Teacher-Student Optimization in One Generation
Chenglin Yang, Lingxi Xie, Chi Su, Alan L. Yuille


Optimizing a deep neural network is a fundamental task in computer vision, yet direct training methods often suffer from over-fitting. Teacher-student optimization aims at providing complementary cues from a model trained previously, but these approaches are often considerably slow due to the pipeline of training a few generations in sequence, i.e., time complexity is increased by several times. This paper presents snapshot distillation (SD), the first framework which enables teacher-student optimization in one generation. The idea of SD is very simple: instead of borrowing supervision signals from previous generations, we extract such information from earlier epochs in the same generation, meanwhile make sure that the difference between teacher and student is sufficiently large so as to prevent under-fitting. To achieve this goal, we implement SD in a cyclic learning rate policy, in which the last snapshot of each cycle is used as the teacher for all iterations in the next cycle, and the teacher signal is smoothed to provide richer information. In standard image classification benchmarks such as CIFAR100 and ILSVRC2012, SD achieves consistent accuracy gain without heavy computational overheads. We also verify that models pre-trained with SD transfers well to object detection and semantic segmentation in the PascalVOC dataset.
[signal, previous, term, dataset, recognition, time] [optimization, computer, vision, pattern, international, error, well, algorithm, solution] [snapshot, image, conference, difference] [network, deep, neural, number, rate, accuracy, process, eqn, standard, achieves, table, performance, higher, convolutional, architecture, gain, best, better, principle, residual, pascalvoc, smaller, gradient, size, achieve] [model, arxiv, preprint, visual, machine, indicates, generation] [baseline, semantic, object, detection, supervision, improve, extra, three, segmentation, final] [teacher, training, learning, student, distillation, classification, knowledge, similarity, set, trained, large, ensemble, testing, data, loss, cosine, annealing, idea, share, temperature, paper]
@InProceedings{Yang_2019_CVPR,
  author = {Yang, Chenglin and Xie, Lingxi and Su, Chi and Yuille, Alan L.},
  title = {Snapshot Distillation: Teacher-Student Optimization in One Generation},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Geometry-Aware Distillation for Indoor Semantic Segmentation
Jianbo Jiao, Yunchao Wei, Zequn Jie, Honghui Shi, Rynson W.H. Lau, Thomas S. Huang


It has been shown that jointly reasoning the 2D appearance and 3D information from RGB-D domains is beneficial to indoor scene semantic segmentation. However, most existing approaches require accurate depth map as input to segment the scene which severely limits their applications. In this paper, we propose to jointly infer the semantic and depth information by distilling geometry-aware embedding to eliminate such strong constraint while still exploiting the helpful depth domain information. In addition, we use this learned embedding to improve the quality of semantic segmentation, through a proposed geometry-aware propagation framework followed by several multi-level skip feature fusion blocks. By decoupling the single task prediction network into two joint tasks of semantic segmentation and geometry embedding learning, together with the proposed information propagation and feature fusion architecture, our method is shown to perform favorably against state-of-the-art methods for semantic segmentation on publicly available challenging indoor datasets.
[fusion, propagation, joint, dataset, prediction, framework, predict, predicting] [depth, rgb, rgbd, geometry, scene, indoor, well, spf, single, approach, corresponding, vision, wij] [proposed, image, figure, method, input, pixel, appearance, result, comparison] [network, performance, table, deep, convolutional, conv, block, better, skip, effectiveness, compared, structure, convolution, neural, batchnorm, output, original, designed] [model, encoder] [semantic, segmentation, feature, backbone, final, map, branch, object, pyramid, improve, propose, supervision, cnn, guidance, miou, challenging, leverage, fully, affinity] [embedding, learning, embeddings, learned, sun, data, training, distilled, loss, function, feat, datasets]
@InProceedings{Jiao_2019_CVPR,
  author = {Jiao, Jianbo and Wei, Yunchao and Jie, Zequn and Shi, Honghui and Lau, Rynson W.H. and Huang, Thomas S.},
  title = {Geometry-Aware Distillation for Indoor Semantic Segmentation},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
LiveSketch: Query Perturbations for Guided Sketch-Based Visual Search
John Collomosse, Tu Bui, Hailin Jin


LiveSketch is a novel algorithm for searching large image collections using hand-sketched queries. LiveSketch tackles the inherent ambiguity of sketch search by creating visual suggestions that augment the query as it is drawn, making query specification an iterative rather than one-shot process that helps disambiguate users' search intent. Our technical contributions are: a triplet convnet architecture that incorporates an RNN based variational autoencoder to search for images using vector (stroke-based) queries; real-time clustering to identify likely search intents (and so, targets within the search embedding); and the use of backpropagation from those targets to perturb the input stroke sequence, so suggesting alterations to the query in order to guide the search. We show improvements in accuracy and time-to-task over contemporary baselines using a 67M image corpus.
[rnn, sequence, dataset, work, time, joint, rasterized] [linear, form, approach, yield, shape] [image, user, interpolation, based, method, stroke, intermediate, content, proposed, figure, input, guide] [search, network, deep, structure, performance, backpropagation, neural, order, architecture, inspired] [query, visual, vector, raster, sketched, livesketch, adversarial, encoded, system, relevance, encoder, model, sketchrnn, sampled, intent, evaluate, identify, perturbation, encoding, bui, disambiguate, enables, common] [object, cnn, interactive, branch, level, map, three] [sketch, embedding, retrieval, triplet, sbir, loss, representation, training, set, large, classification, target, class, learning, learned, clustering, differs, auxiliary, cluster, embeddings]
@InProceedings{Collomosse_2019_CVPR,
  author = {Collomosse, John and Bui, Tu and Jin, Hailin},
  title = {LiveSketch: Query Perturbations for Guided Sketch-Based Visual Search},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Bounding Box Regression With Uncertainty for Accurate Object Detection
Yihui He, Chenchen Zhu, Jianren Wang, Marios Savvides, Xiangyu Zhang


Large-scale object detection datasets (e.g., MS-COCO) try to define the ground truth bounding boxes as clear as possible. However, we observe that ambiguities are still introduced when labeling the bounding boxes. In this paper, we propose a novel bounding box regression loss for learning bounding box transformation and localization variance together. Our loss greatly improves the localization accuracies of various architectures with nearly no additional computation. The learned localization variance allows us to merge neighboring bounding boxes during non-maximum suppression (NMS), which further improves the localization performance. On MS-COCO, we boost the Average Precision (AP) of VGG-16 Faster R-CNN from 23.6% to 29.1%. More importantly, for ResNet-50-FPN Mask R-CNN, our method improves the AP and AP90 by 1.8% and 6.2% respectively, which significantly outperforms previous state-of-the-art bounding box refinement methods. Our code and models are available at github.com/yihui-he/KL-Loss
[] [computer, vision, accurate, confidence, pattern, approach, international, coordinate, single, estimated, deviation] [conference, figure, ieee, based, method] [network, table, variance, standard, fast, lower, better, neural, convolutional, xiangyu, deep, higher, gaussian, gradient] [arxiv, preprint, model, improved, introduced, candidate] [bounding, object, box, detection, var, localization, voting, faster, baseline, improves, head, regression, mask, lreg, ross, improve, location, person, yihui, predicted, jian, kaiming, neighboring, pascal, propose, score, feature, iou, voc, piotr, doll, boundary, map] [loss, learning, learn, classification, learned, training, uncertainty, train, selected, distribution, set, large, function]
@InProceedings{He_2019_CVPR,
  author = {He, Yihui and Zhu, Chenchen and Wang, Jianren and Savvides, Marios and Zhang, Xiangyu},
  title = {Bounding Box Regression With Uncertainty for Accurate Object Detection},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
OCGAN: One-Class Novelty Detection Using GANs With Constrained Latent Representations
Pramuditha Perera, Ramesh Nallapati, Bing Xiang


We present a novel model called OCGAN for the classical problem of one-class novelty detection, where, given a set of examples from a particular class, the goal is to determine if a query example is from the same class. Our solution is based on learning latent representations of in-class examples using a de-noising auto-encoder network. The key contribution of our work is our proposal to explicitly constrain the latent space to exclusively represent the given class. In order to accomplish this goal, firstly, we force the latent space to have bounded support by introducing a tanh activation in the encoder's output layer. Secondly, using a discriminator in the latent space that is trained adversarially, we ensure that encoded representations of in-class examples resemble uniform random samples drawn from the same bounded space. Thirdly, using a second adversarial discriminator in the input space, we ensure all randomly drawn latent samples generate examples that look real. Finally, we introduce a gradient-descent based sampling technique that explores points in the latent space that generate potential out-of-class examples, which are fed back to the network to further train it to generate in-class examples from those points. The effectiveness of the proposed method is measured across four publicly available datasets using two one-class novelty detection protocols where we achieve state-of-the-art results.
[dataset, work, anomaly, recognition, auc, explicitly] [reconstruction, error, pattern, well, chosen, computer, problem, vision] [latent, image, proposed, figure, method, based, conference, generator, ieee, denoising, input, produce, high, generative, drawn, real] [network, performance, deep, order, table, output, neural, processing] [discriminator, random, adversarial, generated, fake, visual, represent, gan, model, ensure, example, machine, path, generate, expected, consider] [detection, propose, object, visualization] [novelty, space, digit, class, trained, classifier, learning, training, ocgan, protocol, loss, mnist, distribution, learned, mining, lvisual, support, representation, data, train, negative, learn, llatent, set, datasets, strategy, fmnist]
@InProceedings{Perera_2019_CVPR,
  author = {Perera, Pramuditha and Nallapati, Ramesh and Xiang, Bing},
  title = {OCGAN: One-Class Novelty Detection Using GANs With Constrained Latent Representations},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Learning Metrics From Teachers: Compact Networks for Image Embedding
Lu Yu, Vacit Oguz Yazici, Xialei Liu, Joost van de Weijer, Yongmei Cheng, Arnau Ramisa


Metric learning networks are used to compute image embeddings, which are widely used in many applications such as image retrieval and face recognition. In this paper, we propose to use network distillation to efficiently compute image embeddings with small networks. Network distillation has been successfully applied to improve image classification, but has hardly been explored for metric learning. To do so, we propose two new loss functions that model the communication of a deep teacher network to a small student network. We evaluate our system in several datasets, including CUB-200-2011, Cars-196, Stanford Online Products and show that embeddings computed using small student networks perform significantly better than those computed using standard networks of similar size. Results on a very compact network (MobileNet-0.25), which can be used on mobile devices, show that the proposed method can greatly improve Recall@1 results from 27.5% to 44.6%. Furthermore, we investigate various aspects of distillation for embeddings, including hint and attention layers, semi-supervised learning and cross quality distillation. (Code is available at https://github.com/yulu0724/EmbeddingDistillation).
[online, dataset, work, second] [relative, absolute, computer, vision, international, pattern, additional, compute, problem] [image, conference, proposed, based, method, quality, ieee, figure] [network, deep, performance, neural, table, small, conv, number, better, efficient, processing, output, layer, equal, gain, apply, compared, siamese] [attention, consider, sum, evaluate, relevant, introduced, access] [feature, improve, propose, object] [teacher, student, distillation, learning, embedding, knowledge, loss, metric, training, data, hint, embeddings, trained, large, triplet, unlabeled, negative, distance, stanford, space, train, retrieval, cross, set, learn, positive, objective, classification, labeled, class, transfer, learned, mining]
@InProceedings{Yu_2019_CVPR,
  author = {Yu, Lu and Oguz Yazici, Vacit and Liu, Xialei and van de Weijer, Joost and Cheng, Yongmei and Ramisa, Arnau},
  title = {Learning Metrics From Teachers: Compact Networks for Image Embedding},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Activity Driven Weakly Supervised Object Detection
Zhenheng Yang, Dhruv Mahajan, Deepti Ghadiyaram, Ram Nevatia, Vignesh Ramanathan


Weakly supervised object detection aims at reducing the amount of supervision required to train detection models. Such models are traditionally learned from images/videos labelled only with the object class and not the object bounding box. In our work, we try to leverage not only the object class labels but also the action labels associated with the data. We show that the action depicted in the image/video can provide strong cues about the location of the associated object. We learn a spatial prior for the object dependent on the action (e.g. "ball" is closer to "leg of the person" in "kicking ball"), and incorporate this prior to simultaneously train a joint object detection and action classification model. We conducted experiments on both video datasets and image datasets to evaluate the performance of our weakly supervised object detection model. Our approach outperformed the current state-of-the-art (SOTA) method by more than 6% in mAP on the Charades video dataset.
[action, human, video, recognition, temporal, interaction, hoi, multiple, dataset, modeling, jointly, framework] [computer, vision, pattern, keypoint, normal, approach, international, note, pose, provide] [prior, ieee, conference, method, image, proposed, comparison, appearance, presented, figure] [performance, network, represents, deep, mentioned, layer, weighted] [model, strong, probability, arxiv, preprint] [object, detection, bounding, weakly, spatial, box, person, location, proposal, pcl, supervision, map, three, region, anchor, center, instance, ross] [classification, supervised, training, class, loss, learning, test, train, set, trained, learned, learn, distribution, large, label, datasets, main]
@InProceedings{Yang_2019_CVPR,
  author = {Yang, Zhenheng and Mahajan, Dhruv and Ghadiyaram, Deepti and Nevatia, Ram and Ramanathan, Vignesh},
  title = {Activity Driven Weakly Supervised Object Detection},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Separate to Adapt: Open Set Domain Adaptation via Progressive Separation
Hong Liu, Zhangjie Cao, Mingsheng Long, Jianmin Wang, Qiang Yang


Domain adaptation has become a resounding success in leveraging labeled data from a source domain to learn an accurate classifier for an unlabeled target domain. When deployed in the wild, the target domain usually contains unknown classes that are not observed in the source domain. Such setting is termed Open Set Domain Adaptation (OSDA). While several methods have been proposed to address OSDA, none of them takes into account the openness of the target domain, which is measured by the proportion of unknown classes in all target classes. Openness is a critical point in open set domain adaptation and exerts a significant impact on performance. In addition, current work aligns the entire target domain with the source domain without excluding unknown samples, which may give rise to negative transfer due to the mismatch between unknown and known classes. To this end, this paper presents Separate to Adapt (STA), an end-to-end approach to open set domain adaptation. The approach adopts a coarse-to-fine weighting mechanism to progressively separate the samples of unknown and known classes, and simultaneously weigh their importance on feature distribution alignment. Our approach allows openness-robust open set domain adaptation, which can be adaptive to a variety of openness in the target domain. We evaluate STA on several benchmark datasets of various openness levels. Results verify that STA significantly outperforms previous methods.
[recognition, previous, work, consists, dataset, perform] [computer, vision, pattern, approach, international, june] [conference, figure, separate, separation, ieee, method, proposed, real] [binary, deep, performance, network, accuracy, neural, table, resnet, processing, progressive] [adversarial, probability, machine, model, step, discriminator] [feature, threshold, three] [domain, target, unknown, set, source, open, adaptation, sta, classifier, data, openness, learning, negative, transfer, training, sample, shared, large, class, train, discrepancy, osbp, adapt, label, similarity, distribution, space, reject, dann, setting, gap, entropy, loss, unsupervised, closed, aligning, extractor, trained, classification, mnist, observe, mingsheng, jianmin, china]
@InProceedings{Liu_2019_CVPR,
  author = {Liu, Hong and Cao, Zhangjie and Long, Mingsheng and Wang, Jianmin and Yang, Qiang},
  title = {Separate to Adapt: Open Set Domain Adaptation via Progressive Separation},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Layout-Graph Reasoning for Fashion Landmark Detection
Weijiang Yu, Xiaodan Liang, Ke Gong, Chenhan Jiang, Nong Xiao, Liang Lin


Detecting dense landmarks for diverse clothes, as a fundamental technique for clothes analysis, has attracted increasing research attention due to its huge application potential. However, due to the lack of modeling underlying semantic layout constraints among landmarks, prior works often detect ambiguous and structure-inconsistent landmarks of multiple overlapped clothes in one person. In this paper, we propose to seamlessly enforce structural layout relationships among landmarks on the intermediate representations via multiple stacked layout-graph reasoning layers. We define the layout-graph as a hierarchical structure including a root node, body-part nodes (e.g. upper body, lower body), coarse clothes-part nodes (e.g. collar, sleeve) and leaf landmark nodes (e.g. left-collar, right-collar). Each Layout-Graph Reasoning(LGR) layer aims to map feature representations into structural graph nodes via a Map-to-Node module, performs reasoning over structural graph nodes to achieve global layout coherency via a layout-graph reasoning module, and then maps graph nodes back to enhance feature representations via a Node-to-Map module. The layout-graph reasoning module integrates a graph clustering operation to generate representations of intermediate nodes (bottom-up inference) and then a graph deconvolution operation (top-down inference) over the whole graph. Extensive experiments on two public fashion landmark datasets demonstrate the superiority of our model. Furthermore, to advance the fine-grained fashion landmark research for supporting more comprehensive clothes generation and attribute recognition, we contribute the first Fine-grained Fashion Landmark Dataset (FFLD) containing 200k images annotated with at most 32 key-points for 13 clothes types.
[graph, lgr, leaf, human, xleaf, fld, ffld, structural, dataset, pyranet, perform, internal, aleaf, multiple, rnleaf] [left, pose, define, matrix, normal, estimation] [landmark, deconvolution, intermediate, deepfashion, image, figure, proposed] [layer, convolutional, operation, stacked, deep, structure, convolution, compared, performance, better, root, achieve, network, lower, neural, weight, grammar] [reasoning, node, model, external, generate, rich] [clothes, module, feature, detection, hierarchical, layout, clothing, semantic, global, map, enhance, sleeve, collar, xmiddle, annotated, evolved, bottom, propose, three, bounding] [fashion, clustering, learning, knowledge, set, adjacency, upper, datasets, training, function]
@InProceedings{Yu_2019_CVPR,
  author = {Yu, Weijiang and Liang, Xiaodan and Gong, Ke and Jiang, Chenhan and Xiao, Nong and Lin, Liang},
  title = {Layout-Graph Reasoning for Fashion Landmark Detection},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
DistillHash: Unsupervised Deep Hashing by Distilling Data Pairs
Erkun Yang, Tongliang Liu, Cheng Deng, Wei Liu, Dacheng Tao


Due to storage and search efficiency, hashing has become significantly prevalent for nearest neighbor search. Particularly, deep hashing methods have greatly improved the search performance, typically under supervised scenarios. In contrast, unsupervised deep hashing models can hardly achieve satisfactory performance due to the lack of supervisory similarity signals. To address this problem, in this paper, we propose a new deep unsupervised hashing model, called DistilHash, which can learn a distilled data set, where data pairs have confident similarity signals. Specifically, we investigate the relationship between the initial but noisy similarity signals learned from local structures and the semantic similarity labels assigned by the optimal Bayesian classifier. We show that, under a mild assumption, some data pairs, of which labels are consistent with those assigned by the optimal Bayesian classifier, can be potentially distilled. With this understanding, we design a simple but effective method to distill data pairs automatically and further adopt a Bayesian learning framework to learn hashing functions from the distilled data set. Extensive experimental results on three widely used benchmark datasets demonstrate that our method achieves state-of-the-art search performance.
[framework, graph] [local, optimal, initial, discrete, algorithm] [image, ieee, cheng, method, conditional, figure, based, proposed] [deep, precision, binary, performance, neural, rate, wei, number, achieve, bayesian, top, effective, network] [probability, relationship] [map, semantic, three, adopt, recall, assigned] [data, hashing, hash, learning, distillhash, distilled, unsupervised, similarity, label, sij, noisy, training, dacheng, set, hamming, supervised, code, sgh, ssdh, sph, deepbit, itq, pcah, dsh, lsh, flip, retrieval, pair, learn, tongliang, bayes, datasets, distance, select, erkun, learned, loss, pairwise, classification, upper, nuswide, ranked, xianglong]
@InProceedings{Yang_2019_CVPR,
  author = {Yang, Erkun and Liu, Tongliang and Deng, Cheng and Liu, Wei and Tao, Dacheng},
  title = {DistillHash: Unsupervised Deep Hashing by Distilling Data Pairs},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Mind Your Neighbours: Image Annotation With Metadata Neighbourhood Graph Co-Attention Networks
Junjie Zhang, Qi Wu, Jian Zhang, Chunhua Shen, Jianfeng Lu


As the visual reflections of our daily lives, images are frequently shared on the social network, which generates the abundant 'metadata' that records user interactions with images. Due to the diverse contents and complex styles, some images can be challenging to recognise when neglecting the context. Images with the similar metadata, such as 'relevant topics and textual descriptions', 'common friends of users' and 'nearby locations', form a neighbourhood for each image, which can be used to assist the annotation. In this paper, we propose a Metadata Neighbourhood Graph Co-Attention Network (MangoNet) to model the correlations between each target image and its neighbours. To accurately capture the visual clues from the neighbourhood, a co-attention mechanism is introduced to embed the target image and its neighbours as graph nodes, while the graph edges capture the node pair correlations. By reasoning on the neighbourhood graph, we obtain the graph representation to help annotate the target image. Experimental results on three benchmark datasets indicate that our proposed model achieves the best performance compared to the state-of-the-art methods.
[graph, capture, gcn, dataset, social, framework, extract] [associated, confidence, corresponding] [image, proposed, method, user, based, figure] [network, convolutional, weighted, size, compared, layer, neural, achieves, deep, table, search] [metadata, model, neighbourhood, attention, visual, mechanism, node, mangonet, neighbour, assist, insatten, ncnn, mapc, mapo, introduced, sum, generate, indicates, referring, textual, reasoning, generated, common, measuring, embedded, represent, vector, mirflickr] [annotation, feature, semantic, module, cnn, propose, three, global, ablation, indicate, locate, adopt, backbone, spatial, voting] [target, label, representation, set, learning, training, train, test, classification, knn, trained]
@InProceedings{Zhang_2019_CVPR,
  author = {Zhang, Junjie and Wu, Qi and Zhang, Jian and Shen, Chunhua and Lu, Jianfeng},
  title = {Mind Your Neighbours: Image Annotation With Metadata Neighbourhood Graph Co-Attention Networks},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Region Proposal by Guided Anchoring
Jiaqi Wang, Kai Chen, Shuo Yang, Chen Change Loy, Dahua Lin


Region anchors are the cornerstone of modern object detection techniques. State-of-the-art detectors mostly rely on a dense anchoring scheme, where anchors are sampled uniformly over the spatial domain with a predefined set of scales and aspect ratios. In this paper, we revisit this foundational stage. Our study shows that it can be done much more effectively and efficiently. Specifically, we present an alternative scheme, named Guided Anchoring, which leverages semantic features to guide the anchoring. The proposed method jointly predicts the locations where the center of objects of interest are likely to exist as well as the scales and aspect ratios at different locations. On top of predicted anchor shapes, we mitigate the feature inconsistency with a feature adaption module. We also study the use of high-quality proposals to improve detection performance. The anchoring scheme can be seamlessly integrated into proposal methods and detectors. With Guided Anchoring, we achieve 9.1% higher recall on MS COCO with 90% fewer anchors than the RPN baseline. We also adopt Guided Anchoring in Fast R-CNN, Faster R-CNN and RetinaNet, respectively improving the detection mAP by 2.2%, 2.7% and 1.2%. Code is available at https://github.com/open-mmlab/mmdetection.
[prediction, window, predict, predefined, multiple] [shape, computer, vision, pattern, corresponding, ground, international, dense, truth] [conference, figure, ieee, method, proposed, based, image, study, high] [scheme, table, scale, higher, convolutional, fewer, performance, design, number, network, convolution, layer] [generation, adaption, probability, model] [anchor, feature, map, object, anchoring, rpn, location, guided, proposal, detection, region, aspect, iou, recall, sliding, bounding, predicted, module, center, faster, branch, adopt, box, refine, threshold, kaiming, improve, fully, ross] [loss, set, large, learning, distribution, training, alignment, conventional, classification, trained]
@InProceedings{Wang_2019_CVPR,
  author = {Wang, Jiaqi and Chen, Kai and Yang, Shuo and Change Loy, Chen and Lin, Dahua},
  title = {Region Proposal by Guided Anchoring},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Distant Supervised Centroid Shift: A Simple and Efficient Approach to Visual Domain Adaptation
Jian Liang, Ran He, Zhenan Sun, Tieniu Tan


Conventional domain adaptation methods usually resort to deep neural networks or subspace learning to find invariant representations across domains. However, most deep learning methods highly rely on large-size source domains and are computationally expensive to train, while subspace learning methods always have a quadratic time complexity that suffers from the large domain size. This paper provides a simple and efficient solution, which could be regarded as a well-performing baseline for domain adaptation tasks. Our method is built upon the nearest centroid classifier, seeking a subspace where the centroids in the target domain are moderately shifted from those in the source domain. Specifically, we design a unified objective without accessing the source domain data and adopt an alternating minimization scheme to iteratively discover the pseudo target labels, invariant subspace, and target centroids. Besides its privacy-preserving property (distant supervision), the algorithm is provably convergent and has a promising linear time complexity. In addition, the proposed method can be readily extended to multi-source setting and domain generalization, and it remarkably enhances popular deep adaptation methods by borrowing the learned transferable features. Extensive experiments on several benchmarks including object, digit, and face recognition datasets validate that our methods yield state-of-the-art results in various domain adaptation tasks.
[dataset, recognition, previous, joint, follow, performs] [problem, optimal, pattern, algorithm, optimization, note, corresponding, approach] [method, ieee, figure, transformation, proposed, image, based] [deep, accuracy, table, best, shallow, popular, better, performance, achieve, size, adaptive, neural] [visual, adversarial, dice, making, simple, evaluation, model] [feature, object, average, baseline, including] [domain, source, adaptation, target, unsupervised, data, learning, class, objective, subspace, transfer, mcsc, pseudo, datasets, uda, mcsd, scatter, distribution, classifier, min, stp, centroid, generalization, dan, revgrad, invariant, labeled, classification, function, gta, jgsa, large, transferable, training, exploit, loss, jda, setup]
@InProceedings{Liang_2019_CVPR,
  author = {Liang, Jian and He, Ran and Sun, Zhenan and Tan, Tieniu},
  title = {Distant Supervised Centroid Shift: A Simple and Efficient Approach to Visual Domain Adaptation},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Learning to Transfer Examples for Partial Domain Adaptation
Zhangjie Cao, Kaichao You, Mingsheng Long, Jianmin Wang, Qiang Yang


Domain adaptation is critical for learning in new and unseen environments. With domain adversarial training, deep networks can learn disentangled and transferable features that effectively diminish the dataset shift between the source and target domains for knowledge transfer. In the era of Big Data, large-scale labeled datasets are readily available, stimulating the interest in partial domain adaptation (PDA), which transfers a recognizer from a large labeled domain to a small unlabeled domain. It extends standard domain adaptation to the scenario where target labels are only a subset of source labels. Under the condition that target labels are unknown, the key challenges of PDA are how to transfer relevant examples in the shared classes to promote positive transfer and how to ignore irrelevant ones in the source domain to mitigate negative transfer. In this work, we propose a unified approach to PDA, Example Transfer Network (ETN), which jointly learns domain-invariant representations across domains and a progressive weighting scheme to quantify the transferability of source examples. A thorough evaluation on several benchmark datasets shows that ETN consistently achieves state-of-the-art results for various partial domain adaptation tasks.
[dataset, recognition, key, work, performs, framework, previous] [outlier, computer, vision, pattern, international, technical, problem] [conference, ieee, figure, based] [deep, network, neural, san, performance, processing, standard, better, reduce, resnet, shift] [partial, transferability, discriminator, example, adversarial, machine, irrelevant, probability, model, relevant] [feature, distinguish] [domain, source, target, adaptation, transfer, etn, learning, label, auxiliary, shared, classifier, negative, space, iwan, weighting, discriminative, trained, data, set, pda, open, dann, transferable, class, predictor, loss, labeled, quantify, training, large, log, classification, mingsheng, jianmin, positive, distribution, unknown, learned, minimax, china, big, knowledge]
@InProceedings{Cao_2019_CVPR,
  author = {Cao, Zhangjie and You, Kaichao and Long, Mingsheng and Wang, Jianmin and Yang, Qiang},
  title = {Learning to Transfer Examples for Partial Domain Adaptation},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Generalized Zero-Shot Recognition Based on Visually Semantic Embedding
Pengkai Zhu, Hanxiao Wang, Venkatesh Saligrama


We propose a novel Generalized Zero-Shot learning (GZSL) method that is agnostic to both unseen images and unseen semantic vectors during training. Prior works in this context propose to map high-dimensional visual features to the semantic domain, which we believe contributes to the semantic gap. To bridge the gap, we propose a novel low-dimensional embedding of visual instances that is "visually semantic." Analogous to semantic data that quantifies the existence of an attribute in the presented instance, components of our visual embedding quantifies existence of a prototypical part-type in the presented instance. In parallel, as a thought experiment, we quantify the impact of noisy semantic data by utilizing a novel visual oracle to visually supervise a learner. These factors, namely semantic noise, visual-semantic gap and label noise lead us to propose a new graphical model for inference with pairwise interactions between label, semantic data, and inputs. We tabulate results on a number of benchmark datasets demonstrating significant improvement in accuracy over state-of-art under both semantic and visual supervision.
[recognition, work, dataset] [computer, pattern, vision, approach, note, june] [latent, input, proposed, ieee, conference, visually, image, based, attribute, mapping, component, method, competing, presented] [accuracy, table, structured, performance, number, neural] [visual, model, potential, existence, evaluation, provided, common, probability, vector, type] [semantic, feature, supervision, propose, graphical, improvement, map, benchmark] [gzsl, embedding, unseen, learning, oracle, mixture, class, training, representation, learner, existing, loss, novel, prototypical, zsl, cub, data, gap, learn, test, similarity, list, space, knowledge, observe, generalized, label, datasets, large, train, set, lprt, apy, quantify, noisy]
@InProceedings{Zhu_2019_CVPR,
  author = {Zhu, Pengkai and Wang, Hanxiao and Saligrama, Venkatesh},
  title = {Generalized Zero-Shot Recognition Based on Visually Semantic Embedding},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Towards Visual Feature Translation
Jie Hu, Rongrong Ji, Hong Liu, Shengchuan Zhang, Cheng Deng, Qi Tian


Most existing visual search systems are deployed based upon fixed kinds of visual features, which prohibits the feature reusing across different systems or when upgrading systems with a new type of feature. Such a setting is obviously inflexible and time/memory consuming, which is indeed mendable if visual features can be "translated" across systems. In this paper, we make the first attempt towards visual feature translation to break through the barrier of using features across different visual search systems. To this end, we propose a Hybrid Auto-Encoder (HAE) to translate visual features, which learns a mapping by minimizing the translation and reconstruction errors. Based upon HAE, an Undirected Affinity Measurement (UAM) is further designed to quantify the affinity among different types of visual features. Extensive experiments have been conducted on several public datasets with sixteen different types of widely-used features in visual search systems. Quantitative results show the encouraging possibilities of feature translation. For the first time, the affinity among widely-used features like SIFT and DELF is reported.
[directed, dataset, vtt, consists, work] [local, reconstruction, matrix, measurement, sift, corresponding, error, handcrafted, column, algorithm, normalized] [translation, image, translated, extracted, based, vst, high, translate, proposed, latent, translating, row, difference, hybrid, quantitative, translator, figure] [deep, search, table, andrew] [visual, query, decoder, encoder, spanning, encoding, find, visualize] [feature, affinity, hae, undirected, stage, three, mst, object, map, relation, convertibility] [target, retrieval, source, learning, transfer, domain, set, minimum, training, quantify, datasets, space, shared]
@InProceedings{Hu_2019_CVPR,
  author = {Hu, Jie and Ji, Rongrong and Liu, Hong and Zhang, Shengchuan and Deng, Cheng and Tian, Qi},
  title = {Towards Visual Feature Translation},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Amodal Instance Segmentation With KINS Dataset
Lu Qi, Li Jiang, Shu Liu, Xiaoyong Shen, Jiaya Jia


Amodal instance segmentation, a new direction of instance segmentation, aims to segment each object instance involving its invisible, occluded parts to imitate human ability. This task requires to reason objects' complex structure. Despite important and futuristic, this task lacks data with large-scale and detailed annotations, due to the difficulty of correctly and consistently labeling invisible parts, which creates the huge barrier to explore the frontier of visual recognition. In this paper, we augment KITTI with more instance pixel-level annotation for 8 categories, which we call KITTI INStance dataset (KINS). We propose the network structure to reason invisible parts via a new multi-task framework with Multi-View Coding (MVC), which combines information in various recognition levels. Extensive experiments show that our MVC effectively improves both amodal and inmodal segmentation. The KINS dataset and our proposed method will be made publicly available.
[dataset, prediction, flow, consists, recognition, framework] [occlusion, occluded, kitti, relative, invisible, scene, shape, visible, autonomous, vision, general, corresponding, local] [figure, image, consistency, proposed, jiaya] [network, order, table, coding, number, convolution, conv, structure, max, layer, performance, small] [visual, perception, model, ability, common, evaluate] [amodal, mask, instance, segmentation, branch, inmodal, annotation, detection, semantic, box, object, global, feature, three, bounding, mlc, annotated, final, extraction, coco, seg, average, including, level, overlapping, kaiming, piotr, ross, segment, improves, panet, category] [classification, data, datasets, specific, learning, large, independent]
@InProceedings{Qi_2019_CVPR,
  author = {Qi, Lu and Jiang, Li and Liu, Shu and Shen, Xiaoyong and Jia, Jiaya},
  title = {Amodal Instance Segmentation With KINS Dataset},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Global Second-Order Pooling Convolutional Networks
Zilin Gao, Jiangtao Xie, Qilong Wang, Peihua Li


Deep Convolutional Networks (ConvNets) are fundamental to, besides large-scale visual recognition, a lot of vision tasks. As the primary goal of the ConvNets is to characterize complex boundaries of thousands of classes in a high-dimensional space, it is critical to learn higher-order representations for enhancing non-linear modeling capability. Recently, Global Second-order Pooling (GSoP), plugged at the end of networks, has attracted increasing attentions, achieving much better performance than classical, first-order networks in a variety of vision tasks. However, how to effectively introduce higher-order representation in earlier layers for improving non-linear capability of ConvNets is still an open problem. In this paper, we propose a novel network model introducing GSoP across from lower to higher layers for exploiting holistic image information throughout a network. Given an input 3D tensor outputted by some previous convolutional layer, we perform GSoP to obtain a covariance matrix which, after nonlinear transformation, is used for tensor scaling along channel dimension. Similarly, we can perform GSoP along spatial dimension for tensor scaling as well. In this way, we can make full use of the second-order statistics of the holistic image throughout a network. The proposed networks are thoroughly evaluated on large-scale ImageNet-1K, and experiments have shown that they outperform non-trivially the counterparts while achieving state-of-the-art results.
[modeling, perform, capture, dependency, performs, producing] [matrix, error, single, quadratic, linear, limited] [image, proposed, comparison, figure, input, statistical, nonlinear, intermediate, presented] [gsop, network, block, covariance, pooling, deep, convolutional, tensor, residual, channel, table, convolution, number, performance, size, capability, earlier, plugged, weight, output, vanilla, better, layer, order, conv, neural, scaling, bilinear, secondorder, compared, small, insert, lower, conveniently, resnet] [introduce, model, visual] [global, holistic, spatial, feature, average, stage, pool, improvement, object, module] [representation, dimension, pairwise, learning, exploiting]
@InProceedings{Gao_2019_CVPR,
  author = {Gao, Zilin and Xie, Jiangtao and Wang, Qilong and Li, Peihua},
  title = {Global Second-Order Pooling Convolutional Networks},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Weakly Supervised Complementary Parts Models for Fine-Grained Image Classification From the Bottom Up
Weifeng Ge, Xiangru Lin, Yizhou Yu


Given a training dataset composed of images and corresponding category labels, deep convolutional neural networks show a strong ability in mining discriminative parts for image classification. However, deep convolutional neural networks trained with image level labels only tend to focus on the most discriminative parts while missing other object parts, which could provide complementary information. In this paper, we approach this problem from a different perspective. We build complementary parts models in a weakly supervised manner to retrieve information suppressed by dominant object parts detected by convolutional neural networks. Given image level labels only, we first extract rough object instances by performing weakly supervised object detection and instance segmentation using Mask R-CNN and CRF-based segmentation. Then we estimate and search for the best parts model for each object instance under the principle of preserving as much diversity as possible. In the last stage, we build a bi-directional long short-term memory (LSTM) network to fuze and encode the partial information of these complementary parts into a comprehensive feature for image classification. Experimental results indicate that the proposed method not only achieves significant improvement over our baseline models, but also outperforms state-of-the-art algorithms by a large margin (6.7%, 2.8%, 5.2% respectively) on Stanford Dogs 120, Caltech-UCSD Birds 2011-200 and Caltech 256.
[lstm, multiple, recognition, build, second, previous, dataset] [computer, vision, pattern, single, corresponding, international, pipeline] [image, conference, ieee, proposed, method, tanh, input, based] [stacked, network, deep, convolutional, neural, accuracy, performance, number, search, activation, size, achieves, crf, original, order, layer] [model, probability, partial, iterative, generated, rich] [object, complementary, detection, weakly, mask, baseline, segmentation, instance, map, feature, sjft, fig, final, bounding, proposal, cnn, context, suppressed, category, level, fuse, cam, box, semantic] [classification, supervised, training, class, set, discriminative, learning, googlenet, loss, trained, large, softmax, data, experimental, stanford, label]
@InProceedings{Ge_2019_CVPR,
  author = {Ge, Weifeng and Lin, Xiangru and Yu, Yizhou},
  title = {Weakly Supervised Complementary Parts Models for Fine-Grained Image Classification From the Bottom Up},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
NetTailor: Tuning the Architecture, Not Just the Weights
Pedro Morgado, Nuno Vasconcelos


Real-world applications of object recognition often require the solution of multiple tasks in a single platform. Under the standard paradigm of network fine-tuning, an entirely new CNN is learned per task, and the final network size is independent of task complexity. This is wasteful, since simple tasks require smaller networks than more complex tasks, and limits the number of tasks that can be solved simultaneously. To address these problems, we propose a transfer learning procedure, denoted NetTailor, in which layers of a pre-trained CNN are used as universal blocks that can be combined with small task-specific layers to generate new networks. Besides minimizing classification error, the new network is trained to mimic the internal activations of a strong unconstrained CNN, and minimize its complexity by the combination of 1) a soft-attention mechanism over blocks and 2) complexity regularization constraints. In this way, NetTailor can adapt the network architecture, not just its weights, to the target task. Experiments show that networks adapted to simple tasks, such as character or traffic sign recognition, become significantly smaller than those adapted to hard tasks, such as fine-grained recognition. More importantly, due to the modular nature of the procedure, this reduction in network complexity is achieved without compromise of either parameter sharing across tasks, or classification accuracy.
[recognition, dataset, multiple, second, work] [computer, vision, pattern, international, single] [conference, removed, input, figure, image, remove] [network, nettailor, complexity, neural, number, architecture, performance, small, deep, accuracy, block, simpler, order, processing, residual, pruning, layer, low, size, impact, output, table, standard, smaller, reduction, skip, inference, params] [model, arxiv, preprint, machine, simple] [cnn, object, voc, three, final, feature] [learning, target, task, student, source, transfer, classification, teacher, large, training, loss, set, proxy, minimize, adaptation, datasets, domain, svhn, universal, knowledge, learned, adapt, shared]
@InProceedings{Morgado_2019_CVPR,
  author = {Morgado, Pedro and Vasconcelos, Nuno},
  title = {NetTailor: Tuning the Architecture, Not Just the Weights},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Learning-Based Sampling for Natural Image Matting
Jingwei Tang, Yagiz Aksoy, Cengiz Oztireli, Markus Gross, Tunc Ozan Aydin


The goal of natural image matting is the estimation of opacities of a user-defined foreground object that is essential in creating realistic composite imagery. Natural matting is a challenging process due to the high number of unknowns in the mathematical modeling of the problem, namely the opacities as well as the foreground and background layer colors, while the original image serves as the single observation. In this paper, we propose the estimation of the layer colors through the use of deep neural networks prior to the opacity estimation. The layer color estimation is a better match for the capabilities of neural networks, and the availability of these colors substantially increase the performance of opacity estimation due to the reduced number of unknowns in the compositing equation. A prominent approach to matting in parallel to ours is called sampling-based matting, which involves gathering color samples from known-opacity regions to predict the layer colors. Our approach outperforms not only the previous hand-crafted sampling algorithms, but also current data-driven methods. We hence classify our method as a hybrid sampling- and learning-based approach to matting, and demonstrate the effectiveness of our approach through detailed ablation studies using alternative network architectures.
[dataset, previous, work] [estimation, approach, estimate, estimating, additional, well, estimated, directly, corresponding, computer, defined, pattern, typically, problem, consistent, provide] [background, matting, image, alpha, color, input, matte, method, figure, compositing, trimap, inpainting, alphagan, opacity, pixel, comprehensive, ieee, composite, proposed, quality, study, adobe, mse, quantitative, qualitative] [network, deep, neural, layer, number, gradient, original, architecture, table, order, better, best, sparse, process, effectiveness, designed] [natural, observed, model, random, making] [foreground, ablation, predicted, final, spatial, region] [sampling, loss, unknown, training, augmentation, function, data, set, randomly, aim]
@InProceedings{Tang_2019_CVPR,
  author = {Tang, Jingwei and Aksoy, Yagiz and Oztireli, Cengiz and Gross, Markus and Ozan Aydin, Tunc},
  title = {Learning-Based Sampling for Natural Image Matting},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Learning Unsupervised Video Object Segmentation Through Visual Attention
Wenguan Wang, Hongmei Song, Shuyang Zhao, Jianbing Shen, Sanyuan Zhao, Steven C. H. Hoi, Haibin Ling


This paper conducts a systematic study on the role of visual attention in Unsupervised Video Object Segmentation (UVOS) tasks. By elaborately annotating three popular video segmentation datasets (DAVIS, Youtube-Objects and SegTrack V2) with dynamic eye-tracking data in the UVOS setting, for the first time, we quantitatively verified the high consistency of visual attention behavior among human observers, and found strong correlation between human attention and explicit primary object judgements during dynamic, task-driven viewing. Such novel observations provide an in-depth insight into the underlying rationale behind UVOS. Inspired by these findings, we decouple UVOS into two sub-tasks: UVOS-driven Dynamic Visual Attention Prediction (DVAP) in spatiotemporal domain, and Attention-Guided Object Segmentation (AGOS) in spatial domain. Our UVOS solution enjoys three major merits: 1) modular training without using expensive video segmentation annotations, instead, using more affordable dynamic fixation data to train the initial video attention module and using existing fixation-segmentation paired static/image data to train the subsequent segmentation module; 2) comprehensive foreground understanding through multi-source learning; and 3) additional interpretability from the biologically-inspired and assessable attention. Experiments on popular benchmarks show that, even without using expensive video object mask annotations, our model achieves compelling performance in comparison with state-of-the-arts.
[video, uvos, human, dvap, dynamic, agos, fixation, jianbing, static, vsod, frame, segtrackv, spatiotemporal, expensive, dataset, convlstm, previous, prediction, motion] [explicit, corresponding, underlying, approach, denote, computer, vision, continuous] [image, ieee, quantitative, based, input, gaze, proposed, consistency, eye, paired, comprehensive, background, intermediate] [neural, deep, correlation, table, performance, convolutional, popular, network, binary, inspired, best] [attention, visual, model, primary, mechanism, evaluation, strong, natural] [object, segmentation, saliency, module, feature, wenguan, three, foreground, spatial, salient, map, detection, fully, annotation] [data, learning, training, test, datasets, train, unsupervised, novel, set, function, existing]
@InProceedings{Wang_2019_CVPR,
  author = {Wang, Wenguan and Song, Hongmei and Zhao, Shuyang and Shen, Jianbing and Zhao, Sanyuan and Hoi, Steven C. H. and Ling, Haibin},
  title = {Learning Unsupervised Video Object Segmentation Through Visual Attention},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
4D Spatio-Temporal ConvNets: Minkowski Convolutional Neural Networks
Christopher Choy, JunYoung Gwak, Silvio Savarese


In many robotics and VR/AR applications, 3D-videos are readily-available input sources (a sequence of depth images, or LIDAR scans). However, in many cases, the 3D-videos are processed frame-by-frame either through 2D convnets or 3D perception algorithms. In this work, we propose 4-dimensional convolutional neural networks for spatio-temporal perception that can directly process such 3D-videos using high-dimensional convolutions. For this, we adopt sparse tensors and propose generalized sparse convolutions that encompass all discrete convolutions. To implement the generalized sparse convolution, we create an open-source auto-differentiation library for sparse tensors that provides extensive functions for high-dimensional convolutional neural networks. We create 4D spatio-temporal convolutional neural networks using the library and validate them on various 3D semantic segmentation benchmarks and proposed 4D datasets for 3D-video perception. To overcome challenges in 4D space, we propose the hybrid kernel, a special case of the generalized sparse convolution, and trilateral-stationary conditional random fields that enforce spatio-temporal consistency in the 7D space-time-chroma space. Experimentally, we show that a convolutional neural network with only generalized 3D sparse convolutions can outperform 2D or 2D-3D hybrid methods by a large margin. Also, we show that on 3D-videos, 4D spatio-temporal convolutional neural networks are robust to noise and outperform the 3D convolutional neural network.
[temporal, sequence, time, dataset, multiple, convert, recurrent] [computer, point, vision, algorithm, dense, scannet, define, voxel, pattern, directly, field, indoor, coordinate, supplementary, depth, continuous, well, defined] [input, conference, proposed, hybrid, conditional, figure, ieee, method, image] [sparse, conv, neural, convolution, network, kernel, convolutional, output, tensor, minkowski, number, library, batch, size, process, tesseract, standard, inference, table, performance, efficient, trilateral] [create, perception, random, arxiv, preprint, requires, memory] [semantic, segmentation, propose, spatial, miou, feature] [generalized, space, data, learning, synthia, datasets, representation, set, function, stanford]
@InProceedings{Choy_2019_CVPR,
  author = {Choy, Christopher and Gwak, JunYoung and Savarese, Silvio},
  title = {4D Spatio-Temporal ConvNets: Minkowski Convolutional Neural Networks},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Pyramid Feature Attention Network for Saliency Detection
Ting Zhao, Xiangqian Wu


Saliency detection is one of the basic challenges in computer vision. Recently, CNNs are the most widely used and powerful techniques for saliency detection, in which feature maps from different layers are always integrated without distinction. However, instinctively, the different feature maps of CNNs and the different features in the same maps should play different roles in saliency detection. To address this problem, a novel CNN named pyramid feature attention network (PFAN) is proposed to enhance the high-level context features and the low-level spatial structural features. In the proposed PFAN, a context-aware pyramid feature extraction (CPFE) module is designed for multi-scale high-level feature maps to capture the rich context features. A channel-wise attention (CA) model and a spatial attention (SA) model are respectively applied to the CPFE feature maps and the low-level feature maps, and then fused to detect salient regions. Finally, an edge preservation loss is proposed to get the accurate boundaries of salient regions. The proposed PFAN is extensively evaluated on five benchmark datasets and the experimental results demonstrate that the proposed network outperforms the state-of-the-art approaches under different evaluation metrics.
[capture, focus, recognition, recurrent] [computer, vision, pattern, ground, international, laplace, truth] [conference, ieee, method, image, proposed, preservation, background, figure, based, high, guide] [network, convolutional, conv, effective, deep, best, basic, scale, add, weighted, convolution, atrous, sigmoid, pooling, size] [attention, model, visual, generate, machine, evaluation, generation] [saliency, salient, feature, detection, object, pyramid, spatial, extraction, module, edge, map, propose, spacial, boundary, semantic, adopt, cpfe, global, detailed, context, detect, fully, cnn, level, challenging] [loss, learning, novel, function, refers, learn, set, datasets, select, large]
@InProceedings{Zhao_2019_CVPR,
  author = {Zhao, Ting and Wu, Xiangqian},
  title = {Pyramid Feature Attention Network for Saliency Detection},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Co-Saliency Detection via Mask-Guided Fully Convolutional Networks With Multi-Scale Label Smoothing
Kaihua Zhang, Tengpeng Li, Bo Liu, Qingshan Liu


In image co-saliency detection problem, one critical issue is how to model the concurrent pattern of the co-salient parts, which appears both within each image and across all the relevant images. In this paper, we propose a hierarchical image co-saliency detection framework as a coarse to fine strategy to capture this pattern. We first propose a mask-guided fully convolutional network structure to generate the initial co-saliency detection result. The mask is used for background removal and it is learned from the high-level feature response maps of the pre-trained VGG-net output. We next propose a multi-scale label smoothing model to further refine the detection result. The proposed model jointly optimizes the label smoothness of pixels and superpixels. Experiment results on three popular image co-saliency detection benchmark datasets including iCoseg, MSRC and Cosal2015 demonstrate the remarkable performance compared with the state-of-the-art methods.
[framework, term, auc, middle] [algorithm, solution, pattern, optimal, initial, analysis, corresponding, approach] [image, method, proposed, figure, based, background, input, pca, appearance, masked] [deep, convolutional, network, smoothing, group, denotes, layer, performance, compared, achieve, output, wei, number] [model, visual, generate, common] [detection, mask, salient, saliency, feature, fcn, object, masking, map, score, semantic, three, including, junwei, fully, superpixel, pool, cfms, dingwen, propose, codw, hierarchical, benchmark, msrc, region, foreground, detect, cosaliency, huazhu, spatial, superpixels] [learning, set, label, unsupervised, learned, objective, supervised, ranking, function, update, datasets]
@InProceedings{Zhang_2019_CVPR,
  author = {Zhang, Kaihua and Li, Tengpeng and Liu, Bo and Liu, Qingshan},
  title = {Co-Saliency Detection via Mask-Guided Fully Convolutional Networks With Multi-Scale Label Smoothing},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
SAIL-VOS: Semantic Amodal Instance Level Video Object Segmentation - A Synthetic Dataset and Baselines
Yuan-Ting Hu, Hong-Shuo Chen, Kexin Hui, Jia-Bin Huang, Alexander G. Schwing


We introduce SAIL-VOS (Semantic Amodal Instance Level Video Object Segmentation), a new dataset aiming to stimulate semantic amodal segmentation research. Humans can effortlessly recognize partially occluded objects and reliably estimate their spatial extent beyond the visible. However, few modern computer vision techniques are capable of reasoning about occluded parts of an object. This is partly due to the fact that very few image datasets and no video dataset exist which permit development of those methods. To address this issue, we present a synthetic dataset extracted from the photo-realistic game GTA-V. Each frame is accompanied with densely annotated, pixel-accurate visible and amodal segmentation masks with semantic labels. More than 1.8M objects are annotated resulting in 100 times more annotations than existing datasets. We demonstrate the challenges of the dataset by quantifying the performance of several baselines. Data and additional material is available at http://sailvos.web.illinois.edu.
[dataset, video, buffer, frame, focus, predicting, time, optical, tracking, report, second, temporal, multiple, driving, pause, record] [depth, occlusion, occluded, pose, scene, compute, visible, visibility, indoor, shape, note, maskrcnn] [image, based, proposed, synthetic, collect, real, figure, variety, contour, database] [performance, validation, convolutional, rate, number, deep] [modal, evaluation, script, game, model, reasoning, random, van, include, visual] [segmentation, object, amodal, semantic, instance, mask, level, stencil, hook, detection, cocoa, maskamodal, annotation, maskjoint, dyce, segmenting, third, annotated, baseline, pascal, toggle] [data, training, learning, class, set, datasets]
@InProceedings{Hu_2019_CVPR,
  author = {Hu, Yuan-Ting and Chen, Hong-Shuo and Hui, Kexin and Huang, Jia-Bin and Schwing, Alexander G.},
  title = {SAIL-VOS: Semantic Amodal Instance Level Video Object Segmentation - A Synthetic Dataset and Baselines},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Learning Instance Activation Maps for Weakly Supervised Instance Segmentation
Yi Zhu, Yanzhao Zhou, Huijuan Xu, Qixiang Ye, David Doermann, Jianbin Jiao


Discriminative region responses residing inside an object instance can be extracted from networks trained with image-level label supervision. However, learning the full extent of pixel-level instance response in a weakly supervised manner remains unexplored. In this work, we tackle this challenging problem by using a novel instance extent filling approach. We first design a process to selectively collect pseudo supervision from noisy segment proposals obtained with previously published techniques. The pseudo supervision is used to learn a differentiable filling module that predicts a class-agnostic activation map for each instance given the image and an incomplete region response. We refer to the above maps as Instance Activation Maps (IAMs), which provide a fine-grained instance-level representation and allow instance masks to be extracted by lightweight CRF. Extensive experiments on the PASCAL VOC12 dataset show that our approach beats the state-of-the-art weakly supervised instance segmentation methods by a significant margin and increases the inference speed by an order of magnitude. Our method also generalizes well across domains and to unseen object categories. Without fine-tuning for the specific tasks, our model trained on VOC12 dataset (20 classes) obtains top performance for weakly supervised object localization on the CUB dataset (200 classes) and achieves competitive results on three widely used salient object detection benchmarks.
[recognition, extract, dataset, highlight] [vision, computer, pattern, approach, well, international] [ieee, conference, image, method, figure, proposed, cover, incomplete] [activation, inference, network, convolutional, process, deep, performance, full, speed, top] [iam, model, sheep, visual, common, random, generate] [object, instance, extent, filling, weakly, segmentation, iams, semantic, proposal, response, map, prm, peak, segment, salient, module, detection, localization, feature, spatial, region, mask, prms, saliency, supervision, iou, predicted, affinity, person, three, box] [supervised, learning, class, pseudo, noisy, classification, learn, knowledge, discriminative, trained, training, unseen, set, dog, large, learned, sampling]
@InProceedings{Zhu_2019_CVPR,
  author = {Zhu, Yi and Zhou, Yanzhao and Xu, Huijuan and Ye, Qixiang and Doermann, David and Jiao, Jianbin},
  title = {Learning Instance Activation Maps for Weakly Supervised Instance Segmentation},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Decoders Matter for Semantic Segmentation: Data-Dependent Decoding Enables Flexible Feature Aggregation
Zhi Tian, Tong He, Chunhua Shen, Youliang Yan


Recent semantic segmentation methods exploit encoder-decoder architectures to produce the desired pixel-wise segmentation prediction. The last layer of the decoders is typically a bilinear upsampling procedure to recover the final pixel-wise prediction. We empirically show that this oversimple and data-independent bilinear upsampling may lead to sub-optimal results. In this work, we propose a data-dependent upsampling (DUpsampling) to replace bilinear, which takes advantages of the redundancy in the label space of semantic segmentation and is able to recover the pixel-wise prediction from low-resolution outputs of CNNs. The main advantage of the new upsampling layer lies in that with a relatively lower-resolution feature map such as 1/16 or 1/32 of the input size, we can achieve even better segmentation accuracy, significantly reducing computation complexity. This is made possible by 1) the new upsampling layer's much improved reconstruction capability; and more importantly 2) the DUpsampling based decoder's flexibility in leveraging almost arbitrary combinations of the CNN encoders' features. Experiments on PASCAL VOC demonstrate that with much less computation complexity, our decoder outperforms the state-of-the-art decoder. Finally, without any post-processing, the framework equipped with our proposed decoder achieves new state-of-the-art performance on two datasets: 88.1% mIOU on PASCAL VOC with 30% computation of the previously best model; and 52.5% mIOU on PASCAL Context.
[prediction, previous, framework, fusion] [reconstruction, linear, typically, bound] [proposed, image, resolution, ieee, input, method, produce, recover, amount, figure] [bilinear, dupsampling, upsampling, performance, computation, convolutional, vanilla, table, ratio, aggregation, atrous, size, better, stride, denotes, order, deep, downsample, upsample, reducing, convolution, network, output, achieves, best, rate, cnns, batch, flexible, achieve, architecture] [decoder, arxiv, preprint, improved, chunhua, encoder] [feature, semantic, segmentation, pascal, final, miou, voc, val, fused, spatial, backbone, context, cnn, improve, ablation] [training, set, softmax, learning, loss, upper, space, test]
@InProceedings{Tian_2019_CVPR,
  author = {Tian, Zhi and He, Tong and Shen, Chunhua and Yan, Youliang},
  title = {Decoders Matter for Semantic Segmentation: Data-Dependent Decoding Enables Flexible Feature Aggregation},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Box-Driven Class-Wise Region Masking and Filling Rate Guided Loss for Weakly Supervised Semantic Segmentation
Chunfeng Song, Yan Huang, Wanli Ouyang, Liang Wang


Semantic segmentation has achieved huge progress via adopting deep Fully Convolutional Networks (FCN). However, the performance of FCN based models severely rely on the amounts of pixel-level annotations which are expensive and time-consuming. To address this problem, it is a good choice to learn to segment with weak supervision from bounding boxes. How to make full use of the class-level and region-level supervisions from bounding boxes is the critical challenge for the weakly supervised learning task. In this paper, we first introduce a box-driven class-wise masking model (BCM) to remove irrelevant regions of each class. Moreover, based on the pixel-level segment proposal generated from the bounding box supervision, we could calculate the mean filling rates of each class to serve as an important prior cue, then we propose a filling rate guided adaptive loss (FR-Loss) to help the model ignore the wrongly labeled pixels in proposals. Unlike previous methods directly training models with the fixed individual segment proposals, our method can adjust the model learning with global statistical information. Thus it can help reduce the negative impacts from wrongly labeled proposals. We evaluate the proposed method on the challenging PASCAL VOC 2012 benchmark and compare with other methods. Extensive experimental results show that the proposed method is effective and achieves the state-of-the-art results.
[work] [corresponding] [proposed, method, image, figure, based, pixel, remove, guide] [rate, convolutional, performance, compare, crf, table, effective, deep, calculate, top, network, adaptive, better, compared, achieves, comparable] [model, attention, evaluate, introduce, generated, irrelevant, arxiv, preprint, obvious, generate] [filling, segmentation, semantic, masking, bounding, weakly, bcm, guided, fully, segment, box, help, wrongly, fcn, supervision, map, global, ignore, score, confident, object, person, foreground, sdi, adopt, feature, three, weak, pascal, mask, spatial] [supervised, class, loss, learn, learning, labeled, training, train, trained, negative, select]
@InProceedings{Song_2019_CVPR,
  author = {Song, Chunfeng and Huang, Yan and Ouyang, Wanli and Wang, Liang},
  title = {Box-Driven Class-Wise Region Masking and Filling Rate Guided Loss for Weakly Supervised Semantic Segmentation},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Dual Attention Network for Scene Segmentation
Jun Fu, Jing Liu, Haijie Tian, Yong Li, Yongjun Bao, Zhiwei Fang, Hanqing Lu


In this paper, we address the scene segmentation task by capturing rich contextual dependencies based on the self-attention mechanism. Unlike previous works that capture contexts by multi-scale features fusion, we propose a Dual Attention Networks (DANet) to adaptively integrate local features with their global dependencies. Specifically, we append two types of attention modules on top of traditional dilated FCN, which model the semantic interdependencies in spatial and channel dimensions respectively. The position attention module selectively aggregates the features at each position by a weighted sum of the features at all positions. Similar features would be related to each other regardless of their distances. Meanwhile, the channel attention module selectively emphasizes interdependent channel maps by integrating associated features among all channel maps. We sum the outputs of the two attention modules to further improve feature representation which contributes to more precise segmentation results. We achieve new state-of-the-art segmentation performance on three challenging scene segmentation datasets, i.e., Cityscapes, PASCAL Context and COCO Stuff dataset. In particular, a Mean IoU score of 81.5% on Cityscapes test set is achieved without using coarse data.
[capture, dataset, perform, previous, outperforms, work] [scene, position, computer, vision, pattern, matrix, local, corresponding] [image, conference, method, ieee, dual, proposed, figure, study, transpose, result, based] [channel, network, dilated, performance, convolution, table, layer, neural, better, represents, operation, employ, convolutional, aggregate, multiplication, pam, achieves, adaptively, weighted] [attention, model, sum, mechanism, rich, introduce] [feature, semantic, module, segmentation, contextual, spatial, pascal, map, context, reshape, danet, global, stuff, coco, adopt, voc, val, fcn, ablation, improve, iou, enhance, object, improves, final, cam] [set, training, data, representation, testing]
@InProceedings{Fu_2019_CVPR,
  author = {Fu, Jun and Liu, Jing and Tian, Haijie and Li, Yong and Bao, Yongjun and Fang, Zhiwei and Lu, Hanqing},
  title = {Dual Attention Network for Scene Segmentation},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
InverseRenderNet: Learning Single Image Inverse Rendering
Ye Yu, William A. P. Smith


We show how to train a fully convolutional neural network to perform inverse rendering from a single, uncontrolled image. The network takes an RGB image as input, regresses albedo and normal maps from which we compute lighting coefficients. Our network is trained using large uncontrolled image collections without ground truth. By incorporating a differentiable renderer, our network can learn from self-supervision. Since the problem is ill-posed we introduce additional supervision: 1. We learn a statistical natural illumination prior, 2. Our key insight is to perform offline multiview stereo (MVS) on images containing rich illumination variation. From the MVS pose and depth maps, we can cross project between overlapping views such that Siamese training can be used to ensure consistent estimation of photometric invariants. MVS depth also provides direct coarse supervision for normal map estimation. We believe this is the first attempt to use MVS supervision for learning inverse rendering.
[prediction, work, recognition, perform, dataset, term] [albedo, illumination, normal, depth, inverse, computer, intrinsic, rendering, vision, lighting, single, surface, reflectance, estimated, pattern, multiview, compute, stereo, estimate, estimation, decomposition, spherical, classical, shape, ground, problem, photometric, direct, camera, truth, international, diffuse, uncontrolled, local, differentiable, geometry, monocular, scene, nestmeyer, iiw, noah, volume, additional, outdoor, computed, error] [image, conference, ieee, shading, input, appearance, statistical, prior, synthetic, figure, consistency, real, frontal] [network, deep, neural, convolutional, performance] [model, introduce, natural] [map, supervision, fully, benchmark, european] [learning, data, training, loss, harmonic, cross, train, trained, supervised, large]
@InProceedings{Yu_2019_CVPR,
  author = {Yu, Ye and Smith, William A. P.},
  title = {InverseRenderNet: Learning Single Image Inverse Rendering},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
A Variational Auto-Encoder Model for Stochastic Point Processes
Nazanin Mehrasa, Akash Abdu Jyothi, Thibaut Durand, Jiawei He, Leonid Sigal, Greg Mori


We propose a novel probabilistic generative model for action sequences. The model is termed the Action Point Process VAE (APP-VAE), a variational auto-encoder that can capture the distribution over the times and categories of action sequences. Modeling the variety of possible action sequences is a challenge, which we show can be addressed via the APP-VAE's use of latent representations and non-linear functions to parameterize distributions over which event is likely to occur next in a sequence and at what time. We empirically validate the efficacy of APP-VAE for modeling action sequences on the MultiTHUMOS and Breakfast datasets.
[action, time, future, sequence, temporal, activity, prediction, recurrent, video, breakfast, asynchronous, modeling, predict, predicting, recognition, dataset, human, complex, lstm, nll, capture, event, occur, multithumos, early] [point, international, computer, vision, variable, truth, approach, compute, ground, predicts, well] [prior, latent, conference, conditional, generative, proposed, intensity, figure, input, ieee] [process, network, neural, table, accuracy, stochastic, output, architecture, gaussian, fixed, standard] [model, history, variational, probability, step, generate, generated, generation] [category, hierarchical, predicted] [distribution, learning, posterior, function, vae, likelihood, data, learned, loss, probabilistic, code, test, task, representation, novel, sample]
@InProceedings{Mehrasa_2019_CVPR,
  author = {Mehrasa, Nazanin and Abdu Jyothi, Akash and Durand, Thibaut and He, Jiawei and Sigal, Leonid and Mori, Greg},
  title = {A Variational Auto-Encoder Model for Stochastic Point Processes},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Unifying Heterogeneous Classifiers With Distillation
Jayakorn Vongkulbhisal, Phongtharin Vinayavekhin, Marco Visentini-Scarzanella


In this paper, we study the problem of unifying knowledge from a set of classifiers with different architectures and target classes into a single classifier, given only a generic set of unlabelled data. We call this problem Unifying Heterogeneous Classifiers (UHC). This problem is motivated by scenarios where data is collected from multiple sources, but the sources cannot share their data, e.g., due to privacy concerns, and only privately trained models can be shared. In addition, each source may not be able to gather data to train all classes due to data availability at each source, and may not be able to train the same classification model due to different computational resources. To tackle this problem, we propose a generalisation of knowledge distillation to merge HCs. We derive a probabilistic relation between the outputs of HCs and the probability over all classes. Based on this relation, we propose two classes of methods based on cross-entropy minimisation and matrix factorisation, which allow us to estimate soft labels over all classes from unlabelled samples and use them in lieu of ground truth labels to train a unified classifier. Our extensive experiments on ImageNet, LSUN, and Places365 datasets show that our approaches significantly outperform a naive extension of distillation and can achieve almost the same accuracy as classifiers that are trained in a centralised, supervised manner.
[perform, heterogeneous, multiple, tackle] [matrix, estimate, approach, problem, single, case, convex, directly, matching, formulation, estimating, analysis, define, note, well, minimisation] [image, based, proposed, method, central, result, missing] [accuracy, neural, imagenet, best, standard, deep, experiment, factorization, better] [probability, random, model, sensitivity, consider, vector, machine, describe] [propose, three, logits, global, relation] [data, set, trained, train, soft, distillation, training, unlabelled, test, class, uhc, logit, knowledge, classify, factorisation, hcs, unified, learning, loss, target, transfer, unifying, supervised, classifier, ensemble, datasets, minimising, temperature, classification, large, main, space]
@InProceedings{Vongkulbhisal_2019_CVPR,
  author = {Vongkulbhisal, Jayakorn and Vinayavekhin, Phongtharin and Visentini-Scarzanella, Marco},
  title = {Unifying Heterogeneous Classifiers With Distillation},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Assessment of Faster R-CNN in Man-Machine Collaborative Search
Arturo Deza, Amit Surana, Miguel P. Eckstein


With the advent of modern expert systems driven by deep learning that supplement human experts (e.g. radiologists, dermatologists, surveillance scanners), we analyze how and when do such expert systems enhance human performance in a fine-grained small target visual search task. We set up a 2 session factorial experimental design in which humans visually search for a target with and without a Deep Learning (DL) expert system. We evaluate human changes of target detection performance and eye-movements in the presence of the DL system. We find that performance improvements with the DL system (computed via a Faster R-CNN with a VGG16) interacts with observer's perceptual abilities (e.g., sensitivity). The main results include: 1) The DL system reduces the False Alarm rate per Image on average across observer groups of both high/low sensitivity; 2) Only human observers with high sensitivity perform better than the DL system, while the low sensitivity group does not surpass individual DL system performance, even when aided with the DL system itself; 3) Increases in number of trials and decrease in viewing time were mainly driven by the DL system only for the low sensitivity group. 4) The DL system aids the human observer to fixate at a target by the 3rd fixation. These results provide insights of the benefits and limitations of deep learning systems that are collaborative or competitive with humans.
[human, hcv, observer, session, time, recognition, work, second, hit, detectability, perform, influence, aided, current, fixation, deg, performing] [computer, viewing, well, total, vision, pattern, analysis, rendered, condition, error] [image, high, figure, expert, eye, ieee, collaborative] [number, deep, rate, search, low, performance, group, small, experiment, better, neural, design, applied, convolutional, represents] [system, sensitivity, visual, machine, find, arxiv, preprint, adversarial, potential, candidate, blue] [detection, object, false, presence, medical, bounding, faster, location, roc, box, average, person] [target, learning, main, experimental, positive, distance, independent, training, split, trained]
@InProceedings{Deza_2019_CVPR,
  author = {Deza, Arturo and Surana, Amit and Eckstein, Miguel P.},
  title = {Assessment of Faster R-CNN in Man-Machine Collaborative Search},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge
Kenneth Marino, Mohammad Rastegari, Ali Farhadi, Roozbeh Mottaghi


Visual Question Answering (VQA) in its ideal form lets us study reasoning in the joint space of vision and language and serves as a proxy for the AI task of scene understanding. However, most VQA benchmarks to date are focused on questions such as simple counting, visual attributes, and object detection that do not require reasoning or knowledge beyond what is in the image. In this paper, we address the task of knowledge-based visual question answering and provide a benchmark, called OK-VQA, where the image content is not sufficient to answer the questions, encouraging methods that rely on external knowledge resources. Our new dataset includes more than 14,000 questions that require external knowledge to answer. We show that the performance of the state-of-the-art VQA models degrades drastically in this new setting. Our analysis shows that our knowledge-based VQA task is diverse, difficult, and large compared to previous knowledge-based VQA datasets. We hope that this dataset enables researchers to open up new avenues for research in this domain.
[dataset, hidden, incorporate, people, work, combined] [require, provide, material, weather, scene] [image, figure, method, includes, based, frequency] [number, performance, compared, science, table, best, better, lot, requiring, top] [question, visual, vqa, answer, answering, reasoning, external, articlenet, mutan, language, common, ban, memory, fruit, citrus, teddy, query, cooking, understanding, relevant, attention, asked, milk, devi, dhruv, van, animal, evaluate, wikipedia, requires, model, natural, daquar, orange, everyday, example, mturk, provided] [category, benchmark, coco, highest] [knowledge, datasets, retrieved, open, set, task, training, train, large, test, bias]
@InProceedings{Marino_2019_CVPR,
  author = {Marino, Kenneth and Rastegari, Mohammad and Farhadi, Ali and Mottaghi, Roozbeh},
  title = {OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
NDDR-CNN: Layerwise Feature Fusing in Multi-Task CNNs by Neural Discriminative Dimensionality Reduction
Yuan Gao, Jiayi Ma, Mingbo Zhao, Wei Liu, Alan L. Yuille


In this paper, we propose a novel Convolutional Neural Network (CNN) structure for general-purpose multi-task learning (MTL), which enables automatic feature fusing at every layer from different tasks. This is in contrast with the most widely used MTL CNN structures which empirically or heuristically share features on some specific layers (e.g., share all the features except the last convolutional layer). The proposed layerwise feature fusing scheme is formulated by combining existing CNN components in a novel way, with clear mathematical interpretability as discriminative dimensionality reduction, which is referred to as Neural Discriminative Dimensionality Reduction (NDDR). Specifically, we first concatenate features with the same spatial resolution from different tasks according to their channel dimension. Then, we show that the discriminative dimensionality reduction can be fulfilled by 1x1 Convolution, Batch Normalization, and Weight Decay in one CNN. The use of existing CNN components ensures the end-to-end training and the extensibility of the proposed NDDR layer to various state-of-the-art CNN architectures in a "plug-and-play" manner. The detailed ablation analysis shows that the proposed NDDR layer is easy to train and also robust to different hyperparameters. Experiments on different task sets with various base network architectures demonstrate the promising performance and desirable generalizability of our proposed method. The code of our paper is available at https://github.com/ethanygao/NDDR-CNN.
[multiple, mtl, prediction, abhinav, perform, dataset, outperforms] [surface, normal, single, analysis, vision] [proposed, method, age, image, gender, input, transformation, face, demonstrate, diagonal] [nddr, network, layer, convolutional, reduction, deep, table, sluice, performance, convolution, channel, initialization, neural, weight, structure, order, output, rate, desirable, number, original, lower, pretrained, wei, batch, size, architecture, layerwise, reduce, shortcut, pacc, concatenate] [enables] [cnn, semantic, feature, segmentation, spatial, alan, object, detection, fusing, propose, ablation, hierarchical, miou, ross] [discriminative, learning, dimensionality, task, learn, novel, training, classification, existing, train, trained, dimension]
@InProceedings{Gao_2019_CVPR,
  author = {Gao, Yuan and Ma, Jiayi and Zhao, Mingbo and Liu, Wei and Yuille, Alan L.},
  title = {NDDR-CNN: Layerwise Feature Fusing in Multi-Task CNNs by Neural Discriminative Dimensionality Reduction},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Spectral Metric for Dataset Complexity Assessment
Frederic Branchaud-Charron, Andrew Achkar, Pierre-Marc Jodoin


In this paper, we propose a new measure to gauge the complexity of image classification problems. Given an annotated image dataset, our method computes a complexity measure called the cumulative spectral gradient (CSG) which strongly correlates with the test accuracy of convolutional neural networks (CNN). The CSG measure is derived from the probabilistic divergence between classes in a spectral clustering framework. We show that this metric correlates with the overall separability of the dataset and thus its inherent complexity. As will be shown, our metric can be used for dataset reduction, to assess which classes are more difficult to disentangle, and approximate the accuracy one could expect to get with a CNN. Results obtained on 11 datasets and three CNN models reveal that our method is more accurate and faster than previous complexity measures.
[dataset, time, complex, previous, framework] [matrix, error, well, compute, total, wij, laplacian, implies, derived, linearly, analysis, require] [method, image, spectral, spectrum, input, proposed, figure, cumulative, raw] [number, complexity, alexnet, table, deep, correlation, overlap, rate, neural, size, processing, accuracy, separable, pearson, best, called, gradient, designed, reduction, weight, better, small] [requires, easily, goal] [cnn, feature, faster, score] [csg, datasets, class, classification, mnist, data, measure, training, similarity, cnnae, test, large, function, metric, embedding, distance, clustering, set, minimum, space, existing, learning, distribution, adjacency, embeddings, train, prohibitively]
@InProceedings{Branchaud-Charron_2019_CVPR,
  author = {Branchaud-Charron, Frederic and Achkar, Andrew and Jodoin, Pierre-Marc},
  title = {Spectral Metric for Dataset Complexity Assessment},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
ADCrowdNet: An Attention-Injective Deformable Convolutional Network for Crowd Understanding
Ning Liu, Yongchao Long, Changqing Zou, Qun Niu, Li Pan, Hefeng Wu


We propose an attention-injective deformable convolutional network called ADCrowdNet for crowd understanding that can address the accuracy degradation problem of highly congested noisy scenes. ADCrowdNet contains two concatenated networks. An attention-aware network called Attention Map Generator (AMG) first detects crowd regions in images and computes the congestion degree of these regions. Based on detected crowd regions and congestion priors, a multi-scale deformable network called Density Map Estimator (DME) then generates high-quality density maps. With the attention-aware training scheme and multi-scale deformable convolutional scheme, the proposed ADCrowdNet achieves the capability of being more effective to capture the crowd features and more resistant to various noises. We have evaluated our method on four popular crowd counting datasets (ShanghaiTech, UCF_CC_50, WorldEXPO'10, and UCSD) and an extra vehicle counting dataset TRANCOS, and our approach beats existing state-of-the-art approaches on all of these datasets.
[dataset, ucf, people, previous] [front, approach, truth, problem, accurate, ground, error, estimated, degree] [figure, image, ieee, input, method, proposed, background, mse, comparison, psnr, based, study, generator] [network, density, dme, adcrowdnet, amg, convolutional, congested, convolution, csrnet, architecture, ucsd, table, scheme, achieved, performance, deep, neural, called, highly, congestion, number, dilated, lower, filter, concatenation, best, accuracy] [attention, understanding, generated, model, visual, example] [crowd, map, counting, deformable, shanghaitech, mae, count, feature, roi, object, vehicle, cnn, module, average, threshold, propose] [training, noisy, datasets, learning, sampling, classification, negative, testing, conducted]
@InProceedings{Liu_2019_CVPR,
  author = {Liu, Ning and Long, Yongchao and Zou, Changqing and Niu, Qun and Pan, Li and Wu, Hefeng},
  title = {ADCrowdNet: An Attention-Injective Deformable Convolutional Network for Crowd Understanding},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
VERI-Wild: A Large Dataset and a New Method for Vehicle Re-Identification in the Wild
Yihang Lou, Yan Bai, Jun Liu, Shiqi Wang, Lingyu Duan


Vehicle Re-identification (ReID) is of great significance to the intelligent transportation and public security. However, many challenging issues of Vehicle ReID in real-world scenarios have not been fully investigated, e.g., the high viewpoint variations, extreme illumination conditions, complex backgrounds, and different camera sources. To promote the research of vehicle ReID in the wild, we collect a new dataset called VERI-Wild with the following distinct features: 1) The vehicle images are captured by a large surveillance system containing 174 cameras covering a large urban district (more than 200km^2) The camera network continuously captures vehicles for 24 hours in each day and lasts for 1 month. 3) It is the first vehicle ReID dataset that is collected from unconstrained conditionsns. It is also a large dataset containing more than 400 thousand images of 40 thousand vehicle IDs. In this paper, we also propose a new method for vehicle ReID, in which, the ReID model is coupled into a Feature Distance Adversarial Network (FDA-Net), and a novel feature distance adversary scheme is designed to generate hard negative samples in feature space to facilitate ReID model training. The comprehensive results show the effectiveness of our method on the proposed dataset and the other two existing datasets.
[dataset, complex, capture, focus] [computer, constraint, camera, vision, pattern, occlusion, illumination, match] [real, proposed, input, image, method, generator, facilitate, subtle, figure, surveillance, unconstrained, captured, collected, ieee, conference, great] [performance, network, scheme, table, capability, compared, deep, designed, regularization, achieves, rate, number] [generated, attention, discriminator, model, generate, adversary, adversarial, gan, query, generation, evaluation, arxiv, att] [vehicle, feature, map, person, challenging, urban, improve] [hard, reid, negative, embedding, training, distance, vehicleid, set, large, learning, similarity, loss, test, representation, discriminative, triplet, existing, sample, cmc, margin, china, datasets, emb, hdc]
@InProceedings{Lou_2019_CVPR,
  author = {Lou, Yihang and Bai, Yan and Liu, Jun and Wang, Shiqi and Duan, Lingyu},
  title = {VERI-Wild: A Large Dataset and a New Method for Vehicle Re-Identification in the Wild},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
3D Local Features for Direct Pairwise Registration
Haowen Deng, Tolga Birdal, Slobodan Ilic


We present a novel, data driven approach for solving the problem of registration of two point cloud scans. Our approach is direct in the sense that a single pair of corresponding local patches already provides the necessary transformation cue for the global registration. To achieve that, we first endow the state of the art PPF-FoldNet auto-encoder (AE) with a pose-variant sibling, where the discrepancy between the two leads to pose-specific descriptors. Based upon this, we introduce RelativeNet, a relative pose estimation network to assign correspondence-specific orientations to the keypoints, eliminating any local reference frame computations. Finally, we devise a simple yet effective hypothesize-and-verify algorithm to quickly use the predictions and align two point sets. Our extensive quantitative and qualitative experiments suggests that our approach outperforms the state of the art in challenging real datasets of pairwise registration and that augmenting the keypoints with local pose information leads to better generalization and a dramatic speed-up.
[prediction, recognition, state] [local, pose, computer, point, registration, vision, ransac, relative, pattern, matching, geometric, international, good, fragment, rigid, relativenet, direct, cloud, robust, correspondence, note, hotel, estimation, algorithm, putative, well, estimate, rotation, cgf, problem, closest, canonical, redwood, ppfnet, usac, birdal] [conference, ieee, method, transformation, latent, based, patch, reference, figure, synthetic, real] [network, number, performance, better, best, precision, deep, structure, architecture, mlp, higher] [finding, find, simple] [feature, recall, benchmark, average, global, driven, challenging, european, art, final] [set, pair, data, loss, pairwise, training, learning, learned, train, invariant, sample, generalized, function]
@InProceedings{Deng_2019_CVPR,
  author = {Deng, Haowen and Birdal, Tolga and Ilic, Slobodan},
  title = {3D Local Features for Direct Pairwise Registration},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
HPLFlowNet: Hierarchical Permutohedral Lattice FlowNet for Scene Flow Estimation on Large-Scale Point Clouds
Xiuye Gu, Yijie Wang, Chongruo Wu, Yong Jae Lee, Panqu Wang


We present a novel deep neural network architecture for end-to-end scene flow estimation that directly operates on large-scale 3D point clouds. Inspired by Bilateral Convolutional Layers (BCL), we propose novel DownBCL, UpBCL, and CorrBCL operations that restore structural information from unstructured point clouds, and fuse information from two consecutive point clouds. Operating on discrete and sparse permutohedral lattice points, our architectural design is parsimonious in computational cost. Our model can efficiently process a pair of point cloud frames at once with a maximum of 86K points per frame. Our approach achieves state-of-the-art performance on the FlyingThings3D and KITTI Scene Flow 2015 datasets. Moreover, trained on synthetic data, our approach shows great generalization ability on real-world data and on different point densities without fine-tuning.
[flow, optical, work, motion, previous, time, displacement, dataset, multiple, outperforms, operates] [point, lattice, scene, permutohedral, estimation, cloud, kitti, downbcl, approach, coarser, denote, slicing, ground, relative, directly, well, error, corrbcls, local, finer, upbcl, downbcls, position, splatting, barycentric] [input, patch, filtering, method, image, remove, interpolation, bilateral] [convolutional, bcl, network, deep, normalization, scheme, original, architecture, correlation, convolution, computational, performance, layer, table, neural, sparse, density, cost, better, filtered, corrbcl, process, number, downsampling, simplex, size, hplflownet, output, upsampling] [model, evaluate, step] [propose, hierarchical, fuse] [learning, data, training, novel, generalization, set, large]
@InProceedings{Gu_2019_CVPR,
  author = {Gu, Xiuye and Wang, Yijie and Wu, Chongruo and Jae Lee, Yong and Wang, Panqu},
  title = {HPLFlowNet: Hierarchical Permutohedral Lattice FlowNet for Scene Flow Estimation on Large-Scale Point Clouds},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
GPSfM: Global Projective SFM Using Algebraic Constraints on Multi-View Fundamental Matrices
Yoni Kasten, Amnon Geifman, Meirav Galun, Ronen Basri


This paper addresses the problem of recovering projective camera matrices from collections of fundamental matrices in multiview settings. We make two main contributions. First, given n \choose 2 fundamental matrices computed for n images, we provide a complete algebraic characterization in the form of conditions that are both necessary and sufficient to enabling the recovery of camera matrices. These conditions are based on arranging the fundamental matrices as blocks in a single matrix, called the n-view fundamental matrix, and characterizing this matrix in terms of the signs of its eigenvalues and rank structures. Secondly, we propose a concrete algorithm for projective structure-from-motion that utilizes this characterization. Given a complete or partial collection of measured fundamental matrices, our method seeks camera matrices that minimize a global algebraic error for the measured fundamental matrices. In contrast to existing methods, our optimization, without any initialization, produces a consistent set of fundamental matrices that corresponds to a unique set of cameras (up to a choice of projective frame). Our experiments indicate that our method achieves state of the art performance in both accuracy and running time.
[motion, graph, construct, time, fii, determine] [fundamental, camera, projective, matrix, fij, optimization, consistent, computer, theorem, form, algebraic, point, viewing, denote, algorithm, pattern, bundle, corresponding, problem, multiview, measured, estimated, adjustment, implies, vision, sfm, recovering, error, ground, sengupta, exists, coordinate, projection, solve, pipeline, reconstruction, linear, truth, epipolar] [method, consistency, recovery, image, cover, ieee, recover, produce, noise] [scale, structure, compared, number, called, accuracy, running, efficient, denotes, block] [collection, complete, characterization, partial, include, machine] [global, three, average, location] [rank, set, triplet, distance, pairwise, symmetric, paper, sufficient, subset, corollary, data, update]
@InProceedings{Kasten_2019_CVPR,
  author = {Kasten, Yoni and Geifman, Amnon and Galun, Meirav and Basri, Ronen},
  title = {GPSfM: Global Projective SFM Using Algebraic Constraints on Multi-View Fundamental Matrices},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Group-Wise Correlation Stereo Network
Xiaoyang Guo, Kai Yang, Wukui Yang, Xiaogang Wang, Hongsheng Li


Stereo matching estimates the disparity between a rectified image pair, which is of great importance to depth sensing, autonomous driving, and other related tasks. Previous works built cost volumes with cross-correlation or concatenation of left and right features across all disparity levels, and then a 2D or 3D convolutional neural network is utilized to regress the disparity maps. In this paper, we propose to construct the cost volume by group-wise correlation. The left features and the right features are divided into groups along the channel dimension, and correlation maps are computed among each group to obtain multiple matching cost proposals, which are then packed into a cost volume. Group-wise correlation provides efficient representations for measuring feature similarities and will not lose too much information like full correlation. It also preserves better performance when reducing parameters compared with previous methods. The 3D stacked hourglass network proposed in previous works is improved to boost the performance and decrease the inference computational cost. Experiment results show that our method outperforms previous methods on Scene Flow, KITTI 2012, and KITTI 2015 datasets.
[flow, previous, multiple, time, consists] [volume, matching, disparity, stereo, kitti, computer, vision, hourglass, pattern, scene, psmnet, left, depth, computed, error, provide, dispnetc, form, estimation] [proposed, conference, ieee, image, figure, method, traditional, input] [correlation, cost, network, aggregation, concatenation, performance, output, table, stacked, group, computational, better, inference, unary, validation, denotes, neural, efficient, convolution, effectiveness, deep, full, compared, explore, aggregate, number, concat, size, experiment] [model, evaluation, improved] [feature, module, improve, european, context, propose] [training, set, datasets, learning, function, base, test, similarity, loss, learn]
@InProceedings{Guo_2019_CVPR,
  author = {Guo, Xiaoyang and Yang, Kai and Yang, Wukui and Wang, Xiaogang and Li, Hongsheng},
  title = {Group-Wise Correlation Stereo Network},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Multi-Level Context Ultra-Aggregation for Stereo Matching
Guang-Yu Nie, Ming-Ming Cheng, Yun Liu, Zhengfa Liang, Deng-Ping Fan, Yue Liu, Yongtian Wang


Exploiting multi-level context information to cost volume can improve the performance of learning-based stereo matching methods. In recent years, 3-D Convolution Neural Networks (3-D CNNs) show the advantages in regularizing cost volume but are limited by unary features learning in matching cost computation. However, existing methods only use features from plain convolution layers or a simple aggregation of multi-level features to calculate cost volume, which is insufficient because stereo matching requires discriminative features to identify corresponding pixels in rectified stereo image pairs. In this paper, we propose a unary features descriptor using multi-level context ultra-aggregation (MCUA), which encapsulates all convolutional features into a more discriminative representation by intra- and inter-level features combination. Specifically, a child module that takes low-resolution images as input captures larger context information; the larger context information from each layer is densely connected to the main branch of the network. MCUA makes good usage of multi-level features with richer context and performs the image-to-image prediction holistically. We introduce our MCUA scheme for cost volume calculation and test it on PSM-Net. We also evaluate our method on Scene Flow and KITTI 2012/2015 stereo datasets. Experimental results show that our method outperforms state-of-the-art methods by a notable margin and effectively improves the accuracy of stereo matching.
[flow, dataset, rnns] [stereo, matching, disparity, volume, scene, pattern, dense, field, error, kitti, initial] [ieee, image, input, proposed, figure, based] [mcua, cost, network, aggregation, residual, densenets, dla, receptive, performance, emcua, scheme, higher, output, order, neural, convolutional, deep, architecture, unary, layer, operation, size, convolution, applying, compare, scale, compared, cnns, introduces, better] [child, model, indicates, enables, generated, introduce, evaluate] [feature, module, context, stage, map, three, calculation, branch, backbone, final] [training, datasets, learning, independent, test, combination, loss, large, set, train]
@InProceedings{Nie_2019_CVPR,
  author = {Nie, Guang-Yu and Cheng, Ming-Ming and Liu, Yun and Liang, Zhengfa and Fan, Deng-Ping and Liu, Yue and Wang, Yongtian},
  title = {Multi-Level Context Ultra-Aggregation for Stereo Matching},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Large-Scale, Metric Structure From Motion for Unordered Light Fields
Sotiris Nousias, Manolis Lourakis, Christos Bergeles


This paper presents a large scale, metric Structure from Motion (SfM) pipeline for generalised cameras with overlapping fields-of-view, and demonstrates it using Light Field (LF) images. We build on recent developments in algorithms for absolute and relative pose recovery for generalised cameras and couple them with multi-view triangulation in a robust framework that advances the state-of-the-art on 3D reconstruction from LFs in several ways. First, our framework can recover the scale of a scene. Second, it is concerned with unordered sets of LF images, meticulously determining the order in which images should be considered. Third, it can scale to datasets with hundreds of LF images. Finally, it recovers 3D scene structure while abstaining from triangulating using very small baselines. Our approach outperforms the state-of-the-art, as demonstrated by real-world experiments with variable size datasets.
[frame, motion, graph, perform, recognition, multiple, work] [pose, computer, vision, camera, error, ransac, relative, triangulation, point, sfm, generalised, estimation, light, ray, reprojection, algorithm, reconstruction, matching, plenoptic, pattern, robust, subaperture, colmap, field, absolute, scene, international, projection, approach, registered, pipeline, initial, correspondence, calibration, unordered, single, bundle, geometric, corresponding, minimal, geometrically, pinhole, adjustment, outlier, view] [image, conference, ieee, central, figure, imaging, translation, reconstructed] [structure, number, scale, efficient, sparse, table, applied, selection] [median, model] [feature, refinement, spatial] [set, pairwise, pair, large, metric, essential]
@InProceedings{Nousias_2019_CVPR,
  author = {Nousias, Sotiris and Lourakis, Manolis and Bergeles, Christos},
  title = {Large-Scale, Metric Structure From Motion for Unordered Light Fields},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Understanding the Limitations of CNN-Based Absolute Camera Pose Regression
Torsten Sattler, Qunjie Zhou, Marc Pollefeys, Laura Leal-Taixe


Visual localization is the task of accurate camera pose estimation in a known scene. It is a key problem in computer vision and robotics, with applications including self-driving cars, Structure-from-Motion, SLAM, and Mixed Reality. Traditionally, the localization problem has been tackled using 3D geometry. Recently, end-to-end approaches based on convolutional neural networks have become popular. These methods learn to directly regress the camera pose from an input image. However, they do not achieve the same level of pose accuracy as 3D structure-based methods. To understand this behavior, we develop a theoretical model for camera pose regression. We use our model to predict failure cases for pose regression techniques and verify our predictions through experiments. We furthermore use our model to show that pose regression is more closely related to pose approximation via image retrieval than to accurate pose estimation via 3D structure. A key result is that current approaches do not consistently outperform a handcrafted image retrieval baseline. This clearly shows that additional research is needed before pose regression algorithms are ready to compete with structure-based methods.
[dataset, predict, current, multiple, work] [pose, camera, absolute, apr, densevlad, posenet, mapnet, relative, torsten, marc, scene, active, position, accurate, estimation, orientation, approach, linear, additional, single, matching, point, rpr, cambridge, tomas, theoretical, clearly, estimate, well, josef, problem, estimated, akihiko, regress] [image, based, input, captured, amount, figure] [search, convolutional, neural, outperform, deep, network, andrew, original] [visual, model, closer] [regression, localization, cnn, stage, feature, predicted] [training, test, retrieval, base, data, set, learning, trained, learn, generalize, large, retrieved, combination, consistently, close, embedding, learned, corresponds, loss, paper]
@InProceedings{Sattler_2019_CVPR,
  author = {Sattler, Torsten and Zhou, Qunjie and Pollefeys, Marc and Leal-Taixe, Laura},
  title = {Understanding the Limitations of CNN-Based Absolute Camera Pose Regression},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
DeepLiDAR: Deep Surface Normal Guided Depth Prediction for Outdoor Scene From Sparse LiDAR Data and Single Color Image
Jiaxiong Qiu, Zhaopeng Cui, Yinda Zhang, Xingdi Zhang, Shuaicheng Liu, Bing Zeng, Marc Pollefeys


In this paper, we propose a deep learning architecture that produces accurate dense depth for the outdoor scene from a single color image and a sparse depth. Inspired by the indoor depth completion, our network estimates surface normals as the intermediate representation to produce dense depth, and can be trained end-to-end. With a modified encoder-decoder structure, our network effectively fuses the dense color image and the sparse LiDAR depth. To address outdoor specific challenges, our network predicts a confidence mask to handle mixed LiDAR signals near foreground boundaries due to occlusion, and combines estimates from the color image and surface normals with learned attention maps to improve the depth accuracy especially for distant areas. Extensive experiments demonstrate that our model improves upon the state-of-the-art performance on KITTI depth completion benchmark. Ablation study shows the positive impact of each model components to the final performance, and comprehensive analysis shows that our model generalizes well to the input with higher sparsity or from indoor scenes.
[recognition, prediction, work, performs] [depth, surface, normal, dense, vision, computer, lidar, completion, indoor, confidence, pattern, single, outdoor, kitti, error, well, accurate, estimated, rgb, estimation, rmse, international, scene, occlusion, directly, local, camera, cspn] [color, image, ieee, input, method, zhang, intermediate, based, bilateral, produce, proposed, figure, inpainting] [sparse, deep, network, performance, neural, better, compared, full, fast, sparsity, output, binary, unit, convolutional, architecture, validation, processing] [model, attention, complete, decoder, evaluation, system] [mask, pathway, map, propose, area, guided, final, feature, integration, european, affinity] [learning, data, learned, learn, representation, set, distance, loss, close, training, test]
@InProceedings{Qiu_2019_CVPR,
  author = {Qiu, Jiaxiong and Cui, Zhaopeng and Zhang, Yinda and Zhang, Xingdi and Liu, Shuaicheng and Zeng, Bing and Pollefeys, Marc},
  title = {DeepLiDAR: Deep Surface Normal Guided Depth Prediction for Outdoor Scene From Sparse LiDAR Data and Single Color Image},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Modeling Point Clouds With Self-Attention and Gumbel Subset Sampling
Jiancheng Yang, Qiang Zhang, Bingbing Ni, Linguo Li, Jinxian Liu, Mengdie Zhou, Qi Tian


Geometric deep learning is increasingly important thanks to the popularity of 3D sensors. Inspired by the recent advances in NLP domain, the self-attention transformer is introduced to consume the point clouds. We develop Point Attention Transformers (PATs), using a parameter-efficient Group Shuffle Attention (GSA) to replace the costly Multi-Head Attention. We demonstrate its ability to process size-varying inputs, and prove its permutation equivariance. Besides, prior work uses heuristics dependence on the input data (e.g., Furthest Point Sampling) to hierarchically select subsets of input points. Thereby, we for the first time propose an end-to-end learnable and task-agnostic sampling operation, named Gumbel Subset Sampling (GSS), to select a representative subset of input points. Equipped with Gumbel-Softmax, it produces a "soft" continuous subset in training phase, and a "hard" discrete subset in test phase. By selecting representative subsets in a hierarchical fashion, the networks learn a stronger representation of the input sets with lower computation cost. Experiments on classification and segmentation benchmarks show the effectiveness and efficiency of our methods. Furthermore, we propose a novel application, to process event camera stream as point clouds, and achieve a state-of-the-art performance on DVS128 Gesture Dataset.
[dataset, event, gesture, time, graph, stream] [point, cloud, discrete, linear, position, single, pointnet, relative, camera, absolute, permutation, differentiable] [input, figure, method, study, prior] [group, shuffle, deep, pat, size, performance, gsa, neural, table, mlp, accuracy, arpe, operation, channel, network, layer, number, furthest, effectiveness, cnns, mha, computation, design, structure, output, stochastic, rate, bingbing, inspired, efficiency, achieve] [attention, gumbel, model, reparameterization, introduce, named] [segmentation, area, pointcnn, propose, hierarchical] [sampling, learning, subset, classification, training, loss, test, shared, softmax, set, embedding, data, representative, representation]
@InProceedings{Yang_2019_CVPR,
  author = {Yang, Jiancheng and Zhang, Qiang and Ni, Bingbing and Li, Linguo and Liu, Jinxian and Zhou, Mengdie and Tian, Qi},
  title = {Modeling Point Clouds With Self-Attention and Gumbel Subset Sampling},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Learning With Batch-Wise Optimal Transport Loss for 3D Shape Recognition
Lin Xu, Han Sun, Yuai Liu


Deep metric learning is essential for visual recognition. The widely used pair-wise (or triplet) based loss objectives cannot make full use of semantical information in training samples or give enough attention to those hard samples during optimization. Thus, they often suffer from a slow convergence rate and inferior performance. In this paper, we show how to learn an importance-driven distance metric via optimal transport programming from batches of samples. It can automatically emphasize hard examples and lead to significant improvements in convergence. We propose a new batch-wise optimal transport loss and combine it in an end-to-end deep metric learning manner. We use it to learn the distance metric and deep feature representation jointly for recognition. Empirical results on visual retrieval and classification tasks with six benchmark datasets, i.e., MNIST, CIFAR10, SHREC13, SHREC14, ModelNet10, and ModelNet40, demonstrate the superiority of the proposed method. It can accelerate the convergence rate significantly while achieving a state-of-the-art recognition performance. For example, in 3D shape recognition experiments, we show that our method can achieve better recognition performance within only 5 epochs than what can be obtained by mainstream 3D shape recognition approaches after 200 epochs.
[recognition] [optimal, shape, computer, ground, tij, pattern, vision, equation, plan, descriptor, international, matrix] [conference, method, ieee, proposed, image, based, figure] [deep, network, number, neural, batch, rate, compared, accuracy, performance, size, precision, table, convolutional, processing, higher, gradient, accelerate, small] [probability, visual, random] [map, feature, benchmark, average, cnn, object] [loss, metric, learning, transport, distance, retrieval, training, objective, mij, transportation, hard, positive, negative, learned, dissimilar, semantical, convergence, representation, classification, embedding, pair, yij, margin, large, learn, mnist, triplet, euclidian, similarity]
@InProceedings{Xu_2019_CVPR,
  author = {Xu, Lin and Sun, Han and Liu, Yuai},
  title = {Learning With Batch-Wise Optimal Transport Loss for 3D Shape Recognition},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
DenseFusion: 6D Object Pose Estimation by Iterative Dense Fusion
Chen Wang, Danfei Xu, Yuke Zhu, Roberto Martin-Martin, Cewu Lu, Li Fei-Fei, Silvio Savarese


A key technical challenge in performing 6D object pose estimation from RGB-D image is to fully leverage the two complementary data sources. Prior works either extract information from the RGB image and depth separately or use costly post-processing steps, limiting their performances in highly cluttered scenes and real-time applications. In this work, we present DenseFusion, a generic framework for estimating 6D pose of a set of known objects from RGB-D images. DenseFusion is a heterogeneous architecture that processes the two data sources individually and uses a novel dense fusion network to extract pixel-wise dense feature embedding, from which the pose is estimated. Furthermore, we integrate an end-to-end iterative pose refinement procedure that further improves the pose estimation while achieving near real-time inference. Our experiments show that our method outperforms state-of-the-art approaches in two datasets, YCB-Video and LineMOD. We also deploy our proposed method to a real robot to grasp and manipulate objects based on the estimated pose.
[fusion, recognition, dataset, prediction, challenge, work, extract, previous, key, outperforms, performs, perform] [pose, estimation, point, dense, computer, depth, cloud, vision, approach, geometric, occlusion, estimated, pattern, international, estimate, posecnn, linemod, rgb, pointfusion, confidence, robust, orientation, grasp, grasping, pointnet, icp, directly] [image, method, color, ieee, conference, based, input, prior, proposed, pixel, figure] [network, architecture, deep, table, residual, performance, compare] [iterative, model, robot, procedure, arxiv, preprint, transformed, generate] [object, refinement, feature, segmentation, global, detection, heavy, final, box, cluttered, bounding] [learning, data, embedding, symmetric, set, novel, loss, main, training]
@InProceedings{Wang_2019_CVPR,
  author = {Wang, Chen and Xu, Danfei and Zhu, Yuke and Martin-Martin, Roberto and Lu, Cewu and Fei-Fei, Li and Savarese, Silvio},
  title = {DenseFusion: 6D Object Pose Estimation by Iterative Dense Fusion},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Dense Depth Posterior (DDP) From Single Image and Sparse Range
Yanchao Yang, Alex Wong, Stefano Soatto


We present a deep learning system to infer the posterior distribution of a dense depth map associated with an image, by exploiting sparse range measurements, for instance from a lidar. While the lidar may provide a depth value for a small percentage of the pixels, we exploit regularities reflected in the training set to complete the map so as to have a probability over depth for each pixel in the image. We exploit a Conditional Prior Network, that allows associating a probability to each depth value given an image, and combine it with a likelihood term that uses the sparse measurements. Optionally we can also exploit the availability of stereo during training, but in any case only require a single image and a sparse point cloud at run-time. We test our approach on both unsupervised and supervised depth completion using the KITTI benchmark, and improve the state-of-the-art in both.
[term, dataset, optical, performs] [depth, dense, completion, point, single, kitti, vision, computer, stereo, approach, ground, corresponding, lidar, cloud, estimate, pattern, range, scene, note, rmse, truth, photometric, reconstruction, error, wcp, imae, monocular, international, provide, morphological, view] [image, conditional, prior, conference, method, input, ieee, proposed, raw, quantitative, comparison, control, figure, based, study] [sparse, network, deep, better, performance, cpn, convolutional, layer, outperform, applied, norm, achieves] [model, arxiv, preprint, probability, observed, evaluate, random, validity] [map, benchmark, mae, semantics, branch, detailed, european, instance] [unsupervised, learning, posterior, training, likelihood, set, supervised, loss, exploit, data, test, distribution, metric, rank]
@InProceedings{Yang_2019_CVPR,
  author = {Yang, Yanchao and Wong, Alex and Soatto, Stefano},
  title = {Dense Depth Posterior (DDP) From Single Image and Sparse Range},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
DuLa-Net: A Dual-Projection Network for Estimating Room Layouts From a Single RGB Panorama
Shang-Ta Yang, Fu-En Wang, Chi-Han Peng, Peter Wonka, Min Sun, Hung-Kuo Chu


We present a deep learning framework, called DuLa-Net, to predict Manhattan-world 3D room layouts from a single RGB panorama. To achieve better prediction accuracy, our method leverages two projections of the panorama at once, namely the equirectangular panorama-view and the perspective ceiling-view, that each contains different clues about the room layouts. Our network architecture consists of two encoder-decoder branches for analyzing each of the two views. In addition, a novel feature fusion structure is proposed to connect the two branches, which are then jointly trained to predict the 2D floor plans and layout heights. To learn more complex room layouts, we introduce the Realtor360 dataset that contains panoramas of Manhattan-world room layouts with different numbers of corners. Experimental results show that our work outperforms recent state-of-the-art in prediction accuracy and performance, especially in the rooms with non-cuboid layouts.
[dataset, fusion, prediction, complex, predict, outperforms, recognition, jointly, consists, work] [floor, plan, computer, layoutnet, single, panorama, indoor, vision, perspective, scene, manhattan, pattern, ceiling, international, rgb, estimated, panocontext, depth, june, panoramic, conversion, position, estimating, note, camera, shape, cuboid, tool, optimization, estimation] [method, image, input, conference, ieee, quantitative, figure] [network, equirectangular, architecture, neural, table, performance, accuracy, process, connected, scheme, output] [room, probability, step, introduce] [layout, iou, map, feature, final, height, propose, polygon, fused, three, global] [training, set, stanford, learning, trained]
@InProceedings{Yang_2019_CVPR,
  author = {Yang, Shang-Ta and Wang, Fu-En and Peng, Chi-Han and Wonka, Peter and Sun, Min and Chu, Hung-Kuo},
  title = {DuLa-Net: A Dual-Projection Network for Estimating Room Layouts From a Single RGB Panorama},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Veritatem Dies Aperit - Temporally Consistent Depth Prediction Enabled by a Multi-Task Geometric and Semantic Scene Understanding Approach
Amir Atapour-Abarghouei, Toby P. Breckon


Robust geometric and semantic scene understanding is ever more important in many real-world applications such as autonomous driving and robotic navigation. In this paper, we propose a multi-task learning-based approach capable of jointly performing geometric and semantic scene understanding, namely depth prediction (monocular depth estimation and depth completion) and semantic scene segmentation. Within a single temporally constrained recurrent network, our approach uniquely takes advantage of a complex series of skip connections, adversarial training and the temporal constraint of sequential frame recurrence to produce consistent depth and semantic class labels simultaneously. Extensive experimental evaluation demonstrates the efficacy of our approach compared to other contemporary state-of-the-art techniques.
[temporal, prediction, flow, video, time, optical, work, temporally, recurrent, previous, perform, dataset, amir, frame, despite] [depth, computer, pattern, vision, approach, ground, scene, estimation, monocular, truth, completion, single, colour, stereo, toby, consistent, rgb, well, dense, rmse] [image, ieee, input, figure, synthetic, method, inpainting, based, capable, separate, generative, feedback] [network, deep, output, skip, convolutional, neural, table, accuracy, better, performance, processing, compared, architecture] [model, generated, adversarial, machine, step, understanding, evaluation] [semantic, segmentation, feature, spatial, object, utilize] [trained, data, training, loss, learning, test, class, supervised, domain]
@InProceedings{Atapour-Abarghouei_2019_CVPR,
  author = {Atapour-Abarghouei, Amir and Breckon, Toby P.},
  title = {Veritatem Dies Aperit - Temporally Consistent Depth Prediction Enabled by a Multi-Task Geometric and Semantic Scene Understanding Approach},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Segmentation-Driven 6D Object Pose Estimation
Yinlin Hu, Joachim Hugonot, Pascal Fua, Mathieu Salzmann


The most recent trend in estimating the 6D pose of rigid objects has been to train deep networks to either directly regress the pose from the image or to predict the 2D locations of 3D keypoints, from which the pose can be obtained using a PnP algorithm. In both cases, the object is treated as a global entity, and a single pose estimate is computed. As a consequence, the resulting techniques can be vulnerable to large occlusions. In this paper, we introduce a segmentation-driven 6D pose estimation framework where each visible part of the objects contributes a local pose prediction in the form of 2D keypoint locations. We then use a predicted measure of confidence to combine these pose candidates into a robust set of 3D-to-2D correspondences, from which a reliable pose estimate can be obtained. We outperform the state-of-the-art on the challenging Occluded-LINEMOD and YCB-Video datasets, which is evidence that our approach deals well with multiple poorly-textured objects occluding each other. Furthermore, it relies on a simple enough architecture to achieve real-time performance.
[multiple, stream, report, dataset, human, fusion, predict, perform, outperforms, state, relies, work] [pose, computer, estimation, local, approach, pattern, international, keypoint, robust, vision, pnp, confidence, posecnn, vincent, accurate, single, keypoints, stefan, rigid, corresponding, predicts, cpm, journal, analysis, well, algorithm, typically, note] [conference, image, method, ieee, figure, input, comparison] [table, network, architecture, compare, better, deep, best, output, effective, fast] [model, random, alexander, machine, simple] [object, segmentation, predicted, global, grid, regression, mask, pascal, box, bounding, location, heatmaps, feature, presence, average, spatial, challenging] [large, loss, learning, class, strategy]
@InProceedings{Hu_2019_CVPR,
  author = {Hu, Yinlin and Hugonot, Joachim and Fua, Pascal and Salzmann, Mathieu},
  title = {Segmentation-Driven 6D Object Pose Estimation},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Exploiting Temporal Context for 3D Human Pose Estimation in the Wild
Anurag Arnab, Carl Doersch, Andrew Zisserman


We present a bundle-adjustment-based algorithm for recovering accurate 3D human pose and meshes from monocular videos. Unlike previous algorithms which operate on single frames, we show that reconstructing a person over an entire sequence gives extra constraints that can resolve ambiguities. This is because videos often give multiple views of a person, yet the overall body shape does not change and 3D positions vary slowly. Our method improves not only on standard mocap-based datasets like Human 3.6M -- where we show quantitative improvements -- but also on challenging in-the-wild datasets such as Kinetics. Building upon our algorithm, we present a new dataset of more than 3 million frames of YouTube videos from Kinetics with automatically generated 3D poses and meshes. We show that retraining a single-frame 3D pose estimator on this data improves accuracy on both real-world and mocap data by evaluating on the 3DPW and HumanEVA datasets.
[human, kinetics, dataset, hmr, temporal, motion, joint, video, people, multiple, mocap, previous, capture, frame, term, youtube, action, initialisation, humaneva, work, jointly, sequence] [pose, bundle, adjustment, smpl, estimation, shape, single, body, error, camera, keypoints, note, ground, reprojection, additional, algorithm, monocular, truth, keypoint, approach, mpjpe, fitting, robust, optimisation, total, accurate, estimator, initial, well, mesh, respect] [method, image, prior, consistency, input] [network, table, neural, original, automatically, structure] [model, arxiv, diversity] [improve, improves, person, detector, supervision] [data, training, trained, datasets, loss, selected, function, learning, objective, optimise, representation]
@InProceedings{Arnab_2019_CVPR,
  author = {Arnab, Anurag and Doersch, Carl and Zisserman, Andrew},
  title = {Exploiting Temporal Context for 3D Human Pose Estimation in the Wild},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
What Do Single-View 3D Reconstruction Networks Learn?
Maxim Tatarchenko, Stephan R. Richter, Rene Ranftl, Zhuwen Li, Vladlen Koltun, Thomas Brox


Convolutional networks for single-view object reconstruction have shown impressive performance and have become a popular subject of research. All existing techniques are united by the idea of having an encoder-decoder network that performs non-trivial reasoning about the 3D structure of the output space. In this work, we set up two alternative approaches that perform image classification and retrieval respectively. These simple baselines yield better results than state-of-the-art methods, both qualitatively and quantitatively. We show that encoder-decoder methods are statistically indistinguishable from these baselines, thus indicating that the current state of the art in single-view object reconstruction does not actually perform reconstruction but image classification. We identify aspects of popular experimental procedures that elicit this behavior and discuss ways to improve the current state of research.
[recognition, perform, dataset, current, prediction, outperforms] [reconstruction, shape, single, depth, ogn, matryoshka, point, approach, atlasnet, surface, voxel, pure, ground, well, shapenet, coordinate, percentage, truth, monocular, problem, geometric, volumetric] [image, figure, input, reconstructed, high, method, based] [convolutional, structure, output, network, number, performance, precision, better, deep] [model, find, evaluation, reasoning, evaluate] [object, iou, baseline, semantic, map, miou, recall, predicted] [retrieval, training, learning, distance, clustering, oracle, test, set, similarity, cluster, measure, existing, representation, trained, data, class, classification, experimental, space, observe]
@InProceedings{Tatarchenko_2019_CVPR,
  author = {Tatarchenko, Maxim and Richter, Stephan R. and Ranftl, Rene and Li, Zhuwen and Koltun, Vladlen and Brox, Thomas},
  title = {What Do Single-View 3D Reconstruction Networks Learn?},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
UniformFace: Learning Deep Equidistributed Representation for Face Recognition
Yueqi Duan, Jiwen Lu, Jie Zhou


In this paper, we propose a new supervision objective named uniform loss to learn deep equidistributed representations for face recognition. Most existing methods aim to learn discriminative face features, encouraging large inter-class distances and small intra-class variations. However, they ignore the distribution of faces in the holistic feature space, which may lead to severe locality and unbalance. With the prior that faces lie on a hypersphere manifold, we impose an equidistributed constraint by uniformly spreading the class centers on the manifold, so that the minimum distance between class centers can be maximized through complete exploitation of the feature space. To this end, we consider the class centers as like charges on the surface of hypersphere with inter-class repulsion, and minimize the total electric potential energy as the uniform loss. Extensive experimental results on the MegaFace Challenge I, IARPA Janus Benchmark A (IJB-A), Youtube Faces (YTF) and Labeled Faces in the Wild (LFW) datasets show the effectiveness of the proposed uniform loss.
[recognition, dataset, employed, joint] [varying, uniformly, range, repulsion, local, locality] [face, proposed, comparison, unconstrained, based, figure, method, high] [deep, verification, table, energy, effectiveness, small, performance, employ, network, connected, accuracy, binary, standard, compared, grant] [potential, manifold, consider] [feature, average, supervision, identification, center, holistic, cnn, fully, benchmark] [loss, uniform, class, uniformface, large, learning, training, hypersphere, minimum, distance, learned, equidistributed, lfw, ytf, distribution, megaface, discriminative, learn, softmax, sphereface, angular, representation, set, experimental, existing, gallery, protocol, cosface, china, datasets, data, margin, observe, space, exploit]
@InProceedings{Duan_2019_CVPR,
  author = {Duan, Yueqi and Lu, Jiwen and Zhou, Jie},
  title = {UniformFace: Learning Deep Equidistributed Representation for Face Recognition},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Semantic Graph Convolutional Networks for 3D Human Pose Regression
Long Zhao, Xi Peng, Yu Tian, Mubbasir Kapadia, Dimitris N. Metaxas


In this paper, we study the problem of learning Graph Convolutional Networks (GCNs) for regression. Current architectures of GCNs are limited to the small receptive field of convolution filters and shared transformation matrix for each node. To address these limitations, we propose Semantic Graph Convolutional Networks (SemGCN), a novel neural network architecture that operates on regression tasks with graph-structured data. SemGCN learns to capture semantic information such as local and global node relationships, which is not explicitly represented in the graph. These semantic relationships can be learned through end-to-end training from the ground truth without additional supervision or hand-crafted rules. We further investigate applying SemGCN to 3D human pose regression. Our formulation is intuitive and sufficient since both 2D and 3D human poses can be represented as a structured graph encoding the relationships between joints in the skeleton of a human body. We carry out comprehensive studies to validate our method. The results prove that SemGCN outperforms state of the art while using 90% fewer parameters.
[human, graph, semgcn, joint, semgconv, framework, gcns, previous, temporal, dataset, state, work, action, represented, long, learns, skeleton] [pose, estimation, matrix, ground, truth, approach, directly, note, local, vision, single, camera, monocular, yichen, problem, limited, formulation, computer] [image, method, proposed, transformation, based, study, input, ieee] [network, convolutional, neural, deep, convolution, relu, performance, batchnorm, configuration, architecture, kernel, fewer, operation, table] [node, evaluation, model, dimitris, visual, system] [semantic, regression, global, feature, backbone, propose, baseline] [learning, training, loss, data, weighting, protocol, novel, shared, function, testing, address]
@InProceedings{Zhao_2019_CVPR,
  author = {Zhao, Long and Peng, Xi and Tian, Yu and Kapadia, Mubbasir and Metaxas, Dimitris N.},
  title = {Semantic Graph Convolutional Networks for 3D Human Pose Regression},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Mask-Guided Portrait Editing With Conditional GANs
Shuyang Gu, Jianmin Bao, Hao Yang, Dong Chen, Fang Wen, Lu Yuan


Portrait editing is a popular subject in photo manipulation.The Generative Adversarial Network (GAN) advances the generating of realistic faces and allows more face editing. In this paper, we argue about three issues in existing techniques: diversity, quality, and controllability for portrait synthesis and editing. To address these issues, we propose a novel end-to-end learning framework that leverages conditional GANs guided by provided face masks for generating faces. The framework learns feature embeddings for every face component (e.g., mouth, hair, eye), separately, contributing to better correspondences for image translation, and local face editing. With the mask, our network is available to many applications, like face synthesis driven by mask, face Swap+ (including hair in swapping), and local manipulation. It can also boost the performance of face parsing a bit as an option of data augmentation.
[framework, dataset] [local, computer, vision, pattern, allows, international, volume] [face, image, facial, background, component, generative, figure, conference, editing, input, hair, synthesis, portrait, realistic, ieee, conditional, helen, skin, changing, color, method, based, change, appearance, quality, proposed, style, result, acm] [network, neural, deep, output, tensor, table] [generated, generate, arxiv, adversarial, preprint, diversity, model, gans, diverse, visual, equivariant, generating, controllability, gan] [mask, feature, parsing, foreground, global, three, fusing, propose, region, guided] [target, embedding, source, loss, transfer, training, data, train, learning, embeddings, function, existing, learn]
@InProceedings{Gu_2019_CVPR,
  author = {Gu, Shuyang and Bao, Jianmin and Yang, Hao and Chen, Dong and Wen, Fang and Yuan, Lu},
  title = {Mask-Guided Portrait Editing With Conditional GANs},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Group Sampling for Scale Invariant Face Detection
Xiang Ming, Fangyun Wei, Ting Zhang, Dong Chen, Fang Wen


Detectors based on deep learning tend to detect multi-scale faces on a single input image for efficiency. Recent works, such as FPN and SSD, generally use feature maps from multiple layers with different spatial resolutions to detect objects at different scales, e.g., high-resolution feature maps for small objects. However, we find that such multi-layer prediction is not necessary. Faces at all scales can be well detected with features from a single layer of the network. In this paper, we carefully examine the factors affecting face detection across a large range of scales, and conclude that the balance of training samples, including both positive and negative ones, at different scales is the key. We propose a group sampling method which divides the anchors into several groups according to the scale, and ensure that the number of samples for each group is the same during training. Our approach using only the last layer of FPN as features is able to advance the state-of-the-arts. Comprehensive analysis and extensive experiments have been conducted to show the effectiveness of the proposed method. Our approach, evaluated on face detection benchmarks including FDDB and WIDER FACE datasets, achieves state-of-the-art results without bells and whistles.
[multiple, key, previous] [computer, vision, pattern, single, international, focal, approach, matching, analysis] [face, conference, ieee, proposed, method, figure, based, image, comparison, handle] [scale, group, performance, number, small, fast, accuracy, network, better, achieves, stride, layer, deep, table, higher, compared, neural, compare, convolutional, architecture] [find, ensure] [feature, detection, anchor, fpn, object, iou, wider, map, ohem, propose, pyramid, improve, cascade, european, fddb, faster, detecting, rpn, adopt] [training, sampling, loss, positive, distribution, data, learning, negative, hard, imbalance, set, large, imbalanced, sample, medium, alignment, issue, randomly]
@InProceedings{Ming_2019_CVPR,
  author = {Ming, Xiang and Wei, Fangyun and Zhang, Ting and Chen, Dong and Wen, Fang},
  title = {Group Sampling for Scale Invariant Face Detection},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Joint Representation and Estimator Learning for Facial Action Unit Intensity Estimation
Yong Zhang, Baoyuan Wu, Weiming Dong, Zhifeng Li, Wei Liu, Bao-Gang Hu, Qiang Ji


Facial action unit (AU) intensity is an index to characterize human expressions. Accurate AU intensity estimation depends on three major elements: image representation, intensity estimator, and supervisory information. Most existing methods learn intensity estimator with fixed image representation, and rely on the availability of fully annotated supervisory information. In this paper, a novel general framework for AU intensity estimation is presented, which differs from traditional estimation methods in two aspects. First, rather than keeping image representation fixed, it simultaneously learns representation and intensity estimator to achieve an optimal solution. Second, it allows incorporating weak supervisory training signal from human knowledge (e.g. feature smoothness, label smoothness, label ranking, and positive label), which makes our model trainable even fully annotated information is not available. More specifically, human knowledge is represented as either soft or hard constraints which are encoded as regularization terms or equality/inequality constraints, respectively. On top of our novel framework, we additionally propose an efficient algorithm for optimization based on Alternating Direction Method of Multipliers (ADMM). Evaluations on two benchmark databases show that our method outperforms competing methods under different ratios of AU intensity annotations, especially for small ratios.
[action, human, fera, joint, kjre, framework, temporal, jointly, frame, key, bormir, dynamic, dsrvm, icc, incorporate, svr, prediction, rvr, sovrim, cor, hssr, osvr, learns] [estimator, estimation, smoothness, ordinal, problem, limited, general, algorithm, provide] [intensity, facial, method, disfa, image, pcc, expression, comparison, database, based] [performance, deep, unit, better, small, table, achieves, denotes] [model, encoded, arg, random, type, vector] [annotated, annotation, weak, feature, mae, propose, weakly, regression, supervision] [representation, learning, label, labeled, knowledge, training, unlabeled, learn, supervised, positive, ranking, min, set, large, testing, neighbor, hard, exploit, soft]
@InProceedings{Zhang_2019_CVPR,
  author = {Zhang, Yong and Wu, Baoyuan and Dong, Weiming and Li, Zhifeng and Liu, Wei and Hu, Bao-Gang and Ji, Qiang},
  title = {Joint Representation and Estimator Learning for Facial Action Unit Intensity Estimation},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Semantic Alignment: Finding Semantically Consistent Ground-Truth for Facial Landmark Detection
Zhiwei Liu, Xiangyu Zhu, Guosheng Hu, Haiyun Guo, Ming Tang, Zhen Lei, Neil M. Robertson, Jinqiao Wang


Recently, deep learning based facial landmark detection has achieved great success. Despite this, we notice that the semantic ambiguity greatly degrades the detection performance. Specifically, the semantic ambiguity means that some landmarks (e.g. those evenly distributed along the face contour) do not have clear and accurate definition, causing the inconsistent annotations (random errors) introduced by annotators. Accordingly, these inconsistent annotations, which are usually provided by public databases, commonly work as the (inaccurate) groundtruth to supervise network training, leading to the degraded accuracy. To our knowledge, very little research has investigated this problem. In this paper, we propose a novel probabilistic model which introduces a latent variable, i.e. 'real' groundtruth which is semantically consistent, to optimize. This framework couples two parts (1) training landmark detection CNN and (2) searching the 'real' groundtruth. These two parts are alternatively optimized: the searched 'real' groundtruth supervises the CNN training; and the trained CNN assists the searching of 'real' groundtruth. In addition, to correct or recover the unconfidently predicted landmarks due to occlusion and low quality, we propose a global heatmap correction unit (GHCU) to correct outliers by considering the global face shape as a constraint. Extensive experiments on both image-based (300V and AFLW) and video-based (300VW) databases demonstrate that our method effectively improves the landmark detection accuracy and achieves state-of-the-art performance.
[human, state, work] [computer, pattern, vision, observation, ambiguity, optimization, hourglass, shape, position, lab, error, local, variable, problem, pose, template, consistent, analysis, confidence] [landmark, face, facial, ghcu, conference, ieee, method, prior, latent, based, aflw, nme, figure, inconsistent, correction, image, pca, conduct] [network, deep, search, table, performance, gaussian, size, searching, accuracy, reduce, achieve, better, full, searched, low, unit, optimized, architecture] [model, random, provided, strong, find, semantically] [semantic, heatmap, detection, predicted, cnn, global, roughly, annotation, category, weak, propose, challenging, art, leading, regression] [training, alignment, test, set, likelihood, loss, probabilistic, learning, train]
@InProceedings{Liu_2019_CVPR,
  author = {Liu, Zhiwei and Zhu, Xiangyu and Hu, Guosheng and Guo, Haiyun and Tang, Ming and Lei, Zhen and Robertson, Neil M. and Wang, Jinqiao},
  title = {Semantic Alignment: Finding Semantically Consistent Ground-Truth for Facial Landmark Detection},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
LAEO-Net: Revisiting People Looking at Each Other in Videos
Manuel J. Marin-Jimenez, Vicky Kalogeiton, Pablo Medina-Suarez, Andrew Zisserman


Capturing the 'mutual gaze' of people is essential for understanding and interpreting the social interactions between them. To this end, this paper addresses the problem of detecting people Looking At Each Other (LAEO) in video sequences. For this purpose, we propose LAEO-Net, a new deep CNN for determining LAEO in videos. In contrast to previous works, LAEO-Net takes spatio-temporal tracks as input and reasons about the whole track. It consists of three branches, one for each character's tracked head and one for their relative position. Moreover, we introduce two new LAEO datasets: UCO-LAEO and AVA-LAEO. A thorough experimental evaluation demonstrates the ability of LAEO-Net to successfully determine if two people are LAEO and the temporal window where it happens. Our model achieves state-of-the-art results on the existing TVHID-LAEO video dataset, significantly outperforming previous approaches.
[laeo, people, human, track, video, social, frame, temporal, dataset, consists, tvhid, determining, determine, determines, work, vfoa, report, interaction, mar, previous, window, fusion, ava, yaw, sequence] [pose, relative, position, problem, note, university, scene] [gaze, figure, synthetic, input, eye, facial, study, based, image, face] [layer, performance, conv, network, table, achieves, apply, automatically] [model, visual, introduce, describe, generate, evaluate, red, provided] [head, branch, score, three, detection, level, detecting, person, annotated, average, detector] [training, pair, train, data, shot, negative, datasets, test, hard, learning, loss, testing]
@InProceedings{Marin-Jimenez_2019_CVPR,
  author = {Marin-Jimenez, Manuel J. and Kalogeiton, Vicky and Medina-Suarez, Pablo and Zisserman, Andrew},
  title = {LAEO-Net: Revisiting People Looking at Each Other in Videos},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Robust Facial Landmark Detection via Occlusion-Adaptive Deep Networks
Meilu Zhu, Daming Shi, Mingjie Zheng, Muhammad Sadiq


In this paper, we present a simple and effective framework called Occlusion-adaptive Deep Networks (ODN) with the purpose of solving the occlusion problem for facial landmark detection. In this model, the occlusion probability of each position in high-level features are inferred by a distillation module that can be learnt automatically in the process of estimating the relationship between facial appearance and facial shape. The occlusion probability serves as the adaptive weight on high-level features to reduce the impact of occlusion and obtain clean feature representation. Nevertheless, the clean feature representation cannot represent the holistic face due to the missing semantic features. To obtain exhaustive and complete feature representation, it is vital that we leverage a low-rank learning module to recover lost features. Considering that facial geometric characteristics are conducive to the low-rank module to recover lost features, we propose a geometry-aware module to excavate geometric relationships between different facial components. Depending on the synergistic effect of three modules, the proposed network achieves better performance in comparison to state-of-the-art methods on challenging benchmark datasets.
[dataset, work, consists, structural, capture] [computer, occlusion, pattern, vision, geometric, robust, local, shape, international, problem, matrix, pose, occluded, analysis, active] [face, facial, ieee, conference, proposed, landmark, odn, recover, method, clean, missing, nrmse, comparison, image, appearance, cofw, input, aflw, outer, figure] [deep, performance, table, convolutional, process, neural, residual, layer] [model, probability, relationship, evaluation, lost] [module, feature, regression, three, challenging, map, locate, heatmap, cnn, localization, detection] [learning, distillation, set, alignment, training, representation, rank, testing, existing, experimental, function, data, test, learnt]
@InProceedings{Zhu_2019_CVPR,
  author = {Zhu, Meilu and Shi, Daming and Zheng, Mingjie and Sadiq, Muhammad},
  title = {Robust Facial Landmark Detection via Occlusion-Adaptive Deep Networks},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Learning Individual Styles of Conversational Gesture
Shiry Ginosar, Amir Bar, Gefen Kohavi, Caroline Chan, Andrew Owens, Jitendra Malik


Human speech is often accompanied by hand and arm gestures. We present a method for cross-modal translation from "in-the-wild" monologue speech of a single speaker to their conversational gesture motion. We train on unlabeled videos for which we only have noisy pseudo ground truth from an automatic pose detection system. Our proposed model significantly outperforms baseline methods in a quantitative comparison. To support research toward obtaining a computational understanding of the relationship between gesture and speech, we release a large video dataset of person-specific gestures.
[gesture, speech, motion, audio, video, speaker, predict, temporal, conversational, human, sequence, signal, prediction, dataset, work, recognition, predicting, arm, sound, time, outperforms, spectrogram] [pose, ground, truth, initial, computer, vision, hand, corresponding, pattern, approach, international, analysis, virtual, keypoints, pck, skeletal, supplementary, well] [input, translation, method, figure, proposed, acm, study, quantitative, real, conference, unet] [compare, network, full, neural, table, convolutional] [model, adversarial, discriminator, generate, find, language, median, gan, automatic, sign, multimodal, random] [predicted, regression, person, berkeley, detection, average] [training, learning, data, pseudo, large, task, trained, representation, set, loss, train, learn]
@InProceedings{Ginosar_2019_CVPR,
  author = {Ginosar, Shiry and Bar, Amir and Kohavi, Gefen and Chan, Caroline and Owens, Andrew and Malik, Jitendra},
  title = {Learning Individual Styles of Conversational Gesture},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Face Anti-Spoofing: Model Matters, so Does Data
Xiao Yang, Wenhan Luo, Linchao Bao, Yuan Gao, Dihong Gong, Shibao Zheng, Zhifeng Li, Wei Liu


Face anti-spoofing is an important task in full-stack face applications including face detection, verification, and recognition. Previous approaches build models on datasets which do not simulate the real-world data well (e.g., small scale, insignificant variance, etc.). Existing models may rely on auxiliary information, which prevents these anti-spoofing solutions from generalizing well in practice. In this paper, we present a data collection solution along with a data synthesis technique to simulate digital medium-based face spoofing attacks, which can easily help us obtain a large amount of training data well reflecting the real-world scenarios. Through exploiting a novel Spatio-Temporal Anti-Spoof Network (STASN), we are able to push the performance on public face anti-spoofing datasets over state-of-the-art methods by a large margin. Since the proposed model can automatically attend to discriminative regions, it makes analyzing the behaviors of the network possible.We conduct extensive experiments and show that the proposed model can distinguish spoof faces by extracting features from a variety of regions to seek out subtle evidences such as borders, moire patterns, reflection artifacts, etc.
[temporal, video, recognition, dataset, second, previous, multiple] [local, international, computer, vision, corresponding, position, initial, practical, error, pattern] [face, spoof, image, conference, live, tasm, method, proposed, ieee, spoofing, biometrics, stasn, collected, antispoofing, conduct, subtle, reflection, based, faceds, jukka, abdenour, amount, figure, real, siw, forensics] [network, performance, best, convolutional, table, rate, better, number, deep, structure, layer] [model, attention, attended, random, attack] [module, region, detection, spatial, cnn, feature, three, ram, including, global, map, score] [data, auxiliary, training, learning, testing, datasets, learn, large, discriminative, cross, positive, negative, set, train, protocol]
@InProceedings{Yang_2019_CVPR,
  author = {Yang, Xiao and Luo, Wenhan and Bao, Linchao and Gao, Yuan and Gong, Dihong and Zheng, Shibao and Li, Zhifeng and Liu, Wei},
  title = {Face Anti-Spoofing: Model Matters, so Does Data},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Fast Human Pose Estimation
Feng Zhang, Xiatian Zhu, Mao Ye


Existing human pose estimation approaches often only consider how to improve the model generalisation performance, but putting aside the significant efficiency problem. This leads to the development of heavy models with poor scalability and cost-effectiveness in practical use. In this work, we investigate the under-studied but practically critical pose model efficiency problem. To this end, we present a new Fast Pose Distillation (FPD) model learning strategy. Specifically, the FPD trains a lightweight pose neural network architecture capable of executing rapidly with low computational cost. It is achieved by effectively transferring the pose structure knowledge of a strong teacher network. Extensive evaluations demonstrate the advantages of our FPD method over a broad range of state-of-the-art pose estimation approaches in terms of model cost-effectiveness on two standard benchmark datasets, MPII Human Pose and Leeds Sports Pose.
[human, joint, mpii, auc, prediction, dataset] [pose, estimation, computer, hourglass, vision, confidence, pattern, lsp, analysis, alternative, provide, error, international, limited, problem] [conference, proposed, ieee, method, based, mse, figure, row, image, high] [fpd, network, deep, table, neural, performance, small, efficiency, lightweight, architecture, accuracy, cost, fast, structure, number, design, compared, efficient, convolutional, deployment, original, computational, inference, whilst, science, building, effective, parameter, highly] [model, strong] [cnn, map, benchmark, object, person, supervision, european, head] [knowledge, training, distillation, teacher, loss, learning, target, function, existing, large, train, test, transfer, student, generalisation, labelled, data]
@InProceedings{Zhang_2019_CVPR,
  author = {Zhang, Feng and Zhu, Xiatian and Ye, Mao},
  title = {Fast Human Pose Estimation},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Decorrelated Adversarial Learning for Age-Invariant Face Recognition
Hao Wang, Dihong Gong, Zhifeng Li, Wei Liu


There has been an increasing research interest in age-invariant face recognition. However, matching faces with big age gaps remains a challenging problem, primarily due to the significant discrepancy of face appearance caused by aging. To reduce such discrepancy, in this paper we present a novel algorithm to remove age-related components from features mixed with both identity and age information. Specifically, we factorize a mixed face feature into two uncorrelated components: identity-dependent component and age-dependent component, where the identity-dependent component contains information that is useful for face recognition. To implement this idea, we propose the Decorrelated Adversarial Learning (DAL) algorithm, where a Canonical Mapping Module (CMM) is introduced to find maximum correlation of the paired features generated by the backbone network, while the backbone network and the factorization module are trained to generate features reducing the correlation. Thus, the proposed model learns the decomposed features of age and identity whose correlation is significantly reduced. Simultaneously, the identity-dependent feature and the age-dependent feature are supervised by ID and age preserving signals respectively to ensure they contain the correct information. Extensive experiments have been conducted on the popular public-domain face aging datasets (FG-NET, MORPH Album 2, and CACD-VS) to demonstrate the effectiveness of the proposed approach.
[recognition, dataset, joint, framework, challenge] [vision, computer, canonical, pattern, analysis, algorithm, international, initial, linear] [face, age, dal, conference, xid, method, identity, ieee, xage, proposed, decomposed, morph, decorrelated, aging, album, figure, aifr, based, image, conduct, frequency, mapping, facial, bcca, component, generative] [correlation, deep, small, residual, factorization, neural, table, network, group, performance, factor, convolutional, processing, reduce, compared, regularization] [adversarial, model, evaluation, introduced] [feature, module, identification, backbone, supervision, propose, improve] [learning, training, cosine, large, trained, set, invariant, testing, similarity, maximum, discriminating, loss, conducted, train, discriminative, learned, classification, cca]
@InProceedings{Wang_2019_CVPR,
  author = {Wang, Hao and Gong, Dihong and Li, Zhifeng and Liu, Wei},
  title = {Decorrelated Adversarial Learning for Age-Invariant Face Recognition},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Cross-Task Weakly Supervised Learning From Instructional Videos
Dimitri Zhukov, Jean-Baptiste Alayrac, Ramazan Gokberk Cinbis, David Fouhey, Ivan Laptev, Josef Sivic


In this paper we investigate learning visual models for the steps of ordinary tasks using weak supervision via instructional narrations and an ordered list of steps instead of strong supervision via temporal annotations. At the heart of our approach is the observation that weakly supervised learning may be easier if a model shares components while learning different steps: "pour egg" should be trained jointly with other tasks involving "pour" and "egg". We formalize this in a component model for recognizing steps and a weakly supervised learning framework that can learn this model under temporal constraints from narration and the list of steps. Past data does not permit systematic studying of sharing and so we also gather a new dataset aimed at assessing cross-task sharing. Our experiments demonstrate that sharing across tasks improves performance, especially when done at the component level and that our component model can parse previously unseen tasks by virtue of its compositionality.
[pour, instructional, temporal, dataset, video, action, work, whisk, build, time, shelf, ordered, egg, framework, narration, lemonade, ordering, complex, steak, crosstask] [approach, cut, well, assume, optimization, form, problem] [component, method, figure, proposed] [sharing, performance, number, add, order, table, full, filter, empirically, compare, standard] [model, step, primary, visual, making, parse, milk, goal, evaluate, tomato, strawberry, compositional, diverse] [weakly, average, improves, improve, car, recall, supervision, annotated, semantic] [learning, supervised, task, set, learn, data, classifier, unseen, mixture, training, list, test, large, share, shared, function, unsupervised, train]
@InProceedings{Zhukov_2019_CVPR,
  author = {Zhukov, Dimitri and Alayrac, Jean-Baptiste and Gokberk Cinbis, Ramazan and Fouhey, David and Laptev, Ivan and Sivic, Josef},
  title = {Cross-Task Weakly Supervised Learning From Instructional Videos},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
D3TW: Discriminative Differentiable Dynamic Time Warping for Weakly Supervised Action Alignment and Segmentation
Chien-Yi Chang, De-An Huang, Yanan Sui, Li Fei-Fei, Juan Carlos Niebles


We address weakly supervised action alignment and segmentation in videos, where only the order of occurring actions is available during training. We propose Discriminative Differentiable Dynamic Time Warping (D3TW), the first discriminative model using weak ordering supervision. The key technical challenge for discriminative modeling with weak supervision is that the loss function of the ordering supervision is usually formulated using dynamic programming and is thus not differentiable. We address this challenge with a continuous relaxation of the min-operator in dynamic programming and extend the alignment loss to be differentiable. The proposed D3TW innovatively solves sequence alignment with discriminative modeling and end-to-end training, which substantially improves the performance in weakly supervised action alignment and segmentation tasks. We show that our model is able to bypass the degenerated sequence problem usually encountered in previous work and outperform the current state-of-the-art across three evaluation metrics in two challenging datasets.
[action, video, transcript, modeling, dynamic, frame, challenge, time, previous, ordering, sequence, breakfast, warping, key, degenerated, work, temporal, prediction, gru, dtw, egg, proposing, hollywood, framework, iod] [differentiable, ground, programming, truth, relaxation, computer, continuous, directly, approach, vision, problem, allows, matrix, pattern] [figure, based, proposed, conference, ieee, input] [best, order, cost, performance, output, optimize, neural, lead] [model, probability, correct, goal, candidate] [weakly, segmentation, supervision, weak, occurring, improves, challenging, false, improve] [alignment, discriminative, supervised, loss, function, test, distance, set, address, learning, label, task, learn, negative, aligning, training, hard, align, positive]
@InProceedings{Chang_2019_CVPR,
  author = {Chang, Chien-Yi and Huang, De-An and Sui, Yanan and Fei-Fei, Li and Carlos Niebles, Juan},
  title = {D3TW: Discriminative Differentiable Dynamic Time Warping for Weakly Supervised Action Alignment and Segmentation},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Progressive Teacher-Student Learning for Early Action Prediction
Xionghui Wang, Jian-Fang Hu, Jian-Huang Lai, Jianguo Zhang, Wei-Shi Zheng


The goal of early action prediction is to recognize actions from partially observed videos with incomplete action executions, which is quite different from action recognition. Predicting early actions is very challenging since the partially observed videos do not contain enough action information for recognition. In this paper, we aim at improving early action prediction by proposing a novel teacher-student learning framework. Our framework involves a teacher model for recognizing actions from full videos, a student model for predicting early actions from partial videos, and a teacher-student learning block for distilling progressive knowledge from teacher to student, crossing different tasks. Extensive experiments on three public action datasets show that the proposed progressive teacher-student learning framework can consistently improve performance of early action prediction model. We have also reported the state-of-the-art performances for early action prediction on all of these sets.
[action, early, prediction, recognition, video, framework, sysu, ntu, human, lstm, dataset, predicting, employed, deepscn, auc, msrnn, activity, teacherstudent, recognizing, interaction, predict, skeleton] [computer, pattern, observation, vision, analysis, international, rgb, partially, local, corresponding, depth] [conference, ieee, figure, proposed, latent, method, mse, based, kong, extracted] [full, performance, table, progressive, network, ratio, distill, neural, accuracy, deep, convolutional, improving, performed] [model, progress, partial, machine, observed, system, evaluation] [feature, improve, level, global, fully] [student, learning, teacher, knowledge, loss, set, distilling, distillation, mmd, training, learned, minimizing, test, novel, label, learn, discrepancy, distribution]
@InProceedings{Wang_2019_CVPR,
  author = {Wang, Xionghui and Hu, Jian-Fang and Lai, Jian-Huang and Zhang, Jianguo and Zheng, Wei-Shi},
  title = {Progressive Teacher-Student Learning for Early Action Prediction},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Social Relation Recognition From Videos via Multi-Scale Spatial-Temporal Reasoning
Xinchen Liu, Wu Liu, Meng Zhang, Jingwen Chen, Lianli Gao, Chenggang Yan, Tao Mei


Discovering social relations, e.g., kinship, friendship, etc., from visual contents can make machines better interpret the behaviors and emotions of human beings. Existing studies mainly focus on recognizing social relations from still images while neglecting another important media--video. On one hand, the actions and storylines in videos provide more important cues for social relation recognition. On the other hand, the key persons may appear at arbitrary spatial-temporal locations, even not in one same image from beginning to the end. To overcome these challenges, we propose a Multi-scale Spatial-Temporal Reasoning (MSTR) framework to recognize social relations from videos. For the spatial representation, we not only adopt a temporal segment network to learn global action and scene information, but also design a Triple Graphs model to capture visual relations between persons and objects. For the temporal domain, we propose a Pyramid Graph Convolutional Network to perform temporal reasoning with multi-scale receptive fields, which can obtain both long-term and short-term storylines in videos. By this means, MSTR can comprehensively explore the multi-scale actions and storylines in spatial-temporal dimensions for social relation reasoning in videos. Extensive experiments on a new large-scale Video Social Relation dataset demonstrate the effectiveness of the proposed framework.
[social, graph, video, temporal, triple, recognition, dataset, gcn, tsn, pgcn, framework, adjacent, capture, visr, action, mstr, recognize, build, clip, frame, key, perform, colleague, fusion, human, recognizing] [matrix, construction, computer] [figure, proposed, based, input, image, appearance] [network, convolutional, convolution, receptive, scale, neural, table, design, group] [reasoning, model, visual, represent, sampled, woman, man, couple] [relation, pyramid, global, spatial, contextual, feature, propose, three, segment, adopt, adopted, bounding, semantic, person, map] [learn, classification, learning, set, domain, existing]
@InProceedings{Liu_2019_CVPR,
  author = {Liu, Xinchen and Liu, Wu and Zhang, Meng and Chen, Jingwen and Gao, Lianli and Yan, Chenggang and Mei, Tao},
  title = {Social Relation Recognition From Videos via Multi-Scale Spatial-Temporal Reasoning},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
MS-TCN: Multi-Stage Temporal Convolutional Network for Action Segmentation
Yazan Abu Farha, Jurgen Gall


Temporally locating and classifying action segments in long untrimmed videos is of particular interest to many applications like surveillance and robotics. While traditional approaches follow a two-step pipeline, by generating frame-wise probabilities and then feeding them to high-level temporal models, recent approaches use temporal convolutions to directly classify the video frames. In this paper, we introduce a multi-stage architecture for the temporal action segmentation task. Each stage features a set of dilated temporal convolutions to generate an initial prediction that is refined by the next one. This architecture is trained using a combination of a classification loss and a proposed smoothing loss that penalizes over-segmentation errors. Extensive evaluation shows the effectiveness of the proposed model in capturing long-range dependencies and recognizing action segments. Our model achieves state-of-the-art results on three challenging datasets: 50Salads, Georgia Tech Egocentric Activities (GTEA), and the Breakfast dataset.
[temporal, action, video, recognition, acc, breakfast, dataset, previous, tcn, long, segmental, egocentric, time, gtea, prediction, capture, juergen, short, modeling, complex] [computer, vision, pattern, field, approach, additional, initial, single] [conference, proposed, edit, ieee, figure, resolution, input, based, qualitative, quality] [table, number, convolutional, architecture, dilated, impact, layer, convolution, network, smoothing, better, output, receptive, low, accuracy, higher, performance, achieves, pooling, residual, size] [model, probability, evaluation, adding, introduce] [segmentation, stage, three, weakly, penalizes, operate] [loss, datasets, supervised, large, training, train, class, set]
@InProceedings{Farha_2019_CVPR,
  author = {Abu Farha, Yazan and Gall, Jurgen},
  title = {MS-TCN: Multi-Stage Temporal Convolutional Network for Action Segmentation},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Transferable Interactiveness Knowledge for Human-Object Interaction Detection
Yong-Lu Li, Siyuan Zhou, Xijie Huang, Liang Xu, Ze Ma, Hao-Shu Fang, Yanfeng Wang, Cewu Lu


Human-Object Interaction (HOI) Detection is an important problem to understand how humans interact with objects. In this paper, we explore Interactiveness Knowledge which indicates whether human and object interact with each other or not. We found that interactiveness knowledge can be learned across HOI datasets, regardless of HOI category settings. Our core idea is to exploit an Interactiveness Network to learn the general interactiveness knowledge from multiple HOI datasets and perform Non-Interaction Suppression before HOI classification in inference. On account of the generalization of interactiveness, interactiveness network is a transferable knowledge learner and can be cooperated with any HOI detection models to achieve desirable results. We extensively evaluate the proposed method on HICO-DET and V-COCO datasets. Our framework outperforms state-of-the-art HOI detection results by a great margin, verifying its efficacy and flexibility. Code is available at https://github.com/DirtyHarryLYL/Transferable-Interactiveness-Network.
[interactiveness, hoi, human, hois, rpt, stream, graph, multiple, framework, rpd, interaction, cewu, dataset, fcs, joint, perform] [pose, dense] [figure, method, image, proposed, transferred, suppress, input] [network, table, performance, pooling, block, residual, inference, sparse] [visual, model, relationship, indicates, arxiv, preprint, evaluate, mode, transferability] [object, detection, person, map, default, spatial, interactive, three, feature, suppression, hierarchical, score, instance, category, bounding, utilize, detected, cnn, box, sized] [knowledge, learning, pair, set, classification, transfer, datasets, training, train, specific, learn, classify, learned, test, representation]
@InProceedings{Li_2019_CVPR,
  author = {Li, Yong-Lu and Zhou, Siyuan and Huang, Xijie and Xu, Liang and Ma, Ze and Fang, Hao-Shu and Wang, Yanfeng and Lu, Cewu},
  title = {Transferable Interactiveness Knowledge for Human-Object Interaction Detection},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Actional-Structural Graph Convolutional Networks for Skeleton-Based Action Recognition
Maosen Li, Siheng Chen, Xu Chen, Ya Zhang, Yanfeng Wang, Qi Tian


Action recognition with skeleton data has recently attracted much attention in computer vision. Previous studies are mostly based on fixed skeleton graphs, only capturing local physical dependencies among joints, which may miss implicit joint correlations. To capture richer dependencies, we introduce an encoder-decoder structure, called A-link inference module, to capture action-specific latent dependencies, i.e. actional links, directly from actions. We also extend the existing skeleton graphs to represent higher-order dependencies, i.e. structural links. Combing the two types of links into a generalized skeleton graph, We further propose the actional-structural graph convolution network (AS-GCN), which stacks actional-structural graph convolution and temporal convolution as a basic building block, to learn both spatial and temporal features for action recognition. A future pose prediction head is added in parallel to the recognition head to help capture more detailed action patterns through self-supervision. We validate AS-GCN in action recognition using two skeleton data sets, NTU-RGB+D and Kinetics. The proposed AS-GCN achieves consistently large improvement compared to the state-of-the-art methods. As a side product, AS-GCN also shows promising results for future pose prediction.
[action, recognition, joint, skeleton, graph, future, temporal, capture, prediction, actional, link, structural, human, asgc, previous, predict, extract, kinetics, linking, sgc, vwulgh, gcn] [pose, computer, pattern, vision, body, polynomial, inferred, note, international, june] [figure, conference, based, ieee, proposed, input, xin, july] [convolution, block, order, inference, table, deep, plot, convolutional, performance, network, larger, denotes, connected] [model, encoder, introduce, represent] [feature, spatial, head, module, backbone, detailed, predicted, propose, response] [data, set, learning, classification, generalized, learn, large, training, validate]
@InProceedings{Li_2019_CVPR,
  author = {Li, Maosen and Chen, Siheng and Chen, Xu and Zhang, Ya and Wang, Yanfeng and Tian, Qi},
  title = {Actional-Structural Graph Convolutional Networks for Skeleton-Based Action Recognition},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Multi-Granularity Generator for Temporal Action Proposal
Yuan Liu, Lin Ma, Yifeng Zhang, Wei Liu, Shih-Fu Chang


Temporal action proposal generation is an important task, aiming to localize the video segments containing human actions in an untrimmed video. In this paper, we propose a multi-granularity generator (MGG) to perform the temporal action proposal from different granularity perspectives, relying on the video visual features equipped with the position embedding information. First, we propose to use a bilinear matching model to exploit the rich local information within the video sequence. Afterwards, two components, namely segment proposal producer (SPP) and frame actionness producer (FAP), are combined to perform the task of temporal action proposal at two distinct granularities. SPP considers the whole video in the form of feature pyramid and generates segment proposals from one coarse perspective, while FAP carries out a finer actionness evaluation for each video frame. Our proposed MGG can be trained in an end-to-end fashion. Through temporally adjusting the segment proposals with fine-grained information based on frame actionness, MGG achieves the superior performance over state-of-the-art methods on the public THUMOS-14 and ActivityNet-1.3 datasets. Moreover, we employ existing action classifiers to perform the classification of the proposals generated by MGG, leading to significant improvements compared against the competing methods for the video detection task.
[action, temporal, video, mgg, frame, spp, fap, sequence, actionness, middle, starting, producer, perform, bsn, illustrates, temporally, recognition, predict, ctap, untrimmed, combined, work] [position, matching, corresponding, defined, local, adjustment, finer] [proposed, high, denoted, based, method] [performance, convolutional, architecture, table, number, bilinear, validation, conv, network, lateral, higher, max, effectiveness, fine] [generated, probability, model, generate, visual, rich, turn, arxiv, tag, preprint] [proposal, segment, feature, boundary, pyramid, detection, anchor, recall, propose, stage, tiou, coarse, refined, three, instance, complementary, localization] [set, training, testing, exploit, loss, function, trained, objective, label, embedding, novel]
@InProceedings{Liu_2019_CVPR,
  author = {Liu, Yuan and Ma, Lin and Zhang, Yifeng and Liu, Wei and Chang, Shih-Fu},
  title = {Multi-Granularity Generator for Temporal Action Proposal},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Deep Rigid Instance Scene Flow
Wei-Chiu Ma, Shenlong Wang, Rui Hu, Yuwen Xiong, Raquel Urtasun


In this paper we tackle the problem of scene flow estimation in the context of self-driving. We leverage deep learning techniques as well as strong priors as in our application domain the motion of the scene can be composed by the motion of the robot and the 3D motion of the actors in the scene. We formulate the problem as energy minimization in a deep structured model, which can be solved efficiently in the GPU by unrolling a Gaussian-Newton solver. Our experiments in the challenging KITTI scene flow dataset show that we outperform the state-of-the-art by a very large margin, while being 800 times faster.
[flow, motion, optical, term, dataset, previous, warping, multiple, osf] [scene, rigid, stereo, estimation, error, disparity, runtime, matching, kitti, solver, photometric, robust, estimating, drisf, estimate, problem, well, approach, compute, occlusion, isf, geometry, autonomous, volume, inverse, occluded, thomas, estimated, point, rgb] [based, pixel, background, image, figure, method, prior, handle] [energy, deep, performance, cost, network, inference, convolutional, structured, neural, full, best, ratio, structure, employ, original] [model, visual, describe, variational] [instance, segmentation, object, three, pyramid, mask, module, feature, improve, spatial, foreground, raquel, context, challenging, comparing, regression] [large, exploit, learning, function, set, pair]
@InProceedings{Ma_2019_CVPR,
  author = {Ma, Wei-Chiu and Wang, Shenlong and Hu, Rui and Xiong, Yuwen and Urtasun, Raquel},
  title = {Deep Rigid Instance Scene Flow},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
See More, Know More: Unsupervised Video Object Segmentation With Co-Attention Siamese Networks
Xiankai Lu, Wenguan Wang, Chao Ma, Jianbing Shen, Ling Shao, Fatih Porikli


We introduce a novel network, called as CO-attention Siamese Network (COSNet), to address the unsupervised video object segmentation task from a holistic view. We emphasize the importance of inherent correlation among video frames and incorporate a global co-attention mechanism to improve further the state-of-the-art deep learning based solutions that primarily focus on learning discriminative foreground representations over appearance and motion in short-term temporal segments. The co-attention layers in our network provide efficient and competent stages for capturing global correlations and scene context by jointly computing and appending co-attention responses into a joint feature space. We train COSNet with pairs of video frames, which naturally augments training data and allows increased learning capacity. During the segmentation stage, the co-attention model encodes useful information by processing multiple reference frames together, which is leveraged to infer the frequently reappearing and salient foreground objects better. We propose a unified and end-to-end trainable framework where different co-attention variants can be derived for mining the rich context within videos. Our extensive experiments over three large benchmarks manifest that COSNet outperforms the current alternatives by a large margin. We will publicly release our implementation and models.
[video, cosnet, frame, uvos, fbms, motion, temporal, capture, multiple, dataset, considering, jianbing, optical, consists, fusion, focus, static] [matrix, column, local, corresponding] [reference, proposed, based, image, method, appearance, background, figure, quantitative, input, produce] [network, performance, deep, table, neural, better, number, weight, siamese, correlation, vanilla, compared, convolution, convolutional] [primary, attention, visual, model, mechanism, coattention, evaluation, rich, advantage] [object, segmentation, feature, global, module, foreground, region, three, final, fully, wenguan, salient, saliency, improvement, spatial] [learning, embedding, training, testing, symmetric, similarity, train, target, sampling, unsupervised, data, large, strategy, set, task, learned]
@InProceedings{Lu_2019_CVPR,
  author = {Lu, Xiankai and Wang, Wenguan and Ma, Chao and Shen, Jianbing and Shao, Ling and Porikli, Fatih},
  title = {See More, Know More: Unsupervised Video Object Segmentation With Co-Attention Siamese Networks},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Patch-Based Discriminative Feature Learning for Unsupervised Person Re-Identification
Qize Yang, Hong-Xing Yu, Ancong Wu, Wei-Shi Zheng


While discriminative local features have been shown effective in solving the person re-identification problem, they are limited to be trained on fully pairwise labelled data which is expensive to obtain. In this work, we overcome this problem by proposing a patch-based unsupervised learning framework in order to learn discriminative feature from patches instead of the whole images. The patch-based learning leverages similarity between patches to learn a discriminative model. Specifically, we develop a PatchNet to select patches from the feature map and learn discriminative features for these patches. To provide effective guidance for the PatchNet to learn discriminative patch feature on unlabeled datasets, we propose an unsupervised patch-based discriminative feature learning loss. In addition, we design an image-level feature learning loss to leverage all the patch features of the same image to serve as an image-level guidance for the PatchNet. Extensive experiments validate the superiority of our method for unsupervised person re-id. Our code is available at https://github.com/QizeYang/PAUL.
[dataset, framework] [local, provide, analysis, compute, cyclic] [image, patch, figure, method, proposed, based, identity, transformation] [table, deep, effective, network, parameter, performance, compared, effectiveness, designed, wei] [model, generate, generation, develop, random, arxiv, preprint] [feature, person, map, guidance, propose, liang, cnn, distinguish, including] [learning, discriminative, unsupervised, unlabeled, patchnet, pedal, learn, negative, surrogate, positive, hard, ipfl, sample, loss, ranking, training, pull, set, datasets, gap, dissimilar, push, pgn, data, metric, pulling, mine, nearest, shaogang, randomly, transfer, target, domain, tao, pairwise, label, observe, large, validate, distance, labeled, supervised]
@InProceedings{Yang_2019_CVPR,
  author = {Yang, Qize and Yu, Hong-Xing and Wu, Ancong and Zheng, Wei-Shi},
  title = {Patch-Based Discriminative Feature Learning for Unsupervised Person Re-Identification},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
SPM-Tracker: Series-Parallel Matching for Real-Time Visual Object Tracking
Guangting Wang, Chong Luo, Zhiwei Xiong, Wenjun Zeng


The greatest challenge facing visual object tracking is the simultaneous requirements on robustness and discrimination power. In this paper, we propose a SiamFC-based tracker, named SPM-Tracker, to tackle this challenge. The basic idea is to address the two requirements in two separate matching stages. Robustness is strengthened in the coarse matching (CM) stage through generalized training while discrimination power is enhanced in the fine matching (FM) stage through a distance learning network. The two stages are connected in series as the input proposals of the FM stage are generated by the CM stage. They are also connected in parallel as the matching scores and box location refinements are fused to generate the final results. This innovative series-parallel structure takes advantage of both stages and results in superior performance. The proposed SPM-Tracker, running at 120fps on GPU, achieves an AUC of 0.687 on OTB-100 and an EAO of 0.434 on VOT-16, exceeding other real-time trackers by a notable margin.
[tracking, auc, framework, work, online, frame] [matching, template, single, michael] [image, proposed, figure, appearance, high, input] [network, tracker, deep, siamrpn, structure, siamese, siamfc, number, correlation, dasiamrpn, performance, search, power, achieves, table, layer, best, fine, eao, convolutional, speed, scale, overlap, ratio, size] [visual, candidate, robustness, model, expected, generate] [stage, box, object, feature, score, regression, proposal, region, three, propose, coarse, detection, branch, recall, adopt, roi, regional, relation, fused, final, inertia] [target, learning, training, similarity, positive, discrimination, classification, discriminative, loss, generalized, distance, align]
@InProceedings{Wang_2019_CVPR,
  author = {Wang, Guangting and Luo, Chong and Xiong, Zhiwei and Zeng, Wenjun},
  title = {SPM-Tracker: Series-Parallel Matching for Real-Time Visual Object Tracking},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Spatial Fusion GAN for Image Synthesis
Fangneng Zhan, Hongyuan Zhu, Shijian Lu


Recent advances in generative adversarial networks (GANs) have shown great potentials in realistic image synthesis whereas most existing works address synthesis realism in either appearance space or geometry space but few in both. This paper presents an innovative Spatial Fusion GAN (SF-GAN) that combines a geometry synthesizer and an appearance synthesizer to achieve synthesis realism in both geometry and appearance spaces. The geometry synthesizer learns contextual geometries of background images and transforms and places foreground objects into the background images unanimously. The appearance synthesizer adjust the color, brightness and styles of the foreground objects and embeds them into background images harmoniously, where a guided filter is incorporated for detail preserving. The two synthesizers are inter-connected as mutual references which can be trained end-to-end with little supervision. The SF-GAN has been evaluated in two tasks: (1) realistic scene text image synthesis for training better recognition models; (2) glass and hat wearing for realistic matching glasses and hats with real portraits. Qualitative and quantitative comparisons with the state-of-the-art demonstrate the superiority of the proposed SF-GAN.
[recognition, consists, dataset, fusion] [geometry, scene, geometric, robust, local] [image, appearance, synthesis, background, synthesizer, real, synthesized, realistic, realism, proposed, composed, transformation, generative, cycle, face, detail, translation, composition, blending, synthesize, study, filtering, mapping, capable, input, fangneng, figure, generator, pixel] [network, achieve, deep, structure, filter, output, better, table, original, designed, achieves, trainable] [text, adversarial, gans, gan, shijian, transformed, hat, discriminator, model, generating] [foreground, guided, object, cropped, spatial, annotation, detection, guidance, highlighted] [training, space, learning, loss, existing, alignment, train, learn, transfer, data, domain]
@InProceedings{Zhan_2019_CVPR,
  author = {Zhan, Fangneng and Zhu, Hongyuan and Lu, Shijian},
  title = {Spatial Fusion GAN for Image Synthesis},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Text Guided Person Image Synthesis
Xingran Zhou, Siyu Huang, Bin Li, Yingming Li, Jiachen Li, Zhongfei Zhang


This paper presents a novel method to manipulate the visual appearance (pose and attribute) of a person image according to natural language descriptions. Our method can be boiled down to two stages: 1) text guided pose generation and 2) visual appearance transferred image synthesis. In the first stage, our method infers a reasonable target human pose based on the text. In the second stage, our method synthesizes a realistic and appearance transferred person image according to the text in conjunction with the target pose. Our method extracts sufficient information from the text and establishes a mapping between the image space and the language space, making generating and editing images corresponding to the description possible. We conduct extensive experiments to reveal the effectiveness of our method, as well as using the VQA Perceptual Score as a metric for evaluating the method. It shows for the first time that we can automatically edit the person image from the natural language descriptions.
[walking, dataset, forward, carrying, human, work] [pose, orientation, corresponding, coordinate, body, problem] [image, attribute, method, appearance, generative, attentional, synthesis, based, conditional, generator, transferred, reference, perceptual, figure, input, manipulate, editing, identity, gray, realistic, edit, proposed, color, result] [upsampling, basic, better, inference, network] [text, adversarial, natural, model, language, black, vqa, wearing, man, visual, generated, white, generation, encoder, attention, description, generate, arxiv, preprint, reasonable, shirt, generates, word, yellow, green, woman, generating, sau, blue] [person, guided, feature, propose, score, module, semantic, regression, predicted, xiaogang, pedestrian] [target, transfer, loss, training, pair, source, learning, novel, existing, tao]
@InProceedings{Zhou_2019_CVPR,
  author = {Zhou, Xingran and Huang, Siyu and Li, Bin and Li, Yingming and Li, Jiachen and Zhang, Zhongfei},
  title = {Text Guided Person Image Synthesis},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
STGAN: A Unified Selective Transfer Network for Arbitrary Image Attribute Editing
Ming Liu, Yukang Ding, Min Xia, Xiao Liu, Errui Ding, Wangmeng Zuo, Shilei Wen


Arbitrary attribute editing generally can be tackled by incorporating encoder-decoder and generative adversarial networks. However, the bottleneck layer in encoder-decoder usually gives rise to blurry and low quality editing result. And adding skip connections improves image quality at the cost of weakened attribute manipulation ability. Moreover, existing methods exploit target attribute vector to guide the flexible translation to desired target domain. In this work, we suggest to address these issues from selective transfer perspective. Considering that specific editing task is certainly only related to the changed attributes instead of all target attributes, our model selectively takes the difference between target and source attribute vectors as input. Furthermore, selective transfer units are incorporated with encoder-decoder to adaptively select and modify encoder feature for enhanced attribute editing. Experiments show that our method (i.e., STGAN) simultaneously improves attribute manipulation accuracy as well as perception quality, and performs favorably against state-of-the-arts in arbitrary face attribute editing and season translation.
[hidden, performs, work, dataset] [reconstruction, computer, vision, pattern, season, international] [attribute, image, editing, stgan, attgan, difference, arbitrary, stargan, quality, translation, conference, facial, input, manipulation, attdiff, figure, stu, generative, ieee, hair, latent, dadv, method, result, datt, modify, icgan, fadernet, transform, ftl, mouth, changed, face, limitation, proposed, generator, conditional, comparison, user, study, competing, blond, high] [skip, accuracy, add, convolution, network, table, better, residual, neural] [encoder, vector, model, adversarial, generation, decoder, arxiv, preprint, ability, generate, attt, brown, abstract, adding] [feature, selective, spatial, improves] [target, source, transfer, learning, training, set, train, code, test]
@InProceedings{Liu_2019_CVPR,
  author = {Liu, Ming and Ding, Yukang and Xia, Min and Liu, Xiao and Ding, Errui and Zuo, Wangmeng and Wen, Shilei},
  title = {STGAN: A Unified Selective Transfer Network for Arbitrary Image Attribute Editing},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Towards Instance-Level Image-To-Image Translation
Zhiqiang Shen, Mingyang Huang, Jianping Shi, Xiangyang Xue, Thomas S. Huang


Unpaired Image-to-image Translation is a new rising and challenging vision problem that aims to learn a mapping between unaligned image pairs in diverse domains. Recent advances in this field like MUNIT and DRIT mainly focus on disentangling content and style/attribute from a given image first, then directly adopting the global style to guide the model to synthesize new domain images. However, this kind of approaches severely incurs contradiction if the target domain images are content-rich with multiple discrepant objects. In this paper, we present a simple yet effective instance-aware image-to-image translation approach (INIT), which employs the fine-grained local (instance) and global styles to the target image spatially. The proposed INIT exhibits three import advantages: (1) the instance-level objective loss can help learn a more accurate reconstruction and incorporate diverse attributes of objects; (2) the styles used for target domain of local/global areas are from corresponding spatial regions in source domain, which intuitively is a more reasonable mapping; (3) the joint training process can benefit both fine and coarse granularity and incorporates instance information to improve the quality of global translation. We also collect a large-scale benchmark for the new instance-level translation task. We observe that our synthetic images can even benefit real-world vision tasks like generic object detection.
[dataset, framework] [vision, reconstruction, problem, scene, computer, night, sunny, pattern] [style, image, translation, content, munit, synthetic, method, consistency, drit, generative, real, figure, cyclegan, input, unpaired, proposed, conditional, init, paired, latent, swapping, background, comparison, mapping, conference, cloudy, lpips] [entire, deep, original, process, table, output, unit, residual] [adversarial, diverse, model, multimodal, inception, generated, generate, encoder, diversity] [object, global, coco, instance, detection, cis, score, bounding, box, segmentation, three, detailed, adopt, average, visualization] [training, domain, learning, target, loss, data, learn, shared, observe, code, randomly, space, unsupervised]
@InProceedings{Shen_2019_CVPR,
  author = {Shen, Zhiqiang and Huang, Mingyang and Shi, Jianping and Xue, Xiangyang and Huang, Thomas S.},
  title = {Towards Instance-Level Image-To-Image Translation},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Dense Intrinsic Appearance Flow for Human Pose Transfer
Yining Li, Chen Huang, Chen Change Loy


We present a novel approach for the task of human pose transfer, which aims at synthesizing a new image of a person from an input image of that person and a target pose. We address the issues of limited correspondences identified between keypoints only and invisible pixels due to self-occlusion. Unlike existing methods, we propose to estimate dense and intrinsic 3D appearance flow to better guide the transfer of pixels between poses. In particular, we wish to generate the 3D flow from just the reference and target poses. Training a network for this purpose is non-trivial, especially when the annotations for 3D appearance flow are scarce by nature. We address this problem through a flow synthesis stage. This is achieved by fitting a 3D model to the given pose pair and project them back to the 2D plane to compute the dense appearance flow for training. The synthesized ground-truths are then used to train a feedforward network for efficient mapping from the input and target skeleton poses to the 3D appearance flow. With the appearance flow, we perform feature warping on the input image and generate a photorealistic image of the target pose. Extensive results on DeepFashion and Market-1501 datasets demonstrate the effectiveness of our approach over existing methods. Our code is available at http://mmlab.ie.cuhk.edu.hk/projects/pose-transfer
[flow, human, warping, warped, perform, predict, work, optical, previous] [pose, visibility, intrinsic, body, dense, corresponding, fitting, invisible, directly, approach, single, groundtruth, well, view] [image, appearance, pixel, reference, method, input, generative, synthesis, proposed, user, quality, figure, deepfashion, study, arbitrary, realistic, chen, change, generator, perceptual, conditional, ssim, fashionis, feedforward, photorealistic, high] [network, convolutional, architecture, better, deep, full, layer, andrew, residual] [model, adversarial, generate, generated, generation, encoder, generating, arxiv, preprint, generates, decoder] [map, module, feature, regression, person, clothing, spatial, final, guided, detailed] [target, transfer, loss, training, pair, train, large, trained, representation, task, existing]
@InProceedings{Li_2019_CVPR,
  author = {Li, Yining and Huang, Chen and Change Loy, Chen},
  title = {Dense Intrinsic Appearance Flow for Human Pose Transfer},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Depth-Aware Video Frame Interpolation
Wenbo Bao, Wei-Sheng Lai, Chao Ma, Xiaoyun Zhang, Zhiyong Gao, Ming-Hsuan Yang


Video frame interpolation aims to synthesize nonexistent frames in-between the original frames. While significant advances have been made from the recent deep convolutional neural networks, the quality of interpolation is often reduced due to large object motion or occlusion. In this work, we propose a video frame interpolation method which explicitly detects the occlusion by exploring the depth information. Specifically, we develop a depth-aware flow projection layer to synthesize intermediate flows that preferably sample closer objects than farther ones. In addition, we learn hierarchical features to gather contextual information from neighboring pixels. The proposed model then warps the input frames, depth maps, and contextual features based on the optical flow and local interpolation kernels for synthesizing the output frame. Our model is compact, efficient, and fully differentiable. Quantitative and qualitative results demonstrate that the proposed model performs favorably against state-of-the-art frame interpolation methods on a wide variety of datasets. The source code and pre-trained model are available at https://github.com/baowenbo/DAIN.
[flow, frame, video, optical, motion, warping, dataset, dain, warped, warp, nie, jointly, time, key, performs, epicflow, explicitly] [depth, estimation, projection, middlebury, occlusion, estimate, local, projected, single, provide, well, estimated] [interpolation, proposed, method, input, figure, intermediate, synthesis, image, psnr, synthesize, based, ssim, toflow, clear, quantitative, sepconv, interpolated, demonstrate, favorably] [network, layer, adaptive, table, residual, deep, convolutional, output, rate, effective, kernel, neural] [model, generate, generates, closer, visual] [contextual, hierarchical, context, propose, map, average, adopt, extraction, object] [learning, learned, large, set, train, learn, existing]
@InProceedings{Bao_2019_CVPR,
  author = {Bao, Wenbo and Lai, Wei-Sheng and Ma, Chao and Zhang, Xiaoyun and Gao, Zhiyong and Yang, Ming-Hsuan},
  title = {Depth-Aware Video Frame Interpolation},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Sliced Wasserstein Generative Models
Jiqing Wu, Zhiwu Huang, Dinesh Acharya, Wen Li, Janine Thoma, Danda Pani Paudel, Luc Van Gool


In generative modeling, the Wasserstein distance (WD) has emerged as a useful metric to measure the discrepancy between generated and real data distributions. Unfortunately, it is challenging to approximate the WD of high-dimensional distributions. In contrast, the sliced Wasserstein distance (SWD) factorizes high-dimensional distributions into their multiple one-dimensional marginal distributions and is thus easier to approximate. In this paper, we introduce novel approximations of the primal and dual SWD. Instead of using a large number of random projections, as it is done by conventional SWD approximation methods, we propose to approximate SWDs with a small number of parameterized orthogonal projections in an end-to-end deep learning fashion. As concrete applications of our SWD approximations, we design two types of differentiable SWD blocks to equip modern generative frameworks---Auto-Encoders (AE) and Generative Adversarial Networks (GAN). In the experiments, we not only show the superiority of the proposed generative models on standard image synthesis benchmarks, but also demonstrate the state-of-the-art performance on challenging high resolution image and video generation in an unsupervised manner.
[video, multiple, pdf] [optimal, algorithm, form, compute, constraint, supplementary, matrix, respect] [dual, proposed, generative, image, resolution, high, idt, latent, real, prior, celeba, preference, study] [orthogonal, number, standard, approximation, block, approximate, deep, unit, compared, network, gradient, progressive, computational, batch, apply, neural, small, achieve, stability, max] [swd, primal, wasserstein, swgan, swae, random, gan, sliced, fid, marginal, visual, adversarial, model, swg, wgan, aae, introduce, lsun, stiefel, discriminator, arxiv, preprint, required, probability] [propose, score] [training, distribution, learning, update, distance, data, set, sample, loss, target, objective, unsupervised, source]
@InProceedings{Wu_2019_CVPR,
  author = {Wu, Jiqing and Huang, Zhiwu and Acharya, Dinesh and Li, Wen and Thoma, Janine and Pani Paudel, Danda and Van Gool, Luc},
  title = {Sliced Wasserstein Generative Models},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Deep Flow-Guided Video Inpainting
Rui Xu, Xiaoxiao Li, Bolei Zhou, Chen Change Loy


Video inpainting, which aims at filling in missing regions in a video, remains challenging due to the difficulty of preserving the precise spatial and temporal coherence of video contents. In this work we propose a novel flow-guided video inpainting approach. Rather than filling in the RGB pixels of each frame directly, we consider the video inpainting as a pixel propagation problem. We first synthesize a spatially and temporally coherent optical flow field across video frames using a newly designed Deep Flow Completion network, then use the synthesized flow fields to guide the propagation of pixels to fill up the missing regions in the video. Specifically, the Deep Flow Competion network follows a coarse-to-fine refinement strategy to complete the flow fields, while their quality is further improved by hard flow example mining. Following the guide of the completed flow fields, the missing video regions can be filled up precisely. Our method is evaluated on DAVIS and YouTubeVOS datasets qualitatively and quantitatively, achieving the state-of-the-art performance in terms of inpainting quality and speed.
[flow, video, second, propagation, temporal, davis, frame, huang, optical, complex, sequence, inpaint, forward, backward, deepfill, work, motion, consecutive] [approach, completion, field, computer, pattern, estimated, vision, directly, initial, rgb] [inpainting, missing, image, pixel, ieee, conference, figure, fill, method, completed, input, psnr, ssim, quality, comparison, smooth, study, removal, result, based, user, filled, proposed] [deep, table, network, effectiveness, neural, fixed, architecture, designed, compared, performance, convolutional, entire, subnetworks, better] [example, model, complete] [region, subnetwork, object, three, foreground, mask, ablation, filling] [hard, training, mining, learning, rank, unseen, data, large, existing]
@InProceedings{Xu_2019_CVPR,
  author = {Xu, Rui and Li, Xiaoxiao and Zhou, Bolei and Change Loy, Chen},
  title = {Deep Flow-Guided Video Inpainting},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Video Generation From Single Semantic Label Map
Junting Pan, Chengyu Wang, Xu Jia, Jing Shao, Lu Sheng, Junjie Yan, Xiaogang Wang


This paper proposes the novel task of video generation conditioned on a SINGLE semantic label map, which provides a good balance between flexibility and quality in the generation process. Different from typical end-to-end approaches, which model both scene content and dynamics in a single step, we propose to decompose this difficult task into two sub-problems. As current image generation methods do better than video generation in terms of detail, we synthesize high quality content by only generating the first frame. Then we animate the scene based on its semantic meaning to obtain temporally coherent video, giving us excellent results overall. We employ a cVAE for predicting optical flow as a beneficial intermediate step to generate a video sequence conditioned on the initial single frame. A semantic label map is integrated into the flow prediction module to achieve major improvements in the image-to-video generation process. Extensive experiments on the Cityscapes dataset show that our method outperforms all competing methods.
[video, motion, frame, sequence, flow, future, prediction, optical, work, starting, dataset, static, mocogan, human, predict, multiple, predicting, recognition, bidirectional] [scene, single, computer, occlusion, initial, ground, compute, vision, truth, well, corresponding, variable] [image, method, background, conditional, latent, figure, conference, quality, based, generative, translation, proposed, input, composed, study, consistency, content, unconditional, qualitative] [network, neural, compared, table, better, processing, structure] [generation, model, arxiv, generated, encoder, preprint, conditioned, generate, adversarial, visual, generating, cvae, decoder] [semantic, map, foreground, car, stage, predicted, mask] [label, learning, loss, task, unsupervised, trained, existing, data, training, uncertainty]
@InProceedings{Pan_2019_CVPR,
  author = {Pan, Junting and Wang, Chengyu and Jia, Xu and Shao, Jing and Sheng, Lu and Yan, Junjie and Wang, Xiaogang},
  title = {Video Generation From Single Semantic Label Map},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Polarimetric Camera Calibration Using an LCD Monitor
Zhixiang Wang, Yinqiang Zheng, Yung-Yu Chuang


It is crucial for polarimetric imaging to accurately calibrate the polarizer angles and the camera response function (CRF) of a polarizing camera. When this polarizing camera is used in a setting of multiview geometric imaging, it is often required to calibrate its intrinsic and extrinsic parameters as well, for which Zhang's calibration method is the most widely used with either a physical checker board, or more conveniently a virtual checker pattern displayed on a monitor. In this paper, we propose to jointly calibrate the polarizer angles and the inverse CRF (ICRF) using a slightly adapted checker pattern displayed on a liquid crystal display (LCD) monitor. Thanks to the lighting principles and the industry standards of the LCD monitors, the polarimetric and radiometric calibration can be significantly simplified, when assisted by the extrinsic parameters estimated from the checker pattern. We present a simple linear method for polarizer angle calibration and a convex method for radiometric calibration, both of which can be jointly refined in a process similar to bundle adjustment. Experiments have verified the feasibility and accuracy of the proposed calibration method.
[recognition, joint, jointly] [calibration, polarizer, camera, angle, lcd, polarization, icrf, pattern, polarizing, checker, light, computer, linear, geometric, vision, calibrate, surface, rotation, polarimetric, estimate, estimation, ground, international, extrinsic, estimated, irradiance, inverse, bundle, front, imin, screen, pose, truth, liquid, equation, normal, adjustment, observation, sin, journal, intrinsic, stereo, illumination, imax, eizo] [method, radiometric, figure, conference, gamma, proposed, intensity, color, noise, display, transmitted, ieee, imaging, displayed, image, dark, comparison, iphone] [crf, phase, number, standard, table] [monitor, room, environment, system] [response, three] [unknown, function, adapted, crystal, space, data]
@InProceedings{Wang_2019_CVPR,
  author = {Wang, Zhixiang and Zheng, Yinqiang and Chuang, Yung-Yu},
  title = {Polarimetric Camera Calibration Using an LCD Monitor},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Fully Automatic Video Colorization With Self-Regularization and Diversity
Chenyang Lei, Qifeng Chen


We present a fully automatic approach to video colorization with self-regularization and diversity. Our model contains a colorization network for video frame colorization and a refinement network for spatiotemporal color refinement. Without any labeled data, both networks can be trained with self-regularized losses defined in bilateral and temporal space. The bilateral loss enforces color consistency between neighboring pixels in a bilateral space and the temporal loss imposes constraints between corresponding pixels in two nearby frames. While video colorization is a multi-modal problem, our method uses a perceptual loss with diversity to differentiate various modes in the solution space. Perceptual experiments demonstrate that our approach outperforms state-of-the-art approaches on fully automatic video colorization.
[video, frame, temporal, dataset, davis, multiple, work, videvo, temporally, spatiotemporal, film, chrominance, optical, differentiate, classic, propagate, coherence, perform, propagation] [approach, confidence, solution, defined, well, inconsistency] [colorization, image, colorized, color, user, method, consistency, perceptual, figure, input, pixel, grayscale, iizuka, zhang, acm, bilateral, reference, btc, based, study, colorful, blind, colorize, lai, proposed, colorizing, conduct, psnr] [network, deep, output, table, full, apply, performance] [model, diversity, automatic, diverse, generate, generating, evaluate] [fully, refinement, global, propose, interactive] [loss, set, learning, train, space, similarity, function, refers, sample, training, nearest]
@InProceedings{Lei_2019_CVPR,
  author = {Lei, Chenyang and Chen, Qifeng},
  title = {Fully Automatic Video Colorization With Self-Regularization and Diversity},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Zoom to Learn, Learn to Zoom
Xuaner Zhang, Qifeng Chen, Ren Ng, Vladlen Koltun


This paper shows that when applying machine learning to digital zoom, it is beneficial to operate on real, RAW sensor data. Existing learning-based super-resolution methods do not use real sensor data, instead operating on processed RGB images. We show that these approaches forfeit detail and accuracy that can be gained by operating on raw data, particularly when zooming in on distant objects. The key barrier to using real sensor data for training is that ground-truth high-resolution imagery is missing. We show how to obtain such ground-truth data via optical zoom and contribute a dataset, SR-RAW, for real-world computational zoom. We use SR-RAW to train a deep network with a novel contextual bilateral loss that is robust to mild misalignment between input and outputs images. The trained network achieves state-of-the-art performance in 4X and 8X computational zoom. We also show that synthesizing sensor data by resampling high-resolution RGB images is an oversimplified approximation of real sensor data and noise, resulting in worse image quality.
[optical, dataset, capture, signal, joint, sequence, challenge] [sensor, rgb, ground, focal, truth, camera, lens, perspective, single, approach, matching] [image, zoom, raw, real, synthetic, figure, perceptual, input, noise, color, bayer, processed, bilateral, bicubic, cobi, ieee, distant, digital, mild, proposed, collect, ssim, psnr, lpips, esrgan, zoomed, superresolution, mosaic] [deep, computational, better, network, processing, table, compare, standard, applied, pretrained, output, residual, filter] [model, length, introduce, generate] [contextual, misalignment, feature, baseline, spatial, propose, operate] [data, trained, loss, training, train, target, learning, existing, pair, source, test, learned]
@InProceedings{Zhang_2019_CVPR,
  author = {Zhang, Xuaner and Chen, Qifeng and Ng, Ren and Koltun, Vladlen},
  title = {Zoom to Learn, Learn to Zoom},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Single Image Reflection Removal Beyond Linearity
Qiang Wen, Yinjie Tan, Jing Qin, Wenxi Liu, Guoqiang Han, Shengfeng He


Due to the lack of paired data, the training of image reflection removal relies heavily on synthesizing reflection images. However, existing methods model reflection as a linear combination model, which cannot fully simulate the real-world scenarios. In this paper, we inject non-linearity into reflection removal from two aspects. First, instead of synthesizing reflection with a fixed combination factor or kernel, we propose to synthesize reflection images by predicting a non-linear alpha blending mask. This enables a free combination of different blurry kernels, leading to a controllable and diverse reflection synthesis. Second, we design a cascaded network for reflection removal with three tasks: predicting the transmission layer, reflection layer, and the non-linear alpha blending mask. The former two tasks are the fundamental outputs, while the latter one being the side output of the network. This side output, on the other hand, making the training a closed loop, so that the separated transmission and reflection layers can be recombined together for training with a reconstruction loss. Extensive quantitative and qualitative experiments demonstrate the proposed synthesis and removal approaches outperforms state-of-the-art methods on two standard benchmarks, as well as in real-world scenarios.
[focused, predict, second, dataset, predicting] [reconstruction, linear, single, light, ground, defined, corresponding, smoothness, truth, computer] [reflection, transmission, image, blending, removal, alpha, synthesis, proposed, zhang, synthetic, real, method, ghosting, psnr, simulate, ssim, synthesize, user, rmnet, defocused, figure, qualitative, collected, pixel, ceilnet, bdn, synthesized, side, input, remove, quantitative, synnet, technology, paired, recombined] [network, layer, kernel, gradient, deep, best, output, table, denotes, performance, convolutional, structure, science] [model, inception, type, adversarial, diverse, generated, controllable] [three, mask, propose, predicted, guided] [loss, training, data, function, testing, combination, objective, existing, datasets, closed, learning]
@InProceedings{Wen_2019_CVPR,
  author = {Wen, Qiang and Tan, Yinjie and Qin, Jing and Liu, Wenxi and Han, Guoqiang and He, Shengfeng},
  title = {Single Image Reflection Removal Beyond Linearity},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Learning to Separate Multiple Illuminants in a Single Image
Zhuo Hui, Ayan Chakrabarti, Kalyan Sunkavalli, Aswin C. Sankaranarayanan


We present a method to separate a single image captured under two illuminants, with different spectra, into the two images corresponding to the appearance of the scene under each individual illuminant. We do this by training a deep neural network to predict the per-pixel reflectance chromaticity of the scene, which we use in conjunction with a previous flash/no-flash image-based separation algorithm to produce the final two output images. We design our reflectance chromaticity network and loss functions by incorporating intuitions from the physics of image formation. We show that this leads to significantly better performance than other single image techniques and even approaches the quality of the two image separation method.
[dataset, perform, multiple, capture] [reflectance, single, light, scene, illumination, intrinsic, hui, well, estimate, ground, truth, lighting, outdoor, corresponding, lit, decomposition, problem, rgb, estimated, additional, estimation, approach, defined, ambient, directly, technique, relative, algorithm, error, kavita] [image, chromaticity, separated, color, illuminant, input, flash, method, separation, shading, figure, quality, produce, kalyan, proposed, photograph, shadingnet, real, separatenet, sylvain, separate, intermediate, chromnet, captured, illuminated, zhuo] [network, deep, output, performance, better, architecture, neural, design, full, computation, table] [model, find, white, evaluate] [final, supervision, global, three, utilize] [loss, training, source, learning, train, trained]
@InProceedings{Hui_2019_CVPR,
  author = {Hui, Zhuo and Chakrabarti, Ayan and Sunkavalli, Kalyan and Sankaranarayanan, Aswin C.},
  title = {Learning to Separate Multiple Illuminants in a Single Image},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Shape Unicode: A Unified Shape Representation
Sanjeev Muralikrishnan, Vladimir G. Kim, Matthew Fisher, Siddhartha Chaudhuri


3D shapes come in varied representations from a set of points to a set of images, each capturing different aspects of the shape. We propose a unified code for 3D shapes, dubbed Shape Unicode, that imbibes shape cues across these representations into a single code, and a novel framework to learn such a code space for any 3D shape dataset. We discuss this framework as a single go-to training model for any input representation, and demonstrate the effectiveness of the learned code space by applying it directly to common shape analysis tasks -- discriminative and generative. In this work, we use three common representations -- voxel grids, point clouds and multi-view projections -- and combine them into a single code. Note that while we use all three representations at training time, the code can be derived from any single representation during testing. We evaluate this code space on shape retrieval, segmentation and correspondence, and show that the unified code performs better than the individual representations themselves. Additionally, this code space compares quite well to the representation-specific state-of-the-art in these tasks. We also qualitatively discuss linear interpolation between points in this space, by synthesizing from intermediate points.
[joint, multiple, time, individual, fusion, framework, jointly] [shape, point, unicode, voxel, single, solo, shapenet, analysis, cloud, voxels, correspondence, ground, truth, descriptor, directly, note, derived, well, dense, surface, geometric, reconstruction, approach, supplementary] [input, figure, translation, method, demonstrate] [network, convolutional, deep, neural, table, output, compare, performance, standard, accuracy, better] [model, common, decoder, encoders, encoding, encoder, generated] [three, segmentation, map, object] [code, training, representation, loss, classification, learning, trained, test, space, unified, embedding, data, train, learned, target, class, retrieval, learn, distance, shared, set, novel, base, informative]
@InProceedings{Muralikrishnan_2019_CVPR,
  author = {Muralikrishnan, Sanjeev and Kim, Vladimir G. and Fisher, Matthew and Chaudhuri, Siddhartha},
  title = {Shape Unicode: A Unified Shape Representation},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Robust Video Stabilization by Optimization in CNN Weight Space
Jiyang Yu, Ravi Ramamoorthi


We propose a novel robust video stabilization method. Unlike traditional video stabilization techniques that involve complex motion models, we directly model the appearance change of the frames as the dense optical flow field of consecutive frames. We introduce a new formulation of the video stabilization task based on first principles, which leads to a large scale non-convex problem. This problem is hard to solve, so previous optical flow based approaches have resorted to heuristics. In this paper, we propose a novel optimization routine that transfers this problem into the convolutional neural network parameter domain. While we exploit the general benefits of CNNs, including standard gradient-based optimization techniques, our method is a new approach to using CNNs purely as an optimizer rather than learning from data.Our method trains the CNN from scratch on each specific input example, and intentionally overfits the CNN parameters to produce the best result on the input example. By solving the problem in the CNN weight space rather than directly for image pixels, we make it a viable formulation for video stabilization. Our method produces both visually and quantitatively better results than previous work, and is robust in situations acknowledged as limitations in current state-of-the-art methods.
[video, optical, motion, frame, flow, stabilization, warp, liu, window, complex, previous, stabilize, consecutive, warped, stabilized] [optimization, field, single, problem, note, directly, error, formulation, robust, solve, affine, camera, local, parallax, linear, dense, general, scene, correspondence] [method, input, pixel, image, based, result, figure, transformation, comparison, quality, proposed, ieee, appearance, acm, traditional, quantitative, produce] [network, neural, original, better, number, output, regularization, size, deep, structure, convolutional, processing, compare, scale, best, complexity] [example, model, requires, simple] [feature, cnn, regression, final, score] [large, metric, objective, function, seek, idea, novel, learning]
@InProceedings{Yu_2019_CVPR,
  author = {Yu, Jiyang and Ramamoorthi, Ravi},
  title = {Robust Video Stabilization by Optimization in CNN Weight Space},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Learning Linear Transformations for Fast Image and Video Style Transfer
Xueting Li, Sifei Liu, Jan Kautz, Ming-Hsuan Yang


Given a random pair of images, a universal style transfer method extracts the feel from a reference image to synthesize an output based on the look of a content image. Recent algorithms based on second-order statistics, however, are either computationally expensive or prone to generate artifacts due to the trade-off between image quality and runtime performance. In this work, we present an approach for universal style transfer that learns the transformation matrix in a data-driven fashion. Our algorithm is efficient yet flexible to transfer different levels of styles with the same auto-encoder network. It also produces stable video style transfer results due to the preservation of the content affinity. In addition, we propose a linear propagation module to enable a feed-forward network for photo-realistic style transfer. We demonstrate the effectiveness of our approach on three tasks: artistic style, photo-realistic and video style transfer, with comparisons to state-of-the-art methods.
[video, propagation, performs, multiple, expensive, work] [matrix, linear, algorithm, reconstruction, single, well, approach, column, compute, directly, optimization, corresponding, stable] [style, content, image, transformation, method, proposed, stylized, based, input, wct, figure, spn, artistic, row, gatys, arbitrary, texture, synthesize, adain, transferred, preserve, stylization, color] [network, covariance, neural, layer, deep, vgg, convolutional, shallower, computationally, efficient, table, fast, efficiently, cov, computational] [model, encoder, generate, transformed, decoder] [feature, module, affinity, cnn, three, propose, faster] [transfer, loss, training, learning, universal, train, pair, learn]
@InProceedings{Li_2019_CVPR,
  author = {Li, Xueting and Liu, Sifei and Kautz, Jan and Yang, Ming-Hsuan},
  title = {Learning Linear Transformations for Fast Image and Video Style Transfer},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Local Detection of Stereo Occlusion Boundaries
Jialiang Wang, Todd Zickler


Stereo occlusion boundaries are one-dimensional structures in the visual field that separate foreground regions of a scene that are visible to both eyes (binocular regions) from background regions of a scene that are visible to only one eye (monocular regions). Stereo occlusion boundaries often coincide with object boundaries, and localizing them is useful for tasks like grasping, manipulation, and navigation. This paper describes the local signatures for stereo occlusion boundaries that exist in a stereo cost volume, and it introduces a local detector for them based on a simple feedforward network with relatively small receptive fields. The local detector produces better boundaries than many other stereo methods, even without incorporating explicit stereo matching, top-down contextual cues, or single-image boundary cues based on texture and intensity.
[work, flow, dataset, optical, sintel, adjacent, localizing] [stereo, occlusion, local, disparity, vision, volume, computer, left, depth, matching, pattern, epipolar, scene, middlebury, monocular, cyclopean, binocular, occluding, point, textured, error, algorithm, international, visible, approach, single, associated, ground, journal] [figure, texture, background, image, intensity, produce, side, based, thin, perceptual, conference, separate, feedforward, input, pixel, resolution] [cost, network, size, output, receptive, siamese, pooling, small] [machine, visual, find, evidence, simple] [detector, boundary, detection, foreground, spatial, map, score, multiscale, object, detect, location, detected, exist] [train, training, space]
@InProceedings{Wang_2019_CVPR,
  author = {Wang, Jialiang and Zickler, Todd},
  title = {Local Detection of Stereo Occlusion Boundaries},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Bi-Directional Cascade Network for Perceptual Edge Detection
Jianzhong He, Shiliang Zhang, Ming Yang, Yanhu Shan, Tiejun Huang


Exploiting multi-scale representations is critical to improve edge detection for objects at different scales. To extract edges at dramatically different scales, we propose a Bi-Directional Cascade Network (BDCN) structure, where an individual layer is supervised by labeled edges at its specific scale, rather than directly applying the same supervision to all CNN outputs. Furthermore, to enrich multi-scale representations learned by BDCN, we introduce a Scale Enhancement Module (SEM) which utilizes dilated convolution to generate multi-scale features, instead of using deeper CNNs or explicitly fusing multi-scale edge maps. These new approaches encourage the learning of multi-scale representations in different layers and detect edges that are well delineated by their scales. Learning scale dedicated layers also results in compact network with a fraction of parameters. We evaluate our method on three datasets, i.e., BSDS500, NYUDv2, and Multicue, and achieve ODS Fmeasure of 0.828, 1.3% higher than current state-of-the art on BSDS500.
[multiple, human, outperforms, previous] [approach, pattern, rgb, groundtruth] [image, method, comparison, based, contour, ieee, figure, input, intermediate, enhancement] [network, deep, scale, bdcn, performance, convolutional, layer, shallow, block, table, conv, multicue, dilated, number, deeper, higher, architecture, convolution, neural, pooling, achieve, structure, dilation, achieves, ced, impact] [natural, introduce] [edge, detection, rcf, sem, cascade, boundary, object, propose, cnn, three, segmentation, feature, ois, final, semantic, hierarchical, improve, supervision, module, detect] [learning, training, set, loss, learned, train, trained, test]
@InProceedings{He_2019_CVPR,
  author = {He, Jianzhong and Zhang, Shiliang and Yang, Ming and Shan, Yanhu and Huang, Tiejun},
  title = {Bi-Directional Cascade Network for Perceptual Edge Detection},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Single Image Deraining: A Comprehensive Benchmark Analysis
Siyuan Li, Iago Breno Araujo, Wenqi Ren, Zhangyang Wang, Eric K. Tokuda, Roberto Hirata Junior, Roberto Cesar-Junior, Jiawan Zhang, Xiaojie Guo, Xiaochun Cao


We present a comprehensive study and evaluation of existing single image deraining algorithms, using a new large-scale benchmark consisting of both synthetic and real-world rainy images.This dataset highlights diverse data sources and image contents, and is divided into three subsets (rain streak, rain drop, rain and mist), each serving different training or evaluation purposes. We further provide a rich variety of criteria for dehazing algorithm evaluation, ranging from full-reference metrics, to no-reference metrics, to subjective evaluation and the novel task-driven evaluation. Experiments on the dataset shed light on the comparisons and limitations of state-of-the-art deraining algorithms, and suggest promising future directions.
[human, video, dataset, driving] [computer, single, vision, pattern, international, analysis, scene, gmm, well, algorithm, camera] [rain, image, deraining, rainy, streak, ieee, real, synthetic, conference, mist, removal, raindrop, clean, derained, rid, sseq, deraindrop, method, based, quality, surveillance, subjective, perceptual, mpid, removing, niqe, comparison, jorder, zhangyang, study, background, dehazing, figure, psnr, ssim, comprehensive, atmospheric] [performance, table, network, deep, best, drop, better, neural, small] [evaluation, evaluate, model, cgan, ddn, type, diverse, adversarial, arxiv, preprint, progress, visual, goal] [detection, object, three, score, map] [set, training, existing, domain, testing]
@InProceedings{Li_2019_CVPR,
  author = {Li, Siyuan and Breno Araujo, Iago and Ren, Wenqi and Wang, Zhangyang and Tokuda, Eric K. and Hirata Junior, Roberto and Cesar-Junior, Roberto and Zhang, Jiawan and Guo, Xiaojie and Cao, Xiaochun},
  title = {Single Image Deraining: A Comprehensive Benchmark Analysis},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Dynamic Scene Deblurring With Parameter Selective Sharing and Nested Skip Connections
Hongyun Gao, Xin Tao, Xiaoyong Shen, Jiaya Jia


Dynamic Scene deblurring is a challenging low-level vision task where spatially variant blur is caused by many factors, e.g., camera shake and object motion. Recent study has made significant progress. Compared with the parameter independence scheme [19] and parameter sharing scheme [33], we develop the general principle for constraining the deblurring network structure by proposing the generic and effective selective sharing scheme. Inside the subnetwork of each scale, we propose a nested skip connection structure for the nonlinear transformation modules to replace stacked convolution layers or residual blocks. Besides, we build a new large dataset of blurred/sharp image pairs towards better restoration quality. Comprehensive experimental results show that our parameter selective sharing scheme, nested skip connection structure, and the new dataset are all significant to set a new state-of-the-art in dynamic scene deblurring.
[dataset, dynamic, motion, perform, work, recurrent, second] [scene, camera, reconstruction, general, dense, ground, truth, column, single] [image, deblurring, blurred, sharp, transformation, nonlinear, proposed, gopro, blur, method, restoration, nah, figure, resblocks, input, quantitative, spatially, latent, caused, mapping] [parameter, sharing, skip, network, residual, nested, deep, structure, scheme, connection, better, scale, convolutional, neural, independence, convolution, table, stacked, compared, layer, kernel, compare, size, variant, effective, fine] [model, encoder, evaluation, progressively] [feature, selective, extraction, object, module, stage, cnn, subnetwork, region, propose] [training, shared, learning, independent, tao, large, set, trained]
@InProceedings{Gao_2019_CVPR,
  author = {Gao, Hongyun and Tao, Xin and Shen, Xiaoyong and Jia, Jiaya},
  title = {Dynamic Scene Deblurring With Parameter Selective Sharing and Nested Skip Connections},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Events-To-Video: Bringing Modern Computer Vision to Event Cameras
Henri Rebecq, Rene Ranftl, Vladlen Koltun, Davide Scaramuzza


Event cameras are novel sensors that report brightness changes in the form of asynchronous "events" instead of intensity frames. They have significant advantages over conventional cameras: high temporal resolution, high dynamic range, and no motion blur. Since the output of event cameras is fundamentally different from conventional cameras, it is commonly accepted that they require the development of specialized algorithms to accommodate the particular nature of events. In this work, we take a different view and propose to apply existing, mature computer vision techniques to videos reconstructed from event data. We propose a novel, recurrent neural network to reconstruct videos from a stream of events and train it on a large amount of simulated event data. Our experiments show that our approach surpasses state-of-the-art reconstruction methods by a large margin (> 20%) in terms of image quality. We further apply off-the-shelf computer vision algorithms to videos reconstructed from event data on tasks such as object classification and visual-inertial odometry, and show that this strategy consistently outperforms algorithms that were specifically designed for event data. We believe that our approach opens the door to bringing the outstanding properties of event cameras to an entirely new range of tasks.
[event, stream, davide, motion, dynamic, dataset, time, sequence, video, ultimateslam, brightness, tracking, flow, henri, outperforms, work, tobi, guillermo, temporal, recurrent, optical, inertial, report, frame] [camera, vision, reconstruction, approach, computer, pattern, error, odometry, directly, range, direct, sensor, contrast, well, allows] [image, ieee, method, intensity, reconstructed, reconstruct, high, simulated, figure, quality, real, based, comparison, prior, translation, amount] [network, compared, latency, deep, output, apply, filter] [visual, natural, machine, fact] [object, feature, integration] [data, classification, conventional, large, learning, datasets, training, train, set]
@InProceedings{Rebecq_2019_CVPR,
  author = {Rebecq, Henri and Ranftl, Rene and Koltun, Vladlen and Scaramuzza, Davide},
  title = {Events-To-Video: Bringing Modern Computer Vision to Event Cameras},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Feedback Network for Image Super-Resolution
Zhen Li, Jinglei Yang, Zheng Liu, Xiaomin Yang, Gwanggil Jeon, Wei Wu


Recent advances in image super-resolution (SR) explored the power of deep learning to achieve a better reconstruction performance. However, the feedback mechanism, which commonly exists in human visual system, has not been fully exploited in existing deep learning based image SR methods. In this paper, we propose an image super-resolution feedback network (SRFBN) to refine low-level representations with high-level information. Specifically, we use hidden states in a recurrent neural network (RNN) with constraints to achieve such feedback manner. A feedback block is designed to handle the feedback connections and to generate powerful high-level representations. The proposed SRFBN comes with a strong early reconstruction ability and can create the final high-resolution image step by step. In addition, we introduce a curriculum learning strategy to make the network well suitable for more complicated tasks, where the low-resolution images are corrupted by multiple types of degradation. Extensive experimental results demonstrate the superiority of the proposed SRFBN in comparison with the state-of-the-art methods. Code is avaliable at https://github.com/Paper99/SRFBN_CVPR19.
[recurrent, hidden, state, early, multiple, complex, previous, second] [reconstruction, single, dense, projection, well, accurate] [feedback, image, degradation, proposed, input, feedforward, based, figure, bicubic, comparison, ihr, vdsr, psnr, corrupted, edsr] [network, conv, srfbn, iteration, fout, deep, block, fin, skip, residual, scale, output, convolutional, factor, deconv, number, performance, achieve, upsample, neural, structure, better, size, denotes, isr, best, powerful, drrn, downsampling, operation, represents, compared, table] [mechanism, model, visual, generate, represent] [feature, contextual, average] [learning, curriculum, strategy, training, set, loss, experimental]
@InProceedings{Li_2019_CVPR,
  author = {Li, Zhen and Yang, Jinglei and Liu, Zheng and Yang, Xiaomin and Jeon, Gwanggil and Wu, Wei},
  title = {Feedback Network for Image Super-Resolution},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Semi-Supervised Transfer Learning for Image Rain Removal
Wei Wei, Deyu Meng, Qian Zhao, Zongben Xu, Ying Wu


Single image rain removal is a typical inverse problem in computer vision. The deep learning technique has been verified to be effective for this task and achieved state-of-the-art performance. However, previous deep learning methods need to pre-collect a large set of image pairs with/without synthesized rain for training, which tends to make the neural network be biased toward learning the specific patterns of the synthesized rain, while be less able to generalize to real test samples whose rain types differ from those in the training data. To this issue, this paper firstly proposes a semi-supervised learning paradigm toward this task. Different from traditional deep learning methods which only use supervised image pairs with/without synthesized rains, we further put real rainy images, without need of their clean ones, into the network training process. This is realized by elaborately formulating the residual between an input rainy image and its expected network output (clear image without rain) as a concise mixture of Gaussians distribution. The network is therefore trained to transfer to adapting the real rain pattern domain instead of only the synthesis rain domain, and thus both the short-of-training-sample and bias-to-supervised-sample issues can be evidently alleviated. Experiments on synthetic and real data verify the superiority of our model compared to the state-of-the-arts.
[term, video, previous, joint] [computer, vision, single, pattern, corresponding, ground, truth, dense, square, optimization, problem, gmm, international] [rain, image, real, rainy, synthesized, figure, method, removal, ieee, conference, clean, sirr, input, proposed, synthetic, streak, component, remove, prior, imposed, based, background, mreal, capable, restoration, noise] [network, deep, sparse, better, output, convolutional, neural, compared, performance, represents, gaussian, validation, residual, order, design, gradient, layer, parameter] [model, visual, represent, introduced] [cnn] [learning, supervised, data, unsupervised, training, domain, transfer, function, distribution, loss, mixture, set, test, generally, learned, task, specific, trained, testing, main, likelihood, log, objective]
@InProceedings{Wei_2019_CVPR,
  author = {Wei, Wei and Meng, Deyu and Zhao, Qian and Xu, Zongben and Wu, Ying},
  title = {Semi-Supervised Transfer Learning for Image Rain Removal},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
EventNet: Asynchronous Recursive Event Processing
Yusuke Sekikawa, Kosuke Hara, Hideo Saito


Event cameras are bio-inspired vision sensors that mimic retinas to asynchronously report per-pixel intensity changes rather than outputting an actual intensity image at regular intervals. This new paradigm of image sensor offers significant potential advantages; namely, sparse and non-redundant data representation. Unfortunately, however, most of the existing artificial neural network architectures, such as a CNN, require dense synchronous input data, and therefore, cannot make use of the sparseness of the data. We propose EventNet, a neural network designed for real-time processing of asynchronous event streams in a recursive and event-wise manner. EventNet models dependence of the output on tens of thousands of causal events recursively using a novel temporal coding scheme. As a result, at inference time, our network operates in an event-wise manner that is realized with very few sum-of-the-product operations---look-up table and temporal feature aggregation---which enables processing of 1 mega or more events per second on standard CPU. In experiments using real data, we demonstrated the real-time performance and robustness of our framework.
[event, temporal, eventnet, time, viii, stream, window, complex, motion, asynchronous, recursively, sequence, manner, lut, spiking, demand, latest] [pointnet, estimation, camera, single, vision, computed, compute, point, rotation, latexit, permutation, algorithm, error] [input, conference, ieee, processed, intensity, real, change, proposed, application, based, difference] [processing, network, neural, process, max, recursive, rate, coding, architecture, output, deep, computation, batch, table, variant, performance, mlp, sparse, realized, computational, size, standard, structure, inference, efficient, trainable] [model] [feature, global, object, module, semantic, segmentation, faster] [data, learning, function, training, paradigm, randomly]
@InProceedings{Sekikawa_2019_CVPR,
  author = {Sekikawa, Yusuke and Hara, Kosuke and Saito, Hideo},
  title = {EventNet: Asynchronous Recursive Event Processing},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Recurrent Back-Projection Network for Video Super-Resolution
Muhammad Haris, Gregory Shakhnarovich, Norimichi Ukita


We proposed a novel architecture for the problem of video super-resolution. We integrate spatial and temporal contexts from continuous video frames using a recurrent encoder-decoder module, that fuses multi-frame information with the more traditional, single frame super-resolution path for the target frame. In contrast to most prior work where frames are pooled together by stacking or warping, our model, the Recurrent Back-Projection Network (RBPN) treats each context frame as a separate source of information. These sources are combined in an iterative refinement framework inspired by the idea of back-projection in multiple-image super-resolution. This is aided by explicitly representing estimated inter-frame motion with respect to the target, rather than explicitly aligning frames. We propose a new video super-resolution benchmark, allowing evaluation at a larger scale and considering videos in different motion regimes. Experimental results demonstrate that our RBPN is superior to existing methods on several datasets.
[video, temporal, rbpn, frame, motion, recurrent, dbpn, multiple, flow, vsr, misr, netmisr, sequence, previous, netsisr, drdvsr, work, concatenated, iteratively, rnn, spmcs, construct, muhammad] [computer, pattern, vision, approach, projection, single, international, estimated, explicit] [image, ieee, conference, missing, figure, sisr, proposed, produce, input, bicubic, resolution, remove, zoom, sharper, psnr] [deep, network, table, performance, better, convolutional, residual, neural, order, best, fast, concatenation, magnitude, architecture] [evaluation, encoder, visual, consider, red, path, blue, indicates] [feature, context, three, map, aligned] [target, neighbor, alignment, learning, trained]
@InProceedings{Haris_2019_CVPR,
  author = {Haris, Muhammad and Shakhnarovich, Gregory and Ukita, Norimichi},
  title = {Recurrent Back-Projection Network for Video Super-Resolution},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Cascaded Partial Decoder for Fast and Accurate Salient Object Detection
Zhe Wu, Li Su, Qingming Huang


Existing state-of-the-art salient object detection networks rely on aggregating multi-level features of pre-trained convolutional neural networks (CNNs). However, compared to high-level features, low-level features contribute less to performance. Meanwhile, they raise more computational cost because of their larger spatial resolutions. In this paper, we propose a novel Cascaded Partial Decoder (CPD) framework for fast and accurate salient object detection. On the one hand, the framework constructs partial decoder which discards larger resolution features of shallow layers for acceleration. On the other hand, we observe that integrating features of deep layers will obtain relatively precise saliency map. Therefore we directly utilize generated saliency map to recurrently optimize features of deep layers. This strategy efficiently suppresses distractors in the features and significantly improves their representation ability. Experiments conducted on five benchmark datasets exhibit that the proposed model not only achieves state-of-the-art but also runs much faster than existing models. Besides, we apply the proposed framework to optimize existing multi-level feature aggregation models and significantly improve their efficiency and accuracy.
[framework, complex, work, dataset] [initial, optimization, accurate] [proposed, image, comparison, figure, based, method, traditional] [deep, convolutional, layer, aggregation, original, shadow, performance, network, compared, fast, design, computation, deeper, table, cost, shallower, accuracy, full] [model, attention, decoder, partial, improved, mechanism] [saliency, salient, detection, map, object, feature, maxf, branch, mae, avgf, refine, holistic, cascaded, backbone, improve, bmpm, propose, duts, module, integrate, nldf, three, spatial, integrating, precise, benchmark, fully, semantic, segment, context, global, amulet, cpd, utilize, faster, segmentation] [set, existing, learning, training, large, strategy, novel, train]
@InProceedings{Wu_2019_CVPR,
  author = {Wu, Zhe and Su, Li and Huang, Qingming},
  title = {Cascaded Partial Decoder for Fast and Accurate Salient Object Detection},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
A Simple Pooling-Based Design for Real-Time Salient Object Detection
Jiang-Jiang Liu, Qibin Hou, Ming-Ming Cheng, Jiashi Feng, Jianmin Jiang


We solve the problem of salient object detection by investigating how to expand the role of pooling in convolutional neural networks. Based on the U-shape architecture, we first build a global guidance module (GGM) upon the bottom-up pathway, aiming at providing layers at different feature levels the location information of potential salient objects. We further design a feature aggregation module (FAM) to make the coarse-level semantic information well fused with the fine-level features from the top-down path- way. By adding FAMs after the fusion operations in the top-down pathway, coarse-level features from the GGM can be seamlessly merged with features at various scales. These two pooling-based modules allow the high-level semantic features to be progressively refined, yielding detail enriched saliency maps. Experiment results show that our proposed approach can more accurately locate the salient objects with sharpened details and hence substantially improve the performance compared to the previous state-of-the-arts. Our approach is fast as well and can run at a speed of more than 30 FPS when processing a 300x400 image. Code can be found at http://mmcheng.net/poolnet/.
[previous, joint, capture, multiple, version, series, dataset, merged] [approach, local, greatly, corresponding, column] [proposed, based, image, dts, input, figure, conduct, produced, row, ieee] [performance, convolutional, network, deep, pooling, table, aggregation, compared, better, architecture, receptive, size, effectiveness, design, experiment, layer] [ppm, visual, model] [salient, feature, saliency, object, detection, edge, ggm, global, module, fams, mae, fpn, guidance, semantic, pyramid, poolnet, maxf, huchuan, context, three, backbone, ggfs, ablation, baseline, xiang, map, role, location, improve, detailed] [training, datasets, learning, large]
@InProceedings{Liu_2019_CVPR,
  author = {Liu, Jiang-Jiang and Hou, Qibin and Cheng, Ming-Ming and Feng, Jiashi and Jiang, Jianmin},
  title = {A Simple Pooling-Based Design for Real-Time Salient Object Detection},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Contrast Prior and Fluid Pyramid Integration for RGBD Salient Object Detection
Jia-Xing Zhao, Yang Cao, Deng-Ping Fan, Ming-Ming Cheng, Xuan-Yi Li, Le Zhang


The large availability of depth sensors provides valuable complementary information for salient object detection (SOD) in RGBD images. However, due to the inherent difference between RGB and depth information, extracting features from the depth channel using ImageNet pre-trained backbone models and fusing them with RGB features directly are sub-optimal. In this paper, we utilize contrast prior, which used to be a dominant cue in none deep learning based SOD approaches, into CNNs-based architecture to enhance the depth information. The enhanced depth cues are further integrated with RGB features for SOD, using a novel fluid pyramid integration, which can make better use of multi-scale cross-modal features. Comprehensive experiments on 5 challenging benchmark datasets demonstrate the superiority of the architecture CPFP over 9 state-of-the-art alternative methods.
[fusion, multiple, tier, work, second] [depth, contrast, rgb, rgbd, well, directly] [proposed, based, image, background, ieee, method, prior, figure] [deep, architecture, compared, network, convolutional, original, net, denotes, channel, size, designed, design, performance, connection] [model, visual, simple, attention, introduced] [salient, feature, object, saliency, enhanced, pyramid, map, detection, fluid, integration, foreground, sod, fusing, enhance, mae, backbone, cnn, region, challenging, fuse, utilize, three, predicted, final, meanf, maxf, ali, nlpr, propose, hierarchical, multiscale] [loss, set, learning, distribution, compatibility, log, novel, existing]
@InProceedings{Zhao_2019_CVPR,
  author = {Zhao, Jia-Xing and Cao, Yang and Fan, Deng-Ping and Cheng, Ming-Ming and Li, Xuan-Yi and Zhang, Le},
  title = {Contrast Prior and Fluid Pyramid Integration for RGBD Salient Object Detection},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Progressive Image Deraining Networks: A Better and Simpler Baseline
Dongwei Ren, Wangmeng Zuo, Qinghua Hu, Pengfei Zhu, Deyu Meng


Along with the deraining performance improvement of deep networks, their structures and learning become more and more complicated and diverse, making it difficult to analyze the contribution of various network modules when developing new deraining networks. To handle this issue, this paper provides a better and simpler baseline deraining network by considering network architecture, input and output, and loss functions. Specifically, by repeatedly unfolding a shallow ResNet, progressive ResNet (PRN) is proposed to take advantage of recursive computation. A recurrent layer is further introduced to exploit the dependencies of deep features across stages, forming our progressive recurrent network (PReNet). Furthermore, intra-stage recursive computation of ResNet can be adopted in PRN and PReNet to notably reduce network parameters with unsubstantial degradation in deraining performance. For network input and output, we take both stage-wise result and original rainy image as input to each ResNet and finally output the prediction of residual image. As for loss functions, single MSE or negative SSIM losses are sufficient to train PRN and PReNet. Experiments show that PRN and PReNet perform favorably on both synthetic and real rainy images. Considering its simplicity, efficiency and effectiveness, our models are expected to serve as a suitable baseline in future deraining research. The source codes are available at https://github.com/csdwren/PReNet.
[recurrent, joint, lstm, predicting, video, complex] [single, computer, vision, pattern, international, directly] [deraining, image, prenet, rain, rainy, prn, ssim, ieee, conference, psnr, background, input, prenetr, result, comparison, resblocks, removal, prnr, mse, rescan, fres, real, clean, streak, jorder, figure, synthetic, quality, unfolding, frecurrent, removing, method, based] [network, progressive, recursive, deep, layer, table, residual, performance, better, convolutional, convolution, resnet, fin, simpler, fout, deeper, achieve, computation, reduce, output] [ddn, model, visual, introduced] [stage, baseline, adopted, cnn, heavy, final] [loss, learning, negative, train, training, trained, conventional, learn]
@InProceedings{Ren_2019_CVPR,
  author = {Ren, Dongwei and Zuo, Wangmeng and Hu, Qinghua and Zhu, Pengfei and Meng, Deyu},
  title = {Progressive Image Deraining Networks: A Better and Simpler Baseline},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
GSPN: Generative Shape Proposal Network for 3D Instance Segmentation in Point Cloud
Li Yi, Wang Zhao, He Wang, Minhyuk Sung, Leonidas J. Guibas


We introduce a novel 3D object proposal approach named Generative Shape Proposal Network (GSPN) for instance segmentation in point cloud data. Instead of treating object proposal as a direct bounding box regression problem, we take an analysis-by-synthesis strategy and generate proposals by reconstructing shapes from noisy observations in a scene. We incorporate GSPN into a novel 3D instance segmentation framework named Region-based PointNet (R-PointNet) which allows flexible proposal refinement and instance segmentation generation. We achieve state-of-the-art performance on several 3D instance segmentation tasks. The success of GSPN largely comes from its emphasis on geometric understandings during object proposal, greatly reducing proposals with low objectness.
[framework, learns, prediction, recognition] [point, cloud, computer, shape, approach, vision, pattern, scene, directly, ground, truth, pointnet, indoor, scannet, chair, volume, geometric] [conference, generative, ieee, figure, prior, high, based, input, conditional, image, conduct] [network, deep, table, neural, binary, performance, design, convolutional, architecture, scale] [generation, generated, partial, generates, arxiv, preprint, sensitive, evaluation, natural, model, generate] [object, segmentation, instance, proposal, feature, gspn, semantic, bounding, context, box, sgpn, center, seed, detection, mask, objectness, roi, fsem, iou, ablation, region, roialign, score, final, including, propose, segment] [data, loss, distribution, learning, set, large, metric, training]
@InProceedings{Yi_2019_CVPR,
  author = {Yi, Li and Zhao, Wang and Wang, He and Sung, Minhyuk and Guibas, Leonidas J.},
  title = {GSPN: Generative Shape Proposal Network for 3D Instance Segmentation in Point Cloud},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Attentive Relational Networks for Mapping Images to Scene Graphs
Mengshi Qi, Weijian Li, Zhengyuan Yang, Yunhong Wang, Jiebo Luo


Scene graph generation refers to the task of automatically mapping an image into a semantic structural graph, which requires correctly labeling each extracted object and their interaction relationships. Despite the recent success in object detection using deep learning techniques, inferring complex contextual relationships and structured graph representations from visual data remains a challenging topic. In this study, we propose a novel Attentive Relational Network that consists of two key modules with an object detection backbone to approach this problem. The first module is a semantic transformation module utilized to capture semantic embedded relation features, by translating visual features and linguistic features into a common semantic space. The other module is a graph self-attention module introduced to embed a joint graph representation through assigning various importance weights to neighboring nodes. Finally, accurate scene graphs are produced by the relation inference module to recognize all entities and corresponding relations. We evaluate our proposed method on the widely-adopted Visual Genome Dataset, and the results demonstrate the effectiveness and superiority of our model.
[graph, joint, capture, structural, previous, dataset] [scene, corresponding, ground, truth, denote] [proposed, figure, transformation, image, based, input, method, denoted, mapping] [neural, network, deep, table, inference, better, structure, concatenation, size, convolutional, denotes, weight, output] [visual, model, relationship, entity, node, attention, relational, word, common, predicate, generation, genome, linguistic, embedded, introduce, correctly, message, represent, vector] [semantic, relation, module, object, detection, feature, attentive, bounding, neighboring, adopt, three, global, category, propose, detected] [representation, embedding, learning, set, classification, label, learned, task, loss, knowledge, space, refers, function]
@InProceedings{Qi_2019_CVPR,
  author = {Qi, Mengshi and Li, Weijian and Yang, Zhengyuan and Wang, Yunhong and Luo, Jiebo},
  title = {Attentive Relational Networks for Mapping Images to Scene Graphs},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Relational Knowledge Distillation
Wonpyo Park, Dongju Kim, Yan Lu, Minsu Cho


Knowledge distillation aims at transferring knowledge acquired in one model (a teacher) to another model (a student) that is typically smaller. Previous approaches can be expressed as a form of training the student to mimic output activations of individual data examples represented by the teacher. We introduce a novel approach, dubbed relational knowledge distillation (RKD), that transfers mutual relations of data examples instead. For concrete realizations of RKD, we propose distance-wise and angle-wise distillation losses that penalize structural differences in relations. Experiments conducted on different tasks show that the proposed method improves educated student models with a significant margin. In particular for metric learning, it allows students to outperform their teachers' performance, achieving the state of the arts on standard benchmark datasets.
[work, individual, recognition, multiple, structural, hidden, combined] [computer, vision, international, pattern, approach, form, single, additional] [conference, proposed, method, ieee, image, figure, difference] [output, deep, neural, network, normalization, table, layer, effective, apply, better, performance, tiny, smaller, imagenet, processing, outperform, accuracy] [model, relational, attention, potential, introduce] [final, object, propose, improves] [teacher, rkd, knowledge, student, distillation, loss, learning, embedding, metric, training, data, triplet, trained, transfer, conventional, set, fitnet, darkrank, function, distance, train, softmax, hkd, stanford, classification, representation, objective, distilling]
@InProceedings{Park_2019_CVPR,
  author = {Park, Wonpyo and Kim, Dongju and Lu, Yan and Cho, Minsu},
  title = {Relational Knowledge Distillation},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Compressing Convolutional Neural Networks via Factorized Convolutional Filters
Tuanhui Li, Baoyuan Wu, Yujiu Yang, Yanbo Fan, Yong Zhang, Wei Liu


This work studies the model compression for deep convolutional neural networks (CNNs) via filter pruning. The workflow of a traditional pruning consists of three sequential stages: pre-training the original model, selecting the pre-trained filters via ranking according to a manually designed criterion (e.g., the norm of filters), and learning the remained filters via fine-tuning. Most existing works follow this pipeline and focus on designing different ranking criteria for filter selection. However, it is difficult to control the performance due to the separation of filter selection and filter learning. In this work, we propose to conduct filter selection and filter learning simultaneously, in a unified model. To this end, we define a factorized convolutional filter (FCF), consisting of a standard real-valued convolutional filter and a binary scalar, as well as a dot-product operator between them. We train a CNN model with factorized convolutional filters (CNN-FCF) by updating the standard filter using back-propagation, while updating the binary scalar using the alternating direction method of multipliers (ADMM) based optimization method. With this trained CNN-FCF model, we only keep the standard filters corresponding to the 1-valued scalars, while all other filters and all binary scalars are discarded, to obtain a compact CNN model. Extensive experiments on CIFAR-10 and ImageNet demonstrate the superiority of the proposed method over state-of-the-art filter pruning methods.
[joint, updated, work] [algorithm, computer, pattern, corresponding, vision, optimization, problem, international, directly, continuous] [method, proposed, conference, ieee, input, distinct, image, conduct, based] [pruning, filter, convolutional, accuracy, neural, deep, standard, ratio, layer, pruned, factorized, output, admm, denotes, compressing, original, binary, selection, nisp, baoyuan, channel, small, compared, sparse, table, wei, performance, resnet, gradient, sfp, compact, imagenet, prune, number, processing, network, descent, higher, lower, efficient, compression] [model, visual, arxiv, preprint] [cnn, including, propose, adopt] [learning, training, ranking, loss, existing, trained, set, train, novel, update, strategy]
@InProceedings{Li_2019_CVPR,
  author = {Li, Tuanhui and Wu, Baoyuan and Yang, Yujiu and Fan, Yanbo and Zhang, Yong and Liu, Wei},
  title = {Compressing Convolutional Neural Networks via Factorized Convolutional Filters},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
On the Intrinsic Dimensionality of Image Representations
Sixue Gong, Vishnu Naresh Boddeti, Anil K. Jain


This paper addresses the following questions pertaining to the intrinsic dimensionality of any given image representation: (i) estimate its intrinsic dimensionality, (ii) develop a deep neural network based non-linear mapping, dubbed DeepMDS, that transforms the ambient representation to the minimal intrinsic space, and (iii) validate the veracity of the mapping through image matching in the intrinsic space. Experiments on benchmark image datasets (LFW, IJB-C and ImageNet-100) reveal that the intrinsic dimensionality of deep neural network representations is significantly lower than the dimensionality of the ambient features. For instance, SphereFace's 512-dim face representation and ResNet's 512-dim image representation have an intrinsic dimensionality of 16 and 19 respectively. Further, the DeepMDS mapping is able to obtain a representation of significantly lower dimensionality while maintaining discriminative ability to a large extent, 59.75% TAR @ 0.1% FAR in 16-dim vs 71.26% TAR in 512-dim on IJB-C and a Top-1 accuracy of 77.0% at 19-dim vs 83.4% at 512-dim on ImageNet-100.
[graph, recognition, dataset, multiple, complex, stagewise] [intrinsic, ambient, linear, geodesic, estimating, estimate, computer, analysis, pattern, topological, estimation, induced, international, projection, principal, corresponding, local, vision, approach] [image, mapping, face, based, conference, ieee, figure, method, component, denoising] [reduction, number, verification, deep, neural, network, table, rate, compact, dnn] [manifold, model, machine, ability, embedded] [feature] [dimensionality, space, representation, deepmds, dimension, distance, learning, rmax, data, distribution, discriminative, sphereface, isomap, learn, training, embedding, lfw, paper, similarity, log, datasets, function, euclidean, hypersphere, classification, dae, large, loss, unsupervised, lim]
@InProceedings{Gong_2019_CVPR,
  author = {Gong, Sixue and Naresh Boddeti, Vishnu and Jain, Anil K.},
  title = {On the Intrinsic Dimensionality of Image Representations},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Part-Regularized Near-Duplicate Vehicle Re-Identification
Bing He, Jia Li, Yifan Zhao, Yonghong Tian


Vehicle re-identification (Re-ID) has been attracting more interests in computer vision owing to its great contributions in urban surveillance and intelligent transportation. With the development of deep learning approaches, vehicle Re-ID still faces a near-duplicate challenge, which is to distinguish different instances with nearly identical appearances. Previous methods simply rely on the global visual features to handle this problem. In this paper, we proposed a simple but efficient part-regularized discriminative feature preserving method which enhances the perceptive ability of subtle discrepancies. We further develop a novel framework to integrate part constrains with the global Re-ID modules by introducing an detection branch. Our framework is trained end-to-end with combined local and global constrains. Specially, without the part-regularized local constrains in inference step, our Re-ID network outperforms the state-of-the-art method by a large margin on large benchmark datasets VehicleID and VeRi-776.
[dataset, window, framework, recognition, license, human, crucial] [local, light, defined, view, approach, project, problem, body, front] [image, proposed, method, result, conduct, figure, extracted, handle, subtle, face, input] [deep, network, convolutional, performance, table, neural, number, precision, grant] [model, query, brand, visual, plate] [vehicle, feature, global, detection, branch, person, three, distinguish, module, map, average, including, extraction, bounding, roi, localization, identification, box, constrains, propose, adopt, backbone] [learning, vehicleid, large, training, metric, distance, similarity, learn, classification, test, list, label, discriminative, probe, gallery, rank, address, loss, rest, embedding, specific]
@InProceedings{He_2019_CVPR,
  author = {He, Bing and Li, Jia and Zhao, Yifan and Tian, Yonghong},
  title = {Part-Regularized Near-Duplicate Vehicle Re-Identification},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Self-Supervised Spatio-Temporal Representation Learning for Videos by Predicting Motion and Appearance Statistics
Jiangliu Wang, Jianbo Jiao, Linchao Bao, Shengfeng He, Yunhui Liu, Wei Liu


We address the problem of video representation learning without human-annotated labels. While previous efforts address the problem by designing novel self-supervised tasks using video data, the learned features are merely on a frame-by-frame basis, which are not applicable to many video analytic tasks where spatio-temporal features are prevailing. In this paper we propose a novel self-supervised approach to learn spatio-temporal features for video representation. Inspired by the success of two-stream approaches in video classification, we propose to learn visual features by regressing both motion and appearance statistics along spatial and temporal dimensions, given only the input video data. Specifically, we extract statistical concepts (fast-motion region and the corresponding dominant direction, spatio-temporal color diversity, dominant color, etc.) from simple patterns in both spatial and temporal domains. Unlike prior puzzles that are even hard for humans to solve, the proposed approach is consistent with human inherent visual habits and therefore easy to answer. We conduct extensive experiments with C3D to validate the effectiveness of our proposed approach. The experiments show that our approach can significantly improve the performance of C3D when applied to video classification tasks. Code is available at https://github.com/laura-wang/video_repres_mas.
[motion, video, action, largest, recognition, flow, dataset, human, temporal, frame, optical, predicting, predict, dynamic, aslan, extract, work, selfsupervised, clip] [dominant, pattern, approach, scene, corresponding, problem, angle, orientation, local, compute, smallest] [appearance, proposed, color, statistical, figure, method, based, image, input, described, comparison] [network, performance, block, number, table, design, accuracy, convolutional, wei, compared, powerful, initialization, achieve, magnitude] [visual, model, diversity, random, understanding] [global, spatial, feature, cnn, location, three, supervision, branch] [learning, representation, task, learn, unlabeled, novel, train, split, learned, data, validate, classification, training, similarity, unsupervised]
@InProceedings{Wang_2019_CVPR,
  author = {Wang, Jiangliu and Jiao, Jianbo and Bao, Linchao and He, Shengfeng and Liu, Yunhui and Liu, Wei},
  title = {Self-Supervised Spatio-Temporal Representation Learning for Videos by Predicting Motion and Appearance Statistics},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Classification-Reconstruction Learning for Open-Set Recognition
Ryota Yoshihashi, Wen Shao, Rei Kawakami, Shaodi You, Makoto Iida, Takeshi Naemura


Open-set classification is a problem of handling 'unknown' classes that are not contained in the training dataset, whereas traditional classifiers assume that only known classes appear in the test environment. Existing open-set classifiers rely on deep networks trained in a supervised manner on known classes in the training set; this causes specialization of learned representations to known classes and makes it hard to distinguish unknowns from knowns. In contrast, we train networks for joint classification and reconstruction of input data. This enhances the learned representation so as to preserve information useful for separating unknowns from knowns, as well as to discriminate classes of knowns. Our novel Classification-Reconstruction learning for Open-Set Recognition (CROSR) utilizes latent representations for reconstruction and enables robust unknown detection without harming the known-class classification accuracy. Extensive experiments reveal that the proposed method outperforms existing deep open-set classifiers in multiple standard datasets and is robust to diverse outliers.
[recognition, anomaly, prediction, dataset, consists, joint] [reconstruction, outlier, well] [latent, input, figure, generative, method, synthesized, image, based, reconstructive] [deep, network, convolutional, table, number, plain, neural, net, lateral, densenet, compact, larger, denotes, higher, accuracy, pooling] [text, adversarial] [detection, detector, cnn, hierarchical, detect, feature, fully] [unknown, classification, training, learning, crosr, openmax, supervised, data, dhrnet, test, class, softmax, representation, set, mnist, existing, unsupervised, open, laddernet, classifier, trained, datasets, distance, dimensionality, terrance, learned, outperformed, walter, distribution, discriminative, domain, nearest, ladder, weibull]
@InProceedings{Yoshihashi_2019_CVPR,
  author = {Yoshihashi, Ryota and Shao, Wen and Kawakami, Rei and You, Shaodi and Iida, Makoto and Naemura, Takeshi},
  title = {Classification-Reconstruction Learning for Open-Set Recognition},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Emotion-Aware Human Attention Prediction
II, Macario O. Cordel, Shaojing Fan, Zhiqi Shen, Mohan S. Kankanhalli


Despite the recent success in face recognition and object classification, in the field of human gaze prediction, computer models are still struggling to accurately mimic human attention. One main reason is that visual attention is a complex human behavior influenced by multiple factors, ranging from low-level features (e.g., color, contrast) to high-level human perception (e.g., objects interactions, object sentiment), making it difficult to model computationally. In this work, we investigate the relation between object sentiment and human attention. We first introduce a new evaluation metric (AttI) for measuring human attention that focuses on human fixation consensus. A series of empirical data analyses with AttI indicate that emotion-evoking objects receive attention favor, especially when they co-occur with emotionally-neutral objects, and this favor varies with different image complexity. Based on the empirical analyses, we design a deep neural network for human attention prediction which allows the attention bias on emotion-evoking objects to be encoded in its feature space. Experiments on two benchmark datasets demonstrate its superior performance, especially on metrics that evaluate relative importance of salient regions. This research provides the clearest picture to date on how object sentiments influence human attention, and it makes one of the first attempts to model this phenomenon computationally.
[human, fixation, prediction, predicting, work, affective, recognition, complex, dataset] [computer, relative, consensus, vision, pattern, initial] [image, neutral, ieee, based, difference, proposed, figure, study, input, conference] [complexity, neural, output, better, performance, compared, deep, dnn, higher, increase, design, processing] [attention, model, visual, improved, empirical, generation, evaluation, introduce, advantage, measuring] [saliency, object, sentiment, easal, feature, atti, emotional, map, mask, score, branch, emod, predicted, indicate, propose, subnetwork, detected, semantic, level, salient, prioritization, context, extraction, benchmark] [emotion, data, metric, negative, hcs, datasets, trained, learning, labeled, positive, training]
@InProceedings{II_2019_CVPR,
  author = {Cordel, II, Macario O. and Fan, Shaojing and Shen, Zhiqi and Kankanhalli, Mohan S.},
  title = {Emotion-Aware Human Attention Prediction},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Residual Regression With Semantic Prior for Crowd Counting
Jia Wan, Wenhan Luo, Baoyuan Wu, Antoni B. Chan, Wei Liu


Crowd counting is a challenging task due to factors such as large variations in crowdedness and severe occlusions. Although recent deep learning based counting algorithms have achieved a great progress, the correlation knowledge among samples and the semantic prior have not yet been fully exploited. In this paper, a residual regression framework is proposed for crowd counting utilizing the correlation information among samples. By incorporating such information into our network, we discover that more intrinsic characteristics can be learned by the network which thus generalizes better to unseen scenarios. Besides, we show how to effectively leverage the semantic prior to improve the performance of crowd counting. We also observe that the adversarial loss can be used to improve the quality of predicted density maps, thus leading to an improvement in crowd counting. Experiments on public datasets demonstrate the effectiveness and generalization ability of the proposed method.
[prediction, fusion, predict, people, dataset, ucf] [computer, vision, pattern, approach, ground, estimated, international, single] [appearance, proposed, image, conference, based, ieee, prior, method, input, mse, sanet, quality, difference, developed] [density, residual, network, performance, deep, correlation, table, convolutional, better, number, neural, effective, best, csrnet, achieves, congested, eliminate, denotes, layer] [adversarial, model, ability, evaluation] [crowd, map, semantic, counting, regression, predicted, final, area, mcnn, shanghaitech, improve, count, false, detection, comparing, module, mae, propose, pedestrian, utilize, fusing, spatial, fsmse, challenging] [support, loss, learning, set, generalization, unseen, exemplar, knowledge, training, labeled, experimental, large]
@InProceedings{Wan_2019_CVPR,
  author = {Wan, Jia and Luo, Wenhan and Wu, Baoyuan and Chan, Antoni B. and Liu, Wei},
  title = {Residual Regression With Semantic Prior for Crowd Counting},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Context-Reinforced Semantic Segmentation
Yizhou Zhou, Xiaoyan Sun, Zheng-Jun Zha, Wenjun Zeng


Recent efforts have shown the importance of context on deep convolutional neural network based semantic segmentation. Among others, the predicted segmentation map (p-map) itself which encodes rich high-level semantic cues (e.g. objects and layout) can be regarded as a promising source of context. In this paper, we propose a dedicated module, Context Net, to better explore the context information in p-maps. Without introducing any new supervisions, we formulate the context learning problem as a Markov Decision Process and optimize it using reinforcement learning during which the p-map and Context Net are treated as environment and agent, respectively. Through adequate explorations, the Context Net selects the information which has long-term benefit for segmentation inference. By incorporating the Context Net with a baseline segmentation scheme, we then propose a Context-reinforced Semantic Segmentation network (CiSS-Net), which is fully end-to-end trainable. Experimental results show that the learned context brings 3.9% absolute improvement on mIoU over the baseline segmentation method, and the CiSS-Net achieves the state-of-the-art segmentation performance on ADE20K, PASCAL-Context and Cityscapes.
[prediction, benefit] [computer, scene, pattern, vision, well, field, local, international, denote] [image, conference, ieee, input, based, method, figure] [net, performance, table, network, convolutional, deep, achieves, validation, pooling, neural, process, architecture, inference, iteration, explore, atrous] [step, arxiv, reward, preprint, generated, reinforcement, generate, probability, policy, indicates] [context, segmentation, semantic, map, fully, global, propose, feature, baseline, pyramid, segment, miou, object, improve, pascal, improves, spatial, pspnet, module, refinement, region, vfk, annotated, predicted, contextual] [learning, learned, set, function, bias, training, base, domain]
@InProceedings{Zhou_2019_CVPR,
  author = {Zhou, Yizhou and Sun, Xiaoyan and Zha, Zheng-Jun and Zeng, Wenjun},
  title = {Context-Reinforced Semantic Segmentation},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Adversarial Structure Matching for Structured Prediction Tasks
Jyh-Jing Hwang, Tsung-Wei Ke, Jianbo Shi, Stella X. Yu


Pixel-wise losses, i.e., cross-entropy or L2, have been widely used in structured prediction tasks as a spatial extension of generic image classification or regression. However, its i.i.d. assumption neglects the structural regularity present in natural images. Various attempts have been made to incorporate structural reasoning mostly through structure priors in a cooperative way where co-occurring patterns are encouraged. We, on the other hand, approach this problem from an opposing angle and propose a new framework, Adversarial Structure Matching (ASM), for training such structured prediction networks via an adversarial process, in which we train a structure analyzer that provides the supervisory signals, the ASM loss. The structure analyzer is trained to maximize ASM loss, or to emphasize recurring multi-scale hard negative structural mistakes usually among co-occurring patterns. On the contrary, the structured prediction network is trained to reduce those mistakes and is thus enabled to distinguish fine-grained structures. As a result, training structured prediction networks using ASM reduces contextual confusion among objects and improves boundary localization. We demonstrate that ASM outperforms its pixel-wise counterpart and commonly used structure priors, GAN, on three different structured prediction tasks, namely, semantic segmentation, monocular depth estimation, and surface normal prediction.
[prediction, structural, dataset, horse, outperforms, capture, framework] [depth, surface, normal, estimation, monocular, ground, truth, note, single, matching, shape, error, scene] [image, figure, proposed, input, demonstrate, method] [structure, structured, asm, analyzer, deep, iid, convolutional, network, neural, rate, regularization, conv, table, regularizer, weizmann, implementation, better] [adversarial, gan, random, visual, evaluate, cgan, arxiv, preprint] [semantic, segmentation, voc, pspnet, boundary, three, pascal, spatial, including, deeplab, miou, propose, improves, affinity, detection, object] [learning, training, loss, set, base, trained, predictor, train, observe, data, hard, negative, autoencoder]
@InProceedings{Hwang_2019_CVPR,
  author = {Hwang, Jyh-Jing and Ke, Tsung-Wei and Shi, Jianbo and Yu, Stella X.},
  title = {Adversarial Structure Matching for Structured Prediction Tasks},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Deep Spectral Clustering Using Dual Autoencoder Network
Xu Yang, Cheng Deng, Feng Zheng, Junchi Yan, Wei Liu


The clustering methods have recently absorbed even-increasing attention in learning and vision. Deep clustering combines embedding and clustering together to obtain optimal embedding subspace for clustering, which can be more effective compared with conventional clustering methods. In this paper, we propose a joint learning framework for discriminative embedding and spectral clustering. We first devise a dual autoencoder network, which enforces the reconstruction constraint for the latent representations and their noisy versions, to embed the inputs into a latent space for clustering. As such the learned latent representations can be more robust to noise. Then the mutual information estimation is utilized to provide more discriminative information from the inputs. Furthermore, a deep spectral clustering method is applied to embed the latent representations into the eigenspace and subsequently clusters them, which can fully exploit the relationship between inputs to achieve optimal clustering results. Experimental results on benchmark datasets show that our method can significantly outperform state-of-the-art clustering approaches.
[acc, dataset, joint, utilized, jointly, framework, graph] [reconstruction, defined, estimation, robust, matrix, pattern, computer, optimal, international, vision, local, approach, relative] [latent, method, spectral, image, dual, cheng, proposed, conference, based, ieee, generative, figure, raw] [deep, network, convolutional, neural, performance, layer, represents, order, output, original, processing, wei, gaussian] [decoder, adversarial, arxiv, preprint, embed, model, variational, encoder] [feature, fully, adopt, benchmark, improve, map] [clustering, autoencoder, mutual, learning, loss, discriminative, nmi, data, unsupervised, embedding, distribution, space, subspace, datasets, representation, training, min, cluster, effectively, objective, divergence, sample, negative, log, noisy, learn, function, set, usps, ytf]
@InProceedings{Yang_2019_CVPR,
  author = {Yang, Xu and Deng, Cheng and Zheng, Feng and Yan, Junchi and Liu, Wei},
  title = {Deep Spectral Clustering Using Dual Autoencoder Network},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Deep Asymmetric Metric Learning via Rich Relationship Mining
Xinyi Xu, Yanhua Yang, Cheng Deng, Feng Zheng


Learning effective distance metric between data has gained increasing popularity, for its promising performance on various tasks, such as face verification, zero-shot learning, and image retrieval. A major line of researches employs hard data mining, which makes efforts on searching a subset of significant data. However, hard data mining based approaches only rely on a small percentage of data, which is apt to overfitting. This motivates us to propose a novel framework, named deep asymmetric metric learning via rich relationship mining (DAMLRRM), to mine rich relationship under satisfying sampling size. DAMLRRM constructs two asymmetric data streams that are differently structured and of unequal length. The asymmetric structure enables the two data streams to interlace each other, which allows for the informative comparison between new data pairs over iterations. To improve the generalization ability, we further relax the constraint on the intra-class relationship. Rather than greedily connecting all possible positive pairs, DAMLRRM builds a minimum-cost spanning tree within each category to ensure the formation of a connected region. As such there exists at least one direct or indirect path between arbitrary positive pairs to bridge intra-class relevance. Extensive experimental results on three benchmark datasets including CUB-200-2011, Cars196, and Stanford Online Products show that DAMLRRM effectively boosts the performance of existing deep metric learning approaches.
[stream, graph, dataset, online, build, time, framework, work, employed] [constraint, point, algorithm] [method, figure, proposed, comparison, based, face, image] [deep, connected, batch, table, neural, size, performance, network, compared, iteration, small, structured, structure, convolutional, number, weighted] [tree, relationship, model, spanning, rich, adversarial, query] [feature, propose, pool, category, semantic, undirected, boundary] [learning, metric, data, positive, training, loss, hard, embedding, asymmetric, mining, distance, negative, retrieval, sampling, damlrrm, shuffled, generalization, stanford, contrastive, clustering, neat, nearest, prim, triplet, lifted, set, similarity, margin, subset, relax, conventional, large, function, novel, mine]
@InProceedings{Xu_2019_CVPR,
  author = {Xu, Xinyi and Yang, Yanhua and Deng, Cheng and Zheng, Feng},
  title = {Deep Asymmetric Metric Learning via Rich Relationship Mining},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Did It Change? Learning to Detect Point-Of-Interest Changes for Proactive Map Updates
Jerome Revaud, Minhyeok Heo, Rafael S. Rezende, Chanmi You, Seong-Gyun Jeong


Maps are an increasingly important tool in our daily lives, yet their rich semantic content still largely depends on manual input. Motivated by the broad availability of geo-tagged street-view images, we propose a new task aiming to make the map update process more proactive. We focus on automatically detecting changes of Points of Interest (POIs), specifically stores or shops of any kind, based on visual input. Faced with the lack of an appropriate benchmark, we build and release a large dataset, captured in two large shopping centers, that comprises 33K geo-localized images and 578 POIs. We then design a generic approach that compares two image sets captured in the same venue at different times and outputs POI changes as a ranked list of map locations. In contrast to logo or franchise recognition approaches, our system does not depend on an external franchise database. It is instead inspired by recent deep metric learning approaches that learn a similarity function fit to the task at hand. We compare various loss functions to learn a metric aligned with the POI change detection goal, and report promising results.
[dataset, recognition, time, perform, second] [poi, geographical, camera, approach, scene, floor, corresponding, viewpoint, visible, computed, place, problem, compute, geometric, pap] [image, change, ieee, captured, based, figure, content, proposed, database, appearance, acquisition] [deep, performance, max, output, imagenet, process] [goal, visual, system, potential, evaluation] [map, detection, detect, semantic, location, average, final, street, detecting, interest, shopping, object, cnn] [learning, loss, metric, similarity, logo, set, retrieval, function, large, embedding, distance, training, task, triplet, measure, split, learn, gap, positive, update, trained, data, pairwise, negative, train, test, generic]
@InProceedings{Revaud_2019_CVPR,
  author = {Revaud, Jerome and Heo, Minhyeok and Rezende, Rafael S. and You, Chanmi and Jeong, Seong-Gyun},
  title = {Did It Change? Learning to Detect Point-Of-Interest Changes for Proactive Map Updates},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Associatively Segmenting Instances and Semantics in Point Clouds
Xinlong Wang, Shu Liu, Xiaoyong Shen, Chunhua Shen, Jiaya Jia


A 3D point cloud describes the real scene precisely and intuitively. To date how to segment diversified elements in such an informative 3D scene is rarely discussed. In this paper, we first introduce a simple and flexible framework to segment instances and semantics in point clouds simultaneously. Then, we propose two approaches which make the two tasks take advantage of each other, leading to a win-win situation. Specifically, we make instance segmentation benefit from semantic segmentation through learning semantic-aware point-level instance embedding. Meanwhile, semantic features of the points belonging to the same instance are fused together to make more accurate per-point semantic predictions. Our method largely outperforms the state-of-the-art method in 3D instance segmentation along with a significant improvement in 3D semantic segmentation. Code has been made available at: https://github.com/WXinlong/ASIS.
[dataset, fusion, time, framework, outperforms] [point, cloud, matrix, pointnet, shapenet, shape, scene, ground, single, well, chair, accurate] [method, ieee, figure, based, proposed, real] [table, performance, neural, network, convolutional, deep, better, output, group, flexible, achieve, effective, computation, size, number, max] [simple, introduce, represent] [instance, semantic, segmentation, asis, feature, baseline, miou, segment, fsem, mwcov, backbone, awareness, area, board, object, sgpn, iou, semantics, propose, final, module, wcov, grouping, improvement] [class, embedding, learning, belonging, embeddings, test, training, novel, learn, loss, set, shared, refers]
@InProceedings{Wang_2019_CVPR,
  author = {Wang, Xinlong and Liu, Shu and Shen, Xiaoyong and Shen, Chunhua and Jia, Jiaya},
  title = {Associatively Segmenting Instances and Semantics in Point Clouds},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Pattern-Affinitive Propagation Across Depth, Surface Normal and Semantic Segmentation
Zhenyu Zhang, Zhen Cui, Chunyan Xu, Yan Yan, Nicu Sebe, Jian Yang


In this paper, we propose a novel Pattern-Affinitive Propagation (PAP) framework to jointly predict depth, surface normal and semantic segmentation. The motivation behind it comes from the statistic observation that pattern-affinitive pairs recur much frequently across different tasks as well as within a task. Thus, we can conduct two types of propagations, cross-task propagation and task-specific propagation, to adaptively diffuse those similar patterns. The former integrates cross-task affinity patterns to adapt to each task therein through the calculation on non-local relationships. Next the latter performs an iterative diffusion in the feature space so that the cross-task affinity patterns can be widely-spread within the task. Accordingly, the learning of each task can be regularized and boosted by the complementary task-level affinities. Extensive experiments demonstrate the effectiveness and the superiority of our method on the joint three tasks. Meanwhile, we achieve the state-of-the-art or competitive results on the three related datasets, NYUD-v2, SUN-RGBD and KITTI.
[prediction, propagation, jointly, joint, perform, dataset, fusion, influence, abhinav] [depth, normal, surface, pap, matrix, rgbd, rgb, ground, error, estimation, note, corresponding, monocular, initial, single, analysis, kitti, local, pattern, volume, well] [method, image, input, demonstrate, figure, conduct, based, proposed] [network, deep, convolutional, layer, table, upsampling, neural, scale, process, designed, block, effectiveness, firstly] [model, find] [affinity, semantic, segmentation, three, feature, affinitive, boost, utilize, illustrated] [learning, diffusion, task, loss, learn, training, set, learned, train, dissimilar, shared, trained, data]
@InProceedings{Zhang_2019_CVPR,
  author = {Zhang, Zhenyu and Cui, Zhen and Xu, Chunyan and Yan, Yan and Sebe, Nicu and Yang, Jian},
  title = {Pattern-Affinitive Propagation Across Depth, Surface Normal and Semantic Segmentation},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Scene Categorization From Contours: Medial Axis Based Salience Measures
Morteza Rezanejad, Gabriel Downs, John Wilder, Dirk B. Walther, Allan Jepson, Sven Dickinson, Kaleem Siddiqi


The computer vision community has witnessed recent advances in scene categorization from images, with the state of the art systems now achieving impressive recognition rates on challenging benchmarks. Such systems have been trained on photographs which include color, texture and shading cues. The geometry of shapes and surfaces, as conveyed by scene contours, is not explicitly considered for this task. Remarkably, humans can accurately recognize natural scenes from line drawings, which consist solely of contour-based shape cues. Here we report the first computer vision study on scene categorization of line drawings derived from popular databases including an artist scene database, MIT67 and Places365. Specifically, we use off-the-shelf pre-trained Convolutional Neural Networks (CNNs) to perform scene classification given only contour information as input, and find performance levels well above chance. We also show that medial-axis based contour salience methods can be used to select more informative subsets of contour pixels, and that the variation in CNN classification performance on various choices for these subsets is qualitatively similar to that observed in human performance. Moreover, when the salience measures are used to weight the contours, we find that these weights boost our CNN performance above that for unweighted contour input. That is, the medial axis based salience weights appear to add useful information that is not available when CNNs are trained to use contours alone.
[human, recognition, complex, dataset] [scene, salience, medial, axis, ribbon, taper, symmetry, computer, local, vision, aof, radius, pattern, point, shape, computed, flux, organization, gestalt, skeletal, inscribed] [contour, separation, artist, based, figure, image, drawing, perceptual, ieee, conference, disk, outward, input, perceptually, side, half] [performance, top, weighted, neural, computing, connected, interval, network, cnns, deep, increasing, table, convolutional] [visual, generated, introduce, natural, regular, machine, consider] [categorization, cnn, three, object, bottom, boundary, region, average, edge, branch] [measure, function, motivated, set, distance, classification, euclidean]
@InProceedings{Rezanejad_2019_CVPR,
  author = {Rezanejad, Morteza and Downs, Gabriel and Wilder, John and Walther, Dirk B. and Jepson, Allan and Dickinson, Sven and Siddiqi, Kaleem},
  title = {Scene Categorization From Contours: Medial Axis Based Salience Measures},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Unsupervised Image Captioning
Yang Feng, Lin Ma, Wei Liu, Jiebo Luo


Deep neural networks have achieved great successes on the image captioning task. However, most of the existing models depend heavily on paired image-sentence datasets, which are very expensive to acquire. In this paper, we make the first attempt to train an image captioning model in an unsupervised manner. Instead of relying on manually labeled image-sentence pairs, our proposed model merely requires an image set, a sentence corpus, and an existing visual concept detector. The sentence corpus is used to teach the captioning model how to generate plausible sentences. Meanwhile, the knowledge in the visual concept detector is distilled into the captioning model to guide the model to recognize the visual concepts in an image. In order to further encourage the generated captions to be semantically consistent with the image, the image and caption are projected into a common latent space so that they can reconstruct each other. Given that the existing sentence corpora are mainly designed for linguistic research and are thus with little reference to image contents, we crawl a large-scale image description corpus of two million natural sentences to facilitate the unsupervised image captioning scenario. Experimental results show that our proposed model is able to produce quite promising results without any caption annotations.
[lstm, recognize, work, dataset] [reconstruction, corresponding] [image, proposed, figure, generator, paired, method, latent, unpaired, input, reconstruct, translation, background, init, real] [neural, initialization, gradient, number, wei, deep, table, order] [captioning, sentence, model, concept, generated, visual, adversarial, adv, caption, corpus, generate, con, language, reward, common, machine, discriminator, description, young, generation, sampled, probability, word, white, man, laptop, plausible, conditioned, text, mscoco, evaluation, encourage] [object, detector, coco, three, feature, semantic, detected] [training, unsupervised, train, data, learning, existing, target, pseudo, space, novel, set, objective, supervised, labeled, experimental, trained, loss, representation, cat, test, knowledge]
@InProceedings{Feng_2019_CVPR,
  author = {Feng, Yang and Ma, Lin and Liu, Wei and Luo, Jiebo},
  title = {Unsupervised Image Captioning},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Exact Adversarial Attack to Image Captioning via Structured Output Learning With Latent Variables
Yan Xu, Baoyuan Wu, Fumin Shen, Yanbo Fan, Yong Zhang, Heng Tao Shen, Wei Liu


In this work, we study the robustness of a CNN+RNN based image captioning system being subjected to adversarial noises. We propose to fool an image captioning system to generate some targeted partial captions for an image polluted by adversarial noises, even the targeted captions are totally irrelevant to the image content. A partial caption indicates that the words at some locations in this caption are observed, while words at other locations are not restricted. It is the first work to study exact adversarial attacks of targeted partial captions. Due to the sequential dependencies among words in a caption, we formulate the generation of adversarial noises for targeted partial captions as a structured output learning problem with latent variables. Both the generalized expectation maximization algorithm and structural SVMs with latent variables are then adopted to optimize the problem. The proposed methods generate very successful attacks to three popular CNN+RNN based image captioning models. Furthermore, the proposed attack methods are used to understand the inner mechanism of image captioning systems, providing the guidance to further improve automatic image captioning systems towards human captioning.
[human, structural, work] [computer, vision, pattern, problem, maximization, algorithm, corresponding] [image, latent, conference, based, proposed, ieee, method, rec, noise, produce] [structured, table, neural, max, deep, output, rate, number, norm, gradient, architecture, optimized] [targeted, adversarial, attack, partial, captioning, caption, gem, observed, model, ssvms, complete, green, visual, arg, success, indicates, benign, word, attacked, svms, sitting, prec, obser, frisbee, easily, playing, robustness, generate, studied, marginal, probability, bird, maximizing] [three, including, predicted, object, average, semantic] [learning, set, log, loss, likelihood, margin, training, function, metric, classification, specific]
@InProceedings{Xu_2019_CVPR,
  author = {Xu, Yan and Wu, Baoyuan and Shen, Fumin and Fan, Yanbo and Zhang, Yong and Tao Shen, Heng and Liu, Wei},
  title = {Exact Adversarial Attack to Image Captioning via Structured Output Learning With Latent Variables},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Cross-Modal Relationship Inference for Grounding Referring Expressions
Sibei Yang, Guanbin Li, Yizhou Yu


Grounding referring expressions is a fundamental yet challenging task facilitating human-machine communication in the physical world. It locates the target object in an image on the basis of the comprehension of the relationships between referring natural language expressions and the image. A feasible solution for grounding referring expressions not only needs to extract all the necessary information (i.e. objects and the relationships among them) in both the image and referring expressions, but also compute and represent multimodal contexts from the extracted information. Unfortunately, existing work on grounding referring expressions cannot extract multi-order relationships from the referring expressions accurately and the contexts they obtain have discrepancies with the contexts described by referring expressions. In this paper, we propose a Cross-Modal Relationship Extractor (CMRE) to adaptively highlight objects and relationships, that have connections with a given expression, with a cross-modal attention mechanism, and represent the extracted information as a language-guided visual relation graph. In addition, we propose a Gated Graph Convolutional Network (GGCN) to compute multimodal semantic contexts by fusing information from different modes and propagating multimodal information in the structured relation graph. Experiments on various common benchmark datasets show that our Cross-Modal Relationship Inference Network, which consists of CMRE and GGCN, outperforms all existing state-of-the-art methods.
[graph, recognition, work, modeling, extract, outperforms] [computer, vision, pattern, vertex, matching, defined, compute, normalized, computed] [expression, conference, ieee, image, figure, proposed, includes, appearance, based] [convolutional, network, weighted, inference, neural, deep] [visual, referring, relationship, cmrin, grounding, language, attention, multimodal, gated, word, cmre, ggcn, natural, refcocog, represent, rij, refcoco, basis, indicates, man, model] [context, relation, object, spatial, semantic, proposal, score, feature, global, detection, guanbin, three, faster, european, location, val, accurately, cnn] [target, set, existing, learning, extractor, test, datasets]
@InProceedings{Yang_2019_CVPR,
  author = {Yang, Sibei and Li, Guanbin and Yu, Yizhou},
  title = {Cross-Modal Relationship Inference for Grounding Referring Expressions},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
What's to Know? Uncertainty as a Guide to Asking Goal-Oriented Questions
Ehsan Abbasnejad, Qi Wu, Qinfeng Shi, Anton van den Hengel


One of the core challenges in Visual Dialogue problems is asking the question that will provide the most useful information towards achieving the required objective. Encouraging an agent to ask the right questions is difficult because we don't know a-priori what information the agent will need to achieve its task, and we don't have an explicit model of what it knows already. We propose a solution to this problem based on a Bayesian model of the uncertainty in the implicit model maintained by the visual dialogue agent, and in the function used to select an appropriate output. By selecting the question that minimises the predicted regret with respect to this implicit model the agent actively reduces ambiguity. The Bayesian model of uncertainty also enables a principled method for identifying when enough information has been acquired, and an action should be selected. We evaluate our approach on two goal-oriented dialogue datasets, one for visual-based collaboration task and the other for a negotiation-based task. Our uncertainty-aware information-seeking model outperforms its counterparts in these two challenging problems.
[action, state, human, outperforms, internal, sequence, current] [approach, international, provide, bound, problem, estimate] [conference, image, based, method, proposed, input] [neural, performance, deep, bayesian, better, achieving, network, achieve, best, processing, process, number, dropout, variance, subsequent] [dialogue, model, agent, reinforcement, visual, word, reward, question, deal, goal, guesswhat, generation, decoder, arxiv, required, policy, expected, game, preprint, choose, machine, identifying, identify, book, generated, van, selecting, history, conversation, step, collection, den, regret, evaluate, token, hat, guesser, ball, greedy] [object, baseline, context, propose, round] [learning, uncertainty, function, distribution, select, supervised, posterior, upper, sampling, task]
@InProceedings{Abbasnejad_2019_CVPR,
  author = {Abbasnejad, Ehsan and Wu, Qi and Shi, Qinfeng and van den Hengel, Anton},
  title = {What's to Know? Uncertainty as a Guide to Asking Goal-Oriented Questions},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Iterative Alignment Network for Continuous Sign Language Recognition
Junfu Pu, Wengang Zhou, Houqiang Li


In this paper, we propose an alignment network with iterative optimization for weakly supervised continuous sign language recognition. Our framework consists of two modules: a 3D convolutional residual network (3D-ResNet) for feature learning and an encoder-decoder network with connectionist temporal classification (CTC) for sequence modelling. The above two modules are optimized in an alternate way. In the encoder-decoder sequence learning network, two decoders are included, i.e., LSTM decoder and CTC decoder. Both decoders are jointly trained by maximum likelihood criterion with a soft Dynamic Time Warping (soft-DTW) alignment constraint. The warping path, which indicates the possible alignment between input video clips and sign words, is used to fine-tune the 3D-ResNet as training labels with classification loss. After fine-tuning, the improved features are extracted for optimization of encoder-decoder sequence learning network in next iteration. The proposed algorithm is evaluated on two large scale continuous sign language recognition benchmarks, i.e., RWTH-PHOENIX-Weather and CSL. Experimental results demonstrate the effectiveness of our proposed method.
[video, sequence, lstm, recognition, temporal, warping, time, clip, dataset, framework, hidden, csl, dynamic, modelling, action, recurrent, jointly, work, long] [continuous, optimization, algorithm, corresponding, error, defined] [input, figure, based, method, proposed, image, translation] [network, neural, deep, convolutional, residual, wengang, performance, conv, better, rate, table, architecture, output, layer] [sign, decoder, language, ctc, slr, word, iterative, sentence, blstm, probability, connectionist, encoder, wer, attention, houqiang, mechanism, machine, path, decoding, length, oscar, model, vocabulary] [feature, inner] [alignment, learning, training, set, classification, representation, target, train, loss, split, experimental, product, extractor, soft, distance]
@InProceedings{Pu_2019_CVPR,
  author = {Pu, Junfu and Zhou, Wengang and Li, Houqiang},
  title = {Iterative Alignment Network for Continuous Sign Language Recognition},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Neural Sequential Phrase Grounding (SeqGROUND)
Pelin Dogan, Leonid Sigal, Markus Gross


We propose an end-to-end approach for phrase grounding in images. Unlike prior methods that typically attempt to ground each phrase independently by building an image-text embedding, our architecture formulates grounding of multiple phrases as a sequential and contextual process. Specifically, we encode region proposals and all phrases into two stacks of LSTM cells, along with so-far grounded phrase-region pairs. These LSTM stacks collectively capture context for grounding of the next phrase. The resulting architecture, which we call SeqGROUND, supports many-to-many matching by allowing an image region to be matched to multiple phrases and vice versa. We show competitive performance on the Flickr30K benchmark dataset and, through ablation studies, validate the efficacy of sequential grounding as well as individual design choices in our model architecture.
[lstm, multiple, sequential, sequence, ordering, state, recurrent, time, perform] [ground, matching, single, corresponding, account, note, respect] [image, stack, input, figure, prior, proposed, method, content] [network, neural, performance, accuracy, architecture, order, deep, table, full, connected, top] [phrase, grounding, visual, grounded, decision, seqground, history, model, language, sentence, simple, vector, text, noun, question, consider, referring, textual, linguistic, multimodal, represent, step, encoded] [box, bounding, context, global, spatial, object, contextual, proposal, region, fully, detection, cnn, propose, three] [similarity, learning, embedding, negative, function, space, pair, positive, loss]
@InProceedings{Dogan_2019_CVPR,
  author = {Dogan, Pelin and Sigal, Leonid and Gross, Markus},
  title = {Neural Sequential Phrase Grounding (SeqGROUND)},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
CLEVR-Ref+: Diagnosing Visual Reasoning With Referring Expressions
Runtao Liu, Chenxi Liu, Yutong Bai, Alan L. Yuille


Referring object detection and referring image segmentation are important tasks that require joint understanding of visual information and natural language. Yet there has been evidence that current benchmark datasets suffer from bias, and current state-of-the-art models cannot be easily evaluated on their intermediate reasoning process. To address these issues and complement similar efforts in visual question answering, we build CLEVR-Ref+, a synthetic diagnostic dataset for referring expression comprehension. The precise locations and attributes of the objects are readily available, and the referring expressions are automatically associated with functional programs. The synthetic nature allows control over dataset bias (through sampling strategy), and the modular programs enable intermediate reasoning ground truth without human annotators. In addition to evaluating several state-of-the-art models on CLEVR-Ref+, we also propose IEP-Ref, a module network approach that significantly outperforms other models on our dataset. In particular, we present two interesting and important findings using IEP-Ref: (1) the module trained to transform feature maps into segmentation masks can be attached to any intermediate module to reveal the entire reasoning process step-by-step; (2) even if all training data has at least one object referred, IEP-Ref can correctly predict no-foreground when presented with false-premise referring expressions. To the best of our knowledge, this is the first direct and quantitative proof that neural modules behave in the way they are intended. We will release data and code for CLEVR-Ref+.
[dataset, lstm, second, multiple] [computer, scene, left, visible, shape, material, functional, front, ground, ordinal, truth, analysis, volume, associated] [expression, image, ieee, intermediate, figure, color, synthetic, metal] [filter, size, neural, small, network, table, number, performance, process, accuracy, addition, output] [referring, visual, reasoning, question, red, model, clevr, natural, referred, unique, simply, attention, rubber, mattnet, diagnostic, return, program, cylinder, ability, vqa, understanding, identify, requires, textual] [segmentation, module, object, iou, spatial, detection, bounding, mask, box, relation, rmi, eccv, evaluated, foreground, segment] [large, learning, datasets, bias, trained, lecture, sampling]
@InProceedings{Liu_2019_CVPR,
  author = {Liu, Runtao and Liu, Chenxi and Bai, Yutong and Yuille, Alan L.},
  title = {CLEVR-Ref+: Diagnosing Visual Reasoning With Referring Expressions},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Describing Like Humans: On Diversity in Image Captioning
Qingzhong Wang, Antoni B. Chan


Recently, the state-of-the-art models for image captioning have overtaken human performance based on the most popular metrics, such as BLEU, METEOR, ROUGE and CIDEr. Does this mean we have solved the task of image captioning The above metrics only measure the similarity of the generated caption to the human annotations, which reflects its accuracy. However, an image contains many concepts and multiple levels of detail, and thus there is a variety of captions that express different concepts and details that might be interesting for different humans. Therefore only evaluating accuracy is not sufficient for measuring the performance of captioning models --- the diversity of the generated captions should also be considered. In this paper, we proposed a new metric for measuring the diversity of image captions, which is derived from latent semantic analysis and kernelized to use CIDEr similarity. We conduct extensive experiments to re-evaluate recent captioning models in the context of both diversity and accuracy. We find that there is still a large gap between the model and human performance in terms of both accuracy and diversity, and the models that have optimized accuracy (CIDEr) have low diversity. We also show that balancing the cross-entropy loss and CIDEr reward in reinforcement learning during training can effectively control the tradeoff between diversity and accuracy of the generated captions.
[human, considering] [field, singular, matrix, single, analysis, computed, good, well, accurate] [image, figure, based, proposed, method, frequency, latent, cgt, conditional] [accuracy, correlation, higher, kernel, performance, table, better, convolutional, gaussian, larger, weight, improving] [diversity, caption, cider, model, generate, captioning, diverse, generated, reward, zebra, grass, word, evaluation, vocabulary, standing, indicates, lsa, machine, evaluate, bleu, sentence, generating, cgan, rouge, balancing, reinforcement, attention, meteor, common, adversarial, playing, measuring, spice, consider] [semantic, score, average, feature, improve] [set, metric, retrieval, similarity, large, loss, trained, training, learning, gap, distribution, randomly, measure, refers, learned, pairwise]
@InProceedings{Wang_2019_CVPR,
  author = {Wang, Qingzhong and Chan, Antoni B.},
  title = {Describing Like Humans: On Diversity in Image Captioning},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
MSCap: Multi-Style Image Captioning With Unpaired Stylized Text
Longteng Guo, Jing Liu, Peng Yao, Jiangwei Li, Hanqing Lu


In this paper, we propose an adversarial learning network for the task of multi-style image captioning (MSCap) with a standard factual image caption dataset and a multi-stylized language corpus without paired images. How to learn a single model for multi-stylized image captioning with unpaired data is a challenging and necessary task, whereas rarely studied in previous works. The proposed framework mainly includes four contributive modules following a typical image encoder. First, a style dependent caption generator to output a sentence conditioned on an encoded image and a specified style. Second, a caption discriminator is presented to distinguish the input sentence to be real or not. The discriminator and the generator are trained in an adversarial manner to enable more natural and human-like captions. Third, a style classifier is employed to discriminate the specific style of the input sentence. Besides, a back-translation module is designed to enforce the generated stylized captions are visually grounded, with the intuition of the cycle consistency for factual caption and stylized caption. We enable an end-to-end optimization of the whole model with differentiable softmax approximation. At last, we conduct comprehensive experiments using a combined dataset containing four caption styles to demonstrate the outstanding performance of our proposed method.
[dataset, framework, human] [computer, vision, pattern, corresponding, single, directly] [image, style, stylized, unpaired, generator, input, paired, conference, proposed, figure, generative, ieee, translation, real, visually, desired, content] [neural, table, accuracy, network, performance, gate, design] [caption, model, factual, captioning, adversarial, generated, arxiv, language, preprint, visual, mscap, word, text, generate, sentence, linguistic, corpus, machine, vector, discriminator, natural, generating, mode, encoder, relevancy, generation, fluency, perplexity, man, enable, introduce] [context, module, including, coco, propose, sentiment] [training, learning, loss, target, data, trained, classification, classifier, negative, embedding, log, train, learn, specific, dog, embeddings]
@InProceedings{Guo_2019_CVPR,
  author = {Guo, Longteng and Liu, Jing and Yao, Peng and Li, Jiangwei and Lu, Hanqing},
  title = {MSCap: Multi-Style Image Captioning With Unpaired Stylized Text},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
CRAVES: Controlling Robotic Arm With a Vision-Based Economic System
Yiming Zuo, Weichao Qiu, Lingxi Xie, Fangwei Zhong, Yizhou Wang, Alan L. Yuille


Training a robotic arm to accomplish real-world tasks has been attracting increasing attention in both academia and industry. This work discusses the role of computer vision algorithms in this field. We focus on low-cost arms on which no sensors are equipped and thus all decisions are made upon visual recognition, e.g., real-time 3D pose estimation. This requires annotating a lot of training data, which is not only time-consuming but also laborious. In this paper, we present an alternative solution, which uses a 3D model to create a large number of synthetic data, trains a vision model in this virtual domain, and applies it to real-world images after domain adaptation. To this end, we design a semi-supervised approach, which fully leverages the geometric constraints among keypoints. We apply an iterative algorithm for optimization. Without any annotations on real images, our algorithm generalizes well and produces satisfying results on 3D pose estimation, which is evaluated on two real-world datasets. We also construct a vision-based control system for task accomplishment, for which we train a reinforcement learning agent in a virtual environment and apply it to the real-world. Moreover, our approach, with merely a 3D model being required, has the potential to generalize to other types of multi-rigid-body dynamic systems.
[arm, youtube, human, joint, dataset, prediction, motion, manually, work, second, interesting] [pose, vision, virtual, computer, camera, lab, estimation, keypoint, international, algorithm, keypoints, well, pattern, geometric, directly, shape, peter, fangwei, university] [conference, synthetic, control, real, image, figure, background, input, collected] [deep, accuracy, performance, number, table, apply, applied, network, achieve, design] [model, robotic, system, reinforcement, environment, arxiv, preprint, adversarial, visual, named, goal, evaluate, iterative] [detection, object, refined, yizhou, annotate] [training, domain, data, learning, adaptation, target, trained, train, datasets, task, distribution, unlabeled, trevor, sergey, large, generalize, labeled]
@InProceedings{Zuo_2019_CVPR,
  author = {Zuo, Yiming and Qiu, Weichao and Xie, Lingxi and Zhong, Fangwei and Wang, Yizhou and Yuille, Alan L.},
  title = {CRAVES: Controlling Robotic Arm With a Vision-Based Economic System},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Networks for Joint Affine and Non-Parametric Image Registration
Zhengyang Shen, Xu Han, Zhenlin Xu, Marc Niethammer


We introduce an end-to-end deep-learning framework for 3D medical image registration. In contrast to existing approaches, our framework combines two registration methods: an affine registration and a vector momentum-parameterized stationary velocity field (vSVF) model. Specifically, it consists of three stages. In the first stage, a multi-step affine network predicts affine transform parameters. In the second stage, we use a U-Net-like network to generate a momentum, from which a velocity field can be computed via smoothing. Finally, in the third stage, we employ a self-iterable map-based vSVF component to provide a non-parametric refinement based on the current estimate of the transformation map. Once the model is trained, a registration is completed in one forward pass. To evaluate the performance, we conducted longitudinal and cross-subject experiments on 3D magnetic resonance images (MRI) of the knee of the Osteoarthritis Initiative (OAI) dataset. Results show that our framework achieves comparable performance to state-of-the-art medical image registration approaches, but it is much faster, with a better control of transformation regularity including the ability to produce approximately symmetric transformations, and combining affine as well as non-parametric registration.
[velocity, warped, framework, time, work, stationary, consists, displacement, predict] [registration, affine, vsvf, momentum, longitudinal, avsm, field, optimization, diffeomorphic, approach, initial, symmetry, lncc, deformation, marc, formulation, corresponding, syn, computed, smoothness, good, volume, single] [image, transformation, based, method, figure, component, smooth, resolution] [network, performance, numerical, regularization, deep, standard, size, denotes, output, number, fast, factor, achieves, better, computational] [vector, model, memory, generation, refer, evaluate] [map, spatial, medical, three, average, deformable, regression, predicted] [loss, source, target, similarity, training, large, set, symmetric, refers, learning, train, test]
@InProceedings{Shen_2019_CVPR,
  author = {Shen, Zhengyang and Han, Xu and Xu, Zhenlin and Niethammer, Marc},
  title = {Networks for Joint Affine and Non-Parametric Image Registration},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Learning Shape-Aware Embedding for Scene Text Detection
Zhuotao Tian, Michelle Shu, Pengyuan Lyu, Ruiyu Li, Chao Zhou, Xiaoyong Shen, Jiaya Jia


We address the problem of detecting scene text in arbitrary shapes, which is a challenging task due to the high variety and complexity of the scene. Specifically, we treat text detection as instance segmentation and propose a segmentation-based framework, which extracts each text instance as an independent connected component. To distinguish different text instances, our method maps pixels onto an embedding space where pixels belonging to the same text are encouraged to appear closer to each other and vise versa. In addition, we introduce a Shape-Aware Loss to make training adaptively accommodate various aspect ratios of text instances and the tiny gaps among them, and a new post-processing pipeline to yield precise bounding box predictions. Experimental results on three challenging datasets (ICDAR15, MSRA-TD500 and CTW1500) demonstrate the effectiveness of our work.
[long, dataset, longer, marked] [scene, directly, pipeline, single, accurate] [method, figure, pixel, image, proposed, arbitrary, based, input, result, comparison, separate] [full, precision, network, effectiveness, table, ratio, original, convolutional, processing, deep, small, better, tiny] [text, model, cij, natural, generated, generate, arxiv, preprint] [map, center, detection, segmentation, instance, feature, bounding, branch, detecting, aspect, three, recall, object, cfi, propose, curved, east, predicted, detect, oriented, challenging, regression, merging, disc, lyu, box] [embedding, loss, cluster, clustering, space, distance, training, large, close, set, minimum, learning, trained]
@InProceedings{Tian_2019_CVPR,
  author = {Tian, Zhuotao and Shu, Michelle and Lyu, Pengyuan and Li, Ruiyu and Zhou, Chao and Shen, Xiaoyong and Jia, Jiaya},
  title = {Learning Shape-Aware Embedding for Scene Text Detection},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Learning to Film From Professional Human Motion Videos
Chong Huang, Chuan-En Lin, Zhenyu Yang, Yan Kong, Peng Chen, Xin Yang, Kwang-Ting Cheng


We investigate the problem of 6 degrees of freedom (DOF) camera planning for filming professional human motion videos using a camera drone. Existing methods either plan motions for only a pan-tilt-zoom (PTZ) camera, or adopt ad-hoc solutions without carefully considering the impact of video contents and previous camera motions on the future camera motions. As a result, they can hardly achieve satisfactory results in our drone cinematography task. In this study, we propose a learning-based framework which incorporates the video contents and previous camera motions to predict the future camera motions that enable the capture of professional videos. Specifically, the inputs of our framework are video contents which are represented using subject-related feature based on 2D skeleton and scene-related features extracted from background RGB images, and camera motions which are represented using optical flows. The correlation between the inputs and output future camera motions are learned via a sequence-to-sequence convolutional long short-term memory (Seq2Seq ConvLSTM) network from a large set of video clips. We deploy our approach to a real drone cinematography system by first predicting the future camera motions, and then converting them to the drone's control commands via an odometer. Our experimental results on extensive datasets and showcases exhibit significant improvements in our approach over conventional baselines and our approach can successfully mimic the footage of a professional cameraman.
[motion, drone, optical, subject, video, flow, human, prediction, filming, professional, cinematography, aee, predict, planning, future, temporal, previous, framework, imitate, dataset, moving, dji, trajectory, capture, skeleton, work, huang, frame] [camera, computer, autonomous, international, dense, pose, vision, problem, rgb, approach, directly, matrix] [conference, input, based, method, image, quality, ieee, background, figure, style, captured, control, aesthetic] [convolutional, output, network, deep, neural, design, apply, group, layer] [imitation, model, system, represent, visual, arxiv, preprint, policy, evaluate, generate, length] [feature, three, aerial, including, score, predicted] [learning, training, set, experimental, testing, data]
@InProceedings{Huang_2019_CVPR,
  author = {Huang, Chong and Lin, Chuan-En and Yang, Zhenyu and Kong, Yan and Chen, Peng and Yang, Xin and Cheng, Kwang-Ting},
  title = {Learning to Film From Professional Human Motion Videos},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Pay Attention! - Robustifying a Deep Visuomotor Policy Through Task-Focused Visual Attention
Pooya Abolghasemi, Amir Mazaheri, Mubarak Shah, Ladislau Boloni


Several recent studies have demonstrated the promise of deep visuomotor policies for robot manipulator control. Despite impressive progress, these systems are known to be vulnerable to physical disturbances, such as accidental or adversarial bumps that make them drop the manipulated object. They also tend to be distracted by visual disturbances such as objects moving in the robot's field of view, even if the disturbance does not physically prevent the execution of the task. In this paper, we propose an approach for augmenting a deep visuomotor policy trained through demonstrations with Task Focused visual Attention (TFA). The manipulation task is specified with a natural language text such as "move the red bowl to the left". This allows the visual attention component to concentrate on the current object that the robot needs to manipulate. We show that even in benign environments, the TFA allows the policy to consistently outperform a variant with no attention mechanism. More importantly, the new policy is significantly more robust: it regularly recovers from severe physical disturbances (such as bumps causing it to drop the object) from which the baseline policy, i.e. with no visual attention, almost never recovers. In addition, we show that the proposed policy performs correctly in the presence of a wide class of visual disturbances, exhibiting a behavior reminiscent of human selective visual attention experiments.
[frame, human, video, focused, work, perform, state, current, joint, lstm] [approach, note, field, allows, rgb, notice, vision, manipulated] [input, figure, latent, masked, image, proposed, manipulation, real, generator, reconstructed, ieee, based, reconstruct] [network, deep, architecture, neural, number, rate, output, better, drop] [visual, attention, robot, policy, visuomotor, motor, primary, model, physical, disturbance, encoding, text, tfa, textual, discriminator, red, command, arxiv, preprint, encoder, blue, adversarial, benign, reinforcement, system, fake, vector, ptfa, generated, execution, natural, bowl, pick, relevant, sentence, demonstrated] [object, spatial, baseline] [task, training, loss, learning, data, teacher, trained, train, push, representation]
@InProceedings{Abolghasemi_2019_CVPR,
  author = {Abolghasemi, Pooya and Mazaheri, Amir and Shah, Mubarak and Boloni, Ladislau},
  title = {Pay Attention! - Robustifying a Deep Visuomotor Policy Through Task-Focused Visual Attention},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Deep Blind Video Decaptioning by Temporal Aggregation and Recurrence
Dahun Kim, Sanghyun Woo, Joon-Young Lee, In So Kweon


Blind video decaptioning is a problem of automatically removing text overlays and inpainting the occluded parts in videos without any input masks. While recent deep learning based inpainting methods deal with a single image and mostly assume that the positions of the corrupted pixels are known, we aim at automatic text removal in video sequences without mask information. In this paper, we propose a simple yet effective framework for fast blind video decaptioning. We construct an encoder-decoder model, where the encoder takes multiple source frames that can provide visible pixels revealed from the scene dynamics. These hints are aggregated and fed into the decoder. We apply a residual connection from the input frame to the decoder output to enforce our network to focus on the corrupted regions only. Our proposed model was ranked in the first place in the ECCV Chalearn 2018 LAP Inpainting Competition Track2: Video decaptioning. In addition, we further improve this strong model by applying a recurrent feedback. The recurrent feedback not only enforces temporal coherence but also provides strong clues on where the corrupted pixels are. Both qualitative and quantitative experiments demonstrate that our full model produces accurate and temporally consistent video results in real time (50+ fps).
[video, temporal, frame, recurrence, stream, decaptioning, chalearn, time, lap, temporally, multiple, recurrent, challenge, optical, structural] [directly, occluded, algorithm, computer, additional] [image, inpainting, input, corrupted, proposed, ieee, blind, recover, consistency, ssim, conference, figure, quality, based, feedback, qualitative, hybrid, pixel, texture] [table, network, residual, full, deep, skip, output, design, gradient, best, neural, number, validation, apply, better, performance] [model, encoder, decoder, text, visual, adversarial, evaluate, generation] [feature, neighboring, context, cnn, propose, eccv, ablation, final, center] [loss, learning, target, source, large, set, train]
@InProceedings{Kim_2019_CVPR,
  author = {Kim, Dahun and Woo, Sanghyun and Lee, Joon-Young and So Kweon, In},
  title = {Deep Blind Video Decaptioning by Temporal Aggregation and Recurrence},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Learning Video Representations From Correspondence Proposals
Xingyu Liu, Joon-Young Lee, Hailin Jin


Correspondences between frames encode rich information about dynamic content in videos. However, it is challenging to effectively capture and learn those due to their irregular structure and complex dynamics. In this paper, we propose a novel neural network that learns video representations by aggregating information from potential correspondences. This network, named CPNet, can learn evolving 2D fields with temporal consistency. In particular, it can effectively learn representations for videos by mixing appearance and long-range motion with an RGB-only input. We provide extensive ablation experiments to validate our model. CPNet shows stronger performance than existing methods on Kinetics and achieves the state-of-the-art performance on Something-Something and Jester. We provide analysis towards the behavior of our model and show its robustness to errors in proposals.
[cpnet, video, motion, kinetics, action, dataset, trn, frame, clip, recognition, positional, recognizing, temporal, artnet, jester, outperforms, flow, published, explored, averaged, previous] [point, correspondence] [proposed, spatially] [table, validation, accuracy, deep, convolutional, number, neural, architecture, network, size, net, compare, rate, output, receptive, experiment, original, fewer, increase, gain, achieved, params, batch, normalization, achieves] [model, arxiv, listed, explanation, correct, length] [module, feature, semantic, ablation, baseline, backbone, highest, three, illustrated, val] [training, learning, learn, datasets, testing, large, classification, test, set, data, softmax, cps]
@InProceedings{Liu_2019_CVPR,
  author = {Liu, Xingyu and Lee, Joon-Young and Jin, Hailin},
  title = {Learning Video Representations From Correspondence Proposals},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
SiamRPN++: Evolution of Siamese Visual Tracking With Very Deep Networks
Bo Li, Wei Wu, Qiang Wang, Fangyi Zhang, Junliang Xing, Junjie Yan


Siamese network based trackers formulate tracking as convolutional feature cross-correlation between target template and searching region. However, Siamese trackers still have accuracy gap compared with state-of-the-art algorithms and they cannot take advantage of feature from deep networks, such as ResNet-50 or deeper. In this work we prove the core reason comes from the lack of strict translation invariance. By comprehensive theoretical analysis and experimental validations, we break this restriction through a simple yet effective spatial aware sampling strategy and successfully train a ResNet-driven Siamese tracker with significant performance gain. Moreover, we propose a new model architecture to perform depth-wise and layer-wise aggregations, which not only further improves the accuracy but also reduces the model size. We conduct extensive ablation studies to demonstrate the effectiveness of the proposed tracker, which obtains currently the best results on four large tracking benchmarks, including OTB2015, VOT2018, UAV123, and LaSOT. Our model will be released to facilitate further studies based on this problem.
[tracking, challenge, dataset, auc, work, online] [template, single, analysis] [based, translation, figure, proposed, high, comparison] [siamese, network, correlation, tracker, deep, performance, layer, accuracy, convolutional, search, precision, depthwise, resnet, best, conv, channel, siamrpn, lasot, table, dasiamrpn, alexnet, deeper, shift, overlap, running, block, achieve, achieves, compared, architecture, structure, convolution, efficient, neural] [visual, success, model, find] [object, feature, rpn, spatial, three, backbone, score, strict, map, response, average, module, semantic, benchmark, propose, improves, proposal, regression] [learning, cross, target, training, train, large, similarity, classification, restriction, sampling, learned, test, strategy, representation, invariance]
@InProceedings{Li_2019_CVPR,
  author = {Li, Bo and Wu, Wei and Wang, Qiang and Zhang, Fangyi and Xing, Junliang and Yan, Junjie},
  title = {SiamRPN++: Evolution of Siamese Visual Tracking With Very Deep Networks},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Sphere Generative Adversarial Network Based on Geometric Moment Matching
Sung Woo Park, Junseok Kwon


We propose sphere generative adversarial network (GAN), a novel integral probability metric (IPM)-based GAN. Sphere GAN uses the hypersphere to bound IPMs in the objective function. Thus, it can be trained stably. On the hypersphere, sphere GAN exploits the information of higher-order statistics of data using geometric moment matching, thereby providing more accurate results. In the paper, we mathematically prove the good properties of sphere GAN. In experiments, sphere GAN quantitatively and qualitatively surpasses recent state-of-the-art GANs for unsupervised image generation problems with the CIFAR-10, STL-10, and LSUN bedroom datasets. Source code is available at https://github.com/pswkiki/SphereGAN.
[moment, time, term, multiple] [sphere, geometric, defined, additional, matching, riemannian, constraint, stable, denote, inverse, stereographic, projection, point, definition] [generative, image, real, generator, based, proposed, figure, transformation, spectral] [gradient, network, norm, penalty, higher, table, bounded, convolutional, weight, equivalent, fisher, denotes] [gan, gans, adversarial, discriminator, probability, ipms, wesserstein, generation, fid, fake, lsun, wgan, wasserstein, inception, bedroom, generated, lipschitz] [feature] [distance, function, hypersphere, objective, space, data, training, conventional, learning, mmd, euclidean, minimizing, unsupervised, ipm, trained, sample, set, novel, class]
@InProceedings{Park_2019_CVPR,
  author = {Woo Park, Sung and Kwon, Junseok},
  title = {Sphere Generative Adversarial Network Based on Geometric Moment Matching},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Adversarial Attacks Beyond the Image Space
Xiaohui Zeng, Chenxi Liu, Yu-Siang Wang, Weichao Qiu, Lingxi Xie, Yu-Wing Tai, Chi-Keung Tang, Alan L. Yuille


Generating adversarial examples is an intriguing problem and an important way of understanding the working mechanism of deep neural networks. Most existing approaches generated perturbations in the image space, i.e., each pixel can be modified independently. However, in this paper we pay special attention to the subset of adversarial examples that correspond to meaningful changes in 3D physical properties (like rotation and translation, illumination condition, etc.). These adversaries arguably pose a more serious concern, as they demonstrate the possibility of causing neural network failure by easy perturbations of real-world 3D objects and scenes. In the contexts of object classification and visual question answering, we augment state-of-the-art deep neural networks that receive 2D input images with a rendering module (either differentiable or not) in front, so that a 3D scene (in the physical space) is rendered into a 2D image (in the image space), and then mapped to a prediction (in the output space). The adversarial perturbations can now go beyond the image space, and have clear meanings in the 3D physical world. Though image-space adversaries can be interpreted as per-pixel albedo change, we verify that they cannot be well explained along these physically meaningful dimensions, which often have a non-local effect. But it is still possible to successfully attack beyond the image space on the physical space, though this is more difficult than image-space attacks, reflected in lower success rates and heavier perturbations required.
[work] [rendering, differentiable, light, illumination, rotation, vision, material, rendered, point, directly, surface, optimization, scene, rgb, algorithm, physically, confidence] [image, figure, color, pixel, method, translation, input, row, denoted] [deep, neural, network, gradient, output, rate, small, number, alexnet, original, table, energy, lower, higher, size] [physical, adversarial, attack, visual, attacking, question, success, conf, perceptibility, find, answering, perturbation, modifying, answer, renderer, attacked, generated, environment, goal, arxiv, preprint, generating] [object, three, module, car, semantic, segmentation] [space, classification, set, learning, large, testing, class, function, dimension, difficult]
@InProceedings{Zeng_2019_CVPR,
  author = {Zeng, Xiaohui and Liu, Chenxi and Wang, Yu-Siang and Qiu, Weichao and Xie, Lingxi and Tai, Yu-Wing and Tang, Chi-Keung and Yuille, Alan L.},
  title = {Adversarial Attacks Beyond the Image Space},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Evading Defenses to Transferable Adversarial Examples by Translation-Invariant Attacks
Yinpeng Dong, Tianyu Pang, Hang Su, Jun Zhu


Deep neural networks are vulnerable to adversarial examples, which can mislead classifiers by adding imperceptible perturbations. An intriguing property of adversarial examples is their good transferability, making black-box attacks feasible in real-world applications. Due to the threat of adversarial attacks, many methods have been proposed to improve the robustness. Several state-of-the-art defenses are shown to be robust against transferable adversarial examples. In this paper, we propose a translation-invariant attack method to generate more transferable adversarial examples against the defense models. By optimizing a perturbation over an ensemble of translated images, the generated adversarial example is less sensitive to the white-box model being attacked and has better transferability. To improve the efficiency of attacks, we further show that our method can be implemented by convolving the gradient at the untranslated image with a pre-defined kernel. Our method is generally applicable to any gradient-based attack method. Extensive experiments on the ImageNet dataset validate the effectiveness of the proposed method. Our best attack fools eight state-of-the-art defenses at an 82% success rate on average based only on the transferability, demonstrating the insecurity of the current defense techniques.
[current, dataset] [robust, linear, wij, property, crafted, optimization] [method, proposed, image, based, input, translated, jpeg, jun, figure] [kernel, gradient, deep, neural, gaussian, size, table, calculate, rate, fast, resnet, convolutional, effectiveness, small, basic] [adversarial, attack, success, defense, generated, model, xadv, generate, example, fgsm, transferability, sign, xreal, dim, robustness, iterative, hgd, tvm, perturbation, inception, fool, ian, tianyu, sensitive, untranslated, attention, blackbox] [improve, average, adopt] [loss, transferable, discriminative, trained, ensemble, learning, function, set, uniform, large]
@InProceedings{Dong_2019_CVPR,
  author = {Dong, Yinpeng and Pang, Tianyu and Su, Hang and Zhu, Jun},
  title = {Evading Defenses to Transferable Adversarial Examples by Translation-Invariant Attacks},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Decoupling Direction and Norm for Efficient Gradient-Based L2 Adversarial Attacks and Defenses
Jerome Rony, Luiz G. Hafemann, Luiz S. Oliveira, Ismail Ben Ayed, Robert Sabourin, Eric Granger


Research on adversarial examples in computer vision tasks has shown that small, often imperceptible changes to an image can induce misclassification, which has security implications for a wide range of image processing systems. Considering L2 norm distortions, the Carlini and Wagner attack is presently the most effective white-box attack in the literature. However, this method is slow since it performs a line-search for one of the optimization terms, and often requires thousands of iterations. In this paper, an efficient approach is proposed to generate gradient-based attacks that induce misclassifications with low L2 norm, by decoupling the direction and the norm of the adversarial perturbation that is added to the image. Experiments conducted on the MNIST, CIFAR-10 and ImageNet datasets indicate that our attack achieves comparable results to the state-of-the-art (in terms of L2 norm) with considerably fewer iterations (as few as 100 iterations), which opens the possibility of using these attacks for adversarial training. Models trained with our attack achieve state-of-the-art robustness against white-box gradient-based L2 attacks on the MNIST and CIFAR-10 datasets, outperforming the Madry defense when the attacks are limited to a maximum norm.
[report, considering] [direction, vision, algorithm, international, computer, robust, optimizing, optimization, limited, case, problem, approach, constraint] [proposed, image, conference, method, noise, input, figure] [norm, imagenet, original, performance, accuracy, number, gradient, search, low, neural, increase, obtains, compared, table, achieves, size, comparable, achieve, higher] [adversarial, attack, ddn, model, madry, defense, robustness, deepfool, success, perturbation, example, targeted, step, untargeted, consider, generate, evaluation, worst, decision, machine, restricted, constrained] [baseline, average, region] [training, mnist, learning, trained, loss, maximum, sample, scenario, hyperparameters, obtaining, class, min]
@InProceedings{Rony_2019_CVPR,
  author = {Rony, Jerome and Hafemann, Luiz G. and Oliveira, Luiz S. and Ben Ayed, Ismail and Sabourin, Robert and Granger, Eric},
  title = {Decoupling Direction and Norm for Efficient Gradient-Based L2 Adversarial Attacks and Defenses},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
A General and Adaptive Robust Loss Function
Jonathan T. Barron


We present a generalization of the Cauchy/Lorentzian, Geman-McClure, Welsch/Leclerc, generalized Charbonnier, Charbonnier/pseudo-Huber/L1-L2, and L2 loss functions. By introducing robustness as a continuous parameter, our loss function allows algorithms built around robust loss minimization to be generalized, which improves performance on basic vision tasks such as registration and clustering. Interpreting our loss as the negative log of a univariate density yields a general probability distribution that includes normal and Cauchy distributions as special cases. This probabilistic interpretation enables the training of neural networks in which the robustness of the loss automatically adapts itself during training, which improves performance on learning-based tasks such as generative image synthesis and unsupervised monocular depth estimation, without requiring any manual parameter tuning.
[nll, determine, work, dataset] [general, robust, depth, shape, normal, cauchy, estimation, allows, monocular, optimization, allowing, algorithm, registration, special, respect, vision, single, rgb, continuous, limit, exp, outlier, approach, manual, well] [image, figure, wavelet, demonstrate, pixel, synthesis, generative] [parameter, performance, fixed, scale, adaptive, output, table, neural, replace, better, introducing, automatically, gradient, residual, magnitude, network, small, larger, replacing] [robustness, model, probability, variational, simply, common] [baseline, improve, three] [loss, distribution, function, training, set, learning, generalize, negative, unsupervised, generalized, minimizing, yuv, fgr, sampling, autoencoders, setting, task, annealing, clustering, generalization, log, existing, supplement]
@InProceedings{Barron_2019_CVPR,
  author = {Barron, Jonathan T.},
  title = {A General and Adaptive Robust Loss Function},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Filter Pruning via Geometric Median for Deep Convolutional Neural Networks Acceleration
Yang He, Ping Liu, Ziwei Wang, Zhilan Hu, Yi Yang


Previous works utilized "smaller-norm-less-important" criterion to prune filters with smaller norm values in a convolutional neural network. In this paper, we analyze this norm-based criterion and point out that its effectiveness depends on two requirements that are not always met: (1) the norm deviation of the filters should be large; (2) the minimum norm of the filters should be small. To solve this problem, we propose a novel filter pruning method, namely Filter Pruning via Geometric Median (FPGM), to compress the model regardless of those two requirements. Unlike previous methods, FPGM compresses CNN models by pruning filters with redundancy, rather than those with"relatively less" importance. When applied to two image classification benchmarks, our method validates its usefulness and strengths. Notably, on CIFAR-10, FPGM reduces more than 52% FLOPs on ResNet-110 with even 2.69% relative accuracy improvement. Moreover, on ILSVRC-2012, FPGM reduces more than 42% FLOPs on ResNet-101 without top-5 accuracy drop, which has advanced the state-of-the-art. Code is publicly available on GitHub: https://github.com/he-y/filter-pruning-geometric-median
[previous, work, time, utilized] [geometric, deviation, theoretical, ideal, computer, analysis, point, vision] [figure, method, realistic, based, comparison, image] [pruning, filter, norm, fpgm, neural, pruned, convolutional, criterion, network, deep, layer, accuracy, performance, acceleration, small, table, sfp, prune, pfec, rate, achieves, weight, efficient, parameter, ith, achieve, interval, smaller, cost, residual, accelerating, computation, batch, scratch, resnet, better, analyze, reduces, number] [model, median, green, blue, arg, arxiv, preprint] [feature, baseline, curve, propose, threshold, mil, cnn] [training, data, minimum, min, large, selected, learning, distribution, select, distance]
@InProceedings{He_2019_CVPR,
  author = {He, Yang and Liu, Ping and Wang, Ziwei and Hu, Zhilan and Yang, Yi},
  title = {Filter Pruning via Geometric Median for Deep Convolutional Neural Networks Acceleration},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Learning to Quantize Deep Networks by Optimizing Quantization Intervals With Task Loss
Sangil Jung, Changyong Son, Seohyung Lee, Jinwoo Son, Jae-Joon Han, Youngjun Kwak, Sung Ju Hwang, Changkyu Choi


Reducing bit-widths of activations and weights of deep networks makes it efficient to compute and store them in memory, which is crucial in their deployments to resource-limited devices, such as mobile phones. However, decreasing bit-widths with quantization generally yields drastically degraded accuracy. To tackle this problem, we propose to learn to quantize activations and weights via a trainable quantizer that transforms and discretizes them. Specifically, we parameterize the quantization intervals and obtain their optimal values by directly minimizing the task loss of the network. This quantization-interval-learning (QIL) allows the quantized networks to maintain the accuracy of the full-precision (32-bit) networks with bit-width as low as 4-bit and minimize the accuracy degeneration with further bit-width reduction (i.e., 3 and 2-bit). Moreover, our quantizer can be trained on a heterogeneous dataset, and thus can be used to quantize pretrained networks without access to their training data. We demonstrate the effectiveness of our trainable quantizer on ImageNet dataset with various network architectures such as ResNet-18, -34 and AlexNet, on which it outperforms existing methods to achieve the state-of-the-art accuracy.
[dataset, joint, work, jointly, heterogeneous, performs] [compute, computer, note, good, vision, optimal, bound, respect, accurate] [method, conference, demonstrate, figure, preserve] [quantization, accuracy, network, weight, quantizer, low, trainable, pruning, activation, neural, quantized, interval, convolutional, deep, clipping, quantizers, layer, imagenet, alexnet, quantize, number, fixed, table, reducing, original, progressive, ratio, performance, higher, parameterized, convolution, compared, achieves, gradient, epoch, efficient, layerwise, operation, ternary, resnet, initialization, thp] [model, arxiv, preprint, transformer, consider] [propose] [training, trained, learning, task, loss, minimizing, existing, function, train, distribution, classification, set, upper, minimize, large]
@InProceedings{Jung_2019_CVPR,
  author = {Jung, Sangil and Son, Changyong and Lee, Seohyung and Son, Jinwoo and Han, Jae-Joon and Kwak, Youngjun and Ju Hwang, Sung and Choi, Changkyu},
  title = {Learning to Quantize Deep Networks by Optimizing Quantization Intervals With Task Loss},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Not All Areas Are Equal: Transfer Learning for Semantic Segmentation via Hierarchical Region Selection
Ruoqi Sun, Xinge Zhu, Chongruo Wu, Chen Huang, Jianping Shi, Lizhuang Ma


The success of deep neural networks for semantic segmentation heavily relies on large-scale and well-labeled datasets, which are hard to collect in practice. Synthetic data offers an alternative to obtain ground-truth labels for free. However, models directly trained on synthetic data often struggle to generalize to real images. In this paper, we consider transfer learning for semantic segmentation that aims to mitigate the gap between abundant synthetic data (source domain) and limited real data (target domain). Unlike previous approaches that either learn mappings to target domain or finetune on target images, our proposed method jointly learn from real images and selectively from realistic pixels in synthetic images to adapt to the target domain. Our key idea is to have weighting networks to score how similar the synthetic pixels are to real ones, and learn such weighting at pixel-, region- and image-levels. We jointly learn these hierarchical weighting networks and segmentation network in an end-to-end manner. Extensive experiments demonstrate that our proposed approach significantly outperforms other existing baselines, and is applicable to scenarios with extremely limited real images.
[joint, dataset, multiple, jointly] [computer, vision, pattern, note, single, direct, international] [synthetic, image, real, ieee, conference, method, insufficient, proposed, generator, pixel, ladv, demonstrate, figure, amount] [network, better, deep, table, performance, achieves, convolutional, abundant, selection, extremely, design, effectiveness, compared, neural, apply] [model, adversarial, encoder, discriminator, machine, arxiv, preprint, indicates] [segmentation, semantic, hierarchical, map, lseg, region, fcn, road, fig, seg, urban, jianping, propose] [weighting, learning, domain, data, source, target, transfer, gtav, loss, training, learn, labeled, adaptation, synthia, knowledge, unsupervised, label, gap, setting, datasets, large, pixelda, trained, similarity, shared, experimental]
@InProceedings{Sun_2019_CVPR,
  author = {Sun, Ruoqi and Zhu, Xinge and Wu, Chongruo and Huang, Chen and Shi, Jianping and Ma, Lizhuang},
  title = {Not All Areas Are Equal: Transfer Learning for Semantic Segmentation via Hierarchical Region Selection},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Unsupervised Learning of Dense Shape Correspondence
Oshri Halimi, Or Litany, Emanuele Rodola, Alex M. Bronstein, Ron Kimmel


We introduce the first completely unsupervised correspondence learning approach for deformable 3D shapes. Key to our model is the understanding that natural deformations (such as changes in pose) approximately preserve the metric structure of the surface, yielding a natural criterion to drive the learning process toward distortion-minimizing predictions. On this basis, we overcome the need for annotated data and replace it by a purely geometric criterion. The resulting learning model is class-agnostic, and is able to leverage any type of deformable geometric data for the training phase. In contrast to existing supervised approaches which specialize on the class seen at training time, we demonstrate stronger generalization as well as applicability to a variety of challenging settings. We showcase our method on a wide selection of correspondence benchmarks, where we outperform other methods in terms of accuracy, generalization, and efficiency.
[human, framework, prediction, online] [correspondence, shape, functional, computer, geodesic, geometric, faust, fmnet, descriptor, matrix, corresponding, dense, matching, axiomatic, pmf, well, problem, single, approach, distortion, additional, vision, pattern, linear, ground, truth, volume, lbo, sgmds, form, israel, geometry, deformation, pose, local] [method, figure, synthetic, ieee, based, spectral, proposed, conference, input, reference, demonstrate] [network, deep, kernel, architecture, layer, performance, compare, processing] [partial, model, expected, calculated] [map, deformable] [unsupervised, learning, supervised, training, loss, distance, set, data, pair, test, trained, soft, generalization, learned, shot, train, class, diffusion, pairwise]
@InProceedings{Halimi_2019_CVPR,
  author = {Halimi, Oshri and Litany, Or and Rodola, Emanuele and Bronstein, Alex M. and Kimmel, Ron},
  title = {Unsupervised Learning of Dense Shape Correspondence},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Unsupervised Visual Domain Adaptation: A Deep Max-Margin Gaussian Process Approach
Minyoung Kim, Pritish Sahu, Behnam Gholami, Vladimir Pavlovic


For unsupervised domain adaptation, the target domain error can be provably reduced by having a shared input representation that makes the source and target domains indistinguishable from each other. Very recently it has been shown that it is not only critical to match the marginal input distributions, but also align the output class distributions. The latter can be achieved by minimizing the maximum discrepancy of predictors. In this paper, we take this principle further by proposing a more systematic and effective way to achieve hypothesis consistency using Gaussian processes (GP). The GP allows us to induce a hypothesis space of classifiers from the posterior distribution of the latent random functions, turning the learning into a large-margin posterior separation problem, significantly easier to solve than previous approaches based on adversarial minimax optimization. We formulate a learning objective that effectively influences the posterior to minimize the maximum discrepancy. This is shown to be equivalent to maximizing margins and minimizing uncertainty of the class predictions in the target domain. Empirical results demonstrate that our approach leads to state-to-the-art performance superior to existing methods on several challenging benchmarks for domain adaptation.
[prediction, hypothesis, traffic, joint, recognition, key] [vision, computer, international, approach, pattern, point, optimization, well, error, defined] [conference, ieee, latent, proposed, method, prior, separation, input, high, image] [deep, neural, processing, gaussian, performance, kernel, max, inference, process, bayesian, network, covariance, accuracy, approximate] [model, adversarial, variational, machine, consider, visual, random] [feature, predicted] [domain, target, learning, posterior, source, space, unsupervised, classifier, class, adaptation, training, data, gpda, maximum, classification, shared, discrepancy, mnist, distribution, embedding, labeled, mcda, function, representation, uncertainty, test, mcd, svhn, datasets, digit, minimize, uda, idea, learned, measure, generalization]
@InProceedings{Kim_2019_CVPR,
  author = {Kim, Minyoung and Sahu, Pritish and Gholami, Behnam and Pavlovic, Vladimir},
  title = {Unsupervised Visual Domain Adaptation: A Deep Max-Margin Gaussian Process Approach},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Balanced Self-Paced Learning for Generative Adversarial Clustering Network
Kamran Ghasedi, Xiaoqian Wang, Cheng Deng, Heng Huang


Clustering is an important problem in various machine learning applications, but still a challenging task when dealing with complex real data. The existing clustering algorithms utilize either shallow models with insufficient capacity for capturing the non-linear nature of data, or deep models with large number of parameters prone to overfitting. In this paper, we propose a deep Generative Adversarial Clustering Network (ClusterGAN), which tackles the problems of training of deep clustering models in unsupervised manner. ClusterGAN consists of three networks, a discriminator, a generator and a clusterer (i.e. a clustering network). We employ an adversarial game between these three players to synthesize realistic samples given discriminative latent variables via the generator, and learn the inverse mapping of the real samples to the discriminative embedding space via the clusterer. Moreover, we utilize a conditional entropy minimization loss to increase/decrease the similarity of intra/inter cluster samples. Since the ground-truth similarities are unknown in clustering task, we propose a novel balanced self-paced learning algorithm to gradually include samples into training from easy to difficult, while considering the diversity of selected samples from all clusters. Therefore, our method makes it possible to efficiently train clusterers with large depth by leveraging the proposed adversarial game and balanced self-paced learning algorithm. According our experiments, ClusterGAN achieves competitive results compared to the state-of-the-art clustering and hashing models on several datasets.
[framework, dataset, joint] [algorithm, computer, pattern, vision, alternative, matrix, international, local, analysis] [generator, conference, real, generative, image, ieee, conditional, spectral, realistic, figure, proposed, diagonal, based, latent, input] [deep, neural, processing, shallow, performance, network, compared, order, regularization, number, block, standard, applied, gradually] [adversarial, model, discriminator, game, machine, gan, random, arxiv, preprint, diversity] [three, easy] [learning, clustering, clustergan, clusterer, loss, unsupervised, data, training, discriminative, balanced, objective, entropy, function, hashing, adjacency, aij, trained, embedding, min, log, distribution, mnist, similarity, selected, retrieval, supervised, large, novel, train, learn, cluster, minimum, set]
@InProceedings{Ghasedi_2019_CVPR,
  author = {Ghasedi, Kamran and Wang, Xiaoqian and Deng, Cheng and Huang, Heng},
  title = {Balanced Self-Paced Learning for Generative Adversarial Clustering Network},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
A Style-Based Generator Architecture for Generative Adversarial Networks
Tero Karras, Samuli Laine, Timo Aila


We propose an alternative generator architecture for generative adversarial networks, borrowing from style transfer literature. The new architecture leads to an automatically learned, unsupervised separation of high-level attributes (e.g., pose and identity when trained on human faces) and stochastic variation in the generated images (e.g., freckles, hair), and it enables intuitive, scale-specific control of the synthesis. The new generator improves the state-of-the-art in terms of traditional distribution quality metrics, leads to demonstrably better interpolation properties, and also better disentangles the latent factors of variation. To quantify interpolation quality and disentanglement, we propose two new, automated methods that are applicable to any generator architecture. Finally, we introduce a new, highly varied and high-quality dataset of human faces.
[dataset, human, work, interesting] [linear, truncation, corresponding, well, compute, point] [latent, generator, image, noise, input, style, mapping, generative, variation, traditional, mixing, figure, synthesis, interpolation, perceptual, intermediate, adain, ffhq, quality, control, based, disentanglement, face, entangled, fids, hair, disentangled, separation] [network, stochastic, architecture, table, layer, regularization, neural, better, convolution, conv, deep, number, trick, fine] [adversarial, generated, path, length, gan, generate, improved, introduce] [improves, feature, average, baseline, global, propose] [space, training, learned, separability, code, source, transfer, loss, set, learning, unsupervised, trained, sampling, distribution, distance, metric]
@InProceedings{Karras_2019_CVPR,
  author = {Karras, Tero and Laine, Samuli and Aila, Timo},
  title = {A Style-Based Generator Architecture for Generative Adversarial Networks},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Parallel Optimal Transport GAN
Gil Avraham, Yan Zuo, Tom Drummond


Although Generative Adversarial Networks (GANs) are known for their sharp realism in image generation, they often fail to estimate areas of the data density. This leads to low modal diversity and at times distorted generated samples. These problems essentially arise from poor estimation of the distance metric responsible for training these networks. To address these issues, we introduce an additional regularisation term which performs optimal transport in parallel within a low dimensional representation space. We demonstrate that operating in a low dimension representation of the data distribution benefits from convergence rate gains in estimating the Wasserstein distance, resulting in more stable GAN training. We empirically show that our regulariser achieves a stabilising effect which leads to higher quality of generated samples and increased mode coverage of the given data distribution. Our method achieves significant improvements on the CIFAR-10, Oxford Flowers and CUB Birds datasets over several GAN baselines both qualitatively and quantitatively.
[term, framework, dataset, leaf] [optimal, estimating, oxford, dimensional, matching, additional, approach, optimisation, problem, supplementary] [latent, generator, generative, transformation, method, quality, mapping, real, figure, regulariser, noise] [cost, neural, rate, gaussian, low, network, better, size, table, processing, increased, compared, parallel, achieves] [gan, wasserstein, arxiv, preprint, generated, adversarial, model, regularisation, fid, veegan, mode, probability, gans, inception, decision, diversity, variational, manifold, modal, discriminator, procedure, refer, vector] [map, improvement, score] [distribution, distance, data, training, transport, convergence, function, representation, space, sample, learning, cub, transportation, divergence, loss, trained, ltf, dimension]
@InProceedings{Avraham_2019_CVPR,
  author = {Avraham, Gil and Zuo, Yan and Drummond, Tom},
  title = {Parallel Optimal Transport GAN},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
3D-SIS: 3D Semantic Instance Segmentation of RGB-D Scans
Ji Hou, Angela Dai, Matthias Niessner


We introduce 3D-SIS, a novel neural network architecture for 3D semantic instance segmentation in commodity RGB-D scans. The core idea of our method to jointly learn from both geometric and color signal, thus enabling accurate instance predictions. Rather than operate solely on 2D frames, we observe that most computer vision applications have multi-view RGB-D input available, which we leverage to construct an approach for 3D instance segmentation that effectively fuses together these multi-modal inputs. Our network leverages high-resolution RGB input by associating 2D images with the volumetric grid based on the pose alignment of the 3D reconstruction. For each image, we first extract 2D features for each pixel with a series of 2D convolutions; we then backproject the resulting feature vector to the associated voxel in the 3D grid. This combination of 2D and 3D feature learning allows significantly higher accuracy object detection and instance segmentation than state-of-the-art alternatives. We show results on both synthetic and real-world public benchmarks, achieving an improvement in mAP of over 13 on real-world data.
[jointly, prediction, joint, predict, series] [rgb, computer, geometry, approach, vision, single, voxel, pattern, associated, volumetric, scene, ground, truth, angela, matthias, scan, point, shape, suncg, shuran, geometric, depth] [input, color, conference, ieee, proposed, synthetic, method, based, resolution] [network, deep, architecture, neural, convolutional, pooling, table, layer, order, inference] [arxiv, preprint, evaluate] [instance, object, segmentation, detection, mask, semantic, bounding, feature, box, grid, spatial, map, region, operate, anchor, predicted, final, sgpn, ross, leverage, sliding, backbone] [learning, data, training, class, test, train, classification, learn, representation, loss]
@InProceedings{Hou_2019_CVPR,
  author = {Hou, Ji and Dai, Angela and Niessner, Matthias},
  title = {3D-SIS: 3D Semantic Instance Segmentation of RGB-D Scans},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Causes and Corrections for Bimodal Multi-Path Scanning With Structured Light
Yu Zhang, Daniel L. Lau, Ying Yu


Structured light illumination is an active 3D scanning technique based on projecting/capturing a set of striped patterns and measuring the warping of the patterns as they reflect off a target object's surface. As designed, each pixel in the camera sees exactly one pixel from the projector; however, there are multi-path situations when the scanned surface has a complicated geometry with step edges and other discontinuities in depth or where the target surface has specularities that reflect light away from the camera. These situations are generally referred to multi-path where a camera pixel sees light from multiple projector positions. In the case of bimodal multi-path, the camera pixel receives light from exactly two positions which occurs along a step edge where the edge slices through a pixel so that the pixel sees both a foreground and background surface. In this paper, we present a general mathematical model to address the bimodal multi-path issue in a phase-measuring-profilometry scanner to measure the constructive and destructive interference between the two light paths, and by taking advantage of this interesting cue, separate the paths and make two decoupled phase measurements. We validate our algorithm with a number of challenging real-world scenarios, outperforming the state-of-the-art method.
[versus, showing, multiple, signal, term] [light, pattern, surface, scan, camera, projected, reconstruction, direct, pmp, scanning, owl, vision, computer, interference, point, problem, illumination, sli, range, shape, depth, single, note, sinusoidal, mesh, daniel, scanned, scanner, algorithm, tof, approach, multipath, unwrapping, calibration] [pixel, frequency, figure, background, projector, proposed, image, ieee, component, raw, conference, separate, traditional, noise, stem, side, high, based, row, transparent, method] [phase, structured, magnitude, small, secondary, number, modulation, represents] [step, bimodal, vector, model, procedure, observed, primary, shifting, unique] [edge, spatial, foreground, global, inside] [target, function]
@InProceedings{Zhang_2019_CVPR,
  author = {Zhang, Yu and Lau, Daniel L. and Yu, Ying},
  title = {Causes and Corrections for Bimodal Multi-Path Scanning With Structured Light},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
TextureNet: Consistent Local Parametrizations for Learning From High-Resolution Signals on Meshes
Jingwei Huang, Haotian Zhang, Li Yi, Thomas Funkhouser, Matthias Niessner, Leonidas J. Guibas


We introduce, TextureNet, a neural network architecture designed to extract features from high-resolution signals associated with 3D surface meshes (e.g., color texture maps). The key idea is to utilize a 4-rotational symmetric(4-RoSy) field to define a domain for convolution on a surface. Though 4-RoSy fields have several properties favor-able for convolution on surfaces (low distortion, few singularities, consistent parameterization, etc.), orientations are ambiguous up to 4-fold rotation at any sample point. So, we introduce a new convolutional operator invariant to the4-RoSy ambiguity and use it in a network to extract features from high-resolution signals on geodesic neighborhoods of a surface. In comparison to alternatives, such as PointNet-based methods which lack a notion of orientation, the coherent structure given by these neighborhoods results in significantly stronger features. As an example application, we demonstrate the benefits of our architecture for 3D semantic segmentation of textured 3D meshes. The results show that our method outperforms all existing methods on the basis of mean IoU by a significant margin in both geometry-only(6.4%) and RGB+Geometry (6.9-8.2%) settings.
[extract] [surface, geodesic, field, point, computer, orientation, tangent, consistent, local, shape, parameterization, direction, vision, texturenet, scannet, coordinate, pattern, associated, approach, geometry, define, ambiguity, rgb, neighborhood, extrinsic, mesh, international, compute, principal, normal, volume, directly, quadriflow, distortion, scene, splatnet, intrinsic] [figure, texture, patch, ieee, conference, input, color, method, image, face, high, extracted, based] [convolution, convolutional, network, table, neural, deep, operator, architecture, better, aggregation, higher] [sampled, orange, arxiv, preprint] [semantic, segmentation, feature, iou] [sample, learning, set, class, data, cross, euclidean]
@InProceedings{Huang_2019_CVPR,
  author = {Huang, Jingwei and Zhang, Haotian and Yi, Li and Funkhouser, Thomas and Niessner, Matthias and Guibas, Leonidas J.},
  title = {TextureNet: Consistent Local Parametrizations for Learning From High-Resolution Signals on Meshes},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
PlaneRCNN: 3D Plane Detection and Reconstruction From a Single Image
Chen Liu, Kihwan Kim, Jinwei Gu, Yasutaka Furukawa, Jan Kautz


This paper proposes a deep neural architecture, PlaneRCNN, that detects and reconstructs piecewise planar regions from a single RGB image. PlaneRCNN employs a variant of Mask R-CNN to detect planes with their plane parameters and segmentation masks. PlaneRCNN then refines an arbitrary number of segmentation masks with a novel loss enforcing the consistency with a nearby view during training. The paper also presents a new benchmark with more fine-grained plane segmentations in the ground-truth, in which, PlaneRCNN outperforms existing state-of-the-art methods with significant margins in the plane detection, segmentation, and reconstruction metrics. PlaneRCNN makes an important step towards robust plane extraction method, which would have immediate impact on a wide range of applications including Robotics, Augmented Reality, and Virtual Reality.
[warping, nearby, jointly, dataset, outperforms, current, recognition, joint] [plane, planar, computer, depthmap, depth, scene, vision, pattern, planercnn, reconstruction, view, single, planenet, normal, international, planerecover, indoor, rgb, estimate, scannet, point, camera, geometric, surface, estimation, left, defined] [conference, ieee, image, figure, pixel, input, layered, reconstructed, consistency, arbitrary] [network, neural, deep, number, accuracy, small, convolutional, inference, proposes] [simple, model, generate] [segmentation, detection, refinement, mask, object, piecewise, module, semantic, anchor, three, benchmark, offset, instance, regression, threshold, parsing, detect, bounding, region] [loss, paper, training, unseen, existing, trained, optimizes, learning]
@InProceedings{Liu_2019_CVPR,
  author = {Liu, Chen and Kim, Kihwan and Gu, Jinwei and Furukawa, Yasutaka and Kautz, Jan},
  title = {PlaneRCNN: 3D Plane Detection and Reconstruction From a Single Image},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Occupancy Networks: Learning 3D Reconstruction in Function Space
Lars Mescheder, Michael Oechsle, Michael Niemeyer, Sebastian Nowozin, Andreas Geiger


With the advent of deep neural networks, learning-based approaches for 3D reconstruction have gained popularity. However, unlike for images, in 3D there is no canonical representation which is both computationally and memory efficient yet allows for representing high-resolution geometry of arbitrary topology. Many of the state-of-the-art learning-based 3D reconstruction approaches can hence only represent very coarse 3D geometry or are limited to a restricted domain. In this paper, we propose Occupancy Networks, a new representation for learning-based 3D reconstruction methods. Occupancy networks implicitly represent the 3D surface as the continuous decision boundary of a deep neural network classifier. In contrast to existing approaches, our representation encodes a description of the 3D output at infinite resolution without excessive memory footprint. We validate that our representation can efficiently encode 3D structure and can be inferred from various kinds of input. Our experiments demonstrate competitive results, both qualitatively and quantitatively, for the challenging tasks of 3D reconstruction from single images, noisy point clouds and coarse discrete voxel grids. We believe that occupancy networks will become a useful tool in a wide variety of learning-based 3D tasks.
[recognition, extract] [occupancy, mesh, computer, vision, point, voxel, reconstruction, pattern, international, surface, single, approach, shape, geometry, normal, ground, truth, well, psgn, shapenet, algorithm, additional, marching, volumetric, volume, atlasnet, limited, continuous, isosurface, oij, note, kitti, contrast] [ieee, input, image, method, resolution, generative, consistency, high, figure, latent, qualitative, comparison, real, based] [network, neural, deep, output, table, processing, number, convolutional, apply] [model, memory, represent, evaluate, generating, generate, introduced] [iou, object, grid, final, coarse] [representation, learning, sampling, distance, training, space, set, function, test, trained, observe, existing]
@InProceedings{Mescheder_2019_CVPR,
  author = {Mescheder, Lars and Oechsle, Michael and Niemeyer, Michael and Nowozin, Sebastian and Geiger, Andreas},
  title = {Occupancy Networks: Learning 3D Reconstruction in Function Space},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
3D Shape Reconstruction From Images in the Frequency Domain
Weichao Shen, Yunde Jia, Yuwei Wu


Reconstructing the high-resolution volumetric 3D shape from images is challenging due to the cubic growth of computational cost. In this paper, we propose a Fourier-based method that reconstructs a 3D shape from images in a 2D space by predicting slices in the frequency domain. According to the Fourier slice projection theorem, we introduce a thickness map to bridge the domain gap between images in the spatial domain and slices in the frequency domain. The thickness map is the 2D spatial projection of the 3D shape, which is easily predicted from the input image by a general convolutional neural network. Each slice in the frequency domain is the Fourier transform of the corresponding thickness map. All slices constitute a 3D descriptor and the 3D shape is the inverse Fourier transform of the descriptor. Using slices in the frequency domain, our method can transfer the 3D shape reconstruction from the 3D space into the 2D space, which significantly reduces the computational cost. The experiment results on the ShapeNet dataset demonstrate that our method achieves competitive reconstruction accuracy and computational efficiency compared with the state-of-the-art reconstruction methods.
[predicting, predict, dataset, fedge, recognition, considering] [shape, reconstruction, thickness, fourier, slice, projection, computer, theorem, silhouette, vision, volumetric, corresponding, inverse, ground, axis, ogn, dense, radon, surface, pattern, shapenet, local, voxel, fsil, single, june, rgb, octree, predicts, accurate] [method, image, frequency, transform, resolution, reconstructed, conference, input, ieee, figure, high, reconstruct, proposed] [network, computational, deep, neural, accuracy, convolutional, table, number, compared, cost, achieves, achieve, output, size, efficient, smaller, architecture, reduce, designed, calculate] [memory, simple, introduce, introduced] [map, spatial, predicted, three, edge, global, final, object, propose, iou] [domain, space, loss, function, gap, set, learning, selected]
@InProceedings{Shen_2019_CVPR,
  author = {Shen, Weichao and Jia, Yunde and Wu, Yuwei},
  title = {3D Shape Reconstruction From Images in the Frequency Domain},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
SiCloPe: Silhouette-Based Clothed People
Ryota Natsume, Shunsuke Saito, Zeng Huang, Weikai Chen, Chongyang Ma, Hao Li, Shigeo Morishima


We introduce a new silhouette-based representation for modeling clothed human bodies using deep generative models. Our method can reconstruct a complete and textured 3D model of a person wearing clothes from a single input picture. Inspired by the visual hull algorithm, our implicit representation uses 2D silhouettes and 3D joints of a body pose to describe the immense shape complexity and variations of clothed people. Given a segmented 2D silhouette of a person and its inferred 3D joints from the input picture, we first synthesize consistent silhouettes from novel view points around the subject. The synthesized silhouettes which are the most consistent with the input segmentation are fed into a deep visual hull algorithm for robust 3D shape prediction. We then infer the texture of the subject's back view using the frontal image and segmentation mask as input to a conditional generative adversarial network. Our experiments demonstrate that our silhouette-based model is an effective representation and the appearance of the back view can be predicted reliably using an image-to-image translation network. While classic methods based on parametric models often fail for single-view images of subjects with challenging clothing, our approach can still produce successful results, which are comparable to those obtained from multi-view input.
[human, predict, joint, huang, capture, dataset, video] [hull, view, pose, silhouette, computer, body, shape, reconstruction, single, vision, pattern, clothed, geometry, estimation, algorithm, camera, textured, approach, well, voxel, mesh, volumetric, additional, corresponding, consistent, parametric, error, angle, international] [input, image, conference, ieee, synthesis, figure, based, method, texture, acm, frontal, reconstructed, conditional, translation, color, generative, reconstruct, ladv] [deep, network, inference, neural, number, output] [visual, model, adversarial, generate, greedy, sampled, complete, candidate, arxiv] [person, propose, european, segmentation, fully] [loss, training, sampling, representation, learning, novel, train, target, data, set, naive]
@InProceedings{Natsume_2019_CVPR,
  author = {Natsume, Ryota and Saito, Shunsuke and Huang, Zeng and Chen, Weikai and Ma, Chongyang and Li, Hao and Morishima, Shigeo},
  title = {SiCloPe: Silhouette-Based Clothed People},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Detailed Human Shape Estimation From a Single Image by Hierarchical Mesh Deformation
Hao Zhu, Xinxin Zuo, Sen Wang, Xun Cao, Ruigang Yang


This paper presents a novel framework to recover detailed human body shapes from a single image. It is a challenging task due to factors such as variations in human shapes, body poses, and viewpoints. Prior methods typically attempt to recover the human body shape using a parametric based template that lacks the surface details. As such the resulting body shape appears to be without clothing. In this paper, we propose a novel learning-based framework that combines the robustness of parametric model with the flexibility of free-form 3D deformation. We use the deep neural networks to refine the 3D shape in a Hierarchical Mesh Deformation (HMD) framework, utilizing the constraints from body joints, silhouettes, and per-pixel shading information. We are able to restore detailed human body shapes beyond skinned models. Experiments demonstrate that our method has outperformed previous state-of-the-art approaches, achieving better accuracy in terms of both 2D IoU number and 3D metric distance. The code is available in https://github.com/zhuhao-nju/hmd.git.
[human, joint, dataset, predict, recognition, prediction, hmr, framework, motion] [shape, body, mesh, vision, pose, depth, computer, ground, truth, smpl, single, deformation, silhouette, surface, pattern, error, vertex, view, deform, parametric, hmd, international, volumetric, michael, projected, position, syn, estimation, initial, project, reconstruction, corresponding] [image, conference, ieee, input, method, figure, based, recovered, handle, recover, result, proposed, recovery, wild, side, recon] [network, deep, neural, better, table, full, accuracy, performance, convolutional] [model, vector] [anchor, detailed, predicted, three, iou, map, hierarchical, refinement, comparing] [source, learning, train, datasets, loss, novel, training, large, deformed, set, data]
@InProceedings{Zhu_2019_CVPR,
  author = {Zhu, Hao and Zuo, Xinxin and Wang, Sen and Cao, Xun and Yang, Ruigang},
  title = {Detailed Human Shape Estimation From a Single Image by Hierarchical Mesh Deformation},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Convolutional Mesh Regression for Single-Image Human Shape Reconstruction
Nikos Kolotouros, Georgios Pavlakos, Kostas Daniilidis


This paper addresses the problem of 3D human pose and shape estimation from a single image. Previous approaches consider a parametric model of the human body, SMPL, and attempt to regress the model parameters that give rise to a mesh consistent with image evidence. This parameter regression has been a very challenging task, with model-based approaches underperforming compared to nonparametric solutions in terms of pose estimation. In our work, we propose to relax this heavy reliance on the model's parameter space. We still retain the topology of the SMPL template mesh, but instead of predicting model parameters, we directly regress the 3D location of the mesh vertices. This is a heavy task for a typical network, but our key insight is that the regression becomes significantly easier using a Graph-CNN. This architecture allows us to explicitly encode the template mesh structure within the network and leverage the spatial locality the mesh has to offer. Image-based features are attached to the mesh vertices and the Graph-CNN is responsible to process them on the mesh structure, while the regression target for each vertex is its 3D location. Having recovered the complete 3D geometry of the mesh, if we still require a specific model parametrization, this can be reliably regressed from the vertices locations. We demonstrate the flexibility and the effectiveness of our proposed graph-based mesh regression by attaching different types of features on the mesh vertices. In all cases, we outperform the comparable baselines relying on model parameter regression, while we also achieve state-of-the-art results among model-based pose estimation approaches.
[human, graph, prediction, multiple, previous, explicitly, focus, joint, work, dataset, series, outperforms] [pose, shape, mesh, smpl, parametric, estimation, regress, approach, vertex, regressed, body, directly, single, rgb, reconstruction, template, ground, michael, truth, georgios, estimating, monocular, regressing, error, nonparametric, form, rotation, smplify, volumetric, kostas, well, defined, direct] [input, image, figure, comparison, demonstrate, proposed, recover] [network, parameter, convolutional, structure, output, table, deep, architecture, typical, original, connected, mlp, compared, processing] [model, evaluation, complete] [regression, cnn, segmentation, feature, propose, semantic, challenging, fully, predicted] [training, representation, learning, task, target, train, data, space]
@InProceedings{Kolotouros_2019_CVPR,
  author = {Kolotouros, Nikos and Pavlakos, Georgios and Daniilidis, Kostas},
  title = {Convolutional Mesh Regression for Single-Image Human Shape Reconstruction},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
H+O: Unified Egocentric Recognition of 3D Hand-Object Poses and Interactions
Bugra Tekin, Federica Bogo, Marc Pollefeys


We present a unified framework for understanding 3D hand and object interactions in raw image sequences from egocentric RGB cameras. Given a single RGB image, our model jointly estimates the 3D hand and object poses, models their interactions, and recognizes the object and action classes with a single feed-forward pass through a neural network. We propose a single architecture that does not rely on external detection algorithms but rather is trained end-to-end on single images. We further merge and propagate information in the temporal domain to infer interactions between hand and object trajectories and recognize actions. The complete model takes as input a sequence of frames and outputs per-frame 3D hand and object pose predictions along with the estimates of object and action categories for the entire sequence. We demonstrate state-of-the-art performance of our algorithm even in comparison to the approaches that work on depth data and ground-truth annotations.
[action, recognition, egocentric, interaction, jointly, temporal, joint, activity, dataset, predict, framework, recognize, motion, rnn, work, ose, tracking, human, recognizing, explicitly, modeling, sequence, individual] [hand, pose, estimation, depth, single, confidence, approach, estimate, bject, rgb, problem, predicts, corresponding, rely, simultaneously, rigid] [input, method, image, color, control, demonstrate, proposed, based, figure] [network, cell, neural, accuracy, output, table, performance, full] [model, reasoning, visual, understanding, pass, evaluate] [object, bounding, grid, propose] [unified, learning, class, train, trained, target, distance, data, large, training]
@InProceedings{Tekin_2019_CVPR,
  author = {Tekin, Bugra and Bogo, Federica and Pollefeys, Marc},
  title = {H+O: Unified Egocentric Recognition of 3D Hand-Object Poses and Interactions},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Learning the Depths of Moving People by Watching Frozen People
Zhengqi Li, Tali Dekel, Forrester Cole, Richard Tucker, Noah Snavely, Ce Liu, William T. Freeman


We present a method for predicting dense depth in scenarios where both a monocular camera and people in the scene are freely moving. Existing methods for recovering depth for dynamic, non-rigid objects from monocular video impose strong assumptions on the objects' motion and may only recover sparse depth. In this paper, we take a data-driven approach and learn human depth priors from a new source of data: thousands of Internet videos of people imitating mannequins, i.e., freezing in diverse, natural poses, while a hand-held camera tours the scene. Since the people are stationary, training data can be created from these videos using multi-view stereo reconstruction. At inference time, our method uses motion parallax cues from the static areas of the scenes, and shows clear improvement over state-of-the-art monocular depth prediction methods. We demonstrate our method on real-world sequences of complex human actions captured by a moving hand-held camera, and show various 3D effects produced using our predicted depth.
[human, people, motion, recognition, dynamic, moving, flow, video, prediction, dataset, optical, static, complex, consists, frame] [depth, camera, vision, computer, pattern, scene, monocular, single, dense, confidence, internet, parallax, initial, estimated, error, indoor, view, rgb, estimation, demon, stereo, computed, rgbd, relative, estimate, accurate, ground, idpp, dorn, mannequinchallenge, approach, sfm, pose, valid] [input, image, method, figure, reference, synthetic, based] [network, dpp, full, inference, performance, deep, accuracy] [model, natural, visual] [map, european, predicted, challenging, mask, object] [learning, data, trained, training, source, set, train, unsupervised]
@InProceedings{Li_2019_CVPR,
  author = {Li, Zhengqi and Dekel, Tali and Cole, Forrester and Tucker, Richard and Snavely, Noah and Liu, Ce and Freeman, William T.},
  title = {Learning the Depths of Moving People by Watching Frozen People},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Extreme Relative Pose Estimation for RGB-D Scans via Scene Completion
Zhenpei Yang, Jeffrey Z. Pan, Linjie Luo, Xiaowei Zhou, Kristen Grauman, Qixing Huang


Estimating the relative rigid pose between two RGB-D scans of the same underlying environment is a fundamental problem in computer vision, robotics, and computer graphics. Most existing approaches allow only limited maximum relative pose changes since they require considerable overlap between the input scans. We introduce a novel approach that extends the scope to extreme relative poses, with little or even no overlap between the input scans. The key idea is to infer more complete scene information about the underlying environment and match on the completed scans. In particular, instead of only performing scene completion from each individual scan, our approach alternates between relative pose estimation and scene completion. This allows us to perform scene completion by utilizing information from both input scans at late iterations, resulting in better results for both scene completion and relative pose estimation. Experimental results on benchmark datasets show that our approach leads to considerable improvements over state-of-the-art approaches for relative pose estimation. In particular, our approach provides encouraging relative pose estimates even between non-overlapping scans.
[recurrent, perform, performing, second, greg, current, extract, combining] [relative, pose, approach, completion, scan, computer, matching, estimation, scene, vision, robust, rotation, rigid, pattern, depth, registration, define, underlying, indoor, international, reconstruction, error, camera, thomas, estimating, problem, note, compute, single, coordinate, fundamental, consistent, geometry, suncg, correspondence, normal, point] [input, conference, ieee, figure, spectral, completed, consistency, method, proposed, image, color] [network, overlap, performance, deep, neural, better, small] [complete, room, transformed] [module, feature, semantic, object, baseline, overlapping, extreme, three] [learning, pairwise, representation, training, experimental, pair, set, existing, train, loss]
@InProceedings{Yang_2019_CVPR,
  author = {Yang, Zhenpei and Pan, Jeffrey Z. and Luo, Linjie and Zhou, Xiaowei and Grauman, Kristen and Huang, Qixing},
  title = {Extreme Relative Pose Estimation for RGB-D Scans via Scene Completion},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
A Skeleton-Bridged Deep Learning Approach for Generating Meshes of Complex Topologies From Single RGB Images
Jiapeng Tang, Xiaoguang Han, Junyi Pan, Kui Jia, Xin Tong


This paper focuses on the challenging task of learning 3D object surface reconstructions from single RGB images. Existing methods achieve varying degrees of success by using different geometric representations. However, they all have their own drawbacks, and cannot well reconstruct those surfaces of complex topologies. To this end, we propose in this paper a skeleton-bridged, stage-wise learning approach to address the challenge. Our use of skeleton is due to its nice property of topology preservation, while being of lower complexity to learn. To learn skeleton from an input image, we design a deep architecture whose decoder is based on a novel design of parallel streams respectively for synthesis of curve- and surface-like skeleton points. We use different shape representations of point cloud, volume, and mesh in our stage-wise learning, in order to take their respective advantages. We also propose multi-stage use of the input image to correct prediction errors that are possibly accumulated in each stage. We conduct intensive experiments to investigate the efficacy of our proposed approach. Qualitative and quantitative results on representative object categories of both simple and complex topologies demonstrate the superiority of our approach over existing ones. We will make our ShapeNet-Skeleton dataset publicly available.
[skeleton, complex, dataset, graph, prediction] [mesh, shape, volume, point, approach, surface, skeletal, computer, single, vision, rgb, curskenet, local, pattern, fitting, topology, volumetric, atlasnet, geometric, cloud, directly, pipeline, defined, laplacian, psg, international] [input, image, method, conference, based, synthesis, proposed, ieee, figure, conduct, quantitative] [neural, deep, network, convolutional, inference, design, firstly, output, architecture, complexity, approximate, structure, better, lower] [generate, generating, model, generation, decoder, correct, generated] [object, fig, global, refinement, three, stage, final, cnn, propose, coarse] [learning, base, representation, existing, set, training, learn, trained, loss, train, task]
@InProceedings{Tang_2019_CVPR,
  author = {Tang, Jiapeng and Han, Xiaoguang and Pan, Junyi and Jia, Kui and Tong, Xin},
  title = {A Skeleton-Bridged Deep Learning Approach for Generating Meshes of Complex Topologies From Single RGB Images},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Learning Structure-And-Motion-Aware Rolling Shutter Correction
Bingbing Zhuang, Quoc-Huy Tran, Pan Ji, Loong-Fah Cheong, Manmohan Chandraker


An exact method of correcting the rolling shutter (RS) effect requires recovering the underlying geometry, i.e. the scene structures and the camera motions between scanlines or between views. However, the multiple-view geometry for RS cameras is much more complicated than its global shutter (GS) counterpart, with various degeneracies. In this paper, we first make a theoretical contribution by showing that RS two-view geometry is degenerate in the case of pure translational camera motion. In view of the complex RS geometry, we then propose a Convolutional Neural Network (CNN)-based method which learns the underlying geometry (camera motion and scene structure) from just a single RS image and perform RS image correction. We call our method structure-and-motion-aware RS correction because it reasons about the concealed motions between the scanlines as well as the scene structure. Our method learns from a large-scale dataset synthesized in a geometrically meaningful way where the RS effect is generated in a manner consistent with the camera motion and scene structure. In extensive experiments, our method achieves superior performance compared to other state-of-the-art methods for single image RS correction and subsequent Structure from Motion (SfM) applications.
[motion, flow, velocity, work, prediction, perform] [camera, depth, scene, undistortion, pure, ground, truth, shutter, geometry, rolling, estimated, rotation, translational, point, scanline, resizing, single, distortion, rectified, exposure, approach, note, error, scanlines, estimation, pose, sfm, corresponding, degeneracy, stereo, kitti, cdf, smarsc, underlying, manhattan, geometric, ambiguity, degenerate, case, constant, supplementary] [image, method, correction, input, figure, translation, real, row, synthetic, pixel, qualitative, synthesized] [network, structure, small, performance, deep, conv, batchnorm, relu, convolutional, achieves, compared] [model, generation, generate, identify, correct, vector] [map, global] [training, data, set, trained, learning]
@InProceedings{Zhuang_2019_CVPR,
  author = {Zhuang, Bingbing and Tran, Quoc-Huy and Ji, Pan and Cheong, Loong-Fah and Chandraker, Manmohan},
  title = {Learning Structure-And-Motion-Aware Rolling Shutter Correction},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
PVNet: Pixel-Wise Voting Network for 6DoF Pose Estimation
Sida Peng, Yuan Liu, Qixing Huang, Xiaowei Zhou, Hujun Bao


This paper addresses the challenge of 6DoF pose estimation from a single RGB image under severe occlusion or truncation. Many recent works have shown that a two-stage approach, which first detects keypoints and then solves a Perspective-n-Point (PnP) problem for pose estimation, achieves remarkable performance. However, most of these methods only localize a set of sparse keypoints by regressing their image coordinates or heatmaps, which are sensitive to occlusion and truncation. Instead, we introduce a Pixel-wise Voting Network (PVNet) to regress pixel-wise vectors pointing to the keypoints and use these vectors to vote for keypoint locations. This creates a flexible representation for localizing occluded or truncated keypoints. Another important feature of this representation is that it provides uncertainties of keypoint locations that can be further leveraged by the PnP solver. Experiments show that the proposed approach outperforms the state of the art on the LINEMOD, Occlusion LINEMOD and YCB-Video datasets by a large margin, while being efficient for real-time pose estimation. We further create a Truncation LINEMOD dataset to validate the robustness of our approach against truncation. The code is available at https://zju3dv.github.io/pvnet/.
[dataset, predicting, considering] [pose, keypoint, keypoints, linemod, estimation, pnp, occlusion, pvnet, projection, approach, truncation, robust, estimated, eggbox, glue, single, rgb, directly, occluded, estimate, rotation, compute, dense, local, algorithm, surface, duck, driller, problem, ape, holepuncher, tekin, column] [method, image, figure, proposed, based, input, pixel, result, vote] [table, network, truncated, fps, performance, cnns, unit, compare] [model, represent, probability, vector, robustness] [object, voting, detection, bounding, feature, detect, semantic, box, cluttered, spatial, final, localization, average] [set, representation, learning, distribution, distance, cat, selected, metric, datasets]
@InProceedings{Peng_2019_CVPR,
  author = {Peng, Sida and Liu, Yuan and Huang, Qixing and Zhou, Xiaowei and Bao, Hujun},
  title = {PVNet: Pixel-Wise Voting Network for 6DoF Pose Estimation},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
SelFlow: Self-Supervised Learning of Optical Flow
Pengpeng Liu, Michael Lyu, Irwin King, Jia Xu


We present a self-supervised learning approach for optical flow. Our method distills reliable flow estimations from non-occluded pixels, and uses these predictions as ground truth to learn optical flow for hallucinated occlusions. We further design a simple CNN to utilize temporal information from multiple frames for better flow estimation. These two principles lead to an approach that yields the best performance for unsupervised optical flow learning on the challenging benchmarks including MPI Sintel, KITTI 2012 and 2015. More notably, our self-supervised pre-trained model provides an excellent initialization for supervised fine-tuning. Our fine-tuned models achieve state-of-the-art results on all three datasets. At the time of writing, we achieve EPE=4.26 on the Sintel benchmark, outperforming all submitted methods.
[flow, optical, sintel, forward, backward, multiple, temporal, work, motion, warped] [occlusion, kitti, occluded, estimation, estimate, volume, approach, michael, photometric, ground, truth, accurate, note, initial, computer, thomas, reliable] [image, method, figure, reference, pixel, noise, row, based, synthetic, guide, input, comparison] [cost, accuracy, achieve, network, performance, achieves, table, cnns, better, convolutional] [model, random, visual, simple] [superpixel, map, feature, rectangle, improve, spatial, superpixels, propose, final, three, utilize, including] [learning, unsupervised, supervised, train, loss, training, data, target, datasets, learn, large, unlabeled, labeled, testing]
@InProceedings{Liu_2019_CVPR,
  author = {Liu, Pengpeng and Lyu, Michael and King, Irwin and Xu, Jia},
  title = {SelFlow: Self-Supervised Learning of Optical Flow},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Taking a Deeper Look at the Inverse Compositional Algorithm
Zhaoyang Lv, Frank Dellaert, James M. Rehg, Andreas Geiger


In this paper, we provide a modern synthesis of the classic inverse compositional algorithm for dense image alignment. We first discuss the assumptions made by this well-established technique, and subsequently propose to relax these assumptions by incorporating data-driven priors into this model. More specifically, we unroll a robust version of the inverse compositional algorithm and replace multiple components of this algorithm using more expressive models whose parameters we train in an end-to-end fashion from data. Our experiments on several challenging 3D rigid motion estimation tasks demonstrate the advantages of combining optimization with learning-based techniques, outperforming the classic inverse compositional algorithm as well as data-driven image-to-pose regression approaches.
[motion, recognition, warp, warping, warped, multiple, work] [vision, computer, algorithm, robust, estimation, optimization, damping, inverse, international, direct, pose, pattern, matrix, camera, classical, icp, well, note, rigid, template, error, depth, provide, dense, problem, approach, relative, tum] [image, ieee, method, pixel] [network, deep, table, convolutional, original, neural, weight, residual, number, iteration, compared] [model, compositional, trust, evaluate, iterative, encoder, implemented] [feature, object, three, european, propose, region, pyramid, spatial, challenging, regression, fully] [learning, large, training, update, function, exploit, objective, min, learned, loss, learn, set, train, task, test]
@InProceedings{Lv_2019_CVPR,
  author = {Lv, Zhaoyang and Dellaert, Frank and Rehg, James M. and Geiger, Andreas},
  title = {Taking a Deeper Look at the Inverse Compositional Algorithm},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Deeper and Wider Siamese Networks for Real-Time Visual Tracking
Zhipeng Zhang, Houwen Peng


Siamese networks have drawn great attention in visual tracking because of their balanced accuracy and speed. However, the backbone networks used in Siamese trackers are relatively shallow, such as AlexNet, which does not fully take advantage of the capability of modern deep neural networks. In this paper, we investigate how to leverage deeper and wider convolutional neural networks to enhance tracking robustness and accuracy. We observe that direct replacement of backbones with existing powerful architectures, such as ResNet and Inception, does not bring improvements. The main reasons are that 1) large increases in the receptive field of neurons lead to reduced feature discriminability and localization precision; and 2) the network padding for convolutions induces a positional bias in learning. To address these issues, we propose new residual modules to eliminate the negative impact of padding, and further design new architectures using these modules with controlled receptive field size and network stride. The designed architectures are lightweight and guarantee real-time tracking speed when applied to SiamFC and SiamRPN. Experiments show that solely due to the proposed network architectures, our SiamFC+ and SiamRPN+ obtain up to 9.8%/6.3% (AUC), 23.3%/8.8% (EAO) and 24.4%/25.0% (EAO) relative improvements over the original versions on the OTB-15, VOT-16 and VOT-17 datasets, respectively.
[tracking, work, internal, key] [computer, field, vision, pattern, depth, position, analysis] [image, conference, ieee, proposed, degradation, input, extracted, based, change, remove] [network, siamese, size, residual, receptive, siamfc, stride, relu, deeper, padding, performance, original, cir, conv, unit, designed, output, downsampling, deep, design, siamrpn, convolution, search, alexnet, block, correlation, neural, convolutional, resnet, table, addition, architecture, small, structure, number, tracker, layer, ratio, wide] [visual, model, inception] [feature, object, wider, backbone, spatial, cropping, european, localization, three, ablation] [target, exemplar, learning, large, set, maximum, bias, training]
@InProceedings{Zhang_2019_CVPR,
  author = {Zhang, Zhipeng and Peng, Houwen},
  title = {Deeper and Wider Siamese Networks for Real-Time Visual Tracking},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Self-Supervised Adaptation of High-Fidelity Face Models for Monocular Performance Tracking
Jae Shin Yoon, Takaaki Shiratori, Shoou-I Yu, Hyun Soo Park


Improvements in data-capture and face modeling techniques have enabled us to create high-fidelity realistic face models. However, driving these realistic face models requires special input data, e.g., 3D meshes and unwrapped textures. Also, these face models expect clean input data taken under controlled lab environments, which is very different from data collected in the wild. All these constraints make it challenging to use the high-fidelity models in tracking for commodity cameras. In this paper, we propose a self-supervised domain adaptation approach to enable the animation of high-fidelity face models from a commodity camera. Our approach first circumvents the requirement for special input data by training a new network that can directly drive a face model just from a single 2D image. Then, we overcome the domain mismatch between lab and uncontrolled environments by performing self-supervised domain adaptation based on "consecutive frame texture consistency" based on the assumption that the appearance of the face is consistent over consecutive frames, avoiding the necessity of modeling the new environment such as lighting or background. Experiments show that we are able to drive a high-fidelity face model to perform complex facial motion from a cellphone camera without requiring any labeled data from the new domain.
[tracking, temporal, consecutive, frame, video, modeling, perform, key, motion] [geometry, reprojection, monocular, pose, lighting, single, reconstruction, assumption, error, lab, directly, approach, consistent, shape, dense, view, regressed, note, camera] [face, facial, texture, method, figure, image, landmark, input, consistency, proposed, based, color, unwrapped, appearance, resolution, lcftc, captured, dam, controlled, realistic, latent, motc, lflrc, imagery, morphable, intermediate] [deep, performance, stability, mismatch, network, accuracy, vgg, neural] [model, environment, modality, requires, enable, visual] [head, predicted, supervision, score, average, ablation] [domain, adaptation, data, loss, training, representation, set, adapt, existing, learned]
@InProceedings{Yoon_2019_CVPR,
  author = {Shin Yoon, Jae and Shiratori, Takaaki and Yu, Shoou-I and Soo Park, Hyun},
  title = {Self-Supervised Adaptation of High-Fidelity Face Models for Monocular Performance Tracking},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Diverse Generation for Multi-Agent Sports Games
Raymond A. Yeh, Alexander G. Schwing, Jonathan Huang, Kevin Murphy


In this paper, we propose a new generative model for multi-agent trajectory data, focusing on the case of multi-player sports games. Our model leverages graph neural networks (GNNs) and variational recurrent neural networks (VRNNs) to achieve a permutation equivariant model suitable for sports. On two challenging datasets (basketball and soccer), we show that we are able to produce more accurate forecasts than previous methods. We assess accuracy using various metrics, such as log-likelihood and "best of N" loss, based on N different samples of the future. We also measure the distribution of statistics of interest, such as player location or velocity, and show that the distribution induced by our generative model better matches the empirical distribution of the test set. Finally, we show that our model can perform conditional prediction, which lets us answer counterfactual questions such as "how will the players move differently if A passes the ball to B instead of C?"
[graph, rnn, trajectory, modeling, basketball, report, recurrent, future, ordering, time, predict, state, dataset, interaction, velocity, previous, perform, forecasting, passing, predicting] [ground, truth, approach, permutation, compute, template, consistent] [generative, conditional, quantitative, latent, figure, based, prior, proposed] [neural, deep, stochastic, network, best, better, standard, outperform, compare, connected] [model, ball, variational, player, equivariant, agent, soccer, vrnn, node, random, offensive, evaluate, distributional, sampled, indicates, marginal, simple, observed, type, conditioned, generated, decoder, tree] [location, edge, average, propose, fully, baseline, predicted] [distribution, learning, test, data, sample, training, representation, observe, sampling, log]
@InProceedings{Yeh_2019_CVPR,
  author = {Yeh, Raymond A. and Schwing, Alexander G. and Huang, Jonathan and Murphy, Kevin},
  title = {Diverse Generation for Multi-Agent Sports Games},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Efficient Online Multi-Person 2D Pose Tracking With Recurrent Spatio-Temporal Affinity Fields
Yaadhav Raaj, Haroon Idrees, Gines Hidalgo, Yaser Sheikh


We present an online approach to efficiently and simultaneously detect and track 2D poses of multiple people in a video sequence. We build upon Part Affinity Field (PAF) representation designed for static images, and propose an architecture that can encode and predict Spatio-Temporal Affinity Fields (STAF) across a video sequence. In particular, we propose a novel temporal topology cross-linked across limbs which can consistently handle body motions of a wide range of magnitudes. Additionally, we make the overall approach recurrent in nature, where the network ingests STAF heatmaps from previous frames and estimates those for the current frame. Our approach uses only online inference and tracking, and is currently the fastest and the most accurate bottom-up approach that is runtime-invariant to the number of people in the scene and accuracy-invariant to input frame rate of camera. Running at ~30 fps on a single GPU at single scale, it achieves highly competitive results on the PoseTrack benchmarks.
[frame, tracking, tafs, video, pafs, previous, people, posetrack, taf, motion, multiple, temporal, human, recurrent, paf, limb, time, consists, online, current, mpii, mota, joint, flowtrack, dataset, work, second, flow] [pose, keypoints, estimation, approach, topology, single, keypoint, body, camera, well, articulated, computed] [figure, image, input, method] [network, number, rate, fps, inference, validation, accuracy, table, vgg, performance, computation, speed, competitive, convolutional, scale, better] [model, mode] [affinity, person, heatmaps, detection, map, spatial, three, coco, stage, module] [training, set, large, data, train]
@InProceedings{Raaj_2019_CVPR,
  author = {Raaj, Yaadhav and Idrees, Haroon and Hidalgo, Gines and Sheikh, Yaser},
  title = {Efficient Online Multi-Person 2D Pose Tracking With Recurrent Spatio-Temporal Affinity Fields},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
GFrames: Gradient-Based Local Reference Frame for 3D Shape Matching
Simone Melzi, Riccardo Spezialetti, Federico Tombari, Michael M. Bronstein, Luigi Di Stefano, Emanuele Rodola


We introduce GFrames, a novel local reference frame (LRF) construction for 3D meshes and point clouds. GFrames are based on the computation of the intrinsic gradient of a scalar field defined on top of the input shape. The resulting tangent vector field defines a repeatable tangent direction of the local frame at each point; importantly, it directly inherits the properties and invariance classes of the underlying scalar function, making it remarkably robust under strong sampling artifacts, vertex noise, as well as non-rigid deformations. Existing local descriptors can directly benefit from our repeatable frames, as we showcase in a selection of 3D vision and shape analysis applications where we demonstrate state-of-the-art performance in a variety of challenging settings.
[frame, human, key, dataset] [point, local, lrf, shape, matching, tangent, lrfs, repeatability, descriptor, error, intrinsic, repeatable, flare, robust, computer, geodesic, geometric, michael, surface, plane, dep, fied, defined, directly, sted, gauss, emanuele, vision, triangle, rigid, mesh, tosca, topological, construction, field, computed, registration, radius, ambiguity, axis, normal, luigi, direction, correspondence, compute, curvature] [figure, reference, based, method, conference, noise, proposed, synthetic] [scalar, gradient, top, deep, performance, denotes, gaussian] [vector, choice, sign, evaluate, manifold, robustness] [average, deformable, challenging, object, baseline] [shot, function, distance, learning, invariance, existing, sampling, task, dog, space, choosing, euclidean]
@InProceedings{Melzi_2019_CVPR,
  author = {Melzi, Simone and Spezialetti, Riccardo and Tombari, Federico and Bronstein, Michael M. and Di Stefano, Luigi and Rodola, Emanuele},
  title = {GFrames: Gradient-Based Local Reference Frame for 3D Shape Matching},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Eliminating Exposure Bias and Metric Mismatch in Multiple Object Tracking
Andrii Maksai, Pascal Fua


Identity Switching remains one of the main difficulties Multiple Object Tracking (MOT) algorithms have to deal with. Many state-of-the-art approaches now use sequence models to solve this problem but their training can be affected by biases that decrease their efficiency. In this paper, we introduce a new training procedure that confronts the algorithm to its own mistakes while explicitly attempting to minimize the number of switches, which results in better training. We propose an iterative scheme of building a rich training set and using it to learn a scoring function that is an explicit proxy for the target tracking metric. Whether using only simple geometric features or more sophisticated ones that also take appearance into account, our approach outperforms the state-of-the-art on several MOT benchmarks.
[tracking, idf, tracklets, multiple, sequence, trajectory, mota, tracklet, online, people, hypothesis, recurrent, social, prediction, longer, individual, current, long, future, track, merged, modeling, motion] [computer, approach, ground, pattern, vision, international, algorithm, exposure, truth, single, supplementary] [conference, appearance, identity, method, image, based, real, figure] [network, neural, number, best, mismatch, performance, deep, better, validation, pruning, batch] [model, procedure, arxiv, candidate, association, preprint, simple, machine, describe] [bounding, score, box, dukemtmc, object, scoring, feature, person, merging, detection] [training, data, set, learning, function, metric, loss, bias, target, prevent]
@InProceedings{Maksai_2019_CVPR,
  author = {Maksai, Andrii and Fua, Pascal},
  title = {Eliminating Exposure Bias and Metric Mismatch in Multiple Object Tracking},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Graph Convolutional Tracking
Junyu Gao, Tianzhu Zhang, Changsheng Xu


Tracking by siamese networks has achieved favorable performance in recent years. However, most of existing siamese methods do not take full advantage of spatial-temporal target appearance modeling under different contextual situations. In fact, the spatial-temporal information can provide diverse features to enhance the target representation, and the context information is important for online adaption of target localization. To comprehensively leverage the spatial-temporal structure of historical target exemplars and get benefit from the context information, in this work, we present a novel Graph Convolutional Tracking (GCT) method for high-performance visual tracking. Specifically, the GCT jointly incorporates two types of Graph Convolutional Networks (GCNs) into a siamese framework for target appearance modeling. Here, we adopt a spatial-temporal GCN to model the structured representation of historical target exemplars. Furthermore, a context GCN is designed to utilize the context of the current frame to learn adaptive features for target localization. Extensive results on 4 challenging benchmarks show that our GCT method performs favorably against state-of-the-art trackers while running around 50 frames per second.
[tracking, graph, gct, current, modeling, gcn, historical, tianzhu, video, changsheng, online, performs, jointly, temporal, framework, dynamic, auc, recognition] [robust, computer, analysis, pattern, provide] [appearance, image, figure, proposed, method, based, ieee] [siamese, convolutional, search, adaptive, performance, network, deep, correlation, siamfc, neural, tracker, structured, precision, structure, filter, layer, junyu, scale, achieves, ope, compared, eao, favorable] [visual, model, success, attention, adaption, machine] [context, feature, object, spatial, response, map, benchmark, threshold, utilize, localization, propose, score] [target, learning, exemplar, embedding, set, representation]
@InProceedings{Gao_2019_CVPR,
  author = {Gao, Junyu and Zhang, Tianzhu and Xu, Changsheng},
  title = {Graph Convolutional Tracking},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
ATOM: Accurate Tracking by Overlap Maximization
Martin Danelljan, Goutam Bhat, Fahad Shahbaz Khan, Michael Felsberg


While recent years have witnessed astonishing improvements in visual tracking robustness, the advancements in tracking accuracy have been limited. As the focus has been directed towards the development of powerful classifiers, the problem of accurate target state estimation has been largely overlooked. In fact, most trackers resort to a simple multi-scale search in order to estimate the target bounding box. We argue that this approach is fundamentally limited since target estimation is a complex task, requiring high-level knowledge about the object. We address this problem by proposing a novel tracking architecture, consisting of dedicated target estimation and classification components. High level knowledge is incorporated into the target estimation through extensive offline learning. Our target estimation component is trained to predict the overlap between the target object and an estimated bounding box. By carefully integrating target-specific information, our approach achieves previously unseen bounding box accuracy. We further introduce a classification component that is trained online to guarantee high discriminative power in the presence of distractors. Our final tracking framework sets a new state-of-the-art on five challenging benchmarks. On the new large-scale TrackingNet dataset, our tracker ATOM achieves a relative gain of 15% over the previous best approach, while running at over 30 FPS. Code and models are available at https://github.com/visionml/pytracking.
[tracking, online, auc, state, prediction, predict, dataset, outperforms, current, perform] [estimation, approach, accurate, problem, estimate, single, algorithm, optimization, confidence, initial, estimated] [image, component, reference, based, figure, method, consisting, comparison, proposed] [network, overlap, gradient, correlation, modulation, architecture, convolutional, siamese, table, atom, offline, achieves, tracker, dasiamrpn, deep, prpool, updt, layer, conv, impact, precision, search, gain, scale, performed, block, size] [visual, success, model, vector, robustness, simple] [bounding, iou, box, object, feature, backbone, module, final, fully, score, branch, challenging] [target, classification, learning, trained, training, test, strategy, set, representation, negative, hard, sample, function, discriminative]
@InProceedings{Danelljan_2019_CVPR,
  author = {Danelljan, Martin and Bhat, Goutam and Shahbaz Khan, Fahad and Felsberg, Michael},
  title = {ATOM: Accurate Tracking by Overlap Maximization},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Visual Tracking via Adaptive Spatially-Regularized Correlation Filters
Kenan Dai, Dong Wang, Huchuan Lu, Chong Sun, Jianhua Li


In this work, we propose a novel adaptive spatially-regularized correlation filters (ASRCF) model to simultaneously optimize the filter coefficients and the spatial regularization weight. First, this adaptive spatial regularization scheme could learn an effective spatial weight for a specific object and its appearance variations, and therefore result in more reliable filter coefficients during the tracking process. Second, our ASRCF model can be effectively optimized based on the alternating direction method of multipliers, where each subproblem has the closed-from solution. Third, our tracker applies two kinds of CF models to estimate the location and scale respectively. The location CF model exploits ensembles of shallow and deep features to determine the optimal position accurately. The scale CF model works on multi-scale shallow features to estimate the optimal scale efficiently. Extensive experiments on five recent benchmarks show that our tracker performs favorably against many state-of-the-art algorithms, with real-time performance of 28fps.
[tracking, term, dataset, tracked] [estimation, equation, optimal, optimization, fourier, solving, michael, reliable, estimate, robust, accurate, matrix] [method, figure, based, background, color] [scale, tracker, correlation, adaptive, regularization, filter, srdcf, bacf, precision, eco, asrcf, kcf, overlap, best, ope, deep, deepsrdcf, mdnet, lsart, staple, rate, shallow, performance, siamfc, cfnet, table, weight, dasiamrpn, siamrpn, speed, compare, efficient, achieves, lasot, dong, original, denotes] [model, success, visual, evaluation] [spatial, object, location, threshold, feature, huchuan, localization, boundary, response, score, baseline] [learning, learned, exploit, objective, effectively, learn, domain, function, update]
@InProceedings{Dai_2019_CVPR,
  author = {Dai, Kenan and Wang, Dong and Lu, Huchuan and Sun, Chong and Li, Jianhua},
  title = {Visual Tracking via Adaptive Spatially-Regularized Correlation Filters},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Deep Tree Learning for Zero-Shot Face Anti-Spoofing
Yaojie Liu, Joel Stehouwer, Amin Jourabloo, Xiaoming Liu


Face anti-spoofing is designed to keep face recognition systems from recognizing fake faces as the genuine users. While advanced face anti-spoofing methods are developed, new types of spoof attacks are also being created and becoming a threat to all existing systems. We define the detection of unknown spoof attacks as Zero-Shot Face Anti-spoofing (ZSFA). Previous works of ZSFA only study 1-2 types of spoof attacks, such as print/replay attacks, which limits the insight of this problem. In this work, we expand the ZSFA problem to a wide range of 13 types of spoof attacks, including print attack, replay attack, 3D mask attacks, and so on. A novel Deep Tree Network (DTN) is proposed to tackle the ZSFA. The tree is learned to partition the spoof samples into semantic sub-groups in an unsupervised fashion. When a data sample arrives, being know or unknown attacks, DTN routes it to the most similar spoof cluster, and make the binary decision. In addition, to enable the study of ZSFA, we introduce the first face anti-spoofing database that contains diverse types of spoof attacks. Experiments show that our proposed method achieves the state of the art on multiple testing protocols of ZSFA.
[routing, leaf, largest, video, recognition, multiple, work] [provide, define, compute, algorithm, general] [spoof, face, live, cru, makeup, print, proposed, replay, tru, method, zsfa, database, based, antispoofing, eye, sfl, dtn, collect, image, funny, acer, study, figure, spoofing, prior, half, silicone] [deep, network, binary, denotes, convolutional, unit, performance, rate, conv, convolution, standard, structure] [tree, node, attack, partial, model, unique, diverse, find, evaluation] [mask, feature, detection, semantic, detect, propose, utilize] [data, unknown, learning, paper, testing, loss, learn, function, set, unsupervised, sample, embedding, supervised, partition]
@InProceedings{Liu_2019_CVPR,
  author = {Liu, Yaojie and Stehouwer, Joel and Jourabloo, Amin and Liu, Xiaoming},
  title = {Deep Tree Learning for Zero-Shot Face Anti-Spoofing},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
ArcFace: Additive Angular Margin Loss for Deep Face Recognition
Jiankang Deng, Jia Guo, Niannan Xue, Stefanos Zafeiriou


One of the main challenges in feature learning using Deep Convolutional Neural Networks (DCNNs) for large-scale face recognition is the design of appropriate loss functions that can enhance the discriminative power. Centre loss penalises the distance between deep features and their corresponding class centres in the Euclidean space to achieve intra-class compactness. SphereFace assumes that the linear transformation matrix in the last fully connected layer can be used as a representation of the class centres in the angular space and therefore penalises the angles between deep features and their corresponding weights in a multiplicative way. Recently, a popular line of research is to incorporate margins in well-established loss functions in order to maximise face class separability. In this paper, we propose an Additive Angular Margin Loss (ArcFace) to obtain highly discriminative features for face recognition. The proposed ArcFace has a clear geometric interpretation due to its exact correspondence to geodesic distance on a hypersphere. We present arguably the most extensive experimental evaluation against all recent state-of-the-art face recognition methods on ten face recognition benchmarks which includes a new large-scale image database with trillions of pairs and a large-scale video dataset. We show that ArcFace consistently outperforms the state of the art and can be easily implemented with negligible computational overhead. To facilitate future research, the code has been made available.
[recognition, dataset, combined, combining] [angle, geodesic, corresponding] [face, figure, proposed, method, casia, image, based, comparison] [performance, deep, table, verification, penalty, network, weight, best, neural, connected, power, number, better, compared, achieves, convolutional] [model] [feature, identification, enhance, fully, roc] [arcface, margin, loss, softmax, training, angular, embedding, cosface, sphereface, target, set, additive, logit, distance, lfw, learning, discriminative, centre, compactness, large, data, trained, test, triplet, discrepancy, datasets, megaface, class, train, classification, function, ytf, cplfw, calfw, refers, representation, cosine, support, probe, positive, space]
@InProceedings{Deng_2019_CVPR,
  author = {Deng, Jiankang and Guo, Jia and Xue, Niannan and Zafeiriou, Stefanos},
  title = {ArcFace: Additive Angular Margin Loss for Deep Face Recognition},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Learning Joint Gait Representation via Quintuplet Loss Minimization
Kaihao Zhang, Wenhan Luo, Lin Ma, Wei Liu, Hongdong Li


Gait recognition is an important biometric method popularly used in video surveillance, where the task is to identify people at a distance by their walking patterns from video sequences. Most of the current successful approaches for gait recognition either use a pair of gait images to form a cross-gait representation or rely on a single gait image for unique-gait representation. These two types of representations emperically complement one another. In this paper, we propose a new Joint Unique-gait and Cross-gait Network (JUCNet), to combine the advantages of unique-gait representation with that of cross-gait representation, leading to an significantly improved performance. Another key contribution of this paper is a novel quintuplet loss function, which simultaneously increases the inter-class differences by pushing representations extracted from different subjects apart and decreases the intra-class variations by pulling representations extracted from the same subject together. Experiments show that our method achieves the state-of-the-art performance tested on standard benchmark datasets, demonstrating its superiority over existing methods.
[gait, jucnet, quintuplet, recognition, yasushi, learns, joint, dataset, subject, extract, human, second, jointly, concatenated, geis, walking, term, gei, work, multiple, uniquegait, crossgait, carrying] [analysis, single] [based, proposed, method, input, image, figure] [identical, convolutional, performance, deep, network, layer, table, deeper, accuracy, size, effectiveness, order, achieves, powerful, stride, validation, structure, binary] [model] [feature, clothing, three] [loss, learning, representation, set, pair, training, gallery, metric, distance, probe, learn, trained, function, classification, class, log, datasets, classifier, learned, hyperparameters, complement, discrepancy]
@InProceedings{Zhang_2019_CVPR,
  author = {Zhang, Kaihao and Luo, Wenhan and Ma, Lin and Liu, Wei and Li, Hongdong},
  title = {Learning Joint Gait Representation via Quintuplet Loss Minimization},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Gait Recognition via Disentangled Representation Learning
Ziyuan Zhang, Luan Tran, Xi Yin, Yousef Atoum, Xiaoming Liu, Jian Wan, Nanxin Wang


Gait, the walking pattern of individuals, is one of the most important biometrics modalities. Most of the existing gait recognition methods take silhouettes or articulated body models as the gait features. These methods suffer from degraded recognition performance when handling confounding variables, such as clothing, carrying and view angle. To remedy this issue, we propose a novel AutoEncoder framework to explicitly disentangle pose and appearance features from RGB imagery and the LSTM-based integration of pose features over time produces the gait feature. In addition, we collect a Frontal-View Gait (FVG) dataset to focus on gait recognition from frontal-view walking, which is a challenging problem since it contains minimal gait cues compared to other views. FVG also includes other important variations,e.g., walking speed, carrying, and clothing. With extensive experiments on CASIA-B, USF and FVG datasets, our method demonstrates superior performance to the-state-of-the-arts quantitatively, the ability of feature disentanglement qualitatively, and promising computational efficiency.
[gait, recognition, video, walking, lstm, fvg, gaitnet, lxrecon, time, session, carrying, subject, work, usf, human, yasushi, temporal, dataset] [pose, view, pattern, vision, computer, rgb, body, international, estimation, reconstruction, silhouette, approach] [appearance, image, method, disentangle, disentanglement, conference, database, ieee, identity, frontal, based, xiaoming, prior, input, comparison, extracted, biometrics, variation, proposed, face, disentangled, luan] [performance, accuracy, output, network, table, achieves, neural] [model, machine, visual] [feature, three, identification, propose, mask, average, final] [loss, learning, training, probe, similarity, representation, cross, gallery, test, classification, trained, novel, large, set]
@InProceedings{Zhang_2019_CVPR,
  author = {Zhang, Ziyuan and Tran, Luan and Yin, Xi and Atoum, Yousef and Liu, Xiaoming and Wan, Jian and Wang, Nanxin},
  title = {Gait Recognition via Disentangled Representation Learning},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Reversible GANs for Memory-Efficient Image-To-Image Translation
Tycho F.A. van der Ouderaa, Daniel E. Worrall


The pix2pix and CycleGAN losses have vastly improved the qualitative and quantitative visual quality of results in image-to-image translation tasks. We extend this framework by exploring approximately invertible architectures which are well suited to these losses. These architectures are approximately invertible by design and thus partially satisfy cycle-consistency before training even begins. Furthermore, since invertible architectures have constant memory complexity in depth, these models can be built arbitrarily deep. We are able to demonstrate superior quantitative output on the Cityscapes and Maps datasets at near constant memory budget.
[dataset, perform, backward] [computer, rmse, constant, notice, inverse, depth, vision] [image, revgan, unpaired, paired, invertible, reversible, translation, cyclegan, quality, figure, input, lgan, lcgan, conference, quantitative, mapping, generator, generative, conditional, lcycle, decx, encx, photolabel, qualitative, separate, high] [neural, residual, performance, table, deep, complexity, network, deeper, convolutional, core, brain, standard, higher, parameter, better, low, usage, fixed, layer] [model, memory, adversarial, arxiv, preprint, visual, gans, evaluate, improved] [segmentation] [training, loss, task, train, trained, learning, test, datasets, learn, space, set, additive]
@InProceedings{Ouderaa_2019_CVPR,
  author = {van der Ouderaa, Tycho F.A. and Worrall, Daniel E.},
  title = {Reversible GANs for Memory-Efficient Image-To-Image Translation},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Sensitive-Sample Fingerprinting of Deep Neural Networks
Zecheng He, Tianwei Zhang, Ruby Lee


Numerous cloud-based services are provided to help customers develop and deploy deep learning applications. When a customer deploys a deep learning model in the cloud and serves it to end-users, it is important to be able to verify that the deployed model has not been tampered with. In this paper, we propose a novel and practical methodology to verify the integrity of remote deep learning models, with only black-box access to the target models. Specifically, we define Sensitive-Sample fingerprints, which are a small set of human unnoticeable transformed inputs that make the model outputs sensitive to the model's parameters. Even small model changes can be clearly reflected in the model outputs. Experimental results on different types of model integrity attacks show that we proposed approach is both effective and efficient. It can detect model integrity breaches with high accuracy (>99.95%) and guaranteed zero false positives on all evaluated attacks. Meanwhile, it only requires up to 103X fewer model inferences, compared with non-sensitive samples.
[recognition, traffic] [algorithm, cloud, approach, corresponding, normal, case, international, note] [missing, proposed, figure, input, arbitrary, image, conference, verify, high, cover, real, method] [neural, deep, rate, dnn, output, number, original, network, small, table, compression, accuracy, selection, efficiency, achieve] [model, integrity, attack, poisoning, adversary, generated, sensitive, fingerprint, manc, trojan, machine, uncovered, generation, generate, sensitivity, correct, adversarial, provider, goal, random, transformed, compromised, required, consider, modification, arxiv, customer, natural, maximize, preprint] [detect, detection, activated] [sample, learning, set, data, function, maximum, training, select, target, specific, datasets, selected]
@InProceedings{He_2019_CVPR,
  author = {He, Zecheng and Zhang, Tianwei and Lee, Ruby},
  title = {Sensitive-Sample Fingerprinting of Deep Neural Networks},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Soft Labels for Ordinal Regression
Raul Diaz, Amit Marathe


Ordinal regression attempts to solve classification problems in which categories are not independent, but rather follow a natural order. It is crucial to classify each class correctly while learning adequate interclass ordinal relationships. We present a simple and effective method that constrains these relationships among categories by seamlessly incorporating metric penalties into ground truth label representations. This encoding allows deep neural networks to automatically learn intraclass and interclass relationships without any explicit modification of the network architecture. Our method converts data labels into soft probability distributions that pair well with common categorical loss functions such as cross-entropy. We show that this approach is effective by using off-the-shelf classification and segmentation networks in four wildly different scenarios: image quality ranking, age estimation, horizon line regression, and monocular depth estimation. We demonstrate that our general-purpose method is very competitive with respect to specialized approaches, and adapts well to a variety of different network architectures and metrics.
[horizon, dataset, multiple, prediction, predict] [ordinal, depth, computer, vision, ground, truth, estimation, pattern, international, approach, monocular, respect, problem, kitti, equation, error, well, typically, squared, discrete, journal] [image, conference, ieee, age, method, input, figure, difference, quality, rating] [network, output, neural, deep, binary, better, table, convolutional, order, number, layer] [true, vector, argmax, expected, encoding, probability] [regression, semantic, fully, instance, category] [sord, classification, soft, learning, loss, metric, rank, log, label, set, test, class, silog, interclass, function, softmax, data, training, sid, domain, task, randomly, categorical, ranking]
@InProceedings{Diaz_2019_CVPR,
  author = {Diaz, Raul and Marathe, Amit},
  title = {Soft Labels for Ordinal Regression},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Local to Global Learning: Gradually Adding Classes for Training Deep Neural Networks
Hao Cheng, Dongze Lian, Bowen Deng, Shenghua Gao, Tao Tan, Yanlin Geng


We propose a new learning paradigm, Local to Global Learning (LGL), for Deep Neural Networks (DNNs) to improve the performance of classification problems. The core of LGL is to learn a DNN model from fewer categories (local) to more categories (global) gradually within the entire training set. LGL is most related to the Self-Paced Learning (SPL) algorithm but its formulation is different from SPL. SPL trains its data from simple to complex, while LGL from local to global. In this paper, we incorporate the idea of LGL into the learning objective of DNNs and explain why LGL works better from an information-theoretic perspective. Experiments on the toy data, CIFAR-10, CIFAR-100, and ImageNet dataset show that LGL outperforms the baseline and SPL-based algorithms.
[dataset, time, framework, perform, performs] [algorithm, local, initial, computer, vision, stable] [conference, figure, traditional, method] [dnn, selection, neural, deep, table, weight, accuracy, number, gradually, performance, better, add, imagenet, called, network, rate, validation, fewer, denotes, layer, dnns, equal, proposes, convolutional] [model, arxiv, preprint, adding, explain] [baseline, global, three, propose] [learning, lgl, training, spl, loss, trained, data, cluster, strategy, select, classification, untrained, curriculum, function, softmax, transfer, large, train, set, objective, label, spld, inv, learn, knowledge, minimize, suppose, dissimilar]
@InProceedings{Cheng_2019_CVPR,
  author = {Cheng, Hao and Lian, Dongze and Deng, Bowen and Gao, Shenghua and Tan, Tao and Geng, Yanlin},
  title = {Local to Global Learning: Gradually Adding Classes for Training Deep Neural Networks},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
What Does It Mean to Learn in Deep Networks? And, How Does One Detect Adversarial Attacks?
Ciprian A. Corneanu, Meysam Madadi, Sergio Escalera, Aleix M. Martinez


The flexibility and high-accuracy of Deep Neural Networks (DNNs) has transformed computer vision. But, the fact that we do not know when a specific DNN will work and when it will fail has resulted in a lack of trust. A clear example is self-driving cars; people are uncomfortable sitting in a car driven by algorithms that may fail under some unknown, unpredictable conditions. Interpretability and explainability approaches attempt to address this by uncovering what a DNN models, i.e., what each node (cell) in the network represents and what images are most likely to activate it. This can be used to generate, for example, adversarial attacks. But these approaches do not generally allow us to determine where a DNN will succeed or fail and why . i.e., does this learned representation generalize to unseen samples? Here, we derive a novel approach to define what it means to learn in deep networks, and how to use this knowledge to detect adversarial attacks. We show how this defines the ability of a network to generalize to unseen testing samples and, most importantly, why this is the case.
[graph, work, second, complex] [topological, algorithm, computer, approach, define, functional, compute, pattern, vision, note, lack, problem, defined, defining, allows, international, analysis, local, solve, well] [figure, ieee, conference, fail] [network, number, dnn, deep, dnns, binary, neural, accuracy, lower, lenet, epoch, activation, density, larger, add, higher, processing, called, correlation, increase] [adversarial, betti, node, opo, interpretability, chain, permuted, arxiv, preprint, example, identify] [global, boundary, detect, object] [learning, training, learn, testing, set, space, unseen, defines, cav, clique, data, specific, generalize, large, learned, ona, representation, examp]
@InProceedings{Corneanu_2019_CVPR,
  author = {Corneanu, Ciprian A. and Madadi, Meysam and Escalera, Sergio and Martinez, Aleix M.},
  title = {What Does It Mean to Learn in Deep Networks? And, How Does One Detect Adversarial Attacks?},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Handwriting Recognition in Low-Resource Scripts Using Adversarial Learning
Ayan Kumar Bhunia, Abhirup Das, Ankan Kumar Bhunia, Perla Sai Raj Kishore, Partha Pratim Roy


Handwritten Word Recognition and Spotting is a challenging field dealing with handwritten text possessing irregular and complex shapes. The design of deep neural network models makes it necessary to extend training datasets in order to introduce variations and increase the number of samples; word-retrieval is therefore very difficult in low-resource scripts. Much of the existing literature comprises preprocessing strategies which are seldom sufficient to cover all possible variations. We propose an Adversarial Feature Deformation Module (AFDM) that learns ways to elastically warp extracted features in a scalable manner. The AFDM is inserted between intermediate layers and trained alternatively with the original framework, boosting its capability to better learn highly informative features rather than trivial ones. We test our meta-framework, which is built on top of popular word-spotting and word-recognition frameworks and enhanced by AFDM, not only on extensive Latin word datasets but also on sparser Indic scripts. We record results for varying sizes of training data, and observe that our enhanced network generalizes much better in the low-data regime; the overall word-error rates and mAP scores are observed to improve as well.
[recognition, framework, dataset, recurrent, complex, prediction] [deformation, well, pattern, analysis, corresponding, robust, limited, deform] [image, based, generator, proposed, transformation, tps, control, input, intermediate, high, generative] [network, neural, deep, original, convolutional, number, performance, layer, better, popular] [adversarial, word, handwritten, afdm, spotting, model, handwriting, arxiv, text, character, hwr, iam, preprint, observed, hws, phocnet, wer, ltask, crnn, qbs, machine, latin, indic, generate, visual] [feature, map, grid, module, spatial] [data, training, task, learning, augmentation, datasets, loss, learn, large, generalize, hard, train, testing, label, set, function, existing, trained, space, strategy]
@InProceedings{Bhunia_2019_CVPR,
  author = {Kumar Bhunia, Ayan and Das, Abhirup and Kumar Bhunia, Ankan and Sai Raj Kishore, Perla and Pratim Roy, Partha},
  title = {Handwriting Recognition in Low-Resource Scripts Using Adversarial Learning},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Adversarial Defense Through Network Profiling Based Path Extraction
Yuxian Qiu, Jingwen Leng, Cong Guo, Quan Chen, Chao Li, Minyi Guo, Yuhao Zhu


Recently, researchers have started decomposing deep neural network models according to their semantics or functions. Recent work has shown the effectiveness of decomposed functional blocks for defending adversarial attacks, which add small input perturbation to the input image to fool the DNN models. This work proposes a profiling-based method to decompose the DNN models to different functional blocks, which lead to the effective path as a new approach to exploring DNNs' internal organization. Specifically, the per-image effective path can be aggregated to the class-level effective path, through which we observe that adversarial images activate effective path different from normal images. We propose an effective path similarity-based method to detect adversarial images with an interpretable model, which achieve better accuracy and broader applicability than the state-of-the-art technique.
[work, auc, extract] [linear, normal, approach, well, degree, computer, equation, international, analysis] [image, based, input, method, study, prior, patch, conference, figure, high, forest] [effective, neural, layer, accuracy, dnn, weight, network, deep, rate, imagenet, neuron, number, alexnet, process, output, density, convolutional, size, impact, larger, achieves, small, processing] [path, adversarial, model, attack, random, bim, deepfool, cdrp, fgsm, jsma, defense, synapse, perturbation, critical, cdrps, profiling, fool, specialization, targeted, activate, find, requires, calculated, indicates, true] [detection, detect, propose, extraction, predicted, false, detector] [similarity, training, set, learning, positive, class, classification]
@InProceedings{Qiu_2019_CVPR,
  author = {Qiu, Yuxian and Leng, Jingwen and Guo, Cong and Chen, Quan and Li, Chao and Guo, Minyi and Zhu, Yuhao},
  title = {Adversarial Defense Through Network Profiling Based Path Extraction},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
RENAS: Reinforced Evolutionary Neural Architecture Search
Yukang Chen, Gaofeng Meng, Qian Zhang, Shiming Xiang, Chang Huang, Lisen Mu, Xinggang Wang


Neural Architecture Search (NAS) is an important yet challenging task in network design due to its high computational consumption. To address this issue, we propose the Reinforced Evolutionary Neural Architecture Search (RENAS), which is an evolutionary method with reinforced mutation for NAS. Our method integrates reinforced mutation into an evolution algorithm for neural architecture exploration, in which a mutation controller is introduced to learn the effects of slight modifications and make mutation actions. The reinforced mutation controller guides the model population to evolve efficiently. Furthermore, as child models can inherit parameters from their parents during evolution, our method requires very limited computational resources. In experiments, we conduct the proposed search method on CIFAR-10 and obtain a powerful network architecture, RENASNet. This architecture achieves a competitive result on CIFAR-10. The explored network architecture is transferable to ImageNet and achieves a new state-of-the-art accuracy, i.e., 75.7% top-1 accuracy with 5.36M parameters on mobile ImageNet. We further test its performance on semantic segmentation with DeepLabv3 on the PASCAL VOC. RENASNet outperforms MobileNet-v1, MobileNet-v2 and NASNet. It achieves 75.83% mIOU without being pretrained on COCO.
[framework, outperforms, hidden, consists] [algorithm] [image, input, method, conduct, result] [search, cell, architecture, neural, imagenet, controller, mutation, network, renasnet, evolution, accuracy, block, reinforced, size, population, performance, table, renas, achieves, mobile, pretrained, validation, better, nasnet, efficient, genetic, convolutional, efficiency, searched, layer, standard, cutout, computational, output, operation, number, rate, sep, operator, gpu, process, structure, stride, slight, evolve] [model, evolutionary, child, reinforcement, random] [semantic, segmentation, feature, evaluated, parent, pascal] [learning, training, set, space, learn, large, trained]
@InProceedings{Chen_2019_CVPR,
  author = {Chen, Yukang and Meng, Gaofeng and Zhang, Qian and Xiang, Shiming and Huang, Chang and Mu, Lisen and Wang, Xinggang},
  title = {RENAS: Reinforced Evolutionary Neural Architecture Search},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Co-Occurrence Neural Network
Irina Shevlev, Shai Avidan


Convolutional Neural Networks (CNNs) became a very popular tool for image analysis. Convolutions are fast to compute and easy to store, but they also have some limitations. First, they are shift-invariant and, as a result, they do not adapt to different regions of the image. Second, they have a fixed spatial layout, so small geometric deformations in the layout of a patch will completely change the filter response. For these reasons, we need multiple filters to handle the different parts and variations in the input. We augment the standard convolutional tools used in CNNs with a new filter that addresses both issues raised above. Our filter combines two terms, a spatial filter and a term that is based on the co-occurrence statistics of input values in the neighborhood. The proposed filter is differentiable and can therefore be packaged as a layer in CNN and trained using back-propagation. We show how to train the filter as part of the network and report results on several data sets. In particular, we replace a convolutional layer with hundreds of thousands of parameters with a Co-occurrence Layer consisting of only a few hundred parameters with minimal impact on accuracy.
[term, work, second, dataset, forward, recognition] [matrix, computer, vision, pattern, equation, total, error, defined, june, completely, linear, single, shape] [input, pixel, image, figure, based, stack, conference, bilateral, ieee, proposed, handle] [filter, col, layer, network, number, size, neural, convolutional, deep, table, performance, output, standard, original, pruning, cof, activation, fast, replace, convolution, denotes, compression, ratio, small, architecture] [pass, generated, regular, type] [spatial, third, propose, three, evaluated, layout] [test, function, trained, distribution, learn, prototype, training, learning, set, data]
@InProceedings{Shevlev_2019_CVPR,
  author = {Shevlev, Irina and Avidan, Shai},
  title = {Co-Occurrence Neural Network},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
SpotTune: Transfer Learning Through Adaptive Fine-Tuning
Yunhui Guo, Honghui Shi, Abhishek Kumar, Kristen Grauman, Tajana Rosing, Rogerio Feris


Transfer learning, which allows a source task to affect the inductive bias of the target task, is widely used in computer vision. The typical way of conducting transfer learning with deep neural networks is to fine-tune a model pretrained on the source task using data from the target task. In this paper, we propose an adaptive fine-tuning approach, called SpotTune, which finds the optimal fine-tuning strategy per instance for the target data. In SpotTune, given an image from the target task, a policy network is used to make routing decisions on whether to pass the image through the fine-tuned layers or the pre-trained layers. We conduct extensive experiments to demonstrate the effectiveness of the proposed approach. Our method outperforms the traditional fine-tuning approach on 12 out of 14 standard datasets. We also compare SpotTune with other state-of-the-art fine-tuning strategies, showing superior performance. On the Visual Decathlon datasets, our method achieves the highest score across the board without bells and whistles.
[dataset, work, challenge, frozen, routing] [approach, discrete, optimal, allows, computer, note] [proposed, method, image, figure, based, input] [spottune, network, residual, deep, block, number, neural, decathlon, accuracy, standard, imagenet, convolutional, compared, better, output, variant, table, adaptive, freeze, layer, performance, achieves, finetuning, computation, wikiart, parameter, ibm, pretrained] [policy, model, visual, gumbel, arxiv, preprint, random, decision, consider] [feature, score, global, improve, baseline, three, propose] [target, learning, training, transfer, source, task, distribution, softmax, data, test, classification, datasets, set, domain, adaptation, shared, learn, extractor, stanford, strategy]
@InProceedings{Guo_2019_CVPR,
  author = {Guo, Yunhui and Shi, Honghui and Kumar, Abhishek and Grauman, Kristen and Rosing, Tajana and Feris, Rogerio},
  title = {SpotTune: Transfer Learning Through Adaptive Fine-Tuning},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Signal-To-Noise Ratio: A Robust Distance Metric for Deep Metric Learning
Tongtong Yuan, Weihong Deng, Jian Tang, Yinan Tang, Binghui Chen


Deep metric learning, which learns discriminative features to process image clustering and retrieval tasks, has attracted extensive attention in recent years. A number of deep metric learning methods, which ensure that similar examples are mapped close to each other and dissimilar examples are mapped farther apart, have been proposed to construct effective structures for loss functions and have shown promising results. In this paper, different from the approaches on learning the loss structures, we propose a robust SNR distance metric based on Signal-to-Noise Ratio (SNR) for measuring the similarity of image pairs for deep metric learning. By exploring the properties of our SNR distance metric from the view of geometry space and statistical theory, we analyze the properties of our metric and show that it can preserve the semantic similarity between image pairs, which well justify its suitability for deep metric learning. Compared with Euclidean distance metric, our SNR distance metric can further jointly reduce the intra-class distances and enlarge the inter-class distances for learned features. Leveraging our SNR distance metric, we propose Deep SNR-based Metric Learning (DSML) to generate discriminative feature embeddings. By extensive experiments on three widely adopted benchmarks, including CARS196, CUB200-2011 and CIFAR10, our DSML has shown its superiority over other state-of-the-art methods. Additionally, we extend our SNR distance metric to deep hashing learning, and conduct experiments on two benchmarks, including CIFAR10 and NUS-WIDE, to demonstrate the effectiveness and generality of our SNR distance metric.
[signal, dataset, construct] [measurement, defined, robust] [image, based, proposed, method, figure, preserve, superiority, noise, database, paired, chen] [deep, performance, binary, compared, table, variance, experiment, denotes, size, promising, ratio, correlation, reduce, applied, order, extensive, enlarge, neural] [generate, measuring] [feature, propose, anchor, semantic, improve, including] [metric, distance, learning, snr, euclidean, loss, hashing, similarity, retrieval, embedding, learned, triplet, objective, contrastive, negative, discriminative, clustering, dsml, dissimilar, lifted, training, function, hamming, set, positive, testing, ranking, weihong, mapped, pair, large, dtsh, generality, data, mahalanobis]
@InProceedings{Yuan_2019_CVPR,
  author = {Yuan, Tongtong and Deng, Weihong and Tang, Jian and Tang, Yinan and Chen, Binghui},
  title = {Signal-To-Noise Ratio: A Robust Distance Metric for Deep Metric Learning},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Detection Based Defense Against Adversarial Examples From the Steganalysis Point of View
Jiayang Liu, Weiming Zhang, Yiwei Zhang, Dongdong Hou, Yujia Liu, Hongyue Zha, Nenghai Yu


Deep Neural Networks (DNNs) have recently led to significant improvements in many fields. However, DNNs are vulnerable to adversarial examples which are samples with imperceptible perturbations while dramatically misleading the DNNs. Moreover, adversarial examples can be used to perform an attack on various kinds of DNN based systems, even if the adversary has no access to the underlying model. Many defense methods have been proposed, such as obfuscating gradients of the networks or detecting adversarial examples. However it is proved out that these defense methods are not effective or cannot resist secondary adversarial attacks. In this paper, we point out that steganalysis can be applied to adversarial examples detection, and propose a method to enhance steganalysis features by estimating the probability of modifications caused by adversarial attacks. Experimental results show that the proposed method can accurately detect adversarial examples. Moreover, secondary adversarial attacks are hard to be directly performed to our method because our method is not based on a neural network but based on high-dimensional artificial features and Fisher Linear Discriminant ensemble.
[markov, transition, perform, second, adjacent] [normal, international, matrix, estimate, computer, linear, horizontal, prove] [method, image, based, input, pixel, difference, ieee, conference, figure] [neural, network, effective, gradient, rate, deep, secondary, table, performance, imagenet, residual, original, calculate] [adversarial, defense, attack, model, probability, steganalysis, spam, mpm, esrm, example, carlini, generated, untargeted, wagner, adv, espam, generate, modification, arxiv, preprint, robustness, fgsm, perturbation, machine, modified, targeted, deepfool, dependence, rich, fnor] [detection, detecting, detect, propose, srm, spatial, feature, detector, enhanced, enhance] [classification, zij, learning, training, set, dimensionality, classify, class, experimental, classifier, data]
@InProceedings{Liu_2019_CVPR,
  author = {Liu, Jiayang and Zhang, Weiming and Zhang, Yiwei and Hou, Dongdong and Liu, Yujia and Zha, Hongyue and Yu, Nenghai},
  title = {Detection Based Defense Against Adversarial Examples From the Steganalysis Point of View},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
HetConv: Heterogeneous Kernel-Based Convolutions for Deep CNNs
Pravendra Singh, Vinay Kumar Verma, Piyush Rai, Vinay P. Namboodiri


We present a novel deep learning architecture in which the convolution operation leverages heterogeneous kernels. The proposed HetConv (Heterogeneous Kernel-Based Convolution) reduces the computation (FLOPs) and the number of parameters as compared to standard convolution operation while still maintaining representational efficiency. To show the effectiveness of our proposed convolution, we present extensive experimental results on the standard convolutional neural network (CNN) architectures such as VGG and ResNet. We find that after replacing the standard convolutional filters in these architectures with our proposed HetConv filters, we achieve 3X to 8X FLOPs based improvement in speed while still maintaining (and sometimes improving) the accuracy. We also compare our proposed convolutions with group/depth wise convolutions and show that it achieves more FLOPs reduction with significantly higher accuracy.
[heterogeneous, pwc, work] [computer, approach, total, initial, vision, homogeneous] [proposed, input, method, conference, comparison, based, result, figure] [convolutional, convolution, filter, accuracy, pruning, standard, hetconv, architecture, efficient, size, latency, layer, deep, mobilenet, neural, pointwise, compression, increase, compared, better, table, reduction, design, replaced, imagenet, reduce, groupwise, flop, number, compare, depthwise, vinay, cost, experimented, computation, efficiency, output, kernel, operation, increasing, popular, designing, dwc, reduces, network, resnet, higher, performance, deeper, keeping, drop, gwc, computational, reduced, xxx] [model] [feature, jian, spatial] [existing, remaining, learning, loss, training]
@InProceedings{Singh_2019_CVPR,
  author = {Singh, Pravendra and Kumar Verma, Vinay and Rai, Piyush and Namboodiri, Vinay P.},
  title = {HetConv: Heterogeneous Kernel-Based Convolutions for Deep CNNs},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Strike (With) a Pose: Neural Networks Are Easily Fooled by Strange Poses of Familiar Objects
Michael A. Alcorn, Qi Li, Zhitao Gong, Chengfei Wang, Long Mai, Wei-Shinn Ku, Anh Nguyen


Despite excellent performance on stationary test sets, deep neural networks (DNNs) can fail to generalize to out-of-distribution (OoD) inputs, including natural, non-adversarial ones, which are common in real-world settings. In this paper, we present a framework for discovering DNN failures that harnesses 3D renderers and 3D models. That is, we estimate the parameters of a 3D renderer that cause a target DNN to misbehave in response to the rendered image. Using our framework and a self-assembled dataset of 3D objects, we investigate the vulnerability of DNNs to OoD poses of well-known objects in ImageNet. For objects that are readily recognized by DNNs in their canonical poses, DNNs incorrectly classify 97% of their pose space. In addition, DNNs are highly sensitive to slight pose perturbations. Importantly, adversarial poses transfer across models and datasets. We find that 99.9% and 99.4% of the poses misclassified by Inception-v3 also transfer to the AlexNet and ResNet-50 image classifiers trained on the same ImageNet dataset, respectively, and 75.5% transfer to the YOLOv3 object detector trained on MS COCO.
[dataset, framework, recognition, work, recorded] [pose, computer, lighting, vision, rotation, pattern, camera, international, rendered, geometry, confidence, error, optimization, corresponding, differentiable, percent] [image, conference, ieee, background, real, change] [dnn, imagenet, dnns, neural, gradient, parameter, deep, descent, highly, alexnet, search, accuracy, table, rate] [adversarial, median, random, renderer, ood, arxiv, misclassified, correctly, calculated, school, generated, preprint, common, procedure, correct, renderers, incorrect, sampled, misclassifications] [object, three] [target, learning, transfer, training, space, set, classifier, test, trained, main, classification, selected, maximum]
@InProceedings{Alcorn_2019_CVPR,
  author = {Alcorn, Michael A. and Li, Qi and Gong, Zhitao and Wang, Chengfei and Mai, Long and Ku, Wei-Shinn and Nguyen, Anh},
  title = {Strike (With) a Pose: Neural Networks Are Easily Fooled by Strange Poses of Familiar Objects},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Blind Geometric Distortion Correction on Images Through Deep Learning
Xiaoyu Li, Bo Zhang, Pedro V. Sander, Jing Liao


We propose the first general framework to automatically correct different types of geometric distortion in a single input image. Our proposed method employs convolutional neural networks (CNNs) trained by using a large synthetic distortion dataset to predict the displacement field between distorted images and corrected images. A model fitting method uses the CNN output to estimate the distortion parameters, achieving a more accurate prediction. The final corrected image is generated based on the predicted flow using an efficient, high-quality resampling method. Experimental results demonstrate that our algorithm outperforms traditional correction methods, and allows for interesting applications such as distortion transfer, distortion exaggeration, and co-occurring distortion correction.
[flow, prediction, dataset, forward] [distortion, geometric, computer, single, distorted, approach, lens, pattern, fitting, vision, camera, geonetm, corrected, estimate, perspective, radial, error, estimation, range, directly, epe, accurate, algorithm, angle, analysis, point, calibration, ground, field, barrel, rotation, corresponding] [image, correction, method, figure, ieee, input, conference, proposed, traditional, based, mapping, pixel, comparison, result] [network, parameter, search, convolutional, architecture, conv, deep, wide, processing, layer, table, output, specialized, residual] [model, type, iterative, correct, machine, generate, automatic, encoder] [detection, propose, feature, map] [resampling, classification, learning, loss, source, trained, division, learn, domain]
@InProceedings{Li_2019_CVPR,
  author = {Li, Xiaoyu and Zhang, Bo and Sander, Pedro V. and Liao, Jing},
  title = {Blind Geometric Distortion Correction on Images Through Deep Learning},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Instance-Level Meta Normalization
Songhao Jia, Ding-Jie Chen, Hwann-Tzong Chen


This paper presents a normalization mechanism called Instance-Level Meta Normalization (ILM Norm) to address a learning-to-normalize problem. ILM Norm learns to predict the normalization parameters via both the feature feed-forward and the gradient back-propagation paths. ILM Norm provides a meta normalization mechanism and has several good properties. It can be easily plugged into existing instance-level normalization schemes such as Instance Normalization, Layer Normalization, or Group Normalization. ILM Norm normalizes each instance individually and therefore maintains high performance even when small mini-batch is used. The experimental results show that ILM Norm well adapts to different network architectures and tasks, and it consistently improves the performance of the original models.
[key, dataset, predict, focus] [error, recovering, additional, underlying, well] [input, figure, image, comparison, method, style, proposed, lpips] [normalization, ilm, batch, rescaling, norm, table, standardization, group, original, size, variance, tensor, rate, number, layer, performance, deep, activation, lower, weight, network, standardized, better, imagenet, neural, validation, increment, best, achieve, gradient, output] [mechanism, model, association, embedded, evaluate, vector] [feature, stage, instance, extraction, map, mask, detection, improves, improve] [learning, training, meta, existing, distribution, classification, transfer, set, learned, strategy, domain, bias, trained, large, experimental]
@InProceedings{Jia_2019_CVPR,
  author = {Jia, Songhao and Chen, Ding-Jie and Chen, Hwann-Tzong},
  title = {Instance-Level Meta Normalization},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Iterative Normalization: Beyond Standardization Towards Efficient Whitening
Lei Huang, Yi Zhou, Fan Zhu, Li Liu, Ling Shao


Batch Normalization (BN) is ubiquitously employed for accelerating neural network training and improving the generalization capability by performing standardization within mini-batches. Decorrelated Batch Normalization (DBN) further boosts the above effectiveness by whitening. However, DBN relies heavily on either a large batch size, or eigen-decomposition that suffers from poor efficiency on GPUs. We propose Iterative Normalization (IterNorm), which employs Newton's iterations for much more efficient whitening, while simultaneously avoiding the eigen-decomposition. Furthermore, we develop a comprehensive study to show IterNorm has better trade-off between optimization and generalization, with theoretical and experimental support. To this end, we exclusively introduce Stochastic Normalization Disturbance (SND), which measures the inherent stochastic uncertainty of samples when applied to normalization operations. With the support of SND, we provide natural explanations to several phenomena from the perspective of optimization, e.g., why group-wise whitening of DBN generally outperforms full-whitening and why the accuracy of BN degenerates with reduced batch sizes. We demonstrate the consistently improved performance of IterNorm with extensive experiments on CIFAR-10 and ImageNet over BN and DBN.
[] [matrix, optimization, respect, error, normalized, condition, square, associated, analysis, centered, eigenvalue, provide] [figure, based, method, proposed, suffers, conditioning, lei, decorrelated, comparison] [iternorm, batch, normalization, size, whitening, dbn, snd, performance, number, residual, neural, deep, iteration, covariance, network, small, calculate, stochastic, standardization, group, eigenvectors, efficiency, efficient, better, zca, operation, layer, root, rate, compared, reduced, imagenet, output, weight, activation, gradient, normalizing, convolution, explore, wide, table, capability] [find, improved, random, iterative, disturbance] [feature, improve, propose, improves] [training, data, test, learning, generalization, dimension, convergence, sample, large, experimental, observe, uncertainty]
@InProceedings{Huang_2019_CVPR,
  author = {Huang, Lei and Zhou, Yi and Zhu, Fan and Liu, Li and Shao, Ling},
  title = {Iterative Normalization: Beyond Standardization Towards Efficient Whitening},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
On Learning Density Aware Embeddings
Soumyadeep Ghosh, Richa Singh, Mayank Vatsa


Deep metric learning algorithms have been utilized to learn discriminative and generalizable models which are effective for classifying unseen classes. In this paper, a novel noise tolerant deep metric learning algorithm is proposed. The proposed method, termed as Density Aware Metric Learning, enforces the model to learn embeddings that are pulled towards the most dense region of the clusters for each class. It is achieved by iteratively shifting the estimate of the center towards the dense region of the cluster thereby leading to faster convergence and higher generalizability. In addition to this, the approach is robust to noisy samples in the training data, often present as outliers. Detailed experiments and analysis on two challenging cross-modal face recognition databases and two popular object recognition databases exhibit the efficacy of the proposed approach. It has superior convergence, requires lesser training time, and yields better accuracies than several popular deep metric learning methods.
[recognition, iteratively, time] [algorithm, dense, estimate, approach, expressed, point, analysis, matching, total, respect] [proposed, face, ieee, image, figure, method, based, database, resolution] [deep, density, shift, vanilla, table, number, performed, better, calculate, popular, kernel, size, compared] [model, selecting] [center, aware, region, object, recall, score, person, anchor, cnn] [loss, triplet, training, learning, metric, class, quadruplet, enclosure, hard, data, noisy, embedding, set, embeddings, cluster, function, negative, discriminative, convergence, distance, centroid, scface, conventional, positive, datl, large, nearest, sample, daql, mining, train, learn, datasets, viewed, ool, facesurv, probe, richa, mayank]
@InProceedings{Ghosh_2019_CVPR,
  author = {Ghosh, Soumyadeep and Singh, Richa and Vatsa, Mayank},
  title = {On Learning Density Aware Embeddings},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Contrastive Adaptation Network for Unsupervised Domain Adaptation
Guoliang Kang, Lu Jiang, Yi Yang, Alexander G. Hauptmann


Unsupervised Domain Adaptation (UDA) makes predictions for the target domain data while manual annotations are only available in the source domain. Previous methods minimize the domain discrepancy neglecting the class information, which may lead to misalignment and poor generalization performance. To address this issue, this paper proposes Contrastive Adaptation Network (CAN) optimizing a new metric which explicitly models the intra-class domain discrepancy and the inter-class domain discrepancy. We design an alternating update strategy for training CAN in an end-to-end manner. Experiments on two real-world benchmarks Office-31 and VisDA-2017 demonstrate that CAN performs favorably against the state-of-the-art methods and produces more discriminative features.
[perform, explicitly, work, hypothesis, previous, recognition, performs] [computer, vision, pattern, estimate, optimization, alternative, accurate, underlying, compute, estimation] [method, conference, proposed, ieee, based, jan, input, figure, result] [network, deep, neural, accuracy, table, performance, layer, validation] [model, adversarial, ambiguous, arxiv, preprint, decision, sampled] [feature, propose] [domain, target, discrepancy, data, cdd, adaptation, source, training, class, contrastive, learning, label, labeled, mmd, loss, unsupervised, minimize, update, set, cluster, pseudo, train, sampling, clustering, sample, minimizing, discriminative, maximum, alignment, uda, metric, objective, learn, dan]
@InProceedings{Kang_2019_CVPR,
  author = {Kang, Guoliang and Jiang, Lu and Yang, Yi and Hauptmann, Alexander G.},
  title = {Contrastive Adaptation Network for Unsupervised Domain Adaptation},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
LP-3DCNN: Unveiling Local Phase in 3D Convolutional Neural Networks
Sudhakar Kumawat, Shanmuganathan Raman


Traditional 3D Convolutional Neural Networks (CNNs) are computationally expensive, memory intensive, prone to overfit, and most importantly, there is a need to improve their feature learning capabilities. To address these issues, we propose Rectified Local Phase Volume (ReLPV) block, an efficient alternative to the standard 3D convolutional layer. The ReLPV block extracts the phase in a 3D local neighborhood (e.g., 3x3x3) of each position of the input map to obtain the feature maps. The phase is extracted by computing 3D Short Term Fourier Transform (STFT) at multiple fixed low frequency points in the 3D local neighborhood of each position. These feature maps at different frequency points are then linearly combined after passing them through an activation function. The ReLPV block provides significant parameter savings of at least, 3^3 to 13^3 times compared to the standard 3D convolutional layer with the filter sizes 3x3x3 to 13x13x13, respectively. We show that the feature learning capabilities of the ReLPV block are significantly better than the standard 3D convolutional layer. Furthermore, it produces consistently better results across different 3D data representations. We achieve state-of-the-art accuracy on the volumetric ModelNet10 and ModelNet40 datasets while utilizing only 11% parameters of the current state-of-the-art. We also improve the state-of-the-art on the UCF-101 split-1 action recognition dataset by 5.68% (when trained from scratch) while using only 15% of the parameters of the state-of-the-art.
[stft, action, recognition, spatiotemporal, video, dataset, version, current, previous] [local, volumetric, volume, corresponding, alternative, neighborhood, fourier, modelnet, voxnet] [input, image, frequency, based, proposed, comparison, traditional] [relpv, layer, network, block, convolutional, phase, standard, cnns, number, size, deep, neural, performance, better, compared, trainable, table, output, architecture, increase, efficient, filter, binarized, binary, activation, low, achieve, relu, channel, connected, order, complexity, separable] [model, memory, arxiv, preprint] [feature, cnn, baseline, map, object, fully, average, improve] [learning, data, training, large, datasets, trained, classification, overfitting, hyperparameters]
@InProceedings{Kumawat_2019_CVPR,
  author = {Kumawat, Sudhakar and Raman, Shanmuganathan},
  title = {LP-3DCNN: Unveiling Local Phase in 3D Convolutional Neural Networks},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Attribute-Driven Feature Disentangling and Temporal Aggregation for Video Person Re-Identification
Yiru Zhao, Xu Shen, Zhongming Jin, Hongtao Lu, Xian-sheng Hua


Video-based person re-identification plays an important role in surveillance video analysis, expanding image-based methods by learning features of multiple frames. Most existing methods fuse features by temporal average-pooling, without exploring the different frame weights caused by various viewpoints, poses, and occlusions. In this paper, we propose an attribute-driven method for feature disentangling and frame re-weighting. The features of single frames are disentangled into groups of sub-features, each corresponds to specific semantic attributes. The sub-features are re-weighted by the confidence of attribute recognition and then aggregated at the temporal dimension as the final representation. By means of this strategy, the most informative regions of each frame are enhanced and contributes to a more discriminative sequence representation. Extensive ablation studies demonstrate the effectiveness of feature disentangling as well as temporal re-weighting. The experimental results on the iLIDS-VID, PRID-2011 and MARS datasets demonstrate that our proposed method outperforms existing state-of-the-art approaches.
[temporal, sequence, dataset, video, frame, recognition, prediction, human, joint, optical, flow, utilized, blackhair, concatenated, predict] [local, corresponding, body, provide] [attribute, method, proposed, disentangling, input, face, comparison, demonstrate, image, based, figure, disentangled] [deep, group, binary, network, aggregation, layer, neural, weight, architecture] [model, attention, calculated, common] [person, feature, bce, map, average, semantic, propose, merge, xiaogang, global, level, three, liang] [learning, loss, transfer, set, trained, informative, entropy, training, cross, existing, task, representation, distance, pairwise, similarity, predictor]
@InProceedings{Zhao_2019_CVPR,
  author = {Zhao, Yiru and Shen, Xu and Jin, Zhongming and Lu, Hongtao and Hua, Xian-sheng},
  title = {Attribute-Driven Feature Disentangling and Temporal Aggregation for Video Person Re-Identification},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Binary Ensemble Neural Network: More Bits per Network or More Networks per Bit?
Shilin Zhu, Xin Dong, Hao Su


Binary neural networks (BNN) have been studied extensively since they run dramatically faster at lower memory and power consumption than floating-point networks, thanks to the efficiency of bit operations. However, contemporary BNNs whose weights and activations are both single bits suffer from severe accuracy degradation. To understand why, we investigate the representation ability, speed and bias/variance of BNNs through extensive experiments. We conclude that the error of BNNs is predominantly caused by intrinsic instability (training time) and non-robustness (train & test time). Inspired by this investigation, we propose the Binary Ensemble Neural Network (BENN) which leverages ensemble methods to improve the performance of BNNs with limited efficiency cost. While ensemble techniques have been broadly believed to be only marginally helpful for strong classifiers such as deep neural networks, our analysis and experiments show that they are naturally a perfect fit to boost BNNs. We find that our BENN, which is faster and more robust than state-of-the-art binary networks, can even surpass the accuracy of the full-precision floating number network with the same architecture.
[work, multiple, recognition, current, people] [single, analysis, error, optimization, international] [input, conference, statistical, real, change, variation, method, ieee] [neural, bnn, benn, network, accuracy, bnns, boosting, deep, binary, bagging, better, best, variance, dnn, table, imagenet, complexity, output, performance, gain, activation, stability, convolutional, processing, achieve, reduce, binarized, weight, speed, compression, computation, quantized, alexnet, compare, severe, number, architecture, compared, quantization, qnn, binarization, standard] [arxiv, preprint, model, machine, memory, robustness, strong] [weak] [training, ensemble, large, learning, classifier, bias, observe, test, train, function, trained, independent, task]
@InProceedings{Zhu_2019_CVPR,
  author = {Zhu, Shilin and Dong, Xin and Su, Hao},
  title = {Binary Ensemble Neural Network: More Bits per Network or More Networks per Bit?},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Distilling Object Detectors With Fine-Grained Feature Imitation
Tao Wang, Li Yuan, Xiaopeng Zhang, Jiashi Feng


State-of-the-art CNN based recognition models are often computationally prohibitive to deploy on low-end devices. A promising high level approach tackling this limitation is knowledge distillation, which let small student model mimic cumbersome teacher model's output to get improved generalization. However, related methods mainly focus on simple task of classification while do not consider complex tasks like object detection. We show applying the vanilla knowledge distillation to detection model gets minor gain. To address the challenge of distilling knowledge in detection model, we propose a fine-grained feature imitation method exploiting the cross-location discrepancy of feature response. Our intuition is that detectors care more about local near object regions. Thus the discrepancy of feature response on the near object anchor locations reveals important information of how teacher model tends to generalize. We design a novel mechanism to estimate those locations and let student model imitate the teacher on them to get enhanced performance. We first validate the idea on a developed lightweight toy detector which carries simplest notion of current state-of-the-art anchor based detection models on challenging KITTI dataset, our method generates up to 15% boost of mAP for the student model compared to the non-imitated counterpart. We then extensively evaluate the method with Faster R-CNN model under various scenarios with common object detection benchmark of Pascal VOC and COCO, imitation alleviates up to 74% performance drop of student model compared to teacher. Codes released at https://github.com/twangnh/Distilling-Object-Detectors
[perform] [computer, ground, truth, directly, vision, analysis, pattern, kitti, corresponding, local] [method, based, proposed, high, conference, background, ieee, figure, input, raw, prior, image] [neural, performance, layer, table, network, deep, full, gain, lightweight, convolutional, pruning, variance, lower, compared, channel, filter, output, shallow] [model, imitation, arxiv, preprint, find, improved, simple] [feature, object, detection, map, anchor, faster, detector, fig, region, mask, level, imitated, iou, response, pascal, bounding, threshold, halved, coco, ross, enhanced, localization, box, including, supervision] [student, teacher, knowledge, distillation, loss, toy, discrepancy, classification, learning, large, distilling, effectively, trained, thresholding]
@InProceedings{Wang_2019_CVPR,
  author = {Wang, Tao and Yuan, Li and Zhang, Xiaopeng and Feng, Jiashi},
  title = {Distilling Object Detectors With Fine-Grained Feature Imitation},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Centripetal SGD for Pruning Very Deep Convolutional Networks With Complicated Structure
Xiaohan Ding, Guiguang Ding, Yuchen Guo, Jungong Han


The redundancy is widely recognized in Convolutional Neural Networks (CNNs), which enables to remove some unimportant filters from convolutional layers so as to slim the network with acceptable performance drop. Inspired by the linearity of convolution, we seek to make some filters increasingly close and eventually identical for network slimming. To this end, we propose Centripetal SGD (C-SGD), a novel optimization method, which can train several filters to collapse into a single point in the parameter hyperspace. When the training is completed, the removal of the identical filters can trim the network with NO performance loss, thus no finetuning is needed. By doing so, we have partly solved an open problem of constrained filter pruning on CNNs with complicated structure, where some layers must be pruned following the others. Our experimental results on CIFAR-10 and ImageNet have justified the effectiveness of C-SGD-based filter pruning. Moreover, we have provided empirical evidences for the assumption that the redundancy in deep neural networks helps the convergence of training by showing that a redundant CNN trained using C-SGD outperforms a normally trained counterpart with the equivalent width.
[] [normal, pattern, computer, international, vision, problem, error, corresponding, direction] [conference, ieee, input, complicated, figure, produce, difference] [neural, pruning, filter, deep, layer, convolutional, network, pruned, processing, accuracy, cnns, redundancy, efficient, performance, centripetal, sgd, redundant, original, lasso, number, parameter, finetuning, channel, regularization, weight, rate, kernel, decay, identical, grow, acceleration, convolution, slimming, compact, residual] [arxiv, preprint, model, constrained, machine, making, generate] [cnn, feature] [training, learning, set, trained, close, cluster, base, seek, sampler, open, remaining, objective, loss, lim, train]
@InProceedings{Ding_2019_CVPR,
  author = {Ding, Xiaohan and Ding, Guiguang and Guo, Yuchen and Han, Jungong},
  title = {Centripetal SGD for Pruning Very Deep Convolutional Networks With Complicated Structure},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Knockoff Nets: Stealing Functionality of Black-Box Models
Tribhuvanesh Orekondy, Bernt Schiele, Mario Fritz


Machine Learning (ML) models are increasingly deployed in the wild to perform a wide range of tasks. In this work, we ask to what extent can an adversary steal functionality of such "victim" models based solely on blackbox interactions: image in, predictions out. In contrast to prior work, we study complex victim blackbox models, and an adversary lacking knowledge of train/test data used by the model, its internals, and semantics over model outputs. We formulate model functionality stealing as a two-step approach: (i) querying a set of input images to the blackbox model to obtain predictions; and (ii) training a "knockoff" with queried image-prediction pairs. We make multiple remarkable observations: (a) querying random images from a different distribution than that of the blackbox training data results in a well-performing knockoff; (b) this is possible even when the knockoff is represented using a different architecture; and (c) our reinforcement learning approach additionally improves query sample efficiency in certain settings and provides performance gains. We validate model functionality stealing on a range of datasets and tasks, as well as show that a reasonable knockoff of an image analysis API could be created for as little as 30.
[dataset, complex, action, influence, multiple, recognition, prediction] [case, approach, active, construction, good, michael] [image, figure, based, presented, solely, study] [performance, architecture, accuracy, budget, adaptive, ilsvrc, output, deep, neural, table, deployed] [model, blackbox, knockoff, stealing, functionality, adversary, victim, find, reward, random, api, machine, policy, strong, evaluate, querying, query, bird, choice, create, access, probability, consider, relevant] [object, hierarchy] [set, training, transfer, learning, data, test, train, knowledge, distribution, strategy, sample, trained, task, datasets, independent, hyperparameters, label, mario, collecting]
@InProceedings{Orekondy_2019_CVPR,
  author = {Orekondy, Tribhuvanesh and Schiele, Bernt and Fritz, Mario},
  title = {Knockoff Nets: Stealing Functionality of Black-Box Models},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Deep Embedding Learning With Discriminative Sampling Policy
Yueqi Duan, Lei Chen, Jiwen Lu, Jie Zhou


Deep embedding learning aims to learn a distance metric for effective similarity measurement, which has achieved promising performance in various tasks. As the vast majority of training samples produce gradients with magnitudes close to zero, hard example mining is usually employed to improve the effectiveness and efficiency of the training procedure. However, most existing sampling methods are designed by hand, which ignores the dependence between examples and suffer from exhaustive searching. In this paper, we propose a deep embedding with discriminative sampling policy (DE-DSP) learning framework by simultaneously training two models: a deep sampler network that learns effective sampling strategies, and a feature embedding that maps samples to the feature space. Rather than exhaustively calculating the hardness of all the examples for mining through forward-propagation, the deep sampler network exploits the strong prior of relations among samples to learn discriminative sampling policy in an more efficient manner. Experimental results demonstrate faster convergence and stronger discriminative power of our DE-DSP framework under different embedding objectives.
[dsn, online, framework, dataset, employed, current, perform] [exhaustive, varying] [proposed, figure, based, method, meaningful, input] [deep, effective, network, compared, performance, represents, applied, batch, effectiveness, search, searching, process, number] [policy, candidate, example, strong, selecting, required] [feature, score, anchor, clothes, easy] [embedding, sampling, training, triplet, learning, loss, metric, sampler, sample, negative, hard, selected, discriminative, mining, positive, select, learned, distance, existing, learn, contrastive, train, retrieval, hardness, experimental, probe, suffer, objective, function, tested, stanford, gallery, china, strategy, large, observe, margin, lifted]
@InProceedings{Duan_2019_CVPR,
  author = {Duan, Yueqi and Chen, Lei and Lu, Jiwen and Zhou, Jie},
  title = {Deep Embedding Learning With Discriminative Sampling Policy},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Hybrid Task Cascade for Instance Segmentation
Kai Chen, Jiangmiao Pang, Jiaqi Wang, Yu Xiong, Xiaoxiao Li, Shuyang Sun, Wansen Feng, Ziwei Liu, Jianping Shi, Wanli Ouyang, Chen Change Loy, Dahua Lin


Cascade is a classic yet powerful architecture that has boosted performance on various tasks. However, how to introduce cascade to instance segmentation remains an open question. A simple combination of Cascade R-CNN and Mask R-CNN only brings limited gain. In exploring a more effective approach, we find that the key to a successful instance segmentation cascade is to fully leverage the reciprocal relationship between detection and segmentation. In this work, we propose a new framework, Hybrid Task Cascade (HTC), which differs in two important aspects: (1) instead of performing cascaded refinement on these two tasks separately, it interweaves them for a joint multi-stage processing; (2) it adopts a fully convolutional branch to provide spatial context, which can help distinguishing hard foreground from cluttered background. Overall, this framework can learn more discriminative features progressively while integrating complementary features together in each stage. Without bells and whistles, a single HTC obtains 38.4% and 1.5% improvement over a strong Cascade Mask R-CNN baseline on MSCOCO dataset. Moreover, our overall system achieves 48.6 mask AP on the test-challenge split, ranking 1st in the COCO 2018 Challenge Object Detection Task. Code is available at https://github.com/open-mmlab/mmdetection.
[flow, prediction, multiple, joint, framework] [computer, vision, pattern, direct, international, pipeline, single] [conference, ieee, figure, hybrid, study, aps, image, kai] [conv, convolutional, table, architecture, design, interleaved, achieves, neural, performance, deep, better, higher, processing, compared, network] [execution, path, preceding, simple] [mask, cascade, segmentation, semantic, box, stage, instance, object, branch, feature, detection, bounding, pool, coco, htc, spatial, fully, propose, apm, apl, kaiming, refinement, baseline, bbox, xmask, roi, adopt, ross, piotr, doll, cascaded, contextual, backbone, level, jian, jianping, wanli, adopts, improvement, improve] [task, loss, learning, representation, training]
@InProceedings{Chen_2019_CVPR,
  author = {Chen, Kai and Pang, Jiangmiao and Wang, Jiaqi and Xiong, Yu and Li, Xiaoxiao and Sun, Shuyang and Feng, Wansen and Liu, Ziwei and Shi, Jianping and Ouyang, Wanli and Change Loy, Chen and Lin, Dahua},
  title = {Hybrid Task Cascade for Instance Segmentation},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Multi-Task Self-Supervised Object Detection via Recycling of Bounding Box Annotations
Wonhee Lee, Joonil Na, Gunhee Kim


In spite of recent enormous success of deep convolutional networks in object detection, they require a large amount of bounding box annotations, which are often time-consuming and error-prone to obtain. To make better use of given limited labels, we propose a novel object detection approach that takes advantage of both multi-task learning (MTL) and self-supervised learning (SSL). We propose a set of auxiliary tasks that help improve the accuracy of object detection. They create their own labels by recycling the bounding box labels (i.e. annotations of the main task) in an SSL manner, and are jointly trained with the object detection model in an MTL way. Our approach is integrable with any region proposal based detection models. We empirically validate that our approach effectively improves detection performance on various architectures and datasets. We test two state-of-the-art region proposal object detectors, including Faster R-CNN and R-FCN, with three CNN backbones of ResNet-101, Inception-ResNet-v2, and MobileNet on two benchmark datasets of PASCAL VOC and COCO.
[mtl, work, prediction, human, jointly, multiple, window] [approach, additional, single] [figure, image, method] [table, performance, number, network, deep, better, convolutional, imagenet, architecture, mobilenet, neural, design, accuracy] [model, visual, create, probability] [object, detection, box, proposal, closeness, labeling, refinement, feature, bounding, region, faster, baseline, foreground, three, voc, average, improve, map, context, backbone, recycling, including, cnn, pascal, coco, mask, detector, surrounding, propose, predicted] [auxiliary, task, learning, main, label, loss, training, soft, class, test, large, set, ssl, classification, train, trained, shared, unsupervised, predictor]
@InProceedings{Lee_2019_CVPR,
  author = {Lee, Wonhee and Na, Joonil and Kim, Gunhee},
  title = {Multi-Task Self-Supervised Object Detection via Recycling of Bounding Box Annotations},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
ClusterNet: Deep Hierarchical Cluster Network With Rigorously Rotation-Invariant Representation for Point Cloud Analysis
Chao Chen, Guanbin Li, Ruijia Xu, Tianshui Chen, Meng Wang, Liang Lin


Current neural networks for 3D object recognition are vulnerable to 3D rotation. Existing works mostly rely on massive amounts of rotation-augmented data to alleviate the problem, which lacks solid guarantee of the 3D rotation invariance. In this paper, we address the issue by introducing a novel point cloud representation that can be mathematically proved rigorously rotation-invariant, i.e., identical point clouds in different orientations are unified as a unique and consistent representation. Moreover, the proposed representation is conditional information-lossless, because it retains all necessary information of point cloud except for orientation information. In addition, the proposed representation is complementary with existing network architectures for point cloud and fundamentally improves their robustness against rotation transformation. Finally, we propose a deep hierarchical cluster network called ClusterNet to better adapt to the proposed representation. We employ hierarchical clustering to explore and exploit the geometric structure of point cloud, which is embedded in a hierarchical structure tree. Extensive experimental results have shown that our proposed method greatly outperforms the state-of-the-arts in rotation robustness on rotation-augmented 3D object classification benchmarks.
[graph, extract] [point, rotation, rri, cloud, tpi, pij, clusternet, rigorously, pointnet, dgcnn, local, property, spherical, relative, neighborhood, pattern, euler, computer, geometric, pik, equation, vision, permutation, angle, tik, edgeconv, special, corresponding] [proposed, method, mapping, input, arbitrary, transformation, conference, figure, ieee, based] [neural, network, deep, structure, table, employ, max, operator, aggregation, called, order, design, apply, convolutional, original, rigorous] [robustness, model, tree, unique] [hierarchical, object, feature, propose, improves, spatial, foundation] [representation, set, cluster, data, clustering, invariance, classification, learning, test, learn, neighbor, novel, existing, function, training, partition, augmentation, space]
@InProceedings{Chen_2019_CVPR,
  author = {Chen, Chao and Li, Guanbin and Xu, Ruijia and Chen, Tianshui and Wang, Meng and Lin, Liang},
  title = {ClusterNet: Deep Hierarchical Cluster Network With Rigorously Rotation-Invariant Representation for Point Cloud Analysis},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Learning to Learn Relation for Important People Detection in Still Images
Wei-Hong Li, Fa-Ting Hong, Wei-Shi Zheng


Humans can easily recognize the importance of people in social event images, and they always focus on the most important individuals. However, learning to learn the relation between people in an image, and inferring the most important person based on this relation, remains undeveloped. In this work, we propose a deep imPOrtance relatIon NeTwork (POINT) that combines both relation modeling and feature learning. In particular, we infer two types of interaction modules: the person-person interaction module that learns the interaction between people and the event-person interaction module that learns to describe how a person is involved in the event occurring in an image. We then estimate the importance relations among people from both interactions and encode the relation feature from the importance relations. In this way, POINT automatically learns several types of relation features in parallel, and we aggregate these relation features and the person's feature to form the importance feature for important people classification. Extensive experimental results show that our method is effective for important people detection and verify the efficacy of learning to learn relations for important people detection.
[people, interaction, dataset, event, ncaa, graph, modeling, eji, work, dot, video, fglobal, exterior, fir, interacting, extracting, fio, learns, individual, submodules, personrank, focus] [point, computer, vision, estimating, international, pattern, estimate] [figure, method, image, appearance, conference, proposed, patch, prior] [table, deep, effective, network, automatically, customized, neural] [model, attention, machine, encode, visual, introduce, evaluating, indicates, probability, encoding, evaluation, inferring, infer, describe] [relation, feature, person, module, detection, map, global, detected, object, bounding, box, involved, location, interior, detecting, baseline, improvement, occurring] [learning, classification, learn, representation, function, additive, product, datasets, reported, set, exploit, data]
@InProceedings{Li_2019_CVPR,
  author = {Li, Wei-Hong and Hong, Fa-Ting and Zheng, Wei-Shi},
  title = {Learning to Learn Relation for Important People Detection in Still Images},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Looking for the Devil in the Details: Learning Trilinear Attention Sampling Network for Fine-Grained Image Recognition
Heliang Zheng, Jianlong Fu, Zheng-Jun Zha, Jiebo Luo


Learning subtle yet discriminative features (e.g., beak and eyes for a bird) plays a significant role in fine-grained image recognition. Existing attention-based approaches localize and amplify significant parts to learn fine-grained details, which often suffer from a limited number of parts and heavy computational cost. In this paper, we propose to learn such fine-grained features from hundreds of part proposals by Trilinear Attention Sampling Network (TASN) in an efficient teacher-student manner. Specifically, TASN consists of 1) a trilinear attention module, which generates attention maps by modeling the inter-channel relationships, 2) an attention-based sampler which highlights attended parts with high resolution, and 3) a feature distiller, which distills part features into an object-level feature by weight sharing and feature preserving strategies. Extensive experiments verify that TASN yields the best performance under the same settings with the most competitive approaches, in iNaturalist-2017, CUB-Bird, and Stanford-Cars datasets.
[multiple, recognition, learns, work, outperforms] [single, note, approach, well] [image, input, proposed, high, figure, resolution, subtle, conduct, result, based] [tasn, trilinear, convolutional, table, channel, accuracy, compared, network, neural, ssn, performance, xxt, better, pooling, normalization, deep, computational, best, effectiveness, weight, extensive, structure] [attention, indicates, visual, model, relationship, generates, mechanism, sampled, attended] [feature, module, map, spatial, improve, backbone, global, three, categorization, finegrained] [sampling, learning, discriminative, sampler, learn, knowledge, large, classification, randomly, distilling, function, soft, training, learned, select, specific]
@InProceedings{Zheng_2019_CVPR,
  author = {Zheng, Heliang and Fu, Jianlong and Zha, Zheng-Jun and Luo, Jiebo},
  title = {Looking for the Devil in the Details: Learning Trilinear Attention Sampling Network for Fine-Grained Image Recognition},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Multi-Similarity Loss With General Pair Weighting for Deep Metric Learning
Xun Wang, Xintong Han, Weilin Huang, Dengke Dong, Matthew R. Scott


A family of loss functions built on pair-based computation have been proposed in the literature which provide a myriad of solutions for deep metric learning. In this pa-per, we provide a general weighting framework for under-standing recent pair-based loss functions. Our contributions are three-fold: (1) we establish a General Pair Weighting (GPW) framework, which casts the sampling problem of deep metric learning into a unified view of pair weighting through gradient analysis, providing a powerful tool for understanding recent pair-based loss functions; (2) we show that with GPW, various existing pair-based methods can be compared and discussed comprehensively, with clear differences and key limitations identified; (3) we propose a new loss called multi-similarity loss (MS loss) under the GPW,which is implemented in two iterative steps (i.e., mining and weighting). This allows it to fully consider three similarities for pair weighting, providing a more principled approach for collecting and weighting informative pairs. Finally, the proposed MS loss obtains new state-of-the-art performance on four image retrieval benchmarks, where it outperforms the most recent approaches, such as ABE[14] and HTL[4], by a large margin, e.g.,60.6%-65.7%on CUB200,and 80.9%-88.0%on In-Shop Clothes Retrieval datasetat Recall@1.
[key, framework, considering, recognition, focus, outperforms] [relative, computed, wij, general, problem, single, provide, allows] [proposed, based, method, image] [deep, weight, performance, structure, table, scheme, compared, number, gradient, neural, network, best, applied, size, analyze, achieves, obtains] [model, iterative, consider, considers, understanding] [three, anchor, instance, propose, fully, comparing, object, assigned, neighboring, clothes] [loss, pair, negative, weighting, learning, metric, positive, similarity, sij, embedding, mining, triplet, binomial, sampling, lifted, informative, deviance, contrastive, selected, retrieval, training, hard, existing, large, cosine, liftedstruct, function, sample, log, distance, set, gpw, unified]
@InProceedings{Wang_2019_CVPR,
  author = {Wang, Xun and Han, Xintong and Huang, Weilin and Dong, Dengke and Scott, Matthew R.},
  title = {Multi-Similarity Loss With General Pair Weighting for Deep Metric Learning},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Domain-Symmetric Networks for Adversarial Domain Adaptation
Yabin Zhang, Hui Tang, Kui Jia, Mingkui Tan


Unsupervised domain adaptation aims to learn a model of classifier for unlabeled samples on the target domain, given training data of labeled samples on the source domain. Impressive progress is made recently by learning invariant features via domain-adversarial training of deep networks. In spite of the recent progress, domain adaptation is still limited in achieving the invariance of feature distributions at a finer category level. To this end, we propose in this paper a new domain adaptation method called Domain-Symmetric Networks (SymNets). The proposed SymNet is based on a symmetric design of source and target task classifiers, based on which we also construct an additional classifier that shares with them its layer neurons. To train the SymNet, we propose a novel adversarial learning objective whose key design is based on a two-level domain confusion scheme, where the category-level confusion loss improves over the domain-level one by driving the learning of intermediate network features to be invariant at the corresponding categories of the two domains. Both domain discrimination and domain confusion are implemented based on the constructed additional classifier. Since target samples are unlabeled, we also propose a scheme of cross-domain training to help learn the target classifier. Careful ablation studies show the efficacy of our proposed method. In particular, based on commonly used base networks, our SymNets achieve the new state of the art on three benchmark domain adaptation datasets.
[dataset, joint] [corresponding, computer, well, international, vision, pattern, additional, note] [based, conference, proposed, ieee, figure, method, denoted] [deep, design, network, performance, table, neural, achieve] [adversarial, machine, model] [feature, category, propose, three, ablation, benchmark, improves, help] [domain, target, confusion, source, task, adaptation, classifier, training, learning, symnets, loss, unsupervised, data, learn, pst, extractor, min, labeled, objective, alignment, etask, entropy, existing, minimization, invariant, symnet, novel, discrimination, train, discrepancy, large, classification, transfer, learned, risk, trained, adapted]
@InProceedings{Zhang_2019_CVPR,
  author = {Zhang, Yabin and Tang, Hui and Jia, Kui and Tan, Mingkui},
  title = {Domain-Symmetric Networks for Adversarial Domain Adaptation},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
End-To-End Supervised Product Quantization for Image Search and Retrieval
Benjamin Klein, Lior Wolf


Product Quantization, a dictionary based hashing method, is one of the leading unsupervised hashing techniques. While it ignores the labels, it harnesses the features to construct look up tables that can approximate the feature space. In recent years, several works have achieved state of the art results on hashing benchmarks by learning binary representations in a supervised manner. This work presents Deep Product Quantization (DPQ), a technique that leads to more accurate retrieval and classification than the latest state of the art methods, while having similar computational complexity and memory footprint as the Product Quantization method. To our knowledge, this is the first work to introduce a dictionary-based representation that is inspired by Product Quantization and which is learned end-to-end, and thus benefits from the supervised signal. DPQ explicitly learns soft and hard representations to enable an efficient and accurate asymmetric search, by using a straight-through estimator. Our method obtains state of the art results on an extensive array of retrieval and classification experiments.
[state, joint, performing, dataset, work, previous, learns] [technique, computer, cartesian, computed] [image, method, central, ieee, based, comparison] [quantization, deep, performance, search, binary, compressed, number, imagenet, inspired, table, efficient, achieves, normalization, improving, optimized, architecture, layer, approximated, original] [vector, encoding, probability, model, memory, encoded] [art, center, map, category, baseline, feature] [retrieval, dpq, hard, product, soft, representation, asymmetric, supervised, learning, distance, loss, training, subic, hashing, symmetric, embedding, unsupervised, classification, learned, protocol, trained, function, softmax, cluster, sample, hamming, hash, distribution, hot, learn, space]
@InProceedings{Klein_2019_CVPR,
  author = {Klein, Benjamin and Wolf, Lior},
  title = {End-To-End Supervised Product Quantization for Image Search and Retrieval},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Learning to Learn From Noisy Labeled Data
Junnan Li, Yongkang Wong, Qi Zhao, Mohan S. Kankanhalli


Despite the success of deep neural networks (DNNs) in image classification tasks, the human-level performance relies on massive training data with high-quality manual annotations, which are expensive and time-consuming to collect. There exist many inexpensive data sources on the web, but they tend to contain inaccurate labels. Training on noisy labeled datasets causes performance degradation because DNNs can easily overfit to the label noise. To overcome this problem, we propose a noise-tolerant training algorithm, where a meta-learning update is performed prior to conventional gradient update. The proposed meta-learning method simulates actual training by generating synthetic noisy labels, and train the model such that after one gradient update using each set of synthetic noisy labels, the model does not overfit to the specific noise. We conduct extensive experiments on the noisy CIFAR-10 dataset and the Clothing1M dataset. The results demonstrate the advantageous performance of the proposed method compared to several state-of-the-art baselines.
[updated, perform, multiple, joint, previous, forward, expensive] [optimization, consistent, algorithm, robust] [noise, method, synthetic, proposed, consistency, figure, clean, based, image, conduct, correction] [accuracy, network, performance, deep, gradient, neural, compared, table, number, original, size, full, dnns, better, ratio] [model, random, generate, iterative, requires, probability] [propose, three] [training, label, noisy, learning, loss, data, update, teacher, cross, classification, test, entropy, set, learn, conventional, train, trained, labeled, asymmetric, mentor, mlnt, overfit, student, lmeta, symmetric, transfer, cat, meta, datasets, knowledge, distribution, sample, minimize, automobile, truck]
@InProceedings{Li_2019_CVPR,
  author = {Li, Junnan and Wong, Yongkang and Zhao, Qi and Kankanhalli, Mohan S.},
  title = {Learning to Learn From Noisy Labeled Data},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
DSFD: Dual Shot Face Detector
Jian Li, Yabiao Wang, Changan Wang, Ying Tai, Jianjun Qian, Jian Yang, Chengjie Wang, Jilin Li, Feiyue Huang


Recently, Convolutional Neural Network (CNN) has achieved great success in face detection. However, it remains a challenging problem for the current face detection methods owing to high degree of variability in scale, pose, occlusion, expression, appearance and illumination. In this Paper, we propose a novel detection network named Dual Shot face Detector(DSFD). which inherits the architecture of SSD and introduces a Feature Enhance Module (FEM) for transferring the original feature maps to extend the single shot detector to dual shot detector. Specially, progressive anchor loss (PAL) computed by using two set of anchors is adopted to effectively facilitate the features. Additionally, we propose an improved anchor matching (IAM) method by integrating novel data augmentation techniques and anchor design strategy in our DSFD to provide better initialization for the regressor. Extensive experiments on popular benchmarks: WIDER FACE (easy: 0.966, medium: 0.957, hard: 0.904) and FDDB ( discontinuous: 0.991, continuous: 0.862 ) demonstrate the superiority of DSFD over the state-of-the-art face detection methods (e.g., PyramidBox and SRN). Code will be made available upon publication.
[second, recognition, current, wang, work] [computer, matching, vision, pattern, international, single, ground, truth, occlusion] [face, conference, ieee, based, image, figure, proposed, high, dual] [progressive, better, original, number, network, convolutional, table, design, size, performance, smaller, layer, initialization, neural, effectiveness, dilation, cell, scale, popular] [improved, indicates, named] [anchor, feature, detection, dsfd, enhance, module, object, jian, propose, three, wider, detector, fem, region, enhanced, improve, rfb, bounding, pyramid, map, easy, roc, matched, fssd, ross, faster, stage, fpn, hierarchical] [loss, shot, set, hard, data, learning, classification, medium, novel, strategy, assign, training, positive]
@InProceedings{Li_2019_CVPR,
  author = {Li, Jian and Wang, Yabiao and Wang, Changan and Tai, Ying and Qian, Jianjun and Yang, Jian and Wang, Chengjie and Li, Jilin and Huang, Feiyue},
  title = {DSFD: Dual Shot Face Detector},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Label Propagation for Deep Semi-Supervised Learning
Ahmet Iscen, Giorgos Tolias, Yannis Avrithis, Ondrej Chum


Semi-supervised learning is becoming increasingly important because it can combine data carefully labeled by humans with abundant unlabeled data to train deep neural networks. Classic methods on semi-supervised learning that have focused on transductive learning have not been fully exploited in the inductive framework followed by modern deep learning. The same holds for the manifold assumption---that similar examples should get the same prediction. In this work, we employ a transductive label propagation method that is based on the manifold assumption to make predictions on the entire dataset and use these predictions to generate pseudo-labels for the unlabeled data and train a deep neural network. At the core of the transductive method lies a nearest neighbor graph of the dataset that we create based on the embeddings of the same network. Therefore our learning process iterates between these two steps. We improve performance on several datasets especially in the few labels regime and show that our work is complementary to current state of the art.
[propagation, work, graph, term, dataset, prediction, perform, temporal, second, consists] [approach, matrix, error, defined, optimization, normalized] [image, method, figure, proposed, consistency, prior] [network, deep, performance, performed, weight, epoch, rate, neural, entire, process, efficient, standard, table, original, number, applied, output] [example, random, iterative, visual, infer] [fully, affinity, complementary, feature] [learning, labeled, training, loss, label, unlabeled, data, supervised, class, unsupervised, diffusion, set, transductive, train, ssl, nearest, certainty, large, classification, trained, neighbor, test, assign, function, entropy, semisupervised, datasets, classifier, combination, setup, main]
@InProceedings{Iscen_2019_CVPR,
  author = {Iscen, Ahmet and Tolias, Giorgos and Avrithis, Yannis and Chum, Ondrej},
  title = {Label Propagation for Deep Semi-Supervised Learning},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Deep Global Generalized Gaussian Networks
Qilong Wang, Peihua Li, Qinghua Hu, Pengfei Zhu, Wangmeng Zuo


Recently, global covariance pooling (GCP) has shown great advance in improving classification performance of deep convolutional neural networks (CNNs). However, existing deep GCP networks compute covariance pooling of convolutional activations with assumption that activations are sampled from Gaussian distributions, which may not hold in practice and fails to fully characterize the statistics of activations. To handle this issue, this paper proposes a novel deep global generalized Gaussian network (3G-Net), whose core is to estimate a global covariance of generalized Gaussian for modeling the last convolutional activations. Compared with GCP in Gaussian setting, our 3G-Net assumes the distribution of activations follows a generalized Gaussian, which can capture more precise characteristics of activations. However, there exists no analytic solution for parameter estimation of generalized Gaussian, making our 3G-Net challenging. To this end, we first present a novel regularized maximum likelihood estimator for robust estimating covariance of generalized Gaussian, which can be optimized by a modified iterative re-weighted method. Then, to efficiently estimate the covariance of generaized Gaussian under deep CNN architectures, we approximate this re-weighted method by developing an unrolling re-weighted module and a square root covariance layer. In this way, 3GNet can be flexibly trained in an end-to-end manner. The experiments are conducted on large-scale ImageNet-1K and Places365 datasets, and the results demonstrate our 3G-Net outperforms its counterparts while achieving very competitive performance to state-of-the-arts.
[outperforms] [square, matrix, robust, estimation, compute, estimator, estimate, solution, note, estimating] [proposed, method, based, image, figure, comparison] [covariance, gaussian, deep, pooling, convolutional, unrolling, root, layer, convolution, network, compared, iteration, cnns, effectiveness, block, table, performance, neural, better, achieve, gcp, parameter, approximate, number, original, bilinear, achieves, tensor, employ, conv, compare, plain, superior, regularized, channel] [attention, iterative, modified, visual, summarize] [global, module, cnn, propose, improve, backbone] [generalized, gap, distribution, training, classification, multivariate, function, set]
@InProceedings{Wang_2019_CVPR,
  author = {Wang, Qilong and Li, Peihua and Hu, Qinghua and Zhu, Pengfei and Zuo, Wangmeng},
  title = {Deep Global Generalized Gaussian Networks},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Semantically Tied Paired Cycle Consistency for Zero-Shot Sketch-Based Image Retrieval
Anjan Dutta, Zeynep Akata


Zero-shot sketch-based image retrieval (SBIR) is an emerging task in computer vision, allowing to retrieve natural images relevant to sketch queries that might not been seen in the training phase. Existing works either require aligned sketch-image pairs or inefficient memory fusion layer for mapping the visual information to a semantic space. In this work, we propose a semantically aligned paired cycle-consistent generative (SEM-PCYC) model for zero-shot SBIR, where each branch maps the visual information to a common semantic space via an adversarial training. Each of these branches maintains a cycle consistency that only requires supervision at category levels, and avoids the need of highly-priced aligned sketch-image pairs. A classification criteria on the generators' outputs ensures the visual to semantic space mapping to be discriminating. Furthermore, we propose to combine textual and hierarchical side information via a feature selection auto-encoder that selects discriminating side information within a same end-to-end model. Our results demonstrate a significant boost in zero-shot SBIR performance over the state-of-the-art on the challenging Sketchy and TU-Berlin datasets.
[dataset] [consistent] [image, side, cycle, consistency, based, mapping, generative, paired, figure, proposed, input, ladv] [deep, original, performance, table, output, network, selection, reducing, neural, fim, precision] [model, adversarial, visual, common, semantically, natural, modality, discriminator] [semantic, aligned, hierarchical, feature, branch, propose, category, map, object, supervision] [sketch, sbir, learning, retrieval, space, loss, unseen, sketchy, training, gim, gsk, class, zsl, similarity, classification, learned, embedding, domain, dse, data, set, representation, tao, fsk, dsk, zeynep, discriminative, learn, dimension, timothy, embeddings, gap]
@InProceedings{Dutta_2019_CVPR,
  author = {Dutta, Anjan and Akata, Zeynep},
  title = {Semantically Tied Paired Cycle Consistency for Zero-Shot Sketch-Based Image Retrieval},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Context-Aware Crowd Counting
Weizhe Liu, Mathieu Salzmann, Pascal Fua


State-of-the-art methods for counting people in crowded scenes rely on deep networks to estimate crowd density. They typically use the same filters over the whole image or over large image patches. Only then do they estimate local scale to compensate for perspective distortion. This is typically achieved by training an auxiliary classifier to select, for predefined image patches, the best kernel size among a limited set of choices. As such, these methods are not end-to-end trainable and restricted in the scope of context they can leverage. In this paper, we introduce an end-to-end trainable deep architecture that combines features obtained using multiple receptive field sizes and learns the importance of each such feature at each image location. In other words, our approach adaptively encodes the scale of the contextual information required to accurately predict crowd density. This yields an algorithm that outperforms state-of-the-art crowd counting methods, especially when perspective effects are strong.
[people, multiple, predict, explicitly, ucf, work] [computer, vision, pattern, perspective, contrast, ground, approach, international, rely, field, account, local, calibration, estimation, estimate, single, corresponding, scene, estimated, compute, note, geometry, range, plane] [image, conference, input, figure] [density, network, scale, conv, deep, convolutional, receptive, size, pooling, table, vgg, architecture, discussed, neural, number, original, compare, adaptively, better, venice, weight, andrew, gaussian, rapid] [model, introduce, encodes] [crowd, counting, map, contextual, feature, context, detection, european, average, three, mcnn, crowded, final] [training, learning, learn, set, loss, comparative, datasets, exploit, data]
@InProceedings{Liu_2019_CVPR,
  author = {Liu, Weizhe and Salzmann, Mathieu and Fua, Pascal},
  title = {Context-Aware Crowd Counting},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Detect-To-Retrieve: Efficient Regional Aggregation for Image Search
Marvin Teichmann, Andre Araujo, Menglong Zhu, Jack Sim


Retrieving object instances among cluttered scenes efficiently requires compact yet comprehensive regional image representations. Intuitively, object semantics can help build the index that focuses on the most relevant regions. However, due to the lack of bounding-box datasets for objects of interest among retrieval benchmarks, most recent work on regional representations has focused on either uniform or class-agnostic region selection. In this paper, we first fill the void by providing a new dataset of landmark bounding boxes, based on the Google Landmarks dataset, that includes 94k images with manually curated boxes from 15k unique landmarks. Then, we demonstrate how a trained landmark detector, using our new dataset, can be leveraged to index image regions and improve retrieval accuracy while being much more efficient than existing regional methods. In addition, we introduce a novel regional aggregated selective match kernel (R-ASMK) to effectively combine information from detected regions into an improved holistic image representation. R-ASMK boosts image retrieval accuracy substantially with no dimensionality increase, while even outperforming systems that index image regions independently. Our complete image retrieval system improves upon the previous state-of-the-art by significant margins on the Revisited Oxford and Paris datasets. Code and data will be released.
[dataset, work, independently] [local, match, note, matching, single, case] [image, landmark, database, proposed, method, produce, based, figure] [aggregation, search, performance, selection, kernel, number, deep, small, efficient, compared, compact, implementation, codebook, pooling, effective, convolutional, precision] [query, visual, memory, relevant, improved, system, introduce, word, common, evaluation] [regional, map, region, aggregated, object, asmk, google, detection, improve, feature, average, bounding, detected, box, selective, detector, vlad, cnn, roxf, improves, grid, leverage, spatial] [retrieval, similarity, trained, large, set, representation, selected, hard, experimental, function, datasets, main, learning]
@InProceedings{Teichmann_2019_CVPR,
  author = {Teichmann, Marvin and Araujo, Andre and Zhu, Menglong and Sim, Jack},
  title = {Detect-To-Retrieve: Efficient Regional Aggregation for Image Search},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Towards Accurate One-Stage Object Detection With AP-Loss
Kean Chen, Jianguo Li, Weiyao Lin, John See, Ji Wang, Lingyu Duan, Zhibo Chen, Changwei He, Junni Zou


One-stage object detectors are trained by optimizing classification-loss and localization-loss simultaneously, with the former suffering much from extreme foreground-background class imbalance issue due to the large number of anchors. This paper alleviates this issue by proposing a novel framework to replace the classification task in one-stage detectors with a ranking task, and adopting the Average-Precision loss (AP-loss) for the ranking problem. Due to its non-differentiability and non-convexity, the AP-loss cannot be optimized directly. For this purpose, we develop a novel optimization algorithm, which seamlessly combines the error-driven update scheme in perceptron learning and backpropagation algorithm in deep networks. We verify good convergence property of the proposed algorithm theoretically and empirically. Experimental results demonstrate notable performance improvement in state-of-the-art one-stage detectors based on AP-loss over different kinds of classification-losses on various benchmarks, without changing the network architectures.
[] [algorithm, optimization, focal, xij, directly, linear, good, approach, form, equation] [based, proposed, figure, method, interpolated, image, input, difference, study] [gradient, performance, activation, table, approximate, scheme, better, optimize, descent, structured, replace, weight, original] [step, bowl, model, teddy, bear, evaluation, book] [object, person, detection, detector, lij, box, coco, cup, score, piecewise, pascal, adopt, diningtable, retinanet, sofa, average, heaviside, anchor, voc] [loss, ranking, function, learning, training, classification, task, update, label, perceptron, class, imbalance, set, metric, yij, test, minibatch, bottle, spoon, convergence, positive, large, novel, hinge, objective]
@InProceedings{Chen_2019_CVPR,
  author = {Chen, Kean and Li, Jianguo and Lin, Weiyao and See, John and Wang, Ji and Duan, Lingyu and Chen, Zhibo and He, Changwei and Zou, Junni},
  title = {Towards Accurate One-Stage Object Detection With AP-Loss},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
On Exploring Undetermined Relationships for Visual Relationship Detection
Yibing Zhan, Jun Yu, Ting Yu, Dacheng Tao


In visual relationship detection, human-notated relationships can be regarded as determinate relationships. However, there are still large amount of unlabeled data, such as object pairs with less significant relationships or even with no relationships. We refer to these unlabeled but potentially useful data as undetermined relationships. Although a vast body of literature exists, few methods exploit these undetermined relationships for visual relationship detection. In this paper, we explore the beneficial effect of undetermined relationships on visual relationship detection. We propose a novel multi-modal feature based undetermined relationship learning network (MF-URLN) and achieve great improvements in relationship detection. In detail, our MF-URLN automatically generates undetermined relationships by comparing object pairs with human-notated data according to a designed criterion. Then, the MF-URLN extracts and fuses features of object pairs from three complementary modals: visual, spatial, and linguistic modals. Further, the MF-URLN proposes two correlated subnetworks: one subnetwork decides the determinate confidence, and the other predicts the relationships. We evaluate the MF-URLN on two datasets: the Visual Relationship Detection (VRD) and the Visual Genome (VG) datasets. The experimental results compared with state-of-the-art methods verify the significant improvements made by the undetermined relationships, e.g., the top-50 relation detection recall improves from 19.5% to 23.9% on the VRD dataset.
[subject, dataset, internal] [computer, confidence, pattern, sky, vision, international, corresponding, defined, directly] [conference, proposed, ieee, based, image, result, transforming, figure, generator] [network, performance, table, compared, deep, concatenating, best, better] [relationship, visual, linguistic, predicate, modal, vrd, phrase, cabinet, sink, evaluation, model, external, generated, refrigerator, classified] [undetermined, object, detection, determinate, relation, bus, person, subnetwork, car, spatial, feature, ymin, ymax, three, detected, street, faster, detector, xmin, beneficial, improves, xmax, propose, recall, detect] [learning, unlabeled, set, loss, data, training, pair, positive, datasets, novel, strategy]
@InProceedings{Zhan_2019_CVPR,
  author = {Zhan, Yibing and Yu, Jun and Yu, Ting and Tao, Dacheng},
  title = {On Exploring Undetermined Relationships for Visual Relationship Detection},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Learning Without Memorizing
Prithviraj Dhar, Rajat Vikram Singh, Kuan-Chuan Peng, Ziyan Wu, Rama Chellappa


Incremental learning (IL) is an important task aimed at increasing the capability of a trained model, in terms of the number of classes recognizable by the model. The key problem in this task is the requirement of storing data (e.g. images) associated with existing classes, while teaching the classifier to learn new classes. However, this is impractical as it increases the memory requirement at every incremental step, which makes it impossible to implement IL algorithms on edge devices with limited memory. Hence, we propose a novel approach, called `Learning without Memorizing (LwM)', to preserve the information about existing (base) classes, without storing any of their data, while making the classifier progressively learn the new classes. In LwM, we present an information preserving penalty: Attention Distillation Loss (L_ AD ), and demonstrate that penalizing the changes in classifiers' attention maps helps to retain information of the base classes, as new classes are added. We show that adding L_ AD to the distillation loss which is an existing information preserving loss consistently outperforms the state-of-the-art performance in the iILSVRC-small and iCIFAR-100 datasets in terms of the overall accuracy of base and incrementally learned classes.
[outperforms, work, prediction, perform, dataset, previous] [problem, computer, defined, vision, approach, provide, corresponding, technique, respect, pattern] [figure, preserving, image, proposed, conference, ieee, input] [number, table, accuracy, performance, experiment, explore, convolutional, applied] [attention, model, step, visual, generated, memory, generate, retain, evaluation] [object, feature, baseline, map, propose, region, score] [base, incremental, class, loss, data, lwm, learning, teacher, trained, distillation, incrementally, lad, student, knowledge, classification, learn, training, datasets, storing, classifier, ipp, existing, icarl, forgetting, divergence, learned, belonging, task, target, catastrophic, train]
@InProceedings{Dhar_2019_CVPR,
  author = {Dhar, Prithviraj and Vikram Singh, Rajat and Peng, Kuan-Chuan and Wu, Ziyan and Chellappa, Rama},
  title = {Learning Without Memorizing},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Dynamic Recursive Neural Network
Qiushan Guo, Zhipeng Yu, Yichao Wu, Ding Liang, Haoyu Qin, Junjie Yan


This paper proposes the dynamic recursive neural network (DRNN), which simplifies the duplicated building blocks in deep neural network. Different from forwarding through different blocks sequentially in previous networks, we demonstrate that the DRNN can achieve better performance with fewer blocks by employing block recursively. We further add a gate structure to each block, which can adaptively decide the loop times of recursive blocks to reduce the computational cost. Since the recursive networks are hard to train, we propose the Loopy Variable Batch Normalization (LVBN) to stabilize the volatile gradient. Further, we improve the LVBN to correct statistical bias caused by the gate structure. Experiments show that the DRNN reduces the parameters and computational cost and while outperforms the original model in term of the accuracy consistently on CIFAR-10 and ImageNet-1k. Lastly we visualize and discuss the relation between image saliency and the number of loop time.
[dynamic, time, recurrent, recognition, early, outperforms] [loop, computer, vision, pattern, international, discrete] [conference, image, ieee, based, result, proposed, figure, caused] [recursive, network, deep, neural, convolutional, rate, block, gate, lvbn, computation, computational, unit, accuracy, imagenet, gradient, population, efficient, drnn, adaptive, residual, layer, better, inference, reducing, cost, number, structure, size, stochastic, top, reduce, reusing, validation, processing, resnet, convolution, apply, applied, shallower, group, decay, table] [model, arxiv, preprint, executed, improved, mechanism] [feature, easy, object] [learning, training, trained, target, naive, hard, set]
@InProceedings{Guo_2019_CVPR,
  author = {Guo, Qiushan and Yu, Zhipeng and Wu, Yichao and Liang, Ding and Qin, Haoyu and Yan, Junjie},
  title = {Dynamic Recursive Neural Network},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Destruction and Construction Learning for Fine-Grained Image Recognition
Yue Chen, Yalong Bai, Wei Zhang, Tao Mei


Delicate feature representation about object parts plays a critical role in fine-grained recognition. For example, experts can even distinguish fine-grained objects relying only on object parts according to professional knowledge. In this paper, we propose a novel "Destruction and Construction Learning" (DCL) method to enhance the difficulty of fine-grained recognition and exercise the classification model to acquire expert knowledge. Besides the standard classification backbone network, another "destruction and construction" stream is introduced to carefully "destruct" and then "reconstruct" the input image, for learning discriminative regions and features. More specifically, for "destruction", we first partition the input image into local regions and then shuffle them by a Region Confusion Mechanism (RCM). To correctly recognize these destructed images, the classification network has to pay more attention to discriminative regions for spotting the differences. To compensate the noises introduced by RCM, an adversarial loss, which distinguishes original images from destructed ones, is applied to reject noisy patterns introduced by RCM. For "construction", a region alignment network, which tries to restore the original spatial layout of local regions, is followed to model the semantic correlation among local regions. By jointly training with parameter sharing, our proposed DCL injects more discriminative local details to the classification network. Experimental results show that our proposed framework achieves state-of-the-art performance on three standard benchmarks. Moreover, our proposed method does not need any external knowledge during training, and there is no computation overhead at inference time except the standard classification network feed-forwarding. Source code: https://github.com/JDAI-CV/DCL.
[recognition, framework] [local, computer, vision, pattern, construction, june, international] [image, proposed, conference, ieee, method, figure, input, based, ladv, row] [network, original, convolutional, correlation, performance, neural, layer, table, accuracy, structure, deep, standard, pooling, overhead, inference, bilinear, ratio] [visual, model, adversarial, attention, rcm, introduced, mechanism, adv, vector] [region, feature, dcl, object, backbone, destructed, lcls, three, global, destruction, semantic, location, baseline, propose, extra, car, localization, role, finegrained] [learning, discriminative, classification, loss, alignment, set, trained, learn, noisy, label, confusion, training, large, representation]
@InProceedings{Chen_2019_CVPR,
  author = {Chen, Yue and Bai, Yalong and Zhang, Wei and Mei, Tao},
  title = {Destruction and Construction Learning for Fine-Grained Image Recognition},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Distraction-Aware Shadow Detection
Quanlong Zheng, Xiaotian Qiao, Ying Cao, Rynson W.H. Lau


Shadow detection is an important and challenging task for scene understanding. Despite promising results from recent deep learning based methods. Existing works still struggle with ambiguous cases where the visual appearances of shadow and non-shadow regions are similar (referred to as distraction in our context). In this paper, we propose a Distraction-aware Shadow Detection Network (DSDNet) by explicitly learning and integrating the semantics of visual distraction regions in an end-to-end framework. At the core of our framework is a novel standalone, differentiable Distraction-aware Shadow (DS) module, which allows us to learn distraction-aware, discriminative features for robust shadow detection, by explicitly predicting false positives and false negatives. We conduct extensive experiments on three public shadow detection datasets, SBU, UCF and ISTD, to evaluate our method. Experimental results demonstrate that our model can boost shadow detection performance, by effectively suppressing the detection of false positives and false negatives, achieving state-of-the-art results.
[ucf, explicitly, fusion, dataset, extract, capture, multiple, work] [ground, illumination, single, vision, truth] [image, input, figure, method, proposed, based, ber, dsc, produce, color] [shadow, distraction, network, fim, deep, performance, convolutional, sbu, top, block, istd, bdrar, best, table, compare, better, adnet, output, conv, layer, architecture, size] [model, visual, attention, adversarial, evaluation, encoder, generate, ambiguous, evaluate] [detection, module, false, object, context, semantics, region, supervision, map, salient, backbone, three, feature, challenging, detect, final, semantic, score, propose, bottom, segmentation] [training, existing, learn, learning, loss, trained, extractor]
@InProceedings{Zheng_2019_CVPR,
  author = {Zheng, Quanlong and Qiao, Xiaotian and Cao, Ying and Lau, Rynson W.H.},
  title = {Distraction-Aware Shadow Detection},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Multi-Label Image Recognition With Graph Convolutional Networks
Zhao-Min Chen, Xiu-Shen Wei, Peng Wang, Yanwen Guo


The task of multi-label image recognition is to predict a set of object labels that present in an image. As objects normally co-occur in an image, it is desirable to model the label dependencies to improve the recognition performance. To capture and explore such important dependencies, we propose a multi-label classification model based on Graph Convolutional Network (GCN). The model builds a directed graph over the object labels, where each node (label) is represented by word embeddings of a label, and GCN is learned to map this label graph into a set of inter-dependent object classifiers. These classifiers are applied to the image descriptors extracted by another sub-net, enabling the whole network to be end-to-end trainable. Furthermore, we propose a novel re-weighted scheme to create an effective label correlation matrix to guide information propagation among the nodes in GCN. Experiments on two multi-label image recognition datasets show that our approach obviously outperforms other existing state-of-the-art methods. In addition, visualization analyses reveal that the classifiers learned by our model maintain meaningful semantic topology.
[gcn, recognition, graph, capture, report, key, explicitly] [matrix, approach, latexit, problem, phone, corresponding] [image, based, proposed, figure, method, remote, meaningful, mapping] [correlation, convolutional, performance, network, scheme, deep, number, table, binary, denotes, applied, neural, better, vanilla, explore, wei, output, structure, accuracy, resnet, cell, effective, design, order] [model, word, node, baseball, glove, ball, evaluate, query] [object, map, voc, semantic, person, feature, propose, average, visualization, snowboard, global] [label, learned, learning, set, classification, embeddings, learn, classifier, representation, dog, novel, train, backpack, china]
@InProceedings{Chen_2019_CVPR,
  author = {Chen, Zhao-Min and Wei, Xiu-Shen and Wang, Peng and Guo, Yanwen},
  title = {Multi-Label Image Recognition With Graph Convolutional Networks},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
High-Level Semantic Feature Detection: A New Perspective for Pedestrian Detection
Wei Liu, Shengcai Liao, Weiqiang Ren, Weidong Hu, Yinan Yu


Object detection generally requires sliding-window classifiers in tradition or anchor-based predictions in modern deep learning approaches. However, either of these approaches requires tedious configurations in windows or anchors. In this paper, taking pedestrian detection as an example, we provide a new perspective where detecting objects is motivated as a high-level semantic feature detection task. Like edges, corners, blobs and other feature detectors, the proposed detector scans for feature points all over the image, for which the convolution is naturally suited. However, unlike these traditional low-level features, the proposed detector goes for a higher-level abstraction, that is, we are looking for central points where there are pedestrians, and modern deep models are already capable of such a high-level semantic abstraction. Besides, like blob detection, we also predict the scales of the pedestrian points, which is also a straightforward convolution. Therefore, in this paper, pedestrian detection is simplified as a straightforward center and scale prediction task through convolutions. This way, the proposed method enjoys an anchor-free setting. Though structurally simple, it presents competitive accuracy and good speed on challenging pedestrian detection benchmarks, and hence leading to a new attractive pedestrian detector. Code and models will be available at https://github.com/liuwei16/CSP.
[prediction, predict, state, work] [computer, vision, pattern, point, ground, single, truth, provide, corresponding, defined, occlusion] [conference, image, proposed, ieee, method, based, traditional, central] [scale, performance, best, network, deep, table, width, convolutional, achieves, size, wei, neural, output, reduce] [arxiv, preprint, machine, generated] [feature, detection, pedestrian, object, center, det, csp, european, offset, height, detector, bounding, caltech, september, semantic, detecting, edge, localization, box, straightforward, tll, citypersons, anchor, faster, final, fcn, location, backbone, adopted, stage, branch, assigned, annotation] [set, learning, test, task, training, generally, loss]
@InProceedings{Liu_2019_CVPR,
  author = {Liu, Wei and Liao, Shengcai and Ren, Weiqiang and Hu, Weidong and Yu, Yinan},
  title = {High-Level Semantic Feature Detection: A New Perspective for Pedestrian Detection},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
RepMet: Representative-Based Metric Learning for Classification and Few-Shot Object Detection
Leonid Karlinsky, Joseph Shtok, Sivan Harary, Eli Schwartz, Amit Aides, Rogerio Feris, Raja Giryes, Alex M. Bronstein


Distance metric learning (DML) has been successfully applied to object classification, both in the standard regime of rich training data and in the few-shot scenario, where each category is represented by only a few examples. In this work, we propose a new method for DML that simultaneously learns the backbone network parameters, the embedding space, and the multi-modal distribution of each of the training categories in that space, in a single end-to-end training process. Our approach outperforms state-of-the-art methods for DML-based object classification on a variety of standard fine-grained datasets. Furthermore, we demonstrate the effectiveness of our approach on the problem of few-shot object detection, by incorporating the proposed DML architecture as a classification head into a standard object detection model. We achieve the best results on the ImageNet-LOC dataset compared to strong baselines, when only a few training examples are available. We also offer the community a new episodic benchmark based on the ImageNet dataset for the few-shot object detection task.
[recognition, jointly, dataset, work, joint, learns] [computer, approach, vision, computed, pattern, simultaneously, single] [proposed, image, figure, conference, method, ieee, background, based] [network, performance, table, imagenet, deep, architecture, layer, compared, standard, neural, batch, number] [model, random, episode, vector, regular] [detection, object, detector, subnet, backbone, category, feature, propose, benchmark, module, ross] [embedding, training, class, dml, classification, learning, set, test, mixture, metric, learned, trained, space, lecture, loss, novel, train, classifier, data, unseen, reported, distance, learn, posterior, distribution, task]
@InProceedings{Karlinsky_2019_CVPR,
  author = {Karlinsky, Leonid and Shtok, Joseph and Harary, Sivan and Schwartz, Eli and Aides, Amit and Feris, Rogerio and Giryes, Raja and Bronstein, Alex M.},
  title = {RepMet: Representative-Based Metric Learning for Classification and Few-Shot Object Detection},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Ranked List Loss for Deep Metric Learning
Xinshao Wang, Yang Hua, Elyor Kodirov, Guosheng Hu, Romain Garnier, Neil M. Robertson


The objective of deep metric learning (DML) is to learn embeddings that can capture semantic similarity information among data points. Existing pairwise or tripletwise loss functions used in DML are known to suffer from slow convergence due to a large proportion of trivial pairs or triplets as the model improves. To improve this, rankingmotivated structured losses are proposed recently to incorporate multiple examples and exploit the structured information among them. They converge faster and achieve state-of-the-art performance. In this work, we present two limitations of existing ranking-motivated structured losses and propose a novel ranked list loss to solve both of them. First, given a query, only a fraction of data points is incorporated to build the similarity structure. Consequently, some useful examples are ignored and the structure is less informative. To address this, we propose to build a setbased similarity structure by exploiting all instances in the gallery. The samples are split into a positive set and a negative set. Our objective is to make the query closer to the positive set than to the negative set by a margin. Second, previous methods aim to pull positive pairs as close as possible in the embedding space. As a result, the intraclass data distribution might be dropped. In contrast, we propose to learn a hypersphere for each class in order to preserve the similarity structure inside it. Our extensive experiments show that the proposed method achieves state-of-the-art performance on three widely used benchmarks.
[previous] [point, constraint, corresponding, algorithm, respect] [proposed, method, image, based, comparison, figure, presented, spectral, conduct] [deep, structured, table, performance, number, batch, better, size, gradient, structure, smaller, rate, original, connected, impact, achieve, layer] [query, represent, example, closer] [three, propose, anchor, feature, fully] [negative, positive, loss, learning, data, metric, embedding, rll, ranked, set, margin, similarity, class, proxy, list, learn, triplet, distance, struct, embeddings, large, training, clustering, mining, lifted, clust, pairwise, sop, nca, multilevel, pull, informative, discriminative, nmi, mine, weighting, function, dml, lrll, googlenet]
@InProceedings{Wang_2019_CVPR,
  author = {Wang, Xinshao and Hua, Yang and Kodirov, Elyor and Hu, Guosheng and Garnier, Romain and Robertson, Neil M.},
  title = {Ranked List Loss for Deep Metric Learning},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
CANet: Class-Agnostic Segmentation Networks With Iterative Refinement and Attentive Few-Shot Learning
Chi Zhang, Guosheng Lin, Fayao Liu, Rui Yao, Chunhua Shen


Recent progress in semantic segmentation is driven by deep Convolutional Neural Networks and large-scale labeled image datasets. However, data labeling for pixel-wise segmentation is tedious and costly. Moreover, a trained model can only make predictions within a set of pre-defined classes. In this paper, we present CANet, a class-agnostic segmentation network that performs few-shot segmentation on new classes with only a few annotated images available. Our network consists of a two-branch dense comparison module which performs multi-level feature comparison between the support image and the query image, and an iterative optimization module which iteratively refines the predicted results. Furthermore, we introduce an attention mechanism to effectively fuse information from multiple support examples under the setting of k-shot learning. Experiments on PASCAL VOC 2012 show that our method achieves a mean Intersection-over-Union score of 55.4% for 1-shot segmentation and 57.1% for 5-shot segmentation, outperforming state-of-the-art methods by a large margin of 14.6% and 13.2%, respectively.
[previous, dataset, multiple, fusion, performs, outperforms, consists, prediction, iteratively, work] [dense, optimization, iter] [comparison, image, method, result, figure, proposed, background] [network, deep, table, convolutional, block, compare, performance, neural, output, convolution, residual] [query, model, iterative, attention, evaluation, mechanism, generated] [segmentation, feature, module, predicted, semantic, object, bounding, box, average, annotated, canet, fuse, iom, branch, global, mask, meaniou, foreground, spatial, adopt, coco, refinement, segment, pascal, voc, area, propose, score] [support, set, learning, training, test, class, metric, trained, effectively, task]
@InProceedings{Zhang_2019_CVPR,
  author = {Zhang, Chi and Lin, Guosheng and Liu, Fayao and Yao, Rui and Shen, Chunhua},
  title = {CANet: Class-Agnostic Segmentation Networks With Iterative Refinement and Attentive Few-Shot Learning},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Precise Detection in Densely Packed Scenes
Eran Goldman, Roei Herzig, Aviv Eisenschtat, Jacob Goldberger, Tal Hassner


Man-made scenes are often densely packed, containing numerous objects, often identical, positioned in close proximity. We show that precise object detection in such scenes remains a challenging frontier even for state-of-the-art object detectors. We propose a novel, deep-learning based method for precise object detection, designed for such challenging settings. Our contributions include: (1) A layer for estimating the Jaccard index as a detection quality score; (2) a novel EM merging unit, which uses our quality scores to resolve detection overlap ambiguities; finally, (3) an extensive, annotated data set, SKU-110K, representing packed retail environments, released for training and testing under such extreme settings. Detection tests on SKU-110K, and counting tests on the CARPK and PUCPR+, show our method to outperform existing state-of-the-art with substantial margins.
[despite] [pattern, vision, approach, ground, single, truth, tightly, confidence, defined] [image, method, described, high, figure, jaccard, quality, resolve] [designed, number, deep, layer, network, table, accuracy, overlap, standard, store, better, full, applied, fast, original, best] [gaussians, represent, representing] [detection, object, bounding, counting, box, packed, retinanet, objectness, overlapping, densely, detector, iou, european, ross, retail, carpk, predicted, crowded, ciou, benchmark, kaiming, average, region, faster, mae, baseline, piotr, doll, precise, challenging, propose] [base, clustering, data, existing, test, mixture, set, learning, novel, training, item, open]
@InProceedings{Goldman_2019_CVPR,
  author = {Goldman, Eran and Herzig, Roei and Eisenschtat, Aviv and Goldberger, Jacob and Hassner, Tal},
  title = {Precise Detection in Densely Packed Scenes},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
KE-GAN: Knowledge Embedded Generative Adversarial Networks for Semi-Supervised Scene Parsing
Mengshi Qi, Yunhong Wang, Jie Qin, Annan Li


In recent years, scene parsing has captured increasing attention in computer vision. Previous works have demonstrated promising performance in this task. However, they mainly utilize holistic features, whilst neglecting the rich semantic knowledge and inter-object relationships in the scene. In addition, these methods usually require a large number of pixel-level annotations, which is too expensive in practice. In this paper, we propose a novel Knowledge Embedded Generative Adversarial Networks, dubbed as KE-GAN, to tackle the challenging problem in a semi-supervised fashion. KE-GAN captures semantic consistencies of different categories by devising a Knowledge Graph from the large-scale text corpus. In addition to readily-available unlabeled data, we generate synthetic images to unveil rich structural information underlying the images. Moreover, a pyramid architecture is incorporated into the discriminator to acquire multi-scale contextual information for better parsing results. Extensive experimental results on four standard benchmarks demonstrate that KE-GAN is capable of improving semantic consistencies and learning better representations for scene parsing, resulting in the state-of-the-art performance.
[graph, dataset, utilized, complex, capture] [scene, confidence, matrix] [image, pixel, figure, based, generator, generative, proposed, real, consistency, conditional, qualitative, noise] [convolutional, deep, performance, network, pooling, table, best, better, employ, achieves, improving, neural, output, process, compared] [adversarial, discriminator, generated, model, random, probability, gans, embedded, generate, relationship, introduced] [semantic, parsing, fully, pyramid, extra, contextual, annotated, segmentation, relation, adopt, module, global, object, feature, miou] [knowledge, training, data, learning, unlabeled, loss, labeled, label, set, distribution, supervised, souly, embedding, similarity, test, novel, experimental, representation]
@InProceedings{Qi_2019_CVPR,
  author = {Qi, Mengshi and Wang, Yunhong and Qin, Jie and Li, Annan},
  title = {KE-GAN: Knowledge Embedded Generative Adversarial Networks for Semi-Supervised Scene Parsing},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Fast User-Guided Video Object Segmentation by Interaction-And-Propagation Networks
Seoung Wug Oh, Joon-Young Lee, Ning Xu, Seon Joo Kim


We present a deep learning method for the interactive video object segmentation. Our method is built upon two core operations, interaction and propagation, and each operation is conducted by Convolutional Neural Networks. The two networks are connected both internally and externally so that the networks are trained jointly and interact with each other to solve the complex video object segmentation problem. We propose a new multi-round training scheme for the interactive video object segmentation so that the networks can learn how to understand the user's intention and update incorrect estimations during the training. At the testing time, our method produces high-quality results and also runs fast enough to work with users interactively. We evaluated the proposed method quantitatively on the interactive track benchmark at the DAVIS Challenge 2018. We outperformed other competing methods by a significant margin in both the speed and the accuracy. We also demonstrated that our method works well with real user interactions.
[video, interaction, previous, propagation, frame, multiple, davis, challenge, recognition, time, current, complex, intention, work, consists] [computer, vision, pattern, additional, computed, single, estimation, international, algorithm] [user, method, conference, ieee, image, reference, proposed, real, input, figure, based, acm, feedback] [network, deep, aggregation, output, performance, fast, neural, number, scheme, block, residual, convolutional, connected, designed] [model, encoder, decoder, generates] [object, segmentation, mask, interactive, feature, round, module, map, roi, propose, false, propagated, refine, fully, foreground] [training, trained, testing, learning, target, data, scenario, positive, update, negative, unsupervised]
@InProceedings{Oh_2019_CVPR,
  author = {Wug Oh, Seoung and Lee, Joon-Young and Xu, Ning and Joo Kim, Seon},
  title = {Fast User-Guided Video Object Segmentation by Interaction-And-Propagation Networks},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Fast Interactive Object Annotation With Curve-GCN
Huan Ling, Jun Gao, Amlan Kar, Wenzheng Chen, Sanja Fidler


Manually labeling objects by tracing their boundaries is a laborious process. In Polygon-RNN++, the authors proposed Polygon-RNN that produces polygonal annotations in a recurrent manner using a CNN-RNN architecture, allowing interactive correction via humans-in-the-loop. We propose a new framework that alleviates the sequential nature of Polygon-RNN, by predicting all vertices simultaneously using a Graph Convolutional Network (GCN). Our model is trained end-to-end, and runs in real time. It supports object annotation by either polygons or splines, facilitating labeling efficiency for both line-based and curved objects. We show that Curve-GCN outperforms all existing approaches in automatic mode, including the powerful DeepLab, and is significantly more efficient in interactive mode than Polygon-RNN++. Our model runs at 29.3ms in automatic, and 2.6ms in interactive mode, making it 10x and 100x faster than Polygon-RNN++.
[graph, gcn, prediction, dataset, cpi, follow, perform, recurrent, work, predict, avg, human, iteratively, propagation] [point, additional, matching, predicts, form, differentiable, corresponding, single, note, approach] [image, control, figure, user, proposed, based, correction, contour] [table, inference, order, number, accuracy, convolutional, compare, deep, network, efficient, connected] [model, automatic, mode, iterative, node, worst, refer, evaluate, provided, requires, correct] [object, interactive, annotator, annotation, polygon, predicted, boundary, spline, dextr, feature, box, extreme, segmentation, iou, instance, cnn, faster, level, crop, location] [loss, training, train, set, trained, data, learning]
@InProceedings{Ling_2019_CVPR,
  author = {Ling, Huan and Gao, Jun and Kar, Amlan and Chen, Wenzheng and Fidler, Sanja},
  title = {Fast Interactive Object Annotation With Curve-GCN},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
FickleNet: Weakly and Semi-Supervised Semantic Image Segmentation Using Stochastic Inference
Jungbeom Lee, Eunji Kim, Sungmin Lee, Jangho Lee, Sungroh Yoon


The main obstacle to weakly supervised semantic image segmentation is the difficulty of obtaining pixel-level information from coarse image-level annotations. Most methods based on image-level annotations use localization maps obtained from the classifier, but these only focus on the small discriminative parts of objects and do not capture precise boundaries. FickleNet explores diverse combinations of locations on feature maps created by generic deep neural networks. It selects hidden units randomly and then uses them to obtain activation scores for image classification. FickleNet implicitly learns the coherence of each location in the feature maps, resulting in a localization map which identifies both discriminative and other parts of objects. The ensemble effects are obtained from a single network by selecting random hidden unit pairs, which means that a variety of localization maps are generated from a single image. Our approach does not require any additional training steps and only adds a simple layer to a standard convolutional neural network; nevertheless it outperforms recent comparable techniques on the Pascal VOC 2012 benchmark in both weakly and semi-supervised settings.
[hidden, window, deterministic] [single, computer, vision, pattern, additional, general] [image, figure, method, ieee, conference, produced, based] [dropout, network, convolution, selection, stochastic, unit, table, validation, deep, rate, inference, convolutional, dilated, gpu, size, neural, process, number, kernel, small, standard, dilation, implementation] [random, arxiv, preprint, expand, requires, strong] [localization, segmentation, map, semantic, ficklenet, feature, weakly, object, dsrg, sliding, voc, pascal, miou, cvpr, cam, score, activated, fully, expansion, region, expanded, spatial] [training, supervised, classification, discriminative, class, learning, trained, test, target, classifier, selected, function]
@InProceedings{Lee_2019_CVPR,
  author = {Lee, Jungbeom and Kim, Eunji and Lee, Sungmin and Lee, Jangho and Yoon, Sungroh},
  title = {FickleNet: Weakly and Semi-Supervised Semantic Image Segmentation Using Stochastic Inference},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
RVOS: End-To-End Recurrent Network for Video Object Segmentation
Carles Ventura, Miriam Bellver, Andreu Girbau, Amaia Salvador, Ferran Marques, Xavier Giro-i-Nieto


Multiple object video object segmentation is a challenging task, specially for the zero-shot case, when no object mask is given at the initial frame and the model has to find the objects to be segmented along the sequence. In our work, we propose a Recurrent network for multiple object Video Object Segmentation (RVOS) that is fully end-to-end trainable. Our model incorporates recurrence on two different domains: (i) the spatial, which allows to discover the different object instances within a frame, and (ii) the temporal, which allows to keep the coherence of the segmented objects along time. We train RVOS for zero-shot video object segmentation and are the first ones to report quantitative results for DAVIS-2017 and YouTube-VOS benchmarks. Further, we adapt RVOS for one-shot video object segmentation by using the masks obtained in previous time steps as inputs to be processed by the recurrent module. Our model reaches comparable results to state-of-the-art techniques in YouTube-VOS benchmark and outperforms all previous video object segmentation methods not using online learning in the DAVIS-2017 benchmark. Moreover, our model achieves faster inference runtimes than previous methods, reaching 44ms/frame on a P100 GPU.
[video, frame, vos, temporal, online, lstm, previous, recognition, recurrence, recurrent, multiple, sequence, rnn, consists, time, motion, outperforms, optical, flow, forward, state, hidden, osmn] [computer, vision, pattern, single, initial, allows, ground, truth, note, analysis] [conference, figure, ieee, qualitative, proposed, based, input, image, contour] [conv, architecture, table, validation, network, output, performance, convolutional, best, pretrained, neural, better, number, performed] [model, decoder, considered, depending, encoder] [object, segmentation, spatial, mask, instance, predicted, annotated, segment, fully, benchmark, segmented, region, european, propose] [learning, training, trained, set]
@InProceedings{Ventura_2019_CVPR,
  author = {Ventura, Carles and Bellver, Miriam and Girbau, Andreu and Salvador, Amaia and Marques, Ferran and Giro-i-Nieto, Xavier},
  title = {RVOS: End-To-End Recurrent Network for Video Object Segmentation},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
DeepFlux for Skeletons in the Wild
Yukang Wang, Yongchao Xu, Stavros Tsogkas, Xiang Bai, Sven Dickinson, Kaleem Siddiqi


Computing object skeletons in natural images is challenging, owing to large variations in object appearance and scale, and the complexity of handling background clutter. Many recent methods frame object skeleton detection as a binary pixel classification problem, which is similar in spirit to learning-based edge detection, as well as to semantic segmentation methods. In the present article, we depart from this strategy by training a CNN to predict a two-dimensional vector field, which maps each scene point to a candidate skeleton pixel, in the spirit of flux-based skeletonization algorithms. This "image context flux" representation has two major advantages over previous approaches. First, it explicitly encodes the relative position of skeletal pixels to semantically meaningful entities, such as the image points in their spatial context, and hence also the implied object boundaries. Second, since the skeleton detection context is a region-based vector field, it is better able to cope with object parts of large width. We evaluate the proposed method on three benchmark datasets for skeleton detection and two for symmetry detection, achieving consistently superior performance over state-of-the-art methods.
[skeleton, dataset, recognition, work, previous, capture] [flux, deepflux, symmetry, field, medial, direction, skeletal, local, computer, journal, associated, international, approach, shape, ground, directly, accurate, neighborhood, kaleem, university] [image, method, figure, pixel, proposed, sven, recover, based, study, side, ieee, background] [binary, network, convolutional, deep, receptive, aspp, wei, performance, magnitude, table, larger, dilated, conv] [vector, natural] [object, context, detection, spatial, edge, segmentation, feature, skeletonization, localization, map, xiang, accurately, propose, srn, detect, predicted, module, mil, backbone, fsds, cnn] [learning, representation, training, learned, train, large, classification]
@InProceedings{Wang_2019_CVPR,
  author = {Wang, Yukang and Xu, Yongchao and Tsogkas, Stavros and Bai, Xiang and Dickinson, Sven and Siddiqi, Kaleem},
  title = {DeepFlux for Skeletons in the Wild},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Interactive Image Segmentation via Backpropagating Refinement Scheme
Won-Dong Jang, Chang-Su Kim


An interactive image segmentation algorithm, which accepts user-annotations about a target object and the background, is proposed in this work. We convert user-annotations into interaction maps by measuring distances of each pixel to the annotated locations. Then, we perform the forward pass in a convolutional neural network, which outputs an initial segmentation map. However, the user-annotated locations can be mislabeled in the initial result. Therefore, we develop the backpropagating refinement scheme (BRS), which corrects the mislabeled pixels. Experimental results demonstrate that the proposed algorithm outperforms the conventional algorithms on four challenging datasets. Furthermore, we demonstrate the generality and applicability of BRS in other computer vision tasks, by transforming existing convolutional neural networks into user-interactive ones.
[interaction, dataset, forward, davis, perform, inertial, time, accepts, outperforms] [algorithm, noc, initial, accurate, yield, note] [image, proposed, background, figure, user, input, pixel, deconvolution, based] [network, number, convolutional, deep, energy, backpropagation, neural, block, fine, performance, convolution, architecture, table, scheme, accuracy, achieve, layer, skip, output] [correct, decoder, pass, probability, random, develop, encoder] [segmentation, interactive, object, foreground, grabcut, iou, saliency, berkeley, sbd, semantic, average, annotated, map, baseline, score, mask, corrective, clicked, backpropagating, refinement, bounding, coarse, click, mislabeled, box] [target, set, training, learning, conventional, train]
@InProceedings{Jang_2019_CVPR,
  author = {Jang, Won-Dong and Kim, Chang-Su},
  title = {Interactive Image Segmentation via Backpropagating Refinement Scheme},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Scene Parsing via Integrated Classification Model and Variance-Based Regularization
Hengcan Shi, Hongliang Li, Qingbo Wu, Zichen Song


Scene Parsing is a challenging task in computer vision, which can be formulated as a pixel-wise classification problem. Existing deep-learning-based methods usually use one general classifier to recognize all object categories. However, the general classifier easily makes some mistakes in dealing with some confusing categories that share similar appearances or semantics. In this paper, we propose an integrated classification model and a variance-based regularization to achieve more accurate classifications. On the one hand, the integrated classification model contains multiple classifiers, not only the general classifier but also a refinement classifier to distinguish the confusing categories. On the other hand, the variance-based regularization differentiates the scores of all categories as large as possible to reduce misclassifications. Specifically, the integrated classification model includes three steps. The first is to extract the features of each pixel. Based on the features, the second step is to classify each pixel across all categories to generate a preliminary classification result. In the third step, we leverage a refinement classifier to refine the classification result, focusing on differentiating the high-preliminary-score categories. An integrated loss with the variance-based regularization is used to train the model. Extensive experiments on three common scene parsing datasets demonstrate the effectiveness of the proposed method.
[multiple, dataset, outperforms, formulated, second] [scene, general, computer, pattern, vision, ground, depth, floor, international, nyu, predicts, problem, dense, analysis] [proposed, conference, image, ieee, method, input, pixel, comparison, based, high, background, figure] [network, binary, regularization, table, accuracy, deep, neural, convolutional, performance, achieve, reduce, effectiveness] [model, probability, incorrect, correct, observed] [parsing, integrated, object, refinement, feature, semantic, multinomial, score, map, context, confusing, category, baseline, preliminary, extraction, vbr, deeplab, segmentation, three, lvbr, pspnet, european, including, leverage] [classification, classifier, learning, loss, distribution, trained, train, training]
@InProceedings{Shi_2019_CVPR,
  author = {Shi, Hengcan and Li, Hongliang and Wu, Qingbo and Song, Zichen},
  title = {Scene Parsing via Integrated Classification Model and Variance-Based Regularization},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
RAVEN: A Dataset for Relational and Analogical Visual REasoNing
Chi Zhang, Feng Gao, Baoxiong Jia, Yixin Zhu, Song-Chun Zhu


Dramatic progress has been witnessed in basic vision tasks involving low-level perception, such as object recognition, detection, and tracking. Unfortunately, there is still enormous performance gap between artificial vision systems and human intelligence in terms of higher-level vision problems, especially ones involving reasoning. Earlier attempts in equipping machines with high-level reasoning have hovered around Visual Question Answering (VQA), one typical task associating vision and language understanding. In this work, we propose a new dataset, built in the context of Raven's Progressive Matrices (RPM) and aimed at lifting machine intelligence by associating vision with structural, relational, and analogical reasoning in a hierarchical representation. Unlike previous works in measuring abstract reasoning using RPM, we establish a semantic link between vision and reasoning by providing structure representation. This addition enables a new type of abstract reasoning by jointly operating on the structure representation. Machine reasoning ability using modern computer vision is evaluated in this newly proposed dataset. Additionally, we also provide human performance as a reference. Finally, we show consistent improvement across all models by incorporating a simple neural module that combines visual understanding and structure reasoning.
[human, dataset, work, cognitive, lstm, recognition, previous, early, joint, multiple] [vision, computer, problem, international, solving, pattern, note, solver, matrix] [image, figure, conference, proposed, component, ieee, noise] [performance, structure, neural, progressive, table, structured, grammar, accuracy, number, denotes, apply, configuration, residual, processing, computational] [reasoning, model, visual, rpm, raven, intelligence, rule, answer, tree, simple, artificial, ability, compositional, wren, question, correct, drt, relational, analogical, machine, generate, understanding, abstract, candidate, attention, sentence, attributed, generation, program, symbolic] [center, feature, cnn, module, improve, semantic, level, three] [test, representation, set, learning, training, generalization, trained]
@InProceedings{Zhang_2019_CVPR,
  author = {Zhang, Chi and Gao, Feng and Jia, Baoxiong and Zhu, Yixin and Zhu, Song-Chun},
  title = {RAVEN: A Dataset for Relational and Analogical Visual REasoNing},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Surface Reconstruction From Normals: A Robust DGP-Based Discontinuity Preservation Approach
Wuyuan Xie, Miaohui Wang, Mingqiang Wei, Jianmin Jiang, Jing Qin


In 3D surface reconstruction from normals, discontinuity preservation is an important but challenging task. However, existing studies fail to address the discontinuous normal maps by enforcing the surface integrability in the continuous domain. This paper introduces a robust approach to preserve the surface discontinuity in the discrete geometry way. Firstly, we design two representative normal incompatibility features and propose an efficient discontinuity detection scheme to determine the splitting pattern for a discrete mesh. Secondly, we model the discontinuity preservation problem as a light-weight energy optimization framework by jointly considering the discontinuity detection and the overall reconstruction error. Lastly, we further shrink the feasible solution space to reduce the complexity based on the prior knowledge. Experiments show that the proposed method achieves the best performance on an extensive 3D dataset compared with the state-of-the-arts in terms of mean angular error and computational complexity.
[time, adjacent] [normal, surface, discontinuity, reconstruction, depth, error, range, approach, orientation, discontinuous, computer, discrete, optimal, shape, total, vertex, integrability, distortion, continuous, mumford, ist, tio, quadratic, fourier, direction, shaping, bunny, histogram, geometry, incompatibility, occlusion, compute, mesh, shah, robust, pattern, feasible, solution, vision, photometric] [reconstructed, figure, noise, proposed, method, face, difference, based, preservation, input, splitting, prior] [gaussian, variance, energy, number, add, original, rate, cost, gradient, connected, group] [model, introduce, basis, step, example, variational] [map, detection, feature, mae, detected] [noisy, angular, dgp, space, address, enforcing, split, measure, function]
@InProceedings{Xie_2019_CVPR,
  author = {Xie, Wuyuan and Wang, Miaohui and Wei, Mingqiang and Jiang, Jianmin and Qin, Jing},
  title = {Surface Reconstruction From Normals: A Robust DGP-Based Discontinuity Preservation Approach},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
DeepFashion2: A Versatile Benchmark for Detection, Pose Estimation, Segmentation and Re-Identification of Clothing Images
Yuying Ge, Ruimao Zhang, Xiaogang Wang, Xiaoou Tang, Ping Luo


Understanding fashion images has been advanced by benchmarks with rich annotations such as DeepFashion, whose labels include clothing categories, landmarks, and consumer-commercial image pairs. However, DeepFashion has nonnegligible issues such as single clothing-item per image, sparse landmarks (4 8 only), and no per-pixel masks, making it had significant gap from real-world scenarios. We fill in the gap by presenting DeepFashion2 to address these issues. It is a versatile benchmark of four tasks including clothes detection, pose estimation, segmentation, and retrieval. It has 801K clothing items where each item has rich annotations such as style, scale, view- point, occlusion, bounding box, dense landmarks (e.g. 39 for 'long sleeve outwear' and 15 for 'vest'), and masks. There are also 873K Commercial-Consumer clothes pairs. The annotations of DeepFashion2 are much larger than its counterparts such as 8x of FashionAI Global Challenge. A strong baseline is proposed, called Match R- CNN, which builds upon Mask R-CNN to solve the above four tasks in an end-to-end manner. Extensive evaluations are conducted with different criterions in Deep- Fashion2. DeepFashion2 Dataset will be released at : https://github.com/switchablenorms/DeepFashion2
[human, long, work, previous, people] [pose, match, viewpoint, occlusion, estimation, occluded, single, dense, defined, well] [image, landmark, deepfashion, side, frontal, identity, figure] [table, network, small, performance, validation, accuracy, top, scale, slight, number] [evaluation, rich, making, asked, wear] [clothes, clothing, mask, including, bounding, detection, box, sleeve, three, commercial, category, benchmark, heavy, detected, segmentation, moderate, outwear, instance, feature, ping, vest, shopping, dress, region] [fashion, item, large, retrieval, medium, labeled, task, loss, learned, learning, data, difficult]
@InProceedings{Ge_2019_CVPR,
  author = {Ge, Yuying and Zhang, Ruimao and Wang, Xiaogang and Tang, Xiaoou and Luo, Ping},
  title = {DeepFashion2: A Versatile Benchmark for Detection, Pose Estimation, Segmentation and Re-Identification of Clothing Images},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Jumping Manifolds: Geometry Aware Dense Non-Rigid Structure From Motion
Suryansh Kumar


Given dense image feature correspondences of a non-rigidly moving object across multiple frames, this paper proposes an algorithm to estimate its 3D shape for each frame. To solve this problem accurately, the recent state-of-the-art algorithm reduces this task to set of local linear subspace reconstruction and clustering problem using Grassmann manifold representation [34]. Unfortunately, their method missed on some of the critical issues associated with the modeling of surface deformations, for e.g., the dependence of a local surface deformation on its neighbors. Furthermore, their representation to group high dimensional data points inevitably introduce the drawbacks of categorizing samples on the high-dimensional Grassmann manifold [32, 31]. Hence, to deal with such limitations with [34], we propose an algorithm that jointly exploits the benefit of high-dimensional Grassmann manifold to perform reconstruction, and its equivalent lower-dimensional representation to infer suitable clusters. To accomplish this, we project each Grassmannians onto a lower-dimensional Grassmann manifold which preserves and respects the deformation of the structure w.r.t its neighbors. These Grassmann points in the lower-dimension then act as a representative for the selection of high-dimensional Grassmann samples to perform each local reconstruction. In practice, our algorithm provides a geometrically efficient way to solve dense NRSfM by switching between manifolds based on its benefit and usage. Experimental results show that the proposed algorithm is very effective in handling noise with reconstruction accuracy as good as or better than the competing methods.
[sequence, motion, work, modeling, dataset, complex, actor, framework, time, term, perform, previous, graph, ordering] [grassmann, dense, reconstruction, nrsfm, computer, matrix, algorithm, grassmannians, local, vision, shape, problem, pattern, surface, point, optimization, solve, solution, grassmannian, singular, deforming, approach, international, estimate, deformation, dimensional, constraint, linear, rotation, reliable, suryansh, corresponding, provide, error, yuchao, camera, formulation, definition, wij, hongdong] [conference, face, ieee, method, based, proposed, image, noise, figure, composed, high, facial] [structure, efficient, number, kumar, better, performance, suitable, denotes, processing] [manifold, introduce, represent] [feature, object, neighboring, grouping] [representation, subspace, minimize, function, set, space, minimization, paper]
@InProceedings{Kumar_2019_CVPR,
  author = {Kumar, Suryansh},
  title = {Jumping Manifolds: Geometry Aware Dense Non-Rigid Structure From Motion},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
LVIS: A Dataset for Large Vocabulary Instance Segmentation
Agrim Gupta, Piotr Dollar, Ross Girshick


Progress on object detection is enabled by datasets that focus the research community's attention on open challenges. This process led us from simple images to complex scenes and from bounding boxes to segmentation masks. In this work, we introduce LVIS (pronounced 'el-vis'): a new dataset for Large Vocabulary Instance Segmentation. We plan to collect 2.2 million high-quality instance segmentation masks for over 1000 entry-level object categories in 164k images. Due to the Zipfian distribution of categories in natural images, LVIS naturally has a long tail of categories with few training samples. Given that state-of-the-art deep learning methods for object detection perform poorly in the low-sample regime, we believe that our dataset poses an important and exciting new scientific challenge. LVIS is available at http://www.lvisdataset.org.
[dataset, perform, long, recognition, marked] [pipeline, single, algorithm, exhaustive, analysis, total, well] [image, quality, figure, expert, consistency] [number, design, max, process, output, smaller, full, size, small] [evaluation, vocabulary, spotting, mark, asked, describe, visual, iterative, constituent, deer, common, frequent] [category, object, segmentation, coco, lvis, stage, mask, instance, annotation, annotated, detection, benchmark, federated, exhaustively, boundary, iou, piotr, doll, average, annotator, spotted, ross, bounding, recall] [set, large, positive, open, training, labeled, datasets, learning, negative, task, tail, data, test, distribution, subset, pietro, label]
@InProceedings{Gupta_2019_CVPR,
  author = {Gupta, Agrim and Dollar, Piotr and Girshick, Ross},
  title = {LVIS: A Dataset for Large Vocabulary Instance Segmentation},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Fast Object Class Labelling via Speech
Michael Gygli, Vittorio Ferrari


Object class labelling is the task of annotating images with labels on the presence or absence of objects from a given class vocabulary. Simply asking one yes-no question per class, however, has a cost that is linear in the vocabulary size and is thus inefficient for large vocabularies. Modern approaches rely on a hierarchical organization of the vocabulary to reduce annotation time, but remain expensive (several minutes per image for the 200 classes in ILSVRC). Instead, we propose a new interface where classes are annotated via speech. Speaking is fast and allows for direct access to the class name, without searching through a list or hierarchy. As additional advantages, annotators can simultaneously speak and scan the image for objects, the interface can be kept extremely simple, and using it requires less mouse movement. As annotators using our interface should only say words from a given class vocabulary, we propose a dedicated task to train them to do so. Through experiments on COCO and ILSVRC, we show our method yields high-quality annotations at 2.3x -14.9x less annotation time than existing methods.
[time, speech, mouse, dataset, speaking, audio, despite, naturally, recognition, work, previous, people] [approach, point, allows, additional, analysis, provide] [image, figure, high, method, input, feedback] [ilsvrc, accuracy, efficient, compare, search, precision, fast, order, speed, number, symbol, lower, size, extremely] [vocabulary, visual, transcription, spoken, spent, automatic, requires, asked, create, correct, find, phrase, simply] [object, interface, annotation, hierarchical, coco, click, annotating, annotated, recall, location, hierarchy, annotate, segmentation, semantic, faster, annotator, transcribe, analyse, final, presence, speak] [class, training, task, label, labelling, large, main, list]
@InProceedings{Gygli_2019_CVPR,
  author = {Gygli, Michael and Ferrari, Vittorio},
  title = {Fast Object Class Labelling via Speech},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
LaSOT: A High-Quality Benchmark for Large-Scale Single Object Tracking
Heng Fan, Liting Lin, Fan Yang, Peng Chu, Ge Deng, Sijia Yu, Hexin Bai, Yong Xu, Chunyuan Liao, Haibin Ling


In this paper, we present LaSOT, a high-quality benchmark for Large-scale Single Object Tracking. LaSOT consists of 1,400 sequences with more than 3.5M frames in total. Each frame in these sequences is carefully and manually annotated with a bounding box, making LaSOT the largest, to the best of our knowledge, densely annotated tracking benchmark. The average video length of LaSOT is more than 2,500 frames, and each sequence comprises various challenges deriving from the wild where target objects may disappear and re-appear again in the view. By releasing LaSOT, we expect to provide the community with a large-scale dedicated benchmark with high quality for both the training of deep trackers and the veritable evaluation of tracking algorithms. Moreover, considering the close connections of visual appearance and natural language, we enrich LaSOT by providing additional language specification, aiming at encouraging the exploration of natural linguistic feature for tracking. A thorough experimental evaluation of 35 tracking algorithms on LaSOT is presented with detailed analysis, and the results demonstrate that there is still a big room for improvements.
[tracking, video, sequence, dataset, consists, motion, frame, manually] [dense, provide, normalized, robust] [figure, appearance] [lasot, precision, deep, siamfc, mdnet, vital, tracker, eco, staple, best, performance, ptav, meem, scale, fps, traca, cfnet, structsiam, dsiam, sint, correlation, dsst, csk, kcf, fdsst, samf, stc, hcft, bacf, srdcf, csrdcf, strcf, lct, tld, asla, ivt, struck, ope, sparse] [visual, evaluation, success, natural, language, length] [object, bounding, benchmark, box, annotation, category, mil, annotated, threshold, average, lingual, feature] [target, training, existing, set, protocol, large, min, learning, testing, labeled]
@InProceedings{Fan_2019_CVPR,
  author = {Fan, Heng and Lin, Liting and Yang, Fan and Chu, Peng and Deng, Ge and Yu, Sijia and Bai, Hexin and Xu, Yong and Liao, Chunyuan and Ling, Haibin},
  title = {LaSOT: A High-Quality Benchmark for Large-Scale Single Object Tracking},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Creative Flow+ Dataset
Maria Shugrina, Ziheng Liang, Amlan Kar, Jiaman Li, Angad Singh, Karan Singh, Sanja Fidler


We present the Creative Flow+ Dataset, the first diverse multi-style artistic video dataset richly labeled with per-pixel optical flow, occlusions, correspondences, segmentation labels, normals, and depth. Our dataset includes 3000 animated sequences rendered using styles randomly selected from 40 textured line styles and 38 shading styles, spanning the range between flat cartoon fill and wildly sketchy shading. Our dataset includes 124K+ train set frames and 10K test set frames rendered at 1500x1500 resolution, far surpassing the largest available optical flow datasets in size. While modern techniques for tasks such as optical flow estimation achieve impressive performance on realistic images and video, today there is no way to gauge their performance on non-photorealistic images. Creative Flow+ poses a new challenge to generalize real-world Computer Vision to messy stylized content. We show that learning-based optical flow methods fail to generalize to this data and struggle to compete with classical approaches, and invite new research in this area. Our dataset and a new optical flow benchmark will be publicly available at: www.cs.toronto.edu/creativeflow/. We further release the complete dataset creation pipeline, allowing the community to generate and stylize their own data on demand.
[flow, optical, dataset, sintel, stylit, motion, frame, tracking] [computer, vision, ground, rendering, truth, rendered, blender, well, shapenet, pattern, mpi, classical, composited, range, analysis, correspondence, shape, general] [style, stylized, creative, animated, shading, image, acm, conference, content, color, ieee, figure, synthetic, artistic, includes, background, mixamo, flat, real, stylization, outline, cartoon, animation] [performance, deep, number] [unique, visual, include, held, diverse, find, flying, enable] [object, benchmark, foreground, including] [datasets, train, test, set, learning, data, existing, trained, large, sketch, generalize, specific, split, domain]
@InProceedings{Shugrina_2019_CVPR,
  author = {Shugrina, Maria and Liang, Ziheng and Kar, Amlan and Li, Jiaman and Singh, Angad and Singh, Karan and Fidler, Sanja},
  title = {Creative Flow+ Dataset},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Weakly Supervised Open-Set Domain Adaptation by Dual-Domain Collaboration
Shuhan Tan, Jiening Jiao, Wei-Shi Zheng


In conventional domain adaptation, a critical assumption is that there exists a fully labeled domain (source) that contains the same label space as another unlabeled or scarcely labeled domain (target). However, in the real world, there often exist application scenarios in which both domains are partially labeled and not all classes are shared between these two domains. Thus, it is meaningful to let partially labeled domains learn from each other to classify all the unlabeled samples in each domain under an open-set setting. We consider this problem as weakly supervised open-set domain adaptation. To address this practical setting, we propose the Collaborative Distribution Alignment (CDA) method, which performs knowledge transfer bilaterally and works collaboratively to classify unlabeled data and identify outlier samples. Extensive experiments on the Office benchmark and an application on person reidentification show that our method achieves state-of-the-art performance.
[recognition, explicitly, dataset, joint] [computer, vision, camera, pattern, international, partially, problem, outlier, well, single, analysis, total, view] [conference, ieee, proposed, method, dual, figure, mapping, latent, image, collaborative, separation] [deep, compared, number, performance, table, standard, neural, process] [machine, find] [person, weakly, feature, propose, fully, detect] [domain, labeled, adaptation, learning, class, set, label, cda, unlabeled, unknown, sample, supervised, data, learn, large, source, distribution, shared, target, transfer, alignment, setting, unsupervised, align, classifier, distc, loss, space, datasets, discrepancy, tca, distance, randomly, mingsheng, jianmin]
@InProceedings{Tan_2019_CVPR,
  author = {Tan, Shuhan and Jiao, Jiening and Zheng, Wei-Shi},
  title = {Weakly Supervised Open-Set Domain Adaptation by Dual-Domain Collaboration},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
A Neurobiological Evaluation Metric for Neural Network Model Search
Nathaniel Blanchard, Jeffery Kinnison, Brandon RichardWebster, Pouya Bashivan, Walter J. Scheirer


Neuroscience theory posits that the brain's visual system coarsely identifies broad object categories via neural activation patterns, with similar objects producing similar neural responses. Artificial neural networks also have internal activation behavior in response to stimuli. We hypothesize that networks exhibiting brain-like activation behavior will demonstrate brain-like characteristics, e.g., stronger generalization capabilities. In this paper we introduce a human-model similarity (HMS) metric, which quantifies the similarity of human fMRI and network activation behavior. To calculate HMS, representational dissimilarity matrices (RDMs) are created as abstractions of activation behavior, measured by the correlations of activations to stimulus pairs. HMS is then the correlation between the fMRI RDM and the neural network RDM across all stimulus pairs. We test the metric on unsupervised predictive coding networks, which specifically model visual perception, and assess the metric for statistical significance over a large range of hyperparameters. Our experiments show that networks with increased human-model similarity are correlated with better performance on two computer vision tasks: next frame prediction and object matching accuracy. Further, HMS identifies networks with high performance on both tasks. An unexpected secondary finding is that the metric can be employed during training as an early-stopping mechanism.
[behavior, fmri, internal, human, frame, biological, prediction, rdm, early, prednet, rdms, time, video, perform, recognition, behavioral, dataset, correlated] [matching, vision, computer, error, measured, range, well, analysis] [high, statistical, fidelity, study, described, proposed, mse, figure] [network, neural, performance, accuracy, activation, coding, brain, correlation, search, deep, computational, stopping, stimulus, higher, architecture, table, representational, compared, standard, science, process, better] [model, visual, evaluation, machine, arxiv, preprint, artificial, consider, exhibit] [object, response, score, average, threshold] [similarity, metric, predictive, data, training, set, learning, trained, generalization, sample, dissimilarity, unsupervised, specific]
@InProceedings{Blanchard_2019_CVPR,
  author = {Blanchard, Nathaniel and Kinnison, Jeffery and RichardWebster, Brandon and Bashivan, Pouya and Scheirer, Walter J.},
  title = {A Neurobiological Evaluation Metric for Neural Network Model Search},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Iterative Projection and Matching: Finding Structure-Preserving Representatives and Its Application to Computer Vision
Alireza Zaeemzadeh, Mohsen Joneidi, Nazanin Rahnavard, Mubarak Shah


The goal of data selection is to capture the most structural information from a set of data. This paper presents a fast and accurate data selection method, in which the selected samples are optimized to span the subspace of all data. We propose a new selection algorithm, referred to as iterative projection and matching (IPM), with linear complexity w.r.t. the number of data, and without any parameter to be tuned. In our algorithm, at each iteration, the maximum information from the structure of the data is captured by one selected sample, and the captured information is neglected in the next iterations by projection on the null-space of previously selected samples. The computational efficiency and the selection accuracy of our proposed algorithm outperform those of the conventional methods. Furthermore, the superiority of the proposed algorithm is shown on active learning for video action recognition dataset on UCF-101; learning using representatives on ImageNet; training a generative adversarial network (GAN) to generate multi-view images from a single-view input on CMU Multi-PIE dataset; and video summarization on UTE Egocentric dataset.
[dataset, video, summarization, recognition, action, subject] [algorithm, problem, projection, computer, active, matrix, singular, vision, pattern, convex, matching, volume, error, null, normalized, international, approach] [proposed, ieee, conference, based, figure, method, row] [selection, accuracy, performance, number, network, table, best, complexity, reduced, sparse, unit, compared, structure, cost, implementation, correlation, size, better] [random, selects, generated, iterative, finding, selecting, model, text, goal, referred, generate, vector, find, decision] [] [data, selected, ipm, learning, training, sample, set, subset, trained, space, representative, uncertainty, classifier, function, select, class, supervised, large, unsupervised, maximum, testing, sampling, classification, metric]
@InProceedings{Zaeemzadeh_2019_CVPR,
  author = {Zaeemzadeh, Alireza and Joneidi, Mohsen and Rahnavard, Nazanin and Shah, Mubarak},
  title = {Iterative Projection and Matching: Finding Structure-Preserving Representatives and Its Application to Computer Vision},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Efficient Multi-Domain Learning by Covariance Normalization
Yunsheng Li, Nuno Vasconcelos


The problem of multi-domain learning of deep networks is considered. An adaptive layer is induced per target domain and a novel procedure, denoted covariance normalization (CovNorm), proposed to reduce its parameters. CovNorm is a data driven method of fairly simple implementation, requiring two principal component analyzes (PCA) and fine-tuning of a mini-adaptation layer. Nevertheless, it is shown, both theoretically and experimentally, to have several advantages over previous approaches, such as batch normalization or geometric matrix approximations. Furthermore, CovNorm can be deployed both when target datasets are available sequentially or simultaneously. Experiments show that, in both cases, it has performance comparable to a fully fine-tuned network, using as few as 0.13% of the corresponding parameters per target domain.
[recognition, dataset, joint, multiple, previous] [matrix, singular, geometric, computer, problem, pattern, match, vision, single, solution, note, volume, require, total] [figure, pca, input, conference, ieee, proposed, based] [layer, covnorm, network, deep, mdl, normalization, covariance, parameter, output, number, pcas, small, residual, approximation, batch, performance, fixed, full, size, convolutional, smaller, variance, svd, neural, efficient, larger, architecture, effective] [model, arxiv, preprint, implemented, adversarial, visual, procedure] [feature] [adaptation, learning, target, task, source, domain, large, specific, transfer, training, independent, datasets, set, fta, data, trained, classification]
@InProceedings{Li_2019_CVPR,
  author = {Li, Yunsheng and Vasconcelos, Nuno},
  title = {Efficient Multi-Domain Learning by Covariance Normalization},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Predicting Visible Image Differences Under Varying Display Brightness and Viewing Distance
Nanyang Ye, Krzysztof Wolski, Rafal K. Mantiuk


Numerous applications require a robust metric that can predict whether image differences are visible or not. However, the accuracy of existing white-box visibility metrics, such as HDR-VDP, is often not good enough. CNN-based black-box visibility metrics have proven to be more accurate, but they cannot account for differences in viewing conditions, such as display brightness and viewing distance. In this paper, we propose a CNN-based visibility metric, which maintains the accuracy of deep network solutions and accounts for viewing conditions. To achieve this, we extend the existing dataset of locally visible differences (LocVis) with a new set of measurements, collected considering aforementioned viewing conditions. Then, we develop a hybrid model that combines white-box processing stages for modeling the effects of luminance masking and contrast sensitivity, with a black-box deep neural network. We demonstrate that the novel hybrid model can handle the change of viewing conditions correctly and outperforms state-of-the-art metrics.
[] [] [] [] [] [] []
@InProceedings{Ye_2019_CVPR,
  author = {Ye, Nanyang and Wolski, Krzysztof and Mantiuk, Rafal K.},
  title = {Predicting Visible Image Differences Under Varying Display Brightness and Viewing Distance},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
A Bayesian Perspective on the Deep Image Prior
Zezhou Cheng, Matheus Gadelha, Subhransu Maji, Daniel Sheldon


The deep image prior was recently introduced as a prior for natural images. It represents images as the output of a convolutional network with random inputs. For "inference", gradient descent is performed to adjust network parameters to make the output match observations. This approach yields good performance on a range of image reconstruction tasks. We show that the deep image prior is asymptotically equivalent to a stationary Gaussian process prior in the limit as the number of channels in each layer of the network goes to infinity, and derive the corresponding kernel. This informs a Bayesian approach to inference. We show that by conducting posterior inference using stochastic gradient Langevin dynamics we avoid the need for early stopping, which is a drawback of the current approach, and improve results for denoising and impainting tasks. We illustrate these intuitions on a number of 1D and 2D signal reconstruction tasks.
[stationary, early, work, signal, avg, averaging] [reconstruction, derive, international, approach, langevin, limit, induced, well, supplementary, respect, estimate, theorem] [image, prior, input, figure, dip, denoising, noise, inpainting, conference, psnr, mse, drawn, comparison, method, gps, described] [deep, gaussian, sgld, convolutional, covariance, inference, sgd, network, bayesian, neural, number, output, gradient, kernel, layer, converges, process, standard, upsampling, architecture, stopping, stochastic, limiting, filter, variance, performance, iteration, downsampling, conv, williams, relu, weight, processing] [random, adding, machine, natural, consider] [spatial, improves, average] [posterior, function, distribution, learning, avoid, transfer, set, overfitting, noisy, mcmc]
@InProceedings{Cheng_2019_CVPR,
  author = {Cheng, Zezhou and Gadelha, Matheus and Maji, Subhransu and Sheldon, Daniel},
  title = {A Bayesian Perspective on the Deep Image Prior},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
ApolloCar3D: A Large 3D Car Instance Understanding Benchmark for Autonomous Driving
Xibin Song, Peng Wang, Dingfu Zhou, Rui Zhu, Chenye Guan, Yuchao Dai, Hao Su, Hongdong Li, Ruigang Yang


Autonomous driving has attracted remarkable attention from both industry and academia. An important task is to estimate 3D properties (e.g. translation, rotation and shape) of a moving or parked vehicle on the road. This task, while critical, is still under-researched in the computer vision community - partially owing to the lack of large scale and fully-annotated 3D car database suitable for autonomous driving research. In this paper, we contribute the first large scale database suitable for 3D car instance understanding - ApolloCar3D. The dataset contains 5,277 driving images and over 60K car instances, where each car is fitted with an industry-grade 3D CAD model with absolute model size and semantically labelled keypoints. This dataset is above 20x larger than PASCAL3D+ and KITTI, the current state-of-the-art. To enable efficient labelling in 3D, we build a pipeline by considering 2D-3D keypoint correspondences for a single instance and 3D relationship among multiple instances. Equipped with such dataset, we build various baseline algorithms with the state-of-the-art deep convolutional neural networks. Specifically, we first segment each car with a pre-trained Mask R-CNN, and then regress towards its 3D pose and shape based on a deformable 3D car model with or without using semantic keypoints. We show that using keypoints significantly improves fitting performance. Finally, we develop a new 3D metric jointly considering 3D pose and 3D shape, allowing for comprehensive evaluation and ablation study.
[dataset, driving, human, multiple, key, future, build, jointly] [pose, shape, estimation, keypoint, keypoints, depth, autonomous, vision, computer, single, pattern, ground, direct, reconstruction, absolute, kitti, provide, occlusion, accurate, viewpoint, error, point, relative, rotation, cad, range, camera] [ieee, image, based, conference, method, figure] [deep, convolutional, pooling, network, performance, scale, neural] [model, understanding, evaluation, develop, arxiv, preprint, attention] [car, object, gtmaks, instance, mask, baseline, semantic, deepmanta, detection, fully, box, offset, regression, premaks, bounding, propose, apolloscape, benchmark, average] [labelled, learning, datasets, training, set, large, metric, existing, task, data, distance, labeled]
@InProceedings{Song_2019_CVPR,
  author = {Song, Xibin and Wang, Peng and Zhou, Dingfu and Zhu, Rui and Guan, Chenye and Dai, Yuchao and Su, Hao and Li, Hongdong and Yang, Ruigang},
  title = {ApolloCar3D: A Large 3D Car Instance Understanding Benchmark for Autonomous Driving},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Compressing Unknown Images With Product Quantizer for Efficient Zero-Shot Classification
Jin Li, Xuguang Lan, Yang Liu, Le Wang, Nanning Zheng


For Zero-Shot Learning (ZSL), the Nearest Neighbor (NN) search is generally conducted for classification, which may cause unacceptable computational complexity for large-scale datasets. To compress zero-shot classes by the trained quantizer for efficient search, it tends to induce large quantization error because distributions between seen and unseen classes are different. However, as semantic attributes of classes are available in ZSL, both seen and unseen classes have the same distribution for one specific property, e.g., animals have or not have spots. Based on this intuition, a Product Quantization Zero-Shot Learning (PQZSL) method is proposed to learn embeddings as well as quantizers to compress visual features into compact codes for Approximate NN (ANN) search. Particularly, visual features are projected into an orthogonal semantic space, and then the Product Quantization (PQ) is utilized to quantize individual properties. Experimental results on five benchmark datasets demonstrate that unseen classes are represented by the Cartesian product of quantized properties with little quantization error. As classes in orthogonal common space are more discriminative, the classification based on PQZSL achieves state-of-the-art performance in Generalized Zero-Shot Learning (GZSL) task, meanwhile, the speed of ANN search is 10-100 times higher than traditional NN search.
[dataset, time] [projected, defined, error, shape, constraint, matrix, analysis, cartesian] [figure, method, proposed, color, ieee, based, image] [quantization, search, number, accuracy, codebook, orthogonal, quantizer, quantize, complexity, imagenet, compact, compressed, size, efficient, computational, achieves, speed, deep, table, consumption, ratio, approximate, performance] [visual, common, vector, represent, machine, required, requires] [semantic] [unseen, learning, product, ann, space, training, class, embedding, classification, learned, set, function, nearest, embeddings, distance, loss, test, train, objective, update, zys, neighbor, trained, generalized, testing, codewords, adc, learn, datasets, pqzsl, main, zsl, similarity, independent, generally, large, data]
@InProceedings{Li_2019_CVPR,
  author = {Li, Jin and Lan, Xuguang and Liu, Yang and Wang, Le and Zheng, Nanning},
  title = {Compressing Unknown Images With Product Quantizer for Efficient Zero-Shot Classification},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Self-Supervised Convolutional Subspace Clustering Network
Junjian Zhang, Chun-Guang Li, Chong You, Xianbiao Qi, Honggang Zhang, Jun Guo, Zhouchen Lin


Subspace clustering methods based on data self-expression have become very popular for learning from data that lie in a union of low-dimensional linear subspaces. However, the applicability of subspace clustering has been limited because practical visual data in raw form do not necessarily lie in such linear subspaces. On the other hand, while Convolutional Neural Network (ConvNet) has been demonstrated to be a powerful tool for extracting discriminative features from visual data, training such a ConvNet usually requires a large amount of labeled data, which are unavailable in subspace clustering applications. To achieve simultaneous feature learning and subspace clustering, we propose an end-to-end trainable framework, called Self-Supervised Convolutional Subspace Clustering Network (S^2ConvSCN), that combines a ConvNet module (for feature learning), a self-expression module (for subspace clustering) and a spectral clustering module (for self-supervision) into a joint optimization framework. Particularly, we introduce a dual self-supervision that exploits the output of spectral clustering to supervise the training of the feature learning module (via a classification loss) and the self-expression module (via a spectral clustering loss). Our experiments on four benchmark datasets show the effectiveness of the dual self-supervision and demonstrate superior performance of our proposed approach.
[joint, term, lie, framework, second] [linear, matrix, computer, pattern, international, vision, analysis, robust, form, error, optimization] [spectral, ieee, conference, image, proposed, face, dual, latent, based, result, supervise] [convolutional, network, stacked, sparse, table, output, kernel, deep, cost, neural, norm, tradeoff, tmax, size, trainable, ssc] [median, machine, model, vector, find, introduce] [feature, module, segmentation, extraction, affinity, union] [clustering, subspace, data, learning, training, set, loss, convscn, extended, yale, representation, function, train, space, classification, class, learn, combination, learned, update, experimental]
@InProceedings{Zhang_2019_CVPR,
  author = {Zhang, Junjian and Li, Chun-Guang and You, Chong and Qi, Xianbiao and Zhang, Honggang and Guo, Jun and Lin, Zhouchen},
  title = {Self-Supervised Convolutional Subspace Clustering Network},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Multi-Scale Geometric Consistency Guided Multi-View Stereo
Qingshan Xu, Wenbing Tao


In this paper, we propose an efficient multi-scale geometric consistency guided multi-view stereo method for accurate and complete depth map estimation. We first present our basic multi-view stereo method with Adaptive Checkerboard sampling and Multi-Hypothesis joint view selection (ACMH). It leverages structured region information to sample better candidate hypotheses for propagation and infer the aggregation view subset at each pixel. For the depth estimation of low-textured areas, we further propose to combine ACMH with multi-scale geometric consistency guidance (ACMM) to obtain the reliable depth estimates for low-textured areas at coarser scales and guarantee that they can be propagated to finer scales. To correct the erroneous estimates propagated from the coarser scales, we present a novel detail restorer. Experiments on extensive datasets show our method achieves state-of-the-art performance, recovering the depth estimation not only in low-textured areas but also in details.
[propagation, current, joint, hypothesis, dataset, sequential, perform, previous, key] [depth, view, acmh, geometric, matching, stereo, acmm, computer, photometric, pattern, colmap, estimation, checkerboard, reliable, vision, error, point, coarser, good, finer, multiview, restorer, absolute, corresponding, dwta, gipuma, robust, estimate, strecha, initial, ground, truth, international, lowtextured, surface, coarsest] [consistency, image, ieee, method, conference, detail, figure, based, pixel, difference, patchmatch, patch, reference] [selection, cost, scale, better, structured, adaptive, table, basic] [machine, correct, infer] [map, region, guidance, challenging, propose, benchmark] [datasets, sampling]
@InProceedings{Xu_2019_CVPR,
  author = {Xu, Qingshan and Tao, Wenbing},
  title = {Multi-Scale Geometric Consistency Guided Multi-View Stereo},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Privacy Preserving Image-Based Localization
Pablo Speciale, Johannes L. Schonberger, Sing Bing Kang, Sudipta N. Sinha, Marc Pollefeys


Image-based localization is a core component of many augmented/mixed reality (AR/MR) and autonomous robotic systems. Current localization systems rely on the persistent storage of 3D point clouds of the scene to enable camera pose estimation, but such data reveals potentially sensitive scene information. This gives rise to significant privacy risks, especially as for many applications 3D mapping is a background process that the user might not be fully aware of. We pose the following question: How can we avoid disclosing confidential information about the captured 3D scene, and yet allow reliable camera pose estimation? This paper proposes the first solution to what we call privacy preserving image-based localization. The key idea of our approach is to lift the map representation from a 3D point cloud to a 3D line cloud. This novel representation obfuscates the underlying scene geometry while providing sufficient geometric constraints to enable robust and accurate 6-DOF camera pose estimation. Extensive experiments on several datasets and localization scenarios underline the high practical relevance of our proposed approach.
[recognition, multiple, work] [pose, point, camera, computer, vision, cloud, problem, international, scene, minimal, estimation, approach, pattern, geometric, estimate, solution, single, geometry, relative, absolute, typically, direction, general, error, confidential, accurate, corresponding, local, matching, solve, case, ransac, rotation, robotics, constraint, analysis, journal, reality, robust, practical, solving, pinhole, sfm] [conference, image, preserving, traditional, based, proposed, figure, transformation, mapping, high] [efficient, scale, number, table, structure, mobile] [privacy, visual, random, refer, query, model, enable, secret] [localization, map, three, location, european] [generalized, representation, data, learning]
@InProceedings{Speciale_2019_CVPR,
  author = {Speciale, Pablo and Schonberger, Johannes L. and Bing Kang, Sing and Sinha, Sudipta N. and Pollefeys, Marc},
  title = {Privacy Preserving Image-Based Localization},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
SimulCap : Single-View Human Performance Capture With Cloth Simulation
Tao Yu, Zerong Zheng, Yuan Zhong, Jianhui Zhao, Qionghai Dai, Gerard Pons-Moll, Yebin Liu


This paper proposes a new method for live free-viewpoint human performance capture with dynamic details (e.g., cloth wrinkles) using a single RGBD camera. Our main contributions are: (i) a multi-layer representation of garments and body, and (ii) a physics-based performance capture procedure. We first digitize the performer using multi-layer surface representation, which includes the undressed body surface and separate clothing meshes. For performance capture, we perform skeleton tracking, cloth simulation, and iterative depth fitting sequentially for the incoming frame. By incorporating cloth simulation into the performance capture pipeline, we can simulate plausible cloth dynamics and cloth-body interactions even in the occluded regions, which was not possible in previous capture methods. Moreover, by formulating depth fitting as a physical process, our system produces cloth tracking results consistent with the depth observation while still maintaining physical constraints. Results and evaluations show the effectiveness of our method. Our method also enables new types of applications such as cloth retargeting, free-viewpoint video rendering and animations.
[mar] [] [] [] [] [] []
@InProceedings{Yu_2019_CVPR,
  author = {Yu, Tao and Zheng, Zerong and Zhong, Yuan and Zhao, Jianhui and Dai, Qionghai and Pons-Moll, Gerard and Liu, Yebin},
  title = {SimulCap : Single-View Human Performance Capture With Cloth Simulation},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Hierarchical Deep Stereo Matching on High-Resolution Images
Gengshan Yang, Joshua Manela, Michael Happold, Deva Ramanan


We explore the problem of real-time stereo matching on high-res imagery. Many state-of-the-art (SOTA) methods struggle to process high-res imagery because of memory constraints or speed limitations. To address this issue, we propose an end-to-end framework that searches for correspondences incrementally over a coarse-to-fine hierarchy. Because high-res stereo datasets are relatively rare, we introduce a dataset with high-res stereo pairs for both training and evaluation. Our approach achieved SOTA performance on Middlebury-v3 and KITTI-15 while running significantly faster than its competitors. The hierarchical design also naturally allows for anytime on-demand reports of disparity by capping intermediate coarse results, allowing us to accurately predict disparity for near-range structures with low latency (30ms). We demonstrate that the performance-vs-speed tradeoff afforded by on-demand hierarchies may address sensing needs for time-critical applications such as autonomous driving.
[driving, flow, dataset, time, optical] [stereo, matching, disparity, volume, depth, error, sota, calibration, range, autonomous, pattern, scene, accurate, camera, avgerr, iresnet, analysis] [image, resolution, figure, method, synthetic, sensing, collected, input, ieee] [network, cost, table, running, deep, efficient, increase, output, performance, group, scale, architecture, search, apply, number, size, design] [model, rob, machine, making, memory] [feature, pyramid, hierarchical, coarse, benchmark, faster, stage, propose, global] [training, datasets, asymmetric, augmentation, data, set, train, target, test, learning, large, loss, address]
@InProceedings{Yang_2019_CVPR,
  author = {Yang, Gengshan and Manela, Joshua and Happold, Michael and Ramanan, Deva},
  title = {Hierarchical Deep Stereo Matching on High-Resolution Images},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Recurrent MVSNet for High-Resolution Multi-View Stereo Depth Inference
Yao Yao, Zixin Luo, Shiwei Li, Tianwei Shen, Tian Fang, Long Quan


Deep learning has recently demonstrated its excellent performance for multi-view stereo (MVS). However, one major limitation of current learned MVS approaches is the scalability: the memory-consuming cost volume regularization makes the learned MVS hard to be applied to high-resolution scenes. In this paper, we introduce a scalable multi-view stereo framework based on the recurrent neural network. Instead of regularizing the entire 3D cost volume in one go, the proposed Recurrent Multi-view Stereo Network (R-MVSNet) sequentially regularizes the 2D cost maps along the depth direction via the gated recurrent unit (GRU). This reduces dramatically the memory consumption and makes high-resolution reconstruction feasible. We first show the state-of-the-art performance achieved by the proposed R-MVSNet on the recent MVS benchmarks. Then, we further demonstrate the scalability of the proposed method on several large-scale scenarios, where previous learned approaches often fail due to the memory constraint. Code is available at https://github.com/YoYo000/MVSNet.
[gru, dataset, recurrent, previous, sequential, fusion, recognition, current] [depth, mvsnet, volume, reconstruction, stereo, vision, dtu, computer, point, pattern, camera, range, dmax, cloud, ground, truth, international, approach, multiview, error, analysis] [image, proposed, reference, input, method, filtering, conference, resolution, figure, based] [cost, regularization, cnns, network, convolutional, deep, size, regularized, performance, architecture, number, table, neural, apply, processing, scalable, sequentially, regularize, requirement, output] [memory, model, probability, machine, evaluation, variational, introduce] [map, spatial, refinement, context, score] [learned, learning, set, training, scalability, sample, loss, large, source]
@InProceedings{Yao_2019_CVPR,
  author = {Yao, Yao and Luo, Zixin and Li, Shiwei and Shen, Tianwei and Fang, Tian and Quan, Long},
  title = {Recurrent MVSNet for High-Resolution Multi-View Stereo Depth Inference},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Synthesizing 3D Shapes From Silhouette Image Collections Using Multi-Projection Generative Adversarial Networks
Xiao Li, Yue Dong, Pieter Peers, Xin Tong


We present a new weakly supervised learning-based method for generating novel category-specific 3D shapes from unoccluded image collections. Our method is weakly supervised and only requires silhouette annotations from unoccluded, category-specific objects. Our method does not require access to the object's 3D shape, multiple observations per object from different views, intra-image pixel correspondences, or any view annotations. Key to our method is a novel multi-projection generative adversarial network (MP-GAN) that trains a 3D shape generator to be consistent with multiple 2D projections of the 3D shapes, and without direct access to these 3D shapes. This is achieved through multiple discriminators that encode the distribution of 2D projections of the 3D shapes seen from a different views. Additionally, to determine the view information for each silhouette image, we also train a view prediction network on visualizations of 3D shapes synthesized by the generator. We iteratively alternate between training the generator and training the view prediction network. We validate our multi-projection GAN on both synthetic and real image datasets. Furthermore, we also show that multi-projection GANs can aid in learning other high-dimensional distributions from lower dimensional training datasets, such as material-class specific spatially varying reflectance properties from images.
[prediction, multiple, dataset, joint, unoccluded, key, learns] [view, silhouette, shape, viewpoint, single, projection, voxel, dimensional, corresponding, chair, reconstruction, material, require, varying, reflectance, multiprojection, directly, property, svbrdf, note] [generator, image, method, figure, latent, generative, synthetic, reference, high, demonstrate, quality, proposed, input] [network, number, accuracy, table, distributed, low] [gan, discriminator, generated, fid, model, adversarial, gans, probability, bird, generating, access, sampled, vector] [object, score, weakly, three, unannotated] [training, distribution, trained, learning, data, novel, train, set, loss, class, datasets, learn, sample, classifier, supervised, large]
@InProceedings{Li_2019_CVPR,
  author = {Li, Xiao and Dong, Yue and Peers, Pieter and Tong, Xin},
  title = {Synthesizing 3D Shapes From Silhouette Image Collections Using Multi-Projection Generative Adversarial Networks},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
The Perfect Match: 3D Point Cloud Matching With Smoothed Densities
Zan Gojcic, Caifa Zhou, Jan D. Wegner, Andreas Wieser


We propose 3DSmoothNet, a full workflow to match 3D point clouds with a siamese deep learning architecture and fully convolutional layers using a voxelized smoothed density value (SDV) representation. The latter is computed per interest point and aligned to the local reference frame (LRF) to achieve rotation invariance. Our compact, learned, rotation invariant 3D point cloud descriptor achieves 94.9% average recall on the 3DMatch benchmark data set, outperforming the state-of-the-art by more than 20 percent points with only 32 output dimensions. This very low output dimension allows for near realtime correspondence search with 0.1 ms per feature point on a standard PC. Our approach is sensor- and scene-agnostic because of SDV, LRF and learning highly descriptive features with fully convolutional layers. We show that 3DSmoothNet trained only on RGB-D indoor scenes of buildings achieves 79.0% average recall on laser scans of outdoor vegetation, more than double the performance of our closest, learning-based competitors. Code, data and pre-trained models are available online at https://github.com/zgojcic/3DSmoothNet.
[recognition, eth, outperforms, extract] [point, local, cloud, rotation, computer, vision, sdv, voxel, international, approach, lrf, descriptor, pattern, scene, percent, registration, matching, correspondence, outdoor, neighborhood, spherical, indoor, laser, geometric, ppfnet, matrix, supplementary, allows, corresponding, canonical, inlier] [conference, ieee, input, based, method, transformation, figure, raw] [performance, convolutional, deep, network, output, neural, search, processing, smoothing, denotes, number, batch, ratio] [vector, evaluation, find, machine] [feature, recall, interest, average, grid, overlapping, fully, object] [data, learning, set, representation, invariance, invariant, trained, training, function, learned, distance, shot, negative]
@InProceedings{Gojcic_2019_CVPR,
  author = {Gojcic, Zan and Zhou, Caifa and Wegner, Jan D. and Wieser, Andreas},
  title = {The Perfect Match: 3D Point Cloud Matching With Smoothed Densities},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Recurrent Neural Network for (Un-)Supervised Learning of Monocular Video Visual Odometry and Depth
Rui Wang, Stephen M. Pizer, Jan-Michael Frahm


Deep learning-based, single-view depth estimation methods have recently shown highly promising results. However, such methods ignore one of the most important features for determining depth in the human vision system, which is motion. We propose a learning-based, multi-view dense depth map and odometry estimation method that uses Recurrent Neural Networks (RNN) and trains utilizing multi-view image reprojection and forward-backward flow-consistency losses. Our model can be trained in a supervised or even unsupervised mode. It is designed for depth and visual odometry estimation from video where the input frames are temporally correlated. However, it also generalizes to single-view depth estimation. Our method produces superior results to the state-of-the-art approaches for single-view and multi-view learning-based depth estimation on the KITTI driving dataset.
[lstm, flow, sequence, recurrent, previous, consecutive, temporal, current, video, recognition, motion, multiple, optical, dataset, backward, frame, hidden] [depth, estimation, odometry, reprojection, computer, vision, monocular, view, camera, single, pattern, pose, constraint, stereo, kitti, dense, estimated, geometric, reconstruction, groundtruth, equation, rmse, rel, multiview, relative, differentiable, june, consistent, volume, eigen] [image, conference, method, ieee, figure, consistency, input, comparison, proposed, row, arbitrary] [network, deep, convolutional, table, architecture, full, neural, output, seq, scale] [visual, length, encoder, evaluation] [map, utilize, ablation] [unsupervised, loss, learning, supervised, training, trained, dgm, data]
@InProceedings{Wang_2019_CVPR,
  author = {Wang, Rui and Pizer, Stephen M. and Frahm, Jan-Michael},
  title = {Recurrent Neural Network for (Un-)Supervised Learning of Monocular Video Visual Odometry and Depth},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
PointWeb: Enhancing Local Neighborhood Features for Point Cloud Processing
Hengshuang Zhao, Li Jiang, Chi-Wing Fu, Jiaya Jia


This paper presents PointWeb, a new approach to extract contextual features from local neighborhood in a point cloud. Unlike previous work, we densely connect each point with every other in a local neighborhood, aiming to specify feature of each point based on the local region characteristics for better representing the region. A novel module, namely Adaptive Feature Adjustment (AFA) module, is presented to find the interaction between points. For each local region, an impact map carrying element-wise impact between point pairs is applied to the feature difference map. Each feature is then pulled or pushed by other features in the same region according to the adaptively learned impact indicators. The adjusted features are well encoded with region information, and thus benefit the point cloud recognition tasks, such as point cloud segmentation and classification. Experimental results show that our model outperforms the state-of-the-arts on both semantic segmentation and shape classification datasets.
[framework, dataset, graph, recognition, previous, interaction, work] [point, local, cloud, adjustment, voxel, neighborhood, approach, shape, pointnet, scannet, directly, scene, indoor, well, form, wij, wall] [difference, method, image, input, figure, paired] [impact, deep, max, adaptive, table, network, output, mlp, neural, convolution, performance, convolutional, better, compared, operation, size, accuracy, macc, adaptively, pooling, number] [model, evaluation] [feature, region, semantic, pointweb, module, segmentation, afa, map, fimp, context, area, frel, miou, object, pointcnn, fmod, clutter, center, relation, baseline, visualization, labeling] [learning, function, classification, data, set, representation, space, learned, shared, pair, web, training, novel]
@InProceedings{Zhao_2019_CVPR,
  author = {Zhao, Hengshuang and Jiang, Li and Fu, Chi-Wing and Jia, Jiaya},
  title = {PointWeb: Enhancing Local Neighborhood Features for Point Cloud Processing},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Scan2Mesh: From Unstructured Range Scans to 3D Meshes
Angela Dai, Matthias Niessner


We introduce Scan2Mesh, a novel data-driven generative approach which transforms an unstructured and potentially incomplete range scan into a structured 3D mesh representation. The main contribution of this work is a generative neural network architecture whose input is a range scan of a 3D object and whose output is an indexed face set conditioned on the input scan. In order to generate a 3D mesh as a set of vertices and face indices, the generative model builds on a series of proxy losses for vertices, edges, and faces. At each stage, we realize a one-to-one discrete mapping between the predicted and ground truth data points with a combination of convolutional- and graph neural network architectures. This enables our algorithm to predict a compact mesh representation similar to those created through manual artist effort using 3D modeling software. Our generated mesh results thus produce sharper, cleaner meshes with a fundamentally different structure from those generated through implicit functions, a first step in bridging the gap towards artist-created CAD models.
[graph, predict, prediction, recognition, work, predicting] [mesh, dist, nsim, vertex, ground, truth, volumetric, scan, approach, surface, shape, computer, completion, shapenet, range, vision, well, point, tsdf, pattern, normal, reconstruction, indexed, implicit, depth, chamfer, cad, formulation, respective, handcrafted] [input, face, dual, generative, mapping, conference, figure, method, ieee, poisson, image, quality] [neural, network, structure, order, deep, output, architecture, table, convolutional] [generate, model, arxiv, preprint, generation, partial, sampled, regular, evaluate, generated, potential, greedy, generating] [predicted, object, edge, final, feature, propose, grid] [distance, set, learning, train, training, representation, data, target, measure]
@InProceedings{Dai_2019_CVPR,
  author = {Dai, Angela and Niessner, Matthias},
  title = {Scan2Mesh: From Unstructured Range Scans to 3D Meshes},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Unsupervised Domain Adaptation for ToF Data Denoising With Adversarial Learning
Gianluca Agresti, Henrik Schaefer, Piergiorgio Sartor, Pietro Zanuttigh


Time-of-Flight data is typically affected by a high level of noise and by artifacts due to Multi-Path Interference (MPI). While various traditional approaches for ToF data improvement have been proposed, machine learning techniques have seldom been applied to this task, mostly due to the limited availability of real world training data with depth ground truth. In this paper, we avoid to rely on labeled real data in the learning framework. A Coarse-Fine CNN, able to exploit multi-frequency ToF data for MPI correction, is trained on synthetic data with ground truth in a supervised way. In parallel, an adversarial learning strategy, based on the Generative Adversarial Networks (GAN) framework, is used to perform an unsupervised pixel-level domain adaptation from synthetic to real world data, exploiting unlabeled real world acquisitions. Experimental results demonstrate that the proposed approach is able to effectively denoise real world data and to outperform state-of-the-art techniques.
[dataset, perform] [depth, tof, error, ground, truth, mpi, approach, scene, vision, computer, light, note, interference, allows, dgt, multipath, agresti, pattern, estimate, range, geometry] [real, synthetic, proposed, method, input, generator, noise, conference, denoising, based, frequency, acquired, figure, ieee, remove, amount, correction, produced, composed] [network, output, order, relu, conv, deep, validation, architecture, convolutional, performance, better, layer, compared, fine, modulation, reduce] [adversarial, discriminator, evaluation, machine, introduced] [map, coarse, cnn, mae] [data, learning, domain, training, adaptation, supervised, trained, unsupervised, loss, egt, set, strategy, unlabeled, noisy, augmentation, train, novel, avoid, datasets, large, idea]
@InProceedings{Agresti_2019_CVPR,
  author = {Agresti, Gianluca and Schaefer, Henrik and Sartor, Piergiorgio and Zanuttigh, Pietro},
  title = {Unsupervised Domain Adaptation for ToF Data Denoising With Adversarial Learning},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Learning Independent Object Motion From Unlabelled Stereoscopic Videos
Zhe Cao, Abhishek Kar, Christian Hane, Jitendra Malik


We present a system for learning motion maps of independently moving objects from stereo videos. The only annotations used in our system are 2D object bounding boxes which introduce the notion of objects in our system. Unlike prior learning based approaches which have focused on predicting dense optical flow fields and/or depth maps for images, we propose to predict instance specific 3D scene flow maps and instance masks from which we derive a factored 3D motion map for each object instance. Our network takes the 3D geometry of the problem into account which allows it to correlate the input images and distinguish moving objects from static ones. We present experiments evaluating the accuracy of our 3D flow vectors, as well as depth maps and projected 2D optical flow where our jointly learned system outperforms earlier approaches trained for each task independently.
[flow, moving, motion, optical, time, prediction, frame, predict, independently, dataset, work, static, multiple, warped, jointly, learns, key] [scene, depth, stereo, camera, disparity, dense, geometry, single, view, kitti, photometric, error, binocular, ground, direction, additional, godard, geometric, estimation, truth, classical, left, christian, allows, plane] [image, method, consistency, based, figure, reference, input, pixel, produce, raw, captured] [network, full, denotes, table, structure, architecture, speed, convolutional, factored] [system, evaluation, model] [object, mask, map, roi, instance, grid, predicted, bounding, iou, feature, jitendra, supervision, box, cnn, average] [learning, loss, training, set, unsupervised, representation, trained, test, data, train]
@InProceedings{Cao_2019_CVPR,
  author = {Cao, Zhe and Kar, Abhishek and Hane, Christian and Malik, Jitendra},
  title = {Learning Independent Object Motion From Unlabelled Stereoscopic Videos},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Learning Single-Image Depth From Videos Using Quality Assessment Networks
Weifeng Chen, Shengyi Qian, Jia Deng


Depth estimation from a single image in the wild remains a challenging problem. One main obstacle is the lack of high-quality training data for images in the wild. In this paper we propose a method to automatically generate such data through Structure-from-Motion (SfM) on Internet videos. The core of this method is a Quality Assessment Network that identifies high-quality reconstructions obtained from SfM. Using this method, we collect single-view depth training data from a large number of YouTube videos and construct a new dataset called YouTube3D. Experiments show that YouTube3D is useful in training depth estimation networks and advances the state of the art of single-view depth estimation in the wild.
[dataset, assessment, work, video, construct, perform, consists, outperforms] [depth, computer, sfm, qanet, relative, vision, diw, nyu, pattern, point, internet, ground, estimation, truth, single, monocular, focal, approach, reconstruction, stereo, error, ytu, thomas, manual, camera, indoor, note, encdecresnet, redweb, good] [quality, conference, method, ieee, input, image, reconstructed, collected, prior, figure, based, collect, arbitrary] [network, better, imagenet, automatically, called, deep, performance, compare, standalone, rate, convolutional, number] [arxiv, preprint, random, generate, length, diverse] [feature, score, average, european] [training, data, train, set, learning, trained, large, datasets, ranking, existing, metric, test, similarity]
@InProceedings{Chen_2019_CVPR,
  author = {Chen, Weifeng and Qian, Shengyi and Deng, Jia},
  title = {Learning Single-Image Depth From Videos Using Quality Assessment Networks},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Learning 3D Human Dynamics From Video
Angjoo Kanazawa, Jason Y. Zhang, Panna Felsen, Jitendra Malik


From an image of a person in action, we can easily guess the 3D motion of the person in the immediate past and future. This is because we have a mental model of 3D human dynamics that we have acquired from observing visual sequences of humans in motion. We present a framework that can similarly learn a representation of 3D dynamics of humans from video via a simple but effective temporal encoding of image features. At test time, from video, the learned temporal representation give rise to smooth 3D mesh predictions. From a single image, our model can recover the current 3D mesh as well as its 3D past and future motion. Our approach is designed so it can learn from videos with 2D pose annotations in a semi-supervised manner. Though annotated data is always limited, there are millions of videos uploaded daily on the Internet. In this work, we harvest this Internet-scale source of unlabeled data by training our model on unlabeled video with pseudo-ground truth 2D pose obtained from an off-the-shelf 2D pose detector. Our experiments show that adding more videos with pseudo-ground truth 2D pose monotonically improves 3D prediction performance. We evaluate our model on the recent challenging dataset of 3D Poses in the Wild and obtain state-of-the-art performance on the 3D prediction task without any fine-tuning. The project website with video can be found at https://akanazawa.github.io/human_dynamics/.
[human, temporal, video, prediction, motion, predict, current, static, future, dataset, frame, learns, hallucinator, framework, predicting, work, time, sequence, capture, joint, temporally, consists, recurrent, penn, nearby, people, fmovie] [pose, truth, shape, body, ground, single, approach, mesh, estimation, well, monocular, reprojection, predicts, error, keypoints, strip, local, openpose] [image, smooth, figure, prior, input, change, method] [table, convolutional, network, performance, deep, acceleration, output] [model, encoder, evaluate, movie] [predicted, context, propose, improves, feature, fully] [representation, learning, train, training, learn, trained, datasets, data, loss, unlabeled, test, learned, hallucinate]
@InProceedings{Kanazawa_2019_CVPR,
  author = {Kanazawa, Angjoo and Zhang, Jason Y. and Felsen, Panna and Malik, Jitendra},
  title = {Learning 3D Human Dynamics From Video},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Lending Orientation to Neural Networks for Cross-View Geo-Localization
Liu Liu, Hongdong Li


This paper studies image-based geo-localization (IBL) problem using ground-to-aerial cross-view matching. The goal is to predict the spatial location of a ground-level query image by matching it to a large geotagged aerial image database (e.g., satellite imagery). This is a challenging task due to the drastic differences in their viewpoints and visual appearances. Existing deep learning methods for this problem have been focused on maximizing feature similarity between spatially close-by image pairs, while minimizing other images pairs which are far apart. They do so by deep feature embedding based on visual appearance in those ground-and-aerial images. However, in everyday life, humans commonly use orientation information as an important cue for the task of spatial localization. Inspired by this insight, this paper proposes a novel method which endows deep neural networks with the `commonsense' of orientation. Given a ground-level spherical panoramic image as query input (and a large georeferenced satellite image database), we design a Siamese network which explicitly encodes the orientation (i.e., spherical directions) of each pixel of the images. Our method significantly boosts the discriminative power of the learned deep features, leading to a much higher recall and precision outperforming all previous methods. Our network is also more compact using only 1/5th number of parameters than a previously best-performing network. To evaluate the generalization of our method, we also created a large-scale cross-view localization benchmark containing 100K geotagged ground-aerial pairs covering a city. Our codes and datasets are available at https://github.com/Liumouliu/OriCNN.
[dataset, previous, liu, incorporate, complex, recognition] [orientation, vision, computer, matching, panorama, pattern, rgb, spherical, directional, view, direction, covering, ground, relative, international, hongdong, problem, azimuth, error] [image, figure, method, ieee, conference, based, input, pixel, comparison, database] [deep, network, neural, siamese, top, convolutional, performance, net, architecture] [query, simple, visual, true, hope, represent] [satellite, localization, feature, map, cnn, aerial, cvusa, recall, cvact, location, north, baseline, google, street, geographic, semantic, lending, benchmark] [learning, paper, triplet, loss, metric, task, embedding, training, learned, idea, set, large]
@InProceedings{Liu_2019_CVPR,
  author = {Liu, Liu and Li, Hongdong},
  title = {Lending Orientation to Neural Networks for Cross-View Geo-Localization},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Visual Localization by Learning Objects-Of-Interest Dense Match Regression
Philippe Weinzaepfel, Gabriela Csurka, Yohann Cabon, Martin Humenberger


We introduce a novel CNN-based approach for visual localization from a single RGB image that relies on densely matching a set of Objects-of-Interest (OOIs). In this paper, we focus on planar objects which are highly descriptive in an environment, such as paintings in museums or logos and storefronts in malls or airports. For each OOI, we define a reference image for which 3D world coordinates are available. Given a query image, our CNN model detects the OOIs, segments them and finds a dense set of 2D-2D matches between each detected OOI and its corresponding reference image. Given these 2D-2D matches, together with the 3D world coordinates of each reference image, we obtain a set of 2D-3D matches from which solving a Perspective-n-Point problem gives a pose estimate. We show that 2D-3D matches for reference images, as well as OOI annotations can be obtained for all training images from a single instance annotation per OOI by leveraging Structure-from-Motion reconstruction. We introduce a novel synthetic dataset, VirtualGallery, which targets challenges such as varying lighting conditions and different occlusion levels. Our results show that our method achieves high precision and is robust to these challenges. We also experiment using the Baidu localization dataset captured in a shopping mall. Our approach is the first deep regression-based method to scale to such a larger environment.
[dataset, relies, human, multiple, consists] [lighting, ooi, oois, pose, dense, camera, error, approach, localized, homography, scene, position, accurate, matching, baidu, posenet, coordinate, colmap, torsten, planar, loop, varying, corresponding, problem, regressing, virtualgallery, point] [image, reference, figure, method, captured, based, color, mapping, study, amount] [performance, deep, standard, impact, number] [visual, query, robustness, model, blue, generate] [localization, mask, cnn, segmentation, box, threshold, instance, detection, detected, annotation, regression, shopping, object, segment] [training, data, test, set, learning, large, loss, augmentation, trained, train, class]
@InProceedings{Weinzaepfel_2019_CVPR,
  author = {Weinzaepfel, Philippe and Csurka, Gabriela and Cabon, Yohann and Humenberger, Martin},
  title = {Visual Localization by Learning Objects-Of-Interest Dense Match Regression},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Bilateral Cyclic Constraint and Adaptive Regularization for Unsupervised Monocular Depth Prediction
Alex Wong, Stefano Soatto


Supervised learning methods to infer (hypothesize) depth of a scene from a single image require costly per-pixel ground-truth. We follow a geometric approach that exploits abundant stereo imagery to learn a model to hypothesize scene structure without direct supervision. Although we train a network with stereo pairs, we only require a single image at test time to hypothesize disparity or depth. We propose a novel objective function that exploits the bilateral cyclic relationship between the left and right disparities and we introduce an adaptive regularization scheme that allows the network to handle both the co-visible and occluded regions in a stereo pair. This process ultimately produces a model to generate hypotheses for the 3-dimensional structure of the scene as viewed in a single image. When used to generate a single (most probable) estimate of depth, our method outperforms state-of-the-art unsupervised monocular depth prediction methods on the KITTI benchmarks. We show that our method generalizes well by applying our models trained on KITTI to the Make3d dataset.
[prediction, term, time, video, outperforms, dataset] [depth, disparity, computer, kitti, stereo, monocular, vision, single, pattern, scene, zxy, cyclic, left, dxy, regularity, local, eigen, approach, reconstruction, solution, reprojection, error, international, constraint, estimation, godard] [image, conference, ieee, method, proposed, fidelity, consistency, bilateral, high, based, qualitative] [adaptive, regularization, network, residual, deep, neural, scheme, applying, convolutional, performance, accuracy, table, structure, reduce, conv] [model, decoder, arxiv, preprint, generate] [propose, global, european, branch, fully, spatial] [data, unsupervised, learning, training, loss, function, weighting, trained, generic, split, supervised, learn, metric, pair, minimize]
@InProceedings{Wong_2019_CVPR,
  author = {Wong, Alex and Soatto, Stefano},
  title = {Bilateral Cyclic Constraint and Adaptive Regularization for Unsupervised Monocular Depth Prediction},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Face Parsing With RoI Tanh-Warping
Jinpeng Lin, Hao Yang, Dong Chen, Ming Zeng, Fang Wen, Lu Yuan


Face parsing computes pixel-wise label maps for different semantic components (e.g., hair, mouth, eyes) from face images. Existing face parsing literature have illustrated significant advantages by focusing on individual regions of interest (RoIs) for faces and facial components. However,the traditional crop-and-resize focusing mechanism ignores all contextual area outside the RoIs, and thus is not suitable when the component area is unpredictable, e.g. hair. Inspired by the physiological vision system of human, we propose a novel RoI Tanh-warping operator that combines the central vision and the peripheral vision together. It addresses the dilemma between a limited sized RoI for focusing and an unpredictable area of surrounding context for peripheral information. To this end, we propose a novel hybrid convolutional neural network for face parsing. It uses hierarchical local based method for inner facial components and global methods for outer facial components. The whole framework is simple and principled, and can be trained end-to-end. To facilitate future research of face parsing, we also manually relabel the training data of the HELEN dataset and will make it public. Experiments on both HELEN and LFW-PL benchmarks demonstrate that our method surpasses state-of-the-art methods.
[warped, prediction, predict, previous, individual, dataset, extract, liu] [vision, computer, coordinate, pattern, local, directly, equation, left, good] [face, facial, component, hybrid, image, hair, figure, input, conference, method, helen, proposed, outer, ieee, central, comparison, skin, mouth, handle] [network, structure, convolutional, neural, output, table, accuracy, padding, operator, size, original, deep, apply] [model, system] [segmentation, inner, parsing, roi, feature, mask, semantic, module, area, region, peripheral, fcn, propose, rectangle, bounding, focusing, surrounding, fully, spatial, cropping, global, baseline, box, cnn] [label, loss, trained, training, align, learning, existing, novel]
@InProceedings{Lin_2019_CVPR,
  author = {Lin, Jinpeng and Yang, Hao and Chen, Dong and Zeng, Ming and Wen, Fang and Yuan, Lu},
  title = {Face Parsing With RoI Tanh-Warping},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Multi-Person Articulated Tracking With Spatial and Temporal Embeddings
Sheng Jin, Wentao Liu, Wanli Ouyang, Chen Qian


We propose a unified framework for multi-person pose estimation and tracking. Our framework consists of two main components, i.e. SpatialNet and TemporalNet. The SpatialNet accomplishes body part detection and part-level data association in a single frame, while the TemporalNet groups human instances in consecutive frames into trajectories. Specifically, besides body part detection heatmaps, SpatialNet also predicts the Keypoint Embedding (KE) and Spatial Instance Embedding (SIE) for body part association. We model the grouping procedure into a differentiable Pose-Guided Grouping (PGG) module to make the whole part detection and grouping pipeline fully end-to-end trainable. TemporalNet extends the spatial grouping of keypoints to temporal grouping of human instances. Given human proposals from two consecutive frames, TemporalNet exploits both appearance features encoded in Human Embedding (HE) and temporally consistent geometric features embodied in Temporal Instance Embedding (TIE) for robust tracking. Extensive experiments demonstrate the effectiveness of our proposed model. Remarkably, we demonstrate substantial improvements over the state-of-the-art pose tracking method from 65.4% to 71.8% Multi-Object Tracking Accuracy (MOTA) on the ICCV'17 PoseTrack Dataset.
[human, tracking, temporal, sie, spatialnet, tie, pgg, temporalnet, mota, posetrack, frame, recognition, extend, temporally, graph, online, framework, movement, previous] [pose, estimation, body, geometric, computer, vision, keypoint, single, robust, pattern, articulated, camera, algorithm, confidence, accurate, total, differentiable] [appearance, conference, ieee, based, figure, method] [table, number, group, performance, accuracy, compact, fast] [arxiv, preprint, model, memory, encoded, generate, vector] [grouping, spatial, instance, person, detection, feature, center, propose, mask, branch, module, predicted, object, fully, faster, improves] [embedding, auxiliary, learning, embeddings, loss, training, similarity, pairwise, set, learn, train]
@InProceedings{Jin_2019_CVPR,
  author = {Jin, Sheng and Liu, Wentao and Ouyang, Wanli and Qian, Chen},
  title = {Multi-Person Articulated Tracking With Spatial and Temporal Embeddings},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Multi-Person Pose Estimation With Enhanced Channel-Wise and Spatial Information
Kai Su, Dongdong Yu, Zhenqi Xu, Xin Geng, Changhu Wang


Multi-person pose estimation is an important but challenging problem in computer vision. Although current approaches have achieved significant progress by fusing the multi-scale feature maps, they pay little attention to enhancing the channel-wise and spatial information of the feature maps. In this paper, we propose two novel modules to perform the enhancement of the information for the multi-person pose estimation. First, a Channel Shuffle Module (CSM) is proposed to adopt the channel shuffle operation on the feature maps with different levels, promoting cross-channel information communication among the pyramid feature maps. Second, a Spatial, Channel-wise Attention Residual Bottleneck (SCARB) is designed to boost the original residual unit with attention mechanism, adaptively highlighting the information of the feature maps both in the spatial and channel-wise context. The effectiveness of our proposed modules is evaluated on the COCO keypoint benchmark, and experimental results show that our approach achieves the state-of-the-art results.
[human, dataset, work] [pose, estimation, computer, vision, pattern, keypoints, keypoint, hourglass, corresponding] [input, proposed, conference, method, ieee, image, figure, denoted] [channel, residual, shuffle, cpn, bottleneck, network, minival, table, operation, adaptively, convolutional, conv, scarb, achieve, size, original, achieves, sigmoid, neural, denotes, compared, performance, performed, concat, resnet, channelwise, deep] [attention, model, simple, mechanism, visual, communication] [feature, spatial, coco, pyramid, module, detection, backbone, ablation, enhance, enhanced, fused, propose, adopt, final, cascaded, baseline] [shuffled, dimension, loss]
@InProceedings{Su_2019_CVPR,
  author = {Su, Kai and Yu, Dongdong and Xu, Zhenqi and Geng, Xin and Wang, Changhu},
  title = {Multi-Person Pose Estimation With Enhanced Channel-Wise and Spatial Information},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
A Compact Embedding for Facial Expression Similarity
Raviteja Vemulapalli, Aseem Agarwala


Most of the existing work on automatic facial expression analysis focuses on discrete emotion recognition, or facial action unit detection. However, facial expressions do not always fall neatly into pre-defined semantic categories. Also, the similarity between expressions measured in the action unit space need not correspond to how humans perceive expression similarity. Different from previous work, our goal is to describe facial expressions in a continuous fashion using a compact embedding space that mimics human visual preferences. To achieve this goal, we collect a large-scale faces-in-the-wild dataset with human annotations in the form: Expressions A and B are visually more similar when compared to expression C, and use this dataset to train a neural network that produces a compact (16-dimensional) expression embedding. We experimentally demonstrate that the learned embedding can be successfully used for various applications such as expression retrieval, photo album summarization, and emotion recognition. We also show that the embedding learned using the proposed dataset performs better than several other embeddings learned using existing emotion or action unit datasets.
[dataset, action, human, prediction, recognition, performs] [analysis, vision] [expression, facial, figure, image, proposed, face, based, comparison, visually, photo, album, database, ten] [accuracy, deep, network, unit, validation, layer, number, better, performance, compared, table, compact, neural, best, densenet] [visual, strong, automatic, generated, median] [category, three, average, google, third] [embedding, triplet, emotion, training, learning, fec, set, existing, learned, space, trained, test, classification, distance, loss, affectnet, embeddings, rater, similarity, pair, metric, retrieval, fecnet, raters, datasets, happiness, function, retrieved, label, train, surprise]
@InProceedings{Vemulapalli_2019_CVPR,
  author = {Vemulapalli, Raviteja and Agarwala, Aseem},
  title = {A Compact Embedding for Facial Expression Similarity},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Deep High-Resolution Representation Learning for Human Pose Estimation
Ke Sun, Bin Xiao, Dong Liu, Jingdong Wang


In this paper, we are interested in the human pose estimation problem with a focus on learning reliable high-resolution representations. Most existing methods recover high-resolution representations from low-resolution representations produced by a high-to-low resolution network. Instead, our proposed network maintains high-resolution representations through the whole process. We start from a high-resolution subnetwork as the first stage, gradually add high-to-low resolution subnetworks one by one to form more stages, and connect the mutli-resolution subnetworks in parallel. We conduct repeated multi-scale fusions such that each of the high-to-low resolution representations receives information from other parallel representations over and over, leading to rich high-resolution representations. As a result, the predicted keypoint heatmap is potentially more accurate and spatially more precise. We empirically demonstrate the effectiveness of our network through the superior pose estimation results over two benchmark datasets: the COCO keypoint detection dataset and the MPII Human Pose dataset. In addition, we show the superiority of our network in pose tracking on the PoseTrack dataset. The code and models have been publicly available at https://github.com/leoxiaobin/deep-high-resolution-net.pytorch.
[human, dataset, mpii, joint, tracking, posetrack, consists, fusion] [pose, estimation, keypoint, approach, keypoints, hourglass, depth, computer] [resolution, input, figure, study, repeated, intermediate, high, method] [network, subnetworks, convolutional, size, deep, neural, simplebaseline, process, parallel, table, strided, gflops, unit, dilated, convolution, performance, efficient, wei, highresolution, residual, small, achieves, compared, gain] [model] [exchange, coco, detection, heatmap, person, feature, jingdong, subnetwork, heatmaps, box, predicted, benchmark, response, object, backbone, semantic, wanli, xiaogang] [learning, training, data, existing, classification, big]
@InProceedings{Sun_2019_CVPR,
  author = {Sun, Ke and Xiao, Bin and Liu, Dong and Wang, Jingdong},
  title = {Deep High-Resolution Representation Learning for Human Pose Estimation},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Feature Transfer Learning for Face Recognition With Under-Represented Data
Xi Yin, Xiang Yu, Kihyuk Sohn, Xiaoming Liu, Manmohan Chandraker


Despite the large volume of face recognition datasets, there is a significant portion of subjects, of which the samples are insufficient and thus under-represented. Ignoring such significant portion results in insufficient training data. Training with under-represented data leads to biased classifiers in conventionally-trained deep networks. In this paper, we propose a center-based feature transfer framework to augment the feature space of under-represented subjects from the regular subjects that have sufficiently diverse samples. A Gaussian prior of the variance is assumed across all subjects and the variance from regular ones are transferred to the under-represented ones. This encourages the under-represented distribution to be closer to the regular distribution. Further, an alternating training regimen is proposed to simultaneously achieve less biased classifiers and a more discriminative feature representation. We conduct ablative study to mimic the under-represented datasets by varying the portion of under-represented classes on the MS-Celeb-1M dataset. Advantageous results on LFW, IJB-A and MS-Celeb-1M demonstrate the effectiveness of our feature transfer and training strategy, compared to both general baselines and state-of-the-art methods. Moreover, our feature transfer successfully presents smooth visual interpolation, which conducts disentanglement to preserve identity of a class while augmenting its feature space with non-identity variations such as pose and lighting.
[recognition, framework, portion, dataset, work, consists] [estimation, problem] [face, figure, method, proposed, ftl, image, transferred, identity, study, transferring, based, generative] [deep, variance, regularization, performance, norm, weight, better, network, batch, table, achieve, compared, number, enc, compact, larger, accuracy, effective, applied, layer] [regular, decision, model, rich, evaluate, generate, goal] [feature, center, propose, boundary, stage, improves] [transfer, training, learning, class, data, classifier, loss, representation, space, distribution, softmax, train, sfmx, discriminative, novel, lfw, set, metric, large, biased, alternating, trained, bias, sampling, imbalanced, classification, learn, base, shared, test]
@InProceedings{Yin_2019_CVPR,
  author = {Yin, Xi and Yu, Xiang and Sohn, Kihyuk and Liu, Xiaoming and Chandraker, Manmohan},
  title = {Feature Transfer Learning for Face Recognition With Under-Represented Data},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Unsupervised 3D Pose Estimation With Geometric Self-Supervision
Ching-Hang Chen, Ambrish Tyagi, Amit Agrawal, Dylan Drover, Rohith MV, Stefan Stojanov, James M. Rehg


We present an unsupervised learning approach to re- cover 3D human pose from 2D skeletal joints extracted from a single image. Our method does not require any multi- view image data, 3D skeletons, correspondences between 2D-3D points, or use previously learned 3D priors during training. A lifting network accepts 2D landmarks as inputs and generates a corresponding 3D skeleton estimate. Dur- ing training, the recovered 3D skeleton is reprojected on random camera viewpoints to generate new 'synthetic' 2D poses. By lifting the synthetic 2D poses back to 3D and re-projecting them in the original camera view, we can de- fine self-consistency loss both in 3D and in 2D. The training can thus be self supervised by exploiting the geometric self- consistency of the lift-reproject-lift process. We show that self-consistency alone is not sufficient to generate realistic skeletons, however adding a 2D pose discriminator enables the lifter to output valid 3D poses. Additionally, to learn from 2D poses 'in the wild', we train an unsupervised 2D domain adapter network to allow for an expansion of 2D data. This improves results and demonstrates the useful- ness of 2D pose data for unsupervised 3D lifting. Results on Human3.6M dataset for 3D human pose estimation demon- strate that our approach improves upon the previous un- supervised methods by 30% and outperforms many weakly supervised approaches that explicitly use 3D data.
[human, skeleton, joint, dataset, video, recognition, motion, temporal, kinetics, previous, capture] [pose, estimation, computer, lifting, vision, approach, pattern, single, international, lifter, camera, depth, monocular, geometric, corresponding, projection, additional, ground, truth, error, note, require, body, rhodin, projected, analysis, form, angle] [conference, ieee, method, figure, image, consistency, input, real, july, amount, extracted, synthetic] [network, table, neural, convolutional, deep] [adversarial, discriminator, random, generate, generated, machine] [supervision, weakly, improve, european, pascal] [data, unsupervised, learning, supervised, loss, training, learn, domain, train, datasets, randomly, distribution, trained, learned]
@InProceedings{Chen_2019_CVPR,
  author = {Chen, Ching-Hang and Tyagi, Ambrish and Agrawal, Amit and Drover, Dylan and MV, Rohith and Stojanov, Stefan and Rehg, James M.},
  title = {Unsupervised 3D Pose Estimation With Geometric Self-Supervision},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Peeking Into the Future: Predicting Future Person Activities and Locations in Videos
Junwei Liang, Lu Jiang, Juan Carlos Niebles, Alexander G. Hauptmann, Li Fei-Fei


Deciphering human behaviors to predict their future paths/trajectories and what they would do from videos is important in many applications. Motivated by this idea, this paper studies predicting a pedestrian's future path jointly with future activities. We propose an end-to-end, multi-task learning system utilizing rich visual features about human behavioral information and interaction with their surroundings. To facilitate the training, the network is learned with an auxiliary task of predicting future location in which the activity will happen. Experimental results demonstrate our state-of-the-art performance over two public benchmarks on future trajectory prediction. Moreover, our method is able to produce meaningful future activity prediction in addition to the path. The result provides the first empirical evidence that joint modeling of paths and activities benefits future path prediction.
[activity, future, prediction, trajectory, time, lstm, human, predict, video, work, predicting, joint, hidden, social, tobs, interaction, eth, behavior, extract, previous, tracking, tpred, ucy, capture, state] [scene, focal, predicts, single, manhattan, ground, truth, pattern, analysis, computer, body, geometric, keypoint, vision] [method, ieee, proposed, figure] [network, size, block, table, convolution, performance, number, neural, deep] [model, path, visual, attention, rich, green, common, yellow, encode, correct, understanding] [person, location, grid, feature, object, module, final, detection, semantic, average, pedestrian, bounding, regression, public, utilize] [learning, loss, set, label, classification, task, training, auxiliary]
@InProceedings{Liang_2019_CVPR,
  author = {Liang, Junwei and Jiang, Lu and Carlos Niebles, Juan and Hauptmann, Alexander G. and Fei-Fei, Li},
  title = {Peeking Into the Future: Predicting Future Person Activities and Locations in Videos},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Re-Identification With Consistent Attentive Siamese Networks
Meng Zheng, Srikrishna Karanam, Ziyan Wu, Richard J. Radke


We propose a new deep architecture for person re-identification (re-id). While re-id has seen much recent progress, spatial localization and view-invariant representation learning for robust cross-view matching remain key, unsolved problems. We address these questions by means of a new attention-driven Siamese learning architecture, called the Consistent Attentive Siamese Network. Our key innovations compared to existing, competing methods include (a) a flexible framework design that produces attention with only identity labels as supervision, (b) explicit mechanisms to enforce attention consistency among images of the same person, and (c) a new Siamese framework that integrates attention and attention consistency, producing principled supervisory signals as well as the first mechanism that can explain the reasoning behind the Siamese framework's predictions. We conduct extensive evaluations on the CUHK03-NP, DukeMTMC-ReID, and Market-1501 datasets and report competitive performance.
[prediction, framework, key, report, providing, jointly, work] [consistent, explicit, compute, robust, enforce, principled, match, provide, note, problem, camera, typically] [figure, image, input, proposed, consistency, identity, method] [siamese, deep, architecture, network, design, performance, convolutional, compared, table, neural, discussed] [attention, model, mechanism, vector, reasoning, consider, query, explain] [person, identification, feature, spatial, ide, module, bce, attentive, baseline, map, casn, localization, pcb, liang, xiaogang, interest, supervision, integrates, leading] [learning, loss, classifier, training, representation, invariant, learn, pair, objective, classification, gallery, set, retrieved, supervisory, trained]
@InProceedings{Zheng_2019_CVPR,
  author = {Zheng, Meng and Karanam, Srikrishna and Wu, Ziyan and Radke, Richard J.},
  title = {Re-Identification With Consistent Attentive Siamese Networks},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
On the Continuity of Rotation Representations in Neural Networks
Yi Zhou, Connelly Barnes, Jingwan Lu, Jimei Yang, Hao Li


In neural networks, it is often desirable to work with various representations of the same space. For example, 3D rotations can be represented with quaternions or Euler angles. In this paper, we advance a definition of a continuous representation, which can be helpful for training deep neural networks. We relate this to topological concepts such as homeomorphism and embedding. We then investigate what are continuous and discontinuous representations for 2D, 3D, and n-dimensional rotations. We demonstrate that for 3D rotations, all representations are discontinuous in the real Euclidean spaces of four or fewer dimensions. Thus, widely used representations such as quaternions and Euler angles are discontinuous and difficult for neural networks to learn. We show that the 3D rotations have continuous representations in 5D and 6D, which are more suitable for learning. We also present continuous representations for the general case of the n-dimensional rotation group SO(n). While our main focus is on rotations, we also show that our constructions apply to other groups such as the orthogonal group and similarity transforms. We finally present empirical results, which show that our continuous rotation representations outperform discontinuous ones for several practical problems in graphics and vision, including a simple autoencoder sanity test, a rotation estimator for 3D point clouds, and an inverse kinematics solver for 3D human poses.
[continuity, human, motion, perform, forward, work, previous] [rotation, continuous, point, discontinuous, matrix, euler, pose, projection, case, define, inverse, note, defined, definition, equation, dimensional, kinematics, angle, column, estimation, topology, fgs, stereographic, sanity, orthogonalization, vision, topological, homeomorphism, error, geodesic, cloud, computer, require] [mapping, figure, input, real, produce, ieee, conference, transform] [neural, network, original, process, group, approximation, higher, deep, orthogonal, output, suitable, number, better, size, outperform, lower] [vector, basis, empirical, represent] [object, regression] [representation, space, test, training, set, learning, function, euclidean, dimension, similarity, product]
@InProceedings{Zhou_2019_CVPR,
  author = {Zhou, Yi and Barnes, Connelly and Lu, Jingwan and Yang, Jimei and Li, Hao},
  title = {On the Continuity of Rotation Representations in Neural Networks},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Iterative Residual Refinement for Joint Optical Flow and Occlusion Estimation
Junhwa Hur, Stefan Roth


Deep learning approaches to optical flow estimation have seen rapid progress over the recent years. One common trait of many networks is that they refine an initial flow estimate either through multiple stages or across the levels of a coarse-to-fine representation. While leading to more accurate results, the downside of this is an increased number of parameters. Taking inspiration from both classical energy minimization approaches as well as residual networks, we propose an iterative residual refinement (IRR) scheme based on weight sharing that can be combined with several backbone networks. It reduces the number of parameters, improves the accuracy, or even achieves both. Moreover, we show that integrating occlusion prediction and bi-directional flow estimation into our IRR scheme can further boost the accuracy. Our full network achieves state-of-the-art results for both optical flow and occlusion estimation across several standard datasets.
[flow, optical, sintel, irr, flownet, multiple, dataset, joint, video, motion, liteflownet, flownets, stacking, forward, eddy, deqing, combined, previous, backward, flyingchairsocc] [occlusion, estimation, volume, michael, thomas, classical, kitti, daniel, accurate, ground, single, error, stefan, robust, dense] [clean, bilateral, image, based, method, proposed, study, comparison] [accuracy, number, network, deep, table, residual, convolutional, better, upsampling, full, layer, scheme, size, fewer, neural, cnns, iteration, validation, original] [model, iterative, alexey] [refinement, final, average, object, public, including, segmentation, semantic, ablation, improvement, baseline] [training, learning, unsupervised, loss, trained, generalization, large, supervised, minibatch, test]
@InProceedings{Hur_2019_CVPR,
  author = {Hur, Junhwa and Roth, Stefan},
  title = {Iterative Residual Refinement for Joint Optical Flow and Occlusion Estimation},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Inverse Discriminative Networks for Handwritten Signature Verification
Ping Wei, Huan Li, Ping Hu


Handwritten signature verification is an important technique for many financial, commercial, and forensic applications. In this paper, we propose an inverse discriminative network (IDN) for writer-independent handwritten signature verification, which aims to determine whether a test signature is genuine or forged compared to the reference signature. The IDN model contains four weight-shared neural network streams, of which two receiving the original signature images are the discriminative streams and the other two addressing the gray-inverted images form the inverse streams. Multiple paths of attention modules connect the discriminative streams and the inverse streams to propagate messages. With the inverse streams and the multi-path attention modules, the IDN model intensifies the effective information of signature verification. Since there was no proper Chinese signature dataset in the community, we collected a large-scale Chinese signature dataset with approximately 29,000 images of 749 individuals' signatures. We test our method on the Chinese signature dataset and other three signature datasets of different languages: CEDAR, BHSig-B, and BHSig-H. Experiments prove the strength and potential of our method.
[dataset, stream, focus, extract, merged, correlated] [inverse, single, international, pattern] [image, reference, forged, method, gray, double, based, conference, stroke, collected, figure, comparison, ieee, strength, proposed] [signature, verification, convolutional, idn, chinese, genuine, network, offline, neural, table, cedar, performance, output, layer, deep, number, frr, effective, skilled, compared, proper, sparse, writer, compare] [model, attention, handwritten, decision, mechanism, simple, indicates, system] [three, feature, module, supervision, propose, cascaded, average, roc] [test, discriminative, pair, training, large, independent, datasets, train, learning]
@InProceedings{Wei_2019_CVPR,
  author = {Wei, Ping and Li, Huan and Hu, Ping},
  title = {Inverse Discriminative Networks for Handwritten Signature Verification},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Led3D: A Lightweight and Efficient Deep Approach to Recognizing Low-Quality 3D Faces
Guodong Mu, Di Huang, Guosheng Hu, Jia Sun, Yunhong Wang


Due to the intrinsic invariance to pose and illumination changes, 3D Face Recognition (FR) has a promising potential in the real world. 3D FR using high-quality faces, which are of high resolutions and with smooth surfaces, have been widely studied. However, research on that with low-quality input is limited, although it involves more applications. In this paper, we focus on 3D FR using low-quality data, targeting an efficient and accurate deep learning solution. To achieve this, we work on two aspects: (1) designing a lightweight yet powerful CNN; (2) generating finer and bigger training data. For (1), we propose a Multi-Scale Feature Fusion (MSFF) module and a Spatial Attention Vectorization (SAV) module to build a compact and discriminative CNN. For (2), we propose a data processing system including point-cloud recovery, surface refinement, and data augmentation (with newly proposed shape jittering and shape scaling). We conduct extensive experiments on Lock3DFace and achieve state-of-the-art results, outperforming many heavy CNNs such as VGG-16 and ResNet-34. In addition, our model can operate at a very high speed (136 fps) on Jetson TX2, and the promising accuracy and efficiency reached show its great applicability on edge/mobile devices.
[recognition, work, video, fusion, kinect, dataset] [depth, pose, normal, well, finer, vectorization, shape, virtual, corresponding] [face, figure, sav, based, proposed, facial, high, database, msff, image, frgc, expression, method, bosphorus, collected, real, lowquality] [network, deep, efficient, accuracy, performance, block, lightweight, cnns, table, convolutional, size, layer, jetson, applied, pooling, maxpooling, convolution, achieve, speed, architecture, compared, number] [model, generate, attention, inception, generated, generation] [feature, propose, spatial, module, global, baseline, cnn, average, leading, map, bigger, including, public] [data, training, augmentation, discriminative, learning, test, set, gallery]
@InProceedings{Mu_2019_CVPR,
  author = {Mu, Guodong and Huang, Di and Hu, Guosheng and Sun, Jia and Wang, Yunhong},
  title = {Led3D: A Lightweight and Efficient Deep Approach to Recognizing Low-Quality 3D Faces},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
ROI Pooled Correlation Filters for Visual Tracking
Yuxuan Sun, Chong Sun, Dong Wang, You He, Huchuan Lu


The ROI (region-of-interest) based pooling method performs pooling operations on the cropped ROI regions for various samples and has shown great success in the object detection methods. It compresses the model size while preserving the localization accuracy, thus it is useful in the visual tracking field. Though being effective, the ROI-based pooling operation is not yet considered in the correlation filter formula. In this paper, we propose a novel ROI pooled correlation filter (RPCF) algorithm for robust visual tracking. Through mathematical derivations, we show that the ROI-based pooling can be equivalently achieved by enforcing additional constraints on the learned filter weights, which makes the ROI-based pooling feasible on the virtual circular samples. Besides, we develop an efficient joint training formula for the proposed correlation filter algorithm, and derive the Fourier solvers for efficient model training. Finally, we evaluate our RPCF tracker on OTB-2013, OTB-2015 and VOT-2017 benchmark datasets. Experimental results show that our tracker performs favourably against other state-of-the-art trackers.
[tracking, performs, auc, dataset] [algorithm, robust, fourier, error, problem, compute, denote, corresponding, computed, optimization] [method, based, figure, proposed, input, image, equality] [pooling, filter, correlation, precision, tracker, rpcf, operation, eco, ccot, performance, staple, kcf, ope, lsart, meem, pooled, deep, rate, convolution, denotes, size, equivalently, efficient, gradient, kernel, weight, overlap, accuracy] [visual, success, model, introduce, vector, evaluate, candidate, conjugate] [feature, roi, threshold, baseline, map, region, location, object, localization, propose, boundary, three, improves, response] [target, training, learned, learning, set, update, paper, sample, learn]
@InProceedings{Sun_2019_CVPR,
  author = {Sun, Yuxuan and Sun, Chong and Wang, Dong and He, You and Lu, Huchuan},
  title = {ROI Pooled Correlation Filters for Visual Tracking},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Deep Video Inpainting
Dahun Kim, Sanghyun Woo, Joon-Young Lee, In So Kweon


Video inpainting aims to fill spatio-temporal holes with plausible content in a video. Despite tremendous progress of deep neural networks for image inpainting, it is challenging to extend these methods to the video domain due to the additional time dimension. In this work, we propose a novel deep network architecture for fast video inpainting. Built upon an image-based encoder-decoder model, our framework is designed to collect and refine information from neighbor frames and synthesize still-unknown regions. At the same time, the output is enforced to be temporally consistent by a recurrent feedback and a temporal memory module. Compared with the state-of-the-art image inpainting algorithm, our method produces videos that are much more semantically correct and temporally smooth. In contrast to the prior video completion method which relies on time-consuming optimization, our method runs in near real-time while generating competitive video results. Finally, we applied our framework to video retargeting task, and obtain visually pleasing results.
[video, temporal, flow, frame, recurrent, warping, dataset, motion, davis, time, optical, temporally, previous, complex, current, framework, learns, vinet, extend] [computer, optimization, vision, consistent, completion, denote, error, pattern, volume] [inpainting, method, image, reference, feedback, input, conference, content, ieee, quality, figure, consistency, based, fill, retargeting, synthesized, conditional, removal, missing, arbitrary, inpainted, user, synthesize] [network, deep, output, layer, original, designed, full] [model, memory, visual, evaluate, arxiv, preprint] [feature, mask, object, propose, global, cnn, segmentation] [source, learning, loss, neighbor, target, large, training, train, novel]
@InProceedings{Kim_2019_CVPR,
  author = {Kim, Dahun and Woo, Sanghyun and Lee, Joon-Young and So Kweon, In},
  title = {Deep Video Inpainting},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
DM-GAN: Dynamic Memory Generative Adversarial Networks for Text-To-Image Synthesis
Minfeng Zhu, Pingbo Pan, Wei Chen, Yi Yang


In this paper, we focus on generating realistic images from text descriptions. Current methods first generate an initial image with rough shape and color, and then refine the initial image to a high-resolution one. Most existing text-to-image synthesis methods have two main problems. (1) These methods depend heavily on the quality of the initial images. If the initial image is not well initialized, the following processes can hardly refine the image to a satisfactory quality. (2) Each word contributes a different level of importance when depicting different image contents, however, unchanged text representation is used in existing image refinement processes. In this paper, we propose the Dynamic Memory Generative Adversarial Network (DM-GAN) to generate high-quality images. The proposed method introduces a dynamic memory module to refine fuzzy image contents, when the initial images are not well generated. A memory writing gate is designed to select the important text information based on the initial image content, which enables our method to accurately generate images from the text description. We also utilize a response gate to adaptively fuse the information read from the memories and the image features. We evaluate the DM-GAN model on the Caltech-UCSD Birds 200 dataset and the Microsoft Common Objects in Context dataset. Experimental results demonstrate that our DM-GAN model performs favorably against the state-of-the-art approaches.
[dynamic, key, dataset] [initial, well, defined] [image, generative, figure, synthesis, based, proposed, input, conditional, conditioning, quality, method, real, synthetic] [gate, network, better, architecture, process, table, performance, output, dynamically, deep] [memory, text, model, bird, word, adversarial, generated, white, writing, generate, attngan, relevant, fid, sentence, generation, inception, yellow, conditioned, evaluate, visual, step, generates, stackgan, attention, description, black, read, crown, blue, brown, natural, vector, upblock, red, arxiv, preprint] [coco, refinement, response, refine, feature, module, stage, propose, utilize, fuse] [cub, loss, select, addressing, set, representation, distance, feat, training, learning, test]
@InProceedings{Zhu_2019_CVPR,
  author = {Zhu, Minfeng and Pan, Pingbo and Chen, Wei and Yang, Yi},
  title = {DM-GAN: Dynamic Memory Generative Adversarial Networks for Text-To-Image Synthesis},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Non-Adversarial Image Synthesis With Generative Latent Nearest Neighbors
Yedid Hoshen, Ke Li, Jitendra Malik


Unconditional image generation has recently been dominated by generative adversarial networks (GANs). GAN methods train a generator which regresses images from random noise vectors, as well as a discriminator that attempts to differentiate between the generated images and a training set of real images. GANs have shown amazing results at generating realistic looking images. Despite their success, GANs suffer from critical drawbacks including: unstable training and mode-dropping. The weaknesses in GANs have motivated research into alternatives including: variational auto-encoders (VAEs), latent embedding learning methods (e.g. GLO) and nearest-neighbor based implicit maximum likelihood estimation (IMLE). Unfortunately at the moment, GANs still significantly outperform the alternative methods for image generation. In this work, we present a novel method - Generative Latent Nearest Neighbors (GLANN) - for training generative models without adversarial training. GLANN combines the strengths of IMLE and GLO in a way that overcomes the main drawbacks of each method. Consequently, GLANN generates images that are far better than GLO and IMLE. Our method does not suffer from mode collapse which plagues GAN training and is much more stable. Qualitative results show that GLANN outperforms a baseline consisting of 800 GANs and VAEs on commonly used datasets. Our models are also shown to be effective for training truly non-adversarial unsupervised image translation.
[dataset, modeling, work] [good, typically, well, fitting, yield, computed] [method, image, latent, generative, perceptual, generator, noise, mapping, high, unconditional, celeba, presented, resolution, real, figure, quality, translation] [better, performance, number, deep, gaussian, best, precision, vgg, effective, competitive, network, architecture, achieved] [glo, gans, imle, gan, adversarial, generated, model, glann, arxiv, preprint, introduced, generation, evaluation, fid, lucic, random, evaluate, inception, arg, generate, visual, vector, sampled, sajjadi, hoshen] [evaluated] [training, trained, space, loss, learning, distribution, unsupervised, sampling, mnist, nearest, code, set, fashion, sample, metric, hyperparameter, train, suffer, embedding, function, neighbor, mapped, vae, min]
@InProceedings{Hoshen_2019_CVPR,
  author = {Hoshen, Yedid and Li, Ke and Malik, Jitendra},
  title = {Non-Adversarial Image Synthesis With Generative Latent Nearest Neighbors},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Mixture Density Generative Adversarial Networks
Hamid Eghbal-zadeh, Werner Zellinger, Gerhard Widmer


Generative Adversarial Networks have a surprising ability to generate sharp and realistic images, but they are known to suffer from the so-called mode collapse problem. In this paper, we propose a new GAN variant called Mixture Density GAN that overcomes this problem by encouraging the Discriminator to form clusters in its embedding space, which in turn leads the Generator to exploit these and discover different modes in the data. This is achieved by positioning Gaussian density functions in the corners of a simplex, using the resulting Gaussian mixture as a likelihood function over discriminator embeddings, and formulating an objective function for GAN training that is based on these likelihoods. We show how formation of these clusters changes the probability landscape of the discriminator and improves the mode discovery of the GAN. We also show that the optimum of our training objective is attained if and only if the generated and the real distribution match exactly. We support our theoretical results with empirical evaluations on three mode discovery benchmark datasets (Stacked-MNIST, Ring of Gaussians and Grid of Gaussians), and four image datasets (CIFAR-10, CelebA, MNIST, and Fashion-MNIST). Furthermore, we demonstrate (1) the ability to avoid mode collapse and discover all the modes and (2) superior quality of the generated images (as measured by the Frechet Inception Distance (FID)), achieving the lowest FID compared to all baselines.
[dataset, multiple] [international, optimal, thomas, form] [real, generator, image, figure, generative, quality, noise, high, conference, input, proposed, infogan, component, method, realistic] [gaussian, density, vanilla, number, neural, achieved, processing, network, stacked, architecture, table, martin, lowest, gradient] [discriminator, gan, mode, generated, probability, collapse, adversarial, generate, fake, pdata, pgen, arxiv, preprint, landscape, fid, wasserstein, discover, gans, veegan, model, ability, gaussians, generating, provided, discovering, dcgan, create, mdgan] [grid, center, benchmark, discovery] [data, mixture, embedding, training, distribution, likelihood, embeddings, space, function, learning, objective, mnist, datasets, trained, cluster, distance, min, optimum, uniform]
@InProceedings{Eghbal-zadeh_2019_CVPR,
  author = {Eghbal-zadeh, Hamid and Zellinger, Werner and Widmer, Gerhard},
  title = {Mixture Density Generative Adversarial Networks},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
SketchGAN: Joint Sketch Completion and Recognition With Generative Adversarial Network
Fang Liu, Xiaoming Deng, Yu-Kun Lai, Yong-Jin Liu, Cuixia Ma, Hongan Wang


Hand-drawn sketch recognition is a fundamental problem in computer vision, widely used in sketch-based image and video retrieval, editing, and reorganization. Previous methods often assume that a complete sketch is used as input; however, hand-drawn sketches in common application scenarios are often incomplete, which makes sketch recognition a challenging problem. In this paper, we propose SketchGAN, a new generative adversarial network (GAN) based approach that jointly completes and recognizes a sketch, boosting the performance of both tasks. Specifically, we use a cascade Encode-Decoder network to complete the input sketch in an iterative manner, and employ an auxiliary sketch recognition task to recognize the completed sketch. Experiments on the Sketchy database benchmark demonstrate that our joint learning approach achieves competitive sketch completion and recognition performance compared with the state-of-the-art methods. Further experiments using several sketch-based applications also validate the performance of our method.
[recognition, joint, previous, multiple, fed, key, work] [completion, computer, vision, pattern, international, problem, approach, well, good, local, closure, consistent] [image, conference, ieee, method, incomplete, contour, corrupted, generative, completed, figure, conditional, conduct, input, sketchgan, acm, proposed, missing, generator, user, intermediate, inpainting, based, database, demonstrate, comparison, real] [network, performance, original, output, table, ratio, deep, better, architecture, accuracy, convolutional, neural, applied] [adversarial, model, gan, discriminator, natural, random, gans, evaluate, visual] [cascade, stage, propose, object, category, improve] [sketch, auxiliary, sketchy, learning, task, loss, data, classification, training]
@InProceedings{Liu_2019_CVPR,
  author = {Liu, Fang and Deng, Xiaoming and Lai, Yu-Kun and Liu, Yong-Jin and Ma, Cuixia and Wang, Hongan},
  title = {SketchGAN: Joint Sketch Completion and Recognition With Generative Adversarial Network},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Foreground-Aware Image Inpainting
Wei Xiong, Jiahui Yu, Zhe Lin, Jimei Yang, Xin Lu, Connelly Barnes, Jiebo Luo


Existing image inpainting methods typically fill holes by borrowing information from surrounding pixels. They often produce unsatisfactory results when the holes overlap with or touch foreground objects due to lack of information about the actual extent of foreground and background regions within the holes. These scenarios, however, are very important in practice, especially for applications such as distracting object removal. To address the problem, we propose a foreground-aware image inpainting system that explicitly disentangles structure inference and content completion. Specifically, our model learns to predict the foreground contour first, and then inpaints the missing region using the predicted contour as guidance. We show that by such disentanglement, the contour completion model predicts reasonable contours of objects, and further substantially improves the performance of image inpainting. Experiments show that our method significantly outperforms existing methods and achieves superior inpainting results on challenging cases with complex compositions.
[predict, explicitly, dataset, work] [completion, computer, pattern, accurate, vision, typically] [contour, image, inpainting, completed, incomplete, missing, input, hole, content, pixel, ieee, cgt, method, gatedconv, based, partialconv, background, result, conference, fill, corrupted, patchmatch, guide, real, generator, produce, generative] [network, structure, overlap, deep, neural, architecture, small, output, inference] [model, adversarial, generated, generate, discriminator, system, infer, complete, natural, con, arxiv] [foreground, module, mask, coarse, saliency, guidance, map, propose, object, refinement, segmentation, salient, detection, final, predicted, challenging] [loss, training, train, learning, existing, large, address, randomly, trained, knowledge]
@InProceedings{Xiong_2019_CVPR,
  author = {Xiong, Wei and Yu, Jiahui and Lin, Zhe and Yang, Jimei and Lu, Xin and Barnes, Connelly and Luo, Jiebo},
  title = {Foreground-Aware Image Inpainting},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Art2Real: Unfolding the Reality of Artworks via Semantically-Aware Image-To-Image Translation
Matteo Tomei, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara


The applicability of computer vision to real paintings and artworks has been rarely investigated, even though a vast heritage would greatly benefit from techniques which can understand and process data from the artistic domain. This is partially due to the small amount of annotated artistic data, which is not even comparable to that of natural images captured by cameras. In this paper, we propose a semantic-aware architecture which can translate artworks to photo-realistic visualizations, thus reducing the gap between visual features of artistic and realistic data. Our architecture can generate natural images by retrieving and learning details from real photos through a similarity matching strategy which leverages a weakly-supervised semantic understanding of the scene. Experimental results show that the proposed technique leads to increased realism and to a reduction in domain shift, which improves the performance of pre-trained architectures for classification, detection, and segmentation. Code is publicly available at: https://github.com/aimagelab/art2real.
[state, bank] [computer, vision, pattern, international, computed, approach, respect, matrix] [conference, image, real, realistic, style, ieee, translation, artistic, patch, realism, generative, method, drit, painting, proposed, extracted, figure, content, quality, unpaired, preserving, user, synthesis] [original, neural, number, unit, architecture, performance, reduce, network, deep, shift, table, processing] [generated, memory, adversarial, model, visual, natural, landscape, inception, evaluate, machine, generate] [semantic, segmentation, european, feature, art, affinity, contextual, propose, detection] [class, domain, training, learning, set, transfer, specific, loss, label, data, belonging, entropy, similarity, distance, gap]
@InProceedings{Tomei_2019_CVPR,
  author = {Tomei, Matteo and Cornia, Marcella and Baraldi, Lorenzo and Cucchiara, Rita},
  title = {Art2Real: Unfolding the Reality of Artworks via Semantically-Aware Image-To-Image Translation},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Structure-Preserving Stereoscopic View Synthesis With Multi-Scale Adversarial Correlation Matching
Yu Zhang, Dongqing Zou, Jimmy S. Ren, Zhe Jiang, Xiaohao Chen


This paper addresses stereoscopic view synthesis from a single image. Various recent works solve this task by reorganizing pixels from the input view to reconstruct the target one in a stereo setup. However, purely depending on such photometric-based reconstruction process, the network may produce structurally inconsistent results. Regarding this issue, this work proposes Multi-Scale Adversarial Correlation Matching (MS-ACM), a novel learning framework for structure-aware view synthesis. The proposed framework does not assume any costly supervision signal of scene structures such as depth. Instead, it models structures as self-correlation coefficients extracted from multi-scale feature maps in transformed spaces. In training, the feature space attempts to push the correlation distances between the synthesized and target images far apart, thus amplifying inconsistent structures. At the same time, the view synthesis network minimizes such correlation distances by fixing mistakes it makes. With such adversarial training, structural errors of different scales and levels are iteratively discovered and reduced, preserving both global layouts and fine-grained details. Extensive experiments on the KITTI benchmark show that MS-ACM improves both visual quality and the metrics over existing methods when plugged into recent view synthesis architectures.
[recognition, structural, stereoscopic, video, framework, window] [view, vision, computer, scene, pattern, local, matching, groundtruth, approach, kitti, single, reconstruction, stereo, international, form, photometric, directly] [synthesis, conference, ieee, proposed, image, input, ssim, synthesized, acm, based, figure, pixel, bad, sepconv, noise, psnr] [structure, network, better, table, correlation, deep, process, scale, small, larger, neural, best, regularization, output, denotes] [adversarial, critic, model, visual] [feature, spatial, benchmark, improves, adopted, average, object] [training, novel, learning, loss, trained, learned, target, existing, idea, set, gap, representation, learn, predictor]
@InProceedings{Zhang_2019_CVPR,
  author = {Zhang, Yu and Zou, Dongqing and Ren, Jimmy S. and Jiang, Zhe and Chen, Xiaohao},
  title = {Structure-Preserving Stereoscopic View Synthesis With Multi-Scale Adversarial Correlation Matching},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
DynTypo: Example-Based Dynamic Text Effects Transfer
Yifang Men, Zhouhui Lian, Yingmin Tang, Jianguo Xiao


In this paper, we present a novel approach for dynamic text effects transfer by using example-based texture synthesis. In contrast to previous works that require an input video of the target to provide motion guidance, we aim to animate a still image of the target text by transferring the desired dynamic effects from an observed exemplar. Due to the simplicity of target guidance and complexity of realistic effects, it is prone to producing temporal artifacts such as flickers and pulsations. To address the problem, our core idea is to find a common Nearest-neighbor Field (NNF) that would optimize the textural coherence across all keyframes simultaneously. With the static NNF for video sequences, we implicitly transfer motion properties from source to target. We also introduce a guided NNF search by employing the distance-based weight map and Simulated Annealing (SA) for deep direction-guided propagation to allow intense dynamic effects to be completely transferred with no semantic guidance provided. Experimental results demonstrate the effectiveness and superiority of our method in dynamic text effects transfer through extensive comparisons with state-of-the-art algorithms. We also show the potentiality of our method via multiple experiments for various application domains.
[dynamic, temporal, video, coherence, motion, propagation, keyframes, flow, frame, term, static, previous, time, keyframe, consecutive] [computer, compute, field, optimization, single, direction, algorithm, total, pattern, approach, correspondence, daniel, vision] [texture, image, stylized, nnf, figure, method, synthesis, animation, patch, style, conference, proposed, ttext, acm, typography, ieee, synthesized, pixel, simulated, complicated, synthesize, consistency, stext, ssty, background, eli, textural, tsty, based, input] [weight, deep, search, convolutional, neural, optimize] [text, common, find, introduce, easily, generate, system] [map, guided, semantic, spatial, guidance, fluid] [target, transfer, source, distance, extended, exemplar, annealing, similarity, distribution]
@InProceedings{Men_2019_CVPR,
  author = {Men, Yifang and Lian, Zhouhui and Tang, Yingmin and Xiao, Jianguo},
  title = {DynTypo: Example-Based Dynamic Text Effects Transfer},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Arbitrary Style Transfer With Style-Attentional Networks
Dae Young Park, Kwang Hee Lee


Arbitrary style transfer aims to synthesize a content image with the style of an image to create a third image that has never been seen before. Recent arbitrary style transfer algorithms find it challenging to balance the content structure and the style patterns. Moreover, simultaneously maintaining the global and local style patterns is difficult due to the patch-based mechanism. In this paper, we introduce a novel style-attentional network (SANet) that efficiently and flexibly integrates the local style patterns according to the semantic spatial distribution of the content image. A new identity loss function and multi-level feature embeddings enable our SANet and decoder to preserve the content structure as much as possible while enriching the style patterns. Experimental results demonstrate that our algorithm synthesizes stylized images in real-time that are higher in quality than those produced by the state-of-the-art algorithms.
[time, work, fcs] [local, algorithm, match, normalized, position, runtime] [style, content, image, identity, stylized, arbitrary, figure, proposed, method, sanet, fcsc, adain, texture, sanets, wct, input, synthesize, feedforward, result, synthesizes, color, synthesis, preserve, gatys, based, contentstyle, mapping, lidentity, synthesized, fixing, korea] [structure, network, relu, neural, output, vgg, denotes, maintaining, fixed, increasing, convolutional, flexibly, learnable, fps] [arxiv, preprint, decoder, encoder, represent, semantically, simply, model] [feature, global, map, semantic, spatial, module, combine, detailed] [loss, transfer, maintain, embedding, training, distribution, function, experimental, learned, trained]
@InProceedings{Park_2019_CVPR,
  author = {Young Park, Dae and Hee Lee, Kwang},
  title = {Arbitrary Style Transfer With Style-Attentional Networks},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Typography With Decor: Intelligent Text Style Transfer
Wenjing Wang, Jiaying Liu, Shuai Yang, Zongming Guo


Text effects transfer can dramatically make the text visually pleasing. In this paper, we present a novel framework to stylize the text with exquisite decor, which is ignored by the previous text stylization methods. Decorative elements pose a challenge to spontaneously handle basal text effects and decor, which are two different styles. To address this issue, our key idea is to learn to separate, transfer and recombine the decors and the basal text effect. A novel text effect transfer network is proposed to infer the styled version of the target text. The stylized text is finally embellished with decor where the placement of the decor is carefully determined by a novel structure-aware strategy. Furthermore, we propose a domain adaptation strategy for decor detection and a one-shot training strategy for text effects transfer, which greatly enhance the robustness of our network to new styles. We base our experiments on our collected topography dataset including 59,000 professionally styled text and demonstrate the superiority of our method over other state-of-the-art style transfer methods.
[framework, dataset] [computer, ground, horizontal, pattern, vision, corresponding, truth, vertical, single] [decorative, styled, style, artistic, proposed, image, figure, decor, raw, conference, basal, result, real, based, mhor, method, transferring, synthetic, translation, input, perceptual, doodle, ieee, jiaying, collected, fails, collect, stargan, netseg, generator, ghor, mguide, insignificant] [network, neural, element, structure, segnet, designed] [text, generate, arxiv, adversarial, preprint, model, red, discriminator] [segmentation, propose, map, illustrated, extra, semantic, mask, spatial, edge, guidance] [transfer, training, domain, target, strategy, distribution, loss, adaptation, data, novel, unseen, randomly, adapt, trained, source, distance]
@InProceedings{Wang_2019_CVPR,
  author = {Wang, Wenjing and Liu, Jiaying and Yang, Shuai and Guo, Zongming},
  title = {Typography With Decor: Intelligent Text Style Transfer},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
RL-GAN-Net: A Reinforcement Learning Agent Controlled GAN Network for Real-Time Point Cloud Shape Completion
Muhammad Sarmad, Hyunjoo Jenny Lee, Young Min Kim


We present RL-GAN-Net, where a reinforcement learning (RL) agent provides fast and robust control of a generative adversarial network (GAN). Our framework is applied to point cloud shape completion that converts noisy, partial point cloud data into a high-fidelity completed shape by controlling the GAN. While a GAN is unstable and hard to train, we circumvent the problem by (1) training the GAN on the latent space representation whose dimension is reduced compared to the raw point cloud input and (2) using an RL agent to find the correct input to the GAN to generate the latent space representation of the shape that best fits the current input of incomplete point cloud. The suggested pipeline robustly completes point cloud with large missing regions. To the best of our knowledge, this is the first attempt to train an RL agent to control the GAN, which effectively learns the highly nonlinear mapping from the input noise of the GAN to the latent space of point cloud. The RL agent replaces the need for complex optimization and consequently makes our technique real time. Additionally, we demonstrate that our pipelines can be used to enhance the classification accuracy of point cloud with missing data.
[action, framework, work, time, complex, state, previous, decoded] [point, shape, cloud, completion, chamfer, approach, computer, leonidas, pipeline, vision, volume, robust, limited, voxel, optimization] [input, missing, latent, completed, control, incomplete, real, generator, hybrid, generative, figure, conference, raw, image, ieee, demonstrate, based] [network, deep, output, best, performance, neural, compared, original, processing, vanilla, suggested, accuracy] [agent, gan, gfv, discriminator, complete, correct, reward, reinforcement, pin, adversarial, decoder, find, generated, environment, policy, encoder, partial, generate, selects, generation, pass, encoded] [semantic, seed, object] [data, learning, loss, distance, training, representation, trained, classification, space, train, function, large, noisy, combination]
@InProceedings{Sarmad_2019_CVPR,
  author = {Sarmad, Muhammad and Jenny Lee, Hyunjoo and Min Kim, Young},
  title = {RL-GAN-Net: A Reinforcement Learning Agent Controlled GAN Network for Real-Time Point Cloud Shape Completion},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Photo Wake-Up: 3D Character Animation From a Single Photo
Chung-Yi Weng, Brian Curless, Ira Kemelmacher-Shlizerman


We present a method and application for animating a human subject from a single photo. E.g., the character can walk out, run, sit, or jump in 3D. The key contributions of this paper are: 1) an application of viewing and animating humans in single photos in 3D, 2) a novel 2D warping method to deform a posable template body model to fit the person's complex silhouette to create an animatable mesh, and 3) a method for handling partial self occlusions. We compare to state-of-the-art related methods and evaluate results with human studies. Further, we present an interactive interface that allows re-posing the person in 3D, and an augmented reality setup where the animated 3D person can emerge from the photo into the real world. We demonstrate the method on photos, posters, and art. The project page is at https://grail.cs.washington.edu/projects/wakeup/.
[human, motion, video, subject, warping, warp, construct, work, illustrates] [body, smpl, mesh, pose, single, silhouette, shape, skinning, computer, linit, front, depth, normal, occlusion, vision, ssmpl, volume, reality, fit, reconstruction, projected, solve, psmpl, corresponding, ocl, rigged, international, fitting, problem, estimation, view, correspondence, initial, occluded, lsmpl, bsmpl, pattern, animatable, augmented, handling] [photo, method, user, animation, input, image, acm, figure, conference, result, recover, texture, ieee, handle, comparison, animating, change, reconstruct, application] [apply, compare] [model, create, arxiv, preprint, character, abstract, automatic] [map, person, head, boundary, mask, final, fully, interactive] [label, learning, set]
@InProceedings{Weng_2019_CVPR,
  author = {Weng, Chung-Yi and Curless, Brian and Kemelmacher-Shlizerman, Ira},
  title = {Photo Wake-Up: 3D Character Animation From a Single Photo},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
DeepLight: Learning Illumination for Unconstrained Mobile Mixed Reality
Chloe LeGendre, Wan-Chun Ma, Graham Fyffe, John Flynn, Laurent Charbonnel, Jay Busch, Paul Debevec


We present a learning-based method to infer plausible high dynamic range (HDR), omnidirectional illumination given an unconstrained, low dynamic range (LDR) image from a mobile phone camera with a limited field of view (FOV). For training data, we collect videos of various reflective spheres placed within the camera's FOV, leaving most of the background unoccluded, leveraging that materials with diverse reflectance functions reveal different lighting cues in a single exposure. We train a deep neural network to regress from the LDR background image to HDR lighting by matching the LDR ground truth sphere images to those rendered with the predicted illumination using image-based relighting, which is differentiable. Our inference runs at interactive frame rates on a mobile device, enabling realistic rendering of virtual objects into real scenes for mobile mixed reality. Training on automatically exposed and white-balanced videos, we improve the realism of rendered objects compared to the state-of-the art methods for both indoor and outdoor scenes.
[dynamic, video, frame, capture, recognition] [lighting, outdoor, computer, ground, ldr, illumination, truth, indoor, scene, reflectance, sphere, light, diffuse, rendered, rendering, single, vision, international, range, pattern, camera, field, volume, rgb, material, virtual, render, linear, phone, estimation, reality, relighting, silver, brdf, measured, exposure, panorama, omnidirectional, lit, direction, estimate] [hdr, image, conference, input, ieee, real, figure, mixed, method, matte, captured, acm, lrec, background, color, pixel, high, produced, reveal] [mobile, network, mirror, inference, deep, neural] [ball, model, adversarial, plausible, basis] [object, predicted, interactive, three] [training, loss, data, learning, unseen, test, train]
@InProceedings{LeGendre_2019_CVPR,
  author = {LeGendre, Chloe and Ma, Wan-Chun and Fyffe, Graham and Flynn, John and Charbonnel, Laurent and Busch, Jay and Debevec, Paul},
  title = {DeepLight: Learning Illumination for Unconstrained Mobile Mixed Reality},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Iterative Residual CNNs for Burst Photography Applications
Filippos Kokkinos, Stamatis Lefkimmiatis


Modern inexpensive imaging sensors suffer from inherent hardware constraints which often result in captured images of poor quality. Among the most common ways to deal with such limitations is to rely on burst photography, which nowadays acts as the backbone of all modern smartphone imaging applications. In this work, we focus on the fact that every frame of a burst sequence can be accurately described by a forward (physical) model. This, in turn, allows us to restore a single image of higher quality from a sequence of low-quality images as the solution of an optimization problem. Inspired by an extension of the gradient descent method that can handle non-smooth functions, namely the proximal gradient descent, and modern deep learning techniques, we propose a convolutional iterative network with a transparent architecture. Our network uses a burst of low-quality image frames and is able to produce an output of higher image quality recovering fine details which are not distinguishable in any of the original burst frames. We focus both on the burst photography pipeline as a whole, i.e., burst demosaicking and denoising, as well as on the traditional Gaussian denoising task. The developed method demonstrates consistent state-of-the art performance across the two tasks and as opposed to other recent deep learning approaches does not have any inherent restrictions either to the number of frames or their ordering.
[frame, dataset, joint, term, warping, focus, sequence] [computer, solution, single, optimization, vision, reconstruction, pattern, approach, affine, case, algorithm, observation, estimate, linear, matrix, inverse, allows, well, camera] [image, burst, denoising, proximal, noise, method, demosaicking, ieee, photography, conference, resdnet, quality, input, result, transformation, proposed, inn, imaging, described, based, kokkinos, inherent, restoration, hsi, color, interpolation, stamatios, smartphone] [network, deep, performance, gaussian, neural, gradient, size, order, number, convolutional, processing, regularizer, residual, computational, bilinear, standard, hardware, descent, computation, design, increase] [iterative, model, step] [map] [learning, training, trained, noisy, corresponds, data, alignment, set, specific]
@InProceedings{Kokkinos_2019_CVPR,
  author = {Kokkinos, Filippos and Lefkimmiatis, Stamatis},
  title = {Iterative Residual CNNs for Burst Photography Applications},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Learning Implicit Fields for Generative Shape Modeling
Zhiqin Chen, Hao Zhang


We advocate the use of implicit fields for learning generative models of shapes and introduce an implicit field decoder, called IM-NET, for shape generation, aimed at improving the visual quality of the generated shapes. An implicit field assigns a value to each point in 3D space, so that a shape can be extracted as an iso-surface. IM-NET is trained to perform this assignment by means of a binary classifier. Specifically, it takes a point coordinate, along with a feature vector encoding a shape, and outputs a value which indicates whether the point is outside the shape or not. By replacing conventional decoders by our implicit decoder for representation learning (via IM-AE) and shape generation (via IM-GAN), we demonstrate superior results for tasks such as generative shape modeling, interpolation, and single-view 3D reconstruction, particularly in terms of visual quality. Code and supplementary material are available at https://github.com/czq142857/implicit-decoder.
[recognition, cleaner, dataset] [shape, point, implicit, computer, vision, surface, field, voxel, marching, international, reconstruction, supplementary, mesh, lfd, pattern, atlasnet, well, single, approach] [conference, generative, figure, resolution, image, ieee, quantitative, quality, interpolation, latent, method, based] [deep, network, table, neural, output, better, structure, progressive, convolutional, processing, compared, compare, number] [decoder, visual, model, sampled, evaluation, generated, encoder, generation, adversarial, generate] [cnn, feature, european, iou, object] [learning, trained, training, set, sampling, testing, representation, distance, autoencoder, loss, train, data]
@InProceedings{Chen_2019_CVPR,
  author = {Chen, Zhiqin and Zhang, Hao},
  title = {Learning Implicit Fields for Generative Shape Modeling},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Reliable and Efficient Image Cropping: A Grid Anchor Based Approach
Hui Zeng, Lida Li, Zisheng Cao, Lei Zhang


Image cropping aims to improve the composition as well as aesthetic quality of an image by removing extraneous content from it. Existing image cropping databases provide only one or several human-annotated bounding boxes as the groundtruth, which cannot reflect the non-uniqueness and flexibility of image cropping in practice. The employed evaluation metrics such as intersection-over-union cannot reliably reflect the real performance of cropping models, either. This work revisits the problem of image cropping, and presents a grid anchor based formulation by considering the special properties and requirements (e.g., local redundancy, content preservation, aspect ratio) of image cropping. Our formulation reduces the searching space of candidate crops from millions to less than one hundred. Consequently, a grid anchor based cropping benchmark is constructed, where all crops of each image are annotated and more reliable evaluation metrics are defined. We also design an effective and lightweight network module, which simultaneously considers the region of interest and region of discard for more accurate image cropping. Our model can stably output visually pleasing crops for images of different scenes and run at a speed of 125 FPS.
[employed, considering] [good, reliable, formulation, defined, define, field, provide, problem, special, local] [image, based, aesthetic, quality, composition, resolution, photo, proposed, comparison, content, figure, database, vpn, ieee, major, acm] [performance, table, number, efficient, channel, output, srcc, network, small, receptive, ratio, standard, effective, speed, size, fixed] [model, candidate, evaluate, acceptable, automatic, attention, evaluation, visual] [cropping, feature, grid, aspect, module, crop, anchor, annotated, roi, baseline, extraction, annotation, spatial, region, map, iou, ven, gaic, improve, detection, froi, cnn, rod, vfn] [source, set, training, learning, ranking, dimension, existing, trained]
@InProceedings{Zeng_2019_CVPR,
  author = {Zeng, Hui and Li, Lida and Cao, Zisheng and Zhang, Lei},
  title = {Reliable and Efficient Image Cropping: A Grid Anchor Based Approach},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Patch-Based Progressive 3D Point Set Upsampling
Wang Yifan, Shihao Wu, Hui Huang, Daniel Cohen-Or, Olga Sorkine-Hornung


We present a detail-driven deep neural network for point set upsampling. A high-resolution point set is essential for point-based rendering and surface reconstruction. Inspired by the recent success of neural image super-resolution techniques, we progressively train a cascade of patch-based upsampling networks on different levels of detail end-to-end. We propose a series of architectural design contributions that lead to a substantial performance boost. The effect of each technical contribution is demonstrated in an ablation study. Qualitative and quantitative experiments show that our method significantly outperforms the state-of-the-art learning-based and optimazation-based approaches, both in terms of handling low-resolution inputs and revealing high-fidelity details.
[previous, multiple, sequence] [point, computer, dense, pattern, vision, local, surface, cloud, ground, geometric, shape, field, well, computed] [input, patch, figure, ieee, image, noise, method, acm, detail, extracted, quantitative, interpolation] [upsampling, network, deep, unit, neural, progressive, sparse, convolutional, receptive, number, table, architecture, apply, sketchfab, performance, size, processing, applying, adaptive, entire] [model, arxiv, preprint, progressively, step] [feature, extraction, level, expansion, multiscale, propose, spatial, ablation] [set, training, learning, train, distance, trained, data, test, code, large]
@InProceedings{Yifan_2019_CVPR,
  author = {Yifan, Wang and Wu, Shihao and Huang, Hui and Cohen-Or, Daniel and Sorkine-Hornung, Olga},
  title = {Patch-Based Progressive 3D Point Set Upsampling},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
An Iterative and Cooperative Top-Down and Bottom-Up Inference Network for Salient Object Detection
Wenguan Wang, Jianbing Shen, Ming-Ming Cheng, Ling Shao


This paper presents a salient object detection method that integrates both top-down and bottom-up saliency inference in an iterative and cooperative manner. The top-down process is used for coarse-to-fine saliency estimation, where high-level saliency is gradually integrated with finer lower-layer features to obtain a fine-grained result. The bottom-up process infers the high-level, but rough saliency through gradually using upper-layer, semantically-richer features. These two processes are alternatively performed, where the bottom-up process uses the fine-grained saliency obtained from the top-down process to yield enhanced high-level saliency estimate, and the top-down process, in turn, is further benefited from the improved high-level information. The network layers in the bottom-up/top-down processes are equipped with recurrent mechanisms for layer-wise, step-by-step optimization. Thus, saliency information is effectively encouraged to flow in a bottom-up, top-down and intra-layer manner. We show that most other saliency models based on fully convolutional networks (FCNs) are essentially variants of our model. Extensive experiments on several famous benchmarks clearly demonstrate the superior performance, good generalization, and powerful learning ability of our proposed saliency inference framework.
[rnn, previous, recurrent, jianbing, human, perform, iteratively, work] [estimation, estimate, accurate, corresponding] [based, proposed, image, input, study, demonstrate, guide, quantitative] [inference, network, deep, convolutional, table, layer, process, performance, deeper, iteration, conv, number, gradually, better, vggnet, performed, efficient, neural, basic] [model, iterative, visual, cooperative, improved, step, attention, consider, perception] [saliency, sod, salient, object, mae, finest, detection, detailed, feature, huchuan, supervision, topdown, fully, spatial, dutste, ali, ablation, wenguan, refine, xiang, leverage, final] [learning, update, strategy, trained]
@InProceedings{Wang_2019_CVPR,
  author = {Wang, Wenguan and Shen, Jianbing and Cheng, Ming-Ming and Shao, Ling},
  title = {An Iterative and Cooperative Top-Down and Bottom-Up Inference Network for Salient Object Detection},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Deep Stacked Hierarchical Multi-Patch Network for Image Deblurring
Hongguang Zhang, Yuchao Dai, Hongdong Li, Piotr Koniusz


Despite deep end-to-end learning methods have shown their superiority in removing non-uniform motion blur, there still exist major challenges with the current multi-scale and scale-recurrent models: 1) Deconvolution/upsampling operations in the coarse-to-fine scheme result in expensive runtime; 2) Simply increasing the model depth with finer-scale levels cannot improve the quality of deblurring. To tackle the above problems, we present a deep hierarchical multi-patch network inspired by Spatial Pyramid Matching to deal with blurry images via a fine-to-coarse hierarchical representation. To deal with the performance saturation w.r.t. depth, we propose a stacked version of our multi-patch model. Our proposed basic multi-patch model achieves the state-of-the-art performance on the GoPro dataset while enjoying a 40xfaster runtime compared to current multi-scale methods. With 30ms to process an image at 1280x720 resolution, it is the first real-time deep motion deblurring model for 720p images at 30fps. For stacked networks, significant improvements (over 1.2dB) are achieved on the GoPro dataset by increasing the network depth. Moreover, by varying the depth of the stacked model, one can adapt the performance and runtime of the same network for different application scenarios.
[motion, dataset, stacking, video, recurrent, expensive, multiple, current, employed, consists] [runtime, depth, finer, note, scene, coarser, matching] [deblurring, image, dmphn, blur, ieee, gopro, psnr, input, figure, mse, blurry, proposed, sharp, result, method, ssim, australian, blind, nah, comparison, deblurred, competing] [network, deep, performance, output, stacked, convolutional, architecture, neural, residual, size, compared, best, fast, table, number, top, achieves, process, weight, variant, videodeblurring] [model, encoder, decoder, evaluate] [level, hierarchical, cnn, improve, spatial, feature, propose, bottom, pyramid] [loss, learning, training, conventional, investigate]
@InProceedings{Zhang_2019_CVPR,
  author = {Zhang, Hongguang and Dai, Yuchao and Li, Hongdong and Koniusz, Piotr},
  title = {Deep Stacked Hierarchical Multi-Patch Network for Image Deblurring},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Turn a Silicon Camera Into an InGaAs Camera
Feifan Lv, Yinqiang Zheng, Bohan Zhang, Feng Lu


Short-wave infrared (SWIR) imaging has a wide range of applications for both industry and civilian. However, the InGaAs sensors commonly used for SWIR imaging suffer from a variety of drawbacks, including high price, low resolution, unstable quality, and so on. In this paper, we propose a novel solution for SWIR imaging using a common Silicon sensor, which has cheaper price, higher resolution and better technical maturity compared with the specialized InGaAs sensor. Our key idea is to approximate the response of the InGaAs sensor by exploiting the largely ignored sensitivity of a Silicon sensor, weak as it is, in the SWIR range. To this end, we build a multi-channel optical system to collect a new SWIR dataset and present a physically meaningful three-stage image processing algorithm on the basis of CNN. Both qualitative and quantitative experiments show promising experimental results, which demonstrate the effectiveness of the proposed method.
[dataset, signal, optical, work, video, dynamic, consists] [sensor, camera, solution, light, inf, range, visible, corresponding, reconstruction, international, algorithm, array, directly, pipeline, simulation, problem] [silicon, swir, ingaas, image, imaging, bandpass, figure, wavelength, compressive, infrared, longpass, high, proposed, quantum, resolution, quality, ieee, input, sensing, ssim, conference, collect, real, schematic, based, method, simulate] [design, filter, network, processing, hardware, low, efficiency, conv, typical, effectiveness, deep, table, higher, represents, architecture, process, order] [system, sensitivity, diagram, sensitive] [three, feature, propose] [set, loss, novel, specific, learning, existing, alignment]
@InProceedings{Lv_2019_CVPR,
  author = {Lv, Feifan and Zheng, Yinqiang and Zhang, Bohan and Lu, Feng},
  title = {Turn a Silicon Camera Into an InGaAs Camera},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Low-Rank Tensor Completion With a New Tensor Nuclear Norm Induced by Invertible Linear Transforms
Canyi Lu, Xi Peng, Yunchao Wei


This work studies the low-rank tensor completion problem, which aims to exactly recover a low-rank tensor from partially observed entries. Our model is inspired by the recently proposed tensor-tensor product (t-product) based on any invertible linear transforms. When the linear transforms satisfy certain conditions, we deduce the new tensor tubal rank, tensor spectral norm, and tensor nuclear norm. Equipped with the tensor nuclear norm, we then solve the tensor completion problem by solving a convex program and provide the theoretical bound for the exact recovery under certain tensor incoherence conditions. The achieved sampling complexity is order-wise optimal. Our model and result greatly extend existing results in the low-rank matrix and tensor completion. Numerical experiments verify our results and the application on image recovery demonstrates the superiority of our method.
[work, interesting, algebra] [linear, matrix, nuclear, completion, convex, tubal, definition, problem, defined, discrete, exact, theorem, induced, denote, rankt, guarantee, tnn, define, singular, slice, compute, solving, incoherence, fourier, general, exactly, solve, computer, optimal, robust, bound] [transform, recovery, based, invertible, image, figure, proposed, spectral, result, frontal, ieee, real, color, denoted, recover, diagonal, application] [tensor, norm, low, block, transforms, satisfies, performance, denotes, operator, orthogonal, size, numerical, applied, equivalent, best] [model, observed, random, program, consider, correct, machine] [] [rank, data, product, existing, sampling, main, min, set, minimization, cosine]
@InProceedings{Lu_2019_CVPR,
  author = {Lu, Canyi and Peng, Xi and Wei, Yunchao},
  title = {Low-Rank Tensor Completion With a New Tensor Nuclear Norm Induced by Invertible Linear Transforms},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Joint Representative Selection and Feature Learning: A Semi-Supervised Approach
Suchen Wang, Jingjing Meng, Junsong Yuan, Yap-Peng Tan


In this paper, we propose a semi-supervised approach for representative selection, which finds a small set of representatives that can well summarize a large data collection. Given labeled source data and big unlabeled target data, we aim to find representatives in the target data, which can not only represent and associate data points belonging to each labeled category, but also discover novel categories in the target data, if any. To leverage labeled source data, we guide representative selection from labeled source to unlabeled target. We propose a joint optimization framework which alternately optimizes (1) representative selection in the target data and (2) discriminative feature learning from both the source and the target for better representative selection. Experiments on image and video datasets demonstrate that our proposed approach not only finds better representatives, but also can discover novel categories in the target data that are not in the source.
[video, joint, summarization, key, represented, second, dataset, breakfast, framework, previous, term] [problem, optimization, algorithm, approach, well, local, point, formulation] [figure, proposed, based, image, glass] [selection, number, accuracy, cost, search, squeeze, better, sparse, performance, table, entire] [find, orange, represent, finding, iterative, sil, discover, evaluate, juice, appear] [feature, recall, location, category, propose, leverage] [source, target, representative, data, set, discriminative, learning, zij, subset, item, labeled, dij, facility, training, novel, objective, hard, narrated, update, serving, zkj, select, selected, representation, lopen, loss, mnist, experimental, determinantal, strategy, learn, datasets, clustering, min, dkj]
@InProceedings{Wang_2019_CVPR,
  author = {Wang, Suchen and Meng, Jingjing and Yuan, Junsong and Tan, Yap-Peng},
  title = {Joint Representative Selection and Feature Learning: A Semi-Supervised Approach},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
The Domain Transform Solver
Akash Bapat, Jan-Michael Frahm


We present a novel framework for edge-aware optimization that is an order of magnitude faster than the state of the art while maintaining comparable results. Our key insight is that the optimization can be formulated by leveraging properties of the domain transform, a method for edge-aware filtering that defines a distance-preserving 1D mapping of the input space. This enables our method to improve performance for a wide variety of problems including stereo, depth super-resolution, render from defocus, colorization, and especially high-resolution depth filtering, while keeping the computational complexity linear in the number of pixels. Our method is highly parallelizable and adaptable, and it has demonstrable linear scalability with respect to image resolutions. We provide a comprehensive evaluation of our method w.r.t speed and accuracy for a variety of tasks.
[time, framework, dataset, term] [optimization, depth, stereo, computer, disparity, vision, well, pattern, confidence, middlebury, solver, volume, approach, compute, matching, international, algorithm, accurate, estimate, problem, rmse, hfbs, linear] [image, color, bilateral, method, transform, filtering, conference, ieee, fbs, pixel, blur, barron, dts, result, acm, variety, figure, defocus, high, colorization, proposed, input, poole, resolution, synthetic, edgeaware] [fast, filter, performance, kernel, gradient, number, size, parallel, efficient, higher, table, original, processing, complexity, accuracy] [evaluation, machine] [faster, map, grid, including, semantic, segmentation] [domain, target, space, function, learning, training, distance, task, large]
@InProceedings{Bapat_2019_CVPR,
  author = {Bapat, Akash and Frahm, Jan-Michael},
  title = {The Domain Transform Solver},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
CapSal: Leveraging Captioning to Boost Semantics for Salient Object Detection
Lu Zhang, Jianming Zhang, Zhe Lin, Huchuan Lu, You He


Detecting salient objects in cluttered scenes is a big challenge. To address this problem, we argue that the model needs to learn discriminative semantic features for salient objects. To this end, we propose to leverage captioning as an auxiliary semantic task to boost salient object detection in complex scenarios. Specifically, we develop a CapSal model which consists of two sub-networks, the Image Captioning Network (ICN) and the Local-Global Perception Network (LGPN). ICN encodes the embedding of a generated caption to capture the semantic information of major objects in the scene, while LGPN incorporates the captioning embedding with local-global visual contexts for predicting the saliency map. ICN and LGPN are jointly trained to model high-level semantics as well as visual saliency. Extensive experiments demonstrate the effectiveness of image captioning in boosting the performance of salient object detection. In particular, our model performs significantly better than the state-of-the-art methods on several challenging datasets of complex scenarios.
[dataset, ucf, capture, jointly, hidden] [computer, vision, pattern, local, ground, corresponding, truth] [image, conference, ieee, input, proposed, figure, comparison] [network, deep, performance, convolutional, effectiveness, precision] [caption, captioning, model, perception, visual, vector, attention, embedded, generated, generate, mscoco, word] [salient, saliency, object, capsal, global, feature, map, propose, semantic, icn, detection, lgpn, gpm, bmpm, dgrl, utilize, final, module, lpm, mask, mae, backbone, three, fully, bounding, recall, mdf, rfcn, dcl, dhs, nldf, amulet, boost, semantics, baseline, context, leverage] [training, learning, shared, embedding, datasets, trained, knowledge, exploit, loss, task]
@InProceedings{Zhang_2019_CVPR,
  author = {Zhang, Lu and Zhang, Jianming and Lin, Zhe and Lu, Huchuan and He, You},
  title = {CapSal: Leveraging Captioning to Boost Semantics for Salient Object Detection},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Phase-Only Image Based Kernel Estimation for Single Image Blind Deblurring
Liyuan Pan, Richard Hartley, Miaomiao Liu, Yuchao Dai


The image motion blurring process is generally modelled as the convolution of a blur kernel with a latent image. Therefore, the estimation of the blur kernel is essentially important for blind image deblurring. Unlike existing approaches which focus on approaching the problem by enforcing various priors on the blur kernel and the latent image, we are aiming at obtaining a high quality blur kernel directly by studying the problem in the frequency domain. We show that the auto-correlation of the absolute phase-only image 1 can provide faithful information about the motion (e.g., the motion direction and magnitude, we call it the motion pattern in this paper.) that caused the blur, leading to a new and efficient blur kernel estimation approach. The blur kernel is then refined and the sharp image is estimated by solving an optimization problem by enforcing a regularization on the blur kernel and the latent image. We further extend our approach to handle non-uniform blur, which involves spatially varying blur kernels. Our approach is evaluated extensively on synthetic and real data and shows good results compared to the state-of-the-art deblurring approaches.
[motion, dataset, video, flow, term, multiple, jointly, framework, dynamic] [approach, estimation, fourier, problem, absolute, camera, linear, single, depth, scene, estimated, estimate, vision, directly, estimating, stereo, direction, optimization, solution, note, computer, principal, yuchao, modelled, pattern, solving, rotation] [image, blur, deblurring, blurry, ieee, latent, proposed, sharp, blind, autocorrelation, result, based, prior, input, handle, figure, jiaya, real, pan, blurring, method, transform, quantitative, nah, yan, comparison, conference, jue, jinshan, liyuan, miaomiao, phaseonly, synthetic] [kernel, deep, phase, convolution, gradient, neural, sparsity, layer, better, regularization] [model, simple, example] [refined, map] [uniform, learning, function, enforcing, learned]
@InProceedings{Pan_2019_CVPR,
  author = {Pan, Liyuan and Hartley, Richard and Liu, Miaomiao and Dai, Yuchao},
  title = {Phase-Only Image Based Kernel Estimation for Single Image Blind Deblurring},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Hierarchical Discrete Distribution Decomposition for Match Density Estimation
Zhichao Yin, Trevor Darrell, Fisher Yu


Explicit representations of the global match distributions of pixel-wise correspondences between pairs of images are desirable for uncertainty estimation and downstream applications. However, the computation of the match density for each pixel may be prohibitively expensive due to the large number of candidates. In this paper, we propose Hierarchical Discrete Distribution Decomposition (HD^3), a framework suitable for learning probabilistic pixel correspondences in both optical flow and stereo matching. We decompose the full match density into multiple scales hierarchically, and estimate the local matching distributions at each scale conditioned on the matching and warping at coarser scales. The local distributions can then be composed together to form the global match density. Despite its simplicity, our probabilistic method achieves state-of-the-art results for both optical flow and stereo matching on established benchmarks. We also find the estimated uncertainty is a good indication of the reliability of the predicted correspondences.
[flow, optical, motion, prediction, sintel, dataset, multiple, framework, time, joint] [match, stereo, kitti, matching, estimation, confidence, local, discrete, estimate, error, scene, correspondence, decomposition, mpi, wij, dense, point, denote, classical, estimating, general, fij] [method, image, decomposed, figure, pixel, based, recover] [density, network, deep, size, full, batch, computation, correlation, lowest, achieve, convolutional, cost, residual, performance, entire, rate, competitive] [model, vector, decoder, probability] [feature, level, pyramid, map, adopt, final, hierarchical, propose, predicted, context] [uncertainty, training, learning, distribution, probabilistic, large, test, data, train, set, support, loss, trained, embedding]
@InProceedings{Yin_2019_CVPR,
  author = {Yin, Zhichao and Darrell, Trevor and Yu, Fisher},
  title = {Hierarchical Discrete Distribution Decomposition for Match Density Estimation},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
FOCNet: A Fractional Optimal Control Network for Image Denoising
Xixi Jia, Sanyang Liu, Xiangchu Feng, Lei Zhang


Deep convolutional neural networks (DCNN) have been successfully used in many low-level vision problems such as image denoising. Recent studies on the mathematical foundation of DCNN has revealed that the forward propagation of DCNN corresponds to a dynamic system, which can be described by an ordinary differential equation (ODE) and solved by the optimal control method. However, most of these methods employ integer-order differential equation, which has local connectivity in time space and cannot describe the long-term memory of the system. Inspired by the fact that the fractional-order differential equation has long-term memory, in this paper we develop an advanced image denoising network, namely FOCNet, by solving a fractional optimal control (FOC) problem. Specifically, the network structure is designed based on the discretization of a fractional-order differential equation, which enjoys long-term memory in both forward and backward passes. Besides, multi-scale feature interactions are introduced into the FOCNet to strengthen the control of the dynamic system. Extensive experiments demonstrate the leading performance of the proposed FOCNet on image denoising. Code will be made available.
[time, forward, state, dynamic, current, propagation, dataset, previous, backward] [optimal, differential, computer, vision, pattern, problem, equation, discretization, corresponding, continuous, derivative, international, analysis] [image, focnet, denoising, control, ieee, psnr, noise, based, conference, nonlinear, figure, fractional, dncnn, ffdnet, wnnm, clean, proposed, competing, described, input] [scale, network, neural, dcnn, deep, fode, memnet, process, table, residual, net, tnrd, convolutional, constructed, layer, architecture, strengthen, evolution, iode, output, convolution] [memory, system, red, mathematical] [feature, level, average] [function, learning, loss, set, noisy, maximum, setting, space, specific]
@InProceedings{Jia_2019_CVPR,
  author = {Jia, Xixi and Liu, Sanyang and Feng, Xiangchu and Zhang, Lei},
  title = {FOCNet: A Fractional Optimal Control Network for Image Denoising},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Orthogonal Decomposition Network for Pixel-Wise Binary Classification
Chang Liu, Fang Wan, Wei Ke, Zhuowei Xiao, Yuan Yao, Xiaosong Zhang, Qixiang Ye


The weight sharing scheme and spatial pooling operations in Convolutional Neural Networks (CNNs) introduce semantic correlation to neighboring pixels on feature maps and therefore deteriorate their pixel-wise classification performance. In this paper, we implement an Orthogonal Decomposition Unit (ODU) that transforms a convolutional feature map into orthogonal bases targeting at de-correlating neighboring pixels on convolutional features. In theory, complete orthogonal decomposition produces orthogonal bases which can perfectly reconstruct any binary mask (ground-truth). In practice, we further design incomplete orthogonal decomposition focusing on de-correlating local patches which balances the reconstruction performance and computational cost. Fully Convolutional Networks (FCNs) implemented with ODUs, referred to as Orthogonal Decomposition Networks (ODNs), learn de-correlated and complementary convolutional features and fuse such features in a pixel-wise selective manner. Over pixel-wise binary classification tasks for two-dimensional image processing, specifically skeleton detection, edge detection, and saliency detection, and one-dimensional keypoint detection, specifically S-wave arrival time detection for earthquake localization, ODNs consistently improves the state-of-the-arts with significant margins.
[skeleton, time, arrival, multiple, dataset, fusion, outperforms, human] [decomposition, computer, pattern, reconstruction, vision, international, local, span, single, linear, keypoint] [ieee, conference, input, incomplete, image, figure, patch, comparison, reconstruct, based, decomposed] [orthogonal, convolutional, binary, performance, odu, table, output, deep, network, correlation, size, weight, unit, wei, earthquake, capability, layer, filter, residual, sharing] [complete] [detection, feature, map, srn, object, saliency, semantic, complementary, edge, neighboring, refinement, spatial, including, salient, odus, mae, perfectly, mask, fully, odns, segmentation, atop] [learning, classification, learn, positive, aij, training]
@InProceedings{Liu_2019_CVPR,
  author = {Liu, Chang and Wan, Fang and Ke, Wei and Xiao, Zhuowei and Yao, Yuan and Zhang, Xiaosong and Ye, Qixiang},
  title = {Orthogonal Decomposition Network for Pixel-Wise Binary Classification},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Multi-Source Weak Supervision for Saliency Detection
Yu Zeng, Yunzhi Zhuge, Huchuan Lu, Lihe Zhang, Mingyang Qian, Yizhou Yu


The high cost of pixel-level annotations makes it appealing to train saliency detection models with weak supervision. However, a single weak supervision source usually does not contain enough information to train a well-performing model. To this end, we propose a unified framework to train saliency detection models with diverse weak supervision sources. In this paper, we use category labels, captions, and unlabelled data for training, yet other supervision sources can also be plugged into this flexible framework. We design a classification network (CNet) and a caption generation network (PNet), which learn to predict object categories and generate captions, respectively, meanwhile highlight the most important regions for corresponding tasks. An attention transfer loss is designed to transmit supervision signal between networks, such that the network designed to be trained with one supervision source can benefit from another. An attention coherence loss is defined on unlabelled data to encourage the networks to detect generally salient regions instead of task-specific regions. We use CNet and PNet to generate pixel-level pseudo labels to train a saliency prediction network (SNet). During the testing phases, we only need SNet to predict saliency maps. Experiments demonstrate the performance of our method compares favourably against unsupervised and weakly supervised methods and even some supervised methods.
[coherence, dataset, framework, predict, prediction, multiple, jointly] [computer, pattern, vision, defined] [image, ieee, conference, method, figure, input, proposed, denoted, result, row] [network, deep, connected, table, performance, convolutional, layer, max, better, weight, designed, neural, pooling, number, lat, design] [attention, caption, generate, generated, attended, visual, generation] [saliency, supervision, salient, cnet, feature, category, pnet, map, detection, weak, fully, weakly, localization, object, detect, coarse, global, mae, module, lac, huchuan, snet, region, spatial, score, propose] [loss, training, unlabelled, supervised, train, trained, data, transfer, classification, pseudo, learning, set, unsupervised, log, generally, dog, source]
@InProceedings{Zeng_2019_CVPR,
  author = {Zeng, Yu and Zhuge, Yunzhi and Lu, Huchuan and Zhang, Lihe and Qian, Mingyang and Yu, Yizhou},
  title = {Multi-Source Weak Supervision for Saliency Detection},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
ComDefend: An Efficient Image Compression Model to Defend Adversarial Examples
Xiaojun Jia, Xingxing Wei, Xiaochun Cao, Hassan Foroosh


Deep neural networks (DNNs) have been demonstrated to be vulnerable to adversarial examples. Specifically, adding imperceptible perturbations to clean images can fool the well trained deep neural networks. In this paper, we propose an end-to-end image compression model to defend adversarial examples: ComDefend. The proposed model consists of a compression convolutional neural network (ComCNN) and a reconstruction convolutional neural network (ResCNN). The ComCNN is used to maintain the structure information of the original image and purify adversarial perturbations. And the ResCNN is used to reconstruct the original image with high quality. In other words, ComDefend can transform the adversarial image to its clean version, which is then fed to the trained classifier. Our method is a pre-processing module, and does not modify the classifier's structure during the whole process. Therefore it can be combined with other model-specific defense models to jointly improve the classifier's robustness. A series of experiments conducted on MNIST, CIFAR10 and ImageNet show that the proposed method outperforms the state-of-the-art defense methods, and is consistently effective to protect classifiers against adversarial attacks.
[time, fed, dataset, consists, second] [normal, computer, vision] [image, method, proposed, clean, input, noise, reconstruct, figure, pixel, reconstructed, result, comparison, ieee, conference, high] [compression, original, layer, size, table, accuracy, neural, deep, network, structure, gaussian, parameter, performance, represents, output, compact, order, compressed, achieve, imagenet, compared, smoothing, selection] [adversarial, comcnn, model, reccnn, attack, comdefend, imperceptible, defensive, deepfool, defend, arxiv, preprint, hgd, bim, defense, perturbation, fgsm, strongest, random, adding, reconstructs, resist] [propose, map, improve] [training, test, classification, learning, main, trained, representation, loss, idea, unified, classifier]
@InProceedings{Jia_2019_CVPR,
  author = {Jia, Xiaojun and Wei, Xingxing and Cao, Xiaochun and Foroosh, Hassan},
  title = {ComDefend: An Efficient Image Compression Model to Defend Adversarial Examples},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Combinatorial Persistency Criteria for Multicut and Max-Cut
Jan-Hendrik Lange, Bjoern Andres, Paul Swoboda


In combinatorial optimization, partial variable assignments are called persistent if they agree with some optimal solution. We propose persistency criteria for the multicut and max-cut problem as well as fast combinatorial routines to verify them. The criteria that we derive are based on mappings that improve feasible multicuts, respectively cuts. Our elementary criteria can be checked enumeratively. The more advanced ones rely on fast algorithms for upper and lower bounds for the respective cut problems and max-flow techniques for auxiliary min-cut problems. Our methods can be used as a preprocessing technique for reducing problem sizes or for computing partial optimality guarantees for solutions output by heuristic solvers. We show the efficacy of our methods on instances of both problems from computer vision, biomedical image analysis and statistical physics.
[graph, work] [cut, persistency, multicut, problem, optimal, solution, persistent, elementary, subgraphs, qpbo, pmc, optimization, defined, theorem, torus, variable, optimality, compute, general, lemma, triangle, well, computer, algorithm, discrete, linear, note, case, pcut, assumption, bjoern, feasible, quadratic, special, ising, checked] [mapping, method, image, figure, dual, side, based] [max, improving, connected, combinatorial, table, size, order, efficient, computing, reduced, original, criterion, fast, binary, compare, cost, weighted, applying] [subgraph, partial, find, candidate, finding, alexander] [edge, heuristic, instance, segmentation, average, propose] [set, min, data, clustering, minimum, symmetric]
@InProceedings{Lange_2019_CVPR,
  author = {Lange, Jan-Hendrik and Andres, Bjoern and Swoboda, Paul},
  title = {Combinatorial Persistency Criteria for Multicut and Max-Cut},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
S4Net: Single Stage Salient-Instance Segmentation
Ruochen Fan, Ming-Ming Cheng, Qibin Hou, Tai-Jiang Mu, Jingdong Wang, Shi-Min Hu


We consider an interesting problem---salient instance segmentation. Other than producing approximate bounding boxes, our network also outputs high-quality instance-level segments. Taking into account the category-independent property of each target, we design a single stage salient instance segmentation framework, with a novel segmentation branch. Our new branch regards not only local context inside each detection window but also its surrounding context, enabling us to distinguish the instances in the same scope even with obstruction. Our network is end-to-end trainable and runs at a fast speed (40 fps when processing an image with resolution 320 x 320). We evaluate our approach on a public available benchmark and show that it outperforms other alternative solutions. We also provide a thorough analysis of the design choices to help readers better understand the functions of each part of our network. The source code can be found at https://github.com/RuochenFan/S4Net.
[explicitly, dataset, interesting, outperforms] [contrast, account, computer, vision, scene] [image, proposed, background, separation, ieee, color, input, based, figure, method] [binary, ternary, convolutional, network, better, performance, size, layer, table, deep, number, design, neural, receptive, experiment, explore, compared] [model, visual, ability, orange, consider] [segmentation, salient, object, feature, roimasking, detection, instance, semantic, branch, region, roialign, mask, bounding, roipool, inside, proposal, saliency, context, rectangle, map, masking, surrounding, interest, box, segment, detector, msrnet, distinguish, detecting, foreground, global] [set, training, base, target, negative, loss, learning, positive]
@InProceedings{Fan_2019_CVPR,
  author = {Fan, Ruochen and Cheng, Ming-Ming and Hou, Qibin and Mu, Tai-Jiang and Wang, Jingdong and Hu, Shi-Min},
  title = {S4Net: Single Stage Salient-Instance Segmentation},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
A Decomposition Algorithm for the Sparse Generalized Eigenvalue Problem
Ganzhao Yuan, Li Shen, Wei-Shi Zheng


The sparse generalized eigenvalue problem arises in a number of standard and modern statistical learning models, including sparse principal component analysis, sparse Fisher discriminant analysis, and sparse canonical correlation analysis. However, this problem is difficult to solve since it is NP-hard. In this paper, we consider a new effective decomposition method to tackle this problem. Specifically, we use random or/and swapping strategies to find a working set and perform global combinatorial search over the small subset of variables. We consider a bisection search method and a coordinate descent method for solving the quadratic fractional programming subproblem. In addition, we provide some theoretical analysis for the proposed method. Our experiments on synthetic data and real-world data have shown that our method significantly and consistently outperforms existing solutions in term of accuracy.
[work, stationary, iteratively, recognition] [problem, optimal, solution, algorithm, coordinate, eigenvalue, trf, quadratic, programming, solving, matrix, decomposition, solve, optimality, subproblem, analysis, condition, principal, bisection, optimization, cwa, journal, constraint, bound, international, convex, semidefinite, miny, computer, case, theorem, tpm, vision, general, define, lemma, guaranteed, variable] [method, based, fractional, conference, component, figure, pca, ieee, proposed, swapping, control] [sparse, sparsity, descent, search, denotes, accuracy, gradient, parameter, truncated, size, converge, processing] [find, working, machine, arg, consider, vector, unique, random] [global, propose] [objective, min, generalized, data, set, function, strategy, learning, convergence, positive, hard, cca, china]
@InProceedings{Yuan_2019_CVPR,
  author = {Yuan, Ganzhao and Shen, Li and Zheng, Wei-Shi},
  title = {A Decomposition Algorithm for the Sparse Generalized Eigenvalue Problem},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Polynomial Representation for Persistence Diagram
Zhichao Wang, Qian Li, Gang Li, Guandong Xu


Persistence diagram (PD) has been considered as a compact descriptor for topological data analysis (TDA). Unfortunately, PD cannot be directly used in machine learning methods since it is a multiset of points. Recent efforts have been devoted to transforming PDs into vectors to accommodate machine learning methods. However, they share one common shortcoming: the mapping of PDs to a feature representation depends on a pre-defined polynomial. To address this limitation, this paper proposes an algebraic representation for PDs, i.e., polynomial representation. In this work, we discover a set of general polynomials that vanish on vectorized PDs and extract the task-adapted feature representation from these polynomials. We also prove two attractive properties of the proposed polynomial representation, i.e., stability and linear separability. Experiments also show that our method compares favorably with state-of-the-art TDA methods.
[recognition, construct, extract, time, work, second] [persistence, polynomial, topological, vanishing, stable, algebraic, analysis, linearly, protein, vectorization, point, international, defined, singular, tda, persistent, algorithm, compute, matrix, prove, hilbert, computer, respect, relaxed, journal, directly, linear, form, topology, proved, definition, theorem, null, taut, pattern, general, vanish] [proposed, method, texture, conference, mapping, image, diagonal, figure, ieee, statistical] [kernel, stability, separable, parameter, standard, computational, scalar, accuracy, weighted, density, neural, flexible] [vector, machine, diagram, represent, considered, common] [feature, map] [representation, space, function, data, learning, set, distance, dimension, euclidean, class, discriminative, classification, specific]
@InProceedings{Wang_2019_CVPR,
  author = {Wang, Zhichao and Li, Qian and Li, Gang and Xu, Guandong},
  title = {Polynomial Representation for Persistence Diagram},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Crowd Counting and Density Estimation by Trellis Encoder-Decoder Networks
Xiaolong Jiang, Zehao Xiao, Baochang Zhang, Xiantong Zhen, Xianbin Cao, David Doermann, Ling Shao


Crowd counting has recently attracted increasing interest in computer vision but remains a challenging problem. In this paper, we propose a trellis encoder-decoder network (TEDnet) for crowd counting, which focuses on generating high-quality density estimation maps. The major contributions are four-fold. First, we develop a new trellis architecture that incorporates multiple decoding paths to hierarchically aggregate features at different encoding stages, which improves the representative capability of convolutional features for large variations in objects. Second, we employ dense skip connections interleaved across paths to facilitate sufficient multi-scale feature fusions, which also helps TEDnet to absorb the supervision information. Third, we propose a new combinatorial loss to enforce similarities in local coherence and spatial correlation between maps. By distributedly imposing this combinatorial loss on intermediate outputs, TEDnet can improve the back-propagation process and alleviate the gradient vanishing problem. Finally, on four widely-used benchmarks, our TEDnet achieves the best overall performance in terms of both density map quality and counting accuracy, with an improvement up to 14% in MAE metric. These results validate the effectiveness of TEDnet for crowd counting.
[multiple, second, fusion, dataset] [computer, estimation, vision, pattern, hourglass, dense, ground, international, estimated, corresponding, truth, accurate, local, approach] [conference, ieee, figure, mse, quality, based, abstraction, image, psnr, proposed, intermediate, ssim, method] [density, tednet, convolutional, trellis, performance, network, distributed, combinatorial, best, table, skip, block, correlation, sal, deep, conv, pooling, neural, accuracy, architecture, max, est, stride, number] [decoding, encoding, decoder, generate, encoder, arxiv, preprint, path, implemented] [feature, crowd, counting, spatial, map, mae, semantic, localization, supervision, shanghaitech, hierarchy, scl, detection, mcnn, improve, regression, propose] [loss, learning]
@InProceedings{Jiang_2019_CVPR,
  author = {Jiang, Xiaolong and Xiao, Zehao and Zhang, Baochang and Zhen, Xiantong and Cao, Xianbin and Doermann, David and Shao, Ling},
  title = {Crowd Counting and Density Estimation by Trellis Encoder-Decoder Networks},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Cross-Atlas Convolution for Parameterization Invariant Learning on Textured Mesh Surface
Shiwei Li, Zixin Luo, Mingmin Zhen, Yao Yao, Tianwei Shen, Tian Fang, Long Quan


We present a convolutional network architecture for direct feature learning on mesh surfaces through their atlases of texture maps. The texture map encodes the parameterization from 3D to 2D domain, rendering not only RGB values but also rasterized geometric features if necessary. Since the parameterization of texture map is not pre-determined, and depends on the surface topologies, we therefore introduce a novel cross-atlas convolution to recover the original mesh geodesic neighborhood, so as to achieve the invariance property to arbitrary parameterization. The proposed module is integrated into classification and segmentation architectures, which takes the input texture map of a mesh, and infers the output predictions. Our method not only shows competitive performances on classification and segmentation public benchmarks, but also paves the way for the broad mesh surfaces learning.
[dataset, multiple] [mesh, atlas, geodesic, computer, point, vision, parameterization, textured, geometric, field, pattern, shape, corresponding, neighborhood, surface, cloud, angle, meshmnist, volume, distortion, rotation, pacross, roffset, projection, geometry, irregular, problem, modelnet, local] [texture, pixel, conference, ieee, method, image, input, resolution, spectral, figure, deconvolution] [convolution, network, standard, convolutional, receptive, deep, original, pooling, neural, kernel, accuracy, small, apply, applying, output, better, applied, size] [natural, regular] [map, feature, offset, segmentation, semantic, spatial, global, european, three, fully] [learning, classification, invariant, testing, training, data, mnist, task]
@InProceedings{Li_2019_CVPR,
  author = {Li, Shiwei and Luo, Zixin and Zhen, Mingmin and Yao, Yao and Shen, Tianwei and Fang, Tian and Quan, Long},
  title = {Cross-Atlas Convolution for Parameterization Invariant Learning on Textured Mesh Surface},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Deep Surface Normal Estimation With Hierarchical RGB-D Fusion
Jin Zeng, Yanfeng Tong, Yunmu Huang, Qiong Yan, Wenxiu Sun, Jing Chen, Yongtian Wang


The growing availability of commodity RGB-D cameras has boosted the applications in the field of scene understanding. However, as a fundamental scene understanding task, surface normal estimation from RGB-D data lacks thorough investigation. In this paper, a hierarchical fusion network with adaptive feature re-weighting is proposed for surface normal estimation from a single RGB-D image. Specifically, the features from color image and depth are successively integrated at multiple scales to ensure global surface smoothness while preserving visually salient details. Meanwhile, the depth features are re-weighted with a confidence map estimated from depth before merging into the color branch to avoid artifacts caused by input depth corruption. Additionally, a hybrid multi-scale loss function is designed to learn accurate normal estimation given noisy ground-truth dataset. Extensive experimental results validate the effectiveness of the fusion strategy and the loss design, outperforming state-of-the-art normal estimation schemes.
[fusion, recognition, prediction, early, dataset, late] [depth, normal, rgb, surface, estimation, confidence, computer, vision, pattern, scannet, single, geometric, accurate, reconstruction, sensor, planar] [conference, image, input, proposed, hybrid, ieee, result, based, method, color, figure, inpainting, missing, smooth, sharp, comparison, detail, colorization, visually, noise, resolution, consistency, difference] [network, structure, convolutional, table, deep, scale, binary, designed, performance, design, output, scheme, small, convolution, neural] [median, generate, provided] [map, hierarchical, branch, object, module, merge, adopt, indicating, global, semantic, ablation] [loss, function, training, data, datasets]
@InProceedings{Zeng_2019_CVPR,
  author = {Zeng, Jin and Tong, Yanfeng and Huang, Yunmu and Yan, Qiong and Sun, Wenxiu and Chen, Jing and Wang, Yongtian},
  title = {Deep Surface Normal Estimation With Hierarchical RGB-D Fusion},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Knowledge-Embedded Routing Network for Scene Graph Generation
Tianshui Chen, Weihao Yu, Riquan Chen, Liang Lin


To understand a scene in depth not only involves locating/recognizing individual objects, but also requires to infer the relationships and interactions among them. However, since the distribution of real-world relationships is seriously unbalanced, existing methods perform quite poorly for the less frequent relationships. In this work, we find that the statistical correlations between object pairs and their relationships can effectively regularize semantic space and make prediction less ambiguous, and thus well address the unbalanced distribution issue. To achieve this, we incorporate these statistical correlations into deep neural networks to facilitate scene graph generation by developing a Knowledge-Embedded Routing Network. More specifically, we show that the statistical correlations between objects appearing in images and their relationships, can be explicitly represented by a structured knowledge graph, and a routing mechanism is learned to propagate messages through the graph to explore their interactions. Extensive experiments on the large-scale Visual Genome dataset demonstrate the superiority of the proposed method over current state-of-the-art competitors.
[graph, dataset, state, hidden, explicitly, routing, previous, subject, prediction, predict, propagate, propagation, performs, walking] [scene, constraint, well, corresponding, predicts, compute] [statistical, method, image, figure, proposed, prior, comparison, based] [network, neural, performance, structured, deep, regularize, table, explore, compared, achieves] [relationship, model, visual, node, smn, frequent, genome, generation, hic, message, correlate, gated, mechanism, interplay, implemented, mcc] [object, improvement, semantic, feature, region, three, detection, faster, spatial, rcnn, predicted, bounding, category] [distribution, knowledge, set, pair, existing, label, representation, gnn, class, learning, training, classification, target, metric, address, learned, task, learn, proportion]
@InProceedings{Chen_2019_CVPR,
  author = {Chen, Tianshui and Yu, Weihao and Chen, Riquan and Lin, Liang},
  title = {Knowledge-Embedded Routing Network for Scene Graph Generation},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
An End-To-End Network for Panoptic Segmentation
Huanyu Liu, Chao Peng, Changqian Yu, Jingbo Wang, Xu Liu, Gang Yu, Wei Jiang


Panoptic segmentation, which needs to assign a category label to each pixel and segment each object instance simultaneously, is a challenging topic. Traditionally, the existing approaches utilize two independent models without sharing features, which makes the pipeline inefficient to implement. In addition, a heuristic method is usually employed to merge the results. However, the overlapping relationship between object instances is difficult to determine without sufficient context information during the merging process. To address the problems, we propose a novel end-to-end Occlusion Aware Network (OANet) for panoptic segmentation, which can efficiently and effectively predict both the instance and stuff segmentation in a single network. Moreover, we introduce a novel spatial ranking module to deal with the occlusion problem between the predicted instances. Extensive experiments have been done to validate the performance of our proposed method and promising results have been achieved on the COCO Panoptic benchmark.
[predict, prediction, dataset, tie] [algorithm, ground, problem, scene, occlusion, equation, column, single, corresponding, field, truth, pipeline] [figure, image, method, pixel, proposed, input, balance, result, quality, conduct, separate] [network, table, sharing, convolutional, deep, convolution, represents, better, layer, gradient, performance, rate] [model, introduce] [segmentation, instance, stuff, spatial, object, panoptic, semantic, feature, score, module, branch, backbone, detection, mask, overlapping, context, coco, pqth, propose, pyramid, proposal, rpn, map, person, pqst, category, three, head, supervision, bounding, ablation, heuristic, merge, merging] [ranking, learning, training, share, loss, set, task, large, label, novel, classification, train]
@InProceedings{Liu_2019_CVPR,
  author = {Liu, Huanyu and Peng, Chao and Yu, Changqian and Wang, Jingbo and Liu, Xu and Yu, Gang and Jiang, Wei},
  title = {An End-To-End Network for Panoptic Segmentation},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Fast and Flexible Indoor Scene Synthesis via Deep Convolutional Generative Models
Daniel Ritchie, Kai Wang, Yu-An Lin


We present a new, fast and flexible pipeline for indoor scene synthesis that is based on deep convolutional generative models. Our method operates on a top-down image-based representation, and inserts objects iteratively into the scene by predict their category, location, orientation and size with separate neural network modules. Our pipeline naturally supports automatic completion of partial scenes, as well as synthesis of complete scenes, without any modifications. Our method is significantly faster than the previous image-based method, and generates results that outperforms it and other state-of-the-art deep generative scene models in terms of faithfulness to training data and perceived visual quality.
[work, multiple, predict, predicting, perform, time] [scene, indoor, orientation, virtual, well, single, completion, vision, pipeline, predicts, computer, empty, allow, sin, bed] [generative, method, synthesis, figure, image, input, prior, separate, produce, synthesize, based] [deep, network, table, convolutional, neural, design, add, size, addition, output, highly, supported] [model, generate, room, probability, living, generated, partial, automatic, complete, visual, nightstand, understanding, wardrobe, ability, evaluate, adversarial, generating] [object, category, module, predicted, location, cnn, interior, spatial, bounding] [distribution, training, data, representation, train, learning, large, office, classifier, dimension]
@InProceedings{Ritchie_2019_CVPR,
  author = {Ritchie, Daniel and Wang, Kai and Lin, Yu-An},
  title = {Fast and Flexible Indoor Scene Synthesis via Deep Convolutional Generative Models},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Marginalized Latent Semantic Encoder for Zero-Shot Learning
Zhengming Ding, Hongfu Liu


Zero-shot learning has been well explored to precisely identify new unobserved classes through a visual-semantic function obtained from the existing objects. However, there exist two challenging obstacles: one is that the human-annotated semantics are insufficient to fully describe the visual samples; the other is the domain shift across existing and new classes. In this paper, we attempt to exploit the intrinsic relationship in the semantic manifold when given semantics are not enough to describe the visual objects, and enhance the generalization ability of the visual-semantic function with marginalized strategy. Specifically, we design a Marginalized Latent Semantic Encoder (MLSE), which is learned on the augmented seen visual features and the latent semantic representation. Meanwhile, latent semantics are discovered under an adaptive graph reconstruction scheme based on the provided semantics. Consequently, our proposed algorithm could enrich visual characteristics from seen classes, and well generalize to unobserved classes. Experimental results on zero-shot benchmarks demonstrate that the proposed model delivers superior performance over the state-of-the-art zero-shot learning approaches.
[graph, recognition] [algorithm, constraint, intrinsic, reconstruction, optimization, approach, matrix, well, notice, corresponding] [latent, proposed, figure, mapping, attribute, generative, based, image, cover] [adaptive, performance, explore, effective, better, table, accuracy, achieve, original, structure, shift] [visual, encoder, model, evaluation, describe, provided, observed, manifold, ability, embedded, seeking] [semantic, semantics, feature, category, adopt] [unseen, learning, zsl, marginalized, data, function, learn, representation, learned, cub, space, generalized, training, test, knowledge, generic, sun, seek, strategy, label, class, discriminative, min, generalization, update, conventional, existing, domain, novel, embedding, compatibility, objective, convergence, ale, exploit, experimental, unknown, distribution]
@InProceedings{Ding_2019_CVPR,
  author = {Ding, Zhengming and Liu, Hongfu},
  title = {Marginalized Latent Semantic Encoder for Zero-Shot Learning},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Scale-Adaptive Neural Dense Features: Learning via Hierarchical Context Aggregation
Jaime Spencer, Richard Bowden, Simon Hadfield


How do computers and intelligent agents view the world around them? Feature extraction and representation constitutes one the basic building blocks towards answering this question. Traditionally, this has been done with carefully engineered hand-crafted techniques such as HOG, SIFT or ORB. However, there is no "one size fits all" approach that satisfies all requirements. In recent years, the rising popularity of deep learning has resulted in a myriad of end-to-end solutions to many computer vision problems. These approaches, while successful, tend to lack scalability and can't easily exploit information learned by other systems. Instead, we propose SAND features, a dedicated deep learning solution to feature extraction capable of providing hierarchical context information. This is achieved by employing sparse relative labels indicating relationships or similarity of dissimilarity between image locations. The nature of these labels results in an almost infinite set of dissimilar examples to choose from. We demonstrate how the selection of negative examples during training can be used to modify the feature space and vary it's learned properties. To demonstrate the generality of this approach, we apply the proposed features to a multitude of tasks, each requiring different properties. This includes disparity estimation, semantic segmentation, self-localisation and SLAM. In all cases, we show how incorporating SAND features results in better or comparable results to the baseline, whilst requiring little to no additional training.
[tracking, work, combined] [computer, dense, vision, disparity, local, stereo, volume, matching, international, additional, odometry, ground, estimation, truth, kitti, provide, approach, pose, slam, descriptor, pattern, corresponding, keypoint, orb, absolute, error] [proposed, ieee, image, conference, method, figure, demonstrate, input, produce, based, tend] [represents, network, deep, order, cost, table, sparse, requiring, number, better] [visual, requires, system] [feature, baseline, global, sand, semantic, extraction, hierarchical, segmentation, final, context, object, detection, contextual] [learning, negative, mining, learned, loss, representation, contrastive, training, trained, positive, distance, set, data, specific, combination]
@InProceedings{Spencer_2019_CVPR,
  author = {Spencer, Jaime and Bowden, Richard and Hadfield, Simon},
  title = {Scale-Adaptive Neural Dense Features: Learning via Hierarchical Context Aggregation},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Unsupervised Embedding Learning via Invariant and Spreading Instance Feature
Mang Ye, Xu Zhang, Pong C. Yuen, Shih-Fu Chang


This paper studies the unsupervised embedding learning problem, which requires an effective similarity measurement between samples in low-dimensional embedding space. Motivated by the positive concentrated and negative separated properties observed from category-wise supervised learning, we propose to utilize the instance-wise supervision to approximate these properties, which aims at learning data augmentation invariant and instance spread-out features. To achieve this goal, we propose a novel instance based softmax embedding method, which directly optimizes the `real' instance features on top of the softmax function. It achieves significantly faster learning speed and higher accuracy than all existing methods. The proposed method performs well for both seen and unseen testing categories with cosine similarity. It also achieves competitive performance even without pre-trained network over samples from fine-grained categories.
[dataset, performs] [directly, well, linear, augmented, corresponding, problem, general] [proposed, image, method, based, figure, input] [performance, network, deep, table, accuracy, achieves, efficiency, batch, full, siamese, achieve, top] [visual, probability, requires, recognized, memory, arxiv, preprint] [feature, instance, propose, cnn, category, improve] [learning, embedding, unsupervised, training, negative, data, positive, testing, classifier, similarity, softmax, learned, augmentation, knn, set, nce, exemplar, invariant, label, cosine, randomly, classification, supervised, hard, sample, metric, representation, product, deepcluster, memorized, minimizing, triplet, setting, mom, discriminative, existing, unlabelled, distribution, mining, share, unseen, task]
@InProceedings{Ye_2019_CVPR,
  author = {Ye, Mang and Zhang, Xu and Yuen, Pong C. and Chang, Shih-Fu},
  title = {Unsupervised Embedding Learning via Invariant and Spreading Instance Feature},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
AOGNets: Compositional Grammatical Architectures for Deep Learning
Xilai Li, Xi Song, Tianfu Wu


Neural architectures are the foundation for improving performance of deep neural networks (DNNs). This paper presents deep compositional grammatical architectures which harness the best of two worlds: grammar models and DNNs. The proposed architectures integrate compositionality and reconfigurability of the former and the capability of learning rich features of the latter in a principled way. We utilize AND-OR Grammar (AOG) as network generator in this paper and call the resulting networks AOGNets. An AOGNet consists of a number of stages each of which is composed of a number of AOG building blocks. An AOG building block splits its input feature map into N groups along feature channels and then treat it as a sentence of N words. It then jointly realizes a phrase structure grammar and a dependency grammar in bottom-up parsing the "sentence" for better feature exploration and reuse. It provides a unified framework for the best practices developed in state-of-the-art DNNs. In experiments, AOGNet is tested in the ImageNet-1K classification benchmark and the MS-COCO object detection and segmentation benchmark. In ImageNet-1K, AOGNet obtains better performance than ResNet and most of its variants, ResNeXt and its attention based variants such as SENet, DenseNet and DualPathNet. AOGNet also obtains the best model interpretability score using network dissection. AOGNet further shows better potential in adversarial defense. In MS-COCO, AOGNet obtains better performance than the ResNet and ResNeXt backbones in Mask R-CNN.
[consists, dependency, graph] [computer, vision, pattern, david] [input, ieee, conference, proposed, method, based, image, figure] [building, aog, block, network, deep, grammar, aognet, lateral, neural, structure, better, aognets, best, performance, number, obtains, fin, fout, design, resnet, table, operation, popular, resnets, effective, add, resnext, search, simplified, dnns, filter, group, convolutional, deeper, stochastic, architecture] [node, model, compositional, child, adversarial, arxiv, interpretability, potential, machine, preprint, phrase, going, simple] [feature, object, detection, map, three, segmentation, parsing] [learning, space, set, classification, training, data, representation, existing]
@InProceedings{Li_2019_CVPR,
  author = {Li, Xilai and Song, Xi and Wu, Tianfu},
  title = {AOGNets: Compositional Grammatical Architectures for Deep Learning},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
A Robust Local Spectral Descriptor for Matching Non-Rigid Shapes With Incompatible Shape Structures
Yiqun Wang, Jianwei Guo, Dong-Ming Yan, Kai Wang, Xiaopeng Zhang


Constructing a robust and discriminative local descriptor for 3D shape is a key component of many computer vision applications. Although existing learning-based approaches can achieve good performance in some specific benchmarks, they usually fail to learn enough information from shapes with different shape types and structures (e.g., spatial resolution, connectivity, transformations, etc.) Focusing on this issue, in this paper, we present a more discriminative local descriptor for deformable 3D shapes with incompatible structures. Based on the spectral embedding using the Laplace-Beltrami framework on the surface, we first construct a novel local spectral feature which shows great resilience to change in mesh resolution, triangulation, transformation. Then the multi-scale local spectral features around each vertex are encoded into a `geometry image', called vertex spectral image, in a very compact way. Such vertex spectral images can be efficiently trained to learn local descriptors using a triplet neural network. Finally, for training and evaluation, we present a new benchmark dataset by extending the widely used FAUST dataset. We utilize a remeshing approach to generate modified shapes with different structures. We evaluate the proposed approach thoroughly and make an extensive comparison to demonstrate that our approach outperforms recent state-of-the-art methods on this benchmark.
[dataset, recognition, construct] [local, shape, vertex, descriptor, computer, matching, rotation, geodesic, vision, correspondence, mesh, laplacebeltrami, surface, point, robust, approach, ldgi, faust, dirichlet, hks, equation, geometric, discrete, michael, incompatible, rops, pattern, geometry, intrinsic, radius, osd, triangulation, dense, analysis, monet] [spectral, proposed, resolution, ieee, image, figure, method, based, demonstrate, patch, quality, conference, presented, frequency] [scale, deep, network, original, performance, neural, number, signature, convolutional, energy, cot] [generate, evaluation, model, correct] [feature, spatial, three] [learning, function, shot, training, learn, invariant, discriminative, novel, learned, domain, large]
@InProceedings{Wang_2019_CVPR,
  author = {Wang, Yiqun and Guo, Jianwei and Yan, Dong-Ming and Wang, Kai and Zhang, Xiaopeng},
  title = {A Robust Local Spectral Descriptor for Matching Non-Rigid Shapes With Incompatible Shape Structures},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Context and Attribute Grounded Dense Captioning
Guojun Yin, Lu Sheng, Bin Liu, Nenghai Yu, Xiaogang Wang, Jing Shao


Dense captioning aims at simultaneously localizing semantic regions and describing these regions-of-interest (ROIs) with short phrases or sentences in natural language. Previous studies have shown remarkable progresses, but they are often vulnerable to the aperture problem that a caption generated by the features inside one ROI lacks contextual coherence with its surrounding context in the input image. In this work, we investigate contextual reasoning based on multi-scale message propagations from the neighboring contents to the target ROIs. To this end, we design a novel end-to-end context and attribute grounded dense captioning framework consisting of 1) a contextual visual mining module and 2) a multi-level attribute grounded description generation module. Knowing that captions often co-occur with the linguistic attributes (such as who, what and where), we also incorporate an auxiliary supervision from hierarchical linguistic attributes to augment the distinctiveness of the learned captions. Extensive experiments and ablation studies on Visual Genome dataset demonstrate the superiority of the proposed model in comparison to state-of-the-art methods.
[cue, lstm, graph, individual, previous, passing, sequential] [dense, local, ground, truth, accurate] [attribute, image, proposed, based, input, generator, figure] [compared, structure, table, network, adaptive] [captioning, linguistic, visual, grounded, description, generated, generation, caption, message, young, model, attention, black, white, language, standing, meteor, natural, generate, rich, evaluation] [contextual, neighboring, feature, global, region, context, semantic, map, object, integration, stage, hierarchical, coarse, rpn, localization, predicted, surfer, supervision, refined, person, roi, module, bounding, improve, detection, final, cci] [target, similarity, extractor, loss, auxiliary, learning, training, novel]
@InProceedings{Yin_2019_CVPR,
  author = {Yin, Guojun and Sheng, Lu and Liu, Bin and Yu, Nenghai and Wang, Xiaogang and Shao, Jing},
  title = {Context and Attribute Grounded Dense Captioning},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Spot and Learn: A Maximum-Entropy Patch Sampler for Few-Shot Image Classification
Wen-Hsuan Chu, Yu-Jhe Li, Jing-Cheng Chang, Yu-Chiang Frank Wang


Few-shot learning (FSL) requires one to learn from object categories with a small amount of training data (as novel classes), while the remaining categories (as base classes) contain a sufficient amount of data for training. It is often desirable to transfer knowledge from the base classes and derive dominant features efficiently for the novel samples. In this work, we propose a sampling method that de-correlates an image based on maximum entropy reinforcement learning, and extracts varying sequences of patches on every forward-pass with discriminative information observed. This can be viewed as a form of "learned" data augmentation in the sense that we search for different sequences of patches within an image and performs classification with aggregation of the extracted features, resulting in improved FSL performances. In addition, our positive and negative sampling policies along with a newly defined reward function would favorably improve the effectiveness of our model. Our experiments on two benchmark datasets confirm the effectiveness of our framework and its superiority over recent FSL approaches.
[action, state, recognition, framework, sequence, current, trajectory] [note, international, computer, form, vision, pattern, algorithm] [image, input, patch, conference, proposed, ieee, based, extracted, variety, amount, method, produce] [neural, deep, performance, denotes, gain, increase, output, connected, number, processing] [policy, model, reinforcement, encoder, attention, visual, sampled, reward, evaluation, correct, machine] [feature, context, predicted, voting, fully] [learning, sampling, entropy, training, maximum, label, novel, negative, soft, sample, sampler, data, positive, classifier, cosine, objective, class, classification, base, hard, function, datasets, labeled, extractor, select, miniimagenet, similarity, fsl, qsof, learn]
@InProceedings{Chu_2019_CVPR,
  author = {Chu, Wen-Hsuan and Li, Yu-Jhe and Chang, Jing-Cheng and Frank Wang, Yu-Chiang},
  title = {Spot and Learn: A Maximum-Entropy Patch Sampler for Few-Shot Image Classification},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Interpreting CNNs via Decision Trees
Quanshi Zhang, Yu Yang, Haotian Ma, Ying Nian Wu


This paper aims to quantitatively explain the rationales of each prediction that is made by a pre-trained convolutional neural network (CNN). We propose to learn a decision tree, which clarifies the specific reason for each prediction made by the CNN at the semantic level. I.e., the decision tree decomposes feature representations in high conv-layers of the CNN into elementary concepts of object parts. In this way, the decision tree tells people which object parts activate which filters for the prediction and how much each object part contributes to the prediction score. Such semantic and quantitative explanations for CNN predictions have specific values beyond the traditional pixel-level analysis of CNNs. More specifically, our method mines all potential decision modes of the CNN, where each mode represents a typical case of how the CNN uses object parts for prediction. The decision tree organizes all potential decision modes in a coarse-to-fine manner to explain CNN predictions at different fine-grained levels. Experiments have demonstrated the effectiveness of the proposed method.
[prediction, dataset, second] [estimated, compute, equation, error] [input, image, based, disentangled, method, quantitatively, quantitative, zhang, figure, proposed] [filter, neural, top, layer, deep, denotes, network, cnns, activation, accuracy, convolutional, represents, receptive, ilsvrc, root, order, output] [decision, tree, contribution, explain, rationale, node, mode, represent, interpretable, visual, parse, interpreting, semantically, model, summarize] [cnn, object, feature, semantic, voc, visualization, contribute, inside, average, activated] [learning, specific, classification, learn, loss, knowledge, learned, distribution, positive, set, log, generic, training]
@InProceedings{Zhang_2019_CVPR,
  author = {Zhang, Quanshi and Yang, Yu and Ma, Haotian and Nian Wu, Ying},
  title = {Interpreting CNNs via Decision Trees},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Dense Relational Captioning: Triple-Stream Networks for Relationship-Based Captioning
Dong-Jin Kim, Jinsoo Choi, Tae-Hyun Oh, In So Kweon


Our goal in this work is to train an image captioning model that generates more dense and informative captions. We introduce "relational captioning," a novel image captioning task which aims to generate multiple captions with respect to relational information between objects in an image. Relational captioning is a framework that is advantageous in both diversity and amount of information, leading to image understanding based on relationships. Part-of-speech (POS, i.e. subject-object-predicate categories) tags can be assigned to every English word. We leverage the POS as a prior to guide the correct sequence of words in a caption. To this end, we propose a multi-task triple-stream network (MTTSNet) which consists of three recurrent units for the respective POS and jointly performs POS prediction and captioning. We demonstrate more diverse and richer representations generated by the proposed model against several baselines and competing methods.
[lstm, graph, recognition, subject, dataset, work, framework, recurrent, fusion] [dense, scene, vision, computer, pattern, international, single, provide, matching] [image, conference, ieee, proposed, based, input, traditional, figure, comparison] [number, neural, performance, network, order, table, compared, compare] [relational, captioning, visual, relationship, captioner, man, model, caption, generate, diverse, generation, generated, wearing, language, word, black, richer, vrd, understanding, white, red, diversity, natural, pred, densecap, subj, obj, sentence, generates] [region, object, union, feature, detection, three, bounding, context, module, box, holistic, localization, fully, average, score] [learning, task, representation, loss, label, dog, classification, triplet, class]
@InProceedings{Kim_2019_CVPR,
  author = {Kim, Dong-Jin and Choi, Jinsoo and Oh, Tae-Hyun and So Kweon, In},
  title = {Dense Relational Captioning: Triple-Stream Networks for Relationship-Based Captioning},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Deep Modular Co-Attention Networks for Visual Question Answering
Zhou Yu, Jun Yu, Yuhao Cui, Dacheng Tao, Qi Tian


Visual Question Answering (VQA) requires a fine-grained and simultaneous understanding of both the visual content of images and the textual content of questions. Therefore, designing an effective `co-attention' model to associate key words in questions with key objects in images is central to VQA performance. So far, most successful attempts at co-attention learning have been achieved by using shallow models, and deep co-attention models show little improvement over their shallow counterparts. In this paper, we propose a deep Modular Co-Attention Network (MCAN) that consists of Modular Co-Attention (MCA) layers cascaded in depth. Each MCA layer models the self-attention of questions and images, as well as the question-guided-attention of images jointly using a modular composition of two basic attention units. We quantitatively and qualitatively evaluate MCAN on the benchmark VQA-v2 dataset and conduct extensive ablation studies to explore the reasons behind MCAN's effectiveness. Experimental results demonstrate that MCAN significantly outperforms the previous state-of-the-art. Our best single model delivers 70.63% overall accuracy on the test-dev set.
[stacking, fusion, key, recognition, consists, dataset, outperforms, previous, lstm, modeling] [vision, computer, dense, international, zhou, pattern, depth] [image, figure, input, conference, ieee, proposed, jun, based] [deep, number, layer, unit, output, neural, processing, network, performance, table, compared, grant, basic, best] [question, attention, visual, model, mca, multimodal, attended, modular, vqa, mcan, word, mcaned, arxiv, preprint, answer, answering, understanding, language, coattention, textual] [feature, three, region, cascaded, object, ablation, val, improve] [learning, learned, learn, set, representation, embeddings]
@InProceedings{Yu_2019_CVPR,
  author = {Yu, Zhou and Yu, Jun and Cui, Yuhao and Tao, Dacheng and Tian, Qi},
  title = {Deep Modular Co-Attention Networks for Visual Question Answering},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Synthesizing Environment-Aware Activities via Activity Sketches
Yuan-Hong Liao, Xavier Puig, Marko Boben, Antonio Torralba, Sanja Fidler


In order to learn to perform activities from demonstrations or descriptions, agents need to distill what the essence of the given activity is, and how it can be adapted to new environments. In this work, we address the problem: environment-aware program generation. Given a visual demonstration or a description of an activity, we generate program sketches representing the essential instructions and propose a model to flesh these into full programs representing the actions needed to perform the activity under the presented environmental constraints. To this end, we build upon VirtualHome, to create a new dataset VirtualHome-Env, where we collect program sketches to represent activities and match programs with environments that can afford them. Furthermore, we construct a knowledge base to sample realistic environments and another knowledge base to seek out the programs under the sampled environments. Finally, we propose RNN-ResActGraph, a network that generates a program from a given sketch and an environment graph and tracks the changes in the environment induced by the program.
[graph, activity, perform, dataset, state, time, hidden, gru, predict, sit, work, extract, prediction, sequence, previous, propagation, action] [ground, truth, note, induced, predicts, computer] [conference, figure, change, collect, proposed] [table, neural, order, inspired, number, original] [program, environment, model, generated, agent, resactgraph, description, generation, visual, demonstration, food, generate, lcs, watch, machine, node, describe, arxiv, preprint, virtualhome, room, dec, instruction, simulator, washing, bedroom, living, goal, reason, execute, language] [object, sofa, inside, propose, predicted, semantic] [sketch, knowledge, task, set, learning, learn, closed, distribution, address]
@InProceedings{Liao_2019_CVPR,
  author = {Liao, Yuan-Hong and Puig, Xavier and Boben, Marko and Torralba, Antonio and Fidler, Sanja},
  title = {Synthesizing Environment-Aware Activities via Activity Sketches},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Self-Critical N-Step Training for Image Captioning
Junlong Gao, Shiqi Wang, Shanshe Wang, Siwei Ma, Wen Gao


Existing methods for image captioning are usually trained by cross entropy loss, which leads to exposure bias and the inconsistency between the optimizing function and evaluation metrics. Recently it has been shown that these two issues can be addressed by incorporating techniques from reinforcement learning, where one of the popular techniques is the advantage actor-critic algorithm that calculates per-token advantage by estimating state value with a parametrized estimator at the cost of introducing estimation bias. In this paper, we estimate state value without using a parametrized value estimator. With the properties of image captioning, namely, the deterministic state transition function and the sparse reward, state value is equivalent to its preceding state-action value, and we reformulate advantage function by simply replacing the former with the latter. Moreover, the reformulated advantage is extended to n-step, which can generally increase the absolute value of the mean of reformulated advantage while lowering variance. Then two kinds of rollout are adopted to estimate state-action value, which we call self-critical n-step training. Empirically we find that our method can obtain better performance compared to the state-of-the-art methods that use the sequence level advantage and parametrized estimator respectively on the widely used MSCOCO benchmark.
[sequence, state, action, transition, time, trajectory, predict, rnn, deterministic] [monte, estimate, computer, absolute, algorithm, pattern, estimation, vision, estimator, exposure, special, estimating] [image, proposed, conference, method, ieee] [better, performance, variance, neural, gradient, compared, increase, larger, table] [advantage, reformulated, model, carlo, token, reward, captioning, rollouts, reinforcement, rollout, evaluation, preceding, xent, parametrized, generated, attention, cider, policy, machine, visual, step, scst, generating, reformulate, lowering, mscoco, introduced, probability, agent, expected] [level, multinomial, baseline, adopted] [function, training, trained, cross, entropy, loss, learning, large, set, strategy, bias]
@InProceedings{Gao_2019_CVPR,
  author = {Gao, Junlong and Wang, Shiqi and Wang, Shanshe and Ma, Siwei and Gao, Wen},
  title = {Self-Critical N-Step Training for Image Captioning},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Multi-Target Embodied Question Answering
Licheng Yu, Xinlei Chen, Georgia Gkioxari, Mohit Bansal, Tamara L. Berg, Dhruv Batra


Embodied Question Answering (EQA) is a relatively new task where an agent is asked to answer questions about its environment from egocentric perception. EQA as introduced in [8] makes the fundamental assumption that every question, e.g., "what color is the car?", has exactly one target ("car") being inquired about. This assumption puts a direct limitation on the abilities of the agent. We present a generalization of EQA -- Multi-Target EQA (MT-EQA). Specifically, we study questions that have multiple targets in them, such as "Is the dresser in the bedroom bigger than the oven in the kitchen?", where the agent has to navigate to multiple locations ("dresser in bedroom", "oven in kitchen") and perform comparative reasoning ("dresser" bigger than "oven") before it can answer a question. Such questions require the development of entirely new modules or components in the agent. To address this, we propose a modular architecture composed of a program generator, a controller, a navigator, and a VQA module. The program generator converts the given question into sequential executable sub-programs; the navigator guides the agent to multiple locations pertinent to the navigation-related sub-programs; and the controller learns to select relevant observations along its path. These observations are then fed to the VQA module to predict the answer. We perform detailed analysis for each of the model components and show that our joint model can outperform previous methods and strong baselines by a significant margin.
[multiple, lstm, action, predict, dataset, georgia, consists, work, hidden, perform, sequence] [compute, single] [color, row, generator, figure, comparison, attribute, based] [compare, controller, size, table, accuracy, full, neural, performance, better, order, equal, deep, ratio, original] [question, vqa, nav, navigation, room, model, eqa, agent, program, embodied, visual, answer, navigator, modular, ctrl, answering, sink, environment, query, inroom, xroom, type, shortest, reasoning, bedroom, dressing, find, requires, compositional, unique, path, bathtub, arxiv, preprint, consider, closer] [object, cnn, module, iou, final, bigger, location, comparing] [target, select, task, learning, training, comparative]
@InProceedings{Yu_2019_CVPR,
  author = {Yu, Licheng and Chen, Xinlei and Gkioxari, Georgia and Bansal, Mohit and Berg, Tamara L. and Batra, Dhruv},
  title = {Multi-Target Embodied Question Answering},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Visual Question Answering as Reading Comprehension
Hui Li, Peng Wang, Chunhua Shen, Anton van den Hengel


Visual question answering (VQA) demands simultaneous comprehension of both the image visual content and natural language questions. In some cases, the reasoning needs the help of common sense or general knowledge which usually appear in the form of text. Current methods jointly embed both the visual information and the textual feature into the same space. Nevertheless, how to model the complex interactions between the two different modalities is not an easy work. In contrast to struggling on multimodal feature fusion, in this paper, we propose to unify all the input information by natural language so as to convert VQA into a machine reading comprehension problem. With this transformation, our method not only can tackle VQA datasets that focus on observation based questions, but can also be naturally extended to handle knowledge-based VQA which requires to explore large-scale external knowledge base. It is a step towards being able to exploit large volumes of text and natural language processing techniques to address VQA problem. Two types of models are proposed to deal with open-ended VQA and multiple-choice VQA respectively. We evaluate our models on three VQA benchmarks. The comparable performance with the state-of-the-art demonstrates the effectiveness of the proposed method.
[work, dataset, peng, joint, predict] [dense, qanet, general] [image, method, based, proposed, figure, ieee, input] [performance, layer, accuracy, higher, output, better, block, table, neural, bilinear, pooling, normalization, applying] [vqa, answer, visual, question, model, text, natural, language, attention, answering, textual, multimodal, supporting, reading, fvqa, external, belongs, tennis, van, memory, description, candidate, caption, comprehension, anton, den, reasoning, paragraph, infer, encoder, ball, chunhua, probability, modelencoder, correct, white, man, evaluate] [semantic, region, context, predicted, feature, category, propose, three, extra, level] [knowledge, embedding, training, learning, test, cat, base]
@InProceedings{Li_2019_CVPR,
  author = {Li, Hui and Wang, Peng and Shen, Chunhua and van den Hengel, Anton},
  title = {Visual Question Answering as Reading Comprehension},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
StoryGAN: A Sequential Conditional GAN for Story Visualization
Yitong Li, Zhe Gan, Yelong Shen, Jingjing Liu, Yu Cheng, Yuexin Wu, Lawrence Carin, David Carlson, Jianfeng Gao


In this work, we propose a new task called Story Visualization. Given a multi-sentence paragraph, the story is visualized by generating a sequence of images, one for each sentence. In contrast to video generation, story visualization focuses less on the continuity in generated images (frames), but more on the global consistency across dynamic scenes and characters -- a challenge that has not been addressed by any single-image or video generation methods. Therefore, we propose a new story-to-image-sequence generation model, StoryGAN, based on the sequential conditional GAN framework. Our model is unique in that it consists of a deep Context Encoder that dynamically tracks the story flow, and two discriminators at the story and image levels, to enhance the image quality and the consistency of the generated sequences. To evaluate the model, we modified existing datasets to create the CLEVR-SV and Pororo-SV datasets. Empirically, StoryGAN outperformed state-of-the-art models in image quality, contextual consistency metrics, and human evaluation.
[video, sequence, gru, dataset, human, coherent, motion, time, hidden, sequential, dynamic, fish, second, static, recurrent, current, eddy, work] [consistent, single, scene, ground, truth, initial, cube, local] [image, figure, input, generator, consistency, real, conditional, generative, quality, proposed, based] [output, cell, filter, deep, standard, table, neural, compared, layer, network, full, structure] [story, storygan, generation, generated, encoder, sentence, pororo, discriminator, text, vector, crong, arxiv, preprint, imagegan, fishing, generate, svc, visual, encoded, gan, adversarial, svfn, model, step, description, encodes, generating, character, ensure] [context, visualization, global, contextual, feature] [task, training, existing, update, learning]
@InProceedings{Li_2019_CVPR,
  author = {Li, Yitong and Gan, Zhe and Shen, Yelong and Liu, Jingjing and Cheng, Yu and Wu, Yuexin and Carin, Lawrence and Carlson, David and Gao, Jianfeng},
  title = {StoryGAN: A Sequential Conditional GAN for Story Visualization},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Noise-Aware Unsupervised Deep Lidar-Stereo Fusion
Xuelian Cheng, Yiran Zhong, Yuchao Dai, Pan Ji, Hongdong Li


In this paper, we present LidarStereoNet, the first unsupervised Lidar-stereo fusion network, which can be trained in an end-to-end manner without the need of ground truth depth maps. By introducing a novel "Feedback Loop" to connect the network input with output, LidarStereoNet could tackle both noisy Lidar points and misalignment between sensors that have been ignored in existing Lidar-stereo fusion work. Besides, we propose to incorporate the piecewise planar model into the network learning to further constrain depths to conform to the underlying 3D geometry. Extensive quantitative and qualitative evaluations on both real and synthetic datasets demonstrate the superiority of our method, which outperforms state-of-the-art stereo matching, depth completion and Lidar-Stereo fusion approaches significantly.
[fusion, dataset, outperforms, warping, moving] [lidar, stereo, depth, disparity, matching, kitti, plane, error, dense, problem, international, lidarstereonet, accurate, fitting, loop, ground, left, truth, defined, photometric, corresponding, completion, robotics, dealing, sinet, colour, note] [method, input, image, ieee, feedback, proposed, handle, conference, raw, based, figure, consistency, noise, quantitative] [sparse, network, performance, deep, core, architecture, sparsity, convolutional, convolution, compared, compare, regularization, highly, better] [model, type] [feature, misalignment, extraction, cnn, ablation, propose] [loss, training, unsupervised, noisy, cleaned, existing, learning, novel, large, probabilistic, function, data, trained]
@InProceedings{Cheng_2019_CVPR,
  author = {Cheng, Xuelian and Zhong, Yiran and Dai, Yuchao and Ji, Pan and Li, Hongdong},
  title = {Noise-Aware Unsupervised Deep Lidar-Stereo Fusion},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Versatile Multiple Choice Learning and Its Application to Vision Computing
Kai Tian, Yi Xu, Shuigeng Zhou, Jihong Guan


Most existing ensemble methods aim to train the underlying embedded models independently and simply aggregate their final outputs via averaging or weighted voting. As many prediction tasks contain uncertainty, most of these ensemble methods just reduce variance of the predictions without considering the collaborations among the ensembles. Different from these ensemble methods, multiple choice learning (MCL) methods exploit the cooperation among all the embedded models to generate multiple diverse hypotheses. In this paper, a new MCL method, called vMCL (the abbreviation of versatile Multiple Choice Learning), is developed to extend the application scenarios of MCL methods by ensembling deep neural networks. Our vMCL method keeps the advantage of existing MCL methods while overcoming their major drawback, thus achieves better performance. The novelty of our vMCL lies in three aspects: (1) a choice network is designed to learn the confidence level of each specialist which can provide the best prediction base on multiple hypotheses; (2) a hinge loss is introduced to alleviate the overconfidence issue in MCL settings; (3) Easy to be implemented and can be trained in an end-to-end manner, which is a very attractive feature for many real-world applications. Experiments on image classification and image segmentation task show that vMCL outperforms the existing state-of-the-art MCL methods.
[multiple, prediction, outperforms, dataset, performs] [error, computer, algorithm, confidence, vision, problem, provide, accurate] [image, high, method, conference, proposed, ieee, shanghai, figure] [network, rate, neural, deep, better, best, size, performance, accuracy, achieves, output, convolutional, aggregate, stochastic, architecture, number] [choice, model, diverse, diversity, generate, probability, machine, example, indicates, embedded, evaluate, specialization] [confident, final, segmentation, three] [vmcl, oracle, mcl, learning, smcl, loss, ensemble, training, classification, cmcl, popt, specialist, overconfidence, existing, set, test, distribution, svhn, train, hinge, data, shared, predictive, uniform, trained, overfitting, large, min]
@InProceedings{Tian_2019_CVPR,
  author = {Tian, Kai and Xu, Yi and Zhou, Shuigeng and Guan, Jihong},
  title = {Versatile Multiple Choice Learning and Its Application to Vision Computing},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
EV-Gait: Event-Based Robust Gait Recognition Using Dynamic Vision Sensors
Yanxiang Wang, Bowen Du, Yiran Shen, Kai Wu, Guangrong Zhao, Jianguo Sun, Hongkai Wen


In this paper, we introduce a new type of sensing modality, the Dynamic Vision Sensors (Event Cameras), for the task of gait recognition. Compared with the traditional RGB sensors, the event cameras have many unique advantages such as ultra low resources consumption, high temporal resolution and much larger dynamic range. However, those cameras only produce noisy and asynchronous events of intensity changes rather than frames, where conventional vision-based gait recognition algorithms can't be directly applied. To address this, we propose a new Event-based Gait Recognition (EV-Gait) approach, which exploits motion consistency to effectively remove noise, and uses a deep neural network to recognise gait from the event streams. To evaluate the performance of EV-Gait, we collect two event-based gait datasets, one from real-world experiments and the other by converting the publicly available RGB gait recognition benchmark CASIA-B. Extensive experiments show that EV-Gait can get nearly 96% recognition accuracy in the real-world settings, while on the CASIA-B benchmark it achieves comparable performance with state-of-the-art RGB-based gait recognition approaches.
[event, gait, recognition, dvs, cancellation, motion, dynamic, stream, moving, work, liu, visualisation, asynchronous, led, temporal, walking, time, human, padala, dataset, dot, static, khoda, tobi] [rgb, vision, sensor, lighting, approach, international, pattern, camera, analysis, viewing, computer, plane, technique] [noise, ieee, proposed, based, figure, conference, captured, image, ftl, intensity, pixel, competing, caused, resblock, consistency, background, collected] [accuracy, performance, deep, neural, network, convolutional, number, relu, processing, achieve, low] [machine, unique, considered, converted] [object, benchmark, spatial] [data, training, set, effectively, classification, representation, learning]
@InProceedings{Wang_2019_CVPR,
  author = {Wang, Yanxiang and Du, Bowen and Shen, Yiran and Wu, Kai and Zhao, Guangrong and Sun, Jianguo and Wen, Hongkai},
  title = {EV-Gait: Event-Based Robust Gait Recognition Using Dynamic Vision Sensors},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
ToothNet: Automatic Tooth Instance Segmentation and Identification From Cone Beam CT Images
Zhiming Cui, Changjian Li, Wenping Wang


This paper proposes a method that uses deep convolutional neural networks to achieve automatic and accurate tooth instance segmentation and identification from CBCT (cone beam CT) images for digital dentistry. The core of our method is a two-stage network. In the first stage, an edge map is extracted from the input CBCT image to enhance image contrast along shape boundaries. Then this edge map and the input images are passed to the second stage. In the second stage, we build our network upon the 3D region proposal network (RPN) with a novel learned-similarity matrix to help efficiently remove redundant proposals, speed up training and save GPU memory. To resolve the ambiguity in the identification task, we encode teeth spatial relationships as an additional feature input in the identification task, which helps to remarkably improve the identification accuracy. Our evaluation, comparison and comprehensive ablation studies demonstrate that our method produces accurate instance segmentation and identification results automatically and outperforms the state-of-the-art approaches. To the best of our knowledge, our method is the first to use neural networks to achieve automatic tooth segmentation and identification from CBCT images.
[second, dataset, individual] [matrix, computer, accurate, shape, ground, truth, vision, pattern, defined, journal, international, contrast, volumetric, condition, volume] [method, image, comparison, conference, ieee, component, based, row, input, figure, proposed, digital] [network, deep, accuracy, gpu, convolutional, neural, lower, performance, higher] [memory, automatic, beam] [tooth, segmentation, teeth, cbct, identification, edge, map, spatial, object, detection, relation, instance, medical, roi, three, box, region, proposal, feature, bounding, bite, propose, piou, segment, boundary, rpn, mask, fully, level, iou, kaiming, ross] [similarity, training, learning, set, data, sij, classification, loss, supervised, train, base, novel, label, open]
@InProceedings{Cui_2019_CVPR,
  author = {Cui, Zhiming and Li, Changjian and Wang, Wenping},
  title = {ToothNet: Automatic Tooth Instance Segmentation and Identification From Cone Beam CT Images},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Modularized Textual Grounding for Counterfactual Resilience
Zhiyuan Fang, Shu Kong, Charless Fowlkes, Yezhou Yang


Computer Vision applications often require a textual grounding module with precision, interpretability, and resilience to counterfactual inputs/queries. To achieve high grounding precision, current textual grounding methods heavily rely on large-scale training data with manual annotations at the pixel level. Such annotations are expensive to obtain and thus severely narrow the model's scope of real-world applications. Moreover, most of these methods sacrifice interpretability, generalizability, and they neglect the importance of being resilient to counterfactual inputs. To address these issues, we propose a visual grounding system which is 1) end-to-end trainable in a weakly supervised fashion with only image-level annotations, and 2) counterfactually resilient owing to the modular design. Specifically, we decompose textual descriptions into three levels: entity, semantic attribute, color information, and perform compositional grounding progressively. We validate our model through a series of experiments and demonstrate its improvement over the state-of-the-art methods. In particular, our model's performance not only surpasses other weakly/un-supervised methods and even approaches the strongly supervised ones, but also is interpretable for decision making and performs much better in face of counterfactual classes than all the others.
[dataset, video, current, perform, work] [computer, vision, pattern, international, note, practical, manual] [conference, image, ieee, color, attribute, figure, based, dictionary, method, result] [better, neural, design, pooling, performance, bilinear, processing, search, convolutional] [grounding, textual, system, counterfactual, modular, entity, model, visual, attention, captioning, natural, language, resilience, interpretability, man, arxiv, preprint, resilient, phrase, progressively, referring, find] [module, semantic, object, weakly, final, feature, three, person, propose, bounding, fully, european, scoring, segmentation, region, adopt] [supervised, training, learning, train, data, classification, novel, function, unsupervised, trained, testing, existing]
@InProceedings{Fang_2019_CVPR,
  author = {Fang, Zhiyuan and Kong, Shu and Fowlkes, Charless and Yang, Yezhou},
  title = {Modularized Textual Grounding for Counterfactual Resilience},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
L3-Net: Towards Learning Based LiDAR Localization for Autonomous Driving
Weixin Lu, Yao Zhou, Guowei Wan, Shenhua Hou, Shiyu Song


We present L3-Net - a novel learning-based LiDAR localization system that achieves centimeter-level localization accuracy, comparable to prior state-of-the-art systems with hand-crafted pipelines. Rather than relying on these hand-crafted modules, we innovatively implement the use of various deep neural network structures to establish a learning-based approach. L3-Net learns local descriptors specifically optimized for matching in different real-world driving scenarios. 3D convolutions over a cost volume built in the solution space significantly boosts the localization accuracy. RNNs are demonstrated to be effective in modeling the vehicle's dynamics, yielding better temporal smoothness and accuracy. We comprehensively validate the effectiveness of our approach using freshly collected datasets. Multiple trials of repetitive data collection over the same road and areas make our dataset ideal for testing localization systems. The SunnyvaleBigLoop sequences, with a year's time interval between the collected mapping and testing data, made it quite challenging, but the low localization error of our method in these datasets demonstrates its maturity for real industrial implementation.
[driving, online, rnns, dataset, recognition, temporal, multiple, lstm, time, extract, yaw] [point, lidar, volume, matching, cloud, autonomous, keypoint, vision, local, computer, pattern, international, solution, keypoints, descriptor, robotics, wan, levinson, handcrafted, geometric, registration, rms, pose, pointnet, ground, automation, june, robust, accurate, corresponding, yielding, estimated, note] [ieee, conference, based, proposed, intensity, method, mapping, figure, input, traditional] [cost, cnns, deep, network, performance, table, neural, structure, better, compared, accuracy, regularization] [probability, system] [localization, offset, feature, map, vehicle, predicted, road, urban, global, including] [learning, data, training, testing, space, distance, set]
@InProceedings{Lu_2019_CVPR,
  author = {Lu, Weixin and Zhou, Yao and Wan, Guowei and Hou, Shenhua and Song, Shiyu},
  title = {L3-Net: Towards Learning Based LiDAR Localization for Autonomous Driving},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Panoptic Feature Pyramid Networks
Alexander Kirillov, Ross Girshick, Kaiming He, Piotr Dollar


The recently introduced panoptic segmentation task has renewed our community's interest in unifying the tasks of instance segmentation (for thing classes) and semantic segmentation (for stuff classes). However, current state-of-the-art methods for this joint task use separate and dissimilar networks for instance and semantic segmentation, without performing any shared computation. In this work, we aim to unify these methods at the architectural level, designing a single network for both tasks. Our approach is to endow Mask R-CNN, a popular instance segmentation method, with a semantic segmentation branch using a shared Feature Pyramid Network (FPN) backbone. Surprisingly, this simple baseline not only remains effective for instance segmentation, but also yields a lightweight, top-performing method for semantic segmentation. In this work, we perform a detailed study of this minimally extended version of Mask R-CNN with FPN, which we refer to as Panoptic FPN, and show it is a robust and accurate baseline for both tasks. Given its effectiveness and conceptual simplicity, we hope our method can serve as a strong baseline and aid future research in panoptic segmentation.
[joint, perform, start, work, predict, recognition, outperforms, version] [single, approach, scene] [figure, image, separate, resolution, method, demonstrate, high] [network, table, accuracy, compare, convolutional, output, dilated, performance, architecture, design, effective, top, deep, convolution, original, scale, popular, lightweight, standard] [simple, adding, decoder, alexander, serve, goal, model, strong, memory] [semantic, segmentation, fpn, panoptic, instance, branch, mask, coco, stuff, baseline, feature, object, backbone, pyramid, thing, miou, ross, kaiming, piotr, pqst, pqth, doll, fully, detection, roughly, combine, foundation, including] [task, training, loss, class, learning, train, shared]
@InProceedings{Kirillov_2019_CVPR,
  author = {Kirillov, Alexander and Girshick, Ross and He, Kaiming and Dollar, Piotr},
  title = {Panoptic Feature Pyramid Networks},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Mask Scoring R-CNN
Zhaojin Huang, Lichao Huang, Yongchao Gong, Chang Huang, Xinggang Wang


Letting a deep network be aware of the quality of its own predictions is an interesting yet important problem. In the task of instance segmentation, the confidence of instance classification is used as mask quality score in most instance segmentation frameworks. However, the mask quality, quantified as the IoU between the instance mask and its ground truth, is usually not well correlated with classification score. In this paper, we study this problem and propose Mask Scoring R-CNN which contains a network block to learn the quality of the predicted instance masks. The proposed network block takes the instance feature and the corresponding predicted mask together to regress the mask IoU. The mask scoring strategy calibrates the misalignment between mask quality and mask score, and improves instance segmentation performance by prioritizing more accurate mask predictions during COCO AP evaluation. By extensive evaluations on the COCO dataset, Mask Scoring R-CNN brings consistent and noticeable gain with different models and outperforms the state-of-the-art Mask R-CNN. We hope our simple and effective approach will provide a new direction for improving instance segmentation. The source code of our method is available at https://github.com/zjhuang22/maskscoring_rcnn.
[prediction, work, report, framework, predict, hypothesis, follow] [ground, computer, truth, pattern, vision, corresponding, confidence, accurate, good, regressing] [conference, method, quality, ieee, high, input, proposed, based, image, study, figure] [network, table, performance, deep, validation, convolutional, low, design, neural, larger, denotes] [arxiv, preprint] [mask, maskiou, instance, score, segmentation, predicted, head, detection, roi, object, scoring, iou, fpn, backbone, feature, coco, box, semantic, bounding, proposal, propose, apm, faster, final, apb, fully, regression, map, category, aware, including, localization, baseline] [classification, target, training, learning, setting, positive, class, learn, set]
@InProceedings{Huang_2019_CVPR,
  author = {Huang, Zhaojin and Huang, Lichao and Gong, Yongchao and Huang, Chang and Wang, Xinggang},
  title = {Mask Scoring R-CNN},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Reasoning-RCNN: Unifying Adaptive Global Reasoning Into Large-Scale Object Detection
Hang Xu, Chenhan Jiang, Xiaodan Liang, Liang Lin, Zhenguo Li


In this paper, we address the large-scale object detection problem with thousands of categories, which poses severe challenges due to long-tail data distributions, heavy occlusions, and class ambiguities. However, the dominant object detection paradigm is limited by treating each object region separately without considering crucial semantic dependencies among objects. In this work, we introduce a novel Reasoning-RCNN to endow any detection networks the capability of adaptive global reasoning over all object regions by exploiting diverse human commonsense knowledge. Instead of only propagating the visual features on the image directly, we evolve the high-level semantic representations of all categories globally to avoid distracted or poor visual features in the image. Specifically, built on feature representations of basic detection network, the proposed network first generates a global semantic pool by collecting the weights of previous classification layer for each category, and then adaptively enhances each object features via attending different semantic contexts in the global semantic pool. Rather than propagating information from all semantic information that may be noisy, our adaptive global reasoning automatically discovers most relative categories for feature evolving. Our Reasoning-RCNN is light-weight and flexible enough to enhance any detection backbone networks, and extensible for integrating any knowledge resources. Solid experiments on object detection benchmarks show the superiority of our Reasoning-RCNN, e.g. achieving around 16% improvement on VisualGenome, 37% on ADE in terms of mAP and 15% improvement on COCO.
[graph, human, previous, ade, recognition, propagate, current] [problem, note, globally, relative] [method, image, figure, proposed, prior, mapping, attribute, comparison] [adaptive, performance, layer, building, table, network, original, performed, capability] [reasoning, visual, attention, relationship, model, commonsense, common, consider, frequent, memory] [global, semantic, object, detection, feature, pool, enhanced, region, car, category, pascal, relation, baseline, voc, fpn, map, person, road, spatial, heavy, propagating, average, faster, coco, improve, proposal, backbone] [knowledge, classification, learning, classifier, class, training, base, task, existing, shared, data, rare, trained]
@InProceedings{Xu_2019_CVPR,
  author = {Xu, Hang and Jiang, Chenhan and Liang, Xiaodan and Lin, Liang and Li, Zhenguo},
  title = {Reasoning-RCNN: Unifying Adaptive Global Reasoning Into Large-Scale Object Detection},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Cross-Modality Personalization for Retrieval
Nils Murrugarra-Llerena, Adriana Kovashka


Existing captioning and gaze prediction approaches do not consider the multiple facets of personality that affect how a viewer extracts meaning from an image. While there are methods that consider personalized captioning, they do not consider personalized perception across modalities, i.e. how a person's way of looking at an image (gaze) affects the way they describe it (captioning). In this work, we propose a model for modeling cross-modality personalized retrieval. In addition to modeling gaze and captions, we also explicitly model the personality of the users providing these samples. We incorporate constraints that encourage gaze and caption samples on the same image to be close in a learned space; we refer to this as content modeling. We also model style: we encourage samples provided by the same user to be close in a separate embedding space, regardless of the image on which they were provided. To leverage the complementary information that content and style constraints provide, we combine the embeddings from both networks. We show that our combined embeddings achieve better performance than existing approaches for cross-modal retrieval.
[recognition, joint, second, privileged, dataset, multiple, work, time, modeling, predict, avg, combined] [vision, computer, pattern, approach, denote, international, single] [image, gaze, content, user, style, conference, method, result, separate, described] [accuracy, network, table, performance, best, achieve] [personality, text, provided, visual, consider, describe, viewer, model, buy, modality, personalized, caption, captioning, relationship, ensure, attention, tyle, meaning, perception, family, retrieve, unique, machine, adriana, perceive] [three, average, map, feature, car, person] [learning, base, learn, embedding, rank, task, retrieval, training, embeddings, data, test, observe]
@InProceedings{Murrugarra-Llerena_2019_CVPR,
  author = {Murrugarra-Llerena, Nils and Kovashka, Adriana},
  title = {Cross-Modality Personalization for Retrieval},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Composing Text and Image for Image Retrieval - an Empirical Odyssey
Nam Vo, Lu Jiang, Chen Sun, Kevin Murphy, Li-Jia Li, Li Fei-Fei, James Hays


In this paper, we study the task of image retrieval, where the input query is specified in the form of an image plus some text that describes desired modifications to the input image. For example, we may present an image of the Eiffel tower, and ask the system to find images which are visually similar, but are modified in small ways, such as being taken at nighttime instead of during the day. o tackle this task, we embed the query (reference image plus modification text) and the target (images). The encoding function of the image text query learns a representation, such that the similarity with the target image representation is high iff it is a "positive match". We propose a new way to combine image and text through residual connection, that is designed for this retrieval task. We show this outperforms existing approaches on 3 different datasets, namely Fashion-200k, MIT-States and a new synthetic dataset we create based on CLEVR. We also show that our approach can be used to perform image classification with compositionally novel labels, and we outperform previous methods on MIT-States on this task.
[dataset, perform, work, outperforms, gating, state, film, previous] [approach, problem, case, single, vision] [image, composition, method, attribute, proposed, study, input, figure, difference, modify, based, reference, user] [residual, layer, deep, performance, table, search, applied, neural, network, compare, size, convolutional, concatenation, parameter, best] [text, query, modification, visual, question, tirg, create, simple, compositional, relationship, generate, example, gated, attention, multimodal, vqa, model, red, arxiv, preprint, system, find, concept, answering, vector] [feature, object, combine, three, propose] [retrieval, learning, target, product, set, classification, loss, training, task, existing, test, similarity, combination, metric, learn, fashion, embedding, representation, triplet, hashing]
@InProceedings{Vo_2019_CVPR,
  author = {Vo, Nam and Jiang, Lu and Sun, Chen and Murphy, Kevin and Li, Li-Jia and Fei-Fei, Li and Hays, James},
  title = {Composing Text and Image for Image Retrieval - an Empirical Odyssey},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Arbitrary Shape Scene Text Detection With Adaptive Text Region Representation
Xiaobing Wang, Yingying Jiang, Zhenbo Luo, Cheng-Lin Liu, Hyunsoo Choi, Sungjin Kim


Scene text detection attracts much attention in computer vision, because it can be widely used in many applications such as real-time text translation, automatic information entry, blind person assistance, robot sensing and so on. Though many methods have been proposed for horizontal and oriented texts, detecting irregular shape texts such as curved texts is still a challenging problem. To solve the problem, we propose a robust scene text detection method with adaptive text region representation. Given an input image, a text region proposal network is first used for extracting text proposals. Then, these proposals are verified and refined with a refinement network. Here, recurrent neural network based adaptive text region representation is proposed for text region refinement, where a pair of boundary points are predicted each time step until no new points are found. In this way, text regions of arbitrary shapes are detected and represented with adaptive number of boundary points. This gives more accurate description of text regions. Experimental results on five benchmarks, namely, CTW1500, TotalText, ICDAR2013, ICDAR2015 and MSRA-TD500, show that the proposed method achieves state-of-the-art in scene text detection.
[dataset, represented, time, long, rnn] [scene, shape, computer, pattern, vision, point, horizontal, robust, international, single, angle] [proposed, method, conference, arbitrary, based, figure, input, ieee, image] [adaptive, table, number, network, performance, deep, needed, precision, fixed, scale, neural, achieves, better, wei] [text, step, represent, natural, deal] [region, detection, regression, bounding, boundary, box, proposal, icdar, curved, mask, oriented, detected, recall, hmean, refinement, xiang, predicted, textsnake, detecting, refined, textspotter, object, faster, backbone, cong, verified, feature, including, polygon, seglink, challenging] [representation, labeled, pairwise, loss, training, learning, label, classification, test, experimental]
@InProceedings{Wang_2019_CVPR,
  author = {Wang, Xiaobing and Jiang, Yingying and Luo, Zhenbo and Liu, Cheng-Lin and Choi, Hyunsoo and Kim, Sungjin},
  title = {Arbitrary Shape Scene Text Detection With Adaptive Text Region Representation},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Adaptive NMS: Refining Pedestrian Detection in a Crowd
Songtao Liu, Di Huang, Yunhong Wang


Pedestrian detection in a crowd is a very challenging issue. This paper addresses this problem by a novel Non-Maximum Suppression (NMS) algorithm to better refine the bounding boxes given by detectors. The contributions are threefold: (1) we propose adaptive-NMS, which applies a dynamic suppression threshold to an instance, according to the target density; (2) we design an efficient subnetwork to learn density scores, which can be conveniently embedded into both the single-stage and two-stage detectors; and (3) we achieve state of the art results on the CityPersons and CrowdHuman benchmarks.
[dataset, state, prediction, framework, follow, predict] [ground, occlusion, general, algorithm] [based, method, image, figure, high, proposed, input, comparison] [density, deep, net, higher, performance, validation, highly, network, table, better, rate, overlap, adaptive, process, convolutional, scale, design, increasing] [true, evaluation, reasonable, greedy] [detection, pedestrian, object, threshold, crowd, false, faster, crowded, rfb, citypersons, bounding, overlapped, feature, region, crowdhuman, detector, suppression, box, cnn, bernt, final, neighboring, localization, subnet, backbone, kaiming, jian, propose, detecting, average, proposal, score, adaptivenms] [set, learning, function, loss, learn, soft]
@InProceedings{Liu_2019_CVPR,
  author = {Liu, Songtao and Huang, Di and Wang, Yunhong},
  title = {Adaptive NMS: Refining Pedestrian Detection in a Crowd},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Point in, Box Out: Beyond Counting Persons in Crowds
Yuting Liu, Miaojing Shi, Qijun Zhao, Xiaofang Wang


Modern crowd counting methods usually employ deep neural networks (DNN) to estimate crowd counts via density regression. Despite their significant improvements, the regression-based methods are incapable of providing the detection of individuals in crowds. The detection-based methods, on the other hand, have not been largely explored in recent trends of crowd counting due to the needs for expensive bounding box annotations. In this work, we instead propose a new deep detection network with only point supervision required. It can simultaneously detect the size and location of human heads and count them in crowds. We first mine useful person size information from point-level annotations and initialize the pseudo ground truth bounding boxes. An online updating scheme is introduced to refine the pseudo ground truth during training; while a locally-constrained regression loss is designed to provide additional constraints on the size of the predicted boxes in a local neighborhood. In the end, we propose a curriculum learning strategy to train the network from images of relatively accurate and easy pseudo ground truth first. Extensive experiments are conducted in both detection and counting tasks on several standard benchmarks, e.g. ShanghaiTech, UCF_CC_50, WiderFace, and TRANCOS datasets, and the results show the superiority of our method over the state-of-the-art.
[ucf, dataset, work, online, people, recognition] [ground, truth, point, local, dense, accurate, good, denote, perspective, notice] [image, result, method, face, based, proposed, real, produce] [size, density, network, table, deep, neural, width, convolutional, smaller, compare, scheme, initialization, small, initialized, denotes] [evaluate, model, nng] [bounding, crowd, detection, counting, psddn, box, person, head, regression, object, sha, localization, center, propose, mae, supervision, shanghaitech, faster, trancos, map, annotated, shb, anchor, predicted, segmentation, widerface, average] [pseudo, training, loss, learning, train, set, test, curriculum, strategy, updating, conducted, distance, hard, datasets]
@InProceedings{Liu_2019_CVPR,
  author = {Liu, Yuting and Shi, Miaojing and Zhao, Qijun and Wang, Xiaofang},
  title = {Point in, Box Out: Beyond Counting Persons in Crowds},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Locating Objects Without Bounding Boxes
Javier Ribera, David Guera, Yuhao Chen, Edward J. Delp


Recent advances in convolutional neural networks (CNN) have achieved remarkable results in locating objects in images. In these networks, the training procedure usually requires providing bounding boxes or the maximum number of expected objects. In this paper, we address the task of estimating object locations without annotated bounding boxes which are typically hand-drawn and time consuming to label. We propose a loss function that can be used in any fully convolutional network (FCN) to estimate object locations. This loss function is a modification of the average Hausdorff distance between two unordered sets of points. The proposed method has no notion of bounding boxes, region proposals, or sliding windows. We evaluate our method with three datasets designed to locate people's heads, pupil centers and plant centers. We outperform state-of-the-art generic object detectors and methods fine-tuned for pupil tracking.
[dataset, second, human, recognition, term, multiple] [computer, pattern, estimated, estimate, vision, point, note, single, equation, june, international, approach, ground, yield, truth, analysis, require, field, respect] [image, ieee, conference, figure, method, july, input, pixel, high] [network, number, convolutional, precision, size, output, neural, deep, density, architecture, weighted, highly] [true, evaluate] [object, hausdorff, bounding, location, plant, pupil, average, crowd, map, counting, crop, ahd, region, detection, faster, locating, localization, recall, box, center, three, locate, cnn, third, mask, medical, whd] [distance, loss, function, learning, training, generic, set, task, metric, minimum]
@InProceedings{Ribera_2019_CVPR,
  author = {Ribera, Javier and Guera, David and Chen, Yuhao and Delp, Edward J.},
  title = {Locating Objects Without Bounding Boxes},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
FineGAN: Unsupervised Hierarchical Disentanglement for Fine-Grained Object Generation and Discovery
Krishna Kumar Singh, Utkarsh Ojha, Yong Jae Lee


We propose FineGAN, a novel unsupervised GAN framework, which disentangles the background, object shape, and object appearance to hierarchically generate images of fine-grained object categories. To disentangle the factors without supervision, our key idea is to use information theory to associate each factor to a latent code, and to condition the relationships between the codes in a specific way to induce the desired hierarchy. Through extensive experiments, we show that FineGAN achieves the desired disentanglement to generate realistic and diverse images belonging to fine-grained classes of birds, dogs, and cars. Using FineGAN's automatically learned features, we also cluster real images as a first attempt at solving the novel problem of unsupervised fine-grained object category discovery. Our code/models/demo can be found at https://github.com/kkanshul/finegan
[capture, work, focus, recognition, capturing] [shape, varying, inf] [image, background, real, latent, appearance, disentanglement, generative, infogan, based, disentangle, figure, variation, realistic, high, color, disentangled, control] [number, hierarchically, process, deep, convolutional, factor, table, group, grouped] [child, generate, generation, generates, adversarial, model, visual, conditioned, generated, adv, inception, generating, evaluate, true] [parent, object, finegan, stage, hierarchical, foreground, discovery, category, final, mask, lbg, three, feature, finegrained, supervision, daux, associate, depict] [code, unsupervised, learning, clustering, learned, representation, set, train, loss, cluster, stanford, share, data, datasets, novel, training, learn, distribution]
@InProceedings{Singh_2019_CVPR,
  author = {Kumar Singh, Krishna and Ojha, Utkarsh and Jae Lee, Yong},
  title = {FineGAN: Unsupervised Hierarchical Disentanglement for Fine-Grained Object Generation and Discovery},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Mutual Learning of Complementary Networks via Residual Correction for Improving Semi-Supervised Classification
Si Wu, Jichang Li, Cheng Liu, Zhiwen Yu, Hau-San Wong


Deep mutual learning jointly trains multiple essential networks having similar properties to improve semi-supervised classification. However, the commonly used consistency regularization between the outputs of the networks may not fully leverage the difference between them. In this paper, we explore how to capture the complementary information to enhance mutual learning. For this purpose, we propose a complementary correction network (CCN), built on top of the essential networks, to learn the mapping from the output of one essential network to the ground truth label, conditioned on the features learnt by another. To make the second essential network increasingly complementary to the first one, this network is supervised by the corrected predictions. As a result, minimizing the prediction divergence between the two complementary networks can lead to significant performance gains in semi-supervised learning. Our experimental results demonstrate that the proposed approach clearly improves mutual learning between essential networks, and achieves state-of-the-art results on multiple semi-supervised classification benchmarks. In particular, the test error rates are reduced from previous 21.23% and 14.65% to 12.05% and 10.37% on CIFAR-10 with 1000 and 2000 labels, respectively.
[second, prediction, previous, temporal, multiple, learns, term, capture, work] [international, approach, ground, truth, error, accurate, virtual, corresponding, computer] [proposed, figure, raw, conference, generative, image, comparison, correction, difference, produce, ieee, mapping, separate, method] [network, deep, output, neural, denotes, performance, processing, effectiveness, improving, table, better, residual, convolutional, connected] [model, adversarial, conditioned, probability, evaluate, decision] [complementary, enhanced, fully, three, improve, instance, baseline, leverage, propose, improves] [essential, learning, ccn, mutual, training, classification, unlabeled, learnt, test, data, divergence, class, semisupervised, label, learn, labeled, knowledge, svhn, minimizing, supervised, train, loss, ensemble, classifier]
@InProceedings{Wu_2019_CVPR,
  author = {Wu, Si and Li, Jichang and Liu, Cheng and Yu, Zhiwen and Wong, Hau-San},
  title = {Mutual Learning of Complementary Networks via Residual Correction for Improving Semi-Supervised Classification},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Sampling Techniques for Large-Scale Object Detection From Sparsely Annotated Objects
Yusuke Niitani, Takuya Akiba, Tommi Kerola, Toru Ogawa, Shotaro Sano, Shuji Suzuki


Efficient and reliable methods for training of object detectors are in higher demand than ever, and more and more data relevant to the field is becoming available. However, large datasets like Open Images Dataset v4 (OID) are sparsely annotated, and some measure must be taken in order to ensure the training of a reliable detector. In order to take the incompleteness of these datasets into account, one possibility is to use pretrained models to detect the presence of the unverified objects. However, the performance of such a strategy depends largely on the power of the pretrained model. In this study, we propose part-aware sampling, a method that uses human intuition for the hierarchical relation between objects. In terse terms, our method works by making assumptions like "a bounding box for a car should contain a bounding box for a tire". We demonstrate the power of our method on OID and compare the performance against a method based on a pretrained model. Our method also won the first and second place on the public and private test sets of the Google AI Open Images Competition 2018.
[dataset, subject, human, work, prediction] [ground, approach, truth] [method, based, missing, image, figure, high, study] [sparse, table, network, pretrained, performance, number, better, precision, rate, order, deep, validation, positively] [model, private, created, sparsely, making, competition] [object, verified, bounding, category, detection, annotated, annotation, score, coco, roi, included, baseline, box, ignore, threshold, oid, detector, ignored, presence, unverified, public, false, proposal, detect, average, weakly, unannotated, improvement] [pseudo, sampling, training, set, loss, open, learning, classification, trained, oracle, negative, subset, large, positive, data, test, train, supervised, datasets, soft]
@InProceedings{Niitani_2019_CVPR,
  author = {Niitani, Yusuke and Akiba, Takuya and Kerola, Tommi and Ogawa, Toru and Sano, Shotaro and Suzuki, Shuji},
  title = {Sampling Techniques for Large-Scale Object Detection From Sparsely Annotated Objects},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Curls & Whey: Boosting Black-Box Adversarial Attacks
Yucheng Shi, Siyu Wang, Yahong Han


Image classifiers based on deep neural networks suffer from harassment caused by adversarial examples. Two defects exist in black-box iterative attacks that generate adversarial examples by incrementally adjusting the noise-adding direction for each step. On the one hand, existing iterative attacks add noises monotonically along the direction of gradient ascent, resulting in a lack of diversity and adaptability of the generated iterative trajectories. On the other hand, it is trivial to perform adversarial attack by adding excessive noises, but currently there is no refinement mechanism to squeeze redundant noises. In this work, we propose Curls & Whey black-box attack to fix the above two defects. During Curls iteration, by combining gradient ascent and descent, we `curl' up iterative trajectories to integrate more diversity and transferability into adversarial examples. Curls iteration also alleviates the diminishing marginal effect in existing iterative attacks. The Whey optimization further squeezes the `whey' of noises by exploiting the robustness of adversarial perturbation. Extensive experiments on Imagenet and Tiny-Imagenet demonstrate that our approach achieves impressive decrease on noise magnitude in l2 norm. Curls & Whey attack also shows promising transferability against ensemble models as well as adversarially trained models. In addition, we extend our attack to the targeted misclassification, effectively reducing the difficulty of targeted attacks under black-box condition.
[trajectory, perform] [direction, optimization, local, computer] [image, noise, method, pixel, interpolation, figure, based] [gradient, original, iteration, number, magnitude, binary, search, squeeze, table, imagenet, smaller, gaussian, deep, redundant, represents, size, variance, nasnet, neural, norm, reduce] [adversarial, iterative, attack, model, substitute, step, whey, example, median, transferability, targeted, fgsm, jsub, ascent, decision, generated, marginal, diminishing, perturbation, arxiv, preprint, inception, nicolas, diversity, simply, query, untargeted, whc, ian, patrick, robustness, downhill] [average, category, boundary, three, propose, mask] [target, loss, distance, set, ensemble, update, existing, cross, function, classification, learning]
@InProceedings{Shi_2019_CVPR,
  author = {Shi, Yucheng and Wang, Siyu and Han, Yahong},
  title = {Curls & Whey: Boosting Black-Box Adversarial Attacks},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Barrage of Random Transforms for Adversarially Robust Defense
Edward Raff, Jared Sylvester, Steven Forsyth, Mark McLean


Defenses against adversarial examples, when using the ImageNet dataset, are historically easy to defeat. The common understanding is that a combination of simple image transformations and other various defenses are insufficient to provide the necessary protection when the obfuscated gradient is taken into account. In this paper, we explore the idea of stochastically combining a large number of individually weak defenses into a single barrage of randomized transformations to build a strong defense against adversarial attacks. We show that, even after accounting for obfuscated gradients, the Barrage of Random Transforms (BaRT) is a resilient defense against even the most difficult attacks, such as PGD. BaRT achieves up to a 24x improvement in accuracy compared to previous work, and has even extended effectiveness out to a previously untested maximum adversarial perturbation of e=32.
[work, perform, combined] [international, provide, computer, approach, single, vision, compute, rgb] [image, transform, conference, transformation, prior, figure, input, color, noise, produce] [accuracy, transforms, number, gradient, group, imagenet, applied, apply, deep, neural, scale, reduced, impact, compared, larger, stochastic, network] [adversarial, attack, adversary, defense, bart, model, random, obfuscated, success, pgd, eot, machine, selecting, strong, robustness, include, consider, find, strongest, attacker, bpda, medoid, barrage, simple, stronger, ensembling, create, threat, making, defeat, randomness, choose, targeted] [fully, weak, three] [learning, training, randomly, selected, ensemble, test, large, set, space, function]
@InProceedings{Raff_2019_CVPR,
  author = {Raff, Edward and Sylvester, Jared and Forsyth, Steven and McLean, Mark},
  title = {Barrage of Random Transforms for Adversarially Robust Defense},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Aggregation Cross-Entropy for Sequence Recognition
Zecheng Xie, Yaoxiong Huang, Yuanzhi Zhu, Lianwen Jin, Yuliang Liu, Lele Xie


In this paper, we propose a novel method, aggregation cross-entropy (ACE), for sequence recognition from a brand new perspective. The ACE loss function exhibits competitive performance to CTC and the attention mechanism, with much quicker implementation (as it involves only four fundamental formulas), faster inference\back-propagation (approximately O(1) in parallel), less storage requirement (no parameter and negligible runtime memory), and convenient employment (by replacing CTC with ACE). Furthermore, the proposed ACE loss function exhibits two noteworthy properties: (1) it can be directly applied for 2D prediction by flattening the 2D prediction into 1D prediction as the input and (2) it requires only characters and their numbers in the sequence annotation for supervision, which allows it to advance beyond sequence recognition, e.g., counting problem. The code is publicly available at https://github.com/summerlvsong/Aggregation-Cross-Entropy.
[recognition, prediction, sequence, previous, dataset, speech, time, online, recurrent] [scene, problem, irregular, require, directly, general, shape, runtime, note, normalized] [proposed, image, based, input, result, method, comparison, figure, synthetic, real] [network, number, neural, table, chinese, performance, gradient, computation, order, implementation, complexity, aggregation, applied, deep, offline, convolutional, parameter, original, highly, represents, size, higher, rate] [text, ace, attention, ctc, model, character, handwritten, probability, mechanism, everyday, requires, memory, natural, regular, hctr, accumulative, cik] [counting, annotation, icdar, object, cropped, spatial, challenging] [loss, function, training, class, large, trained, datasets, learning, generally, set, log, data]
@InProceedings{Xie_2019_CVPR,
  author = {Xie, Zecheng and Huang, Yaoxiong and Zhu, Yuanzhi and Jin, Lianwen and Liu, Yuliang and Xie, Lele},
  title = {Aggregation Cross-Entropy for Sequence Recognition},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
LaSO: Label-Set Operations Networks for Multi-Label Few-Shot Learning
Amit Alfassy, Leonid Karlinsky, Amit Aides, Joseph Shtok, Sivan Harary, Rogerio Feris, Raja Giryes, Alex M. Bronstein


Example synthesis is one of the leading methods to tackle the problem of few-shot learning, where only a small number of samples per class are available. However, current synthesis approaches only address the scenario of a single category label per image. In this work, we propose a novel technique for synthesizing samples with multiple labels for the (yet unhandled) multi-label few-shot classification scenario. We propose to combine pairs of given examples in feature space, so that the resulting synthesized feature vectors will correspond to examples whose label sets are obtained through certain set operations on the label sets of the corresponding input pairs. Thus, our method is capable of producing a sample containing the intersection, union or set-difference of labels present in two input samples. As we show, these set operations generalize to labels unseen during training. This enables performing augmentation on examples of novel categories, thus, facilitating multi-label few-shot classifier learning. We conduct numerous experiments showing promising results for the label-set manipulation capabilities of the proposed approach, both directly (using the classification and retrieval metrics), and in the context of performing data augmentation for multi-label few-shot learning. We propose a benchmark for this new and challenging task and show that our method compares favorably to all the common baselines.
[dataset, performing, work, interesting, multiple, recognition] [approach, computer, analytic, corresponding, vision, international, computed] [conference, manipulation, image, synthesized, celeba, synthesis, proposed, ieee, figure, input, method, generative, based] [table, original, validation, performed, deep, performance, neural, network, small] [visual, vector, expected, giraffe] [feature, person, union, intersection, coco, semantic, propose, category, backbone, map, car, object] [laso, set, label, learning, classification, training, unseen, space, classifier, trained, augmentation, dog, mint, retrieval, data, subtraction, rloss, learned, test, task, muni, extractor, generalize, train, learn, zint, loss, retrieved, scenario, novel]
@InProceedings{Alfassy_2019_CVPR,
  author = {Alfassy, Amit and Karlinsky, Leonid and Aides, Amit and Shtok, Joseph and Harary, Sivan and Feris, Rogerio and Giryes, Raja and Bronstein, Alex M.},
  title = {LaSO: Label-Set Operations Networks for Multi-Label Few-Shot Learning},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Few-Shot Learning With Localization in Realistic Settings
Davis Wertheimer, Bharath Hariharan


Traditional recognition methods typically require large, artificially-balanced training classes, while few-shot learning methods are tested on artificially small ones. In contrast to both extremes, real world recognition problems exhibit heavy-tailed class distributions, with cluttered scenes and a mix of coarse and fine-grained class distinctions. We show that prior methods designed for few-shot learning do not work out of the box in these challenging conditions, based on a new "meta-iNat" benchmark. We introduce three parameter-free improvements: (a) better training procedures based on adapting cross-validation to meta-learning, (b) novel architectures that localize objects using limited bounding box annotations before classification, and (c) simple parameter-free expansions of the feature space based on bilinear pooling. Together, these improvements double the accuracy of state-of-the-art models on meta-iNat while generalizing to prior benchmarks, complex neural architectures, and settings with substantial domain shift.
[recognition, work, dataset] [computer, vision, pattern, international, approach, problem, well, june] [reference, conference, image, ieee, background, prior, based, figure, real] [accuracy, batch, small, covariance, network, number, pooling, neural, table, bilinear, performance, deep, convolutional, processing] [evaluation, query, model, machine, relevant, common] [localization, feature, bounding, foreground, box, improves, object, benchmark, interest] [learning, training, class, prototypical, learner, set, representation, large, trained, transfer, folding, data, classification, localizer, test, unsupervised, space, rare, softmax, labeled, learn, learned, fewshot, generalize]
@InProceedings{Wertheimer_2019_CVPR,
  author = {Wertheimer, Davis and Hariharan, Bharath},
  title = {Few-Shot Learning With Localization in Realistic Settings},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
AdaGraph: Unifying Predictive and Continuous Domain Adaptation Through Graphs
Massimiliano Mancini, Samuel Rota Bulo, Barbara Caputo, Elisa Ricci


The ability to categorize is a cornerstone of visual intelligence, and a key functionality for artificial, autonomous visual machines. This problem will never be solved without algorithms able to adapt and generalize across visual domains. Within the context of domain adaptation and generalization, this paper focuses on the predictive domain adaptation scenario, namely the case where no target data are available and the system has to learn to generalize from annotated source images plus unlabeled samples with associated metadata from auxiliary domains. Our contribution is the first deep architecture that tackles predictive domain adaptation, able to leverage over the information brought by the auxiliary domains through a graph. Moreover, we present a simple yet effective strategy that allows us to take advantage of the incoming target data at test time, in a continuous domain adaptation scenario. Experiments on three benchmark databases support the value of our approach.
[graph, considering, state, dataset, previous, prediction, work, forward, incoming, time, multiple] [continuous, approach, associated, allows, corresponding, problem, view, directly, define, case] [method, proposed, image, figure, mapping] [deep, network, scale, performance, order, accuracy, layer, table, architecture, standard, batch, employ] [model, metadata, visual, node, probability, consider, pass] [refinement, baseline, propose, art, edge, samuel] [domain, target, data, source, adaptation, set, adagraph, strategy, training, test, auxiliary, learning, pda, exploiting, bias, predictive, exploit, classification, unsupervised, gbn, barbara, learn, labeled, scenario, update, elisa, unlabeled, task, loss, novel, upper, rota]
@InProceedings{Mancini_2019_CVPR,
  author = {Mancini, Massimiliano and Rota Bulo, Samuel and Caputo, Barbara and Ricci, Elisa},
  title = {AdaGraph: Unifying Predictive and Continuous Domain Adaptation Through Graphs},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Grounded Video Description
Luowei Zhou, Yannis Kalantidis, Xinlei Chen, Jason J. Corso, Marcus Rohrbach


Video description is one of the most challenging problems in vision and language understanding due to the large variability both on the video and language side. Models, hence, typically shortcut the difficulty in recognition and generate plausible sentences that are based on priors but are not necessarily grounded in the video. In this work, we explicitly link the sentence to the evidence in the video by annotating each noun phrase in a sentence with the corresponding bounding box in one of the frames of a video. Our dataset, ActivityNet-Entities, augments the challenging ActivityNet Captions dataset with 158k bounding box annotations, each grounding a noun phrase. This allows training video description models with this data, and importantly, evaluate how grounded or "true" such model are to the video they describe. To generate grounded captions, we propose a novel video description model which is able to exploit these bounding box annotations. We demonstrate the effectiveness of our model on our dataset, but also show how it can be applied to image description on the Flickr30k Entities dataset. We achieve state-of-the-art performance on video description, video paragraph description, and image description and demonstrate our generated sentences are better grounded in the video.
[video, dataset, temporal, activitynet, human, frame, recognition, people, work, framework] [computer, vision, pattern, international, corresponding, well, dense] [conference, image, ieee, method, based] [better, neural, accuracy, group, number, best, validation] [attention, grounding, description, model, language, visual, sentence, grounded, generated, noun, captioning, generation, man, word, encoding, generating, evaluation, gvd, generate, refer, attend, jason, evaluate, paragraph, indicates] [object, region, module, box, bounding, feature, supervision, annotated, annotate, localization, context, val, three, segment] [set, test, classification, loss, class, supervised, embedding, unsupervised]
@InProceedings{Zhou_2019_CVPR,
  author = {Zhou, Luowei and Kalantidis, Yannis and Chen, Xinlei and Corso, Jason J. and Rohrbach, Marcus},
  title = {Grounded Video Description},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Streamlined Dense Video Captioning
Jonghwan Mun, Linjie Yang, Zhou Ren, Ning Xu, Bohyung Han


Dense video captioning is an extremely challenging task since accurate and coherent description of events in a video requires holistic understanding of video contents as well as contextual reasoning of individual events. Most existing approaches handle this problem by first detecting event proposals from a video and then captioning on a subset of the proposals. As a result, the generated sentences are prone to be redundant or inconsistent since they fail to consider temporal dependency between events. To tackle this challenge, we propose a novel dense video captioning framework, which models temporal dependency across events in a video explicitly and leverages visual and linguistic context from prior events for coherent storytelling. This objective is achieved by 1) integrating an event sequence generation network to select a sequence of event proposals adaptively, and 2) feeding the sequence of event proposals to our sequential video captioning network, which is trained by reinforcement learning with two-level rewards---at both event and episode levels---for better context modeling. The proposed technique achieves outstanding performances on ActivityNet Captions dataset in most metrics.
[event, video, sequence, temporal, activitynet, sequential, rnn, esgn, coherent, recurrent, dependency, time, sdvc, dataset, framework, action, gru, hidden, individual, rnnptr, ting, work] [dense, algorithm, note, camera, single] [proposed, based, image, method, input, quality] [network, neural, number, performance, table, validation, sequentially, deep] [captioning, generation, caption, visual, generated, episode, man, candidate, reinforcement, generate, model, generates, reward, meteor, description, linguistic, sampled, attention, mft, playing, describing, rnnenc, rnne, understanding, describe, conditioned, paragraph] [proposal, context, detection, detected, hierarchical, feature, adopt, average, detecting, propose] [learning, set, existing, log, selected, trained, training, representation, select]
@InProceedings{Mun_2019_CVPR,
  author = {Mun, Jonghwan and Yang, Linjie and Ren, Zhou and Xu, Ning and Han, Bohyung},
  title = {Streamlined Dense Video Captioning},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Adversarial Inference for Multi-Sentence Video Description
Jae Sung Park, Marcus Rohrbach, Trevor Darrell, Anna Rohrbach


While significant progress has been made in the image captioning task, video description is still in its infancy due to the complex nature of video data. Generating multi-sentence descriptions for long videos is even more challenging. Among the main issues are the fluency and coherence of the generated descriptions, and their relevance to the video. Recently, reinforcement and adversarial learning based methods have been explored to improve the image captioning models; however, both types of methods suffer from a number of issues, e.g. poor readability and high redundancy for RL and stability issues for GANs. In this work, we instead propose to apply adversarial techniques during inference, designing a discriminator which encourages better multi-sentence video description. In addition, we find that a multi-discriminator "hybrid" design, where each discriminator targets one aspect of a description, leads to the best results. Specifically, we decouple the discriminator to evaluate on three criteria: 1) visual relevance to the video, 2) language diversity and fluency, and 3) coherence across sentences. Our approach results in more accurate, diverse, and coherent multi-sentence video descriptions, as shown by automatic as well as human evaluation on the popular ActivityNet Captions dataset.
[video, recognition, human, multiple, previous, sequence, work, lstm, coherent, activitynet, clip, temporal, coherence, jointly, dataset] [computer, vision, pattern, approach, international, well, rely] [conference, ieee, image, based, generator, figure, hybrid, prior, generative, method] [inference, processing, best, compare, neural, number, better, standard, table, wei] [adversarial, discriminator, sentence, visual, description, language, mle, captioning, gan, natural, generated, reinforcement, diversity, automatic, paragraph, evaluation, generate, model, caption, marcus, generation, hybriddis, diverse, repetition, include, scst, word, anna] [propose, score, three, improve, object, baseline, person] [learning, training, trained, representation, task, aim, sampling, trevor]
@InProceedings{Park_2019_CVPR,
  author = {Sung Park, Jae and Rohrbach, Marcus and Darrell, Trevor and Rohrbach, Anna},
  title = {Adversarial Inference for Multi-Sentence Video Description},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Unified Visual-Semantic Embeddings: Bridging Vision and Language With Structured Meaning Representations
Hao Wu, Jiayuan Mao, Yufeng Zhang, Yuning Jiang, Lei Li, Weiwei Sun, Wei-Ying Ma


We propose the Unified Visual-Semantic Embeddings (Unified VSE) for learning a joint space of visual representation and textual semantics. The model unifies the embeddings of concepts at different levels: objects, attributes, relations, and full scenes. We view the sentential semantics as a combination of different semantic components such as objects and relations; their embeddings are aligned with different image regions. A contrastive learning approach is proposed for the effective learning of this fine-grained alignment from only image-caption pairs. We also present a simple yet effective approach that enforces the coverage of caption embeddings on the semantic components that appear in the sentence. We demonstrate that the Unified VSE outperforms baselines on cross-modal retrieval tasks; the enforcement of the semantic coverage improves the model's robustness in defending text-domain adversarial attacks. Moreover, our model empowers the use of visual cues to accurately resolve word dependencies in novel sentences.
[recognition, joint, dataset, eat, multiple] [vision, computer, pattern, wall, approach, international, local, matching, scene] [image, conference, ieee, figure, based] [table, neural, performance, full, deep, structured, factorized] [caption, adversarial, visual, model, vse, sentence, relational, word, clock, textual, language, white, ucomp, encoder, univse, coverage, usent, wooden, encoding, multimodal, relevance, attack, sweater, enforcement, robustness, example] [semantic, object, parsing, map, semantics, region, global] [embedding, learning, unified, retrieval, embeddings, contrastive, alignment, negative, space, representation, set, training, learned, combination, loss, randomly, ranking, pair, task, hard]
@InProceedings{Wu_2019_CVPR,
  author = {Wu, Hao and Mao, Jiayuan and Zhang, Yufeng and Jiang, Yuning and Li, Lei and Sun, Weiwei and Ma, Wei-Ying},
  title = {Unified Visual-Semantic Embeddings: Bridging Vision and Language With Structured Meaning Representations},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Learning to Compose Dynamic Tree Structures for Visual Contexts
Kaihua Tang, Hanwang Zhang, Baoyuan Wu, Wenhan Luo, Wei Liu


We propose to compose dynamic tree structures that place the objects in an image into a visual context, helping visual reasoning tasks such as scene graph generation and visual Q&A. Our visual context tree model, dubbed VCTree, has two key advantages over existing structured object representations including chains and fully-connected graphs: 1) The efficient and expressive binary tree encodes the inherent parallel/hierarchical relationships among objects, e.g., "clothes" and "pants" are usually co-occur and belong to "person"; 2) the dynamic structure varies from image to image and task to task, allowing more content-/task-specific message passing among objects. To construct a VCTree, we design a score function that calculates the task-dependent validity between each object pair, and the tree is the binary version of the maximum spanning tree from the score matrix. Then, visual contexts are encoded by bidirectional TreeLSTM and decoded by task-specific models. We develop a hybrid learning procedure which integrates end-task supervised learning and the tree structure reinforcement learning, where the former's evaluation result serves as a self-critic for the latter's structure exploration. Experimental results on two benchmarks, which require reasoning over contexts: Visual Genome for scene graph generation and VQA2.0 for visual Q&A, show that VCTree outperforms state-of-the-art results while discovering interpretable visual context structures.
[graph, dynamic, prediction, construct] [scene, construction, left, matrix, vision] [image, figure, proposed, hybrid, input] [structure, table, validation, binary, neural, constructed, parallel, number, popular, compared] [visual, vct, tree, question, ree, model, attention, sgg, generation, relationship, predicate, vqa, node, encoded, man, compose, reasoning, treelstm, reinforcement, encoding, encode, multimodal, hanwang, chain, indicates, message, spanning, genome, answer] [context, object, feature, bounding, score, detection, contextual, relation, box, indicate, hierarchical, adopt, layout, three] [learning, supervised, set, task, balanced, bias, pair, pairwise, classification, training, maximum, distribution]
@InProceedings{Tang_2019_CVPR,
  author = {Tang, Kaihua and Zhang, Hanwang and Wu, Baoyuan and Luo, Wenhan and Liu, Wei},
  title = {Learning to Compose Dynamic Tree Structures for Visual Contexts},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Reinforced Cross-Modal Matching and Self-Supervised Imitation Learning for Vision-Language Navigation
Xin Wang, Qiuyuan Huang, Asli Celikyilmaz, Jianfeng Gao, Dinghan Shen, Yuan-Fang Wang, William Yang Wang, Lei Zhang


Vision-language navigation (VLN) is the task of navigating an embodied agent to carry out natural language instructions inside real 3D environments. In this paper, we study how to address three critical challenges for this task: the cross-modal grounding, the ill-posed feedback, and the generalization problems. First, we propose a novel Reinforced Cross-Modal Matching (RCM) approach that enforces cross-modal grounding both locally and globally via reinforcement learning (RL). Particularly, a matching critic is used to provide an intrinsic reward to encourage global matching between instructions and trajectories, and a reasoning navigator is employed to perform cross-modal grounding in the local visual scene. Evaluation on a VLN benchmark dataset shows that our RCM model significantly outperforms previous methods by 10% on SPL and achieves the new state-of-the-art performance. To improve the generalizability of the learned policy, we further introduce a Self-Supervised Imitation Learning (SIL) method to explore unseen environments by imitating its own past, good decisions. We demonstrate that SIL can approximate a better and more efficient policy, which tremendously minimizes the success rate performance gap between seen and unseen environments (from 30.7% to 11.7%).
[trajectory, action, previous, recognition, learns, dataset, work, state] [matching, computer, intrinsic, vision, pattern, extrinsic, local, good, approach, panoramic, note] [conference, ieee, figure, method, image, prior, row, real, proposed, replay] [performance, better, rate, neural, explore, validation, table, processing, reinforced, effectiveness, search, best] [visual, reward, agent, language, rcm, navigation, navigator, sil, instruction, vln, path, critic, exploration, natural, model, arxiv, preprint, reasoning, policy, imitation, reinforcement, success, attention, textual, grounding, beam, introduce, length, embodied, history, evaluation, turn, step, generalizability, understanding] [context, global, propose] [learning, unseen, set, test, target, training, task, supervised, spl, lifelong, testing]
@InProceedings{Wang_2019_CVPR,
  author = {Wang, Xin and Huang, Qiuyuan and Celikyilmaz, Asli and Gao, Jianfeng and Shen, Dinghan and Wang, Yuan-Fang and Yang Wang, William and Zhang, Lei},
  title = {Reinforced Cross-Modal Matching and Self-Supervised Imitation Learning for Vision-Language Navigation},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Dynamic Fusion With Intra- and Inter-Modality Attention Flow for Visual Question Answering
Peng Gao, Zhengkai Jiang, Haoxuan You, Pan Lu, Steven C. H. Hoi, Xiaogang Wang, Hongsheng Li


Learning effective fusion of multi-modality features is at the heart of visual question answering. We propose a novel method of dynamically fuse multi-modal features with intra- and inter-modality information flow, which alternatively pass dynamic information between and across the visual and language modalities. It can robustly capture the high-level interactions between language and vision domains, thus significantly improves the performance of visual question answering. We also show that, the proposed dynamic intra modality attention flow conditioned on the other modality can dynamically modulate the intra-modality attention of the current modality, which is vital for multimodality feature fusion. Experimental evaluations on the VQA 2.0 dataset show that the proposed method achieves the state-of-the-art VQA performance. Extensive ablation studies are carried out for the comprehensive analysis of the proposed method.
[flow, dynamic, fusion, key, framework, previous, utilized, updated, dataset, passing, capture, gru] [vision, computer, pattern, international, position, dense] [image, proposed, conference, ieee, figure, based, input, denoted] [performance, bilinear, better, neural, network, dynamically, deep, pooling, validation, parallel, table, processing, weight] [visual, attention, question, word, vqa, language, modality, dfaf, dyintramaf, intermaf, conditioned, query, transformed, sentinel, model, intramaf, arxiv, preprint, answering, natural, captioning, mechanism, intermafre, dim, man, pointing, generate] [region, feature, module, object, ablation, rcnn, faster, average] [learning, dimension, update, product, embedding]
@InProceedings{Gao_2019_CVPR,
  author = {Gao, Peng and Jiang, Zhengkai and You, Haoxuan and Lu, Pan and Hoi, Steven C. H. and Wang, Xiaogang and Li, Hongsheng},
  title = {Dynamic Fusion With Intra- and Inter-Modality Attention Flow for Visual Question Answering},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Cycle-Consistency for Robust Visual Question Answering
Meet Shah, Xinlei Chen, Marcus Rohrbach, Devi Parikh


Despite significant progress in Visual Question Answer-ing over the years, robustness of today's VQA models leave much to be desired. We introduce a new evaluation protocol and associated dataset (VQA-Rephrasings) and show that state-of-the-art VQA models are notoriously brittle to linguistic variations in questions. VQA-Rephrasings contains 3 human-provided rephrasings for 40k questions-image pairs from the VQA v2.0 validation dataset. As a step towards improving robustness of VQA models, we propose a model-agnostic framework that exploits cycle consistency. Specifically, we train a model to not only answer a question, but also generate a question conditioned on the answer, such that the answer predicted for the generated question is the same as the ground truth answer to the original question. Without the use of additional supervision, we show that our approach is significantly more robust to linguistic variations than state-of-the-art VQA models, when evaluated on the VQA-Rephrasings dataset. In addition, our approach also outperforms state-of-the-art approaches on the standard VQA and Visual Question Generation tasks on the challenging VQA v2.0 dataset. Code and models will be made publicly available.
[dataset, framework, gating, prediction, passing] [robust, computer, vision, consensus, pattern, ground, approach, associated, additional, consistent, implies] [image, consistency, conference, ieee, proposed, row, cycle, component, collected] [original, performance, validation, table, better, processing, accuracy, represents, neural] [vqa, question, model, answer, visual, generation, vqg, robustness, generated, pythia, linguistic, attention, generate, answering, language, yellow, rephrasing, natural, ban, devi, evaluation, generating, butd, conditioned, correct, mutan, enables, semantically, modality, mechanism, dhruv, arxiv] [module, predicted, propose, score, failure] [trained, training, learning, loss, base, train, task, split, existing, measure, data]
@InProceedings{Shah_2019_CVPR,
  author = {Shah, Meet and Chen, Xinlei and Rohrbach, Marcus and Parikh, Devi},
  title = {Cycle-Consistency for Robust Visual Question Answering},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Embodied Question Answering in Photorealistic Environments With Point Cloud Perception
Erik Wijmans, Samyak Datta, Oleksandr Maksymets, Abhishek Das, Georgia Gkioxari, Stefan Lee, Irfan Essa, Devi Parikh, Dhruv Batra


To help bridge the gap between internet vision-style problems and the goal of vision for embodied perception we instantiate a large-scale navigation task -- Embodied Question Answering [1] in photo-realistic environments (Matterport 3D). We thoroughly study navigation policies that utilize 3D point clouds, RGB images, or their combination. Our analysis of these models reveals several key findings. We find that two seemingly naive navigation baselines, forward-only and random, are strong navigators and challenging to outperform, due to the specific choice of the evaluation setting presented by [1]. We find a novel loss-weighting scheme we call Inflection Weighting to be important when training recurrent models for navigation with behavior cloning and are able to out perform the baselines with this technique. We find that point clouds provide a richer signal than RGB images for learning obstacle avoidance, motivating the use (and continued study) of 3D deep learning models for embodied navigation.
[behavior, action, dataset, work, time, predict] [point, vision, rgb, cloud, provide, ground, analysis, truth, indoor, depth, view, directly, single, daniel] [color, proposed, based, input, real, figure, conference] [deep, neural, number, best, accuracy, order, network] [navigation, question, agent, embodied, perception, visual, answering, memory, inflection, model, environment, find, answer, iout, room, encoder, reactive, strong, shortest, step, navigator, arxiv, navigate, random, path, evaluation, obstacle, embodiedqa, language, attention, preprint] [semantic, object, feature, utilize, three, segmentation] [learning, set, weighting, trained, target, loss, task, novel, learn, train, representation, training, distance]
@InProceedings{Wijmans_2019_CVPR,
  author = {Wijmans, Erik and Datta, Samyak and Maksymets, Oleksandr and Das, Abhishek and Gkioxari, Georgia and Lee, Stefan and Essa, Irfan and Parikh, Devi and Batra, Dhruv},
  title = {Embodied Question Answering in Photorealistic Environments With Point Cloud Perception},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Reasoning Visual Dialogs With Structural and Partial Observations
Zilong Zheng, Wenguan Wang, Siyuan Qi, Song-Chun Zhu


We propose a novel model to address the task of Visual Dialog which exhibits complex dialog structures. To obtain a reasonable answer based on the current question and the dialog history, the underlying semantic dependencies between dialog entities are essential. In this paper, we explicitly formalize this task as inference in a graphical model with partially observed nodes and unknown graph structures (relations in dialog). The given dialog entities are viewed as the observed nodes. The answer to a given question is represented by a node with missing value. We first introduce an Expectation Maximization algorithm to infer both the underlying dialog structures and the missing node values (desired answers). Based on this, we proceed to propose a differentiable graph neural network (GNN) solution that approximates this process. Experiment results on the VisDial and VisDial-Q datasets show that our model outperforms comparative methods. It is also observed that our method can infer the underlying dialog structure for better dialog reasoning.
[graph, hidden, current, dataset, state, represented, recurrent, joint, passing, human, outperforms] [algorithm, underlying, wij, approach, loop, inferred, compute] [image, based, missing, quantitative, proposed, method] [neural, structure, network, inference, table, convolutional, performance, deep, process, represents, higher] [dialog, visual, node, question, model, answer, visdial, message, unobserved, observed, caption, language, queried, infer, evaluation, attention, van, history, captioning, answering, hvj, partial, represent, belief, dhruv, devi, reasoning, natural, potential, anton, den, goal] [edge, semantic, feature, round, fully] [learning, embedding, update, task, data, discriminative, training, log, function, distribution, gnn]
@InProceedings{Zheng_2019_CVPR,
  author = {Zheng, Zilong and Wang, Wenguan and Qi, Siyuan and Zhu, Song-Chun},
  title = {Reasoning Visual Dialogs With Structural and Partial Observations},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Recursive Visual Attention in Visual Dialog
Yulei Niu, Hanwang Zhang, Manli Zhang, Jianhong Zhang, Zhiwu Lu, Ji-Rong Wen


Visual dialog is a challenging vision-language task, which requires the agent to answer multi-round questions about an image. It typically needs to address two major problems: (1) How to answer visually-grounded questions, which is the core challenge in visual question answering (VQA); (2) How to infer the co-reference between questions and the dialog history. An example of visual co-reference is: pronouns (e.g., "they") in the question (e.g., "Are they on or off?") are linked with nouns (e.g., "lamps") appearing in the dialog history (e.g., "How many lamps are there?") and the object grounded in the image. In this work, to resolve the visual co-reference for visual dialog, we propose a novel attention mechanism called Recursive Visual Attention (RvA). Specifically, our dialog agent browses the dialog history until the agent has sufficient confidence in the visual co-reference resolution, and refines the visual attention recursively. The quantitative and qualitative experimental results on the large-scale VisDial v0.9 and v1.0 datasets demonstrate that the proposed RvA not only outperforms the state-of-the-art methods, but also achieves reasonable recursion and interpretable attention maps without additional annotations. The code is available at https://github.com/yuleiniu/rva.
[dataset, recursively, current, second] [computer, vision, pattern, discrete, algorithm, continuous] [conference, image, ieee, figure, proposed, qualitative, resolution, based, resolve, generative, paired, color, method, transformation] [recursive, neural, represents, table, validation, performance, replaced] [visual, attention, question, dialog, agent, history, visdial, language, model, rva, recursion, word, answer, ambiguous, termination, nfer, mechanism, att, reasonable, review, natural, arxiv, preprint, answering, represent, return, hanwang, referring, grounding, caption, coatt, corefnmn, example] [feature, module, region, round, refine, illustrated, european] [pair, function, set, discriminative, retrieval, softmax, training, trained, representation, rank, datasets]
@InProceedings{Niu_2019_CVPR,
  author = {Niu, Yulei and Zhang, Hanwang and Zhang, Manli and Zhang, Jianhong and Lu, Zhiwu and Wen, Ji-Rong},
  title = {Recursive Visual Attention in Visual Dialog},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Two Body Problem: Collaborative Visual Task Completion
Unnat Jain, Luca Weihs, Eric Kolve, Mohammad Rastegari, Svetlana Lazebnik, Ali Farhadi, Alexander G. Schwing, Aniruddha Kembhavi


Collaboration is a necessary skill to perform tasks that are beyond one agent's capabilities. Addressed extensively in both conventional and modern AI, multi-agent collaboration has often been studied in the context of simple grid worlds. We argue that there are inherently visual aspects to collaboration which should be studied in visually rich environments. A key element in collaboration is communication that can be either explicit, through messages, or implicit, through perception of the other agents and the visual world. Learning to collaborate in a visual environment entails learning (1) to perform the task, (2) when and what to communicate, and (3) how to act based on these communications and the perception of the visual world. In this paper we study the problem of learning to collaborate directly from pixels in AI2-THOR and demonstrate the benefits of explicit and implicit modes of communication to perform visual tasks. Refer to our project page for more details: https://prior.allenai.org/projects/two-body-problem
[action, collaboration, perform, work, time, multiple, motion, planning, jointly, joint] [implicit, explicit, single, well, autonomous, relative, indoor, approach, note, position, analysis, robotics] [expert, mapping, ieee, figure, collaborative, study, based, unconstrained, prior] [deep, weight, accuracy, number, better, mobile] [communication, visual, agent, reinforcement, episode, pick, pickup, environment, policy, constrained, failed, navigation, belief, language, perception, consider, message, navigate, vocabulary, model, communicate, cooperative, arxiv, preprint, multiagent, rotate, attempt] [object, map, round, refinement] [learning, task, unseen, setting, large, learned, target, train, learn, distance, loss, cross, entropy, minimum, oracle]
@InProceedings{Jain_2019_CVPR,
  author = {Jain, Unnat and Weihs, Luca and Kolve, Eric and Rastegari, Mohammad and Lazebnik, Svetlana and Farhadi, Ali and Schwing, Alexander G. and Kembhavi, Aniruddha},
  title = {Two Body Problem: Collaborative Visual Task Completion},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering
Drew A. Hudson, Christopher D. Manning


We introduce GQA, a new dataset for real-world visual reasoning and compositional question answering, seeking to address key shortcomings of previous VQA datasets. We have developed a strong and robust question engine that leverages Visual Genome scene graph structures to create 22M diverse reasoning questions, which all come with functional programs that represent their semantics. We use the programs to gain tight control over the answer distribution and present a new tunable smoothing technique to mitigate question biases. Accompanying the dataset is a suite of new metrics that evaluate essential qualities such as consistency, grounding and plausibility. A careful analysis is performed for baselines as well as state-of-the-art models, providing fine-grained results for different question types and topologies. Whereas a blind LSTM obtains a mere 42.1%, and strong VQA models achieve 54.1%, human performance tops at 89.3%, offering ample opportunity for new research to explore. We hope GQA will provide an enabling resource for the next generation of models with enhanced robustness, improved consistency, and deeper semantic understanding of vision and language.
[dataset, recognition, graph, human, structural, multiple, lstm] [scene, computer, vision, pattern, functional, left, analysis, provide, well, good, associated] [image, conference, ieee, figure, color, based, control] [performance, accuracy, achieve, binary] [question, visual, answer, gqa, vqa, reasoning, compositional, model, answering, understanding, red, apple, generation, arxiv, type, natural, logical, entailment, preprint, linguistic, balancing, attention, engine, diverse, grounding, educated, relevant, making, refer, white, mac, genome, create, evaluate, green, grounded, involve, development, serve] [semantic, object, relation] [distribution, set, metric, datasets, test, open, pair, representation, knowledge, task, measure]
@InProceedings{Hudson_2019_CVPR,
  author = {Hudson, Drew A. and Manning, Christopher D.},
  title = {GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Text2Scene: Generating Compositional Scenes From Textual Descriptions
Fuwen Tan, Song Feng, Vicente Ordonez


In this paper, we propose Text2Scene, a model that generates various forms of compositional scene representations from natural language descriptions. Unlike recent works, our method does NOT use Generative Adversarial Networks (GANs). Text2Scene instead learns to sequentially generate objects and their attributes (location, size, appearance, etc) at every time step by attending to different parts of the input text and the current status of the generated scene. We show that under minor modifications, the proposed framework can handle the generation of different forms of scene representations, including cartoon-like scenes, object layouts corresponding to real images, and synthetic images. Our method is not only competitive when compared with state-of-the-art GAN-based methods using automatic metrics and superior based on human judgments but also has the advantage of producing interpretable results.
[recognition, human, work, current, dataset, time, framework, predicting, state, learns] [scene, computer, vision, corresponding, pattern, international, predicts, defined] [image, conference, synthetic, input, attribute, ieee, method, figure, patch, generative, synthesis, real, reference, proposed, qualitative] [table, neural, network, processing, convolutional] [model, generation, abstract, text, generated, evaluation, language, generate, generating, attention, adversarial, vector, visual, mike, holding, jenny, textual, automatic, true, natural, canvas, decoder, attngan, vicente, encoder, caption, lawrence, compositional, generates] [object, semantic, spatial, three, coco, layout, predicted, location, module, foreground, context] [learning, set, embedding, task, test, retrieval, representation]
@InProceedings{Tan_2019_CVPR,
  author = {Tan, Fuwen and Feng, Song and Ordonez, Vicente},
  title = {Text2Scene: Generating Compositional Scenes From Textual Descriptions},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
From Recognition to Cognition: Visual Commonsense Reasoning
Rowan Zellers, Yonatan Bisk, Ali Farhadi, Yejin Choi


Visual understanding goes well beyond object recognition. With one glance at an image, we can effortlessly imagine the world beyond the pixels: for instance, we can infer people's actions, goals, and mental states. While this task is easy for humans, it is tremendously difficult for today's vision systems, requiring higher-order cognition and commonsense reasoning about the world. We formalize this task as Visual Commonsense Reasoning. Given a challenging question about an image, a machine must answer correctly and then provide a rationale justifying its answer. Next, we introduce a new dataset, VCR, consisting of 290k multiple choice QA problems derived from 110k movie scenes. The key recipe for generating non-trivial and high-quality problems at scale is Adversarial Matching, a new approach to transform rich annotations into multiple choice questions with minimal bias. Experimental results show that while humans find VCR easy (over 90% accuracy), state-of-the-art vision models struggle ( 45%). To move towards cognition-level understanding, we present a new reasoning engine, Recognition to Cognition Networks (R2C), that models the necessary layered inferences for grounding, contextualization, and reasoning. R2C helps narrow the gap between humans and machines ( 65%); still, the challenge is far from solved, and we provide analysis that suggests avenues for future work.
[dataset, recognition, work, human, multiple, perform, video, challenge, lstm, antonio, social, overview] [computer, vision, pattern, well, international, provide, additional, scene] [conference, ieee, image, figure] [performance, neural, entire, deep, computational, processing, inference] [visual, language, answer, model, question, natural, reasoning, commonsense, query, understanding, correct, rationale, answering, adversarial, vcr, bert, machine, requires, grounding, vqa, pointing, movie, referring, text, empirical, choice, justification, attention, devi, christopher, arxiv, preprint, telling, meaning, grounded, word] [response, object, context, european, detection, three] [task, learning, representation, trevor, test, avoid, set]
@InProceedings{Zellers_2019_CVPR,
  author = {Zellers, Rowan and Bisk, Yonatan and Farhadi, Ali and Choi, Yejin},
  title = {From Recognition to Cognition: Visual Commonsense Reasoning},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
The Regretful Agent: Heuristic-Aided Navigation Through Progress Estimation
Chih-Yao Ma, Zuxuan Wu, Ghassan AlRegib, Caiming Xiong, Zsolt Kira


As deep learning continues to make progress for challenging perception tasks, there is increased interest in combining vision, language, and decision-making. Specifically, the Vision and Language Navigation (VLN) task involves navigating to a goal purely from language instructions and visual information without explicit knowledge of the goal. Recent successful approaches have made in-roads in achieving good success rates for this task but rely on beam search, which thoroughly explores a large number of trajectories and is unrealistic for applications such as robotics. In this paper, inspired by the intuition of viewing the problem as search on a navigation graph, we propose to use a progress monitor developed in prior work as a learnable heuristic for search. We then propose two modules incorporated into an end-to-end architecture: 1) A learned mechanism to perform backtracking, which decides whether to continue moving forward or roll back to a previous state (Regret Module) and 2) A mechanism to help the agent decide which direction to go next by showing directions that are visited and their associated progress estimate (Progress Marker). Combined, the proposed approach significantly outperforms current state-of-the-art methods using greedy action selection, with 5% absolute improvement on the test server in success rates, and more importantly 8% on success rates normalized by the path length.
[action, previous, forward, current, time, perform, work, state, graph, trajectory] [direction, estimated, roll, vision, note, estimate, computer, robotics, allow] [marker, proposed, figure, synthetic, conference, method, prior, difference, real] [search, selection, rate, table, performance, number, better, explore, neural] [agent, progress, navigation, monitor, rollback, regretful, visual, goal, regret, visited, navigable, step, success, decide, language, mechanism, textual, grounding, grounded, ppm, osr, arxiv, preprint, beam, greedy, decides, instruction, path, embodied, attention, vector, robot] [location, module, feature, heuristic, improvement, propose] [learning, learned, spl, task, learn, data, existing, unseen, training, trained, set, loss, test]
@InProceedings{Ma_2019_CVPR,
  author = {Ma, Chih-Yao and Wu, Zuxuan and AlRegib, Ghassan and Xiong, Caiming and Kira, Zsolt},
  title = {The Regretful Agent: Heuristic-Aided Navigation Through Progress Estimation},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Tactical Rewind: Self-Correction via Backtracking in Vision-And-Language Navigation
Liyiming Ke, Xiujun Li, Yonatan Bisk, Ari Holtzman, Zhe Gan, Jingjing Liu, Jianfeng Gao, Yejin Choi, Siddhartha Srinivasa


We present the Frontier Aware Search with backTracking (FAST) Navigator, a general framework for action decoding, that achieves state-of-the-art results on the 2018 Room-to-Room (R2R) Vision-and-Language navigation challenge. Given a natural language instruction and photo-realistic image views of a previously unseen environment, the agent was tasked with navigating from source to target location as quickly as possible. While all current approaches make local action decisions or score entire trajectories using beam search, ours balances local and global signals when exploring an unobserved environment. Importantly, this lets us act greedily but use global signals to backtrack when necessary. Applying FAST framework to existing state-of-the-art models achieved a 17% relative gain, an absolute 6% gain on Success rate weighted by Path Length.
[action, trajectory, work, current, previous, framework, sequence, frontier, fusion, time, speaker, future, long, hidden] [local, approach, vision, computer, international, compute, algorithm, panoramic] [conference, figure, method, completed] [fast, search, best, table, neural, performance, compare, rate, explore, validation, top] [agent, partial, beam, model, navigation, progress, visual, success, greedy, language, instruction, candidate, backtrack, node, vln, queue, goal, smna, natural, monitor, visited, length, attention, peaker, backtracking, ollower, sum, path, navigator, navigate, step, ppm, complete, simple] [global, score, scoring, three, logits, final, location] [learning, spl, existing, unseen, logit, trained, log, set]
@InProceedings{Ke_2019_CVPR,
  author = {Ke, Liyiming and Li, Xiujun and Bisk, Yonatan and Holtzman, Ari and Gan, Zhe and Liu, Jingjing and Gao, Jianfeng and Choi, Yejin and Srinivasa, Siddhartha},
  title = {Tactical Rewind: Self-Correction via Backtracking in Vision-And-Language Navigation},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Learning to Learn How to Learn: Self-Adaptive Visual Navigation Using Meta-Learning
Mitchell Wortsman, Kiana Ehsani, Mohammad Rastegari, Ali Farhadi, Roozbeh Mottaghi


Learning is an inherently continuous phenomenon. When humans learn a new task there is no explicit distinction between training and inference. As we learn a task, we keep learning about it while performing the task. What we learn and how we learn it varies during different stages of learning. Learning how to learn and adapt is a key property that enables us to generalize effortlessly to new settings. This is in contrast with conventional settings in machine learning where a trained model is frozen during inference. In this paper we study the problem of learning to learn at both training and test time in the context of visual navigation. A fundamental challenge in navigation is generalization to unseen scenes. In this paper we propose a self-adaptive visual navigation method (SAVN) which learns to adapt to new environments without any explicit supervision. Our solution is a meta-reinforcement learning approach where an agent learns a self-supervised interaction loss that encourages effective navigation. Our experiments, performed in the AI2-THOR framework, show major improvements in both success rate and SPL for visual navigation in novel scenes. Our code and data are available at: https://github.com/allenai/savn.
[action, interaction, learns, time, state, perform, prediction, trajectory, outperforms, work, abhinav] [approach, scene, denote, algorithm, explicit, equation, respect, problem, note, optimal] [method, figure, image, based, traditional] [network, inference, deep, gradient, rate, sgd, number, experiment] [agent, navigation, visual, success, int, model, lnav, navigate, consider, lint, policy, reinforcement, exploration, episode, ttrain, termination, pieter, environment, goal, length, access, step, roozbeh] [object, baseline, propose, semantic, context, supervision] [learning, learn, loss, training, adapt, objective, target, task, spl, class, adaptation, learned, testing, savn, maml, set, test, sergey, function, supervised, domain, minimizing, large, unseen]
@InProceedings{Wortsman_2019_CVPR,
  author = {Wortsman, Mitchell and Ehsani, Kiana and Rastegari, Mohammad and Farhadi, Ali and Mottaghi, Roozbeh},
  title = {Learning to Learn How to Learn: Self-Adaptive Visual Navigation Using Meta-Learning},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
High Flux Passive Imaging With Single-Photon Sensors
Atul Ingle, Andreas Velten, Mohit Gupta


Single-photon avalanche diodes (SPADs) are an emerging technology with a unique capability of capturing individual photons with high timing precision. SPADs are being used in several active imaging systems (e.g., fluorescence lifetime microscopy and LiDAR), albeit mostly limited to low photon flux settings. We propose passive free-running SPAD (PF-SPAD) imaging, an imaging modality that uses SPADs for capturing 2D intensity images with unprecedented dynamic range under ambient lighting, without any active light source. Our key observation is thatthe precise inter-photon timing measured by a SPAD can be used for estimating scene brightness under ambient lighting conditions, even for very bright scenes. We develop a theoretical model for PF-SPAD imaging, and derive a scene brightness estimator based on the average time of darkness between successive photons detected by a PF-SPAD pixel. Our key insight is that due to the stochastic nature of photon arrivals, this estimator does not suffer from a hard saturation limit. Coupled with high sensitivity at low flux, this enables a PF-SPAD pixel to measure a wide range of scene brightnesses, from very low to very high, thereby achieving extreme dynamic range. We demonstrate an improvement of over 2 orders of magnitude over conventional sensors by imaging scenes spanning a dynamic range of 10^6:1.
[time, dynamic, brightness, capture, nature, key] [flux, photon, spad, range, incident, sensor, exposure, scene, estimator, supplementary, light, note, single, estimated, passive, limit, theoretical, spads, well, asymptotic, vision, equation, additional, simultaneously, cmos, camera, active, derive, monotonically, pfspad, total, constant, afterpulsing] [noise, high, imaging, image, pixel, figure, saturation, dark, bright, ieee, hdr, poisson, resolution, method, captured] [dead, quantization, variance, low, full, fixed, number, hardware, wide, capacity, adaptive, increasing] [machine, model, text] [response, curve, detection, detected, extreme, average, count, counting] [conventional, snr, shot, soft, function, large, experimental, prototype, hard]
@InProceedings{Ingle_2019_CVPR,
  author = {Ingle, Atul and Velten, Andreas and Gupta, Mohit},
  title = {High Flux Passive Imaging With Single-Photon Sensors},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Photon-Flooded Single-Photon 3D Cameras
Anant Gupta, Atul Ingle, Andreas Velten, Mohit Gupta


Single-photon avalanche diodes (SPADs) are starting to play a pivotal role in the development of photon-efficient, long-range LiDAR systems. However, due to non-linearities in their image formation model, a high photon flux (e.g., due to strong sunlight) leads to distortion of the incident temporal waveform, and potentially, large depth errors. Operating SPADs in low flux regimes can mitigate these distortions, but, often requires attenuating the signal and thus, results in low signal-to-noise ratio. In this paper, we address the following basic question: what is the optimal photon flux that a SPAD-based LiDAR should be operated in? We derive a closed form expression for the optimal flux, which is quasi-depth-invariant, and depends on the ambient light strength. The optimal flux is lower than what a SPAD typically measures in real world scenarios, but surprisingly, considerably higher than what is conventionally suggested for avoiding distortions. We propose a simple, adaptive approach for achieving the optimal flux by attenuating incident flux based on an estimate of ambient light strength. Using extensive simulations and a hardware prototype, we show that the optimal flux criterion holds for several depth estimators, under a wide range of illumination conditions.
[signal, time, work, arrival, version] [depth, flux, optimal, ambient, attenuation, photon, incident, light, laser, histogram, spad, bkg, scene, lidar, sig, range, brc, error, waveform, total, pulse, formation, estimation, reconstruction, corresponding, avalanche, distortion, single, receptivity, opt, spads, considerably, estimate, measured, theoretical, estimated, illumination, pulsed, sensor, analysis, problem, optimality, attenuating, derive, approach, timing, shape, active, vision] [high, imaging, figure, based, background, acquired, ieee, image, proposed, correction, fluorescence] [low, number, factor, higher, computational, ith, criterion, wide, achieves, hardware, fixed, magnitude, precision, adaptive, dead] [model, strong, probability] [bin, extreme, level, average] [source, large, uniform]
@InProceedings{Gupta_2019_CVPR,
  author = {Gupta, Anant and Ingle, Atul and Velten, Andreas and Gupta, Mohit},
  title = {Photon-Flooded Single-Photon 3D Cameras},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Acoustic Non-Line-Of-Sight Imaging
David B. Lindell, Gordon Wetzstein, Vladlen Koltun


Non-line-of-sight (NLOS) imaging enables unprecedented capabilities in a wide range of applications, including robotic and machine vision, remote sensing, autonomous vehicle navigation, and medical imaging. Recent approaches to solving this challenging problem employ optical time-of-flight imaging systems with highly sensitive time-resolved photodetectors and ultra-fast pulsed lasers. However, despite recent successes in NLOS imaging using these systems, widespread implementation and adoption of the technology remains a challenge because of the requirement for specialized, expensive hardware. We introduce acoustic NLOS imaging, which is orders of magnitude less expensive than most optical systems and captures hidden 3D geometry at longer ranges with shorter acquisition times compared to state-of-the-art optical methods. Inspired by hardware setups used in radar and algorithmic approaches to model and invert wave-based image formation models developed in the seismic imaging community, we demonstrate a new approach to seeing around corners.
[signal, hidden, optical, sound, time, capture, microphone, radar, speaker, tracking, audio] [acoustic, reconstruction, nlos, scene, confocal, array, scattering, diffuse, transmit, corner, surface, measurement, specular, falloff, volume, dmo, wall, position, seismic, linear, reflector, moveout, midpoint, range, geometry, shape, fourier, additional, normal, nmo, letter, approach, bandwidth, supplementary, light, limited, khz, direction, fmcw, chirp, formation, wave, squared] [imaging, resolution, correction, captured, frequency, image, figure, flat, reconstructed, acquisition, demonstrate, reconstruct] [hardware, lateral, magnitude, lct, compared] [system, model, iterative] [object, receive, response, offset, spatial, roughly] [distance, measure, setup, dimension]
@InProceedings{Lindell_2019_CVPR,
  author = {Lindell, David B. and Wetzstein, Gordon and Koltun, Vladlen},
  title = {Acoustic Non-Line-Of-Sight Imaging},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Steady-State Non-Line-Of-Sight Imaging
Wenzheng Chen, Simon Daneau, Fahim Mannan, Felix Heide


Conventional intensity cameras recover objects in the direct line-of-sight of the camera, whereas occluded scene parts are considered lost in this process. Non-line-of-sight imaging (NLOS) aims at recovering these occluded objects by analyzing their indirect reflections on visible scene surfaces. Existing NLOS methods temporally probe the indirect light transport to unmix light paths based on their travel time, which mandates specialized instrumentation that suffers from low photon efficiency, high cost, and mechanical scanning. We depart from temporal probing and demonstrate steady-state NLOS imaging using conventional intensity sensors and continuous illumination. Instead of assuming perfectly isotropic scattering, the proposed method exploits directionality in the hidden surface reflectance, resulting in (small) spatial variation of their indirect reflections for varying illumination. To tackle the shape-dependence of these variations, we propose a trainable architecture which learns to map diffuse indirect reflections to scene reflectance using only synthetic training data. Relying on consumer color image sensors, with high fill factor, high quantum efficiency and low read-out noise, we demonstrate high-fidelity color NLOS imaging for scene configurations tackled before with picosecond time resolution.
[temporal, hidden, time, temporally, work, amplitude] [light, indirect, scene, nlos, diffuse, wall, computer, specular, direct, transient, single, photon, visible, surface, reflectance, corresponding, vision, geometry, illumination, planar, position, direction, reconstruction, spad, volume, pattern, recovering, sensor, allows, shape, albedo, point, plane, normal, view, international, occluded, assuming, monocular, measurement, practical, contrast, case] [imaging, image, proposed, method, conference, high, acm, ieee, intensity, recover, demonstrate, resolution, color, based, coded, reflection, fill, acquire] [low, network, architecture, efficiency, deep, identical] [model] [object, spatial, propose, map] [transport, conventional, setup, training, function, existing, sample]
@InProceedings{Chen_2019_CVPR,
  author = {Chen, Wenzheng and Daneau, Simon and Mannan, Fahim and Heide, Felix},
  title = {Steady-State Non-Line-Of-Sight Imaging},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
A Theory of Fermat Paths for Non-Line-Of-Sight Shape Reconstruction
Shumian Xin, Sotiris Nousias, Kiriakos N. Kutulakos, Aswin C. Sankaranarayanan, Srinivasa G. Narasimhan, Ioannis Gkioulekas


We present a novel theory of Fermat paths of light between a known visible scene and an unknown object not in the line of sight of a transient camera. These light paths either obey specular reflection or are reflected by the object's boundary, and hence encode the shape of the hidden object. We prove that Fermat paths correspond to discontinuities in the transient measurements. We then derive a novel constraint that relates the spatial derivatives of the path lengths at these discontinuities to the surface normal. Based on this theory, we present an algorithm, called Fermat Flow, to estimate the shape of the non-line-of-sight object. Our method allows, for the first time, accurate shape recovery of complex objects, ranging from diffuse to specular, that are hidden around the corner as well as hidden behind a diffuser. Finally, our approach is agnostic to the particular technology used for transient imaging. As such, we demonstrate mm-scale shape recovery from pico-second scale transients using a SPAD and ultrafast laser, as well as micron-scale reconstruction from femto-second scale transients using interferometry. We believe our work is a significant advance over the state-of-the-art in non-line-of-sight imaging.
[hidden, temporal, coin, flow, work, perform] [transient, fermat, nlos, surface, visible, point, reconstruction, specular, pathlength, proposition, light, shape, theory, local, scene, equation, discontinuity, corresponding, additionally, pathlengths, computer, depth, normal, tangent, measured, diffuse, confocal, sphere, brdf, ramesh, scanning, assume, note, provide, matthew, sight, corner, photon, geometric, case, implies, proof, andreas, gordon, well, spad, ultrafast, active, reconstructing, backprojection] [imaging, figure, based, reconstruct, acm, intensity, reconstructed, james, produce] [gradient, called, fast] [path, vector, system, consider, procedure] [object, boundary, detector, branch, including] [source, function, sph, set, distance, specific]
@InProceedings{Xin_2019_CVPR,
  author = {Xin, Shumian and Nousias, Sotiris and Kutulakos, Kiriakos N. and Sankaranarayanan, Aswin C. and Narasimhan, Srinivasa G. and Gkioulekas, Ioannis},
  title = {A Theory of Fermat Paths for Non-Line-Of-Sight Shape Reconstruction},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
End-To-End Projector Photometric Compensation
Bingyao Huang, Haibin Ling


Projector photometric compensation aims to modify a projector input image such that it can compensate for disturbance from the appearance of projection surface. In this paper, for the first time, we formulate the compensation problem as an end-to-end learning problem and propose a convolutional neural network, named CompenNet, to implicitly learn the complex compensation function. CompenNet consists of a UNet-like backbone network and an autoencoder subnet. Such architecture encourages rich multi-level interactions between the camera-captured projection surface image and the input image, and thus captures both photometric and environment information of the projection surface. In addition, the visual details and interaction information are carried to deeper layers along the multi-level skip convolution layers. The architecture is of particular importance for the projector compensation task, for which only a small training dataset is allowed in practice. Another contribution we make is a novel evaluation benchmark, which is independent of system setup and thus quantitatively verifiable. Such benchmark is not previously available, to our best knowledge, due to the fact that conventional evaluation requests the hardware system to actually project the final results. Our key idea, motivated from our end-to-end problem formulation, is to use a reasonable surrogate to avoid such projection process so as to be setup-independent. Our method is evaluated carefully on the benchmark, and the results show that our end-to-end learning solution outperforms state-of-the-arts both qualitatively and quantitatively by a significant margin.
[complex, capture, consists, outperforms, work, time] [surface, projection, camera, textured, problem, photometric, solution, projected, formulation, rmse, reflectance, volume, corresponding] [image, compensation, projector, compennet, input, method, ssim, tps, proposed, color, radiometric, captured, uncompensated, pixel, comparison, psnr, mapping, clear, spectral, ieee, quantitatively, compensated, grundh, quantitative, compensate, figure] [skip, convolution, table, network, deep, process, size, convolutional, output, number, neural, architecture, conv] [model, evaluation, system, visual, find, named] [benchmark, global, feature, propose] [training, sampling, loss, set, function, data, learning, transfer, trained, train, setup, learn, surrogate, existing]
@InProceedings{Huang_2019_CVPR,
  author = {Huang, Bingyao and Ling, Haibin},
  title = {End-To-End Projector Photometric Compensation},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Bringing a Blurry Frame Alive at High Frame-Rate With an Event Camera
Liyuan Pan, Cedric Scheerlinck, Xin Yu, Richard Hartley, Miaomiao Liu, Yuchao Dai


Event-based cameras can measure intensity changes (called 'events') with microsecond accuracy under high-speed motion and challenging lighting conditions. With the active pixel sensor (APS), the event camera allows simultaneous output of the intensity frames. However, the output images are captured at a relatively low frame-rate and often suffer from motion blur. A blurry image can be regarded as the integral of a sequence of latent images, while the events indicate the changes between the latent images. Therefore, we are able to model the blur-generation process by associating event data to a latent image. In this paper, we propose a simple and effective approach, the Event-based Double Integral (EDI) model, to reconstruct a high frame-rate, sharp video from a single blurry frame and its event data. The video generation is based on solving a simple non-convex optimization problem in a single scalar variable. Experimental results on both synthetic and real images demonstrate the superiority of our EDI model and optimization method in comparison to the state-of-the-art.
[event, video, motion, time, sequence, dataset, dynamic, frame, jin, edi, flow, simultaneous, davis, optical, guillermo, davide] [reconstruction, single, camera, pattern, integral, vision, lighting, optimization, scene, exposure, estimation, estimate, chosen, problem, contrast, corresponding, approach, june, solving, provide] [image, intensity, deblurring, blurry, reconstructed, ieee, sharp, high, method, latent, blur, result, based, real, reconstruct, input, scheerlinck, figure, pixel, captured, double, recover, synthetic, psnr, proposed, lxy, deblurred] [low, deep, output, rate, best, neural, restore, performance, automatically, table] [model, generate, visual, simple, generation] [propose, baseline, edge, threshold] [data, learning, tao, log, suffer]
@InProceedings{Pan_2019_CVPR,
  author = {Pan, Liyuan and Scheerlinck, Cedric and Yu, Xin and Hartley, Richard and Liu, Miaomiao and Dai, Yuchao},
  title = {Bringing a Blurry Frame Alive at High Frame-Rate With an Event Camera},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Bringing Alive Blurred Moments
Kuldeep Purohit, Anshul Shah, A. N. Rajagopalan


We present a solution for the goal of extracting a video from a single motion blurred image to sequentially reconstruct the clear views of a scene as beheld by the camera during the time of exposure. We first learn motion representation from sharp videos in an unsupervised manner through training of a convolutional recurrent video autoencoder network that performs a surrogate task of video reconstruction. Once trained, it is employed for guided training of a motion encoder for blurred images. This network extracts embedded motion information from the blurred image to generate a sharp video in conjunction with the trained recurrent video decoder. As an intermediate step, we also design an efficient architecture that enables real-time single image deblurring and outperforms competing methods across all factors: accuracy, speed, and compactness. Experiments on real scenes and standard datasets demonstrate the superiority of our framework over the state-of-the-art and its ability to generate a plausible sequence of temporally consistent sharp frames.
[motion, video, frame, flow, recurrent, rvd, optical, dataset, extract, sequence, recognition, fed, convlstm, time, extracting, work, perform, rve] [computer, vision, single, pattern, scene, camera, estimated, estimate, corresponding, estimation, approach, exposure, well, ambiguity, reconstruction, problem] [blurred, image, sharp, deblurring, conference, ieee, bie, method, blur, proposed, figure, central, reconstruct, deblurred, transformation, deconvolution, blind, real, high] [network, convolutional, architecture, neural, layer, deep, design, performance, output, residual, cell, stride, block, processing] [encoder, decoder, generate, generation, generated, provided] [extraction, spatial, predicted, feature, challenging] [training, trained, learning, task, representation, loss, large, test, datasets]
@InProceedings{Purohit_2019_CVPR,
  author = {Purohit, Kuldeep and Shah, Anshul and Rajagopalan, A. N.},
  title = {Bringing Alive Blurred Moments},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Learning to Synthesize Motion Blur
Tim Brooks, Jonathan T. Barron


We present a technique for synthesizing a motion blurred image from a pair of unblurred images captured in succession. To build this system we motivate and design a differentiable "line prediction" layer to be used as part of a neural network architecture, with which we can learn a system to regress from image pairs to motion blurred images that span the capture time of the input image pair. Training this model requires an abundance of data, and so we design and execute a strategy for using frame interpolation techniques to generate a large-scale synthetic dataset of motion blurred images and their respective inputs. We additionally capture a high quality test set of real motion blurred images, synthesized from slow motion videos, with which we evaluate our model against several baseline techniques that can be used to synthesize motion blur. Our model produces higher accuracy output than our baselines, and is several orders of magnitude faster than baselines with competitive accuracy.
[motion, frame, video, flow, optical, prediction, dataset, adjacent, sequence, averaged, work, duration, second, temporal, subject, predict, unblurred, capture, time] [technique, single, scene, camera, algorithm, linear, corresponding, well, estimation, require] [image, blur, blurred, input, interpolation, synthetic, figure, synthesize, real, produce, pixel, synthesizing, synthesized, synthesis, high, deblurring, tend, blurring, resolution, described, captured, sharp] [network, kernel, neural, layer, deep, output, magnitude, architecture, separable, andrew, convolutional, performance, compare, speed, fast] [model, generate, system, evaluate] [baseline, three, jonathan, predicted, average] [training, data, learning, triplet, test, set, learned, train, task, sampling, large, trained]
@InProceedings{Brooks_2019_CVPR,
  author = {Brooks, Tim and Barron, Jonathan T.},
  title = {Learning to Synthesize Motion Blur},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Underexposed Photo Enhancement Using Deep Illumination Estimation
Ruixing Wang, Qing Zhang, Chi-Wing Fu, Xiaoyong Shen, Wei-Shi Zheng, Jiaya Jia


This paper presents a new neural network for enhancing underexposed photos. Instead of directly learning an image-to-image mapping as previous work, we introduce intermediate illumination in our network to associate the input with expected enhancement result, which augments the network's capability to learn complex photographic adjustment from expert-retouched input/output image pairs. Based on this model, we formulate a loss function that adopts constraints and priors on the illumination, prepare a new dataset of 3,000 underexposed image pairs, and train the network to effectively learn a rich variety of adjustment for diverse lighting conditions. By these means, our network is able to recover clear details, distinct contrast, and natural color in the enhancement results. We perform extensive experiments on the benchmark MIT-Adobe FiveK dataset and our new dataset, and show that our network is effective to deal with previously challenging images.
[dataset, work] [illumination, contrast, adjustment, smoothness, reconstruction, lighting, local, reflectance, approach, corresponding, directly, note, well] [image, underexposed, color, figure, enhancement, method, photo, input, result, fivek, hdrnet, dpe, jiep, comparison, photographic, enhancing, mapping, recover, user, bilateral, study, based, clear, psnr, prepare, variety, vivid, produce, prior, ssim, rating, ieee, acm, fail] [network, deep, design, capability, effective] [natural, visual, model, rich, diverse] [enhanced, global, challenging, enhance, feature, three, benchmark, map] [loss, learning, learn, function, effectively, existing, test, train]
@InProceedings{Wang_2019_CVPR,
  author = {Wang, Ruixing and Zhang, Qing and Fu, Chi-Wing and Shen, Xiaoyong and Zheng, Wei-Shi and Jia, Jiaya},
  title = {Underexposed Photo Enhancement Using Deep Illumination Estimation},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Blind Visual Motif Removal From a Single Image
Amir Hertz, Sharon Fogel, Rana Hanocka, Raja Giryes, Daniel Cohen-Or


Many images shared over the web include overlaid objects, or visual motifs, such as text, symbols or drawings, which add a description or decoration to the image. For example, decorative text that specifies where the image was taken, repeatedly appears across a variety of different images. Often, the reoccurring visual motif, is semantically similar, yet, differs in location, style and content (e.g. text placement, font and letters). This work proposes a deep learning based technique for blind removal of such objects. In the blind setting, the location and exact geometry of the motif are unknown. Our approach simultaneously estimates which pixels contain the visual motif, and synthesizes the underlying latent image. It is applied to a single input image, without any user assistance in specifying the location of the motif, achieving state-of-the-art results for blind removal of both opaque and semi-transparent visual motifs.
[dataset, work, previous] [single, reconstruction, computer, supplementary, constant, estimating, international, vision, estimated, ground, truth] [image, motif, removal, figure, watermark, method, corrupted, blind, ieee, latent, background, inpainting, conference, input, reconstructed, reflection, psnr, remove, opacity, ssim, user, removing, separate, matte, pix, study, emojis, overlaid, reconstruct, separation, gray, blending] [network, size, deep, table, convolutional, original, neural, addition, applied, binary, output, architecture] [visual, embedded, decoder, random, text, font, encoder] [mask, location, three, baseline, final, branch, spatial, ablation, lmask] [test, shared, unseen, training, loss, trained, lim, randomly, set, learning, train]
@InProceedings{Hertz_2019_CVPR,
  author = {Hertz, Amir and Fogel, Sharon and Hanocka, Rana and Giryes, Raja and Cohen-Or, Daniel},
  title = {Blind Visual Motif Removal From a Single Image},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Non-Local Meets Global: An Integrated Paradigm for Hyperspectral Denoising
Wei He, Quanming Yao, Chao Li, Naoto Yokoya, Qibin Zhao


Non-local low-rank tensor approximation has been developed as a state-of-the-art method for hyperspectral image (HSI) denoising. Unfortunately, while their denoising performance benefits little from more spectral bands, the running time of these methods significantly increases. In this paper, we claim that the HSI lies in a global spectral low-rank subspace, and the spectral subspaces of each full band patch groups should lie in this global low-rank subspace. This motivates us to propose a unified spatial-spectral paradigm for HSI denoising. As the new model is hard to optimize, An efficient algorithm motivated by alternating minimization is developed. This is done by first learning a low-dimensional orthogonal basis and the related reduced image from the noisy HSI. Then, the non-local low-rank denoising and iterative regularization are developed to refine the reduced image and orthogonal basis, respectively. Finally, the experiments on synthetic and both real datasets demonstrate the superiority against the
[] [] [] [] [] [] []
@InProceedings{He_2019_CVPR,
  author = {He, Wei and Yao, Quanming and Li, Chao and Yokoya, Naoto and Zhao, Qibin},
  title = {Non-Local Meets Global: An Integrated Paradigm for Hyperspectral Denoising},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Neural Rerendering in the Wild
Moustafa Meshry, Dan B. Goldman, Sameh Khamis, Hugues Hoppe, Rohit Pandey, Noah Snavely, Ricardo Martin-Brualla


We explore total scene capture --- recording, modeling, and rerendering a scene under varying appearance such as season and time of day. Starting from Internet photos of a tourist landmark, we apply traditional 3D reconstruction to register the photos and approximate the scene as a point cloud. For each photo, we render the scene points into a deep framebuffer, and train a deep neural network to learn the mapping of these initial renderings to the actual photos. This rerendering network also takes as input a latent appearance vector and a semantic mask indicating the location of transient objects like pedestrians. The model is evaluated on several datasets of publicly available images spanning a broad range of illumination conditions. We create short videos that demonstrate realistic manipulation of the image viewpoint, appearance, and semantic labels. We also compare results to prior work on scene reconstruction from Internet photos.
[capture, buffer, modeling, frame, work, dataset, jointly, complex, framework] [scene, rendering, reconstruction, transient, viewpoint, lighting, ground, point, approach, night, total, internet, geometry, truth, illumination, allows, supplementary, note, dense, rendered] [appearance, image, input, rerendering, figure, photo, staged, latent, realistic, translation, conditioning, synthesis, captured, method, style, generative, interpolating, smoothly, color, day, trevi, rerender, reconstructed] [network, neural, deep, output, better, san, vgg] [model, multimodal, encoder, generate, adversarial, vector, system, create, generated, simple] [semantic, baseline, segmentation, labeling, location, mask] [training, proxy, loss, train, representation, trained, learn, datasets, large, space, transfer]
@InProceedings{Meshry_2019_CVPR,
  author = {Meshry, Moustafa and Goldman, Dan B. and Khamis, Sameh and Hoppe, Hugues and Pandey, Rohit and Snavely, Noah and Martin-Brualla, Ricardo},
  title = {Neural Rerendering in the Wild},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
GeoNet: Deep Geodesic Networks for Point Cloud Analysis
Tong He, Haibin Huang, Li Yi, Yuqian Zhou, Chihao Wu, Jue Wang, Stefano Soatto


Surface-based geodesic topology provides strong cues for object semantic analysis and geometric modeling. However, such connectivity information is lost in point clouds. Thus we introduce GeoNet, the first deep learning architecture trained to model the intrinsic structure of surfaces represented as point clouds. To demonstrate the applicability of learned geodesic-aware representations, we propose fusion schemes which use GeoNet in conjunction with other baseline or backbone networks, such as PU-Net and PointNet++, for down-stream point cloud analysis. Our method improves the state-of-the-art on multiple representative tasks that can benefit from understandings of the underlying surface topology, including point upsampling, normal estimation, mesh reconstruction and non-rigid shape classification.
[fusion, multiple] [point, geodesic, cloud, normal, estimation, mesh, surface, shape, neighborhood, geonet, reconstruction, computer, pof, intrinsic, ground, truth, puf, shapenet, underlying, vision, topology, topological, well, leftout, pattern, geometric, approach, dense, depth, estimated, uniformly, heldout, volume, leveraging, applicability] [method, input, based, figure, ieee, acm, demonstrate, conference, latent, conduct, noise] [deep, upsampling, distributed, network, layer, sparse, structure, better, table, accuracy, kernel, neural] [shortest, red, path] [feature, propose, baseline, backbone, object, semantic] [set, learning, distance, learned, euclidean, training, large, loss, learn, data, function, classification, trained]
@InProceedings{He_2019_CVPR,
  author = {He, Tong and Huang, Haibin and Yi, Li and Zhou, Yuqian and Wu, Chihao and Wang, Jue and Soatto, Stefano},
  title = {GeoNet: Deep Geodesic Networks for Point Cloud Analysis},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
MeshAdv: Adversarial Meshes for Visual Recognition
Chaowei Xiao, Dawei Yang, Bo Li, Jia Deng, Mingyan Liu


Highly expressive models such as deep neural networks (DNNs) have been widely applied to various applications. However, recent studies show that DNNs are vulnerable to adversarial examples, which are carefully crafted inputs aiming to mislead the predictions. Currently, the majority of these studies have focused on perturbation added to image pixels, while such manipulation is not physically realistic. Some works have tried to overcome this limitation by attaching printable 2D patches or painting patterns onto surfaces, but can be potentially defended because 3D shape features are intact. In this paper, we propose meshAdv to generate "adversarial 3D meshes" from objects that have rich shape features but minimal textural variation. To manipulate the shape or texture of the objects, we make use of a differentiable renderer to compute accurate shading on the shape and propagate the gradient. Extensive experiments show that the generated 3D meshes are effective in attacking both classifiers and object detectors. We evaluate the attack under different viewpoints. In addition, we design a pipeline to perform black-box attack on a photorealistic renderer with unknown rendering parameters.
[human, flow, recognition, perform] [shape, rendering, mesh, computer, differentiable, lighting, rendered, vision, camera, pattern, scene, directly, indoor, vertex, estimate, international, optimization, mitsuba, physically, pipeline, view, robust, case, accurate] [texture, image, based, conference, figure, ieee, photorealistic, manipulating, manipulation, method, pristine, perceptual, pixel, manipulate] [deep, neural, rate, table, achieve, densenet, applied, magnitude] [adv, adversarial, perturbation, attack, renderer, meshadv, generated, model, generate, success, transferability, victim, fool, arxiv, preprint, goal, robustness, vulnerable, machine, targeted, mislead] [object, detection, average, propose] [target, loss, learning, class, unknown, distance, test, classification, label]
@InProceedings{Xiao_2019_CVPR,
  author = {Xiao, Chaowei and Yang, Dawei and Li, Bo and Deng, Jia and Liu, Mingyan},
  title = {MeshAdv: Adversarial Meshes for Visual Recognition},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Fast Spatially-Varying Indoor Lighting Estimation
Mathieu Garon, Kalyan Sunkavalli, Sunil Hadap, Nathan Carr, Jean-Francois Lalonde


We propose a real-time method to estimate spatially-varying indoor lighting from a single RGB image. Given an image and a 2D location in that image, our CNN estimates a 5th order spherical harmonic representation of the lighting at the given location in less than 20ms on a laptop mobile graphics card. While existing approaches estimate a single, global lighting representation or require depth as input, our method reasons about local lighting without requiring any geometry information. We demonstrate, through quantitative experiments including a user study, that our results achieve lower lighting estimation errors and are preferred by users over the state-of-the-art. Our approach can be used directly for augmented reality applications, where a virtual object is relit realistically at any position in the scene in real-time.
[dataset, work, recognition, predict] [lighting, local, light, scene, estimate, single, depth, computer, indoor, ground, vision, approach, reflectance, rendering, illumination, camera, pattern, error, truth, international, rgb, geometry, estimation, estimating, additional, coordinate, albedo, intrinsic, note, outdoor, render, spherical, require, virtual, relighting, rendered, degree, ambient] [image, method, real, conference, synthetic, ieee, user, input, acm, shading, patch, figure, hdr, barron, captured, high, study] [network, order, deep, table, full, neural, mobile, layer, lower] [environment, path, vector, model] [global, center, object, improves] [probe, learning, domain, adaptation, loss, training, trained, close, randomly, test]
@InProceedings{Garon_2019_CVPR,
  author = {Garon, Mathieu and Sunkavalli, Kalyan and Hadap, Sunil and Carr, Nathan and Lalonde, Jean-Francois},
  title = {Fast Spatially-Varying Indoor Lighting Estimation},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Neural Illumination: Lighting Prediction for Indoor Environments
Shuran Song, Thomas Funkhouser


This paper addresses the task of estimating the light arriving from all directions to a 3D point observed at a selected pixel in an RGB image. This task is challenging because it requires predicting a mapping from a partial RGB observation by a camera to a complete illumination map for a different 3D point, which depends on the 3D location of the selected pixel, the distribution of unobserved light sources, the occlusions by scene geometry, etc. Previous methods attempt to learn this complex mapping directly using a single black-box neural network which often fails to estimate high-frequency lighting details for scenes with complicated 3D geometry. Instead, we propose "Neural Illumination," a new approach that decomposes illumination prediction into several simpler differentiable sub-tasks: 1) geometry estimation, 2) scene completion, and 3) LDR-to-HDR estimation. The advantage of this approach is that the sub-tasks are relatively easy to learn and can be trained with direct supervision, while the whole pipeline is fully differentiable and can be fine-tuned with end-to-end supervision. Experiments show that our approach performs significantly better quantitatively and qualitatively than prior work.
[prediction, dataset, warping, work, warp, dynamic, warped, predicting] [illumination, ldr, scene, ground, estimation, geometry, lighting, single, truth, light, surface, diffuse, gardner, depth, computer, directly, vision, locale, panoramic, algorithm, indoor, accurate, rgb, range, problem, geometric, direct, spherical, direction, panorama, observation, estimate, approach, dense, error, estimating, arriving] [image, input, hdr, pixel, figure, conference, high, intensity, produce, ieee, mapping, prior, intermediate, acm] [network, output, neural, convolutional, table, performance] [model, observed, generate, adversarial, unobserved] [map, module, supervision, location, propose, final] [loss, target, training, selected, train, data, set, trained, function, learning, task]
@InProceedings{Song_2019_CVPR,
  author = {Song, Shuran and Funkhouser, Thomas},
  title = {Neural Illumination: Lighting Prediction for Indoor Environments},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Deep Sky Modeling for Single Image Outdoor Lighting Estimation
Yannick Hold-Geoffroy, Akshaya Athawale, Jean-Francois Lalonde


We propose a data-driven learned sky model, which we use for outdoor lighting estimation from a single image. As no large-scale dataset of images and their corresponding ground truth illumination is readily available, we use complementary datasets to train our approach, combining the vast diversity of illumination conditions of SUN360 with the radiometrically calibrated and physically accurate Laval HDR sky database. Our key contribution is to provide a holistic view of both lighting modeling and estimation, solving both problems end-to-end. From a test image, our method can directly estimate an HDR environment map of the lighting without relying on analytical lighting models. We demonstrate the versatility and expressivity of our learned sky model and show that it can be used to recover plausible illumination, leading to visually pleasant virtual object insertions. To further evaluate our method, we capture a dataset of HDR 360deg panoramas and show through extensive validation that we significantly outperform previous state-of-the-art.
[dataset, previous, modeling, dynamic, capture] [sky, lighting, outdoor, illumination, estimate, panorama, single, estimation, computer, ground, truth, laval, skynet, estimated, azimuth, pattern, approach, scene, reflectance, range, position, international, view, overcast, ldr, error, vision, accurate, provide, limited, field, technique, reconstruction, rmse, relighting, corresponding, directly, virtual] [hdr, image, method, conference, ieee, figure, database, variety, proposed, input, captured, radiometric, quantitative] [deep, residual, network, architecture, full, performance] [model, encoder, plausible, step, evaluation] [propose, object, improvement] [sun, training, learning, train, autoencoder, learn, set, loss, learned, distribution, trained, large, space, test]
@InProceedings{Hold-Geoffroy_2019_CVPR,
  author = {Hold-Geoffroy, Yannick and Athawale, Akshaya and Lalonde, Jean-Francois},
  title = {Deep Sky Modeling for Single Image Outdoor Lighting Estimation},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Bidirectional Learning for Domain Adaptation of Semantic Segmentation
Yunsheng Li, Lu Yuan, Nuno Vasconcelos


Domain adaptation for semantic image segmentation is very necessary since manually labeling large datasets with pixel-level labels is expensive and time consuming. Existing domain adaptation techniques either work on limited datasets, or yield not so good performance compared with supervised learning. In this paper, we propose a novel bidirectional learning framework for domain adaptation of segmentation. Using the bidirectional learning, the image translation model and the segmentation adaptation model can be learned alternatively and promote to each other.Furthermore, we propose a self-supervised learning algorithm to learn a better segmentation adaptation model and in return improve the image translation model. Experiments show that our method superior to the state-of-the-art methods in domain adaptation of segmentation with a big margin. The source code is available at https://github.com/liyunsheng13/BDL
[bidirectional, work, dataset, prediction, backward, forward] [computer, algorithm, confidence, equation, defined, direction, ground, truth, vision, pattern, good] [translation, image, figure, real, method, synthetic, translated, result, proposed, recon, conference, perceptual, pixel, ieee, high, based] [performance, network, table, better, process, neural, ratio, compared, deep, number, iteration, shift, processing] [model, adversarial, arxiv, preprint, adv, gan, probability, choose, visual, find, introduce] [segmentation, semantic, threshold, miou, seg, propose, feature, aligned, improve, deeplab] [learning, domain, adaptation, loss, source, target, data, training, ssl, trained, learn, gap, train, set, unsupervised, function, tssl, large, datasets, classification, synthia]
@InProceedings{Li_2019_CVPR,
  author = {Li, Yunsheng and Yuan, Lu and Vasconcelos, Nuno},
  title = {Bidirectional Learning for Domain Adaptation of Semantic Segmentation},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Enhanced Bayesian Compression via Deep Reinforcement Learning
Xin Yuan, Liangliang Ren, Jiwen Lu, Jie Zhou


In this paper, we propose an Enhanced Bayesian Compression method to flexibly compress the deep networks via reinforcement learning. Unlike the existing Bayesian compression method which cannot explicitly enforce quantization weights during training, our method learns flexible codebooks in each layer for an optimal network quantization. To dynamically adjust the state of codebooks, we employ an Actor-Critic network to collaborate with the original deep network. Different from most existing network quantization methods, our EBC does not require re-training procedures after the quantization. Experimental results show that our method obtains low-bit precision with acceptable accuracy drop on MNIST, CIFAR and ImageNet.
[actor, state, dataset, collaborate, action, explicitly] [optimal, directly, error, problem] [method, proposed, figure, image, high, input, comparison] [network, deep, ebc, bayesian, compression, original, neural, accuracy, layer, codebook, quantization, convolutional, precision, quantized, flexible, efficient, epoch, bit, table, bnn, codebooks, variance, pruning, compressed, sparse, reduce, output, applied, gradient, imagenet, pretrained, convolution, fout, batch] [model, reinforcement, variational, agent, policy, critic, step, length] [assigned, feature, enhanced] [learning, test, training, trained, loss, distribution, update, function, data, refers, classification, posterior, learn, train, set, existing]
@InProceedings{Yuan_2019_CVPR,
  author = {Yuan, Xin and Ren, Liangliang and Lu, Jiwen and Zhou, Jie},
  title = {Enhanced Bayesian Compression via Deep Reinforcement Learning},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Strong-Weak Distribution Alignment for Adaptive Object Detection
Kuniaki Saito, Yoshitaka Ushiku, Tatsuya Harada, Kate Saenko


We propose an approach for unsupervised adaptation of object detectors from label-rich to label-poor domains which can significantly reduce annotation costs associated with detection. Recently, approaches that align distributions of source and target images using an adversarial loss have been proven effective for adapting object classifiers. However, for object detection, fully matching the entire distributions of source and target images to each other at the global image level may fail, as domains could have distinct scene layouts and different combinations of objects. On the other hand, strong matching of local features such as texture and color makes sense, as it does not change category level semantics. This motivates us to propose a novel method for detector adaptation based on strong local alignment and weak global alignment. Our key contribution is the weak alignment model, which focuses the adversarial alignment loss on images that are globally similar and puts less emphasis on aligning images that are globally dissimilar. Additionally, we design the strong domain alignment model to only look at local receptive fields of the feature map. We empirically verify the effectiveness of our method on four datasets comprising both large and small domain shifts. Our code is available at https://github.com/VisionLearningGroup/DA_Detection.
[dataset] [local, focal, matching, scene, globally, partially, match] [method, proposed, image, based, real, synthetic, figure] [performance, table, deep, network, adaptive, scale, designed, number, effective, effectiveness] [strong, model, adversarial, vector, evidence, improved, indicates, visual] [feature, object, detection, global, weak, pascal, baseline, car, propose, faster, region, semantic, context, map, clipart, bounding, voc, evaluated, cityscape, instance, aligned, rcnn, proposal, frcnn, segmentation, detector] [domain, alignment, source, target, loss, adaptation, classifier, training, align, classification, trained, unsupervised, extractor, function, set, objective, aligning, large, learning, train, strictly, dissimilar, distribution, class, motivated, hurt]
@InProceedings{Saito_2019_CVPR,
  author = {Saito, Kuniaki and Ushiku, Yoshitaka and Harada, Tatsuya and Saenko, Kate},
  title = {Strong-Weak Distribution Alignment for Adaptive Object Detection},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
MFAS: Multimodal Fusion Architecture Search
Juan-Manuel Perez-Rua, Valentin Vielzeuf, Stephane Pateux, Moez Baccouche, Frederic Jurie


We tackle the problem of finding good architectures for multimodal classification problems. We propose a novel and generic search space that spans a large number of possible fusion architectures. In order to find an optimal architecture for a given dataset in the proposed search space, we leverage an efficient sequential model-based exploration approach that is tailored for the problem. We demonstrate the value of posing multimodal fusion as a neural architecture search problem by extensive experimentation on a toy dataset and two other real multimodal datasets. We discover fusion architectures that exhibit state-of-the-art performance for problems with different domain and dataset size, including the \ntu dataset, the largest multimodal action recognition dataset available.
[fusion, dataset, video, unimodal, action, recognition, work, sequential, late, early, multiple, second, audio, human, start] [problem, pose, rgb, algorithm, optimal, approach, single, defined, analysis, well] [image, method, proposed, input] [search, neural, architecture, deep, best, table, number, network, validation, layer, convolutional, accuracy, top, progressive, complexity, relu, scheme, structure, order, efficient, performance, output, sigm, nist, better, weight] [multimodal, sampled, exploration, model, modality, movie, text, finding, visual, attention] [feature, three, final, propose, including, fuse] [space, learning, classification, training, surrogate, data, function, large, trained, reported, set, sampling, paper, loss, observe, sample, temperature, shared, task]
@InProceedings{Perez-Rua_2019_CVPR,
  author = {Perez-Rua, Juan-Manuel and Vielzeuf, Valentin and Pateux, Stephane and Baccouche, Moez and Jurie, Frederic},
  title = {MFAS: Multimodal Fusion Architecture Search},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Disentangling Adversarial Robustness and Generalization
David Stutz, Matthias Hein, Bernt Schiele


Obtaining deep networks that are robust against adversarial examples and generalize well is an open problem. A recent hypothesis even states that both robust and accurate models are impossible, i.e., adversarial robustness and generalization are conflicting goals. In an effort to clarify the relationship between robustness and generalization, we assume an underlying, low-dimensional data manifold and show that: 1. regular adversarial examples leave the manifold; 2. adversarial examples constrained to the manifold, i.e., on-manifold adversarial examples, exist; 3. on-manifold adversarial examples are generalization errors, and on-manifold adversarial training boosts generalization; 4. regular robustness and generalization are not necessarily contradicting goals. These assumptions imply that both robust and accurate models are possible. However, different models (architectures, training strategies etc.) can exhibit different robustness and generalization characteristics. To confirm our claims, we present extensive experiments on synthetic data (with known manifold) as well as on EMNIST, Fashion-MNIST and CelebA.
[work, considering, benefit, hypothesis] [error, robust, well, david, contrast, normal, accurate, corresponding, problem, supplementary, optimization, michael, computed] [image, generative, celeba, latent, figure, transformation, synthetic, difference, change, unconstrained] [rate, deep, neural, max, original, better, andrew, fixed, approximated, network, order, approximation] [adversarial, robustness, regular, manifold, success, emnist, example, true, leave, constrained, ian, perturbation, attack, relationship, machine, nicholas, onmanifold, model, exhibit, perturbed, simple, random, consider, defense, carlini, dawn, aleksander, contradicting, essentially, considered] [boost, illustrated] [training, generalization, test, data, learning, distance, augmentation, class, learned, label, distribution, experimental]
@InProceedings{Stutz_2019_CVPR,
  author = {Stutz, David and Hein, Matthias and Schiele, Bernt},
  title = {Disentangling Adversarial Robustness and Generalization},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
ShieldNets: Defending Against Adversarial Attacks Using Probabilistic Adversarial Robustness
Rajkumar Theagarajan, Ming Chen, Bir Bhanu, Jing Zhang


Defending adversarial attack is a critical step towards reliable deployment of deep learning empowered solutions for industrial applications. Probabilistic adversarial robustness (PAR), as a theoretical framework, is introduced to neutralize adversarial attacks by concentrating sample probability to adversarial-free zones. Distinct to most of the existing defense mechanisms that require modifying the architecture/training of the target classifier which is not feasible in the real-world scenario, e.g., when a model has already been deployed, PAR is designed in the first place to provide proactive protection to an existing fixed model. ShieldNet is implemented as a demonstration of PAR in this work by using PixelCNN. Experimental results show that this approach is generalizable, robust against adversarial transferability and resistant to a wide variety of attacks on the Fashion-MNIST and CIFAR10 datasets, respectively.
[dataset, work] [approach, theoretical, corresponding, provide, robust] [image, proposed, input, pixel, method, comparison, generative, transformation, conference] [resnet, table, vgg, neural, original, accuracy, deep, performance, pixelcnn, lower, network, architecture, designed, number, compared, smoothing, implementation, small, sgd, achieves, convolutional] [adversarial, attack, model, fgsm, shieldnet, arxiv, par, deepfool, bim, preprint, defense, transferability, robustness, pixeldefend, perturbation, defend, defending, probability, modifying, perturbed, attacking, introduce, decision, adv, evaluation, machine, neutralize] [cnn] [training, testing, probabilistic, mnist, fashion, distribution, loss, learning, target, classification, existing, classifier, space, function, label, datasets, train, sample, experimental]
@InProceedings{Theagarajan_2019_CVPR,
  author = {Theagarajan, Rajkumar and Chen, Ming and Bhanu, Bir and Zhang, Jing},
  title = {ShieldNets: Defending Against Adversarial Attacks Using Probabilistic Adversarial Robustness},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Deeply-Supervised Knowledge Synergy
Dawei Sun, Anbang Yao, Aojun Zhou, Hao Zhao


Convolutional Neural Networks (CNNs) have become deeper and more complicated compared with the pioneering AlexNet. However, current prevailing training scheme follows the previous way of adding supervision to the last layer of the network only and propagating error information up layer-by-layer. In this paper, we propose Deeply-supervised Knowledge Synergy (DKS), a new method aiming to train CNNs with improved generalization ability for image classification tasks without introducing extra computational cost during inference. Inspired by the deeply-supervised learning scheme, we first append auxiliary supervision branches on top of certain intermediate network layers. While properly using auxiliary supervision can improve model accuracy to some degree, we go one step further to explore the possibility of utilizing the probabilistic knowledge dynamically learnt by the classifiers connected to the backbone network as a new regularization to improve the training. A novel synergy loss, which considers pairwise knowledge matching among all supervision branches, is presented. Intriguingly, it enables dense pairwise knowledge matching operations in both top-down and bottom-up directions at each training iteration, resembling a dynamic synergy process for the same task. We evaluate DKS on image classification datasets using state-of-the-art CNN architectures, and show that the models trained with it are consistently better than the corresponding counterparts. For instance, on the ImageNet classification benchmark, our ResNet-152 model outperforms the baseline model with a 1.47% margin in Top-1 accuracy. Code is available at https://github.com/sundw2014/DKS.
[hidden, dataset, early, time, showing, complex] [matching, error, optimization, dense, defined, problem] [method, image, intermediate, proposed, comparison] [dks, network, accuracy, deep, connected, synergy, layer, neural, imagenet, cnns, convolutional, table, scheme, performance, number, best, block, gain, add, designed, compared, top, resnet, densenet, building, architecture, denotes, regularization, better, modern, standard, achieves, compare, batch, rate] [model, simple, adding, improved] [baseline, supervision, backbone, cnn, improve, three, propose, feature] [auxiliary, training, knowledge, learning, classification, pairwise, loss, data, classifier, trained, set, distillation, teacher, test, train, strategy, learnt, class, convergence, transfer]
@InProceedings{Sun_2019_CVPR,
  author = {Sun, Dawei and Yao, Anbang and Zhou, Aojun and Zhao, Hao},
  title = {Deeply-Supervised Knowledge Synergy},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Dual Residual Networks Leveraging the Potential of Paired Operations for Image Restoration
Xing Liu, Masanori Suganuma, Zhun Sun, Takayuki Okatani


In this paper, we study design of deep neural networks for tasks of image restoration. We propose a novel style of residual connections dubbed "dual residual connection", which exploits the potential of paired operations, e.g., up- and down-sampling or convolution with large- and small-size kernels. We design a modular block implementing this connection style; it is equipped with two containers to which arbitrary paired operations are inserted. Adopting the "unraveled" view of the residual networks proposed by Veit et al., we point out that a stack of the proposed modular blocks allows the first operation in a block interact with the second operation in any subsequent blocks. Specifying the two operations in each of the stacked blocks, we build a complete network for each individual task of image restoration. We experimentally evaluate the proposed approach on five image restoration tasks using nine datasets. The results show that the proposed networks with properly chosen paired operations outperform previous methods on almost all of the tasks and datasets.
[motion, dataset, previous, consists] [computer, international, pattern, vision, single, ground, estimate, truth, supplementary, approach] [image, proposed, conference, removal, noise, paired, figure, method, input, blur, raindrop, restoration, haze, dual, sharp, result, hazy, rain, durbs, deblurgan, dcpdn, ieee, clear, ssim] [network, residual, table, design, convolutional, neural, deep, connection, block, gaussian, better, entire, layer, structure, processing, number, output, normalization, convolution, size, employ] [attention, adversarial, choose, machine, potential] [three, detection, map, cnn, object, european, jian] [training, task, trained, test, noisy, tested]
@InProceedings{Liu_2019_CVPR,
  author = {Liu, Xing and Suganuma, Masanori and Sun, Zhun and Okatani, Takayuki},
  title = {Dual Residual Networks Leveraging the Potential of Paired Operations for Image Restoration},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Probabilistic End-To-End Noise Correction for Learning With Noisy Labels
Kun Yi, Jianxin Wu


Deep learning has achieved excellent performance in various computer vision tasks, but requires a lot of training examples with clean labels. It is easy to collect a dataset with noisy labels, but such noise makes networks overfit seriously and accuracies drop dramatically. To address this problem, we propose an end-to-end framework called PENCIL, which can update both network parameters and label estimations as label distributions. PENCIL is independent of the backbone network structure and does not need an auxiliary clean dataset or prior information about noise, thus it is more general and robust than existing methods and is easy to apply. PENCIL outperformed previous state-of-the-art methods by large margins on both synthetic and real-world datasets with different noise types and noise rates. Experiments show that PENCIL is robust on clean datasets, too.
[dataset, framework, forward, updated, prediction, transition, second] [robust, handling, additional, estimate, estimated] [noise, pencil, method, image, clean, proposed, high, prior, row, correction] [network, rate, deep, accuracy, table, neural, best, epoch, fixed, small, validation, achieved, original, higher, inspired, better, gradient, number] [correct, robustness, probability, random, step, model] [backbone, average, easy, propose, three] [label, noisy, learning, loss, training, distribution, symmetric, test, entropy, large, asymmetric, cross, set, classification, yij, data, update, datasets, function, tested, hyperparameters, tanaka, probabilistic, existing, updating, dldl, log, fair, subset]
@InProceedings{Yi_2019_CVPR,
  author = {Yi, Kun and Wu, Jianxin},
  title = {Probabilistic End-To-End Noise Correction for Learning With Noisy Labels},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Attention-Guided Unified Network for Panoptic Segmentation
Yanwei Li, Xinze Chen, Zheng Zhu, Lingxi Xie, Guan Huang, Dalong Du, Xingang Wang


This paper studies panoptic segmentation, a recently proposed task which segments foreground (FG) objects at the instance level as well as background (BG) contents at the semantic level. Existing methods mostly dealt with these two problems separately, but in this paper, we reveal the underlying relationship between them, in particular, FG objects provide complementary cues to assist BG understanding. Our approach, named the Attention-guided Unified Network (AUNet), is a unified framework with two branches for FG and BG segmentation simultaneously. Two sources of attentions are added to the BG branch, namely, RPN and FG segmentation mask to provide object-level and pixel-level attentions, respectively. Our approach is generalized to different backbones with consistent accuracy gain in both FG and BG segmentation, and also sets new state-of-the-arts both in the MS-COCO (46.5% PQ) and Cityscapes (59.0% PQ) benchmarks.
[framework, dataset] [scene, well, finer, absolute, additional, inverse] [background, proposed, figure, image, method, input, result, quality, based, presented] [network, performance, denotes, pam, conv, designed, table, deep, bilinear, top, scale, convolutional] [attention, generated, model, relationship, named, understanding, visual] [segmentation, panoptic, semantic, mask, instance, feature, stuff, branch, module, map, foreground, aunet, rpn, object, backbone, proposal, complementary, contextual, roiupsample, pqst, kaiming, mam, adopt, adopted, extra, pqth, ross, global, reweight, spatial, activated, including, roialign, valuex, valuey, val, piotr] [unified, training, shared, function, task, data, learning, set]
@InProceedings{Li_2019_CVPR,
  author = {Li, Yanwei and Chen, Xinze and Zhu, Zheng and Xie, Lingxi and Huang, Guan and Du, Dalong and Wang, Xingang},
  title = {Attention-Guided Unified Network for Panoptic Segmentation},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
NAS-FPN: Learning Scalable Feature Pyramid Architecture for Object Detection
Golnaz Ghiasi, Tsung-Yi Lin, Quoc V. Le


Current state-of-the-art convolutional architectures for object detection are manually designed. Here we aim to learn a better architecture of feature pyramid network for object detection. We adopt Neural Architecture Search and discover a new feature pyramid architecture in a novel scalable search space covering all cross-scale connections. The discovered architecture, named NAS-FPN, consists of a combination of top-down and bottom-up connections to fuse features across scales. NAS-FPN, combined with various backbone models in the RetinaNet framework, achieves better accuracy and latency tradeoff compared to state-of-the-art object detection models. NAS-FPN improves mobile detection accuracy by 2 AP compared to state-of-the-art SSDLite with MobileNetV2 model in [32] and achieves 48.3 AP which surpasses Mask R-CNN [10] detection accuracy with less computation time.
[time, rnn, stacking, combined] [accurate, dense] [image, figure, input, resolution, high, method] [architecture, search, network, neural, number, output, accuracy, controller, layer, discovered, pyramidal, design, better, dropblock, scalable, inference, designing, deep, size, performance, ssdlite, computation, fast, stacked, cell, convolutional, compared, speed, binary, capacity, achieves, tradeoff, mobile, higher, identical, applied] [model, generate, step, sampled, sum, discover, reinforcement] [feature, pyramid, object, backbone, detection, fpn, merging, retinanet, combine, improves, multiscale, propose, mask] [learning, dimension, space, proxy, training, trained, task, classification, loss]
@InProceedings{Ghiasi_2019_CVPR,
  author = {Ghiasi, Golnaz and Lin, Tsung-Yi and Le, Quoc V.},
  title = {NAS-FPN: Learning Scalable Feature Pyramid Architecture for Object Detection},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
OICSR: Out-In-Channel Sparsity Regularization for Compact Deep Neural Networks
Jiashi Li, Qi Qi, Jingyu Wang, Ce Ge, Yujian Li, Zhangzhang Yue, Haifeng Sun


Channel pruning can significantly accelerate and compress deep neural networks. Many channel pruning works utilize structured sparsity regularization to zero out all the weights in some channels and automatically obtain structure-sparse network in training stage. However, these methods apply structured sparsity regularization on each layer separately where the correlations between consecutive layers are omitted. In this paper, we first combine one out-channel in current layer and the corresponding in-channel in next layer as a regularization group, namely out-in-channel. Our proposed Out-In-Channel Sparsity Regularization (OICSR) considers correlations between successive layers to further retain predictive power of the compact network. Training with OICSR thoroughly transfers discriminative features into a fraction of out-in-channels. Correspondingly, OICSR measures channel importance based on statistics computed from two consecutive layers, not individual layer. Finally, a global greedy pruning algorithm is designed to remove redundant out-in-channels in an iterative way. Our method is comprehensively evaluated with various CNN architectures including CifarNet, AlexNet, ResNet, DenseNet and PreActSeNet on CIFAR-10, CIFAR-100 and ImageNet-1K datasets. Notably, on ImageNet-1K, we reduce 37.2% FLOPs on ResNet-50 while outperforming the original model by 0.22% top-1 accuracy.
[consecutive, acc, dataset, current, work, huang] [computer, international, corresponding, algorithm, vision, pattern, simultaneously, form, ocl] [separated, conference, method, based, figure, proposed] [accuracy, regularization, pruning, channel, structured, layer, oicsr, neural, pruned, sparsity, deep, network, redundant, convolutional, group, compared, ratio, cifarnet, lasso, cnns, achieves, compact, ith, higher, alexnet, energy, table, automatically, compression, denotes, prune, structure, imagenet, selection, filter, scaling, regularizes] [greedy, iterative, model, incorrect, relevant] [global] [training, loss, learning, existing]
@InProceedings{Li_2019_CVPR,
  author = {Li, Jiashi and Qi, Qi and Wang, Jingyu and Ge, Ce and Li, Yujian and Yue, Zhangzhang and Sun, Haifeng},
  title = {OICSR: Out-In-Channel Sparsity Regularization for Compact Deep Neural Networks},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Semantically Aligned Bias Reducing Zero Shot Learning
Akanksha Paul, Narayanan C. Krishnan, Prateek Munjal


Zero shot learning (ZSL) aims to recognize unseen classes by exploiting semantic relationships between seen and unseen classes. Two major problems faced by ZSL algorithms are the hubness problem and the bias towards the seen classes. Existing ZSL methods focus on only one of these problems in the conventional and generalized ZSL setting. In this work, we propose a novel approach, Semantically Aligned Bias Reducing (SABR) ZSL, which focuses on solving both the problems. It overcomes the hubness problem by learning a latent space that preserves the semantic relationship between the labels while encoding the discriminating information about the classes. Further, we also propose ways to reduce bias of the seen classes through a simple cross-validation process in the inductive setting and a novel weak transfer constraint in the transductive setting. Extensive experiments on three benchmark datasets suggest that the proposed model significantly outperforms existing state-of-the-art algorithms by 1.5-9% in the conventional ZSL setting and by 2-14% in the generalized ZSL for both the inductive and transductive settings.
[early, work, outperforms] [computer, pattern, vision, defined, approach, optimal, problem] [latent, generator, conditional, conference, ieee, proposed, figure, synthetic, prior] [performance, best, reducing, deep, network, number, reduce, output, layer, accuracy] [model, visual, semantically, marginal, relationship, true, machine, wasserstein] [semantic, propose, aligned, instance, average] [class, unseen, zsl, space, learning, bias, transductive, label, data, inductive, training, embedding, learn, unlabeled, setting, embeddings, learned, sun, hubness, datasets, set, generalized, conventional, loss, cub, transfer, distribution, classifier, labeled, classification, train, novel, similarity, function, gzsl, cau, shot, test, discriminative, xian, regressor]
@InProceedings{Paul_2019_CVPR,
  author = {Paul, Akanksha and Krishnan, Narayanan C. and Munjal, Prateek},
  title = {Semantically Aligned Bias Reducing Zero Shot Learning},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Feature Space Perturbations Yield More Transferable Adversarial Examples
Nathan Inkawhich, Wei Wen, Hai (Helen) Li, Yiran Chen


Many recent works have shown that deep learning models are vulnerable to quasi-imperceptible input perturbations, yet practitioners cannot fully explain this behavior. This work describes a transfer-based blackbox targeted adversarial attack of deep feature space representations that also provides insights into cross-model class representations of deep CNNs. The attack is explicitly designed for transferability and drives feature space representation of a source image at layer L towards the representation of a target image at L. The attack yields highly transferable targeted examples, which outperform competition winning methods by over 30% in targeted attack metrics. We also show the choice of L to generate examples from is important, transferability characteristics are blackbox model agnostic, and indicate that well trained deep models have similar highly-abstract representations.
[work] [well, error, momentum, depth, computer, direction, analysis, principal] [image, method, ieee, figure, dadv, separated, produce, transferring, amount, component, conference] [layer, deep, rate, best, original, activation, imagenet, highly, powerful, gradient, table, performance, better] [attack, model, adversarial, blackbox, targeted, whitebox, transferability, perturbed, decision, perturbation, tsuc, ttr, generated, untargeted, ytrue, example, closer, epsilon, fool, choose, success, jaa, successful, iterative, find] [feature] [class, space, target, transferable, trained, transfer, source, test, data, measure, learning, function, distance, classification, learned, loss, representation, large, expect]
@InProceedings{Inkawhich_2019_CVPR,
  author = {Inkawhich, Nathan and Wen, Wei and (Helen) Li, Hai and Chen, Yiran},
  title = {Feature Space Perturbations Yield More Transferable Adversarial Examples},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
IGE-Net: Inverse Graphics Energy Networks for Human Pose Estimation and Single-View Reconstruction
Dominic Jack, Frederic Maire, Sareh Shirazi, Anders Eriksson


Inferring 3D scene information from 2D observations is an open problem in computer vision. We propose using a deep-learning based energy minimization framework to learn a consistency measure between 2D observations and a proposed world model, and demonstrate that this framework can be trained end-to-end to produce consistent and realistic inferences. We evaluate the framework on human pose estimation and voxel-based object reconstruction benchmarks and show competitive results can be achieved with relatively shallow networks with drastically fewer learned parameters and floating point operations than conventional deep-learning approaches.
[human, dataset, framework, joint, varied] [pose, computer, optimization, vision, single, reconstruction, voxel, pattern, estimation, shape, initial, frustum, approach, volume, well, ige, problem, view, international, point, volumetric, continuous, dense, camera, consistent, associated, feasibility, additional] [image, conference, ieee, proposed, resolution, based, figure, method, consistency, high] [energy, network, number, table, deep, neural, convolutional, inference, output, standard, size, layer, scale, scaling, residual, performance, optimizer] [model, consider, inferring, adversarial, infer] [object, inner, grid, iou, feature] [trained, learning, loss, function, learned, base, protocol, training, learn]
@InProceedings{Jack_2019_CVPR,
  author = {Jack, Dominic and Maire, Frederic and Shirazi, Sareh and Eriksson, Anders},
  title = {IGE-Net: Inverse Graphics Energy Networks for Human Pose Estimation and Single-View Reconstruction},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Accelerating Convolutional Neural Networks via Activation Map Compression
Georgios Georgiadis


The deep learning revolution brought us an extensive array of neural network architectures that achieve state-of-the-art performance in a wide variety of Computer Vision tasks including among others, classification, detection and segmentation. In parallel, we have also been observing an unprecedented demand in computational and memory requirements, rendering the efficient use of neural networks in low-powered devices virtually unattainable. Towards this end, we propose a three-stage compression and acceleration pipeline that sparsifies, quantizes and entropy encodes activation maps of Convolutional Neural Networks. Sparsification increases the representational power of activation maps leading to both acceleration of inference and higher model accuracy. Inception-V3 and MobileNet-V1 can be accelerated by as much as 1.6x with an increase in accuracy of 0.38% and 0.54% on the ImageNet and CIFAR-10 datasets respectively. Quantizing and entropy coding the sparser activation maps lead to higher compression over the baseline, reducing the memory cost of the network execution. Inception-V3 and MobileNet-V1 activation maps, quantized to 16 bits, are compressed by as much as 6x with an increase in accuracy of 0.36% and 0.55% respectively.
[report, dataset] [computer, algorithm, international, vision, pipeline, percentage, optimal, pattern] [conference, method, ieee, high, change, input, prior, image, variety] [sparse, activation, compression, neural, network, accuracy, deep, sparsity, convolutional, layer, table, coding, number, acceleration, compressed, size, gain, quantization, pruning, increase, imagenet, computational, efficient, weight, regularization, achieve, sparsification, cost, reduce, compressing, effective, parameter, power, higher, andrew, accelerating, lead] [arxiv, preprint, model, memory, fact, attempt, machine] [baseline, seg, map, leading] [training, learning, entropy, selected, data, function, set, mnist, corresponds, effectively]
@InProceedings{Georgiadis_2019_CVPR,
  author = {Georgiadis, Georgios},
  title = {Accelerating Convolutional Neural Networks via Activation Map Compression},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Knowledge Distillation via Instance Relationship Graph
Yufan Liu, Jiajiong Cao, Bing Li, Chunfeng Yuan, Weiming Hu, Yangxi Li, Yunqiang Duan


The key challenge of knowledge distillation is to extract general, moderate and sufficient knowledge from a teacher network to guide a student network. In this paper, a novel Instance Relationship Graph (IRG) is proposed for knowledge distillation. It models three kinds of knowledge, including instance features, instance relationships and feature space transformation, while the latter two kinds of knowledge are neglected by previous methods. Firstly, the IRG is constructed to model the distilled knowledge of one network layer, by considering instance features and instance relationships as vertexes and edges respectively. Secondly, an IRG transformation is proposed to models the feature space transformation across layers. It is more moderate than directly mimicking the features at intermediate layers. Finally, hint loss functions are designed to force a student's IRGs to mimic the structures of a teacher's IRGs. The proposed method effectively captures the knowledge along the whole network via IRGs, and thus shows stable convergence and strong robustness to different network architectures. In addition, the proposed method shows superior performance over existing methods on datasets of various scales.
[extract, outperforms, time, graph, previous, framework] [vertex, computer, general, vision, robust] [transformation, proposed, method, figure, conference, difference, competing, guide, intermediate, based, ieee, supervise] [network, performance, neural, layer, deep, size, table, batch, best, accuracy, imagenet, represents, effectiveness, compared, convolutional, called, pruning, processing, larger] [model, mode, relationship, arxiv, preprint, attention, type] [instance, feature, three, edge, baseline, moderate, logits, including] [knowledge, teacher, lirg, student, irg, space, loss, rocket, training, distillation, distilled, fsp, trained, sufficient, learning, irgs, datasets, set, hint, existing]
@InProceedings{Liu_2019_CVPR,
  author = {Liu, Yufan and Cao, Jiajiong and Li, Bing and Yuan, Chunfeng and Hu, Weiming and Li, Yangxi and Duan, Yunqiang},
  title = {Knowledge Distillation via Instance Relationship Graph},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
PPGNet: Learning Point-Pair Graph for Line Segment Detection
Ziheng Zhang, Zhengxin Li, Ning Bi, Jia Zheng, Jinlei Wang, Kun Huang, Weixin Luo, Yanyu Xu, Shenghua Gao


In this paper, we present a novel framework to detect line segments in man-made environments. Specifically, we propose to describe junctions, line segments and relationships between them with a simple graph, which is more structured and informative than end-point representation used in existing line segment detection methods. In order to extract a line segment graph from an image, we further introduce the PPGNet, a convolutional neural network that directly infers a graph from an image. We evaluate our method on published benchmarks including York Urban and Wireframe datasets. The results demonstrate that our method achieves satisfactory performance and generalizes well on all the benchmarks. The source code of our work is available at https://github.com/svip-lab/PPGNet.
[graph, dataset, framework, prediction, predict] [computer, pattern, vision, matrix, indoor, ground, directly, local, manhattan, truth, scene, international, problem, robust, general, outdoor, analysis, matching, single, well, corresponding, endpoint] [image, ieee, conference, method, figure, proposed, capable, based, input] [performance, network, deep, size, convolutional, neural, order, achieves, small, processing, precision, rate, parameterized] [introduce, evaluation, simple, room] [segment, junction, detection, feature, wireframe, lsam, jdm, ppgnet, detect, backbone, connectivity, module, amim, semantic, threshold, kaiming, fully, recall, european, propose, infers, urban, edge, annotated] [adjacency, sampling, datasets, representation, set, learning, training, pair, large, existing]
@InProceedings{Zhang_2019_CVPR,
  author = {Zhang, Ziheng and Li, Zhengxin and Bi, Ning and Zheng, Jia and Wang, Jinlei and Huang, Kun and Luo, Weixin and Xu, Yanyu and Gao, Shenghua},
  title = {PPGNet: Learning Point-Pair Graph for Line Segment Detection},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Building Detail-Sensitive Semantic Segmentation Networks With Polynomial Pooling
Zhen Wei, Jingyi Zhang, Li Liu, Fan Zhu, Fumin Shen, Yi Zhou, Si Liu, Yao Sun, Ling Shao


Semantic segmentation is an important computer vision task, which aims to allocate a semantic label to each pixel in an image. When training a segmentation model, it is common to fine-tune a classification network pre-trained on a large-scale dataset. However, as an intrinsic property of the classification model, invariance to spatial perturbation resulting from the lose of detail-sensitivity prevents segmentation networks from achieving high performance. The use of standard poolings is one of the key factors for this invariance. The most common standard poolings are max and average pooling. Max pooling can increase both the invariance to spatial perturbations and the non-linearity of the networks. Average pooling, on the other hand, is sensitive to spatial perturbations, but is a linear function. For semantic segmentation, we prefer both the preservation of detailed cues within a local feature region and non-linearity that increases a network's functional complexity. In this work, we propose a polynomial pooling (P-pooling) function that finds an intermediate form between max and average pooling to provide an optimally balanced and self-adjusted pooling strategy for semantic segmentation. The P-pooling is differentiable and can be applied into a variety of pre-trained networks. Extensive studies on the PASCAL VOC, Cityscapes and ADE20k datasets demonstrate the superiority of P-pooling over other poolings. Experiments on various network architectures and state-of-the-art training strategies also show that models with P-pooling layers consistently outperform those directly fine-tuned using pre-trained classification models.
[dataset, backward] [polynomial, scene, computer, provide, local, form, differentiable, directly, corresponding] [input, figure, image, proposed, high, method, quantitative, intermediate] [pooling, max, network, convolutional, deep, table, standard, layer, compared, dpp, strided, vgg, poolings, stride, channel, resnet, small, output, order, convolution, kernel, original, effectiveness, validation, science, higher, norm] [model, gated, common, evaluation, visual] [segmentation, semantic, average, feature, voc, spatial, detailed, pascal, map, evaluated, baseline, region, propose] [function, training, classification, set, learning, trained, data, large, specific]
@InProceedings{Wei_2019_CVPR,
  author = {Wei, Zhen and Zhang, Jingyi and Liu, Li and Zhu, Fan and Shen, Fumin and Zhou, Yi and Liu, Si and Sun, Yao and Shao, Ling},
  title = {Building Detail-Sensitive Semantic Segmentation Networks With Polynomial Pooling},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Variational Bayesian Dropout With a Hierarchical Prior
Yuhang Liu, Wenyong Dong, Lei Zhang, Dong Gong, Qinfeng Shi


Variational dropout (VD) is a generalization of Gaussian dropout, which aims at inferring the posterior of network weights based on a log-uniform prior on them to learn these weights as well as dropout rate simultaneously. The log-uniform prior not only interprets the regularization capacity of Gaussian dropout in network training, but also underpins the inference of such posterior. However, the log-uniform prior is an improper prior (i.e., its integral is infinite), which causes the inference of posterior to be ill-posed, thus restricting the regularization performance of VD. To address this problem, we present a new generalization of Gaussian dropout, termed variational Bayesian dropout (VBD), which turns to exploit a hierarchical prior on the network weights and infer a new joint posterior. Specifically, we implement the hierarchical prior as a zero-mean Gaussian distribution with variance sampled from a uniform hyper-prior. Then, we incorporate such a prior into inferring the joint posterior over network weights and the variance in the hierarchical prior, with which both the network training and dropout rate estimation can be cast into a joint optimization problem. More importantly, the hierarchical prior is a proper prior which enables the inference of posterior to be well-posed. In addition, we further show that the proposed VBD can be seamlessly applied to network compression. Experiments on classification and network compression demonstrate the superior performance of the proposed VBD in regularizing network training.
[term, framework, joint, follow] [error, well, computer, vision, proposition, international, consistent, general, theoretical] [prior, proposed, method, noise, conference, input, traditional, based, figure] [dropout, network, gaussian, bayesian, neural, vbd, compression, regularization, rate, inference, deep, improper, performance, compressing, bernoulli, preventing, concrete, max, table, weight, structured, convolutional, variance, proper, fixed, processing, number, denotes, compared, capacity, layer, sparse, pruning, sparsity, architecture, better, sbp, approximate] [variational, machine, model, adversarial, arxiv, preprint, inferring, sampled, enables, natural] [hierarchical, propose] [posterior, learning, test, distribution, training, generalization, objective, uniform, classification, learn, divergence, interpretation, address, exploit, conventional, likelihood]
@InProceedings{Liu_2019_CVPR,
  author = {Liu, Yuhang and Dong, Wenyong and Zhang, Lei and Gong, Dong and Shi, Qinfeng},
  title = {Variational Bayesian Dropout With a Hierarchical Prior},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
AANet: Attribute Attention Network for Person Re-Identifications
Chiat-Pin Tay, Sharmili Roy, Kim-Hui Yap


This paper proposes Attribute Attention Network (AANet), a new architecture that integrates person attributes and attribute attention maps into a classification framework to solve the person re-identification (re-ID) problem. Many person re-ID models typically employ semantic cues such as body parts or human pose to improve the re-ID performance. Attribute information, however, is often not utilized. The proposed AANet leverages on a baseline model that uses body parts and integrates the key attribute information in an unified learning framework. The AANet consists of a global person ID task, a part detection task and a crucial attribute detection task. By estimating the class responses of individual attributes and combining them to form the attribute attention map (AAM), a very strong discriminatory representation is constructed. The proposed AANet outperforms the best state-of-the-art method [??] using ResNet-50 by 3.36% in mAP and 3.12% in Rank-1 accuracy on DukeMTMC-reID dataset. On Market1501 dataset, AANet achieves 92.38% mAP and 95.10% Rank-1 accuracy with re-ranking, outperforming [??], another state of the art method using ResNet-152, by 1.42% in mAP and 0.47% in Rank-1 accuracy. In addition, AANet can perform person attribute prediction (e.g., gender, hair length, clothing length etc.), and localize the attributes in the query image.
[second, human, key, perform, performs, multiple, individual] [computer, body, vision, pattern, pose, good, international, form, problem] [attribute, proposed, figure, image, conference, ieee, based, identity, method, color, comparison, input] [network, activation, accuracy, deep, best, performance, lower, table, architecture, output, denotes, layer, convolutional, search, top] [query, attention, model, generated] [person, map, feature, aanet, clothing, global, aam, three, backbone, discriminatory, afn, semantic, spreid, pfn, heatmap, liang, detection, challenging] [learning, classification, loss, task, training, class, retrieval, uncertainty, upper, gallery, classifier, viewed, learn, gap, rank, triplet]
@InProceedings{Tay_2019_CVPR,
  author = {Tay, Chiat-Pin and Roy, Sharmili and Yap, Kim-Hui},
  title = {AANet: Attribute Attention Network for Person Re-Identifications},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Overcoming Limitations of Mixture Density Networks: A Sampling and Fitting Framework for Multimodal Future Prediction
Osama Makansi, Eddy Ilg, Ozgun Cicek, Thomas Brox


Future prediction is a fundamental principle of intelligence that helps plan actions and avoid possible dangers. As the future is uncertain to a large extent, modeling the uncertainty and multimodality of the future states is of great relevance. Existing approaches are rather limited in this regard and mostly yield a single hypothesis of the future or, at the best, strongly constrained mixture components that suffer from instabilities in training and mode collapse. In this work, we present an approach that involves the prediction of several samples of the future with a winner-takes-all loss and iterative grouping of samples to multiple modes. Moreover, we discuss how to evaluate predicted multimodal distributions, including the common real scenario, where only a single sample from the ground-truth distribution is available for evaluation. We show on synthetic and real data that the proposed approach triggers good estimates of multimodal distributions and avoids mode collapse.
[future, prediction, wta, hypothesis, multiple, dataset, ewta, driving, multimodality, predict, video, mdns, nll, work, predicting, rwta, cpi, framework, time, emd, second, motion, semd] [ground, truth, single, computer, vision, approach, fitting, international, evolving, estimation, pattern, estimate, linear] [proposed, figure, conference, ieee, mdn, image, synthetic, based, input, real] [network, density, deep, dropout, gaussian, neural, convolutional, best, number, better, table] [multimodal, mode, arxiv, preprint, diverse, model, evaluation, evaluate] [predicted, stage, car, object, pedestrian, location] [distribution, mixture, loss, learning, training, sample, sampling, uncertainty, set, distance, data, learned, metric, oracle]
@InProceedings{Makansi_2019_CVPR,
  author = {Makansi, Osama and Ilg, Eddy and Cicek, Ozgun and Brox, Thomas},
  title = {Overcoming Limitations of Mixture Density Networks: A Sampling and Fitting Framework for Multimodal Future Prediction},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
A Main/Subsidiary Network Framework for Simplifying Binary Neural Networks
Yinghao Xu, Xin Dong, Yudian Li, Hao Su


To reduce memory footprint and run-time latency, techniques such as neural net-work pruning and binarization have been explored separately. However, it is un-clear how to combine the best of the two worlds to get extremely small and efficient models. In this paper, we, for the first time, define the filter-level pruning problem for binary neural networks, which cannot be solved by simply migrating existing structural pruning methods for full-precision models. A novel learning-based approach is proposed to prune filters in our main/subsidiary network frame-work, where the main network is responsible for learning representative features to optimize the prediction performance, and the subsidiary component works as a filter selector on the main network. To avoid gradient mismatch when training the subsidiary component, we propose a layer-wise and bottom-up scheme. We also provide the theoretical and experimental comparison between our learning-based and greedy rule-based methods. Finally, we empirically demonstrate the effectiveness of our approach applied on several binary models, including binarizedNIN, VGG-11, and ResNet-18, on various image classification datasets. For bi-nary ResNet-18 on ImageNet, we use 78.6% filters but can achieve slightly better test error 49.87% (50.02%-0.15%) than the original model
[previous, work] [error, problem, optimal, international, reconstruction] [component, method, input, figure, conference, proposed, based, remove] [neural, pruning, network, binary, subsidiary, layer, deep, activation, gradient, quantization, pruned, rate, filter, mismatch, output, number, accuracy, efficient, prune, original, weight, quantized, retrain, processing, achieve, scheme, nin, better, convolution, imagenet, acceleration, compression, convolutional, operation, compared, ratio, dnns, sparse, computation, saving, larger, smaller] [memory, sign, model, goal, mac, represent, arxiv, preprint, greedy] [feature, bin, final, curve, propose] [learning, main, training, loss, function, train, set, distillation, experimental]
@InProceedings{Xu_2019_CVPR,
  author = {Xu, Yinghao and Dong, Xin and Li, Yudian and Su, Hao},
  title = {A Main/Subsidiary Network Framework for Simplifying Binary Neural Networks},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
PointNetLK: Robust & Efficient Point Cloud Registration Using PointNet
Yasuhiro Aoki, Hunter Goforth, Rangaprasad Arun Srivatsan, Simon Lucey


PointNet has revolutionized how we think about representing point clouds. For classification and segmentation tasks, the approach and its subsequent variants/extensions are considered state-of-the-art. To date, the successful application of PointNet to point cloud registration has remained elusive. In this paper we argue that PointNet itself can be thought of as a learnable "imaging" function. As a consequence, classical vision algorithms for image alignment can be brought to bear on the problem -- namely the Lucas & Kanade (LK) algorithm. Our central innovations stem from: (i) how to modify the LK algorithm to accommodate the PointNet imaging function, and (ii) unrolling PointNet and the LK algorithm into a single trainable recurrent deep neural network. We describe the architecture, and compare its performance against state-of-the-art in several common registration scenarios. The architecture offers some remarkable properties including: generalization across shape categories and computational efficiency -- opening up new paths of exploration for the application of deep learning to point cloud registration. Code and videos are available at https://github.com/hmgoforth/PointNetLK.
[time, warp, work] [point, pointnetlk, pointnet, icp, cloud, registration, template, computer, approach, pose, visible, international, algorithm, vision, estimation, initial, pattern, twist, pnlk, modelnet, partially, optimization, jacobian, error, sensor, estimate, local, optimal, inverse, gest, volume, robust, classical, well, globally, rotation] [conference, ieee, image, figure, transform, noise, application, translation, input] [network, computation, pooling, deep, cost, neural, performance, number, gradient, max, efficient, computational, order, equal, convolution, gaussian] [model, find, iterative, arxiv, preprint] [object, global, feature] [source, test, alignment, training, data, representation, function, learning, trained, set, unseen, classification, symmetric, noisy, testing, train, minimum, stanford, loss]
@InProceedings{Aoki_2019_CVPR,
  author = {Aoki, Yasuhiro and Goforth, Hunter and Arun Srivatsan, Rangaprasad and Lucey, Simon},
  title = {PointNetLK: Robust & Efficient Point Cloud Registration Using PointNet},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Few-Shot Adaptive Faster R-CNN
Tao Wang, Xiaopeng Zhang, Li Yuan, Jiashi Feng


To mitigate the detection performance drop caused by domain shift, we aim to develop a novel few-shot adaptation approach that requires only a few target domain images with limited bounding box annotations. To this end, we first observe several significant challenges. First, the target domain data is highly insufficient, making most existing domain adaptation methods ineffective. Second, object detection involves simultaneous localization and classification, further complicating the model adaptation process. Third, the model suffers from over-adaptation (similar to overfitting when training with a few data example) and instability risk that may lead to degraded detection performance in the target domain. To address these challenges, we first introduce a pairing mechanism over source and target features to alleviate the issue of insufficient target domain samples. We then propose a bi-level module to adapt the source trained detector to the target domain: 1) the split pooling based image level adaptation module uniformly extracts and aligns paired local patch features over locations, with different scale and aspect ratio; 2) the instance level adaptation module semantically aligns paired object features while avoids inter-class confusion. Meanwhile, a source model feature regularization (SMFR) is applied to stabilize the adaptation process of the two modules. Combining these contributions gives a novel few-shot adaptive Faster-RCNN framework, termed FAFRCNN, which effectively adapts to target domain with a few labeled samples. Experiments with multiple datasets show that our model achieves new state-of-the-art performance under both the interested few-shot domain adaptation(FDA) and unsupervised domain adaptation(UDA) setting.
[framework] [computer, vision, limited, pattern, approach, local, international] [image, proposed, method, pairing, conference, based, ieee, result, paired, input, real, figure, amount] [pooling, number, shift, performance, table, regularization, scale, deep, adaptive, small, neural] [model, adversarial, arxiv, preprint, discriminator] [object, feature, detection, level, instance, faster, module, box, car, annotated, sps, bounding, detector, roi, foreground, annotation, grid, spm, frcnn] [domain, target, adaptation, source, data, learning, training, uda, split, setting, unsupervised, large, sample, fda, trained, smfr, transfer, objective, set, adda, adapted, sampling, address, class, spl, novel, adapt, aligns, datasets, foggy]
@InProceedings{Wang_2019_CVPR,
  author = {Wang, Tao and Zhang, Xiaopeng and Yuan, Li and Feng, Jiashi},
  title = {Few-Shot Adaptive Faster R-CNN},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
VRSTC: Occlusion-Free Video Person Re-Identification
Ruibing Hou, Bingpeng Ma, Hong Chang, Xinqian Gu, Shiguang Shan, Xilin Chen


Video person re-identification (re-ID) plays an important role in surveillance video analysis. However, the performance of video re-ID degenerates severely under partial occlusion. In this paper, we propose a novel network, called Spatio-Temporal Completion network (STCnet), to explicitly handle partial occlusion problem. Different from most previous works that discard the occluded frames, STCnet can recover the appearance of the occluded parts. For one thing, the spatial structure of a pedestrian frame can be used to predict the occluded body parts from the unoccluded body parts of this frame. For another, the temporal patterns of pedestrian sequence provide important clues to generate the contents of occluded parts. With the spatio-temporal information, STCnet can recover the appearance for the occluded parts, which could be leveraged with those unoccluded parts for more accurate video re-ID. By combining a re-ID network with STCnet, a video re-ID framework robust to partial occlusion (VRSTC) is proposed. Experiments on three challenging video re-ID databases demonstrate that the proposed approach outperforms the state-of-the-arts.
[temporal, video, stcnet, frame, guider, adjacent, unoccluded, dataset, framework, consists, predict, current, explicitly, outperforms] [occluded, local, occlusion, body, completion, problem, reconstruction, approach, visible] [generator, input, proposed, appearance, figure, extracted, image, recover, patch, demonstrate] [network, structure, performance, convolutional, table, layer, deep, pooling, output, original, lower] [attention, generated, partial, adversarial, discriminator, generate, mechanism, model, encoder, arxiv, preprint] [spatial, person, feature, global, pedestrian, region, scoring, three, average, map, baseline, propose] [loss, similarity, set, learning, trained, representation, training, train, discriminative, china]
@InProceedings{Hou_2019_CVPR,
  author = {Hou, Ruibing and Ma, Bingpeng and Chang, Hong and Gu, Xinqian and Shan, Shiguang and Chen, Xilin},
  title = {VRSTC: Occlusion-Free Video Person Re-Identification},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Compact Feature Learning for Multi-Domain Image Classification
Yajing Liu, Xinmei Tian, Ya Li, Zhiwei Xiong, Feng Wu


The goal of multi-domain learning is to improve the performance over multiple domains by making full use of all training data from them. However, variations of feature distributions across different domains result in a non-trivial solution of multi-domain learning. The state-of-the-art work regarding multi-domain classification aims to extract domain-invariant features and domain-specific features independently. However, they view the distributions of features from different classes as a general distribution and try to match these distributions across domains, which lead to the mixture of features from different classes across domains and degrade the performance of classification. Additionally, existing works only force the shared features among domains to be orthogonal to the features in the domain-specific network. However, redundant features between the domain-specific networks still remain, which may shrink the discriminative ability of domain-specific features. Therefore, we propose an end-to-end network to obtain the more optimal features, which we call compact features. We propose to extract the domain-invariant features by matching the joint distributions of different domains, which have dis- tinct boundaries between different classes. Moreover, we add an orthogonal constraint between the private features across domains to ensure the discriminative ability of the domain-specific space. The proposed method is validated on three landmark datasets, and the results demonstrate the effectiveness of our method.
[joint, dataset, previous, extract, formulated] [matching, optimal, computer, general, match, well, vision, international] [figure, image, proposed, conditional, conference, extracted, ieee, transformation] [network, orthogonal, performance, applied, regularization, neural, compact, redundant, layer, apply, architecture, full, deep, sharing, parameter, identical, gradient, table, convolution] [private, adversarial, discriminator, marginal, ensure, man, arxiv, preprint] [feature, improve, connect, three, improves, propose, category] [shared, learning, domain, classification, training, distribution, classifier, space, restriction, mnist, loss, cross, learn, log, learned, indiv, knowledge, update, jarn, domainspecific, uniqueness, datasets, joarn, pair, discriminative, address]
@InProceedings{Liu_2019_CVPR,
  author = {Liu, Yajing and Tian, Xinmei and Li, Ya and Xiong, Zhiwei and Wu, Feng},
  title = {Compact Feature Learning for Multi-Domain Image Classification},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Adaptive Transfer Network for Cross-Domain Person Re-Identification
Jiawei Liu, Zheng-Jun Zha, Di Chen, Richang Hong, Meng Wang


Recent deep learning based person re-identification approaches have steadily improved the performance for benchmarks, however they often fail to generalize well from one domain to another. In this work, we propose a novel adaptive transfer network (ATNet) for effective cross-domain person re-identification. ATNet looks into the essential causes of domain gap and addresses it following the principle of "divide-and-conquer". It decomposes the complicated cross-domain transfer into a set of factor-wise sub-transfers, each of which concentrates on style transfer with respect to a certain imaging factor, e.g., illumination, resolution and camera view etc. An adaptive ensemble strategy is proposed to fuse factor-wise transfers by perceiving the affect magnitudes of various factors on images. Such "decomposition-and-ensemble" strategy gives ATNet the capability of precise style transfer at factor level and eventually effective transfer across domains. In particular, ATNet consists of a transfer network composed by multiple factor-wise CycleGANs and an ensemble CycleGAN as well as a selection network that infers the affects of different factors on transferring each image. Extensive experimental results on three widely-used datasets, i.e., Market-1501, DukeMTMC-reID and PRID2011 have demonstrated the effectiveness of the proposed ATNet with significant performance improvements over state-of-the-art methods.
[dataset, recognition, multiple, work] [computer, camera, vision, pattern, illumination, international, well, constraint, matching] [conference, proposed, ieee, style, image, resolution, method, translated, figure, imaging, based, transferring, comparison, cyclegan, complicated, translation, conduct] [network, factor, adaptive, performance, selection, rate, weight, deep, effectiveness, effective, table, denotes, achieves, architecture, layer, original] [gan, gans, model, adversarial, generated, visual, observed, evaluation] [person, three, feature, map, pedestrian, propose] [domain, atnet, transfer, target, learning, ensemble, source, unsupervised, adaptation, set, emsemble, gap, loss, training, googlenet, uda, datasets, data, sample, large]
@InProceedings{Liu_2019_CVPR,
  author = {Liu, Jiawei and Zha, Zheng-Jun and Chen, Di and Hong, Richang and Wang, Meng},
  title = {Adaptive Transfer Network for Cross-Domain Person Re-Identification},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Large-Scale Few-Shot Learning: Knowledge Transfer With Class Hierarchy
Aoxue Li, Tiange Luo, Zhiwu Lu, Tao Xiang, Liwei Wang


Recently, large-scale few-shot learning (FSL) becomes topical. It is discovered that, for a large-scale FSL problem with 1,000 classes in the source domain, a strong baseline emerges, that is, simply training a deep feature embedding model using the aggregated source classes and performing nearest neighbor (NN) search using the learned features on the target classes. The state-of-the-art large-scale FSL methods struggle to beat this baseline, indicating intrinsic limitations on scalability. To overcome the challenge, we propose a novel large-scale FSL model by learning transferable visual features with the class hierarchy which encodes the semantic relations between source and target classes. Extensive experiments show that the proposed model significantly outperforms not only the NN baseline but also the state-of-the-art alternatives. Furthermore, we show that the proposed model can be easily extended to the large-scale zero-shot learning (ZSL) problem and also achieves the state-of-the-art results.
[prediction, dataset, recognition, second] [corresponding, problem, sgm] [figure, proposed, based, image] [layer, deep, number, network, accuracy, table, search, imagenet, net, denotes, best, achieves, performance] [model, visual, step, word, strong, evaluation, easily, beat, simple] [feature, hierarchy, semantic, hierarchical, baseline, object, box, cnn, propose, lsd] [class, target, source, learning, superclass, fsl, transferable, zsl, training, embedding, knowledge, comparative, imnet, learned, set, test, transfer, nearest, neighbor, label, classification, novel, learn, bottle, clustering, existing, space, labeled, experimental, trained, tao, largescale, extended, ppa]
@InProceedings{Li_2019_CVPR,
  author = {Li, Aoxue and Luo, Tiange and Lu, Zhiwu and Xiang, Tao and Wang, Liwei},
  title = {Large-Scale Few-Shot Learning: Knowledge Transfer With Class Hierarchy},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Moving Object Detection Under Discontinuous Change in Illumination Using Tensor Low-Rank and Invariant Sparse Decomposition
Moein Shakeri, Hong Zhang


Although low-rank and sparse decomposition based methods have been successfully applied to the problem of moving object detection using structured sparsity-inducing norms, they are still vulnerable to significant illumination changes that arise in certain applications. We are interested in moving object detection in applications involving time-lapse image sequences for which current methods mistakenly group moving objects and illumination changes into foreground. Our method relies on the multilinear (tensor) data low-rank and sparse decomposition framework to address the weaknesses of existing methods. The key to our proposed method is to create first a set of prior maps that can characterize the changes in the image sequence due to illumination. We show that they can be detected by a k-support norm. To deal with concurrent, two types of changes, we employ two regularization terms, one for detecting moving objects and the other for accounting for illumination changes, in the tensor low-rank and sparse decomposition formulation. Through comprehensive experiments using challenging datasets, we show that our method demonstrates a remarkable ability to detect moving objects under discontinuous change in illumination, and outperforms the state-of-the-art solutions to this challenging problem.
[moving, sequence, term, time, second, multiple, dataset, capture, framework, video] [illumination, tlisd, computer, vision, discontinuous, problem, matrix, direction, ilisd, decomposition, robust, slice, pattern, dominant, international, wildlife, analysis, error, formulation] [image, method, proposed, prior, ieee, background, conference, change, based, frontal, real, figure, separate, comparison, captured, industrial, surveillance] [tensor, norm, sparse, group, accuracy, best, number, regularization, processing, performance, sparsity] [monitoring, evaluation, evaluate, model] [object, detection, foreground, detected, average, detect] [invariant, rank, data, sample, representation, independent, existing, subtraction, min, set]
@InProceedings{Shakeri_2019_CVPR,
  author = {Shakeri, Moein and Zhang, Hong},
  title = {Moving Object Detection Under Discontinuous Change in Illumination Using Tensor Low-Rank and Invariant Sparse Decomposition},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Pedestrian Detection With Autoregressive Network Phases
Garrick Brazil, Xiaoming Liu


We present an autoregressive pedestrian detection framework with cascaded phases designed to progressively improve precision. The proposed framework utilizes a novel lightweight stackable decoder-encoder module which uses convolutional re-sampling layers to improve features while maintaining efficient memory and runtime cost. Unlike previous cascaded detection systems, our proposed framework is designed within a region proposal network and thus retains greater context of nearby detections compared to independently processed RoI systems. We explicitly encourage increasing levels of precision by assigning strict labeling policies to each consecutive phase such that early phases develop features primarily focused on achieving high recall and later on accurate precision. In consequence, the final feature maps form more peaky radial gradients emulating from the centroids of unique pedestrians. Using our proposed autoregressive framework leads to new state-of-the-art performance on the reasonable and occlusion settings of the Caltech pedestrian dataset, and achieves competitive state-of-the-art performance on the KITTI dataset.
[previous, recurrent, framework, prediction, iteratively, dataset] [runtime, single, occlusion, kitti, form, denote] [proposed, high, method, produce, image] [phase, network, autoregressive, convolutional, layer, stride, channel, performance, table, deep, convolution, design, compared, bilinear, width, peaky, competitive, efficiency, achieve, efficient] [policy, memory, reasonable, evaluate] [pedestrian, feature, labeling, detection, box, caltech, module, object, proposal, suppression, rpn, bounding, strict, final, pathway, regression, iou, challenging, backbone, pfe, stackable, recall, score, localization, cascaded] [classification, ensemble, loss, target, incremental, set, learn, hard, training, setting, train, label]
@InProceedings{Brazil_2019_CVPR,
  author = {Brazil, Garrick and Liu, Xiaoming},
  title = {Pedestrian Detection With Autoregressive Network Phases},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
All You Need Is a Few Shifts: Designing Efficient Convolutional Neural Networks for Image Classification
Weijie Chen, Di Xie, Yuan Zhang, Shiliang Pu


Shift operation is an efficient alternative over depthwise separable convolution. However, it is still bottlenecked by its implementation manner, namely memory movement. To put this direction forward, a new and novel basic component named Sparse Shift Layer (SSL) is introduced in this paper to construct efficient convolutional neural networks. In this family of architectures, the basic block is only composed by 1x1 convolutional layers with only a few shift operations applied to the intermediate feature maps. To make this idea feasible, we introduce shift operation penalty during optimization and further propose a quantization-aware shift learning method to impose the learned displacement more friendly for inference. Extensive ablation studies indicate that only a few shift operations are sufficient to provide spatial information communication. Furthermore, to maximize the role of SSL, we redesign an improved network architecture to Fully Exploit the limited capacity of neural Network (FE-Net). Equipped with SSL, this network can achieve 75.0% top-1 accuracy on ImageNet with only 563M M-Adds. It surpasses other counterparts constructed by depthwise separable convolution and the networks searched by NAS in terms of accuracy and practical speed.
[displacement, build, time, formulated] [provide, practical, runtime, alternative, accurate, limited] [input, method, image, component, composed, study, figure, interpolation] [shift, neural, network, operation, layer, depthwise, separable, basic, convolutional, convolution, efficient, accuracy, sparsity, computational, compact, conv, architecture, deep, performance, channel, sparse, denotes, inference, equipped, imagenet, computation, kernel, unit, table, size, pruning, inverted, shiftresnet, block, achieve, design, lightweight, occupies, redundant, small, integer, output, number, unimportant, shiftnet, order, gpu, group] [memory, find] [feature, spatial, module, ablation, illustrated, propose, fully] [ssl, learning, training, classification, loss, data]
@InProceedings{Chen_2019_CVPR,
  author = {Chen, Weijie and Xie, Di and Zhang, Yuan and Pu, Shiliang},
  title = {All You Need Is a Few Shifts: Designing Efficient Convolutional Neural Networks for Image Classification},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Stochastic Class-Based Hard Example Mining for Deep Metric Learning
Yumin Suh, Bohyung Han, Wonsik Kim, Kyoung Mu Lee


Performance of deep metric learning depends heavily on the capability of mining hard negative examples during training. However, many metric learning algorithms often require intractable computational cost due to frequent feature computations and nearest neighbor searches in a large-scale dataset. As a result, existing approaches often suffer from trade-off between training speed and prediction accuracy. To alleviate this limitation, we propose a stochastic hard negative mining method. Our key idea is to adopt class signatures that keep track of feature embedding online with minor additional cost during training, and identify hard negative example candidates using the signatures. Given an anchor instance, our algorithm first selects a few hard negative classes based on the class-to-sample distances and then performs a refined search in an instance-level only from the selected classes. As most of the classes are discarded at the first step, it is much more efficient than exhaustive search while effectively mining a large number of hard examples. Our experiment shows that the proposed technique improves image retrieval accuracy substantially; it achieves the state-of-the-art performance on the several standard benchmark datasets.
[dataset, online, perform] [approach, algorithm, note, technique, respect] [proposed, based, method, image, figure, input, comparison, face] [stochastic, deep, accuracy, search, number, computational, compared, table, iteration, signature, cost, batch, small, performance, approximate, increase, pooling, standard, reduce] [example, random, sampled, inception, identify, find, identified, adversarial] [feature, anchor, baseline, improves, person, average] [class, hard, mining, metric, loss, triplet, negative, training, learning, embedding, nearest, sample, retrieval, minibatch, distance, neighbor, similarity, set, randomly, existing, selected, extractor, large, pair, label, bca, embeddings, datasets, stanford, sampling]
@InProceedings{Suh_2019_CVPR,
  author = {Suh, Yumin and Han, Bohyung and Kim, Wonsik and Mu Lee, Kyoung},
  title = {Stochastic Class-Based Hard Example Mining for Deep Metric Learning},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Revisiting Local Descriptor Based Image-To-Class Measure for Few-Shot Learning
Wenbin Li, Lei Wang, Jinglin Xu, Jing Huo, Yang Gao, Jiebo Luo


Few-shot learning in image classification aims to learn a classifier to classify images when only few training examples are available for each class. Recent work has achieved promising classification performance, where an image-level feature based measure is usually used. In this paper, we argue that a measure at such a level may not be effective enough in light of the scarcity of examples in few-shot learning. Instead, we think a local descriptor based image-to-class measure should be taken, inspired by its surprising success in the heydays of local invariant features. Specifically, building upon the recent episodic training mechanism, we propose a Deep Nearest Neighbor Neural Network (DN4 in short) and train it in an end-to-end manner. Its key difference from the literature is the replacement of the image-level feature based measure in the final layer by a local descriptor based image-to-class measure. This measure is conducted online via a k-nearest neighbor search over the deep local descriptors of convolutional feature maps. The proposed DN4 not only learns the optimal deep local descriptors for the image-to-class measure, but also utilizes the higher efficiency of such a measure in the case of example scarcity, thanks to the exchangeability of visual patterns across the images in the same class. Our work leads to a simple, effective, and computationally efficient framework for few-shot learning. Experimental study on benchmark datasets consistently shows its superiority over the related state-of-the-art, with the largest absolute improvement of 17% over the next best. The source code can be available from https://github.com/WenbinLee/DN4.git.
[work, perform, framework, dataset, key] [local, matching, descriptor, approach, note] [based, image, proposed, method, difference] [deep, network, neural, number, table, convolutional, better, accuracy, performance, layer, best, effective, literature] [model, query, visual, memory, type] [module, global, feature, final, relation, three] [training, learning, nbnn, classification, measure, class, embedding, similarity, set, episodic, support, learn, task, metric, prototypical, train, setting, test, stanford, neighbor, datasets, nearest, representation, trained, gnn, snail, meta, conducted, exchangeability, miniimagenet, invariant]
@InProceedings{Li_2019_CVPR,
  author = {Li, Wenbin and Wang, Lei and Xu, Jinglin and Huo, Jing and Gao, Yang and Luo, Jiebo},
  title = {Revisiting Local Descriptor Based Image-To-Class Measure for Few-Shot Learning},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Towards Robust Curve Text Detection With Conditional Spatial Expansion
Zichuan Liu, Guosheng Lin, Sheng Yang, Fayao Liu, Weisi Lin, Wang Ling Goh


It is challenging to detect curve texts due to their irregular shapes and varying sizes. In this paper, we first investigate the deficiency of the existing curve detection methods and then propose a novel Conditional Spatial Expansion (CSE) mechanism to improve the performance of curve detection. Instead of regarding the curve text detection as a polygon regression or a segmentation problem, we formulate it as a sequence prediction on the spatial domain. CSE starts with a seed arbitrarily chosen within a text region and progressively merges neighborhood regions based on the extracted local features by a CNN and contextual information of merged regions. The CSE is highly parameterized and can be seamlessly integrated into existing object detection frameworks. Enhanced by the data-dependent CSE mechanism, our curve text detection system provides robust instance-level text region extraction with minimal post-processing. The analysis experiment shows that our CSE can handle texts with various shapes, sizes, and orientations, and can effectively suppress the false-positives coming from text-like textures or unexpected texts included in the same RoI. Compared with the existing curve text detection algorithms, our method is more robust and enjoys a simpler processing flow. It also creates a new state-of-art performance on curve text benchmarks with F-measurement of up to 78.4%.
[previous, state, extract, recognition, current, modeling, adjacent, transition] [computer, scene, robust, corresponding, vision, local, pattern, ambiguity, direction, neighborhood, well] [method, based, conference, proposed, conditional, ieee, figure, image, arbitrary, produce, produced] [performance, neural, highly, flexible, output, precision, initialized, experiment] [text, node, arxiv, preprint, robustness, targeted, candidate, demonstrated] [cse, detection, region, seed, object, curve, spatial, regression, expansion, polygon, segmentation, box, rcnn, merging, expanding, grid, baseline, feature, unexpected, mask, detecting, detector, included, location, illustrated, instance, recall, proposal, seeding, indicated, faster, xiang] [existing, training, positive, sampling, set, rest, datasets]
@InProceedings{Liu_2019_CVPR,
  author = {Liu, Zichuan and Lin, Guosheng and Yang, Sheng and Liu, Fayao and Lin, Weisi and Ling Goh, Wang},
  title = {Towards Robust Curve Text Detection With Conditional Spatial Expansion},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Revisiting Perspective Information for Efficient Crowd Counting
Miaojing Shi, Zhaohui Yang, Chao Xu, Qijun Chen


Crowd counting is the task of estimating people numbers in crowd images. Modern crowd counting methods employ deep neural networks to estimate crowd counts via crowd density regressions. A major challenge of this task lies in the perspective distortion, which results in drastic person scale change in an image. Density regression on the small person area is in general very hard. In this work, we propose a perspective-aware convolutional neural network (PACNN) for efficient crowd counting, which integrates the perspective information into density regression to provide additional knowledge of the person scale change in an image. Ground truth perspective maps are firstly generated for training; PACNN is then specifically designed to predict multi-scale perspective maps and encode them as perspective-aware weighting layers in the network to adaptively combine the outputs of multi-scale density maps. The weights are learned at every pixel of the maps such that the final density combination is robust to the perspective distortion. We conduct extensive experiments on the ShanghaiTech, WorldExpo'10, UCF_CC_50, and UCSD datasets, and demonstrate the effectiveness and efficiency of PACNN over the state-of-the-art.
[people, ucf, dataset] [perspective, ground, truth, local, camera, dense, general, estimated, additional, scene, varying, fit, corresponding, linear, denote, estimate] [image, mse, change, proposed, method, figure, traditional, nonlinear, resolution, pixel] [density, pacnn, network, scale, size, convolutional, table, deep, neural, conv, inference, small, ucsd, output, architecture, standard, modern, employ, adaptively, number, lowest, efficient] [sampled, generate] [crowd, counting, map, person, regression, mae, head, sha, pedestrian, count, detection, shanghaitech, combine, three, final, average, height, cnn, propose] [loss, weighting, function, training, task, test, distance, learning, combination]
@InProceedings{Shi_2019_CVPR,
  author = {Shi, Miaojing and Yang, Zhaohui and Xu, Chao and Chen, Qijun},
  title = {Revisiting Perspective Information for Efficient Crowd Counting},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Towards Universal Object Detection by Domain Attention
Xudong Wang, Zhaowei Cai, Dashan Gao, Nuno Vasconcelos


Despite increasing efforts on universal representations for visual recognition, few have addressed object detection. In this paper, we develop an effective and efficient universal object detection system that is capable of working on various image domains, from human faces and traffic signs to medical CT images. Unlike multi-domain models, this universal model does not require prior knowledge of the domain of interest. This is achieved by the introduction of a new family of adaptation layers, based on the principles of squeeze and excitation, and a new domain-attention mechanism. In the proposed universal detector, all parameters and computations are shared across domains, and a single network processes all domains all the time. Experiments, on a newly established universal object detection benchmark of 11 diverse datasets, show that the proposed detector outperforms a bank of individual detectors, a multi-domain detector, and a baseline universal detector, with a 1.3x parameter increase over a single-domain baseline detector. The code and benchmark are available at http://www.svcl.ucsd.edu/projects/universal-detection/.
[bank, multiple, dataset, kitchen, traffic, outperforms] [single, kitti, solution, problem] [figure, proposed, image, comparison, ieee] [table, performance, network, number, adaptive, pooling, residual, convolutional, deep, output, relu, channel, neural, parameter, best, sharing] [attention, visual, lisa, model, adding, mechanism, evaluation, arxiv, preprint, common, diverse, machine] [detector, object, detection, adapter, voc, module, faster, feature, baseline, widerface, comic, deeplesion, benchmark, coco, global, watercolor, clipart, ross, pascal, dota, backbone, average, kaiming, propose, rpn] [domain, universal, learning, trained, adaptation, set, datasets, shared, tested, training, test, knowledge, extractor]
@InProceedings{Wang_2019_CVPR,
  author = {Wang, Xudong and Cai, Zhaowei and Gao, Dashan and Vasconcelos, Nuno},
  title = {Towards Universal Object Detection by Domain Attention},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Ensemble Deep Manifold Similarity Learning Using Hard Proxies
Nicolas Aziere, Sinisa Todorovic


This paper is about learning deep representations of images such that images belonging to the same class have more similar representations than those belonging to different classes. For this goal, prior work typically uses the triplet or N-pair loss, specified in terms of either l2-distances or dot-products between deep features. However, such formulations seem poorly suited to the highly non-Euclidean deep feature space. Our first contribution is in specifying the N-pair loss in terms of manifold similarities between deep features. We introduce a new time- and memory-efficient method for estimating the manifold similarities by using a closed-form convergence solution of the Random Walk algorithm. Our efficiency comes, in part, from following the recent work that randomly partitions the deep feature space, and expresses image distances via representatives of the resulting subspaces, a.k.a. proxies. Our second contribution is aimed at reducing overfitting by estimating hard proxies that are as close to one another as possible, but remain in their respective subspaces. Our evaluation demonstrates that we outperform the state of the art in both image retrieval and clustering on the benchmark CUB-200-2011, Cars196, and Stanford Online Products datasets.
[work, walk, second, state, online] [estimating, estimate, optimal, approach, computer, compute, bound, normalized, intrinsic, initial, vision, estimation, defined, respective, algorithm, geodesic, relative, respect] [image, desired, prior, conference, figure] [deep, number, complexity, computing, cnns, performance, represents, small, efficiently, optimize, size] [manifold, random, contribution, query] [feature, cnn, illustrated, contextual, anchor] [loss, training, edms, ensemble, similarity, learning, proxy, set, hard, metric, space, distance, data, retrieval, clustering, randomly, class, partitioning, upper, googlenet, test, positive, lcxt, belonging, partition, minimizing, negative, large, log, close, stanford, triplet, function, angular]
@InProceedings{Aziere_2019_CVPR,
  author = {Aziere, Nicolas and Todorovic, Sinisa},
  title = {Ensemble Deep Manifold Similarity Learning Using Hard Proxies},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Quantization Networks
Jiwei Yang, Xu Shen, Jun Xing, Xinmei Tian, Houqiang Li, Bing Deng, Jianqiang Huang, Xian-sheng Hua


Although deep neural networks are highly effective, their high computational and memory costs severely hinder their applications to portable devices. As a consequence, lowbit quantization, which converts a full-precision neural network into a low-bitwidth integer version, has been an active and promising research topic. Existing methods formulate the low-bit quantization of networks as an approximation or optimization problem. Approximation-based methods confront the gradient mismatch problem, while optimizationbased methods are only suitable for quantizing weights and can introduce high computational cost during the training stage. In this paper, we provide a simple and uniform way for weights and activations quantization by formulating it as a differentiable non-linear function. The quantization function is represented as a linear combination of several Sigmoid functions with learnable biases and scales that could be learned in a lossless and end-to-end manner via continuous relaxation of the steepness of Sigmoid functions. Extensive experiments on image classification and object detection tasks show that our quantization networks outperform state-of-the-art methods. We believe that the proposed method will shed new lights on the interpretation of neural network quantization.
[forward, backward, work, manner] [linear, differentiable, optimization, continuous, form, ideal, relaxation, directly] [method, based, proposed, input, high, image, figure] [quantization, neural, network, deep, binary, sigmoid, weight, activation, table, convolutional, quantized, performance, imagenet, gradient, ternary, formulate, process, inference, unit, quantizing, scale, computational, mismatch, operation, gradually, epoch, layer, alexnet, integer, approximation, lossless, efficient, increased, low, suitable] [model, step, arxiv, preprint, simple] [object, detection, module, final, maxout, propose] [function, training, learning, set, temperature, learned, classification, soft, trained, train, combination, gap]
@InProceedings{Yang_2019_CVPR,
  author = {Yang, Jiwei and Shen, Xu and Xing, Jun and Tian, Xinmei and Li, Houqiang and Deng, Bing and Huang, Jianqiang and Hua, Xian-sheng},
  title = {Quantization Networks},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
RES-PCA: A Scalable Approach to Recovering Low-Rank Matrices
Chong Peng, Chenglizhao Chen, Zhao Kang, Jianbo Li, Qiang Cheng


Robust principal component analysis (RPCA) has drawn significant attentions due to its powerful capability in recovering low-rank matrices as well as successful appplications in various real world problems. The current state-of-the-art algorithms usually need to solve singular value decomposition of large matrices, which generally has at least a quadratic or even cubic complexity. This drawback has limited the application of RPCA in solving real world problems. To combat this drawback, in this paper we propose a new type of RPCA method, RES-PCA, which is linearly efficient and scalable in both data size and dimension. For comparison purpose, AltProj, an existing scalable approach to RPCA requires the precise knowlwdge of the true rank; otherwise, it may fail to recover low-rank matrices. By contrast, our method works with or without knowing the true rank; even when both methods work, our method is faster. Extensive experiments have been performed and testified to the effectiveness of proposed method quantitatively and in visual quality, which suggests that our method is suitable to be employed as a light-weight, scalable component for RPCA in any application pipelines.
[subject, current, time, video, qiang, term] [altproj, robust, rpca, pcp, matrix, nsa, solve, truth, ground, vbrpca, ialm, principal, convex, linear, algorithm, light, singular, optimization, approach, computer, analysis, linearly, corresponding, journal, well, nuclear, john, international, decomposition, quadratic] [method, proposed, component, recover, pca, figure, face, ieee, separation, nonlinear, background, desired, real, based] [norm, complexity, original, scalable, sparse, parameter, table, efficient, size, number, chong, zhao, effectiveness, group, shadow, low, top, science, suitable] [balancing, model, visual, consider, observed, type] [propose, precise] [data, rank, min, set, large, paper, subspace, generally]
@InProceedings{Peng_2019_CVPR,
  author = {Peng, Chong and Chen, Chenglizhao and Kang, Zhao and Li, Jianbo and Cheng, Qiang},
  title = {RES-PCA: A Scalable Approach to Recovering Low-Rank Matrices},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Occlusion-Net: 2D/3D Occluded Keypoint Localization Using Graph Networks
N. Dinesh Reddy, Minh Vo, Srinivasa G. Narasimhan


We present Occlusion-Net, a framework to predict 2D and 3D locations of occluded keypoints for objects, in a largely self-supervised manner. We use an off-the-shelf detector as input (like MaskRCNN) that is trained only on visible key point annotations. This is the only supervision used in this work. A graph encoder network then explicitly classifies invisible edges and a graph decoder network corrects the occluded keypoint locations from the initial detector. Central to this work is a trifocal tensor loss that provides indirect self-supervision for occluded keypoint locations that are visible in other views of the object. The 2D keypoints are then passed into a 3D graph network that estimates the 3D shape and camera pose using the self-supervised re-projection loss. At test time, our approach successfully localizes keypoints in a single view under a diverse set of severe occlusion settings. We demonstrate and evaluate our approach on synthetic CAD data as well as a large image set capturing vehicles at many busy city intersections. As an interesting aside, we compare the accuracy of human labels of invisible keypoints against those obtained from geometric trifocal-tensor loss.
[graph, predict, human, multiple, dataset, explicitly] [occluded, keypoints, keypoint, visible, invisible, occlusion, maskrcnn, computer, pck, shape, camera, trifocal, approach, vision, respect, computed, carfusion, error, pose, confidence, ground, pattern, point, single, truth, reprojection, cad, reconstruction, initial, view, well, eij] [figure, alpha, conference, method, image, ieee, input, based, synthetic] [network, number, accuracy, neural, tensor, compared, output, convolutional, deep, denotes, represents] [model, encoder, decoder, arxiv, preprint] [object, baseline, localization, predicted, detection, supervision, location, annotated, detector, edge] [observe, trained, loss, training, labeled, learning, large, train, set, data]
@InProceedings{Reddy_2019_CVPR,
  author = {Dinesh Reddy, N. and Vo, Minh and Narasimhan, Srinivasa G.},
  title = {Occlusion-Net: 2D/3D Occluded Keypoint Localization Using Graph Networks},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Efficient Featurized Image Pyramid Network for Single Shot Detector
Yanwei Pang, Tiancai Wang, Rao Muhammad Anwer, Fahad Shahbaz Khan, Ling Shao


Single-stage object detectors have recently gained popularity due to their combined advantage of high detection accuracy and real-time speed. However, while promising results have been achieved by these detectors on standard-sized objects, their performance on small objects is far from satisfactory. To detect very small/large objects, classical pyramid representation can be exploited, where an image pyramid is used to build a feature pyramid (featurized image pyramid), enabling detection across a range of scales. Existing single-stage detectors avoid such a featurized image pyramid representation due to its memory and time complexity. In this paper, we introduce a light-weight architecture to efficiently produce featurized image pyramid in a single-stage detection framework. The resulting multi-scale features are then injected into the prediction layers of the detector using an attention module. The performance of our detector is validated on two benchmarks: PASCAL VOC and MS COCO. For a 300x300 input, our detector operates at 111 frames per second (FPS) on a Titan X GPU, providing state-of-the-art detection accuracy on PASCAL VOC 2007 testset. On the MS COCO testset, our detector achieves state-of-the-art results surpassing all existing single-stage methods in the case of single-scale inference.
[prediction, fusion, forward, current, passed, dataset, combined] [single, approach, accurate] [image, input, comparison, proposed, figure] [conv, convolutional, standard, shallow, performance, small, network, block, featurized, layer, accuracy, lfip, compared, size, deep, speed, downsampling, pooling, architecture, scale, achieves, progressive, number, relu, table, impact, increase, precision] [introduce, attention, generated] [feature, detection, ssd, pyramid, object, detector, voc, pascal, baseline, map, coco, module, modulated, refinedet, extraction, cnn, average, backbone, sized, dssd, faster] [large, set, representation, existing, test, shot]
@InProceedings{Pang_2019_CVPR,
  author = {Pang, Yanwei and Wang, Tiancai and Muhammad Anwer, Rao and Shahbaz Khan, Fahad and Shao, Ling},
  title = {Efficient Featurized Image Pyramid Network for Single Shot Detector},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Multi-Task Multi-Sensor Fusion for 3D Object Detection
Ming Liang, Bin Yang, Yun Chen, Rui Hu, Raquel Urtasun


In this paper we propose to exploit multiple related tasks for accurate multi-sensor 3D object detection. Towards this goal we present an end-to-end learnable architecture that reasons about 2D and 3D object detection as well as ground estimation and depth completion. Our experiments show that all these tasks are complementary and help the network learn better representations by fusing information at various levels. Importantly, our approach leads the KITTI benchmark on 2D, 3D and bird's eye view object detection, while being real-time.
[fusion, long, online, dataset, multiple] [lidar, depth, dense, ground, point, estimation, completion, kitti, provide, single, geometric, approach, camera, accurate, cloud, orientation, continuous, range, rgb, autonomous, monocular, correspondence] [image, figure, proposed, based, pixel, input] [network, sparse, better, convolutional, gain, architecture, table, apply, residual] [model, find] [feature, object, detection, bev, detector, roi, backbone, map, box, raquel, benchmark, fuse, height, refinement, bin, propose, improve, fused, module, precise, oriented, moderate, ablation, stage, fully, extraction, easy] [learning, exploit, representation, loss, task, training, hard, learn, pseudo, auxiliary]
@InProceedings{Liang_2019_CVPR,
  author = {Liang, Ming and Yang, Bin and Chen, Yun and Hu, Rui and Urtasun, Raquel},
  title = {Multi-Task Multi-Sensor Fusion for 3D Object Detection},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Domain-Specific Batch Normalization for Unsupervised Domain Adaptation
Woong-Gi Chang, Tackgeun You, Seonguk Seo, Suha Kwak, Bohyung Han


We propose a novel unsupervised domain adaptation framework based on domain-specific batch normalization in deep neural networks. We aim to adapt to both domains by specializing batch normalization layers in convolutional neural networks while allowing them to share all other model parameters, which is realized by a two-stage algorithm. In the first stage, we estimate pseudo-labels for the examples in the target domain using an external unsupervised domain adaptation algorithm---for example, MSTN or CPUA---integrating the proposed domain-specific batch normalization. The second stage learns the final models using a multi-task classification loss for the source and target domains. Note that the two domains have separate batch normalization layers in both stages. Our framework can be easily incorporated into the domain adaptation techniques based on deep neural networks with batch normalization layers. We also present that our approach can be extended to the problem with multiple source domains. The proposed algorithm is evaluated on multiple benchmark datasets and achieves the state-of-the-art accuracy in the standard setting and the multi-source domain adaption scenario.
[framework, second, dataset, multiple, learns, prediction, merged, consists] [initial, note, approach, estimate, algorithm, matching, michael, single] [based, method, separate, figure, proposed] [network, batch, deep, normalization, table, neural, accuracy, performance, variance, better, standard, layer, shift] [adversarial, model, common, visual, indicates] [stage, semantic, lcls, baseline, benchmark, improve] [domain, dsbn, adaptation, unsupervised, target, source, learning, training, loss, mstn, classification, pseudo, class, learn, data, cpua, existing, trained, function, datasets, alignment, kate, unlabeled, transfer, learned, maximum, discrepancy, extended, representation]
@InProceedings{Chang_2019_CVPR,
  author = {Chang, Woong-Gi and You, Tackgeun and Seo, Seonguk and Kwak, Suha and Han, Bohyung},
  title = {Domain-Specific Batch Normalization for Unsupervised Domain Adaptation},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Grid R-CNN
Xin Lu, Buyu Li, Yuxin Yue, Quanquan Li, Junjie Yan


This paper proposes a novel object detection framework named Grid R-CNN, which adopts a grid guided localization mechanism for accurate object detection. Different from the traditional regression based methods, the Grid R-CNN captures the spatial information explicitly and enjoys the position sensitive property of fully convolutional architecture. Instead of using only two independent points, we design a multi-point supervision formulation to encode more clues in order to reduce the impact of inaccurate prediction of specific points. To take the full advantage of the correlation of points in a grid, we propose a two-stage information fusion strategy to fuse feature maps of neighbor grid points. The grid guided localization approach is easy to be extended to different state-of-the-art detection frameworks. Grid R-CNN leads to high quality object localization, and experiments demonstrate that it achieves a 4.1% AP gain at IoU=0.8 and a 10.0% AP gain at IoU=0.9 on COCO benchmark compared to Faster R-CNN with Res50 backbone and FPN architecture.
[fusion, prediction, second, framework, perform] [computer, point, vision, corresponding, pattern, approach, accurate, left, international, corner, ground, truth, directly] [conference, based, figure, ieee, method, image, quality, proposed, mapping, high, comparison, traditional] [table, convolutional, order, performance, network, gain, top, design, output, deep, accuracy, original, compared, achieves, neural, achieve] [mechanism] [grid, feature, object, region, box, localization, fpn, bounding, detection, faster, map, proposal, spatial, regression, fully, guided, iou, branch, coco, location, supervision, improvement, backbone, locate, roi, heatmap, pascal, doll, european, propose, offset, improve, voc] [extended, classification, large, positive, upper, learning, set]
@InProceedings{Lu_2019_CVPR,
  author = {Lu, Xin and Li, Buyu and Yue, Yuxin and Li, Quanquan and Yan, Junjie},
  title = {Grid R-CNN},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
MetaCleaner: Learning to Hallucinate Clean Representations for Noisy-Labeled Visual Recognition
Weihe Zhang, Yali Wang, Yu Qiao


Deep Neural Networks (DNNs) have achieved remarkable successes in large-scale visual recognition. However, they often suffer from overfitting under noisy labels. To alleviate this problem, we propose a conceptually simple but effective MetaCleaner, which can learn to hallucinate a clean representation of an object category, according to a small noisy subset from the same category. Specially, MetaCleaner consists of two flexible submodules. The first submodule, namely Noisy Weighting, can estimate the confidence scores of all the images in the noisy subset, by analyzing their deep features jointly. The second submodule, namely Clean Hallucinating, can generate a clean representation from the noisy subset, by summarizing the noisy images with their confidence scores. Via MetaCleaner, DNNs can strengthen its robustness to noisy labels, as well as enhance its generalization capacity with richer data diversity. Moreover, MetaCleaner can be easily integrated into the standard training procedure of DNNs, which promotes its value for real-life applications. We conduct extensive experiments on two popular benchmarks in noisy-labeled recognition, i.e., Food-101N and Clothing1M. For both datasets, our MetaCleaner significantly outperforms baselines, and achieves the state-of-the-art performance.
[outperforms, recognition, perform, illustrates, consists] [confidence, robust, vision, computer, estimate, corresponding, allows, international, pattern, well, solution] [clean, image, noise, input, conference, ieee, method, figure, proposed] [size, deep, batch, table, neural, dnns, small, performance, network, number, layer, standard, verification, capacity, achieve] [visual, generate, robustness, procedure, requires, simple, richer, evaluate] [score, extra, level, feature, baseline, propose, object, improve, cnn, category] [metacleaner, noisy, learning, training, subset, data, representation, label, weighting, set, hallucinate, loss, classifier, softmax, hallucination, pclean, investigate, cleannet, hallucinating, meta, confusion, prototypical, generalize, sample, train, learn, noisylabeled, depress]
@InProceedings{Zhang_2019_CVPR,
  author = {Zhang, Weihe and Wang, Yali and Qiao, Yu},
  title = {MetaCleaner: Learning to Hallucinate Clean Representations for Noisy-Labeled Visual Recognition},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Mapping, Localization and Path Planning for Image-Based Navigation Using Visual Features and Map
Janine Thoma, Danda Pani Paudel, Ajad Chhatkuli, Thomas Probst, Luc Van Gool


Building on progress in feature representations for image retrieval, image-based localization has seen a surge of research interest. Image-based localization has the advantage of being inexpensive and efficient, often avoiding the use of 3D metric maps altogether. That said, the need to maintain a large amount of reference images as an effective support of localization in a scene, nonetheless calls for them to be organized in a map structure of some kind. The problem of localization often arises as part of a navigation process. We are, therefore, interested in summarizing the reference images as a set of landmarks, which meet the requirements for image-based navigation. A contribution of this paper is to formulate such a set of requirements for the two sub-tasks involved: compact map construction and accurate self localization. These requirements are then exploited for compact map representation and accurate self-localization, using the framework of a network flow problem. During this process, we formulate the map construction and self-localization problems as convex quadratic and second-order cone programs, respectively. We evaluate our methods on publicly available indoor and outdoor datasets, where they outperform existing methods significantly.
[flow, sequence, graph, directed, multiple, second, planning] [matching, geometric, problem, eij, uij, oxford, robotcar, seqslam, construction, vertex, uniformly, define, absolute, algorithm, convex, solve, initial, netvlad, night, quadratic, cone, camera, provide, programming, solving, solved] [image, reference, landmark, method, based, figure, desired] [network, cost, rate, capacity, accuracy, number, selection, process, order] [query, visual, navigation, path, rule, cij, sensitivity, consider, ensure] [localization, map, feature, location, three, edge, matched, threshold, anchor] [distance, set, yij, representation, source, target, min, bipartite, summarized, task, selected, uniform]
@InProceedings{Thoma_2019_CVPR,
  author = {Thoma, Janine and Pani Paudel, Danda and Chhatkuli, Ajad and Probst, Thomas and Van Gool, Luc},
  title = {Mapping, Localization and Path Planning for Image-Based Navigation Using Visual Features and Map},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Triply Supervised Decoder Networks for Joint Detection and Segmentation
Jiale Cao, Yanwei Pang, Xuelong Li


Joint object detection and semantic segmentation is essential in many fields such as self-driving cars. An initial attempt towards this goal is to simply share a single network for multi-task learning. We argue that it does not make full use of the fact that detection and segmentation are mutually beneficial. In this paper, we propose a framework called TripleNet to deeply boost these two tasks. On the one hand, to deeply join the two tasks at different scales, triple supervisions including detection-oriented supervision and class-aware/agnostic segmentation supervisions are imposed on each layer of the decoder. Class-agnostic segmentation provides an objectness prior to detection and segmentation. On the other hand, to further intercross the two tasks and refine the features in each scale, two light-weight modules (i.e., the inner-connected module and the attention skip-layer fusion) are incorporated. Because segmentation supervision on each decoder layer are not performed at the test stage and two added modules are light-weight, the proposed TripleNet can run at a real-time speed (16 fps). Experiments on the VOC 2007/2012 and COCO datasets show that TripleNet outperforms all the other one-stage methods on both two tasks (e.g., 81.9% mAP and 83.3% mIoU on VOC 2012, and 37.1% mAP and 59.6% mIoU on COCO) by a single network.
[joint, outperforms, fusion, concatenated, dataset] [computer, vision, pattern, international, single, simultaneously] [conference, ieee, proposed, image, input, based, method, figure] [layer, table, convolutional, performance, network, deep, deeply, compared, speed, size, output, rate, neural, achieves] [decoder, encoder, generate, attention, improved] [detection, segmentation, semantic, object, feature, triplenet, voc, pairnet, seg, map, supervision, det, coco, miou, module, pyramid, branch, fully, detect, three, blitznet, context, backbone, join, improve, spatial, logits, doll, european] [set, learning, training, test, data, datasets]
@InProceedings{Cao_2019_CVPR,
  author = {Cao, Jiale and Pang, Yanwei and Li, Xuelong},
  title = {Triply Supervised Decoder Networks for Joint Detection and Segmentation},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Leveraging the Invariant Side of Generative Zero-Shot Learning
Jingjing Li, Mengmeng Jing, Ke Lu, Zhengming Ding, Lei Zhu, Zi Huang


Conventional zero-shot learning (ZSL) methods generally learn an embedding, e.g., visual-semantic mapping, to handle the unseen visual samples via an indirect manner. In this paper, we take the advantage of generative adversarial networks (GANs) and propose a novel method, named leveraging invariant side GAN (LisGAN), which can directly generate the unseen features from random noises which are conditioned by the semantic descriptions. Specifically, we train a conditional Wasserstein GANs in which the generator synthesizes fake unseen features from noises and the discriminator distinguishes the fake from real via a minimax game. Considering that one semantic description can correspond to various synthesized visual samples, and the semantic description, figuratively, is the soul of the generated features, we introduce soul samples as the invariant side of generative zero-shot learning in this paper. A soul sample is the meta-representation of one class. It visualizes the most semantically-meaningful aspects of each sample in the same category. We regularize that each generated sample (the varying side of generative ZSL) should be close to at least one soul sample (the invariant side) which has the same class label with it. At the zero-shot recognition stage, we propose to use two classifiers, which are deployed in a cascade way, to achieve a coarse-to-fine result. Experiments on five popular benchmarks verify that our proposed approach can outperform state-of-the-art methods with significant improvements.
[report, previous, dataset, work, recognize, considering, multiple] [approach, guarantee, confidence, directly, corresponding] [generative, real, method, conditional, synthesized, generator, side, proposed, image, synthesize, handle, figure, attribute, lei, input] [table, accuracy, number, regularize, compared, deep, highly, regularization, output] [visual, generated, fake, gan, model, adversarial, gans, discriminator, generate, introduce, random, jingjing] [semantic, propose, category, cascade, leverage, object] [unseen, learning, sample, soul, classification, generalized, apay, class, training, loss, classifier, learn, entropy, supervised, deploy, awa, invariant, train, close, reported, label, address, embedding, set, harmonic, generally, trained, zsl, lisgan, space, domain]
@InProceedings{Li_2019_CVPR,
  author = {Li, Jingjing and Jing, Mengmeng and Lu, Ke and Ding, Zhengming and Zhu, Lei and Huang, Zi},
  title = {Leveraging the Invariant Side of Generative Zero-Shot Learning},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Exploring the Bounds of the Utility of Context for Object Detection
Ehud Barnea, Ohad Ben-Shahar


The recurring context in which objects appear holds valuable information that can be employed to predict their existence. This intuitive observation indeed led many researchers to endow appearance-based detectors with explicit reasoning about context. The underlying thesis suggests that stronger contextual relations would facilitate greater improvements in detection capacity. In practice, however, the observed improvement in many case is modest at best, and often only marginal. In this work we seek to improve our understanding of this phenomenon, in part by pursuing an opposite approach. Instead of attempting to improve detection scores by employing context, we treat the utility of context as an optimization problem: to what extent can detection scores be improved by considering context or any other kind of additional information? With this approach we explore the bounds on improvement by using contextual relations between objects and provide a tool for identifying the most helpful ones. We show that simple co-occurrence relations can often provide large gains, while in other cases a significant improvement is simply impossible or impractical with either co-occurrence or more precise spatial relations. To better understand these results we then analyze the ability of context to handle different types of false detections, revealing that tested contextual information cannot ameliorate localization errors, severely limiting its gains. These and additional insights further our understanding on where and why utilization of context for object detection succeeds and fails.
[work, largest, dataset, employed] [confidence, additional, provide, bound, defined, general, accurate, define, analysis, case, note, choi] [method, based, image, figure, background, presented, change, handle] [number, precision, best, equal, suggested, binary, lower, capacity, experiment, larger, better, analyze, neural] [true, observed, random, blue, utility, understanding, simply, provided, maximize, expected, represent] [context, object, detection, improvement, contextual, localization, false, maximal, recall, bin, relation, pascal, category, employing, spatial, calculation, average, ctxi, improve, curve, iou, confident, role, assigned, detector] [ranking, function, upper, base, confusion, large, set, corresponds, classification, space]
@InProceedings{Barnea_2019_CVPR,
  author = {Barnea, Ehud and Ben-Shahar, Ohad},
  title = {Exploring the Bounds of the Utility of Context for Object Detection},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
A-CNN: Annularly Convolutional Neural Networks on Point Clouds
Artem Komarichev, Zichun Zhong, Jing Hua


Analyzing the geometric and semantic properties of 3D point clouds through the deep networks is still challenging due to the irregularity and sparsity of samplings of their geometric structures. This paper presents a new method to define and compute convolution directly on 3D point clouds by the proposed annular convolution. This new convolution operator can better capture the local neighborhood geometry of each point by specifying the (regular and dilated) ring-shaped structures and directions in the computation. It can adapt to the geometric variability and scalability at the signal processing level. We apply it to the developed hierarchical neural networks for object classification, part segmentation, and semantic segmentation in large-scale scenes. The extensive experiments and comparisons demonstrate that our approach outperforms the state-of-the-art methods on a variety of standard benchmark datasets (e.g., ModelNet10, ModelNet40, ShapeNet-part, S3DIS, and ScanNet).
[ordering, ordered, framework, capture] [point, local, annular, computer, shape, vision, cloud, pattern, geometric, pointnet, ring, normal, define, tangent, counterclockwise, approach, volumetric, indoor, international, neighborhood, supplementary] [proposed, conference, ieee, based, method, input, figure] [convolution, deep, dilated, neural, convolutional, better, network, order, kernel, apply, number, original, pooling, applying, size, operator, processing, performance, aggregate, scheme, achieves] [model, regular, query, encoder, evaluate] [segmentation, neighboring, semantic, feature, hierarchical, object, spatial, cnn, propose, region, pointcnn] [learning, classification, learn, data, representation, set, training, euclidean]
@InProceedings{Komarichev_2019_CVPR,
  author = {Komarichev, Artem and Zhong, Zichun and Hua, Jing},
  title = {A-CNN: Annularly Convolutional Neural Networks on Point Clouds},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
DARNet: Deep Active Ray Network for Building Segmentation
Dominic Cheng, Renjie Liao, Sanja Fidler, Raquel Urtasun


In this paper, we propose a Deep Active Ray Network (DARNet) for automatic building segmentation. Taking an image as input, it first exploits a deep convolutional neural network (CNN) as the backbone to predict energy maps, which are further utilized to construct an energy function. A polygon-based contour is then evolved via minimizing the energy function, of which the minimum defines the final segmentation. Instead of parameterizing the contour using Euclidean coordinates, we adopt polar coordinates, i.e., rays, which not only prevents self-intersection but also simplifies the design of the energy function. Moreover, we propose a loss function that directly encourages the contours to match building boundaries. Our DARNet is trained end-to-end by back-propagating through the energy minimization and the backbone CNN, which makes the CNN adapt to the dynamics of the contour evolution. Experiments on three building instance segmentation datasets demonstrate our DARNet achieves either state-of-the-art or comparable performances to other competitors.
[term, multiple, predict, dataset] [active, point, dsac, ground, truth, balloon, parameterization, ray, darnet, match, approach, defined, curvature, note, polar, define, initial, angle, eballoon, polysim, boundf, single, directly] [contour, method, figure, based, image, reference, input, demonstrate, proposed, transform] [energy, building, deep, network, output, initialization, performance, better, convolutional, structured, compared, inference, validation] [vector, model, partial, introduce, improved] [cnn, segmentation, instance, backbone, predicted, propose, bing, map, object, vaihingen, semantic, polygon, final, boundary, three, feature, area, torontocity] [loss, learning, set, representation, data, encourages, alignment, function, datasets, distance, training]
@InProceedings{Cheng_2019_CVPR,
  author = {Cheng, Dominic and Liao, Renjie and Fidler, Sanja and Urtasun, Raquel},
  title = {DARNet: Deep Active Ray Network for Building Segmentation},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Point Cloud Oversegmentation With Graph-Structured Deep Metric Learning
Loic Landrieu, Mohamed Boussaha


We propose a new supervized learning framework for oversegmenting 3D point clouds into superpoints. We cast this problem as learning deep embeddings of the local geometry and radiometry of 3D points, such that the border of objects presents high contrasts. The embeddings are computed using a lightweight neural network operating on the points' local neighborhood. Finally, we formulate point cloud oversegmentation as a graph partition problem with respect to the learned embeddings. This new approach allows us to set a new state-of-the-art in point cloud oversegmentation by a significant margin, on a dense indoor dataset (S3DIS) and a sparse outdoor one (vKITTI). Our best solution requires over five times fewer superpoints to reach similar performance than previously published methods on S3DIS. Furthermore, we show that our framework can be used to improve superpoint-based semantic segmentation algorithms, setting a new state-of-the-art for this task as well.
[graph, framework, dataset, long] [point, superpoints, cloud, oversegmentation, local, einter, approach, superpoint, problem, define, vkitti, computed, ooa, radiometry, respect, defined, pointnet, geometric, position, lpe, vccs, international, allows, computer, lidar, neighborhood, note, purity, supervized, geometry, analysis, algorithm] [method, figure, input, quality, high, image] [deep, neural, performance, network, table, structure, small, accuracy, better, best] [introduce] [semantic, segmentation, object, propose, border, improve, superpixel, spatial, superpixels, edge, lin] [embeddings, learning, loss, embedding, set, metric, partition, function, ssp, adjacency, task, clustering, weighting, contrastive, learned]
@InProceedings{Landrieu_2019_CVPR,
  author = {Landrieu, Loic and Boussaha, Mohamed},
  title = {Point Cloud Oversegmentation With Graph-Structured Deep Metric Learning},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Graphonomy: Universal Human Parsing via Graph Transfer Learning
Ke Gong, Yiming Gao, Xiaodan Liang, Xiaohui Shen, Meng Wang, Liang Lin


Prior highly-tuned human parsing models tend to fit towards each dataset in a specific domain or with discrepant label granularity, and can hardly be adapted to other human parsing tasks without extensive re-training. In this paper, we aim to learn a single universal human parsing model that can tackle all kinds of human parsing needs by unifying label annotations from different domains or at various levels of granularity. This poses many fundamental learning challenges, e.g. discovering underlying semantic structures among different label granularity, performing proper transfer learning across different image domains, and identifying and utilizing label redundancies across related tasks. To address these challenges, we propose a new universal human parsing agent, named "Graphonomy", which incorporates hierarchical graph transfer learning upon the conventional parsing network to encode the underlying label semantic structures and propagate relevant semantic information. In particular, Graphonomy first learns and propagates compact high-level graph representation among the labels within one dataset via Intra-Graph Reasoning, and then transfers semantic information across multiple datasets via Inter-Graph Transfer. Various graph transfer dependencies (e.g., similarity, linguistic knowledge) between different datasets are analyzed and encoded to enhance graph transfer capability. By distilling universal semantic graph representation to each specific task, Graphonomy is able to predict all levels of parsing labels in one system without piling up the complexity. Experimental results show Graphonomy effectively achieves the state-of-the-art results on three human parsing benchmarks as well as advantageous universal human parsing performance.
[human, graph, dataset, multiple, tackle, explicitly, correlated, incorporating] [matrix, body, june, pose] [image, method, hair, figure] [network, table, deep, performance, structure, weight, achieve, convolution, number, capability, convolutional, distill] [reasoning, model, node, visual, introduce, arxiv, preprint, encode] [parsing, semantic, graphonomy, feature, hierarchical, enhance, xiaodan, liang, global, xiaohui, cihp, object, atr, shuicheng, three, segmentation, deeplab, jian, head, annotated] [transfer, learning, label, universal, representation, knowledge, training, similarity, data, datasets, target, source, discrepancy, set, task, specific, conventional, alleviate, adjacency]
@InProceedings{Gong_2019_CVPR,
  author = {Gong, Ke and Gao, Yiming and Liang, Xiaodan and Shen, Xiaohui and Wang, Meng and Lin, Liang},
  title = {Graphonomy: Universal Human Parsing via Graph Transfer Learning},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Fitting Multiple Heterogeneous Models by Multi-Class Cascaded T-Linkage
Luca Magri, Andrea Fusiello


This paper addresses the problem of multiple models fitting in the general context where the sought structures can be described by a mixture of heterogeneous parametric models drawn from different classes. To this end, we conceive a multi-model selection framework that extend T-linkage to cope with different nested class of models. Our method, called MCT, compares favourably with the state-of-the-art on publicly available data-sets for various fitting problems: lines and conics, homographies and fundamental matrices, planes and cylinders.
[multiple, motion, framework, extract, circle, heterogeneous, work] [fitting, fundamental, geometric, computer, general, mct, problem, point, homographies, homography, pattern, vision, inlier, matrix, degenerate, robust, algorithm, gric, consensus, parametric, case, approach, single, conic, colour, international, tlinkage, planar, geometry, underlying, form, plane, assignment, cube, varying, instantiated, well] [image, ieee, described, conference, preference, figure, recovery, method, result] [selection, nested, number, simpler, structure, criterion] [model, type, step, provided] [segmentation, refinement, cascaded, threshold, detected] [data, set, class, sampling, function, interpretation, clustering, compatible, specific, idea, partition, main, exploiting, belonging]
@InProceedings{Magri_2019_CVPR,
  author = {Magri, Luca and Fusiello, Andrea},
  title = {Fitting Multiple Heterogeneous Models by Multi-Class Cascaded T-Linkage},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
A Late Fusion CNN for Digital Matting
Yunke Zhang, Lixue Gong, Lubin Fan, Peiran Ren, Qixing Huang, Hujun Bao, Weiwei Xu


This paper studies the structure of a deep convolutional neural network to predict the foreground alpha matte by taking a single RGB image as input. Our network is fully convolutional with two decoder branches for the foreground and background classification respectively. Then a fusion branch is used to integrate the two classification results which gives rise to alpha values as the soft segmentation result. This design provides more degrees of freedom than a single decoder branch for the network to obtain better alpha values during training. The network can implicitly produce trimaps without user interaction, which is easy to use for novices without expertise in digital matting. Experimental results demonstrate that our network can achieve high-quality alpha mattes for various types of objects and outperform the state-of-the-art CNN-based image matting methods on the human image matting task.
[fusion, transition, human, dataset, recognition, late, predict, consists, joint] [computer, vision, pattern, single, approach, internet, compute, well, international, volume, rgb] [image, matting, alpha, background, ieee, conference, matte, trimap, input, digital, based, figure, blending, trimaps, pixel, method, color, spectral, portrait] [network, deep, convolutional, design, neural, structure, better, block, size, output, weight, convolution] [dim, decoder, probability, natural, encoder, generated, automatic] [foreground, segmentation, branch, predicted, object, cnn, fully, region, three, semantic, final, feature, refinement] [training, loss, testing, learning, classification, soft, trained, set, learn, train]
@InProceedings{Zhang_2019_CVPR,
  author = {Zhang, Yunke and Gong, Lixue and Fan, Lubin and Ren, Peiran and Huang, Qixing and Bao, Hujun and Xu, Weiwei},
  title = {A Late Fusion CNN for Digital Matting},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
BASNet: Boundary-Aware Salient Object Detection
Xuebin Qin, Zichen Zhang, Chenyang Huang, Chao Gao, Masood Dehghan, Martin Jagersand


Deep Convolutional Neural Networks have been adopted for salient object detection and achieved the state-of-the-art performance. Most of the previous works however focus on region accuracy but not on the boundary quality. In this paper, we propose a predict-refine architecture, BASNet, and a new hybrid loss for Boundary-Aware Salient object detection. Specifically, the architecture is composed of a densely supervised Encoder-Decoder network and a residual refinement module, which are respectively in charge of saliency prediction and saliency map refinement. The hybrid loss guides the network to learn the transformation between the input image and the ground truth in a three-level hierarchy -- pixel-, patch- and map- level -- by fusing Binary Cross Entropy (BCE), Structural SIMilarity (SSIM) and Intersection-over-Union (IoU) losses. Equipped with the hybrid loss, the proposed predict-refine architecture is able to effectively segment the salient object regions and accurately predict the fine structures with clear boundaries. Experimental results on six public datasets show that our method outperforms the state-of-the-art methods both in terms of regional and boundary evaluation measures. Our method runs at over 25 fps on a single GPU. The code is available at: https://github.com/NathanUA/BASNet.
[prediction, recurrent, capture, predict, structural, second] [computer, vision, pattern, ground, truth, international, accurate, local, relaxed, defined] [ieee, conference, image, proposed, input, method, ssim, hybrid, based, figure, clear, comparison, pixel, background] [deep, network, convolutional, architecture, residual, neural, output, layer, pooling, convolution, size, fine, designed, precision, martin, binary] [probability, attention, model, visual] [saliency, salient, object, boundary, refinement, detection, module, map, predicted, bce, iou, stage, rrm, relaxf, global, coarse, huchuan, foreground, mae, three, basnet, maxf, adopted, feature, refine, ablation, recall, level, picanetr] [loss, training, learning, novel, supervised]
@InProceedings{Qin_2019_CVPR,
  author = {Qin, Xuebin and Zhang, Zichen and Huang, Chenyang and Gao, Chao and Dehghan, Masood and Jagersand, Martin},
  title = {BASNet: Boundary-Aware Salient Object Detection},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
ZigZagNet: Fusing Top-Down and Bottom-Up Context for Object Segmentation
Di Lin, Dingguo Shen, Siting Shen, Yuanfeng Ji, Dani Lischinski, Daniel Cohen-Or, Hui Huang


Multi-scale context information has proven to be essential for object segmentation tasks. Recent works construct the multi-scale context by aggregating convolutional feature maps extracted by different levels of a deep neural network. This is typically done by propagating and fusing features in a one-directional, top-down and bottom-up, manner. In this work, we introduce ZigZagNet, which aggregates a richer multi-context feature map by using not only dense top-down and bottom-up propagation, but also by introducing pathways crossing between different levels of the top-down and the bottom-up hierarchies, in a zig-zag fashion. Furthermore, the context information is exchanged and aggregated over multiple stages, where the fused feature maps from one stage are fed into the next one, yielding a more comprehensive context for improved segmentation performance. Our extensive evaluation on the public benchmarks demonstrates that ZigZagNet surpasses the state-of-the-art accuracy for both semantic segmentation and instance segmentation tasks.
[dataset, propagation, propagate, adjacent, bidirectional, multiple, previous, fusion] [dense, compute, yielding, approach, single] [figure, image, method, produced, input, chen, produce, successive] [network, convolutional, validation, deep, table, performance, achieve, neural, convolution, accuracy, atrous, compared, employ, pooling] [encoding, richer, evaluate, represent, model] [feature, context, segmentation, map, semantic, pascal, object, person, instance, voc, subregions, coco, zigzagnet, global, backbone, region, mask, stage, spatial, fused, pyramid, miou, fusing, rce, three, lin, subregion, exchange, score, exchanging, fuse] [set, learning, test, training, large, reported, representation, loss]
@InProceedings{Lin_2019_CVPR,
  author = {Lin, Di and Shen, Dingguo and Shen, Siting and Ji, Yuanfeng and Lischinski, Dani and Cohen-Or, Daniel and Huang, Hui},
  title = {ZigZagNet: Fusing Top-Down and Bottom-Up Context for Object Segmentation},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Object Instance Annotation With Deep Extreme Level Set Evolution
Zian Wang, David Acuna, Huan Ling, Amlan Kar, Sanja Fidler


In this paper, we tackle the task of interactive object segmentation. We revive the old ideas on level set segmentation which framed object annotation as curve evolution. Carefully designed energy functions ensured that the curve was well aligned with image boundaries, and generally "well behaved". The Level Set Method can handle objects with complex shapes and topological changes such as merging and splitting, thus able to deal with occluded objects and objects with holes. We propose Deep Extreme Level Set Evolution that combines powerful CNN models with level set optimization in an end-to-end fashion. Our method learns to predict evolution parameters conditioned on the image and evolves the predicted initial contour to produce the final result. We make our model interactive by incorporating user clicks on the extreme boundary points, following DEXTR. We show that our approach significantly outperforms DEXTR on the static Cityscapes dataset and the video segmentation benchmark DAVIS, and performs on par on PASCAL and SBD.
[motion, term, predict, dataset, davis, follow, prediction, work, time, human, learns] [initial, active, direction, point, approach, denote, curvature, field, predicts, additional, signed, well] [image, input, contour, method, proposed, figure, based, qualitative] [evolution, deep, modulation, performance, table, energy, convolutional, regularization, architecture, network, output, employ, neural] [model, vector, evaluation, evaluate, find] [level, object, extreme, boundary, segmentation, curve, interactive, branch, lsf, delse, predicted, cnn, pascal, annotation, dextr, threshold, semantic, map, final, regression, miou, box, instance] [set, function, training, distance, loss, update, learning, train, trained, exploit, task]
@InProceedings{Wang_2019_CVPR,
  author = {Wang, Zian and Acuna, David and Ling, Huan and Kar, Amlan and Fidler, Sanja},
  title = {Object Instance Annotation With Deep Extreme Level Set Evolution},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Leveraging Crowdsourced GPS Data for Road Extraction From Aerial Imagery
Tao Sun, Zonglin Di, Pengyu Che, Chun Liu, Yin Wang


Deep learning is revolutionizing the mapping industry. Under lightweight human curation, computer has generated almost half of the roads in Thailand on Open- StreetMap (OSM) using high resolution aerial imagery. Bing maps are displaying 125 million computer generated building polygons in the U.S. While tremendously more efficient than manual mapping, one cannot map out everything from the air. Especially for roads, a small prediction gap by image occlusion renders the entire road useless for routing. Misconnections can be more dangerous. Therefore computer-based mapping often requires local verifications, which is still labor intensive. In this paper, we propose to leverage crowdsourced GPS data to improve and support road extraction from aerial imagery. Through novel data augmentation, GPS rendering, and 1D transpose convolution techniques, we show almost 5% improvements over previous competition winning models, and much better robustness when predicting new areas without any new training data or domain adaptation.
[dataset, prediction, recognition, challenge, work, early] [rendering, computer, international, vision, render, pattern, quantity, local, total, rgb] [gps, image, input, figure, beijing, transpose, crowdsourced, imagery, resolution, method, shanghai, conference, ieee, noise, linknet, remote, high, osm, mapping, based, taxi, pixel] [performance, number, deep, convolution, kernel, better, neural, interval, convolutional, size, table, gaussian, best, building, scale, speed, low, larger, top, layer, network, net, verification] [model, decoder, random, machine] [road, aerial, segmentation, extraction, satellite, map, iou, area, semantic, deeplab, deepglobe] [data, datasets, augmentation, sampling, learning, sample, trained]
@InProceedings{Sun_2019_CVPR,
  author = {Sun, Tao and Di, Zonglin and Che, Pengyu and Liu, Chun and Wang, Yin},
  title = {Leveraging Crowdsourced GPS Data for Road Extraction From Aerial Imagery},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Adaptive Pyramid Context Network for Semantic Segmentation
Junjun He, Zhongying Deng, Lei Zhou, Yali Wang, Yu Qiao


Recent studies witnessed that context features can significantly improve the performance of deep semantic segmentation networks. Current context based segmentation methods differ with each other in how to construct context features and perform differently in practice. This paper firstly introduces three desirable properties of context features in segmentation task. Specially, we find that Global-guided Local Affinity (GLA) can play a vital role in constructing effective context features, while this property has been largely ignored in previous works. Based on this analysis, this paper proposes Adaptive Pyramid Context Network (APCNet) for semantic segmentation. APCNet adaptively constructs multi-scale contextual representations with multiple well-designed Adaptive Context Modules (ACMs). Specifically, each ACM leverages a global image representation as a guidance to estimate the local affinity coefficients for each sub-region, and then calculates a context vector with these affinities. We empirically evaluate our APCNet on three semantic segmentation and scene parsing datasets, including PASCAL VOC 2012, Pascal-Context, and ADE20K dataset. Experimental results show that APCNet achieves state-of-the-art performance on all three benchmarks, and obtains a new record 84.2% on PASCAL VOC 2012 test set without MS COCO pre-trained and any post-processing.
[dataset, construct, previous, capture, key] [local, computer, scene, vision, pattern, estimate, property, position, international, problem] [image, conference, method, ieee, based, input, figure, proposed] [adaptive, table, performance, convolutional, network, deep, scale, validation, calculate, size, adaptively, achieves, neural, convolution, original, pooling] [vector, arxiv, preprint, model, evaluation, describe, attention] [context, semantic, segmentation, global, apcnet, pyramid, pascal, affinity, voc, feature, gla, backbone, pspnet, three, parsing, spatial, fcn, map, average, danet, object, baseline, contextual, coco, parsenet, psanet, ocnet] [set, training, learning, representation, large, label, test]
@InProceedings{He_2019_CVPR,
  author = {He, Junjun and Deng, Zhongying and Zhou, Lei and Wang, Yali and Qiao, Yu},
  title = {Adaptive Pyramid Context Network for Semantic Segmentation},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Isospectralization, or How to Hear Shape, Style, and Correspondence
Luca Cosmo, Mikhail Panine, Arianna Rampini, Maks Ovsjanikov, Michael M. Bronstein, Emanuele Rodola


The question whether one can recover the shape of a geometric object from its Laplacian spectrum ('hear the shape of the drum') is a classical problem in spectral geometry with a broad range of implications and applications. While theoretically the answer to this question is negative (there exist examples of iso-spectral but non-isometric manifolds), little is known about the practical possibility of using the spectrum for shape reconstruction and optimization. In this paper, we introduce a numerical procedure called isospectralization, consisting of deforming one shape to make its Laplacian spectrum match that of another. We implement the isospectralization procedure using modern differentiable programming techniques and exemplify its applications in some of the classical and notoriously hard problems in geometry processing, computer vision, and graphics such as shape reconstruction, pose and style transfer, and dense deformable correspondence.
[possibility, sequence, term, hear, starting] [shape, laplacian, problem, geometric, computer, reconstruction, isospectralization, mesh, functional, intrinsic, approach, optimization, matching, correspondence, algorithm, inverse, discrete, eij, error, michael, pose, total, riemannian, matrix, triangle, geodesic, classical, practical, reconstructing, additional, eigenvalue, assumption, vertex, defined, volume, david, university, geometry, recovering, isospectral, isometric, initial, theoretical, solving, general] [figure, spectrum, spectral, style, input, result, flat, recover, proposed] [iteration, numerical, regularizer, number, top] [procedure, question, finding, consider] [map, boundary, edge, connectivity, area, deformable] [embedding, target, alignment, source, metric, transfer, embeddings, aligning, align]
@InProceedings{Cosmo_2019_CVPR,
  author = {Cosmo, Luca and Panine, Mikhail and Rampini, Arianna and Ovsjanikov, Maks and Bronstein, Michael M. and Rodola, Emanuele},
  title = {Isospectralization, or How to Hear Shape, Style, and Correspondence},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Speech2Face: Learning the Face Behind a Voice
Tae-Hyun Oh, Tali Dekel, Changil Kim, Inbar Mosseri, William T. Freeman, Michael Rubinstein, Wojciech Matusik


How much can we infer about a person's looks from the way they speak? In this paper, we study the task of reconstructing a facial image of a person from a short audio recording of that person speaking. We design and train a deep neural network to perform this task using millions of natural Internet/Youtube videos of people speaking. During training, our model learns voice-face correlations that allow it to produce images that capture various physical attributes of the speakers such as age, gender and ethnicity. This is done in a self-supervised manner, by utilizing the natural co-occurrence of faces and speech in Internet videos, without the need to model attributes explicitly. We evaluate and numerically quantify how--and in what manner--our Speech2Face reconstructions, obtained directly from audio, resemble the true face images of the speakers.
[speech, audio, recognition, short, capture, video, sound, spectrogram, dataset, speaker, predicting, duration, people, predict] [computer, vision, computed, directly, reconstruction, international, pattern, canonical, matching, additional, approach, well, corresponding, reconstructing] [face, image, facial, conference, input, reconstructed, figure, ieee, avspeech, reconstruct, age, gender, extracted, craniofacial, nose, based, lip, acm, database] [network, correlation, relu, original, neural, conv, table, deep, layer, batch, performance] [model, voice, true, decoder, visual, natural, encoder, random, machine, goal] [feature, person, predicted, segment, european] [learning, test, training, trained, loss, train, retrieval, task, classification]
@InProceedings{Oh_2019_CVPR,
  author = {Oh, Tae-Hyun and Dekel, Tali and Kim, Changil and Mosseri, Inbar and Freeman, William T. and Rubinstein, Michael and Matusik, Wojciech},
  title = {Speech2Face: Learning the Face Behind a Voice},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Joint Manifold Diffusion for Combining Predictions on Decoupled Observations
Kwang In Kim, Hyung Jin Chang


We present a new predictor combination algorithm that improves a given task predictor based on potentially relevant reference predictors. Existing approaches are limited in that, to discover the underlying task dependence, they either require known parametric forms of all predictors or access to a single fixed dataset on which all predictors are jointly evaluated. To overcome these limitations, we design a new non-parametric task dependence estimation procedure that automatically aligns evaluations of heterogeneous predictors across disjoint feature sets. Our algorithm is instantiated as a robust manifold diffusion process that jointly refines the estimated predictor alignments and the corresponding task dependence. We apply this algorithm to the relative attributes ranking problem and demonstrate that it not only broadens the application range of predictor combination approaches but also outperforms existing methods even when applied to classical predictor combination settings.
[joint, dataset, multiple, bridge, graph, combining, heterogeneous, mtl, explicitly] [algorithm, corresponding, matrix, approach, relative, parametric, require, problem, solution, single, initial, linear, underlying, respective, estimate, estimation, laplacian] [based, reference, attribute, application, gram, figure, coupling, coupled] [accuracy, parameter, constructed, kernel, process, automatically, validation, fixed, disjoint, applied, number, deep] [manifold, dependence, goal, relevant] [feature, baseline, improves, score] [predictor, task, combination, diffusion, metric, data, learning, rank, bkf, main, set, ranking, decoupled, existing, bgf, tpc, training, function, alignment, distribution, sample, update, hyperparameters, setting]
@InProceedings{Kim_2019_CVPR,
  author = {In Kim, Kwang and Jin Chang, Hyung},
  title = {Joint Manifold Diffusion for Combining Predictions on Decoupled Observations},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Audio Visual Scene-Aware Dialog
Huda Alamri, Vincent Cartillier, Abhishek Das, Jue Wang, Anoop Cherian, Irfan Essa, Dhruv Batra, Tim K. Marks, Chiori Hori, Peter Anderson, Stefan Lee, Devi Parikh


We introduce the task of scene-aware dialog. Our goal is to generate a complete and natural response to a question about a scene, given video and audio of the scene and the history of previous turns in the dialog. To answer successfully, agents must ground concepts from the question in the video while leveraging contextual cues from the dialog history. To benchmark this task, we introduce the Audio Visual Scene-Aware Dialog (AVSD) Dataset. For each of more than 11,000 videos of human actions from the Charades dataset, our dataset contains a dialog about the video, plus a final summary of the video by one of the dialog participants. We train several baseline systems for this task and evaluate the performance of the trained models using both qualitative and quantitative metrics. Our results indicate that models must utilize all the available inputs (video, audio, question, and dialog history) to perform best on this dataset.
[video, audio, dataset, temporal, human, amt, summary, people, conversational, work, recognition, middle, short, dynamic, sequence, providing] [computer, ground, vision, pattern, scene, truth, well, provide] [conference, input, figure, ieee, image, visually] [table, network, order, performance, original, best] [dialog, question, model, answer, visual, avsd, history, natural, candidate, evaluate, goal, answering, questioner, language, conversation, understanding, evaluation, devi, introduce, find, grounded, perception, dhruv, introduced, agent, script, encode, simply] [three, benchmark, person, final] [task, rank, set, trained, embedding, test, training]
@InProceedings{Alamri_2019_CVPR,
  author = {Alamri, Huda and Cartillier, Vincent and Das, Abhishek and Wang, Jue and Cherian, Anoop and Essa, Irfan and Batra, Dhruv and Marks, Tim K. and Hori, Chiori and Anderson, Peter and Lee, Stefan and Parikh, Devi},
  title = {Audio Visual Scene-Aware Dialog},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Learning to Minify Photometric Stereo
Junxuan Li, Antonio Robles-Kelly, Shaodi You, Yasuyuki Matsushita


Photometric stereo estimates the surface normal given a set of images acquired under different illumination conditions. To deal with diverse factors involved in the image formation process, recent photometric stereo methods demand a large number of images as input. We propose a method that can dramatically decrease the demands on the number of images by learning the most informative ones under different illumination conditions. To this end, we use a deep learning framework to automatically learn the critical illumination conditions required at input. Furthermore, we present an occlusion layer that can synthesize cast shadows, which effectively improves the estimation accuracy. We assess our method on challenging real-world conditions, where we outperform techniques elsewhere in the literature with a significantly reduced number of light conditions.
[dataset, predict, outperforms, recognition, second] [surface, occlusion, observation, photometric, normal, cast, light, stereo, diligent, note, pattern, illumination, direction, computer, problem, brdf, estimation, optimal, robust, consistent, assume, vision, approach, lambertian, estimate, shadowed, reflectance, respect, shape] [method, input, illuminant, figure, conference, ieee, proposed, synthetic, row, comparison] [table, connection, layer, number, sparse, network, performance, deep, selection, shadow, neural, yielded, regularization, zeroing, employ, size, regularizer] [random, relevant, system, making, model] [map, object, propose, feature, benchmark] [training, function, randomly, set, loss, learning, select, rank, source, effectively, angular, trained, selected]
@InProceedings{Li_2019_CVPR,
  author = {Li, Junxuan and Robles-Kelly, Antonio and You, Shaodi and Matsushita, Yasuyuki},
  title = {Learning to Minify Photometric Stereo},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Reflective and Fluorescent Separation Under Narrow-Band Illumination
Koji Koyamatsu, Daichi Hidaka, Takahiro Okabe, Hendrik P. A. Lensch


In this paper, we address the separation of reflective and fluorescent components in RGB images taken under narrow-band light sources such as LEDs. First, we show that the fluorescent color per pixel can be estimated from at least two images under different light source colors, because the observed color at a surface point is represented by a convex combination of the light source color and the illumination-invariant fluorescent color. Second, we propose a method for robustly estimating the fluorescent color via MAP estimation by taking the prior knowledge with respect to fluorescent colors into consideration. We conducted a number of experiments by using both synthetic and real images, and confirmed that our proposed method works better than the closely related state-of-the-art method and enables us to separate reflective and fluorescent components even from a single image. Furthermore, we demonstrate that our method is effective for applications such as image-based material editing and relighting.
[second, work, represented, bidirectional, outgoing, incoming, dataset] [light, single, illumination, material, camera, estimated, linear, estimation, reflectance, rgb, respect, emission, convex, surface, point, closely, plane, assume, computed, ground, constraint, incident, pure, robust, varying, corresponding] [fluorescent, color, figure, method, reflective, spectral, pixel, proposed, prior, component, separation, input, image, comparison, real, chromaticity, described, ica, ieee, based, fluorescence, wavelength, qualitative, reflection, editing, separating, bispectral, water, absorption, synthetic, stokes] [table, better, number, gaussian, effective, density] [observed, sensitivity, model, probability, consider] [map, intersection, propose, object] [source, space, knowledge, independent, distribution, combination, spanned]
@InProceedings{Koyamatsu_2019_CVPR,
  author = {Koyamatsu, Koji and Hidaka, Daichi and Okabe, Takahiro and Lensch, Hendrik P. A.},
  title = {Reflective and Fluorescent Separation Under Narrow-Band Illumination},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Depth From a Polarisation + RGB Stereo Pair
Dizhong Zhu, William A. P. Smith


In this paper, we propose a hybrid depth imaging system in which a polarisation camera is augmented by a second image from a standard digital camera. For this modest increase in equipment complexity over conventional shape-from-polarisation, we obtain a number of benefits that enable us to overcome longstanding problems with the polarisation shape cue. The stereo cue provides a depth map which, although coarse, is metrically accurate. This is used as a guide surface for disambiguation of the polarisation surface normal estimates using a higher order graphical model. In turn, these are used to estimate diffuse albedo. By extending a previous shape-from-polarisation method to the perspective case, we show how to compute dense, detailed maps of absolute depth, while retaining a linear formulation. We show that our hybrid method is able to recover dense 3D geometry that is superior to state-of-the-art shape-from-polarisation or two view stereo alone.
[second, work, cue, previous, term] [polarisation, depth, surface, specular, normal, stereo, diffuse, albedo, estimated, angle, linear, shape, camera, estimation, dominant, estimate, perspective, approach, degree, light, reflectance, refractive, direction, polarization, rgb, compute, assume, single, iun, unpolarised, constraint, viewing, disambiguation, orientation, multiview, atkinson, note, measured, edwin, smith, dense, view, initial, photometric, denote, sinusoid, form] [image, method, figure, pixel, guide, intensity, proposed, difference, input, shading, synthetic, frequency, recovery] [phase, order, gradient, higher, cost, low, compare, standard, number, sparse] [model, system, consider, encourage] [map, graphical, coarse, mask, height] [source, unknown, setup, set, function]
@InProceedings{Zhu_2019_CVPR,
  author = {Zhu, Dizhong and Smith, William A. P.},
  title = {Depth From a Polarisation + RGB Stereo Pair},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Rethinking the Evaluation of Video Summaries
Mayu Otani, Yuta Nakashima, Esa Rahtu, Janne Heikkila


Video summarization is a technique to create a short skim of the original video while preserving the main stories/content. There exists a substantial interest in automatizing this process due to the rapid growth of the available material. The recent progress has been facilitated by public benchmark datasets, which enable easy and fair comparison of methods. Currently the established evaluation protocol is to compare the generated summary with respect to a set of reference summaries provided by the dataset. In this paper, we will provide in-depth assessment of this pipeline using two popular benchmark datasets. Surprisingly, we observe that randomly generated summaries achieve comparable or better performance to the state-of-the-art. In some cases, the random summaries outperform even the human generated summaries in leave-one-out experiments. Moreover, it turns out that the video segmentation, which is often considered as a fixed pre-processing method, has the most significant impact on the performance measure. Based on our observations, we propose alternative approaches for assessing the importance scores as well as an intuitive visualization of correlation between the estimated scoring and human annotations.
[video, human, summarization, summary, tvsum, summe, framework, frame, current, dataset, recognition, dpplstm, long, short, averaging, highlight, illustrates] [computer, vision, approach, pattern, well, corresponding, respect, analysis, completely, compute] [reference, figure, conference, based, ieee, method, content, proposed, produced, quality, produce, result, high, user] [correlation, original, performance, selection, number, compare, table, better, deep, order] [random, evaluation, generated, visual, randomized, blue, length, randomization, provided, commonly, evaluate] [score, segmentation, segment, average, annotated, predicted, level, final, curve, benchmark, scoring, propose] [test, subset, uniform, selected, ranking, datasets, set, main, distribution, maximum]
@InProceedings{Otani_2019_CVPR,
  author = {Otani, Mayu and Nakashima, Yuta and Rahtu, Esa and Heikkila, Janne},
  title = {Rethinking the Evaluation of Video Summaries},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
What Object Should I Use? - Task Driven Object Detection
Johann Sawatzky, Yaser Souri, Christian Grund, Jurgen Gall


When humans have to solve everyday tasks, they simply pick the objects that are most suitable. While the question which object should one use for a specific task sounds trivial for humans, it is very difficult to answer for robots or other autonomous systems. This issue, however, is not addressed by current benchmarks for object detection that focus on detecting object categories. We therefore introduce the COCO-Tasks dataset which comprises about 40,000 images where the most suitable objects for 14 tasks have been annotated. We furthermore propose an approach that detects the most suitable objects for a given task. The approach builds on a Gated Graph Neural Network to exploit the appearance of each object as well as the global context of all present objects in the scene. In our experiments, we show that the proposed approach outperforms other approaches that are evaluated on the dataset like classification or ranking approaches.
[graph, dataset, hidden, fusion, joint] [scene, ground, truth, confidence, approach, chosen, single, column, supplementary, well] [image, proposed, method, preferred, figure, based, input, glass] [number, neural, table, network, convolutional, deep, resnet, top, standard, layer, weighted, performance, suitable, best, order, connected, add] [visual, model, question, probability, node, requires, gated, affordances, step, pick, choose] [object, detection, bounding, coco, final, driven, ggnn, context, baseline, detector, wine, category, annotated, ablation, detected, average, detecting, propose, instance, global] [task, class, selected, train, learning, classification, test, training, classifier, set, learned, loss]
@InProceedings{Sawatzky_2019_CVPR,
  author = {Sawatzky, Johann and Souri, Yaser and Grund, Christian and Gall, Jurgen},
  title = {What Object Should I Use? - Task Driven Object Detection},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Triangulation Learning Network: From Monocular to Stereo 3D Object Detection
Zengyi Qin, Jinglu Wang, Yan Lu


In this paper, we study the problem of 3D object detection from stereo images, in which the key challenge is how to effectively utilize stereo information. Different from previous methods using pixel-level depth maps, we propose to employ 3D anchors to explicitly construct object-level correspondences between the regions of interest in stereo images, from which the deep neural network learns to detect and triangulate the targeted object in 3D space. We also introduce a cost-efficient channel reweighting strategy that enhances representational features and weakens noisy signals to facilitate the learning process. All of these are flexibly integrated into a solid baseline detector that inputs monocular images. We demonstrate that both the monocular baseline and the stereo triangulation learning network outperform the prior state-of-the-arts in 3D object detection and localization on the challenging KITTI dataset.
[frame, coherence] [left, confidence, camera, monocular, stereo, triangulation] [reference, input, image] [conv, network, add] [potential] [roi, reweight, roialign, regression, anchor, score, baseline, bounding, rpn, objectness, map, object, box] [classification, learning, target, pairwise]
@InProceedings{Qin_2019_CVPR,
  author = {Qin, Zengyi and Wang, Jinglu and Lu, Yan},
  title = {Triangulation Learning Network: From Monocular to Stereo 3D Object Detection},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Connecting the Dots: Learning Representations for Active Monocular Depth Estimation
Gernot Riegler, Yiyi Liao, Simon Donne, Vladlen Koltun, Andreas Geiger


We propose a technique for depth estimation with a monocular structured-light camera, i.e., a calibrated stereo set-up with one camera and one laser projector. Instead of formulating the depth estimation via a correspondence search problem, we show that a simple convolutional architecture is sufficient for high-quality disparity estimates in this setting. As accurate ground-truth is hard to obtain, we train our model in a self-supervised fashion with a combination of photometric and geometric losses. Further, we demonstrate that the projected pattern of the structured light sensor can be reliably separated from the ambient information. This can then be used to improve depth boundaries in a weakly supervised fashion by modeling the joint statistics of image and depth edges. The model trained in this fashion compares favorably to the state-of-the-art on challenging synthetic and real-world datasets. In addition, we contribute a novel simulator, which allows to benchmark active depth prediction algorithms in controlled conditions.
[dataset, recognition, dot, prediction, predicting, optical] [disparity, depth, pattern, ambient, light, vision, stereo, computer, camera, active, geometric, point, photometric, estimation, scene, monocular, single, matching, laser, estimated, contrast, local, geometry, international, accurate, hyperdepth, sensor, virtual, plane, denote, assume, projected, directly, note, view, fastmrf] [image, input, ieee, reference, projector, method, synthetic, based, demonstrate, figure, real, qualitative] [network, structured, architecture, accuracy, table, deep, block, gradient, convolutional, denotes, small] [model, decoder, random, simple, provided, requires, evaluate, evaluation, machine] [edge, map, object, propose, spatial, location] [loss, learning, training, train, exploit, distribution, supervised, large, set, trained, data, distance]
@InProceedings{Riegler_2019_CVPR,
  author = {Riegler, Gernot and Liao, Yiyi and Donne, Simon and Koltun, Vladlen and Geiger, Andreas},
  title = {Connecting the Dots: Learning Representations for Active Monocular Depth Estimation},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Learning Non-Volumetric Depth Fusion Using Successive Reprojections
Simon Donne, Andreas Geiger


Given a set of input views, multi-view stereopsis techniques estimate depth maps to represent the 3D reconstruction of the scene; these are fused into a single, consistent, reconstruction -- most often a point cloud. In this work we propose to learn an auto-regressive depth refinement directly from data. While deep learning has improved the accuracy and speed of depth estimation significantly, learned MVS techniques remain limited to the planesweeping paradigm. We refine a set of input depth maps by successively reprojecting information from neighbouring views to leverage multi-view constraints. Compared to learning-based volumetric fusion techniques, an image-based representation allows significantly more detailed reconstructions; compared to traditional point-based techniques, our method learns noise suppression and surface completion in a data-driven fashion. Due to the limited availability of high-quality reconstruction datasets with ground truth, we introduce two novel synthetic datasets to (pre-)train our network. Our approach is able to improve both the output depth maps and the reconstructed point cloud, for both learned and traditional depth estimation front-ends, on both synthetic and real data.
[fusion, dataset, recognition, multiple, work] [depth, confidence, point, vision, view, computer, colmap, estimate, reprojected, neighbouring, surface, mvsnet, estimation, approach, ground, cloud, bound, reprojection, reconstruction, dtu, stereo, international, truth, completeness, pattern, volumetric, culled, well, yield, initial, stereopsis, limited, single, scene, note] [figure, input, image, ieee, reconstructed, synthetic, pixel, method, result, proposed, reference] [network, accuracy, output, deep, table, lower, better, residual, iteration, scale] [neighbour, step, evaluation] [refinement, center, map, three, refined, final, spatial, threshold, refine, fusing] [learning, learned, set, datasets, training, classification, large, min]
@InProceedings{Donne_2019_CVPR,
  author = {Donne, Simon and Geiger, Andreas},
  title = {Learning Non-Volumetric Depth Fusion Using Successive Reprojections},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Stereo R-CNN Based 3D Object Detection for Autonomous Driving
Peiliang Li, Xiaozhi Chen, Shaojie Shen


We propose a 3D object detection method for autonomous driving by fully exploiting the sparse and dense, semantic and geometry information in stereo imagery. Our method, called Stereo R-CNN, extends Faster R-CNN for stereo inputs to simultaneously detect and associate object in left and right images. We add extra branches after stereo Region Proposal Network (RPN) to predict sparse keypoints, viewpoints, and object dimensions, which are combined with 2D left-right boxes to calculate a coarse 3D object bounding box. We then recover the accurate 3D bounding box by a region-based photometric alignment using left and right RoIs. Our method does not require depth input and 3D position supervision, however, outperforms all existing fully supervised image-based methods. Experiments on the challenging KITTI dataset show that our method outperforms the state-of-the-art stereo-based method by around 30% AP on both 3D detection and 3D localization tasks. Code will be made publicly available.
[predict, multiple] [stereo, left, depth, keypoint, dense, accurate, vision, computer, kitti, pattern, keypoints, viewpoint, point, perspective, estimation, apbv, corresponding, lidar, disparity, autonomous, error, geometry, projection, single, camera, provide, monocular, view, orientation, angle, projected, regress, cloud, note, solved, matching, simultaneously] [method, image, ieee, conference, based, input, pixel, figure, raw] [network, performance, sparse, output, table, validation, deep] [mode, evaluate] [object, box, detection, roi, easy, localization, semantic, bounding, feature, rpn, iou, boundary, center, comparing, coarse, aligned, average, fully, faster, region, proposal, regression, evaluated, detect] [hard, alignment, training, learning, loss, flip, set, distance]
@InProceedings{Li_2019_CVPR,
  author = {Li, Peiliang and Chen, Xiaozhi and Shen, Shaojie},
  title = {Stereo R-CNN Based 3D Object Detection for Autonomous Driving},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Hybrid Scene Compression for Visual Localization
Federico Camposeco, Andrea Cohen, Marc Pollefeys, Torsten Sattler


Localizing an image w.r.t. a 3D scene model represents a core task for many computer vision applications. An increasing number of real-world applications of visual localization on mobile devices, e.g., Augmented Reality or autonomous robots such as drones or self-driving cars, demand localization approaches to minimize storage and bandwidth requirements. Compressing the 3D models used for localization thus becomes a practical necessity. In this work, we introduce a new hybrid compression algorithm that uses a given memory limit in a more effective way. Rather than treating all 3D points equally, it represents a small set of points with full appearance information and an additional, larger set of points with compressed information. This enables our approach to obtain a more complete scene representation without increasing the memory requirements, leading to a superior performance compared to previous compression schemes. As part of our contribution, we show how to handle ambiguous matches arising from point compression during RANSAC. Besides outperforming previous compression techniques in terms of pose accuracy under the same memory constraints, our compression scheme itself is also more efficient. Furthermore, the localization rates and accuracy obtained with our approach are comparable to state-of-the-art feature-based methods, while using a small fraction of the memory.
[second, previous, dataset, work, graph, localize] [scene, pose, point, descriptor, ransac, approach, camera, minimal, torsten, matching, estimation, problem, marc, posenet, good, match, registration, visibility, single, sift, noah, algorithm, localized, robust] [image, method, database, cover, hybrid, high, comparison] [compression, compressed, number, full, accuracy, order, consumption, rate, small, larger, performance, dubrovnik, compared, store, best, compress, lower, reduce, search, better, variant, selection, compressing, scheme] [visual, memory, query, model, word, coverage, unique, median, selecting, modified] [localization, regression, feature] [set, selected, subset, select, sampling, sample, distribution, large, retrieval, datasets]
@InProceedings{Camposeco_2019_CVPR,
  author = {Camposeco, Federico and Cohen, Andrea and Pollefeys, Marc and Sattler, Torsten},
  title = {Hybrid Scene Compression for Visual Localization},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
MMFace: A Multi-Metric Regression Network for Unconstrained Face Reconstruction
Hongwei Yi, Chen Li, Qiong Cao, Xiaoyong Shen, Sheng Li, Guoping Wang, Yu-Wing Tai


We propose to address the face reconstruction in the wild by using a multi-metric regression network, MMFace, to align a 3D face morphable model (3DMM) to an input image. The key idea is to utilize a volumetric sub-network to estimate an intermediate geometry representation, and a parametric sub-network to regress the 3DMM parameters. Our parametric sub-network consists of identity loss, expression loss, and pose loss which greatly improves the aligned geometry details by incorporating high level loss functions directly defined in the 3DMM parametric spaces. Our high-quality reconstruction is robust under large variations of expressions, poses, illumination conditions, and even with large partial occlusions. We evaluate our method by comparing the performance with state-of-the-art approaches on latest 3D face dataset LS3D-W and Florence. We achieve significant improvements both quantitatively and qualitatively. Due to our high-quality reconstruction, our method can be easily extended to generate high-quality geometry sequences for video inputs.
[video, frame, dataset, consists, framework] [pose, geometry, reconstruction, volumetric, computer, ground, parametric, truth, exp, pattern, corresponding, vision, accurate, directly, single, estimation, estimated, estimate, perspective, icp, volume, illumination, regress, robust, coordinate, orientation, supplementary, well] [face, facial, identity, expression, method, image, input, prn, conference, landmark, morphable, mmface, disface, figure, florence, proposed, handle, eid, ieee, unconstrained, chen, animoji, based, eexp, jvcr, comparison, denoted, intermediate] [network, parameter, entire, better, performance, achieves] [model, partial, attention, evaluate, evaluation, easily] [regression, three, head, european, propose] [loss, alignment, large, representation, training, space, align]
@InProceedings{Yi_2019_CVPR,
  author = {Yi, Hongwei and Li, Chen and Cao, Qiong and Shen, Xiaoyong and Li, Sheng and Wang, Guoping and Tai, Yu-Wing},
  title = {MMFace: A Multi-Metric Regression Network for Unconstrained Face Reconstruction},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
3D Motion Decomposition for RGBD Future Dynamic Scene Synthesis
Xiaojuan Qi, Zhengzhe Liu, Qifeng Chen, Jiaya Jia


A future video is the 2D projection of a 3D scene with predicted camera and object motion. Accurate future video prediction inherently requires understanding of 3D motion and geometry of a scene. In this paper, we propose a RGBD scene forecasting model with 3D motion decomposition. We predict ego-motion and foreground motion that are combined to generate a future 3D dynamic scene, which is then projected into a 2D image plane to synthesize future motion, RGB images and depth maps. Optional semantic maps can be integrated. Experimental results on KITTI and Driving datasets show that our model outperforms other state-of-the- arts in forecasting future RGBD dynamic scenes.
[motion, future, prediction, frame, flow, optical, video, dataset, predict, framework, driving, predicting, forecasting, prednet, work, previous, static, multiple, mcnet, outperforms, utilized, moving, dynamic] [depth, scene, camera, kitti, rgbd, approach, decomposition, rgb, directly, error, point, field, accurate, estimate, estimated, projected, estimation, predicts, projection, geometry] [image, figure, method, input, color, synthesis, background, pixel, produce, qualitative, proposed, produced, synthesize] [network, table, deep, neural, convolutional, compared, better] [model, generate, evaluate, evaluation, copy, visual, system] [semantic, segmentation, foreground, refinement, map, module, predicted, object, baseline] [learning, unsupervised, training, loss, train]
@InProceedings{Qi_2019_CVPR,
  author = {Qi, Xiaojuan and Liu, Zhengzhe and Chen, Qifeng and Jia, Jiaya},
  title = {3D Motion Decomposition for RGBD Future Dynamic Scene Synthesis},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Single Image Depth Estimation Trained via Depth From Defocus Cues
Shir Gur, Lior Wolf


Estimating depth from a single RGB images is a fundamental task in computer vision, which is most directly solved using supervised deep learning. In the field of unsupervised learning of depth from a single RGB image, depth is not given explicitly. Existing work in the field receives either a stereo pair, a monocular video, or multiple views, and, using losses that are based on structure-from-motion, trains a depth estimation network. In this work, we rely, instead of different views, on depth from focus cues. Learning is based on a novel Point Spread Function convolutional layer, which applies location specific kernels that arise from the Circle-Of-Confusion in each image location. We evaluate our method on data derived from five common datasets for depth estimation and lightfield images, and present results that are on par with supervised methods on KITTI and Make3D datasets and outperform unsupervised learning approaches. Since the phenomenon of depth from defocus is not dataset specific, we hypothesize that learning based on it would overfit less to the specific content in each dataset. Our experiments show that this is indeed the case, and an estimator learned on one dataset using our method provides better results on other datasets, than the directly supervised methods.
[focused, focus, dataset, work, consists] [depth, dof, kitti, single, estimation, monocular, camera, dense, rmse, computer, ground, lightfield, truth, dorn, psf, lens, rendered, nyu, focal, rendering, algorithm, scene, light, pattern, directly, field, stereo, estimated, plane, rel, aperture, denote, point] [image, method, based, blur, defocus, input, ieee, coc, comparison, ssim, conference, figure, psnr, quantitative] [aspp, deep, layer, kernel, output, size, convolutional, network, employ, convolution, atrous, performance, table, outperform, compare, neural] [model, evaluate, consider, arxiv] [supervision, regression, predicted, three, object] [learning, training, supervised, unsupervised, distance, loss, domain, function, trained, data, datasets, set, test]
@InProceedings{Gur_2019_CVPR,
  author = {Gur, Shir and Wolf, Lior},
  title = {Single Image Depth Estimation Trained via Depth From Defocus Cues},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
RGBD Based Dimensional Decomposition Residual Network for 3D Semantic Scene Completion
Jie Li, Yu Liu, Dong Gong, Qinfeng Shi, Xia Yuan, Chunxia Zhao, Ian Reid


RGB images differentiate from depth as they carry more details about the color and texture information, which can be utilized as a vital complement to depth for boosting the performance of 3D semantic scene completion (SSC). SSC is composed of 3D shape completion (SC) and semantic scene labeling while most of the existing approaches use depth as the sole input which causes the performance bottleneck. Moreover, the state-of-the-art methods employ 3D CNNs which have cumbersome networks and tremendous parameters. We introduce a light-weight Dimensional Decomposition Residual network (DDR) for 3D dense prediction tasks. The novel factorized convolution layer is effective for reducing the network parameters, and the proposed multi-scale fusion mechanism for depth and color image can improve the completion and segmentation accuracy simultaneously. Our method demonstrates excellent performance on two public datasets. Compared with the latest method SSCNet, we achieve 5.9% gains in SC-IoU and 5.7% gains in SSC-IOU, albeit with only 21% network parameters and 16.6% FLOPs employed compared with that of SSCNet.
[fusion, dataset] [scene, depth, completion, sscnet, dimensional, rgbd, decomposition, rgb, shape, indoor, well, voxel, projection, voxels, nyu, esscnet, geometry, wall, ground, point, corresponding, volume, floor] [color, method, image, proposed, figure, input, based, texture, mapping] [ddr, network, performance, table, residual, convolution, ssc, block, relu, compared, deep, aspp, layer, convolutional, group, nyucad, conv, deeper, reduce, design, achieve, output] [understanding] [feature, semantic, object, labeling, segmentation, three, spatial, improve, category, propose, cnn, detection, map] [learning, extractor, representation, training, novel, task, loss, complement, space, effectively]
@InProceedings{Li_2019_CVPR,
  author = {Li, Jie and Liu, Yu and Gong, Dong and Shi, Qinfeng and Yuan, Xia and Zhao, Chunxia and Reid, Ian},
  title = {RGBD Based Dimensional Decomposition Residual Network for 3D Semantic Scene Completion},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Neural Scene Decomposition for Multi-Person Motion Capture
Helge Rhodin, Victor Constantin, Isinsu Katircioglu, Mathieu Salzmann, Pascal Fua


Learning general image representations has proven key to the success of many computer vision tasks. For example, many approaches to image understanding problems rely on deep networks that were initially trained on ImageNet, mostly because the learned features are a valuable starting point to learn from limited labeled data. However, when it comes to 3D motion capture of multiple people, these features are only of limited use. In this paper, we therefore propose an approach to learning features that are useful for this purpose. To this end, we introduce a self-supervised approach to learning what we call a neural scene decomposition (NSD) that can be exploited for 3D pose estimation. NSD comprises three layers of abstraction to represent human subjects: spatial layout in terms of bounding-boxes and relative depth; a 2D shape representation in terms of an instance segmentation mask; and subject-specific appearance and 3D pose information. By exploiting self-supervision coming from multiview data, our NSD model can be trained end-to-end without any 2D or 3D supervision. In contrast to previous approaches, it works for multiple persons and full-frame images. Because it encodes 3D geometry, NSD can then be effectively leveraged to train a 3D pose estimation network from small amounts of annotated data.
[human, multiple, subject, people, motion, time, joint, bidirectional] [pose, computer, view, vision, pattern, nsd, depth, single, scene, estimation, international, shape, reconstruction, decomposition, approach, monocular, visibility, body, require, position, corresponding, matrix, additional, limited, multiview, boxing] [conference, image, figure, input, appearance, latent, method, background, abstraction] [neural, deep, network, processing, scale, convolutional, accuracy, number] [arxiv, model, introduce, encoding, transformer, requires, adversarial] [bounding, segmentation, detection, spatial, person, object, instance, annotated, european, box, detected, three, supervision] [training, representation, learning, novel, data, train, unsupervised, test, learn, trained, supervised, learned, target]
@InProceedings{Rhodin_2019_CVPR,
  author = {Rhodin, Helge and Constantin, Victor and Katircioglu, Isinsu and Salzmann, Mathieu and Fua, Pascal},
  title = {Neural Scene Decomposition for Multi-Person Motion Capture},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Efficient Decision-Based Black-Box Adversarial Attacks on Face Recognition
Yinpeng Dong, Hang Su, Baoyuan Wu, Zhifeng Li, Wei Liu, Tong Zhang, Jun Zhu


Face recognition has obtained remarkable progress in recent years due to the great improvement of deep convolutional neural networks (CNNs). However, deep CNNs are vulnerable to adversarial examples, which can cause fateful consequences in real-world face recognition applications with security-sensitive purposes. Adversarial attacks are widely studied as they can identify the vulnerability of the models before they are deployed. In this paper, we evaluate the robustness of state-of-the-art face recognition models in the decision-based black-box attack setting, where the attackers have no access to the model parameters and gradients, but can only acquire hard-label predictions by sending queries to the target model. This attack setting is more practical in real-world face recognition systems. To improve the efficiency of previous methods, we propose an evolutionary attack algorithm, which can model the local geometry of the search directions and reduce the dimension of the search space. Extensive experiments demonstrate the effectiveness of the proposed method that induces a minimum perturbation to an input face image with fewer queries. We also apply the proposed method to attack a real-world face recognition system successfully.
[recognition, perform] [distortion, optimization, matrix, coordinate, problem, local, algorithm, geometry] [face, method, image, proposed, based, input, diagonal, study, demonstrate, figure, identity] [search, deep, original, covariance, verification, number, table, cnns, gradient, stochastic, neural, better, selection, effectiveness, performance, evolution, criterion] [adversarial, attack, evolutionary, model, impersonation, dodging, generated, probability, random, generate, robustness, vector, step, success, recognized, ccov] [boundary, average, predicted] [space, dimension, arcface, set, distance, pair, sphereface, cosface, update, objective, large, loss, select, target, setting, lfw, minimum, function, sample, distribution, proportional, learning]
@InProceedings{Dong_2019_CVPR,
  author = {Dong, Yinpeng and Su, Hang and Wu, Baoyuan and Li, Zhifeng and Liu, Wei and Zhang, Tong and Zhu, Jun},
  title = {Efficient Decision-Based Black-Box Adversarial Attacks on Face Recognition},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
FA-RPN: Floating Region Proposals for Face Detection
Mahyar Najibi, Bharat Singh, Larry S. Davis


We propose a novel approach for generating region proposals for performing face detection. Instead of classifying anchor boxes using features from a pixel in the convolutional feature map, we adopt a pooling-based approach for generating region proposals. However, pooling hundreds of thousands of anchors which are evaluated for generating proposals becomes a computational bottleneck during inference. To this end, an efficient anchor placement strategy for reducing the number of anchor-boxes is proposed. We then show that proposals generated by our network (Floating Anchor Region Proposal Network, FA-RPN) are better than RPN for generating region proposals for face detection. We discuss several beneficial features of FA-RPN proposals (which can be enabled without re-training) like iterative refinement, placement of fractional anchors and changing size/shape of anchors. Our face detector based on FA-RPN obtains 89.4% mAP with a ResNet-50 backbone on the WIDER dataset.
[perform, dataset, performing, multiple] [computer, vision, pattern, single, initial, well] [face, image, conference, ieee, based, figure, high, proposed] [pooling, network, convolutional, small, inference, precision, performance, stride, size, deep, performed, scale, efficient, number, improving, overlap, better, neural, top, validation, apply] [generating, generated, iterative, model, generate, arxiv, preprint] [anchor, region, wider, detection, object, proposal, rpn, ssh, placement, recall, feature, detector, aspect, refinement, detect, improve, pascal, localization, final, afw, snip] [training, set, classification, train, learning, datasets, hard]
@InProceedings{Najibi_2019_CVPR,
  author = {Najibi, Mahyar and Singh, Bharat and Davis, Larry S.},
  title = {FA-RPN: Floating Region Proposals for Face Detection},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Bayesian Hierarchical Dynamic Model for Human Action Recognition
Rui Zhao, Wanru Xu, Hui Su, Qiang Ji


Human action recognition remains as a challenging task partially due to the presence of large variations in the execution of action. To address this issue, we propose a probabilistic model called Hierarchical Dynamic Model (HDM). Leveraging on Bayesian framework, the model parameters are allowed to vary across different sequences of data, which increase the capacity of the model to adapt to intra-class variations on both spatial and temporal extent of actions. Meanwhile, the generative learning process allows the model to preserve the distinctive dynamic pattern for each action class. Through Bayesian inference, we are able to quantify the uncertainty of the classification, providing insight during the decision process. Compared to state-of-the-art methods, our method not only achieves competitive recognition performance within individual dataset but also shows better generalization capability across different datasets. Experiments conducted on data with missing values also show the robustness of the proposed method.
[action, recognition, temporal, human, hidden, hdm, modeling, dataset, skeleton, hsmm, duration, combined, motion, state, dynamic, individual, time, hmm, transition, perform, sequence, joint, utd] [pose, estimate, estimation, compute, allows, approach, total, algorithm, computed, hand] [variation, figure, based, missing, proposed, method, prior, handle, high, generative, demonstrate] [bayesian, inference, better, accuracy, compared, performance, deep, capability, table, process, speed, number, max, covariance] [model, random, probability, provided, arg, variational] [spatial, hierarchical, feature, score, msra] [learning, data, uncertainty, distribution, classification, hyperparameters, datasets, likelihood, class, log, large, training, set, sampling, generalization, representation, learned]
@InProceedings{Zhao_2019_CVPR,
  author = {Zhao, Rui and Xu, Wanru and Su, Hui and Ji, Qiang},
  title = {Bayesian Hierarchical Dynamic Model for Human Action Recognition},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Mixed Effects Neural Networks (MeNets) With Applications to Gaze Estimation
Yunyang Xiong, Hyunwoo J. Kim, Vikas Singh


There is much interest in computer vision to utilize commodity hardware for gaze estimation. A number of papers have shown that algorithms based on deep convolutional architectures are approaching accuracies where streaming data from mass-market devices can offer good gaze tracking performance, although a gap still remains between what is possible and the performance users will expect in real deployments. We observe that one obvious avenue for improvement relates to a gap between some basic technical assumptions behind most existing approaches and the statistical properties of the data used for training. Specifically, most training datasets involve tens of users with a few hundreds (or more) repeated acquisitions per user. The non i.i.d. nature of this data suggests better estimation may be possible if the model explicitly made use of such "repeated measurements" from each user as is commonly done in classical statistical analysis using so-called mixed effects models. The goal of this paper is to adapt these "mixed effects" ideas from statistics within a deep neural network architecture for gaze estimation, based on eye images. Such a formulation seeks to specifically utilize information regarding the hierarchical structure of the training data -- each node in the hierarchy is a user who provides tens or hundreds of repeated samples. This modification yields an architecture that offers state of the art performance on various publicly available datasets improving results by 10-20%.
[dataset, subject, tracking, work, predict, video, prediction, recurrent, recognition, outperforms] [estimation, linear, computer, vision, error, algorithm, pattern, multiview, international, problem, calibration, direction, corresponding, good, analysis, accurate, estimate, approach, variable] [gaze, eye, mixed, based, mpiigaze, statistical, ieee, image, repeated, conference, method, menets, yusuke, appearance, real, input, appearancebased] [neural, fixed, accuracy, deep, network, resnet, convolutional, menet, performance, architecture, number, better, layer] [random, model, offer, common, machine, simple, vector] [regression, head, utilize, art] [data, training, datasets, test, specific, unknown, learning, trained, independent, function]
@InProceedings{Xiong_2019_CVPR,
  author = {Xiong, Yunyang and Kim, Hyunwoo J. and Singh, Vikas},
  title = {Mixed Effects Neural Networks (MeNets) With Applications to Gaze Estimation},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
3D Human Pose Estimation in Video With Temporal Convolutions and Semi-Supervised Training
Dario Pavllo, Christoph Feichtenhofer, David Grangier, Michael Auli


In this work, we demonstrate that 3D poses in video can be effectively estimated with a fully convolutional model based on dilated temporal convolutions over 2D keypoints. We also introduce back-projection, a simple and effective semi-supervised training method that leverages unlabeled video data. We start with predicted 2D keypoints for unlabeled video, then estimate 3D poses and finally back-project to the input 2D keypoints. In the supervised setting, our fully-convolutional model outperforms the previous best result from the literature by 6 mm mean per-joint position error on Human3.6M, corresponding to an error reduction of 11%, and the model also shows significant improvements on HumanEva-I. Moreover, experiments with back-projection show that it comfortably outperforms previous state-of-the-art results in semi-supervised settings where labeled data is scarce. Code and models are available at https://github.com/facebookresearch/VideoPose3D
[human, temporal, recognition, video, work, sequence, joint, previous, trajectory, dataset, motion, predicting, bone, time, frame, prediction] [pose, vision, computer, estimation, international, error, pattern, keypoint, approach, keypoints, well, camera, single, field, mpjpe, position, accurate] [conference, figure, input, method, based, translation] [convolutional, dilated, neural, dropout, receptive, batch, deep, table, architecture, network, output, batchnorm, relu, best, number, full, size, better, complexity, efficient] [model, machine, length, evaluate, simple] [predicted, mask, european, baseline, bounding, detector] [data, learning, training, labeled, protocol, unlabeled, set, supervised, train, loss, large, trained, setup]
@InProceedings{Pavllo_2019_CVPR,
  author = {Pavllo, Dario and Feichtenhofer, Christoph and Grangier, David and Auli, Michael},
  title = {3D Human Pose Estimation in Video With Temporal Convolutions and Semi-Supervised Training},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Learning to Regress 3D Face Shape and Expression From an Image Without 3D Supervision
Soubhik Sanyal, Timo Bolkart, Haiwen Feng, Michael J. Black


The estimation of 3D face shape from a single image must be robust to variations in lighting, head pose, expression, facial hair, makeup, and occlusions. Robustness requires a large training set of in-the-wild images, which by construction, lack ground truth 3D shape. To train a network without any 2D-to-3D supervision, we present RingNet, which learns to compute 3D face shape from a single image. Our key observation is that an individual's face shape is constant across images, regardless of expression, pose, lighting, etc. RingNet leverages multiple images of a person and automatically detected 2D face features. It uses a novel loss that encourages the face shape to be similar when the identity is the same and different for different people. We achieve invariance to expression by representing the face using the FLAME model. Once trained, our method takes a single image and outputs the parameters of FLAME, which can be readily animated. Additionally we create a new database of faces "not quite in-the-wild" (NoW) with 3D head scans and high-resolution images of the subjects in a wide variety of conditions. We evaluate publicly available methods and find that RingNet is more accurate than methods that use 3D supervision. The dataset, model, and results are available for research purposes at http://ringnet.is.tuebingen.mpg.de.
[dataset, subject, multiple, key, consists, previous] [shape, single, reconstruction, ground, mesh, pose, truth, error, provide, approach, monocular, corresponding, computer, ring, compute, scan, accurate, estimate, vision, estimation, tightly, directly, note, regress, constant] [face, ringnet, image, flame, facial, expression, figure, method, identity, mapping, consistency, morphable, produce, input, neutral, feng, synthetic, quantitative, real, reconstruct, conference] [network, deep, standard, table, wide] [model, robustness, evaluate, evaluation, median] [head, person, supervision, cropped, regression, benchmark, feature, predicted, region] [training, learning, loss, data, distance, learn, train, space, large, set, alignment]
@InProceedings{Sanyal_2019_CVPR,
  author = {Sanyal, Soubhik and Bolkart, Timo and Feng, Haiwen and Black, Michael J.},
  title = {Learning to Regress 3D Face Shape and Expression From an Image Without 3D Supervision},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
PoseFix: Model-Agnostic General Human Pose Refinement Network
Gyeongsik Moon, Ju Yong Chang, Kyoung Mu Lee


Multi-person pose estimation from a 2D image is an essential technique for human behavior understanding. In this paper, we propose a human pose refinement network that estimates a refined pose from a tuple of an input image and input pose. The pose refinement was performed mainly through an end-to-end trainable multi-stage architecture in previous methods. However, they are highly dependent on pose estimation models and require careful model design. By contrast, we propose a model-agnostic pose refinement method. According to a recent study, state-of-the-art 2D human pose estimation methods have similar error distributions. We use this error statistics as prior information to generate synthetic poses and use the synthesized poses to train our model. In the testing stage, pose estimation results of any other methods can be input to the proposed method. Moreover, the proposed model does not require code or knowledge about other methods, which allows it to be easily used in the post-processing step. We show that the proposed approach achieves better performance than the conventional multi-stage refinement models and consistently improves the performance of various state-of-the-art pose estimation methods on the commonly used benchmark. The code is available in (https://github.com/mks0601/PoseFix_RELEASE).
[human, joint, displacement, fed, learns] [pose, posefix, estimation, error, keypoint, groundtruth, keypoints, require, estimated, pipeline, form, finer, body, position, ronchi, good, defined, estimate] [input, proposed, figure, image, synthesized, method, based, result, described, jitter, frequency] [network, performance, applied, architecture, deep, design, gaussian, cpn, output, validation, table, trainable, achieves, convolutional, upsampling, size] [model, generate, vector, simple, calculated, generated, type] [refinement, refined, heatmap, coarse, refine, heatmaps, mask, backbone, improves, detection, person, stage, module, bounding, feature] [training, testing, trained, conventional, learning, train, loss, code, set, knowledge, consistently, representation]
@InProceedings{Moon_2019_CVPR,
  author = {Moon, Gyeongsik and Yong Chang, Ju and Mu Lee, Kyoung},
  title = {PoseFix: Model-Agnostic General Human Pose Refinement Network},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
RepNet: Weakly Supervised Training of an Adversarial Reprojection Network for 3D Human Pose Estimation
Bastian Wandt, Bodo Rosenhahn


This paper addresses the problem of 3D human pose estimation from single images. While for a long time human skeletons were parameterized and fitted to the observation by satisfying a reprojection error, nowadays researchers directly use neural networks to infer the 3D pose from the observations. However, most of these approaches ignore the fact that a reprojection constraint has to be satisfied and are sensitive to overfitting. We tackle the overfitting problem by ignoring 2D to 3D correspondences. This efficiently avoids a simple memorization of the training data and allows for a weakly supervised training. One part of the proposed reprojection network (RepNet) learns a mapping from a distribution of 2D poses to a distribution of 3D poses using an adversarial training approach. Another part of the network estimates the camera. This allows for the definition of a network layer that performs the reprojection of the estimated 3D pose back to 2D which results in a reprojection loss function. Our experiments show that RepNet generalizes well to unknown data and outperforms state-of-the-art methods when applied to unseen data. Moreover, our implementation runs in real-time on a standard desktop PC.
[human, joint, dataset, motion, recognition, bone, outperforms, learns] [pose, estimation, camera, computer, vision, matrix, reprojection, pattern, error, single, monocular, symmetry, international, repnet, ground, kinematic, analysis, reconstruction, truth, directly, additional, body, note, allows, volume, estimated, well] [conference, method, ieee, image, input, proposed, reconstructed, generator, noise, row, mapping, figure, generative] [network, neural, layer, standard, table, scale, output, connected, best, performance, structure, calculate] [critic, adversarial, discriminator, plausible, evaluation, machine, vector, wasserstein, chain] [weakly, fully, propose] [training, supervised, trained, data, distribution, set, learning, loss, datasets, unknown]
@InProceedings{Wandt_2019_CVPR,
  author = {Wandt, Bastian and Rosenhahn, Bodo},
  title = {RepNet: Weakly Supervised Training of an Adversarial Reprojection Network for 3D Human Pose Estimation},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Fast and Robust Multi-Person 3D Pose Estimation From Multiple Views
Junting Dong, Wen Jiang, Qixing Huang, Hujun Bao, Xiaowei Zhou


This paper addresses the problem of 3D pose estimation for multiple people in a few calibrated camera views. The main challenge of this problem is to find the cross-view correspondences among noisy and incomplete 2D pose predictions. Most previous methods address this challenge by directly reasoning in 3D using a pictorial structure model, which is inefficient due to the huge state space. We propose a fast and robust approach to solve this problem. Our key idea is to use a multi-way matching algorithm to cluster the detected 2D poses in all views. Each resulting cluster encodes 2D poses of the same person across different views and consistent correspondences across the keypoints, from which the 3D pose of each person can be effectively inferred. The proposed convex optimization based multi-way matching algorithm is efficient and robust against missing and false detections, without knowing the number of people in the scene. Moreover, we propose to combine geometric and appearance cues for cross-view matching. The proposed approach achieves significant performance gains from the state-of-the-art (96.3% vs. 90.6% and 96.9% vs. 88% on the Campus and Shelf datasets, respectively), while being efficient for real-time applications.
[human, multiple, people, joint, actor, state, shelf, previous, dataset, work, motion] [pose, matching, approach, estimation, algorithm, camera, problem, geometry, geometric, campus, body, consistent, constraint, matrix, view, robust, solve, corresponding, belagiannis, pictorial, single, optimization, keypoints, epipolar, associated, estimated, pij, variable, triangulation, calibrated, denote] [appearance, proposed, figure, consistency, reconstruct, based, cycle, image, result, inconsistent] [denotes, number, performance, network, table, fast, efficient] [model, find, correctly] [bounding, detected, person, affinity, propose, false, matched, detector, panoptic, location, box] [pair, space, cluster, aij, datasets, learning, novel, set]
@InProceedings{Dong_2019_CVPR,
  author = {Dong, Junting and Jiang, Wen and Huang, Qixing and Bao, Hujun and Zhou, Xiaowei},
  title = {Fast and Robust Multi-Person 3D Pose Estimation From Multiple Views},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Face-Focused Cross-Stream Network for Deception Detection in Videos
Mingyu Ding, An Zhao, Zhiwu Lu, Tao Xiang, Ji-Rong Wen


Automated deception detection (ADD) from real-life videos is a challenging task. It specifically needs to address two problems: (1) Both face and body contain useful cues regarding whether a subject is deceptive. How to effectively fuse the two is thus key to the effectiveness of an ADD model. (2) Real-life deceptive samples are hard to collect; learning with limited training data thus challenges most deep learning based ADD models. In this work, both problems are addressed. Specifically, for face-body multimodal learning, a novel face-focused cross-stream network (FFCSN) is proposed. It differs significantly from the popular two-stream networks in that: (a) face detection is added into the spatial stream to capture the facial expressions explicitly, and (b) correlation learning is performed across the spatial and temporal streams for joint deep feature learning across both face and body. To address the training data scarcity problem, our FFCSN model is trained with both meta learning and adversarial learning. Extensive experiments show that our FFCSN model achieves state-of-the-art results. Further, the proposed FFCSN model as well as its robust training strategy are shown to be generally applicable to other human-centric video analysis tasks such as emotion recognition from user-generated videos.
[deception, ffcsn, video, temporal, deceptive, recognition, dataset, motion, action, fusion, stream, human, zhiwu, cope, consists, acc, scarcity, submodule, auc, affective, frame] [body, analysis, clearly, approach, international, defined, problem, note, consensus] [face, figure, facial, comparison, based, proposed, ieee, conference, expression] [network, deep, add, correlation, size, full, neural, number, convolutional] [model, adversarial, multimodal, visual, vector] [detection, feature, spatial, segment, module, branch, detecting, including, benchmark, faster, illustrated] [learning, training, data, meta, emotion, loss, base, pairwise, set, comparative, sample, task, novel, trained, existing]
@InProceedings{Ding_2019_CVPR,
  author = {Ding, Mingyu and Zhao, An and Lu, Zhiwu and Xiang, Tao and Wen, Ji-Rong},
  title = {Face-Focused Cross-Stream Network for Deception Detection in Videos},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Unequal-Training for Deep Face Recognition With Long-Tailed Noisy Data
Yaoyao Zhong, Weihong Deng, Mei Wang, Jiani Hu, Jianteng Peng, Xunqiang Tao, Yaohai Huang


Large-scale face datasets usually exhibit a massive number of classes, a long-tailed distribution, and severe label noise, which undoubtedly aggravate the difficulty of training. In this paper, we propose a training strategy that treats the head data and the tail data in an unequal way, accompanying with noise-robust loss functions, to take full advantage of their respective characteristics. Specifically, the unequal-training framework provides two training data streams: the first stream applies the head data to learn discriminative face representation supervised by Noise Resistance loss; the second stream applies the tail data to learn auxiliary information by gradually mining the stable discriminative information from confusing tail classes. Consequently, both training streams offer complementary information to deep feature learning. Extensive experiments have demonstrated the effectiveness of the new unequal-training framework and loss functions. Better yet, our method could save a significant amount of GPU memory. With our method, we achieve the best result on MegaFace Challenge 2 (MF2) given a large-scale noisy training data set.
[dataset, recognition, second, framework, challenge, joint] [stable, problem, corresponding, approach] [face, noise, identity, method, figure, based, image, result, proposed] [deep, table, number, gradually, extremely, experiment, performance, full, gpu, convolutional, size, original] [model, type, probability, candidate] [head, feature, three, propose, center, level, third, bag] [training, data, tail, loss, noisy, large, learning, softmax, representation, trained, discriminative, set, resistance, label, lfw, webface, datasets, longtailed, train, hard, mining, space, refers, learn, supervised, megaface, imbalanced, arcface, class, labeled, hypothetical, margin, dirty, distribution, ytf, accepted, learned, base, entirely, pyip]
@InProceedings{Zhong_2019_CVPR,
  author = {Zhong, Yaoyao and Deng, Weihong and Wang, Mei and Hu, Jiani and Peng, Jianteng and Tao, Xunqiang and Huang, Yaohai},
  title = {Unequal-Training for Deep Face Recognition With Long-Tailed Noisy Data},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
T-Net: Parametrizing Fully Convolutional Nets With a Single High-Order Tensor
Jean Kossaifi, Adrian Bulat, Georgios Tzimiropoulos, Maja Pantic


Recent findings indicate that over-parametrization, while crucial for successfully training deep neural networks, also introduces large amounts of redundancy. Tensor methods have the potential to efficiently parametrize over-complete representations by leveraging this redundancy. In this paper, we propose to fully parametrize Convolutional Neural Networks (CNNs) with a single high-order, low-rank tensor. Previous works on network tensorization have focused on parametrizing individual layers (convolutional or fully connected) only, and perform the tensorization layer-by-layer separately. In contrast, we propose to jointly capture the full structure of a neural network by parametrizing it with a single high-order tensor, the modes of which represent each of the architectural design parameters of the network (e.g. number of convolutional blocks, depth, number of stacks, input features, etc). This parametrization allows to regularize the whole network and drastically reduce the number of parameters. Our model is end-to-end trainable and the low-rank structure imposed on the weight tensor acts as an implicit regularization. We study the case of networks with rich structure, namely Fully Convolutional Networks (FCNs), which we propose to parametrize with a single 8th-order tensor. We show that our approach can achieve superior performance with small compression rates, and attain high compression rates with negligible drop in accuracy for the challenging task of human pose estimation.
[human, work, individual, outperforms, trimmed, report] [pose, decomposition, single, estimation, approach, parametrization, allows, corresponding, form, depth, jean, georgios] [method, facial, input, high, proposed, comparison, figure] [tensor, convolutional, number, neural, network, compression, tucker, accuracy, deep, order, weight, structure, performance, fin, output, original, uncompressed, residual, architecture, ratio, parametrizing, redundancy, table, tensorization, block, efficient, parametrize, architectural, layer, tensorizing, size, compressed, design, proposes, parameter, compared, higher, power, reducing, stacked] [model] [baseline, fully, propose, semantic, challenging, segmentation] [large, learning, task, training, rank, existing]
@InProceedings{Kossaifi_2019_CVPR,
  author = {Kossaifi, Jean and Bulat, Adrian and Tzimiropoulos, Georgios and Pantic, Maja},
  title = {T-Net: Parametrizing Fully Convolutional Nets With a Single High-Order Tensor},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Hierarchical Cross-Modal Talking Face Generation With Dynamic Pixel-Wise Loss
Lele Chen, Ross K. Maddox, Zhiyao Duan, Chenliang Xu


We devise a cascade GAN approach to generate talking face video, which is robust to different face shapes, view angles, facial characteristics, and noisy audio conditions. Instead of learning a direct mapping from audio to video frames, we propose first to transfer audio to high-level structure, i.e., the facial landmarks, and then to generate video frames conditioned on the landmarks. Compared to a direct audio-to-image approach, our cascade approach avoids fitting spurious correlations between audiovisual signals that are irrelevant to the speech content. We, humans, are sensitive to temporal discontinuities and subtle artifacts in video. To avoid those pixel jittering problems and to enforce the network to focus on audiovisual-correlated regions, we propose a novel dynamically adjustable pixel-wise loss with an attention mechanism. Furthermore, to generate a sharper image with well-synchronized facial movements, we propose a novel regression-based discriminator structure, which considers sequence-level information along with frame-level information. Thoughtful experiments on several datasets and real-world samples demonstrate significantly better results obtained by our method than the state-of-the-art methods in both quantitative and qualitative comparisons.
[audio, video, speech, frame, time, dataset, temporal, motion, signal, lstm, dynamic, work, mfcc, sequence, observes] [ground, truth, computer, vision, synchronization, pose, international] [image, face, facial, talking, lip, conference, pixel, synthesized, lrw, chung, input, based, figure, atvgnet, adjustable, method, generator, ieee, jittering, proposed, voxceleb, realistic, pca, quality] [network, structure, table, dynamically, better, neural] [generation, attention, model, generate, example, generated, discriminator, visual, generating, evaluate, mechanism, conditioned, machine, gan, green, find] [propose, feature, score, grid, cascade, september, hierarchical, head, final, mask] [loss, novel, training, learning, trained, datasets]
@InProceedings{Chen_2019_CVPR,
  author = {Chen, Lele and Maddox, Ross K. and Duan, Zhiyao and Xu, Chenliang},
  title = {Hierarchical Cross-Modal Talking Face Generation With Dynamic Pixel-Wise Loss},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Object-Centric Auto-Encoders and Dummy Anomalies for Abnormal Event Detection in Video
Radu Tudor Ionescu, Fahad Shahbaz Khan, Mariana-Iuliana Georgescu, Ling Shao


Abnormal event detection in video is a challenging vision problem. Most existing approaches formulate abnormal event detection as an outlier detection task, due to the scarcity of anomalous data during training. Because of the lack of prior information regarding abnormal events, these methods are not fully-equipped to differentiate between normal and abnormal events. In this work, we formalize abnormal event detection as a one-versus-rest binary classification problem. Our contribution is two-fold. First, we introduce an unsupervised feature learning framework based on object-centric convolutional auto-encoders to encode both motion and appearance information. Second, we propose a supervised classification approach based on clustering the training samples into normality clusters. A one-versus-rest abnormal event classifier is then employed to separate each normality cluster from the rest. For the purpose of training the classifier, the other clusters act as dummy anomalies. During inference, an object is labeled as abnormal if the highest classification score assigned by the one-versus-rest classifiers is negative. Comprehensive experiments are performed on four benchmarks: Avenue, ShanghaiTech, UCSD and UMN. Our approach provides superior results on all four data sets. On the large-scale ShanghaiTech data set, our method provides an absolute gain of 8.4% in terms of frame-level AUC compared to the state-of-the-art method.
[abnormal, event, auc, anomaly, frame, video, motion, avenue, cae, umn, framework, normality, hinami, people, second, employed, work] [approach, normal, note, respective] [based, figure, appearance, method, latent, input, presented] [convolutional, ucsd, order, deep, performance, better, layer, parameter, higher, binary, employ, top] [provided, model, representing] [detection, object, score, feature, shanghaitech, detector, false, pedestrian, bounding, person, propose, three, highest, final, detecting] [data, training, test, set, learning, positive, svm, classification, train, sample, learn, task, unsupervised, cluster, reported, clustering, supervised, classifier]
@InProceedings{Ionescu_2019_CVPR,
  author = {Tudor Ionescu, Radu and Shahbaz Khan, Fahad and Georgescu, Mariana-Iuliana and Shao, Ling},
  title = {Object-Centric Auto-Encoders and Dummy Anomalies for Abnormal Event Detection in Video},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
DDLSTM: Dual-Domain LSTM for Cross-Dataset Action Recognition
Toby Perrett, Dima Damen


Domain alignment in convolutional networks aims to learn the degree of layer-specific feature alignment beneficial to the joint learning of source and target datasets. While increasingly popular in convolutional networks, there have been no previous attempts to achieve domain alignment in recurrent networks. Similar to spatial features, both source and target domains are likely to exhibit temporal dependencies that can be jointly learnt and aligned. In this paper we introduce Dual-Domain LSTM (DDLSTM), an architecture that is able to learn temporal dependencies from two domains concurrently. It performs cross-contaminated batch normalisation on both input-to-hidden and hidden-to-hidden weights, and learns the parameters for cross-contamination, for both single-layer and multi-layer LSTM architectures. We evaluate DDLSTM on frame-level action recognition using three datasets, taking a pair at a time, and report an average increase in accuracy of 3.5%. The proposed DDLSTM architecture outperforms standard, fine-tuned, and batch-normalised LSTMs.
[lstm, ddlstm, mpii, breakfast, action, online, joint, normalisation, temporal, jointly, bnlstm, dataset, recognition, recurrent, multiple, ddbn, video, second, epic, future, lstms, thumos, human, outperforms, benefit] [computer, vision, pattern, single, international, note, corresponding, well] [conference, input, figure, method, proposed, background] [batch, table, standard, architecture, offline, neural, accuracy, network, number, deep, cell, convolutional, layer, increase] [cooking, contribution, evaluate, model] [feature, three, average] [domain, training, datasets, trained, classification, learning, adaptation, learn, sample, test, alignment, shared, comparative, source, target, data, class]
@InProceedings{Perrett_2019_CVPR,
  author = {Perrett, Toby and Damen, Dima},
  title = {DDLSTM: Dual-Domain LSTM for Cross-Dataset Action Recognition},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
The Pros and Cons: Rank-Aware Temporal Attention for Skill Determination in Long Videos
Hazel Doughty, Walterio Mayol-Cuevas, Dima Damen


We present a new model to determine relative skill from long videos, through learnable temporal attention modules. Skill determination is formulated as a ranking problem, making it suitable for common and generic tasks. However, for long videos, parts of the video are irrelevant for assessing skill, and there may be variability in the skill exhibited throughout a video. We therefore propose a method which assesses the relative overall level of skill in a long video by attending to its skill-relevant parts. Our approach trains temporal attention modules, learned with only video-level supervision, using a novel rank-aware loss function. In addition to attending to task-relevant video parts, our proposed loss jointly trains two attention modules to separately attend to video parts which are indicative of higher (pros) and lower (cons) skill. We evaluate our approach on the EPIC-Skills dataset and additionally annotate a larger dataset from YouTube videos for skill determination with five previously unexplored tasks. Our method outperforms previous approaches and classic softmax attention on both datasets by over 4% pairwise accuracy, and as much as 12% on individual tasks. We also demonstrate our model's ability to attend to rank-aware parts of the video.
[skill, video, temporal, dataset, tie, determination, assessment, recognition, long, previous, surgical, perform, work, scramble, action, surgery, motion] [computer, vision, pattern, international, single, well, disparity, approach] [conference, method, high, ieee, figure, proposed, drawing] [low, network, best, better, performance, filter, higher, number, automated, relu, accuracy, optimize] [attention, attend, model, relevant, attends, diversity, common] [module, score, branch, final, improvement, propose, annotate, segment, level] [ranking, loss, uniform, softmax, task, weighting, training, learn, pairwise, learning, pair, test, rank, informative, set, datasets, encourages, learned]
@InProceedings{Doughty_2019_CVPR,
  author = {Doughty, Hazel and Mayol-Cuevas, Walterio and Damen, Dima},
  title = {The Pros and Cons: Rank-Aware Temporal Attention for Skill Determination in Long Videos},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Collaborative Spatiotemporal Feature Learning for Video Action Recognition
Chao Li, Qiaoyong Zhong, Di Xie, Shiliang Pu


Spatiotemporal feature learning is of central importance for action recognition in videos. Existing deep neural network models either learn spatial and temporal features independently (C2D) or jointly with unconstrained parameters (C3D). In this paper, we propose a novel neural operation which encodes spatiotemporal features collaboratively by imposing a weight-sharing constraint on the learnable parameters. In particular, we perform 2D convolution along three orthogonal views of volumetric video data, which learns spatial appearance and temporal motion cues respectively. By sharing the convolution kernels of different views, spatial and temporal features are collaboratively learned and thus benefit from each other. The complementary features are subsequently fused by a weighted summation whose coefficients are learned end-to-end. Our approach achieves state-of-the-art performance on large-scale benchmarks and won the 1st place in the Moments in Time Challenge 2018. Moreover, based on the learned coefficients of different views, we are able to quantify the contributions of spatial and temporal features. This analysis sheds light on interpretability of the model and may also guide the future design of algorithm for video recognition.
[temporal, video, spatiotemporal, action, time, recognition, motion, kinetics, dataset, learns, collaboratively, multiple, jointly, perform, frame, optical, work] [view, volumetric, field] [figure, proposed, based, input, comparison, collaborative, image, method] [cost, convolution, network, table, deep, weight, convolutional, neural, architecture, size, number, residual, performance, coefficient, imagenet, validation, accuracy, sharing, output, top, filter, unit, receptive, operation, computational, effectiveness, applied] [model] [feature, spatial, three, average, propose, cnn] [learning, learned, learn, set, dimension, representation, train, share, shared, classification, training, sample]
@InProceedings{Li_2019_CVPR,
  author = {Li, Chao and Zhong, Qiaoyong and Xie, Di and Pu, Shiliang},
  title = {Collaborative Spatiotemporal Feature Learning for Video Action Recognition},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
MARS: Motion-Augmented RGB Stream for Action Recognition
Nieves Crasto, Philippe Weinzaepfel, Karteek Alahari, Cordelia Schmid


Most state-of-the-art methods for action recognition consist of a two-stream architecture with 3D convolutions: an appearance stream for RGB frames and a motion stream for optical flow frames. Although combining flow with RGB improves the performance, the cost of computing accurate optical flow is high, and increases action recognition latency. This limits the usage of two-stream approaches in real-world applications requiring low latency. In this paper, we introduce two learning approaches to train a standard 3D CNN, operating on RGB frames, that mimics the motion stream, and as a result avoids flow computation at test time. First, by minimizing a feature-based loss compared to the Flow stream, we show that the network reproduces the motion stream with high fidelity. Second, to leverage both appearance and motion information effectively, we train with a linear combination of the feature-based loss and the standard cross-entropy loss for action recognition. We denote the stream trained using this combined loss as Motion-Augmented RGB Stream (MARS). As a single stream, MARS performs better than RGB or Flow alone, for instance with 72.7% accuracy on Kinetics compared to 72.0% and 65.6% with RGB and Flow streams respectively.
[flow, stream, motion, action, optical, mers, video, minikinetics, recognition, time, kinetics, combining, performs, privileged, temporal, mpegflow, work] [rgb, approach, denote, explicit] [figure, appearance, based, difference, mse, input, proposed, method, high] [accuracy, performance, network, standard, convolutional, table, computation, better, layer, validation, compared, cnns, impact, imagenet, architecture, lower, deep, neural, computational, freeze] [mimic, model, step] [three, feature, leverage, average, cnn] [loss, train, training, learning, test, entropy, trained, distillation, cross, class, strategy, observe, datasets, knowledge, large, classification, classify, effectively]
@InProceedings{Crasto_2019_CVPR,
  author = {Crasto, Nieves and Weinzaepfel, Philippe and Alahari, Karteek and Schmid, Cordelia},
  title = {MARS: Motion-Augmented RGB Stream for Action Recognition},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Convolutional Relational Machine for Group Activity Recognition
Sina Mokhtarzadeh Azar, Mina Ghadimi Atigh, Ahmad Nickabadi, Alexandre Alahi


We present an end-to-end deep Convolutional Neural Network called Convolutional Relational Machine (CRM) for recognizing group activities that utilizes the information in spatial relations between individual persons in image or video. It learns to produce an intermediate spatial representation (activity map) based on individual and group activities. A multi-stage refinement component is responsible for decreasing the incorrect predictions in the activity map. Finally, an aggregation component uses the refined information to recognize group activities. Experimental results demonstrate the constructive contribution of the information extracted and represented in the form of the activity map. CRM shows advantages over state-of-the-art models on Volleyball and Collective Activity datasets.
[activity, individual, recognition, action, crm, collective, extract, optical, flow, volleyball, temporal, frame, multiple, video, previous, considering, recurrent, recognizing, joint, dataset, greg] [computer, vision, pattern, single, rgb, form, initial, scene, ground, truth] [input, conference, based, ieee, produce, component, proposed, figure, method, image] [group, convolutional, aggregation, accuracy, better, table, neural, network, performance, conv, layer, number, deep, pooling] [model, relational, machine, consider, generated] [map, feature, stage, spatial, refinement, final, cnn, refined, bounding, person, baseline] [training, representation, learning, loss, class, set]
@InProceedings{Azar_2019_CVPR,
  author = {Mokhtarzadeh Azar, Sina and Ghadimi Atigh, Mina and Nickabadi, Ahmad and Alahi, Alexandre},
  title = {Convolutional Relational Machine for Group Activity Recognition},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Video Summarization by Learning From Unpaired Data
Mrigank Rochan, Yang Wang


We consider the problem of video summarization. Given an input raw video, the goal is to select a small subset of key frames from the input video to create a shorter summary video that best describes the content of the original video. Most of the current state-of-the-art video summarization approaches use supervised learning and require labeled training data. Each training instance consists of a raw input video and its ground truth summary video curated by human annotators. However, it is very expensive and difficult to create such labeled training examples. To address this limitation, we propose a novel formulation to learn video summarization from unpaired data. We present an approach that learns to generate optimal video summaries using a set of raw videos (V) and a set of summary videos (S), where there exists no correspondence between V and S. We argue that this type of data is much easier to collect. Our model aims to learn a mapping function F : V -> S such that the distribution of resultant summary videos from F(V) is similar to the distribution of S with the help of an adversarial objective. In addition, we enforce a diversity constraint on F(V) to ensure that the generated video summaries are visually diverse. Experimental results on two benchmark datasets indicate that our proposed approach significantly outperforms other alternative methods.
[video, summary, key, summarization, frame, summe, tvsum, selector, dataset, unpairedvsn, work, fcsn, temporal, youtube, ldiv, consists, unpairedvsnadv, kristen, human, learns] [computer, vision, ground, pattern, truth, formulation, international, corresponding, reconstruction, additional, approach, correspondence, define, michael] [unpaired, conference, input, raw, ieee, method, real, proposed, paired, produced, image] [network, table, performance, output, precision, small, selection, number] [model, adversarial, discriminator, diversity, create, goal, generated, partial, evaluation, generate] [feature, supervision, european, baseline, fully, final, propose, recall] [data, learning, training, supervised, set, learn, unsupervised, loss, datasets, subset, function, select, selected, objective, distribution, web, observe, setting, labeled]
@InProceedings{Rochan_2019_CVPR,
  author = {Rochan, Mrigank and Wang, Yang},
  title = {Video Summarization by Learning From Unpaired Data},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Skeleton-Based Action Recognition With Directed Graph Neural Networks
Lei Shi, Yifan Zhang, Jian Cheng, Hanqing Lu


The skeleton data have been widely used for the action recognition tasks since they can robustly accommodate dynamic circumstances and complex backgrounds. In existing methods, both the joint and bone information in skeleton data have been proved to be of great help for action recognition tasks. However, how to incorporate these two types of data to best take advantage of the relationship between joints and bones remains a problem to be solved. In this work, we represent the skeleton data as a directed acyclic graph based on the kinematic dependency between the joints and bones in the natural human body. A novel directed graph neural network is designed specially to extract the information of joints, bones and their relations and make prediction based on the extracted features. In addition, to better fit the action recognition task, the topological structure of the graph is made adaptive based on the training process, which brings notable improvement. Moreover, the motion information of the skeleton sequence is exploited and combined with the spatial information to further enhance the performance in a two-stream framework. Our final model is tested on two large-scale datasets, NTU-RGBD and Skeleton-Kinetics, and exceeds state-of-the-art performance on both of them.
[graph, action, recognition, skeleton, human, directed, motion, joint, bone, dataset, temporal, dgn, incoming, incidence, video, extract, sequence, updated, outgoing, acyclic, represented, work, dgnn, fed] [vertex, matrix, pattern, computer, vision, body, directly, corresponding, approach, international, problem, pose] [conference, ieee, based, method, figure, extracted] [neural, performance, network, structure, convolutional, block, table, original, adaptive, process, number, layer, accuracy, better, deep, aggregation, designed, connected, denotes, root, initialized, parameter] [model, represent] [edge, spatial, final] [data, training, set, learning, source, target, function, conventional, updating, adjacency]
@InProceedings{Shi_2019_CVPR,
  author = {Shi, Lei and Zhang, Yifan and Cheng, Jian and Lu, Hanqing},
  title = {Skeleton-Based Action Recognition With Directed Graph Neural Networks},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
PA3D: Pose-Action 3D Machine for Video Recognition
An Yan, Yali Wang, Zhifeng Li, Yu Qiao


Recent studies have witnessed the successes of using 3D CNNs for video action recognition. However, most 3D models are built upon RGB and optical flow streams, which may not fully exploit pose dynamics, i.e., an important cue of modeling human actions. To fill this gap, we propose a concise Pose-Action 3D Machine (PA3D), which can effectively encode multiple pose modalities within a unified 3D framework, and consequently learn spatio-temporal pose representations for action recognition. More specifically, we introduce a novel temporal pose convolution to aggregate spatial poses over frames. Unlike the classical temporal convolution, our operation can explicitly learn the pose motions that are discriminative to recognize human actions. Extensive experiments on three popular benchmarks (i.e., JHMDB, HMDB, and Charades) show that, PA3D outperforms the recent pose-based approaches. Furthermore, PA3D is highly complementary to the recent 3D CNNs, e.g., I3D. Multi-stream fusion achieves the state-of-the-art performance on all evaluated data sets.
[action, temporal, video, human, recognition, joint, hmdb, perform, jhmdb, tempposeconv, fusion, complex, potion, prediction, spatiotemporal, consists, modeling, outperforms, optical, cue, recognize, frame, motion, explicitly, predefined, stream, framework] [pose, local, estimation, rgb] [feed, input, traditional] [convolution, convolutional, table, number, output, dilation, stride, cnns, deep, effective, performance, layer, achieve] [encode, encoding, modality, sampled, model, generate, evaluate, machine] [cnn, semantic, spatial, heatmaps, affinity, feature, propose, three, complementary, score, fuse, average, built, segment] [representation, learn, discriminative, learning, novel, training, effectively]
@InProceedings{Yan_2019_CVPR,
  author = {Yan, An and Wang, Yali and Li, Zhifeng and Qiao, Yu},
  title = {PA3D: Pose-Action 3D Machine for Video Recognition},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Deep Dual Relation Modeling for Egocentric Interaction Recognition
Haoxin Li, Yijun Cai, Wei-Shi Zheng


Egocentric interaction recognition aims to recognize the camera wearer's interactions with the interactor who faces the camera wearer in egocentric videos. In such a human-human interaction analysis problem, it is crucial to explore the relations between the camera wearer and the interactor. However, most existing works directly model the interactions as a whole and lack modeling the relations between the two interacting persons. To exploit the strong relations for egocentric interaction recognition, we introduce a dual relation modeling framework which learns to model the relations between the camera wearer and the interactor based on the individual action representations of the two persons. Specifically, we develop a novel interactive LSTM module, the key component of our framework, to explicitly model the relations between the two interacting persons based on their individual action representations, which are collaboratively learned with an interactor attention module and a global-local motion module. Experimental results on three egocentric interaction datasets show the effectiveness of our method and advantage over state-of-the-arts.
[motion, interaction, recognition, action, interactor, egocentric, individual, lstm, wearer, modeling, interacting, explicitly, frame, pev, video, symmetrical, framework, human, pov, dataset, jointly, ego, recognize, localize, exo, learns, time, activity] [camera, local, computer, vision, pattern, international, dense, explicit, analysis, equation] [conference, ieee, appearance, figure, dual, method, based, comparison, paired] [accuracy, network, block, deep, table, performance, effectiveness, concrete, convolutional, concatenation, neural] [attention, model, sampled, step, introduce] [module, global, relation, feature, interactive, mask, segmentation, three, person] [learning, learn, learned, loss, representation, existing, china, classification]
@InProceedings{Li_2019_CVPR,
  author = {Li, Haoxin and Cai, Yijun and Zheng, Wei-Shi},
  title = {Deep Dual Relation Modeling for Egocentric Interaction Recognition},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
MOTS: Multi-Object Tracking and Segmentation
Paul Voigtlaender, Michael Krause, Aljosa Osep, Jonathon Luiten, Berin Balachandar Gnana Sekar, Andreas Geiger, Bastian Leibe


This paper extends the popular task of multi-object tracking to multi-object tracking and segmentation (MOTS). Towards this goal, we create dense pixel-level annotations for two existing tracking datasets using a semi-automatic annotation procedure. Our new annotations comprise 65,213 pixel masks for 977 distinct objects (cars and pedestrians) in 10,870 video frames. For evaluation, we extend existing multi-object tracking metrics to this new task. Moreover, we propose a new baseline method which jointly addresses detection, tracking, and segmentation with a single convolutional network. We demonstrate the value of our datasets by achieving improvements in performance when training on MOTS annotations. We believe that our datasets, metrics and baseline will become a valuable resource towards developing multi-object tracking approaches that go beyond 2D bounding boxes. We make our annotations, code, and models available at https://www.vision.rwth-aachen.de/page/mots.
[tracking, video, dataset, time, temporal, multiple, mot, track, frame, extend, jointly, manually, lstm, optical] [kitti, ground, truth, provide, well] [method, based, image, input, pixel, proposed, appearance] [convolutional, network, table, original, number, order, denotes, performance, accuracy, validation] [association, arxiv, preprint, evaluation, evaluate, model, vector] [segmentation, mask, object, bounding, box, instance, head, annotation, detection, car, motsa, smotsa, baseline, annotated, iou, benchmark, detector, motschallenge, ped, propose, feature, region, camot, refinement, motsp, person] [training, datasets, set, task, existing, learning, trained, loss, data]
@InProceedings{Voigtlaender_2019_CVPR,
  author = {Voigtlaender, Paul and Krause, Michael and Osep, Aljosa and Luiten, Jonathon and Balachandar Gnana Sekar, Berin and Geiger, Andreas and Leibe, Bastian},
  title = {MOTS: Multi-Object Tracking and Segmentation},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Siamese Cascaded Region Proposal Networks for Real-Time Visual Tracking
Heng Fan, Haibin Ling


Recently, the region proposal networks (RPN) have been combined with the Siamese network for tracking, and shown excellent accuracy with high efficiency. Nevertheless, previously proposed one-stage Siamese-RPN trackers degenerate in presence of similar distractors and large scale variation. Addressing these issues, we propose a multi-stage tracking framework, Siamese Cascaded RPN (C-RPN), which consists of a sequence of RPNs cascaded from deep high-level to shallow low-level layers in a Siamese network. Compared to previous solutions, C-RPN has several advantages: (1) Each RPN is trained using the outputs of RPN in the previous stage. Such process stimulates hard negative sampling, resulting in more balanced training samples. Consequently, the RPNs are sequentially more discriminative in distinguishing difficult background (i.e.,, similar distractors). (2) Multi-level features are fully leveraged through a novel feature transfer block (FTB) for each RPN, further improving the discriminability of C-RPN using both high-level semantic and low-level spatial information. (3) With multiple steps of regressions, C-RPN progressively refines the location and shape of the target in each RPN with adjusted anchor boxes in the previous stage, which makes localization more accurate. C-RPN is trained end-to-end with the multi-task loss function. In inference, C-RPN is deployed as it is, without any temporal adaption, for real-time tracking. In extensive experiments on OTB-2013, OTB-2015, VOT-2016, VOT-2017, LaSOT and TrackingNet, C-RPN consistently achieves state-of-the-art results and runs in real-time.
[tracking, multiple, consists, second, extract, previous] [single, approach, accurate] [figure, high, background, image] [siamese, network, best, lasot, siamrpn, eao, deep, performance, ftb, correlation, suc, scale, convolutional, block, filter, tracker, better, siamfc, staple, rpns, compared, achieves, overlap, layer, neural, rpnl, accuracy, crpn, bacf, distractors] [visual, success, model] [feature, rpn, object, anchor, stage, regression, semantic, cascaded, propose, localization, detection, region, proposal, fully, improve, threshold, cascade, score, three, response] [negative, training, classification, target, discriminative, loss, learning, transfer, set, large, viewed, function, hard, learn]
@InProceedings{Fan_2019_CVPR,
  author = {Fan, Heng and Ling, Haibin},
  title = {Siamese Cascaded Region Proposal Networks for Real-Time Visual Tracking},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
PointFlowNet: Learning Representations for Rigid Motion Estimation From Point Clouds
Aseem Behl, Despoina Paschalidou, Simon Donne, Andreas Geiger


Despite significant progress in image-based 3D scene flow estimation, the performance of such approaches has not yet reached the fidelity required by many applications. Simultaneously, these applications are often not restricted to image-based estimation: laser scanners provide a popular alternative to traditional cameras, for example in the context of self-driving cars, as they directly yield a 3D point cloud. In this paper, we propose to estimate 3D motion from such unstructured point clouds using a deep neural network. In a single forward pass, our model jointly predicts 3D scene flow as well as the 3D bounding box and rigid body motion of objects in the scene. While the prospect of estimating 3D scene flow from unstructured point clouds is promising, it is also a challenging task. We show that the traditional global representation of rigid body motion prohibits inference by CNNs, and propose a translation equivariant representation to circumvent this problem. For training our deep network, a large dataset is required. Because of this, we augment real scans from KITTI with virtual objects, realistically modeling occlusions and simulating sensor noise. A thorough comparison with classic and learning-based techniques highlights the robustness of the proposed approach.
[flow, motion, recognition, dataset, tracking, moving, optical] [scene, rigid, point, vision, computer, local, estimation, lidar, body, kitti, international, ground, pattern, voxel, coordinate, estimate, rotation, augmented, stereo, cloud, truth, approach, robotics, error, dense, autonomous, directly, dewan, voxels, automation, intelligent, provide, unstructured, well, denote, origin, estimating, virtual, geometry] [ieee, method, translation, comparison, proposed, figure, real, based, image] [convolutional, deep, neural, network, performance, original, sparse] [system, model] [object, feature, detection, global, context, bounding, proposal, propose, map, location, detected, predicted, box, illustrated] [learning, loss, training, representation, set, positive]
@InProceedings{Behl_2019_CVPR,
  author = {Behl, Aseem and Paschalidou, Despoina and Donne, Simon and Geiger, Andreas},
  title = {PointFlowNet: Learning Representations for Rigid Motion Estimation From Point Clouds},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Listen to the Image
Di Hu, Dong Wang, Xuelong Li, Feiping Nie, Qi Wang


Visual-to-auditory sensory substitution devices can assist the blind in sensing the visual environment by translating the visual information into a sound pattern. To improve the translation quality, the task performances of the blind are usually employed to evaluate different encoding schemes. In contrast to the toilsome human-based assessment, we argue that machine model can be also developed for evaluation, and more efficient. To this end, we firstly propose two distinct cross-modal perception model w.r.t. the late-blind and congenitally-blind cases, which aim to generate concrete visual contents based on the translated sound. To validate the functionality of proposed models, two novel optimization strategies w.r.t. the primary encoding scheme are presented. Further, we conduct sets of human-based experiments to evaluate and compare them with the conducted machine-based assessments in the cross-modal generation task. Their highly consistent results w.r.t. different encoding schemes indicate that using machine model to accelerate optimization evaluation and reduce experimental cost is feasible to some extent, which could dramatically promote the upgrading of encoding scheme then help the blind to improve their visual perception ability.
[sound, audio, human, assessment, people, current, employed, time, signal, dataset] [optimization, corresponding, provide, vision] [blind, image, proposed, translated, translation, based, figure, content, generative, generator, quality, frequency, input, color, conditional] [scheme, effective, performance, concrete, firstly, best, device, neural] [visual, encoding, model, perception, generated, evaluation, voice, sensory, machine, generation, primary, substitution, evaluate, generate, modified, cortex, plasticity, auditory, adversarial, arxiv, encoded, imagine, preprint, discriminator, modality, experience, simple] [object, improve, propose, help, adopted, stage] [training, learning, novel, function, embeddings, mnist, difficult, knowledge, effectively, task, testing, digit]
@InProceedings{Hu_2019_CVPR,
  author = {Hu, Di and Wang, Dong and Li, Xuelong and Nie, Feiping and Wang, Qi},
  title = {Listen to the Image},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Image Super-Resolution by Neural Texture Transfer
Zhifei Zhang, Zhaowen Wang, Zhe Lin, Hairong Qi


Due to the significant information loss in low-resolution (LR) images, it has become extremely challenging to further advance the state-of-the-art of single image super-resolution (SISR). Reference-based super-resolution (RefSR), on the other hand, has proven to be promising in recovering high-resolution (HR) details when a reference (Ref) image with similar content as that of the LR input is given. However, the quality of RefSR can degrade severely when Ref is less similar. This paper aims to unleash the potential of RefSR by leveraging more texture details from Ref images with stronger robustness even when irrelevant Ref images are provided. Inspired by the recent work on image stylization, we formulate the RefSR problem as neural texture transfer. We design an end-to-end deep model which enriches HR details by adaptively transferring the texture from Ref images according to their textural similarity. Instead of matching content in the raw pixel space as done by previous methods, our key contribution is a multi-level matching conducted in the neural space. This matching scheme facilitates multi-scale neural transfer that allows the model to benefit more from those semantically related Ref patches, and gracefully degrade to SISR performance on the least relevant Ref inputs. We build a benchmark dataset for the general research of RefSR, which contains Ref images paired with LR inputs with varying levels of similarity. Both quantitative and qualitative evaluations demonstrate the superiority of our method over state-of-the-art.
[recognition, dataset, optical, multiple, video, flow] [computer, vision, pattern, matching, single, local, problem, international, recovering] [texture, image, srntt, conference, refsr, ieee, sisr, proposed, reference, crossnet, content, based, quality, psnr, srgan, patch, input, perceptual, swapping, figure, bicubic, adaptiveness, style, swapped, mdsr, enet, quantitative, qualitative, demonstrate, method, facilitate] [deep, neural, performance, compared, network, convolutional, denotes, table, layer, effectiveness, original, output, structure, residual, achieve] [visual, adversarial, model, evaluation, external] [feature, map, european, adopt, final] [transfer, loss, similarity, learning, training, testing, space, existing, conducted, large]
@InProceedings{Zhang_2019_CVPR,
  author = {Zhang, Zhifei and Wang, Zhaowen and Lin, Zhe and Qi, Hairong},
  title = {Image Super-Resolution by Neural Texture Transfer},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Conditional Adversarial Generative Flow for Controllable Image Synthesis
Rui Liu, Yu Liu, Xinyu Gong, Xiaogang Wang, Hongsheng Li


Flow-based generative models show great potential in image synthesis due to its reversible pipeline and exact log-likelihood target, yet it suffers from weak ability for conditional image synthesis, especially for multi-label or unaware conditions. This is because the potential distribution of image conditions is hard to measure precisely from its latent variable z. In this paper, based on modeling a joint probabilistic density of an image and its conditions, we propose a novel flow-based generative model named conditional adversarial generative flow (CAGlow). Instead of disentangling attributes from latent space, we blaze a new trail for learning an encoder to estimate the mapping from condition space to latent space in an adversarial manner. Given a specific condition c, CAGlow can encode it to a sampled z, and then enable robust conditional image synthesis in complex situations like combining person identity with multiple attributes. The proposed CAGlow can be implemented in both supervised and unsupervised manners, thus can synthesize images with conditional information like categories, attributes, and even some unknown properties. Extensive experiments show that CAGlow ensures the independence of different conditions and outperforms regular Glow to a significant extent.
[flow, work, multiple, dataset, forward] [approach, condition, bound, exp, inferred, continuous, well] [latent, generative, conditional, image, synthesis, caglow, figure, real, glow, proposed, reversible, cglow, face, attribute, ezp, mapping, celeba, prior, change, hair, based, identity, bijective, input] [lower, network, deep, block, accuracy, better, number, table, mentioned, output] [adversarial, model, encoder, generated, discriminator, arxiv, preprint, step, variational, decoder, regular, gans, sampled, generate, gan, fake, natural, interpretable, maximizing] [supervision, det, map, three, feature, xiaogang, propose] [distribution, space, unsupervised, loss, log, classifier, objective, training, learning, supervised, unknown, representation, specific, mnist, vaes, data]
@InProceedings{Liu_2019_CVPR,
  author = {Liu, Rui and Liu, Yu and Gong, Xinyu and Wang, Xiaogang and Li, Hongsheng},
  title = {Conditional Adversarial Generative Flow for Controllable Image Synthesis},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
How to Make a Pizza: Learning a Compositional Layer-Based GAN Model
Dim P. Papadopoulos, Youssef Tamaazousti, Ferda Ofli, Ingmar Weber, Antonio Torralba


A food recipe is an ordered set of instructions for preparing a particular dish. From a visual perspective, every instruction step can be seen as a way to change the visual appearance of the dish by adding extra objects (e.g., adding an ingredient) or changing the appearance of the existing ones (e.g., cooking the dish). In this paper, we aim to teach a machine how to make a pizza by building a generative model that mirrors this step-by-step procedure. To do so, we learn composable module operations which are able to either add or remove a particular ingredient. Each operator is designed as a Generative Adversarial Network (GAN). Given only weak image-level supervision, the operators are trained to generate a visual layer that needs to be added to or removed from the existing image. The proposed model is able to decompose an image into an ordered sequence of layers by applying sequentially in the right order the corresponding removing modules. Experimental results on synthetic and real pizza images demonstrate that our proposed model is able to: (1) segment pizza toppings in a weakly- supervised fashion, (2) remove them by revealing what is occluded underneath them (i.e., inpainting), and (3) infer the ordering of the toppings without any depth ordering supervision. Code, data, and models are available online.
[ordering, predict, sequence, ordered, antonio] [occluded, depth, corresponding, approach, invisible] [image, removing, remove, input, synthetic, generative, proposed, real, appearance, figure, generator, translation, consistency, layered, composite, realistic, cycle, removed] [layer, top, add, achieves, sequentially, apply, order, performance, applying, achieve, residual, output] [model, pizza, adversarial, adding, generated, discriminator, generate, green, gan, generating, food, visual, underneath, gans, pepperoni, pizzagan, arxiv, preprint, cooking, composable, infer, procedure, making, ingredient, generation] [object, segmentation, module, semantic, mask] [class, loss, trained, classification, test, learning, set, training, learn, task]
@InProceedings{Papadopoulos_2019_CVPR,
  author = {Papadopoulos, Dim P. and Tamaazousti, Youssef and Ofli, Ferda and Weber, Ingmar and Torralba, Antonio},
  title = {How to Make a Pizza: Learning a Compositional Layer-Based GAN Model},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
TransGaGa: Geometry-Aware Unsupervised Image-To-Image Translation
Wayne Wu, Kaidi Cao, Cheng Li, Chen Qian, Chen Change Loy


Unsupervised image-to-image translation aims at learning a mapping between two visual domains. However, learning a translation across large geometry variations al- ways ends up with failure. In this work, we present a novel disentangle-and-translate framework to tackle the complex objects image-to-image translation task. Instead of learning the mapping on the image space directly, we disentangle image space into a Cartesian product of the appearance and the geometry latent spaces. Specifically, we first in- troduce a geometry prior loss and a conditional VAE loss to encourage the network to learn independent but com- plementary representations. The translation is then built on appearance and geometry space separately. Extensive experiments demonstrate the superior performance of our method to other state-of-the-art approaches, especially in the challenging near-rigid and non-rigid objects translation tasks. In addition, by taking different exemplars as the ap- pearance references, our method also supports multimodal translation. Project page: https://wywu.github.io/projects/TGaGa/TGaGa.html
[human, framework, work, perform, horse, complex, structural] [geometry, shape, problem, approach, directly] [appearance, image, translation, latent, face, method, conditional, munit, drit, landmark, quality, figure, disentangled, cyclegan, prior, style, disentanglement, mapping, quantitative, perceptual, consistency, alexei, chen, synthesis, input, pca, pixel, generative, based, qualitative] [structure, network, deep, better, unit, andrew, performance] [multimodal, adversarial, generated, visual, model, diversity, transformer, fid, introduce, generation, encoder, giraffe, encourage, variational, diverse] [object, ablation] [unsupervised, space, loss, learning, representation, training, large, cat, domain, datasets, code, learn, transfer, labeled, novel, vae]
@InProceedings{Wu_2019_CVPR,
  author = {Wu, Wayne and Cao, Kaidi and Li, Cheng and Qian, Chen and Change Loy, Chen},
  title = {TransGaGa: Geometry-Aware Unsupervised Image-To-Image Translation},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Depth-Attentional Features for Single-Image Rain Removal
Xiaowei Hu, Chi-Wing Fu, Lei Zhu, Pheng-Ann Heng


Rain is a common weather phenomenon, where object visibility varies with depth from the camera and objects faraway are visually blocked more by fog than by rain streaks. Existing methods and datasets for rain removal, however, ignore these physical properties, thereby limiting the rain removal efficiency on real photos. In this work, we first analyze the visual effects of rain subject to scene depth and formulate a rain imaging model collectively with rain streaks and fog; by then, we prepare a new dataset called RainCityscapes with rain streaks and fog on real outdoor photos. Furthermore, we design an end-to-end deep neural network, where we train it to learn depth-attentional features via a depth-guided attention mechanism, and regress a residual map to produce the rain-free image output. We performed various experiments to visually and quantitatively compare our method with several state-of-the-art methods to demonstrate its superiority over the others.
[dataset] [depth, scene, single, computer, camera, regress, vision, visibility, ground, formulation, note] [rain, image, removal, fog, real, figure, remove, input, method, produce, streak, intensity, based, ieee, raincityscapes, rescan, comparison, removing, haze, prepare, produced, psnr, photo, garg, imaging, ssim, synthesize, realistic, pixel] [network, deep, residual, convolutional, neural, table, layer, formulate, compared, denotes, conv, rate, process, output, group, add, design] [attention, model, visual, decoder, arxiv, preprint] [map, feature, branch, three, comparing, supervision] [training, set, existing, learn, learning, datasets, function]
@InProceedings{Hu_2019_CVPR,
  author = {Hu, Xiaowei and Fu, Chi-Wing and Zhu, Lei and Heng, Pheng-Ann},
  title = {Depth-Attentional Features for Single-Image Rain Removal},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Hyperspectral Image Reconstruction Using a Deep Spatial-Spectral Prior
Lizhi Wang, Chen Sun, Ying Fu, Min H. Kim, Hua Huang


Regularization is a fundamental technique to solve an ill-posed optimization problem robustly and is essential to reconstruct compressive hyperspectral images. Various hand-crafted priors have been employed as a regularizer but are often insufficient to handle the wide variety of spectra of natural hyperspectral images, resulting in poor reconstruction quality. Moreover, the prior-regularized optimization requires manual tweaking of its weight parameters to achieve a balance between the spatial and spectral fidelity of result images. In this paper, we present a novel hyperspectral image reconstruction algorithm that substitutes the traditional hand-crafted prior with a data-driven prior, based on an optimization-inspired network. Our method consists of two main parts: First, we learn a novel data-driven prior that regularizes the optimization problem with a goal to boost the spatial-spectral fidelity. Our data-driven prior learns both local coherence and dynamic characteristics of natural hyperspectral images. Second, we combine our regularizer with an optimization-inspired network to overcome the heavy computation problem in the traditional iterative optimization methods. We learn the complete parameters in the network through end-to-end training, enabling robust performance with high accuracy. Extensive simulation and hardware experiments validate the superior performance of our method over the state-of-the-art methods.
[modeling, work, term] [optimization, reconstruction, problem, observation, pattern, vision, computer, aperture, solve, solution, underlying, linear, twist, technique, solving, cube] [image, hyperspectral, prior, spectral, compressive, proposed, method, figure, ieee, cassi, imaging, coded, conference, patch, psnr, sam, ssim, result, based, snapshot, real, oblique, parallelepiped, proposedd, pixel, sensing, mapping, gpsr, hscnn, proposedi, reconstructed, high, proximal] [network, deep, performance, neural, sparsity, compared, number, regularization, hardware, designed, convolutional, layer, superior, sparse, computational] [natural, model, system, iterative, arg, machine] [spatial, stage, propose] [learn, learning, autoencoder, novel, set, training, data, function, exploit, selected]
@InProceedings{Wang_2019_CVPR,
  author = {Wang, Lizhi and Sun, Chen and Fu, Ying and Kim, Min H. and Huang, Hua},
  title = {Hyperspectral Image Reconstruction Using a Deep Spatial-Spectral Prior},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
LiFF: Light Field Features in Scale and Depth
Donald G. Dansereau, Bernd Girod, Gordon Wetzstein


Feature detectors and descriptors are key low-level vision tools that many higher-level tasks build on. Unfortunately these fail in the presence of challenging light transport effects including partial occlusion, low contrast, and reflective or refractive surfaces. Building on spatio-angular imaging modalities offered by emerging light field cameras, we introduce a new and computationally efficient 4D light field feature detector and descriptor: LiFF. LiFF is scale invariant and utilizes the full 4D light field to detect features that are robust to changes in perspective. This is particularly useful for structure from motion (SfM) and other tasks that match features across viewpoints of a scene. We demonstrate significantly improved 3D reconstructions via SfM when using LiFF instead of the leading 2D or 4D features, and show that LiFF runs an order of magnitude faster than the leading 4D approach. Finally, LiFF inherently estimates depth for each feature, opening a path for future research in light field-based SfM.
[work, recognition, key, dataset, motion] [liff, sift, light, field, computer, vision, depth, slope, focal, spurious, scene, pattern, range, descriptor, subimages, approach, sfm, matching, colmap, putative, inlier, robust, note, occlusion, point, plenoptic, match, directly, well, reconstruction, direct, repeating, good, yielding] [repeated, noise, image, stack, high, conference, imaging, method, comparison, row] [performance, scale, implementation, higher, computational, speed, compared, number, identical, cost, computationally, structure, rate, filter, low, applied] [partial, identify] [feature, detection, including, detected, challenging, detector, edge, object, peak, detect, leading, threshold] [set, space, dog, large, proportion]
@InProceedings{Dansereau_2019_CVPR,
  author = {Dansereau, Donald G. and Girod, Bernd and Wetzstein, Gordon},
  title = {LiFF: Light Field Features in Scale and Depth},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Deep Exemplar-Based Video Colorization
Bo Zhang, Mingming He, Jing Liao, Pedro V. Sander, Lu Yuan, Amine Bermak, Dong Chen


This paper presents the first end-to-end network for exemplar-based video colorization. The main challenge is to achieve temporal consistency while remaining faithful to the reference style. To address this issue, we introduce a recurrent framework that unifies the semantic correspondence and color propagation steps. Both steps allow a provided reference image to guide the colorization of every frame, thus reducing accumulated propagation errors. Video frames are colorized in sequence based on the colorization history, and its coherency is further enforced by the temporal consistency loss. All of these components, learned end-to-end, help produce realistic videos with good temporal stability. Experiments show our result is superior to the state-of-the-art methods both quantitatively and qualitatively.
[video, temporal, frame, propagation, flow, previous, work, optical, propagate, warped, consists] [correspondence, computer, lab, ground, international, matching, local, truth, vision, pattern, defined, allows] [colorization, color, image, reference, method, figure, conference, based, colorized, ieee, acm, comparison, input, colorize, proposed, consistency, result, grayscale, user, produce, realistic, stn, vpn, quantitative, hong, kong] [network, deep, output, top, achieve, layer, imagenet, compare, convolutional, order, neural, apply, table] [arxiv, preprint, automatic, generate, adversarial, discriminator, visual, natural] [feature, semantic, map, subnet, contextual, three, propose] [loss, learning, training, similarity, measure, set, test, transfer, trained]
@InProceedings{Zhang_2019_CVPR,
  author = {Zhang, Bo and He, Mingming and Liao, Jing and Sander, Pedro V. and Yuan, Lu and Bermak, Amine and Chen, Dong},
  title = {Deep Exemplar-Based Video Colorization},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
On Finding Gray Pixels
Yanlin Qian, Joni-Kristian Kamarainen, Jarno Nikkanen, Jiri Matas


We propose a novel grayness index for finding gray pixels and demonstrate its effectiveness and efficiency in illumination estimation. The grayness index, GI in short, is derived using the Dichromatic Reflection Model and is learning-free. GI allows to estimate one or multiple illumination sources in color-biased images. On standard single-illumination and multiple-illumination estimation benchmarks, GI outperforms state-of-the-art statistical methods and many recent deep methods. GI is simple and fast, written in a few dozen lines of code, processing a 1080p image in 0.4 seconds with a non-optimized Matlab code.
[dataset, work, outperforms, multiple] [illumination, dichromatic, estimation, scene, local, camera, surface, problem, light, estimate, corresponding, contrast, estimated, corrected, assumption, vision, single, error, percentage, colour, linear] [color, gray, constancy, image, method, reflection, pixel, grayness, ffcc, statistical, figure, cheng, chakrabarti, based, imaging, clear, trimean, illuminant, captured, proposed, chroma, patch] [table, best, performance, standard, deep, convolutional, compared, process, computational] [median, model, finding, white, worst, van, simple, physical, natural] [global, spatial, box, map] [setting, training, angular, testing, test, set, datasets, tested, trained, share, summarized, novel, learning]
@InProceedings{Qian_2019_CVPR,
  author = {Qian, Yanlin and Kamarainen, Joni-Kristian and Nikkanen, Jarno and Matas, Jiri},
  title = {On Finding Gray Pixels},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
UnOS: Unified Unsupervised Optical-Flow and Stereo-Depth Estimation by Watching Videos
Yang Wang, Peng Wang, Zhenheng Yang, Chenxu Luo, Yi Yang, Wei Xu


In this paper, we propose UnOS, an unified system for unsupervised optical flow and stereo depth estimation using convolutional neural network (CNN) by taking advantages of their inherent geometrical consistency based on the rigid-scene assumption. UnOS significantly outperforms other state-of-the-art (SOTA) unsupervised approaches that treated the two tasks independently. Specifically, given two consecutive stereo image pairs from a video, UnOS estimates per-pixel stereo depth images, camera ego-motion and optical flow with three parallel CNNs. Based on these quantities, UnOS computes rigid optical flow and compares it against the optical flow estimated from the FlowNet, yielding pixels satisfying the rigid-scene assumption. Then, we encourage geometrical consistency between the two estimated flows within rigid regions, from which we derive a rigid-aware direct visual odometry (RDVO) module. We also propose rigid and occlusion-aware flow-consistency losses for the learning of UnOS. We evaluated our results on the popular KITTI dataset over 4 related tasks, i.e. stereo depth, optical flow, visual odometry and motion segmentation.
[flow, optical, motion, moving, video, consecutive, jointly, static, joint, flownet, previous, sequence, recognition] [stereo, depth, rigid, unos, computer, matching, camera, estimation, vision, kitti, odometry, scene, pattern, monocular, rdvo, error, sota, corresponding, geometrical, direct, volume, pose, computed, estimated, yielding, pwcnet, geometry, international, left, view, occluded, michael] [consistency, conference, pixel, ieee, image, method, based, proposed, figure] [better, deep, network, performance, represents, convolutional, worse] [visual, arxiv, preprint, evaluation, potential, evaluate] [mask, object, segmentation, module, map, propose, motionnet, spatial] [learning, unsupervised, training, loss, train, supervised, set, target, task, source]
@InProceedings{Wang_2019_CVPR,
  author = {Wang, Yang and Wang, Peng and Yang, Zhenheng and Luo, Chenxu and Yang, Yi and Xu, Wei},
  title = {UnOS: Unified Unsupervised Optical-Flow and Stereo-Depth Estimation by Watching Videos},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Learning Transformation Synchronization
Xiangru Huang, Zhenxiao Liang, Xiaowei Zhou, Yao Xie, Leonidas J. Guibas, Qixing Huang


Reconstructing the 3D model of a physical object typically requires us to align the depth scans obtained from different camera poses into the same coordinate system. Solutions to this global alignment problem usually proceed in two steps. The first step estimates relative transformations between pairs of scans using an off-the-shelf technique. Due to limited information presented between pairs of scans, the resulting relative transformations are generally noisy. The second step then jointly optimizes the relative transformations among all input depth scans. A natural constraint used in this step is the cycle-consistency constraint, which allows us to prune incorrect relative transformations by detecting inconsistent cycles. The performance of such approaches, however, heavily relies on the quality of the input relative transformations. Instead of merely using the relative transformations as the input to perform transformation synchronization, we propose to use a neural network to learn the weights associated with each relative transformation. Our approach alternates between transformation synchronization using weighted relative transformations and predicting new weights of the input relative transformations using a neural network. We demonstrate the usefulness of this approach across a wide range of datasets.
[recurrent, graph, perform, second] [synchronization, approach, relative, tij, rotation, computer, wij, robust, problem, associated, vision, redwood, scannet, matrix, pattern, status, registration, synchronized, note, andrea, qixing, reweighted, pose, respect, consistent, matching, scan, compute, fastgr, error, leonidas, depth, corresponding, underlying, initial, supplementary, international, federica, beatrice, camera, coordinate] [transformation, input, translation, ieee, figure, conference, noise, spectral, image, recovery, proposed, study] [network, output, neural, weight, layer] [rij, vector] [module, global, baseline, map, object] [weighting, sij, function, learning, loss, pairwise, experimental, data, alignment, pair]
@InProceedings{Huang_2019_CVPR,
  author = {Huang, Xiangru and Liang, Zhenxiao and Zhou, Xiaowei and Xie, Yao and Guibas, Leonidas J. and Huang, Qixing},
  title = {Learning Transformation Synchronization},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
D2-Net: A Trainable CNN for Joint Description and Detection of Local Features
Mihai Dusmanu, Ignacio Rocco, Tomas Pajdla, Marc Pollefeys, Josef Sivic, Akihiko Torii, Torsten Sattler


In this work we address the problem of finding reliable pixel-level correspondences under difficult imaging conditions. We propose an approach where a single convolutional neural network plays a dual role: It is simultaneously a dense feature descriptor and a feature detector. By postponing the detection to a later stage, the obtained keypoints are more stable than their traditional counterparts based on early detection of low-level structures. We show that this model can be trained using pixel correspondences extracted from readily available large-scale SfM reconstructions, without any further annotations. The proposed method obtains state-of-the-art performance on both the difficult Aachen Day-Night localization dataset and the InLoc indoor localization benchmark, as well as competitive performance on other benchmarks for image matching and 3D reconstruction.
[dataset, work, performs, report, perform, joint, extract] [approach, local, dense, matching, descriptor, keypoints, illumination, viewpoint, rootsift, pose, marc, torsten, keypoint, josef, single, localized, robust, indoor, camera, point, tomas, akihiko, reconstruction, match] [image, method, proposed, ieee, resolution, pixel, based, extracted] [sparse, number, performance, order, convolutional, better, accuracy, neural, higher, deep] [evaluation, visual, description, strong, query] [feature, detection, detector, cnn, localization, challenging, extraction, stage, object, score, inloc, detected, map, threshold, propose, densely, level] [learning, trained, margin, loss, set, dij, training, task, soft, nearest, retrieval]
@InProceedings{Dusmanu_2019_CVPR,
  author = {Dusmanu, Mihai and Rocco, Ignacio and Pajdla, Tomas and Pollefeys, Marc and Sivic, Josef and Torii, Akihiko and Sattler, Torsten},
  title = {D2-Net: A Trainable CNN for Joint Description and Detection of Local Features},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Recurrent Neural Networks With Intra-Frame Iterations for Video Deblurring
Seungjun Nah, Sanghyun Son, Kyoung Mu Lee


Recurrent neural networks (RNNs) are widely used for sequential data processing. Recent state-of-the-art video deblurring methods bank on convolutional recurrent neural network architectures to exploit the temporal relationship between neighboring frames. In this work, we aim to improve the accuracy of recurrent models by adapting the hidden states transferred from past frames to the frame being processed so that the relations between video frames could be better used. We iteratively update the hidden state via re-using RNN cell parameters before predicting an output deblurred frame. Since we use existing parameters to update the hidden state, our method improves accuracy without additional modules. As the architecture remains the same regardless of iteration number, fewer iteration models can be considered as a partial computational path of the models with more iterations. To take advantage of this property, we employ a stochastic method to optimize our iterative models better. At training time, we randomly choose the iteration number on the fly and apply a regularization loss that favors less computation unless there are considerable reconstruction gains. We show that our method exhibits state-of-the-art video deblurring performance while operating in real-time speed.
[video, hidden, state, rnn, frame, recurrent, temporal, time, dataset, dynamic, motion, ovd, term, fbt, optical, recurrence, previous, current, flow, gating] [single, michael, tae, camera, estimation, note, scene, algorithm] [deblurring, method, blur, proposed, blurry, image, sharp, kyoung, latent, burst, gopro, hyun, figure, deblurred, input, dual, restored, blurred, handle, deconvolution] [iteration, stochastic, neural, network, cell, regularization, number, accuracy, deep, better, architecture, computation, performance, adaptive, convolutional, rdn, residual, best, regularized] [model, refer, arxiv, preprint, iterative, describe] [improve, baseline, feature, average] [training, loss, update, train, learning, target, set]
@InProceedings{Nah_2019_CVPR,
  author = {Nah, Seungjun and Son, Sanghyun and Mu Lee, Kyoung},
  title = {Recurrent Neural Networks With Intra-Frame Iterations for Video Deblurring},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Learning to Extract Flawless Slow Motion From Blurry Videos
Meiguang Jin, Zhe Hu, Paolo Favaro


In this paper, we introduce the task of generating a sharp slow-motion video given a low frame rate blurry video. We propose a data-driven approach, where the training data is captured with a high frame rate camera and blurry images are simulated through an averaging process. While it is possible to train a neural network to recover the sharp frames from their average, there is no guarantee of the temporal smoothness for the formed video, as the frames are estimated independently. To address the temporal smoothness requirement we propose a system with two networks: One, DeblurNet, to predict sharp keyframes and the second, InterpNet, to predict intermediate frames between the generated keyframes. A smooth transition is ensured by interpolating between consecutive keyframes using InterpNet. Moreover, the proposed scheme enables further increase in frame rate without retraining the network, by applying InterpNet recursively between pairs of sharp frames. We evaluate the proposed method on several datasets, including a novel dataset captured with a Sony RX V camera. We also demonstrate its performance of increasing the frame rate up to 20 times on real blurry videos.
[frame, video, flow, motion, temporal, tntt, recognition, slow, keyframes, dataset, optical, work, consecutive, second] [vision, computer, approach, camera, pattern, single, smoothness, problem, ground, scene, international, alternative, corresponding, estimation, error, truth] [blurry, deblurring, sharp, input, interpolation, conference, image, ieee, interpnet, method, real, captured, sony, ive, intermediate, deblurnet, proposed, nah, kupyn, figure, high, blur, jiang, interpolated, acm, interpolating, realistic, gopro] [rate, network, output, neural, performance, convolutional, increase, number, low, fps, better, table, achieve, deep, architecture, full, subsequent] [generate, model, evaluate, generates, introduce, generated] [three] [training, task, address, loss, learning, gap]
@InProceedings{Jin_2019_CVPR,
  author = {Jin, Meiguang and Hu, Zhe and Favaro, Paolo},
  title = {Learning to Extract Flawless Slow Motion From Blurry Videos},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Natural and Realistic Single Image Super-Resolution With Explicit Natural Manifold Discrimination
Jae Woong Soh, Gu Yong Park, Junho Jo, Nam Ik Cho


Recently, many convolutional neural networks for single image super-resolution (SISR) have been proposed, which focus on reconstructing the high-resolution images in terms of objective distortion measures. However, the networks trained with objective loss functions generally fail to reconstruct the realistic fine textures and details that are essential for better perceptual quality. Recovering the realistic details remains a challenging problem, and only a few works have been proposed which aim at increasing the perceptual quality by generating enhanced textures. However, the generated fake details often make undesirable artifacts and the overall image looks somewhat unnatural. Therefore, in this paper, we present a new approach to reconstructing realistic super-resolved images with high perceptual quality, while maintaining the naturalness of the result. In particular, we focus on the domain prior properties of SISR problem. Specifically, we define the naturalness prior in the low-level domain and constrain the output image in the natural manifold, which eventually generates more natural and realistic images. Our results show better naturalness compared to the recent super-resolution algorithms including perception-oriented ones.
[] [computer, vision, pattern, single, fractal, approach, problem, defined, dense, international] [image, figure, conference, ieee, natsr, perceptual, sisr, srgan, realistic, nmd, proposed, quality, generative, bicubic, naturalness, blurry, method, ilr, ihr, input, rdblock, edsr, texture, unnatural, frsr, enet, niqe, nqsr, result, prior, enhancenet] [residual, conv, network, better, output, deep, neural, best, convolutional, denotes, table, original, compared, sigmoid, architecture, validation, higher, low] [natural, model, manifold, discriminator, adversarial, generates] [average, adopt, feature, score, including, global] [loss, training, space, learning, domain, set, target, noisy, train]
@InProceedings{Soh_2019_CVPR,
  author = {Woong Soh, Jae and Yong Park, Gu and Jo, Junho and Ik Cho, Nam},
  title = {Natural and Realistic Single Image Super-Resolution With Explicit Natural Manifold Discrimination},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
RF-Net: An End-To-End Image Matching Network Based on Receptive Field
Xuelun Shen, Cheng Wang, Xin Li, Zenglei Yu, Jonathan Li, Chenglu Wen, Ming Cheng, Zijian He


This paper proposes a new end-to-end trainable matching network based on receptive field, RF-Net, to compute sparse correspondence between images. Building end-to-end trainable matching framework is desirable and challenging. The very recent approach, LF-Net, successfully embeds the entire feature extraction pipeline into a jointly trainable pipeline, and produces the state-of-the-art matching results. This paper introduces two modifications to the structure of LF-Net. First, we propose to construct receptive feature maps, which lead to more effective keypoint detection. Second, we introduce a general loss function term, neighbor mask, to facilitate training patch selection. This results in improved stability in descriptor training. We trained RF-Net on the open dataset HPatches, and compared it with other methods on multiple benchmark datasets. Experiments show that RF-Net outperforms existing state-of-the-art methods.
[dataset, consists, outperforms, second] [descriptor, matching, match, keypoints, hpatches, local, orientation, illumination, viewpoint, pipeline, keypoint, sift, field, approach, nnt, nnr, corresponding, ground, truth, closest, quantity] [image, patch, produce, based, figure, extracted, zhang, method, surf] [network, receptive, performance, scale, effective, convolution, represents, fast, structure, apply, deep, top, sparse, design, output, better, convolutional, ratio] [evaluation, abstract, description, correct] [feature, response, score, detector, three, map, resize, mask, detection, integrated, average, propose, merge, interest, threshold] [loss, training, learning, neighbor, trained, distance, nearest, learned, function, train, existing, protocol, learn]
@InProceedings{Shen_2019_CVPR,
  author = {Shen, Xuelun and Wang, Cheng and Li, Xin and Yu, Zenglei and Li, Jonathan and Wen, Chenglu and Cheng, Ming and He, Zijian},
  title = {RF-Net: An End-To-End Image Matching Network Based on Receptive Field},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Fast Single Image Reflection Suppression via Convex Optimization
Yang Yang, Wenye Ma, Yin Zheng, Jian-Feng Cai, Weiyu Xu


Removing undesired reflections from images taken through the glass is of great importance in computer vision. It serves as a means to enhance the image quality for aesthetic purposes as well as to preprocess images in machine learning and pattern recognition applications. We propose a convex model to suppress the reflection from a single input image. Our model implies a partial differential equation with gradient thresholding, which is solved efficiently using Discrete Cosine Transform. Extensive experiments on synthetic and real-world images demonstrate that our approach achieves desirable reflection suppression results and dramatically reduces the execution time compared to the state of the art.
[time, multiple, term, work] [single, computer, approach, pattern, solution, vision, solving, equation, wan, letter, convex, assumption, problem, column, international, solved] [reflection, image, proposed, input, transmission, method, removal, ieee, conference, synthetic, arvanitopoulos, figure, fidelity, sharp, ssim, quality, dereflected, result, remove, separation, comparison, psnr, glass, separating, dark, smartphone, dereflection, prior, portable, row, separate, blended] [layer, gradient, parameter, best, desirable, performance, achieves, number, efficiency, compared, increasing, table, original, better, sparsity, size] [model, execution, brown, visual, external, retains, wooden] [suppression, propose, edge, threshold] [data, large, viewed]
@InProceedings{Yang_2019_CVPR,
  author = {Yang, Yang and Ma, Wenye and Zheng, Yin and Cai, Jian-Feng and Xu, Weiyu},
  title = {Fast Single Image Reflection Suppression via Convex Optimization},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
A Mutual Learning Method for Salient Object Detection With Intertwined Multi-Supervision
Runmin Wu, Mengyang Feng, Wenlong Guan, Dong Wang, Huchuan Lu, Errui Ding


Though deep learning techniques have made great progress in salient object detection recently, the predicted saliency maps still suffer from incomplete predictions due to the internal complexity of objects and inaccurate boundaries caused by strides in convolution and pooling operations. To alleviate these issues, we propose to train saliency detection networks by exploiting the supervision from not only salient object detection, but also foreground contour detection and edge detection. First, we leverage salient object detection and foreground contour detection tasks in an intertwined manner to generate saliency maps with uniform highlight. Second, the foreground contour and edge detection tasks guide each other simultaneously, thereby leading to preciser foreground contour prediction and reducing the local noises for edge prediction. In addition, we develop a novel mutual learning module (MLM) which serves as the building block of our method. Each MLM consists of multiple network branches trained in a mutual learning manner, which improves the performance by a large margin. Extensive experiments on seven challenging datasets demonstrate that the proposed method has delivered state-of-the-art results in both salient object detection and edge detection.
[dataset, recurrent, internal, multiple, time, work] [computer, vision, pattern, accurate, local, international, volume, ground] [contour, ieee, conference, image, method, proposed, figure, comparison, input, based, guide] [network, deep, table, performance, convolution, architecture, cost, entire, better, compare, pooling] [visual, model, evaluate, generate, encoder, decoder] [detection, salient, edge, saliency, object, foreground, intertwined, three, supervision, mae, propose, module, backbone, map, including, hierarchical, mlms, duts, predicted, mlm, challenging, amulet, semantic, srm, ecssd, ablation, ashp, leverage] [learning, training, mutual, train, set, test, uniform, datasets, strategy, novel, task, suffer]
@InProceedings{Wu_2019_CVPR,
  author = {Wu, Runmin and Feng, Mengyang and Guan, Wenlong and Wang, Dong and Lu, Huchuan and Ding, Errui},
  title = {A Mutual Learning Method for Salient Object Detection With Intertwined Multi-Supervision},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Enhanced Pix2pix Dehazing Network
Yanyun Qu, Yizi Chen, Jingying Huang, Yuan Xie


In this paper, we reduce the image dehazing problem to an image-to-image translation problem, and propose Enhanced Pix2pix Dehazing Network (EPDN), which generates a haze-free image without relying on the physical scattering model. EPDN is embedded by a generative adversarial network, which is followed by a well-designed enhancer. Inspired by visual perception global-first theory, the discriminator guides the generator to create a pseudo realistic image on a coarse scale, while the enhancer following the generator is required to produce a realistic dehazing image on the fine scale. The enhancer contains two enhancing blocks based on the receptive field model, which reinforces the dehazing effect in both color and details. The embedded GAN is jointly trained with the enhancer. Extensive experiment results on synthetic datasets and real-world datasets show that the proposed EPDN is superior to the state-of-the-art methods in terms of PSNR, SSIM, PI, and subjective visual effect.
[dataset, recognition, second, formulated, work] [computer, single, vision, scattering, pattern, outdoor, light, matching, indoor, estimate, directly] [image, dehazing, generator, enhancing, epdn, hazy, proposed, transmission, perceptual, method, color, haze, ieee, atmospheric, gfn, psnr, conference, realistic, enhancer, dcpdn, comparison, ssim, produce, based, figure, dcp, input, translation, prior, quality, synthesis, real, study, generative, synthetic, includes] [output, original, network, best, performance, achieves, architecture, block, convolutional, compared, neural, convolution, skip, channel, deep] [gan, visual, adversarial, discriminator, embedded, physical, model] [feature, map, three, ablation, global, module] [loss, training, function, set, china, learning]
@InProceedings{Qu_2019_CVPR,
  author = {Qu, Yanyun and Chen, Yizi and Huang, Jingying and Xie, Yuan},
  title = {Enhanced Pix2pix Dehazing Network},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Assessing Personally Perceived Image Quality via Image Features and Collaborative Filtering
Jari Korhonen


During the past few years, different methods for optimizing the camera settings and post-processing techniques to improve the subjective quality of consumer photos have been studied extensively. However, most of the research in the prior art has focused on finding the optimal method for an average user. Since there is large deviation in personal opinions and aesthetic standards, the next challenge is to find the settings and post-processing techniques that fit to the individual users' personal taste. In this study, we aim to predict the personally perceived image quality by combining classical image feature analysis and collaboration filtering approach known from the recommendation systems. The experimental results for the proposed method show promising results. As a practical application, our work can be used for personalizing the camera settings or post-processing parameters for different users and images.
[dataset, assessment, predicting, predict, individual, work, prediction, combining] [rmse, contrast, relative, approach, matrix, camera, well] [image, quality, user, proposed, collaborative, filtering, method, subjective, scc, rating, pcc, enhancement, personal, based, preferred, latent, ieee, figure, database, prior, randimfeats, personally, content, comparison, forest, papi, aesthetic, study, perceived, application, extracted] [validation, scheme, performance, table, gradient, neural, group, best, boosting, accuracy, deep, initialized] [model, vector, random, evaluation, find] [feature, regression, average, baseline, improvement, extraction] [training, datasets, item, test, recommendation, set, scenario, source, randomly, conventional]
@InProceedings{Korhonen_2019_CVPR,
  author = {Korhonen, Jari},
  title = {Assessing Personally Perceived Image Quality via Image Features and Collaborative Filtering},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Single Image Reflection Removal Exploiting Misaligned Training Data and Network Enhancements
Kaixuan Wei, Jiaolong Yang, Ying Fu, David Wipf, Hua Huang


Removing undesirable reflections from a single image captured through a glass window is of practical importance to visual computing systems. Although state-of-the-art methods can obtain decent results in certain situations, performance declines significantly when tackling more general real-world cases. These failures stem from the intrinsic difficulty of single image reflection removal -- the fundamental ill-posedness of the problem, and the insufficiency of densely-labeled training data needed for resolving this ambiguity within learning-based neural network pipelines. In this paper, we address these issues by exploiting targeted network enhancements and the novel use of misaligned data. For the former, we augment a baseline network architecture by embedding context encoding modules that are capable of leveraging high-level contextual clues to reduce indeterminacy within areas containing strong reflections. For the latter, we introduce an alignment-invariant loss function that facilitates exploiting misaligned real-world training data that is much easier to collect. Experimental results collectively show that our method outperforms the state-of-the-art with aligned data, and that significant improvements are possible when using additional misaligned data.
[recognition, dataset, human, finetuned, prediction] [vision, computer, pattern, single, international, june, additional, approach, problem] [image, reflection, unaligned, conference, ieee, removal, real, errnet, figure, input, linv, misaligned, method, ssim, background, synthetic, psnr, lcx, july, separation, comparison, result, ncc, lmse, collected, lpixel, prior, transmitted, based, ladv] [network, layer, deep, channel, table, convolutional, performance, better, gradient, pooling, neural, architecture, residual, batch, pretrained] [visual, adversarial, encoding, model, attention, machine] [feature, aligned, context, contextual, basenet, spatial, global, pyramid, module, european] [loss, training, data, function, learning, large, set, train, testing, trained]
@InProceedings{Wei_2019_CVPR,
  author = {Wei, Kaixuan and Yang, Jiaolong and Fu, Ying and Wipf, David and Huang, Hua},
  title = {Single Image Reflection Removal Exploiting Misaligned Training Data and Network Enhancements},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Exploring Context and Visual Pattern of Relationship for Scene Graph Generation
Wenbin Wang, Ruiping Wang, Shiguang Shan, Xilin Chen


Relationship is the core of scene graph, but its prediction is far from satisfying because of its complex visual diversity. To alleviate this problem, we treat relationship as an abstract object, exploring not only significative visual pattern but contextual information for it, which are two key aspects when considering object recognition. Our observation on current datasets reveals that there exists intimate association among relationships. Therefore, inspired by the successful application of context to object-oriented tasks, we especially construct context for relationships where all of them are gathered so that the recognition could benefit from their association. Moreover, accurate recognition needs discriminative visual pattern for object, and so does relationship. In order to discover effective pattern for relationship, traditional relationship feature extraction methods such as using union region or combination of subject-object feature pairs are replaced with our proposed intersection region which focuses on more essential parts. Therefore, we present our so-called Relationship Context - InterSeCtion Region (CISC) method. Experiments for scene graph generation on Visual Genome dataset and visual relationship prediction on VRD dataset indicate that both the relationship context and intersection region improve performances and realize anticipated functions.
[graph, recognition, passing, dataset, prediction, key, construct, walking, extract, framework] [scene, vision, computer, pattern, international, volume, ground, left, truth, case, initial, ear] [conference, ieee, image, figure, proposed, based, glass] [basic, table, neural, number, processing, order, better, vij, denotes, firstly, mentioned] [relationship, visual, model, memory, message, generation, wearing, reasoning, repetition, predicate, mem, association, vrd, evaluation, reimplemented, man, evaluate, imp] [object, context, region, intersection, union, feature, detection, european, predicted, head, dotted, spatial, detected, contextual, module, roi, detector] [update, classification, cat]
@InProceedings{Wang_2019_CVPR,
  author = {Wang, Wenbin and Wang, Ruiping and Shan, Shiguang and Chen, Xilin},
  title = {Exploring Context and Visual Pattern of Relationship for Scene Graph Generation},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Learning From Synthetic Data for Crowd Counting in the Wild
Qi Wang, Junyu Gao, Wei Lin, Yuan Yuan


Recently, counting the number of people for crowd scenes is a hot topic because of its widespread applications (e.g. video surveillance, public security). It is a difficult task in the wild: changeable environment, large-range number of people cause the current methods can not work well. In addition, due to the scarce data, many methods suffer from over-fitting to a different extent. To remedy the above two problems, firstly, we develop a data collector and labeler, which can generate the synthetic crowd scenes and simultaneously annotate them without any manpower. Based on it, we build a large-scale, diverse synthetic dataset. Secondly, we propose two schemes that exploit the synthetic data to boost the performance of crowd counting in the wild: 1) pretrain a crowd counter on the synthetic data, then finetune it using the real data, which significantly prompts the model's performance on real data; 2) propose a crowd counting method via domain adaptation, which can free humans from heavy data annotations. Extensive experiments show that the first method achieves the state-of-the-art performance on four real datasets, and the second outperforms our baselines. The dataset and source code are available at https://gjy3035.github.io/GCC-CL/.
[dataset, people, ucf, construct, time, video] [computer, vision, pattern, scene, weather, analysis, international, local] [synthetic, cycle, real, gcc, image, ssim, ieee, conference, sfcn, proposed, method, grs, based, sht, figure, translated, mse, psnr, collector, counter, shanghai, tech, produce, gsr, remedy] [density, performance, congested, table, convolutional, number, original, deep, network, pretrained, effective, compared, neural, science] [gan, model, random, arxiv, diverse, game, preprint, develop, find] [crowd, counting, propose, spatial, map, person, improve, three, mae, fully] [data, domain, adaptation, learning, training, embedding, loss, datasets, set, exploit, supervised, label, effectively, existing, learn]
@InProceedings{Wang_2019_CVPR,
  author = {Wang, Qi and Gao, Junyu and Lin, Wei and Yuan, Yuan},
  title = {Learning From Synthetic Data for Crowd Counting in the Wild},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
A Local Block Coordinate Descent Algorithm for the CSC Model
Ev Zisselman, Jeremias Sulam, Michael Elad


The Convolutional Sparse Coding (CSC) model has recently gained considerable traction in the signal and image processing communities. By providing a global, yet tractable, model that operates on the whole image, the CSC was shown to overcome several limitations of the patch-based sparse model while achieving superior performance in various applications. Contemporary methods for pursuit and learning the CSC dictionary often rely on the Alternating Direction Method of Multipliers (ADMM) in the Fourier domain for the computational convenience of convolutions, while ignoring the local characterizations of the image. In this work we propose a new and simple approach that adopts a localized strategy, based on the Block Coordinate Descent algorithm. The proposed method, termed Local Block Coordinate Descent (LoBCoD), operates locally on image patches. Furthermore, we introduce a novel stochastic gradient descent version of LoBCoD for training the convolutional filters. This Stochastic-LoBCoD leverages the benefits of online learning, while being applicable even to a single training image. We demonstrate the advantages of the proposed algorithms for image inpainting and multi-focus image fusion, achieving state-of-the-art results.
[signal, online, fusion, dataset, work, time, showing, version, previous, operates] [algorithm, local, approach, international, problem, coordinate, solving, equation, corresponding, computer, supplementary, direction, fourier, single, solution, pattern] [image, dictionary, proposed, ieee, csc, figure, pursuit, method, conference, inpainting, reconstructed, comparison, based, lobcod, sbdl, corrupted, presented, component, patch, needle, described, ybk, blur, blurred] [sparse, convolutional, descent, gradient, stochastic, block, number, processing, original, coding, compared, better, residual, batch, small, parameter, performance, size, computation, performed, table, applying] [model, vector, step, rule] [global, feature, boundary] [update, learning, min, set, training, representation, task, data, test]
@InProceedings{Zisselman_2019_CVPR,
  author = {Zisselman, Ev and Sulam, Jeremias and Elad, Michael},
  title = {A Local Block Coordinate Descent Algorithm for the CSC Model},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Not Using the Car to See the Sidewalk -- Quantifying and Controlling the Effects of Context in Classification and Segmentation
Rakshith Shetty, Bernt Schiele, Mario Fritz


Importance of visual context in scene understanding tasks is well recognized in the computer vision community. However, to what extent the computer vision models are dependent on the context to make their predictions is unclear. A model overly relying on context will fail when encountering objects in different contexts than in training data and hence it is important to identify these dependencies before we can deploy the models in the real-world. We propose a method to quantify the sensitivity of black-box vision models to visual context by editing images to remove selected objects and measuring the response of the target models. We apply this methodology on two tasks, image classification and semantic segmentation, and discover undesirable dependency between objects and context, for example that "sidewalk" segmentation is very sensitive to the presence of "cars" in the image. We propose an object removal based data augmentation solution to mitigate this dependency and increase the robustness of classification and segmentation models to contextual variations. Our experiments show that the proposed data augmentation helps these models improve the performance in out-of-context scenarios, while preserving the performance on regular data.
[recognition, dataset, perform, dependency] [vision, computer, single, pattern, robust, augmented, well, scene, analysis, good] [image, removal, conference, removed, ieee, based, figure, remove, proposed, removing, quantifying, real] [performance, original, full, table, network, higher, better, neural, deep, convolutional] [model, robustness, edited, visual, sensitivity, regular, example, measuring, dependence] [context, object, segmentation, upernet, contextual, baseline, sidewalk, semantic, road, car, improve, keyboard, coco, sci, detection, three, unrel, map, presence, segment] [data, classification, classifier, training, augmentation, trained, class, test, set, split, min, quantify, train, measure, loss, negative, target]
@InProceedings{Shetty_2019_CVPR,
  author = {Shetty, Rakshith and Schiele, Bernt and Fritz, Mario},
  title = {Not Using the Car to See the Sidewalk -- Quantifying and Controlling the Effects of Context in Classification and Segmentation},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Discovering Fair Representations in the Data Domain
Novi Quadrianto, Viktoriia Sharmanska, Oliver Thomas


Interpretability and fairness are critical in computer vision and machine learning applications, in particular when dealing with human outcomes, e.g. inviting or not inviting for a job interview based on application materials that may include photographs. One promising direction to achieve fairness is by learning data representations that remove the semantics of protected characteristics, and are therefore able to mitigate unfair outcomes. All available models however learn latent embeddings which comes at the cost of being uninterpretable. We propose to cast this problem as data-to-data translation, i.e. learning a mapping from an input domain to a fair target domain, where a fairness definition is being enforced. Here the data domain can be images, or any tabular data representation. This task would be straightforward if we had fair target data available, but this is not the case. To overcome this, we learn a highly unconstrained mapping by exploiting statistics of residuals -- the difference between input data and its translated version -- and the protected characteristics. When applied to the CelebA dataset of face images with gender attribute as the protected characteristic, our model enforces equality of opportunity by adjusting the eyes and lips regions. Intriguingly, on the same dataset we arrive at similar conclusions when using semantic attribute representations of images for translation. On face images of the recent DiF dataset, with the same gender attribute, our method adjusts nose regions. In the Adult income dataset, also with protected gender attribute, our model achieves equality of opportunity by, among others, obfuscating the wife and husband relationship. Analyzing those systematic changes will allow us to scrutinize the interplay of fairness criterion, chosen protected characteristics, and prediction performance.
[dataset, prediction, work, recognition, focus, perform, multiple] [computer, decomposition, vision, international, approach, enforce, pattern, well] [fairness, image, protected, input, equality, attribute, opportunity, conference, mapping, translated, method, translation, figure, celeba, gender, tpr, face, age, characteristic, latent, statistical, style, job, tabular, female, male, conditional, translate] [neural, network, original, group, residual, binary, layer, kernel, deep, output, accuracy, performance, table, criterion, convolutional] [machine, transformer, interpretability, model, adversarial, dependence, interpretable, diversity] [feature, semantic] [fair, learning, representation, data, training, learn, classifier, domain, loss, target, learned, positive, svm, trained, set, transfer, measure, classification]
@InProceedings{Quadrianto_2019_CVPR,
  author = {Quadrianto, Novi and Sharmanska, Viktoriia and Thomas, Oliver},
  title = {Discovering Fair Representations in the Data Domain},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Actor-Critic Instance Segmentation
Nikita Araslanov, Constantin A. Rothkopf, Stefan Roth


Most approaches to visual scene analysis have emphasised parallel processing of the image elements. However, one area in which the sequential nature of vision is apparent, is that of segmenting multiple, potentially similar and partially occluded objects in a scene. In this work, we revisit the recurrent formulation of this challenging problem in the context of reinforcement learning. Motivated by the limitations of the global max-matching assignment of the ground-truth segments to the recurrent states, we develop an actor-critic approach in which the actor recurrently predicts one instance mask at a time and utilises the gradient from a concurrently trained critic network. We formulate the state, action, and the reward such as to let the critic model long-term effects of the current prediction and in- corporate this information into the gradient signal. Furthermore, to enable effective exploration in the inherently high-dimensional action space of instance masks, we learn a compact representation using a conditional variational auto-encoder. We show that our actor-critic model consistently provides accuracy benefits over the recurrent baseline on standard instance segmentation benchmarks.
[state, action, prediction, actor, recurrent, lstm, future, previous, sequential, time, current, work, buffer, dataset, temporal, timestep, hidden, sequence] [assignment, note, algorithm, approach, well, kitti, problem, corresponding, ground, computed, compute, vision] [image, latent, method, input, quality] [network, order, gradient, deep, accuracy, table, number, validation, standard, max, processing] [model, critic, reward, dice, exploration, reinforcement, decoder, policy, cvppp, episode, variational, encoder, bptt] [instance, segmentation, mask, context, object, predicted, score, baseline, counting, segmented, spatial, improve, bounding, box, raquel, semantic] [loss, set, training, trained, function, learning, representation, space, train, test, learn]
@InProceedings{Araslanov_2019_CVPR,
  author = {Araslanov, Nikita and Rothkopf, Constantin A. and Roth, Stefan},
  title = {Actor-Critic Instance Segmentation},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Generalized Zero- and Few-Shot Learning via Aligned Variational Autoencoders
Edgar Schonfeld, Sayna Ebrahimi, Samarth Sinha, Trevor Darrell, Zeynep Akata


Many approaches in generalized zero-shot learning rely on cross-modal mapping between the image feature space and the class embedding space. As labeled images are expensive, one direction is to augment the dataset by generating either images or image features. However, the former misses fine-grained details and the latter requires learning a mapping associated with class embeddings. In this work, we take feature generation one step further and propose a model where a shared latent space of image features and class embeddings is learned by modality-specific aligned variational autoencoders. This leaves us with the required discriminative information about the image and classes in the latent features, on which we train a softmax classifier. The key to our approach is that we align the distributions learned from images and from side-information to construct latent features that contain the essential multi-modal information associated with unseen classes. We evaluate our learned latent features on several benchmark datasets, i.e. CUB, SUN, AWA1 and AWA2, and establish a new state of the art on generalized zero-shot as well as on few-shot learning. Moreover, our results on ImageNet with various zero-shot splits show that our latent features generalize well in large-scale settings.
[dataset, state, hidden, framework] [approach, well] [latent, image, figure, side, conditional, method, proposed, attribute] [accuracy, number, performance, imagenet, size, achieved, compare, better, epoch] [model, visual, variational, generating, generate, evaluate, generated, sentence, decoder, provided, wasserstein, common, encoder] [feature, aligned, semantic, benchmark, cnn] [class, unseen, learning, space, training, distribution, embeddings, alignment, generalized, gzsl, learn, data, vae, embedding, revise, learned, vaes, harmonic, set, cub, classifier, train, loss, minimizing, shared, distance, discriminative, task, datasets, acch, autoencoders, transfer, representation, test]
@InProceedings{Schonfeld_2019_CVPR,
  author = {Schonfeld, Edgar and Ebrahimi, Sayna and Sinha, Samarth and Darrell, Trevor and Akata, Zeynep},
  title = {Generalized Zero- and Few-Shot Learning via Aligned Variational Autoencoders},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Semantic Projection Network for Zero- and Few-Label Semantic Segmentation
Yongqin Xian, Subhabrata Choudhury, Yang He, Bernt Schiele, Zeynep Akata


Semantic segmentation is one of the most fundamental problems in computer vision and pixel-level labelling in this context is particularly expensive. Hence, there have been several attempts to reduce the annotation effort such as learning from image level labels and bounding box annotations. In this paper we take this one step further and focus on the challenging task of zero- and few-shot learning of semantic segmentation. We define this task as image segmentation by assigning a label to every pixel even though either no labeled sample of that class was present during training, i.e. zero-label semantic segmentation, or only a few labeled samples were present, i.e. few-label semantic segmentation.Our goal is to transfer the knowledge from previously seen classes to novel classes. Our proposed semantic projection network (SPNet) achieves this goal by incorporating a class-level semantic information into any network designed for semantic segmentation, in an end-to-end manner. We also propose a benchmark for this task on the challenging COCO-Stuff and PASCAL VOC12 datasets. Our model is effective in segmenting novel classes, i.e. alleviating expensive dense annotations, but also in adapting to novel classes without forgetting its prior knowledge, i.e. generalized zero- and few-label semantic segmentation.
[predict, prediction, report, multiple] [projection, dense, require, initial] [image, figure, pixel, prior] [network, deep, layer, table, convolutional, achieves, standard, imagenet, number, output, better] [model, word, visual, evaluation] [semantic, segmentation, iou, feature, segment, object, miou, propose, cnn, pascal, baseline, challenging, fully, context, segmenting, mask, voc] [unseen, embedding, spnet, class, training, learning, novel, labeled, test, space, set, knowledge, task, zlss, embeddings, classification, loss, yij, generalized, trained, label, fasttext, data, transfer, harmonic, aim, learn, base, target, hinge, similarity, learned, classifier]
@InProceedings{Xian_2019_CVPR,
  author = {Xian, Yongqin and Choudhury, Subhabrata and He, Yang and Schiele, Bernt and Akata, Zeynep},
  title = {Semantic Projection Network for Zero- and Few-Label Semantic Segmentation},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
GCAN: Graph Convolutional Adversarial Network for Unsupervised Domain Adaptation
Xinhong Ma, Tianzhu Zhang, Changsheng Xu


To bridge source and target domains for domain adaptation, there are three important types of information including data structure, domain label, and class label. Most existing domain adaptation approaches exploit only one or two types of this information and cannot make them complement and enhance each other. Different from existing methods, we propose an end-to-end Graph Convolutional Adversarial Network (GCAN) for unsupervised domain adaptation by jointly modeling data structure, domain label, and class label in a unified deep framework. The proposed GCAN model enjoys several merits. First, to the best of our knowledge, this is the first work to model the three kinds of information jointly in a deep model for unsupervised domain adaptation. Second, the proposed model has designed three effective alignment mechanisms including structure-aware alignment, domain alignment, and class centroid alignment, which can learn domain-invariant and semantic representations effectively to reduce the domain discrepancy for domain adaptation. Extensive experimental results on five standard benchmarks demonstrate that the proposed GCAN algorithm performs favorably against state-of-the-art unsupervised domain adaptation methods.
[graph, jointly, modeling, dataset, gcn, work] [analysis, matrix, journal] [proposed, ieee, method, image, figure, spectral, jan, comparison] [deep, structure, network, convolutional, neural, reduce, accuracy, alexnet, standard, shift, analyzer, designed, performance, applied, table] [adversarial, model, arxiv, preprint, machine, visual] [three, feature, semantic, including, category, cnn, instance, propose] [domain, adaptation, class, data, target, source, alignment, label, unsupervised, learning, transfer, gcan, centroid, learn, loss, revgrad, trained, mstn, discrepancy, labeled, training, classification, unified, classifier, discriminative, knowledge, existing, distribution, mingsheng, jianmin, unlabeled, mapped, space, set, gakt, judy]
@InProceedings{Ma_2019_CVPR,
  author = {Ma, Xinhong and Zhang, Tianzhu and Xu, Changsheng},
  title = {GCAN: Graph Convolutional Adversarial Network for Unsupervised Domain Adaptation},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Seamless Scene Segmentation
Lorenzo Porzi, Samuel Rota Bulo, Aleksander Colovic, Peter Kontschieder


In this work we introduce a novel, CNN-based architecture that can be trained end-to-end to deliver seamless scene segmentation results. Our goal is to predict consistent semantic segmentation and detection results by means of a panoptic output format, going beyond the simple combination of independently trained segmentation and detection models. The proposed architecture takes advantage of a novel segmentation head that seamlessly integrates multi-scale features generated by a Feature Pyramid Network with contextual information conveyed by a light-weight DeepLab-like module. As additional contribution we review the panoptic metric and propose an alternative that overcomes its limitations when evaluating non-instance categories. Our proposed network architecture yields state-of-the-art results on three challenging street-level datasets, i.e. Cityscapes, Indian Driving Dataset and Mapillary Vistas.
[work, combined, recognition, dataset, driving, individual, independently, joint, fed, jointly, prediction, time] [scene, ground, truth, single, provide, corresponding, denote, body] [image, proposed, input, resolution, based, pixel] [output, network, architecture, convolution, convolutional, deep, computational, layer, pooling, operation] [model, provided] [segmentation, semantic, instance, panoptic, stuff, mask, object, bounding, head, region, predicted, detection, final, box, proposal, branch, seamless, mapillary, backbone, score, kaiming, fully, fpn, iou, ross, feature, pyramid, thing, segment, piotr, doll, module, anchor, nclasses] [class, metric, set, training, trained, learning, test, reported, large, rota]
@InProceedings{Porzi_2019_CVPR,
  author = {Porzi, Lorenzo and Rota Bulo, Samuel and Colovic, Aleksander and Kontschieder, Peter},
  title = {Seamless Scene Segmentation},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Unsupervised Image Matching and Object Discovery as Optimization
Huy V. Vo, Francis Bach, Minsu Cho, Kai Han, Yann LeCun, Patrick Perez, Jean Ponce


Learning with complete or partial supervision is power- ful but relies on ever-growing human annotation efforts. As a way to mitigate this serious problem, as well as to serve specific applications, unsupervised learning has emerged as an important field of research. In computer vision, unsu- pervised learning comes in various guises. We focus here on the unsupervised discovery and matching of object cate- gories among images in a collection, following the work of Cho et al. [12]. We show that the original approach can be reformulated and solved as a proper optimization problem. Experiments on several benchmarks establish the merit of our approach.
[graph, largest, video] [algorithm, problem, optimization, continuous, note, corresponding, matching, associated, normalized, defined, denote, approach, solution, form, eij, computed, define, cubic, single, matrix, match, case] [image, cho, method, figure, separate, mixed, dual] [number, deep, performance, table, better, vij, small, search, fixed, max, element, structure] [visual, ascent, greedy, model, primal, potential] [object, voc, region, discovery, score, standout, proposal, skl, selective, fully, supermodular, unnormalized] [unsupervised, sij, set, learning, supervised, similarity, function, maximum, setting, large, ensemble, class, clustering, discriminative]
@InProceedings{Vo_2019_CVPR,
  author = {Vo, Huy V. and Bach, Francis and Cho, Minsu and Han, Kai and LeCun, Yann and Perez, Patrick and Ponce, Jean},
  title = {Unsupervised Image Matching and Object Discovery as Optimization},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Wide-Area Crowd Counting via Ground-Plane Density Maps and Multi-View Fusion CNNs
Qi Zhang, Antoni B. Chan


Crowd counting in single-view images has achieved outstanding performance on existing counting datasets. However, single-view counting is not applicable to large and wide scenes (e.g., public parks, long subway platforms, or event spaces) because a single camera cannot capture the whole scene in adequate detail for counting, e.g., when the scene is too large to fit into the field-of-view of the camera, too long so that the resolution is too low on faraway crowds, or when there are too many large objects that occlude large portions of the crowd. Therefore, to solve the wide-area counting task requires multiple cameras with overlapping fields-of-view. In this paper, we propose a deep neural network framework for multi-view crowd counting, which fuses information from multiple camera views to predict a scene-level density map on the ground-plane of the 3D world. We consider 3 versions of the fusion framework: the late fusion model fuses camera-view density map; the naive early fusion model fuses camera-view feature maps; and the multi-view multi-scale early fusion model favors that features aligned to the same ground-plane point have consistent scales. We test our 3 fusion models on 3 multi-view counting datasets, PETS2009, DukeMTMC, and a newly collected multi-view counting dataset containing a crowded street intersection. Our methods achieve state-of-the-art results compared to other multi-view counting baselines.
[fusion, early, people, late, dataset, multiple, video, framework] [camera, scene, computer, projection, view, vision, projected, pattern, city, single, estimate, mvms, international, consistent, plane, multiview, corresponding, scenelevel, point] [image, ieee, conference, ive, proposed, based, reference, method, figure, extracted] [density, scale, selection, normalization, neural, performance, fixed, learnable, conv, network, table, antoni, achieve, better, processing, deep, compared] [model, sum] [counting, crowd, map, feature, detection, count, street, propose, dukemtmc, module, fused, person, pyramid, fusing, extraction] [large, distance, representation, set, training, learning, test, existing]
@InProceedings{Zhang_2019_CVPR,
  author = {Zhang, Qi and Chan, Antoni B.},
  title = {Wide-Area Crowd Counting via Ground-Plane Density Maps and Multi-View Fusion CNNs},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Show, Control and Tell: A Framework for Generating Controllable and Grounded Captions
Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara


Current captioning approaches can describe images using black-box architectures whose behavior is hardly controllable and explainable from the exterior. As an image can be described in infinite ways depending on the goal and the context at hand, a higher degree of controllability is needed to apply captioning algorithms in complex scenarios. In this paper, we introduce a novel framework for image captioning which can generate diverse descriptions by allowing both grounding and controllability. Given a control signal in the form of a sequence or set of image regions, we generate the corresponding caption through a recurrent architecture which predicts textual chunks explicitly grounded on regions, following the constraints of the given control. Experiments are conducted on Flickr30k Entities and on COCO Entities, an extended version of COCO in which we add grounding annotations collected in a semi-automatic manner. Results demonstrate that our method achieves state of the art performances on controllable image captioning, in terms of caption quality and diversity. Code and annotations are publicly available at: https://github.com/aimagelab/show-control-and-tell.
[sequence, lstm, state, signal, explicitly, version, multiple, recurrent, work, dependency] [computer, vision, pattern, approach, international, single, corresponding, respect, compute, defined, permutation] [image, conference, control, method, ieee, input, quality, figure, proposed, extracted] [neural, gate, table, output, employ, network, adaptive, number] [visual, captioning, model, caption, controllable, language, noun, sentinel, controllability, attention, word, sentence, vector, natural, generating, generate, chunk, diverse, generation, reward, grounding, textual, grounded, girl, machine, sampled, evaluation, baby, generated, probability, considers, evaluate, mechanism, talk] [region, coco, attentive, score, context, feature] [set, alignment, test, training, learning, distribution, sample]
@InProceedings{Cornia_2019_CVPR,
  author = {Cornia, Marcella and Baraldi, Lorenzo and Cucchiara, Rita},
  title = {Show, Control and Tell: A Framework for Generating Controllable and Grounded Captions},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Towards VQA Models That Can Read
Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, Marcus Rohrbach


Studies have shown that a dominant class of questions asked by visually impaired users on images of their surroundings involves reading text in the image. But today's VQA models can not read! Our paper takes a first step towards addressing this problem. First, we introduce a new "TextVQA" dataset to facilitate progress on this important problem. Existing datasets either have a small proportion of questions about text (e.g., the VQA dataset) or are too small (e.g., the VizWiz dataset). TextVQA contains 45,336 questions on 28,408 images that require reasoning about text to answer. Second, we introduce a novel model architecture that reads text in the image, reasons about it in the context of the image and the question, and predicts an answer which might be a deduction based on the text and the image or composed of the strings found in the image. Consequently, we call our approach Look, Read, Reason & Answer (LoRRA). We show that LoRRA outperforms existing state-of-the-art VQA models on our TextVQA dataset. We find that the gap between human performance and machine performance is significantly larger on TextVQA than on VQA 2.0, suggesting that TextVQA is well-suited to benchmark progress along directions complementary to VQA 2.0.
[dataset, recognition, predicting, predict, outperforms] [computer, vision, require, note, international, approach, predicts, provide, pattern] [image, conference, based, component, figure, high, ieee, visually] [accuracy, number, validation, neural, fixed, top, network, architecture, computational] [answer, vqa, ocr, text, question, model, visual, textvqa, answering, reading, lorra, vizwiz, reason, reasoning, attention, pythia, read, majority, word, brand, length, common, mechanism, token, copy, random, introduce, vocabulary, language, dhruv, devi, machine, correct, system, natural, ban, marcus] [module, detected, average, occurring, predicted, spatial] [space, learning, set, test, datasets, existing, training, data, open, embedding, embeddings, trained, maximum]
@InProceedings{Singh_2019_CVPR,
  author = {Singh, Amanpreet and Natarajan, Vivek and Shah, Meet and Jiang, Yu and Chen, Xinlei and Batra, Dhruv and Parikh, Devi and Rohrbach, Marcus},
  title = {Towards VQA Models That Can Read},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Object-Aware Aggregation With Bidirectional Temporal Graph for Video Captioning
Junchao Zhang, Yuxin Peng


Video captioning aims to automatically generate natural language descriptions of video content, which has drawn a lot of attention recent years. Generating accurate and fine-grained captions needs to not only understand the global content of video, but also capture the detailed object information. Meanwhile, video representations have great impact on the quality of generated captions. Thus, it is important for video captioning to capture salient objects with their detailed temporal dynamics, and represent them using discriminative spatio-temporal representations. In this paper, we propose a new video captioning approach based on object-aware aggregation with bidirectional temporal graph (OA-BTG), which captures detailed temporal dynamics for salient objects in video, and learns discriminative spatio-temporal representations by performing object-aware local feature aggregation on detected object regions. The main novelties and advantages are: (1) Bidirectional temporal graph: A bidirectional temporal graph is constructed along and reversely along the temporal order, which provides complementary ways to capture the temporal trajectories for each salient object. (2) Object-aware aggregation: Learnable VLAD (Vector of Locally Aggregated Descriptors) models are constructed on object temporal trajectories and global frame sequence, which performs object-aware aggregation to learn discriminative representations. A hierarchical attention mechanism is also developed to distinguish different contributions of multiple objects. Experiments on two widely-used datasets demonstrate our OA-BTG achieves state-of-the-art performance in terms of BLEU@4, METEOR and CIDEr metrics.
[temporal, video, bidirectional, graph, forward, backward, frame, capture, reversely, time, dataset, hidden, extract, seqvlad, construct, capturing, state, watt, microsoft] [local, approach, well, accurate, corresponding] [proposed, based, content, figure, method] [aggregation, learnable, constructed, denotes, fine, order, layer, neural, applied, size, effectiveness] [captioning, attention, generate, cider, msvd, language, generated, model, man, visual, evaluation, meteor, word, mechanism, sentence, encoding, description, woman, natural, generating, decoder, indicates] [object, global, vlad, detailed, salient, feature, region, hierarchical, spatial, distinguish, baseline, propose] [discriminative, learning, learn, cluster, set, training, tao]
@InProceedings{Zhang_2019_CVPR,
  author = {Zhang, Junchao and Peng, Yuxin},
  title = {Object-Aware Aggregation With Bidirectional Temporal Graph for Video Captioning},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Progressive Attention Memory Network for Movie Story Question Answering
Junyeong Kim, Minuk Ma, Kyungsu Kim, Sungjin Kim, Chang D. Yoo


This paper proposes the progressive attention memory network (PAMN) for movie story question answering (QA). Movie story QA is challenging compared to VQA in two aspects: (1) pinpointing the temporal parts relevant to answer the question is difficult as the movies are typically longer than an hour, (2) it has both video and subtitle where different questions require different modality to infer the answer. To overcome these challenges, PAMN involves three main features: (1) progressive attention mechanism that utilizes cues from both question and answer to progressively prune out irrelevant temporal parts in memory, (2) dynamic modality fusion that adaptively determines the contribution of each modality for answering the current question, and (3) belief correction answering scheme that successively corrects the prediction score on each candidate answer. Experiments on publicly available benchmark datasets, MovieQA and TVQA, demonstrate that each feature contributes to our movie story QA architecture, PAMN, and improves performance to achieve the state-of-the-art result. Qualitative analysis by visualizing the inference mechanism of PAMN is also provided.
[video, temporal, dynamic, fusion, multiple, recognition, prediction, updated, utilized, observing] [vision, computer, pattern, require, international] [correction, dual, conference, ieee, proposed, image, qualitative] [progressive, network, performance, bilinear, validation, scheme, denotes, accuracy, neural, table, compared, adaptively, tucker, layer, size, stacked, number] [attention, pamn, question, memory, modality, answering, movie, belief, subtitle, answer, story, visual, movieqa, mechanism, multimodal, relevant, candidate, tvqa, infer, step, correct, dmf, contribution, textual, embedded, indicates, pinpointing, successively, corrects, word, reasoning, understanding, observed] [utilizes, benchmark, feature, three, score] [embedding, test, representation, set, paper, learning, difficult, main, large]
@InProceedings{Kim_2019_CVPR,
  author = {Kim, Junyeong and Ma, Minuk and Kim, Kyungsu and Kim, Sungjin and Yoo, Chang D.},
  title = {Progressive Attention Memory Network for Movie Story Question Answering},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Memory-Attended Recurrent Network for Video Captioning
Wenjie Pei, Jiyuan Zhang, Xiangrong Wang, Lei Ke, Xiaoyong Shen, Yu-Wing Tai


Typical techniques for video captioning follow the encoder-decoder framework, which can only focus on one source video being processed. A potential disadvantage of such design is that it cannot capture the multiple visual context information of a word appearing in more than one relevant videos in training data. To tackle this limitation, we propose the Memory-Attended Recurrent Network (MARN) for video captioning, in which a memory structure is designed to explore the full-spectrum correspondence between a word and its various similar visual contexts across videos in training data. Thus, our model is able to achieve a more comprehensive understanding for each word and yield higher captioning quality. Furthermore, the built memory structure enables our method to model the compatibility between adjacent words explicitly instead of asking the model to learn implicitly, as most existing models do. Extensive validation on two real-word datasets demonstrates that our MARN consistently outperforms state-of-the-art methods.
[video, recurrent, extract, lstm, capture, multiple, temporal, focus, dataset, hidden, human, framework, sequence, modeling, gru, adjacent, assistant, time] [corresponding, equation] [based, image, figure, proposed, comprehensive] [performance, network, table, neural, designed, best, structure, deep, validation, cnns, constructed, compare] [memory, visual, decoder, word, captioning, model, basis, attention, attended, marn, decoding, candidate, caption, encoder, evaluation, msvd, mechanism, encoding, generate, vocabulary, machine, step, system, relevant, description, cider] [context, hierarchical, feature, enhance, module, including, three, predicted, ablation] [source, embedding, training, learning, auxiliary, compatibility, datasets, loss]
@InProceedings{Pei_2019_CVPR,
  author = {Pei, Wenjie and Zhang, Jiyuan and Wang, Xiangrong and Ke, Lei and Shen, Xiaoyong and Tai, Yu-Wing},
  title = {Memory-Attended Recurrent Network for Video Captioning},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Visual Query Answering by Entity-Attribute Graph Matching and Reasoning
Peixi Xiong, Huayi Zhan, Xin Wang, Baivab Sinha, Ying Wu


Visual Query Answering (VQA) is of great significance in offering people convenience: one can raise a question for details of objects, or high-level understanding about the scene, over an image. This paper proposes a novel method to address the VQA problem. In contrast to prior works, our method that targets single scene VQA, replies on graph-based techniques and involves reasoning. In a nutshell, our approach is centered on three graphs. The first graph, referred to as inference graph G_I, is constructed via learning over labeled data. The other two graphs, referred to as query graph Q and entity-attribute graph EAG, are generated from natural language query NLQ and image Img, that are issued from users, respectively. As EAG often does not take sufficient information to answer Q, we develop techniques to infer missing information of EAG with G_I. Based on EAG and Q, we provide techniques to find matches of Q in EAG, as the answer of NLQ in Img. Unlike commonly used VQA methods that are based on end-to-end neural networks, our graph-based method shows well-designed reasoning capability, and thus is highly interpretable. We also create a dataset on soccer match (Soccer-VQA) with rich annotations. The experimental results show that our approach outperforms the state-of-the-art method and has high potential for future investigation.
[graph, dataset, work, hidden, represented] [approach, pattern, matching, computer, field, variable, scene, vision, associated, status, direction, well, problem, note, inferred, defined, provide, match] [image, figure, attribute, based, method, missing, ieee, conference, conditional, incomplete, color] [inference, accuracy, neural, network, table, structured, bayesian, constructed, better, number] [query, visual, question, vqa, gea, answer, entity, answering, node, eag, soccer, obvious, reasoning, infer, team, natural, language, find, attention, qnl, type, defending, understanding] [person, module, object, role, three, edge] [set, hard, data, knowledge, labeled, uniform, function, domain, medium, learning]
@InProceedings{Xiong_2019_CVPR,
  author = {Xiong, Peixi and Zhan, Huayi and Wang, Xin and Sinha, Baivab and Wu, Ying},
  title = {Visual Query Answering by Entity-Attribute Graph Matching and Reasoning},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Look Back and Predict Forward in Image Captioning
Yu Qin, Jiajun Du, Yonghua Zhang, Hongtao Lu


Most existing attention-based methods on image captioning focus on the current word and visual information in one time step and generate the next word, without considering the visual and linguistic coherence. We propose Look Back (LB) method to embed visual information from the past and Predict Forward (PF) approach to look into future. LB method introduces attention value from the previous time step into the current attention generation to suit visual coherence of human. PF model predicts the next two words in one time step and jointly employs their probabilities for inference. Then the two approaches are combined together as LBPF to further integrate visual information from the past and linguistic information in the future to improve image captioning performance. All the three methods are applied on a classic base decoder, and show remarkable improvements on MSCOCO dataset with small increments on parameter counts. Our LBPF model achieves BLEU-4 / CIDEr / SPICE scores of 37.4 / 116.4 / 21.2 with cross-entropy loss and 38.3 / 127.6 / 22.0 with CIDEr optimization. Our three proposed methods can be easily applied on most attention-based encoder-decoder models for image captioning.
[predict, time, previous, forward, current, hidden, state, sequence, dataset, accumulated, future, employed] [approach, error] [image, method, proposed, input, result, comparison] [employ, network, best, performance, table, parameter, neural, inference, validation, size, mirror] [attention, model, visual, cider, lbpf, word, caption, generation, generate, language, captioning, mscoco, spice, generated, attt, evaluation, pizza, man, step, embed, fatt, game, att, decoder, vector, probability, remarkable, simply, karpathy, meteor, beam, picture, playing] [module, three, feature, final, score, predicted, detected, combine, baseline, propose] [loss, training, base, logit, embedding, test, learning, trained, set, dog, existing]
@InProceedings{Qin_2019_CVPR,
  author = {Qin, Yu and Du, Jiajun and Zhang, Yonghua and Lu, Hongtao},
  title = {Look Back and Predict Forward in Image Captioning},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Explainable and Explicit Visual Reasoning Over Scene Graphs
Jiaxin Shi, Hanwang Zhang, Juanzi Li


We aim to dismantle the prevalent black-box neural architectures used in complex visual reasoning tasks, into the proposed eXplainable and eXplicit Neural Modules (XNMs), which advance beyond existing neural module networks towards using scene graphs --- objects as nodes and the pairwise relationships as edges --- for explainable and explicit reasoning with structured knowledge. XNMs allow us to pay more attention to teach machines how to "think", regardless of what they "look". As we will show in the paper, by using scene graphs as an inductive bias, 1) we can design XNMs in a concise and flexible fashion, i.e., XNMs merely consist of 4 meta-types, which significantly reduce the number of parameters by 10 to 100 times, and 2) we can explicitly trace the reasoning-flow in terms of graph attentions. XNMs are so generic that they support a wide range of scene graph implementations with various qualities. For example, when the graphs are detected perfectly, XNMs achieve 100% accuracy on both CLEVR and CLEVR CoGenT, establishing an empirical performance upper-bound for visual reasoning; when the graphs are noisily detected from real-world images, XNMs are still robust to achieve a competitive 67.5% accuracy on VQAv2.0, surpassing the popular bag-of-objects attention models without graph structures.
[graph, dataset, complex, represented] [scene, condition, corresponding, explicit, vision, well, matrix] [figure, attribute, image, input, composite, metal, intermediate, proposed, demonstrate] [neural, filter, accuracy, table, achieve, performance, design, output, deep, number, implementation, weight, represents, validation] [reasoning, visual, attention, xnms, node, program, question, clevr, explainable, model, answer, query, language, find, red, logical, cylinder, parse, totally, vector, stacknmn, brown, nmns, perfect] [module, object, edge, detected, feature, detection, layout, count, roi, attentive, det] [training, set, supervised, large, trained, test, setting, label, existing, learning, generic, novel, specific, embeddings]
@InProceedings{Shi_2019_CVPR,
  author = {Shi, Jiaxin and Zhang, Hanwang and Li, Juanzi},
  title = {Explainable and Explicit Visual Reasoning Over Scene Graphs},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Transfer Learning via Unsupervised Task Discovery for Visual Question Answering
Hyeonwoo Noh, Taehoon Kim, Jonghwan Mun, Bohyung Han


We study how to leverage off-the-shelf visual and linguistic data to cope with out-of-vocabulary answers in visual question answering task. Existing large-scale visual datasets with annotations such as image class labels, bounding boxes and region descriptions are good sources for learning rich and diverse visual concepts. However, it is not straightforward how the visual concepts can be captured and transferred to visual question answering models due to missing link between question dependent answering models and visual data without question. We tackle this problem in two steps: 1) learning a task conditional visual classifier, which is capable of solving diverse question-specific visual recognition tasks, based on unsupervised task discovery and 2) transferring the task conditional visual classifier to visual question answering models. Specifically, we employ linguistic knowledge sources such as structured lexical database (e.g. WordNet) and visual descriptions for unsupervised task discovery, and transfer a learned task conditional visual classifier as an answering unit in a visual question answering model. We empirically show that the proposed algorithm generalizes to out-of-vocabulary answers successfully using the knowledge transferred from the visual dataset.
[dataset, recognition, joint, human, construct, framework] [problem, approach, vision, defined] [conditional, proposed, figure, image, based, transferred, mapping] [group, pretrained, pre, performance, neural, standard, parameter, table, separable, deep, constructed] [visual, vqa, question, word, answer, external, model, specification, answering, diverse, linguistic, wordnet, description, evaluation, type, captioning, language, attention, van, lexical, genome] [feature, discovery, object, bounding, illustrated, region, weakly, box] [task, learning, classifier, data, set, training, knowledge, transfer, unsupervised, learned, distribution, learn, test, exploiting, datasets, main, novel, setting, classification, train, split, trained, pretraining, supervised, embedding, class, paper, generalization]
@InProceedings{Noh_2019_CVPR,
  author = {Noh, Hyeonwoo and Kim, Taehoon and Mun, Jonghwan and Han, Bohyung},
  title = {Transfer Learning via Unsupervised Task Discovery for Visual Question Answering},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Intention Oriented Image Captions With Guiding Objects
Yue Zheng, Yali Li, Shengjin Wang


Although existing image caption models can produce promising results using recurrent neural networks (RNNs), it is difficult to guarantee that an object we care about is contained in generated descriptions, for example in the case that the object is inconspicuous in the image. Problems become even harder when these objects did not appear in training stage. In this paper, we propose a novel approach for generating image captions with guiding objects (CGO). The CGO constrains the model to involve a human-concerned object when the object is in the image. CGO ensures that the object is in the generated description while maintaining fluency. Instead of generating the sequence from left to right, we start the description with a selected object and generate other parts of the sequence based on this object. To achieve this, we design a novel framework combining two LSTMs in opposite directions. We demonstrate the characteristics of our method on MSCOCO where we generate descriptions for each detected object in the images. With CGO, we can extend the ability of description to the objects being neglected in image caption labels and provide a set of more comprehensive and diverse descriptions for an image. CGO shows advantages when applied to the task of describing novel objects. We show experimental results on both MSCOCO and ImageNet datasets. Evaluations show that our method outperforms the state-of-the-art models in the task with average F1 75.8, leading to better descriptions in terms of both content accuracy and fluency.
[sequence, time, lstm] [computer, ground, truth, vision, pattern, left, approach, international, form] [image, conference, input, ieee, method, figure] [denotes, neural, process, number, table, deep, applied, imagenet, accuracy] [cgo, model, caption, guiding, generated, language, word, generating, generate, captioning, diverse, visual, description, mscoco, generates, partial, appear, meteor, attention, arxiv, preprint, probability, step, subsection, contained, describing, sentence, beam, describe, alexander, fluent, improved] [object, average, detection, score, recall] [novel, selected, training, set, base, trained, embedding, label, test, classifier, existing, task, learning, specific]
@InProceedings{Zheng_2019_CVPR,
  author = {Zheng, Yue and Li, Yali and Wang, Shengjin},
  title = {Intention Oriented Image Captions With Guiding Objects},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Uncertainty Guided Multi-Scale Residual Learning-Using a Cycle Spinning CNN for Single Image De-Raining
Rajeev Yasarla, Vishal M. Patel


Single image de-raining is an extremely challenging problem since the rainy image may contain rain streaks which may vary in size, direction and density. Previous approaches have attempted to address this problem by leveraging some prior information to remove rain streaks from a single image. One of the major limitations of these approaches is that they do not consider the location information of rain drops in the image. The proposed Uncertainty guided Multi-scale Residual Learning (UMRL) network attempts to address this issue by learning the rain content at different scales and using them to estimate the final de-rained output. In addition, we introduce a technique which guides the network to learn the network weights based on the confidence measure about the estimate. Furthermore, we introduce a new training and testing procedure based on the notion of cycle spinning to improve the final de-raining performance. Extensive experiments on synthetic and real datasets to demonstrate that the proposed method achieves significant improvements over the recent state-of-the-art methods.
[recognition, recurrent, dataset, work, joint] [confidence, computer, single, corresponding, estimate, vision, pattern, clearly, approach, estimating, inf, international, direction, estimated] [image, rain, umrl, figure, cycle, spinning, method, rainy, proposed, ieee, based, input, conference, remove, streak, synthetic, removal, removing, zhang, content, perceptual, fourth, high, clean, generative, cyclically] [network, residual, performance, deep, convolutional, subsequent, layer, scale, table, size, compared, density, block, neural] [ddn, introduced, introduce] [map, final, improve, guided, location, feature] [loss, uncertainty, learning, base, shifted, address, training, learn, testing, datasets, observe]
@InProceedings{Yasarla_2019_CVPR,
  author = {Yasarla, Rajeev and Patel, Vishal M.},
  title = {Uncertainty Guided Multi-Scale Residual Learning-Using a Cycle Spinning CNN for Single Image De-Raining},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Toward Realistic Image Compositing With Adversarial Learning
Bor-Chun Chen, Andrew Kae


Compositing a realistic image is a challenging task and usually requires considerable human supervision using professional image editing software. In this work we propose a generative adversarial network (GAN) architecture for automatic image compositing. The proposed model consists of four sub-networks: a transformation network that improves the geometric and color consistency of the composite image, a refinement network that polishes the boundary of the composite image, and a pair of discriminator network and a segmentation network for adversarial learning. Experimental results on both synthesized images and real images show that our model, Geometrically and Color Consistent GANs (GCC-GANs), can automatically generate realistic composite images compared to several state-of-the-art methods, and does not require any manual effort.
[human, learns, work, second, framework, dataset] [geometric, computer, vision, pattern, geometry, column, consistent, geometrically, note, directly, single, well, initial] [image, composite, color, background, compositing, realistic, transformation, figure, conference, generative, ieee, composition, input, manipulation, real, alpha, proposed, synthesized, acm, consistency, appearance, correction, method, described, realism, poisson, blending, harmonization, perceptual, soda] [network, compared, deep, original, better, process, table, automatically, apply, order, size] [model, adversarial, generate, discriminator, generated, transformer, create, fool, plausible] [foreground, object, segmentation, mask, refinement, boundary, spatial, semantic, combine, baseline, annotator, detection] [training, learning, loss, data, select, learn, train, auxiliary, trained]
@InProceedings{Chen_2019_CVPR,
  author = {Chen, Bor-Chun and Kae, Andrew},
  title = {Toward Realistic Image Compositing With Adversarial Learning},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Cross-Classification Clustering: An Efficient Multi-Object Tracking Technique for 3-D Instance Segmentation in Connectomics
Yaron Meirovitch, Lu Mi, Hayk Saribekyan, Alexander Matveev, David Rolnick, Nir Shavit


Pixel-accurate tracking of objects is a key element in many computer vision applications, often solved by iterated individual object tracking or instance segmentation followed by object matching. Here we introduce cross-classification clustering (3C), a technique that simultaneously tracks complex, interrelated objects in an image stack. The key idea in cross-classification is to efficiently turn a clustering problem into a classification problem by running a logarithmic number of independent classifications per image, letting the cross-labeling of these classifications uniquely classify each pixel to the object labels. We apply the 3C mechanism to achieve state-of-the-art accuracy in connectomics -- the nanoscale mapping of neural tissue from electron microscopy volumes. Our reconstruction system increases scalability by an order of magnitude over existing single-object tracking methods (such as flood-filling networks). This scalability is important for the deployment of connectomics pipelines, since currently the best performing techniques require computing infrastructures that are beyond the reach of most laboratories. Our algorithm may offer benefits in other domains that require pixel-accurate tracking of multiple objects, such as segmentation of videos and medical imagery.
[tracking, dataset, multiple, framework, nature, human] [computer, reconstruction, approach, pattern, vision, problem, volume, algorithm, single, technique, david, simultaneously, analysis] [image, figure, ieee, conference, microscopy, based, pixel, input, imaging, traditional] [neural, number, accuracy, convolutional, network, jeff, order, ratio, efficient, entire, compared, output, original, architecture, deep] [arxiv, preprint, system, machine] [segmentation, object, connectomics, instance, seed, ffn, medical, fcn, border, electron, hierarchical, agglomeration, hanspeter, maskextend, neuronal, detection, moritz, winfried, srinivas, sebastian, viren, toufiq] [classification, learning, clustering, data, large, source, set, label, target, labeled, datasets]
@InProceedings{Meirovitch_2019_CVPR,
  author = {Meirovitch, Yaron and Mi, Lu and Saribekyan, Hayk and Matveev, Alexander and Rolnick, David and Shavit, Nir},
  title = {Cross-Classification Clustering: An Efficient Multi-Object Tracking Technique for 3-D Instance Segmentation in Connectomics},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Deep ChArUco: Dark ChArUco Marker Pose Estimation
Danying Hu, Daniel DeTone, Tomasz Malisiewicz


ChArUco boards are used for camera calibration, monocular pose estimation, and pose verification in both robotics and augmented reality. Such fiducials are detectable via traditional computer vision methods (as found in OpenCV) in well-lit environments, but classical methods fail when the lighting is poor or when the image undergoes extreme motion blur. We present Deep ChArUco, a real-time pose estimation system which combines two custom deep networks, ChArUcoNet and RefineNet, with the Perspective-n-Point (PnP) algorithm to estimate the marker's 6DoF pose. ChArUcoNet is a two-headed marker-specific convolutional neural network (CNN) which jointly outputs ID-specific classifiers and 2D point locations. The 2D point locations are further refined into subpixel coordinates using RefineNet. Our networks are trained using a combination of auto-labeled videos of the target marker, synthetic subpixel corner data, and extreme data augmentation. We evaluate Deep ChArUco in challenging low-light, high-motion, high-blur scenarios and demonstrate that our approach is superior to a traditional OpenCV-based method for ChArUco marker detection and pose estimation.
[motion, dataset, video, custom] [pose, estimation, point, corner, lighting, computer, reprojection, superpoint, error, camera, pattern, single, vision, keypoint, augmented, classical, total, algorithm, good, approach] [charuco, figure, opencv, charuconet, marker, image, synthetic, blur, subpixel, traditional, conference, aruco, chessboard, method, based, pixel, input, fiducial, ieee] [deep, accuracy, network, neural, convolutional, number, output, factor, performance, table, compare, increasing, architecture, applied, shadow] [evaluation, system, unique, evaluate, random] [detection, detector, refinenet, detected, extreme, detect, head, board, object, average, refinement, three, feature, region, location] [training, data, classification, set, test]
@InProceedings{Hu_2019_CVPR,
  author = {Hu, Danying and DeTone, Daniel and Malisiewicz, Tomasz},
  title = {Deep ChArUco: Dark ChArUco Marker Pose Estimation},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Pseudo-LiDAR From Visual Depth Estimation: Bridging the Gap in 3D Object Detection for Autonomous Driving
Yan Wang, Wei-Lun Chao, Divyansh Garg, Bharath Hariharan, Mark Campbell, Kilian Q. Weinberger


3D object detection is an essential task in autonomous driving. Recent techniques excel with highly accurate detection rates, provided the 3D input data is obtained from precise but expensive LiDAR technology. Approaches based on cheaper monocular or stereo imagery data have, until now, resulted in drastically lower accuracies --- a gap that is commonly attributed to poor image-based depth estimation. However, in this paper we argue that it is not the quality of the data but its representation that accounts for the majority of the difference. Taking the inner workings of convolutional neural networks into consideration, we propose to convert image-based depth maps to pseudo-LiDAR representations --- essentially mimicking the LiDAR signal. With this representation we can apply different existing LiDAR-based detection algorithms. On the popular KITTI benchmark, our approach achieves impressive improvements over the existing state-of-the-art in image-based performance --- raising the detection accuracy of objects within the 30m range from the previous state-of-the-art of 22% to an unprecedented 74%. At the time of submission our algorithm holds the highest entry on the KITTI 3D object detection leaderboard for stereo-image-based approaches.
[mono, signal, work, report, key, future, fusion] [depth, lidar, stereo, estimation, kitti, point, monocular, psmn, approach, disparity, avod, autonomous, oint, accurate, view, estimated, additional, apbev, scene, pseudolidar, provide, frustum, pointnet, note, algorithm, sensor, single, dense] [isp, image, pixel, frontal, input, based, comparison, figure] [table, convolutional, validation, performance, accuracy, network, deep, apply, compare, highly, neural, best] [visual] [object, detection, map, iou, car, benchmark, box, bev, bounding, category, easy, moderate, pedestrian, cyclist] [representation, gap, data, training, existing, hard, paper, train]
@InProceedings{Wang_2019_CVPR,
  author = {Wang, Yan and Chao, Wei-Lun and Garg, Divyansh and Hariharan, Bharath and Campbell, Mark and Weinberger, Kilian Q.},
  title = {Pseudo-LiDAR From Visual Depth Estimation: Bridging the Gap in 3D Object Detection for Autonomous Driving},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Rules of the Road: Predicting Driving Behavior With a Convolutional Model of Semantic Interactions
Joey Hong, Benjamin Sapp, James Philbin


We focus on the problem of predicting future states of entities in complex, real-world driving scenarios. Previous research has approached this problem via low-level signals to predict short time horizons, and has not addressed how to leverage key assets relied upon heavily by industry self-driving systems: (1) large 3D perception efforts which provide highly accurate 3D states of agents with rich attributes, and (2) detailed and accurate semantic maps of the environment (lanes, traffic lights, crosswalks, etc). We present a unified representation which encodes such high-level semantic information in a spatial grid, allowing the use of deep convolutional models to fuse complex scene context. This enables learning entity-entity and entity-environment interactions with simple, feed-forward computations in each timestep within an overall temporal model of an agent's behavior. We propose different ways of modelling the future as a distribution over future states using standard supervised learning. We introduce a novel dataset providing industry-grade rich perception and semantic inputs, and empirically show we can effectively learn fundamentals of driving behavior.
[future, state, prediction, time, work, trajectory, dataset, dynamic, predict, traffic, modeling, driving, behavior, predicting, static, forecasting, industry, temporal, timestep, tracking, planning, motion] [scene, single, additional, problem, sensor, well, autonomous, note, occupancy, point, regress, kitti, light] [method, input, latent, figure, described] [output, gaussian, network, convolutional, deep, standard, tensor, larger, table, better, number, size] [entity, model, perception, multimodal, diverse, sampled, robot, requires] [road, grid, map, semantic, vehicle, regression, spatial, context, including, detection, bounding] [learning, distribution, set, representation, uncertainty, target, data, training, mixture, large, sampling, space]
@InProceedings{Hong_2019_CVPR,
  author = {Hong, Joey and Sapp, Benjamin and Philbin, James},
  title = {Rules of the Road: Predicting Driving Behavior With a Convolutional Model of Semantic Interactions},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Metric Learning for Image Registration
Marc Niethammer, Roland Kwitt, Francois-Xavier Vialard


Image registration is a key technique in medical image analysis to estimate deformations between image pairs. A good deformation model is important for high-quality estimates. However, most existing approaches use ad-hoc deformation models chosen for mathematical convenience rather than to capture observed data variation. Recent deep learning approaches learn deformation models directly from data. However, they provide limited control over the spatial regularity of transformations. Instead of learning the entire registration approach, we learn a spatially-adaptive regularizer within a registration model. This allows controlling the desired level of regularity and preserving structural properties of a registration model. For example, diffeomorphic transformations can be attained. Our approach is a radical departure from existing deep learning approaches to image registration by embedding a deep learning model in an optimization-based registration algorithm to parameterize and data-adapt the registration model itself.
[velocity, optical, capture, work, flow, displacement, largest, individual] [registration, omt, local, approach, deformation, optimization, momentum, diffeomorphic, regularity, initial, directly, field, deviation, allows, vsvf, lddmm, ground, good, jacobian, truth, estimated, well, localized, parameterizations, parameterization, estimation] [image, based, figure, transformation, real, control, desired, synthetic, proposed] [standard, regularizer, deep, regularization, gaussian, penalty, fast, optimize, brain, kernel, best, overlap, desirable, convolutional, network, denotes, weight] [model, vector, regular] [global, spatial, regression, cnn, deformable, medical] [learning, metric, large, learn, source, target, training, data, set, trained, function, shared]
@InProceedings{Niethammer_2019_CVPR,
  author = {Niethammer, Marc and Kwitt, Roland and Vialard, Francois-Xavier},
  title = {Metric Learning for Image Registration},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
LO-Net: Deep Real-Time Lidar Odometry
Qing Li, Shaoyang Chen, Cheng Wang, Xin Li, Chenglu Wen, Ming Cheng, Jonathan Li


We present a novel deep convolutional network pipeline, LO-Net, for real-time lidar odometry estimation. Unlike most existing lidar odometry (LO) estimations that go through individually designed feature selection, feature matching, and pose estimation pipeline, LO-Net can be trained in an end-to-end manner. With a new mask-weighted geometric constraint loss, LO-Net can effectively learn feature representation for LO estimation, and can implicitly exploit the sequential dependencies and dynamics in the data. We also design a scan-to-map module, which uses the geometric and semantic information learned in LO-Net, to improve the estimation accuracy. Experiments on benchmark datasets demonstrate that LO-Net outperforms existing learning based approaches and has similar accuracy with the state-of-the-art geometry-based approach, LOAM.
[dataset, trel, consecutive, dynamic, prediction, moving, framework] [lidar, point, odometry, normal, kitti, estimation, pose, scan, international, cloud, ground, icp, vision, geometric, rrel, computer, robotics, ford, matrix, corresponding, relative, range, truth, volume, translational, rotational, autonomous, accurate, camera, approach, initial, loam, velodyne, automation, pattern] [figure, conference, ieee, mapping, method, transformation, based, input, consistency, smooth] [network, deep, better, table, processing, performance, architecture, achieves, fast, number, convolutional] [model, arxiv, visual, evaluation, preprint] [mask, feature, map, average, regression, extraction, module, predicted] [data, learning, set, loss, learned, trained, training]
@InProceedings{Li_2019_CVPR,
  author = {Li, Qing and Chen, Shaoyang and Wang, Cheng and Li, Xin and Wen, Chenglu and Cheng, Ming and Li, Jonathan},
  title = {LO-Net: Deep Real-Time Lidar Odometry},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
TraPHic: Trajectory Prediction in Dense and Heterogeneous Traffic Using Weighted Interactions
Rohan Chandra, Uttaran Bhattacharya, Aniket Bera, Dinesh Manocha


We present a new algorithm for predicting the near-term trajectories of road agents in dense traffic videos. Our approach is designed for heterogeneous traffic, where the road agents may correspond to buses, cars, scooters, bi-cycles, or pedestrians. We model the interactions between different road agents using a novel LSTM-CNN hybrid network for trajectory prediction. In particular, we take into account heterogeneous interactions that implicitly account for the varying shapes, dynamics, and behaviors of different road agents. In addition, we model horizon-based interactions which are used to implicitly model the driving behavior of each road agent. We evaluate the performance of our prediction algorithm, TraPHic, on the standard datasets and also introduce a new dense, heterogeneous traffic dataset corresponding to urban Asian videos and agent trajectories. We outperform state-of-the-art methods on dense traffic datasets by 30%.
[traffic, prediction, trajectory, heterogeneous, dataset, horizon, traf, interaction, traphic, predict, ngsim, driver, dynamic, social, hidden, consists, work, velocity, lstms, motion, ego, implicitly, sequence, passed, predicting, heterogeneity] [dense, approach, algorithm, rmse, computer, neighborhood, account, relative, vision, corresponding, error, ground, truth, well, homogeneous, compute, international, autonomous] [prior, based, figure, hybrid, conference, input, ieee] [weighted, network, layer, number, neural, sparse, deep, density, performance, standard, compare, table, reduces] [agent, model, arxiv, include, evaluation, preprint] [road, spatial, car, predicted, average, region] [datasets, learning, neighbor, corresponds, novel]
@InProceedings{Chandra_2019_CVPR,
  author = {Chandra, Rohan and Bhattacharya, Uttaran and Bera, Aniket and Manocha, Dinesh},
  title = {TraPHic: Trajectory Prediction in Dense and Heterogeneous Traffic Using Weighted Interactions},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
World From Blur
Jiayan Qiu, Xinchao Wang, Stephen J. Maybank, Dacheng Tao


What can we tell from a single motion-blurred image? We show in this paper that a 3D scene can be revealed. Unlike prior methods that focus on producing a deblurred image, we propose to estimate and take advantage of the hidden message of a blurred image, the relative motion trajectory, to restore the 3D scene collapsed during the exposure process. To this end, we train a deep network that jointly predicts the motion trajectory, the deblurred image, and the depth one, all of which in turn form a collaborative and self-supervised cycle that supervise one another to reproduce the input blurred image, enabling plausible 3D scene reconstruction from a single blurred image. We test the proposed model on several large-scale datasets we constructed based on benchmarks, as well as real-world blurred images, and show that it yields very encouraging quantitative and qualitative results.
[motion, sequence, video, frame, lstm, focus, work, second, fed, term, jointly, extract, flow, dynamic] [depth, computer, vision, pattern, scene, estimation, single, camera, international, reconstruction, monocular, volume, estimate, relative, stereo, well, coordinate, respect, linear, exposure, ground, approach, denote, nyu, rmse, analysis, form] [blurred, image, conference, ieee, deblurring, deblurred, cycle, input, clean, reference, proposed, blind, blur, figure, reproduce, supervise, collaborative, based, conduct] [network, deep, neural, convolutional, process, architecture, processing, structure, table] [model, visual, arxiv, preprint] [module, three, map, european, propose, branch] [learning, training, datasets, unsupervised, trained, train]
@InProceedings{Qiu_2019_CVPR,
  author = {Qiu, Jiayan and Wang, Xinchao and Maybank, Stephen J. and Tao, Dacheng},
  title = {World From Blur},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Topology Reconstruction of Tree-Like Structure in Images via Structural Similarity Measure and Dominant Set Clustering
Jianyang Xie, Yitian Zhao, Yonghuai Liu, Pan Su, Yifan Zhao, Jun Cheng, Yalin Zheng, Jiang Liu


The reconstruction and analysis of tree-like topological structures in the biomedical images is crucial for biologists and surgeons to understand biomedical conditions and plan surgical procedures. The underlying tree-structure topology reveals how different curvilinear components are anatomically connected to each other. Existing automated topology reconstruction methods have great difficulty in identifying the connectivity when two or more curvilinear components cross or bifurcate, due to their projection ambiguity, imaging noise and low contrast. In this paper, we propose a novel curvilinear structural similarity measure to guide a dominant-set clustering approach to address this indispensable issue. The novel similarity measure takes into account both intensity and geometric properties in representing the curvilinear structure locally and globally, and group curvilinear objects at crossover points into different connected branches by dominant-set clustering. The proposed method is applicable to different imaging modalities, and quantitative and qualitative results on retinal vessel, plant root, and neuronal network datasets show that our methodology is capable of advancing the current state-of-the-art techniques.
[structural, graph, dataset, capture] [topology, reconstruction, dominant, approach, local, estimation, vessel, contrast, shape, topological, defined, pattern, diadem, geometric, computer, centerline, crossover, accurate, problem, manual, single, tracing, relative, respect, algorithm, measurement, accii, owa, vision] [method, proposed, ieee, image, pixel, biomedical, imaging, figure, intensity, high, conference] [structure, number, wide, automated, weighted, table, science, root, weight] [tree, rice, identified, indicates, identifying, correctly, node] [curvilinear, retinal, neuronal, connectivity, level, global, junction, plant, branch, three, region, segment, edge, propose, score] [similarity, measure, set, datasets, clustering, novel, distance]
@InProceedings{Xie_2019_CVPR,
  author = {Xie, Jianyang and Zhao, Yitian and Liu, Yonghuai and Su, Pan and Zhao, Yifan and Cheng, Jun and Zheng, Yalin and Liu, Jiang},
  title = {Topology Reconstruction of Tree-Like Structure in Images via Structural Similarity Measure and Dominant Set Clustering},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Pyramidal Person Re-IDentification via Multi-Loss Dynamic Training
Feng Zheng, Cheng Deng, Xing Sun, Xinyang Jiang, Xiaowei Guo, Zongqiao Yu, Feiyue Huang, Rongrong Ji


Most existing Re-IDentification (Re-ID) methods are highly dependent on precise bounding boxes that enable images to be aligned with each other. However, due to the challenging practical scenarios, current detection models often produce inaccurate bounding boxes, which inevitably degenerate the performance of existing Re-ID algorithms. In this paper, we propose a novel coarse-to-fine pyramid model to relax the need of bounding boxes, which not only incorporates local and global information, but also integrates the gradual cues between them. The pyramid model is able to match at different scales and then search for the correct image of the same identity, even when the image pairs are not aligned. In addition, in order to learn discriminative identity representation, we explore a dynamic training scheme to seamlessly unify two losses and extract appropriate shared information between them. Experimental results clearly demonstrate that the proposed method achieves the state-of-the-art results on three datasets. Especially, our approach exceeds the current best method by 9.5% on the most challenging CUHK03 dataset.
[dataset, dynamic, current, multiple] [local, matching] [method, proposed, identity, based, image, comparison, figure] [performance, number, deep, network, basic, convolutional, pyramidal, achieve, achieves, best, weight, parameter, layer, neural, science, scheme] [model, appropriate, query, random, vector] [person, pyramid, global, feature, map, branch, bounding, detection, identification, level, pedestrian, challenging, propose, backbone, improve, including, liang, three, pcb, detected, average, mgn] [loss, rank, triplet, learning, training, sampling, set, learn, task, hard, discriminative, embedding, existing, novel, strategy, china, metric, gallery, setting, protocol, shared]
@InProceedings{Zheng_2019_CVPR,
  author = {Zheng, Feng and Deng, Cheng and Sun, Xing and Jiang, Xinyang and Guo, Xiaowei and Yu, Zongqiao and Huang, Feiyue and Ji, Rongrong},
  title = {Pyramidal Person Re-IDentification via Multi-Loss Dynamic Training},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Holistic and Comprehensive Annotation of Clinically Significant Findings on Diverse CT Images: Learning From Radiology Reports and Label Ontology
Ke Yan, Yifan Peng, Veit Sandfort, Mohammadhadi Bagheri, Zhiyong Lu, Ronald M. Summers


In radiologists' routine work, one major task is to read a medical image, e.g., a CT scan, find significant lesions, and describe them in the radiology report. In this paper, we study the lesion description or annotation problem. Given a lesion image, our aim is to predict a comprehensive set of relevant labels, such as the lesion's body part, type, and attributes, which may assist downstream fine-grained diagnosis. To address this task, we first design a deep learning module to extract relevant semantic labels from the radiology reports associated with the lesion images. With the images and text-mined labels, we propose a lesion annotation network (LesaNet) based on a multilabel convolutional neural network (CNN) to learn all labels holistically. Hierarchical relations and mutually exclusive relations between the labels are leveraged to improve the label prediction accuracy. The relations are utilized in a label expansion strategy and a reliable hard example mining algorithm. We also attach a simple score propagation layer on LesaNet to enhance recall and explore implicit relation between labels. Multilabel metric learning is combined with classification to enable interpretable prediction. We evaluated LesaNet on the public DeepLesion dataset, which contains over 32K diverse lesion images. Experiments show that LesaNet can precisely annotate the lesions using an ontology of 171 fine-grained labels with an average AUC of 0.9344.
[predict, propagation, auc, multiple, framework, prediction, yifan, dataset] [body, left, associated, mid] [image, ronald, based, missing, study, patch, figure, ieee, comprehensive] [deep, network, layer, lower, accuracy, convolutional, neural, number] [relevant, example, sentence, find, describe, include, model, node] [lesion, annotation, medical, lesanet, score, radiology, lung, hierarchical, nodule, liver, lymph, predicted, metastasis, semantic, module, recall, deeplesion, chest, expansion, multiscale, hemangioma, zhiyong, improve] [label, test, loss, learning, training, set, multilabel, hard, exclusive, positive, classification, negative, learn, ontology, mining, triplet, spl, embedding, sample, learned, metric, large, data, noisy, rare, retrieval, national]
@InProceedings{Yan_2019_CVPR,
  author = {Yan, Ke and Peng, Yifan and Sandfort, Veit and Bagheri, Mohammadhadi and Lu, Zhiyong and Summers, Ronald M.},
  title = {Holistic and Comprehensive Annotation of Clinically Significant Findings on Diverse CT Images: Learning From Radiology Reports and Label Ontology},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Robust Histopathology Image Analysis: To Label or to Synthesize?
Le Hou, Ayush Agarwal, Dimitris Samaras, Tahsin M. Kurc, Rajarsi R. Gupta, Joel H. Saltz


Detection, segmentation and classification of nuclei are fundamental analysis operations in digital pathology. Existing state-of-the-art approaches demand extensive amount of supervised training data from pathologists and may still perform poorly in images from unseen tissue types. We propose an unsupervised approach for histopathology image segmentation that synthesizes heterogeneous sets of training image patches, of every tissue type. Although our synthetic patches are not always of high quality, we harness the motley crew of generated samples through a generally applicable importance sampling method. This proposed approach, for the first time, re-weighs the training loss over synthetic data so that the ideal (unbiased) generalization loss over the true data distribution is minimized. This enables us to use a random polygon generator to synthesize approximate cellular structures (i.e., nuclear masks) for which no real examples are given in many tissue types, and hence, GAN-based methods are not suited. In addition, we propose a hybrid synthesis pipeline that utilizes textures in real histopathology patches and GAN models, to tackle heterogeneity in tissue textures. Compared with existing state-of-the-art supervised models, our approach generalizes significantly better on cancer types without training data. Even in cancer types with training data, our approach achieves the same performance without supervision cost. We release code and segmentation results on over 5000 Whole Slide Images (WSI) in The Cancer Genome Atlas (TCGA) repository, a dataset that would be orders of magnitude larger than what is available today.
[dataset, human] [nuclear, initial, approach, ground, truth, analysis, well, robust, defined, ideal] [synthetic, image, real, patch, cancer, nucleus, method, reference, histopathology, synthesis, input, realistic, proposed, texture, synthesize, figure, based, tcga, refiner, extracted, generative, lymphocyte] [convolutional, network, deep, neural, larger, weight, output, performance] [type, adversarial, gan, discriminator, generate, model, evaluation, classified, sampled] [segmentation, tissue, cnn, detection, refined, slide, pathology, mask, segmented, pathologist, foreground, propose] [training, supervised, loss, data, sampling, trained, generalization, set, distribution, unsupervised, universal, learning, randomly, unbiased, sample, classification, existing, datasets, minimize, source, large]
@InProceedings{Hou_2019_CVPR,
  author = {Hou, Le and Agarwal, Ayush and Samaras, Dimitris and Kurc, Tahsin M. and Gupta, Rajarsi R. and Saltz, Joel H.},
  title = {Robust Histopathology Image Analysis: To Label or to Synthesize?},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Data Augmentation Using Learned Transformations for One-Shot Medical Image Segmentation
Amy Zhao, Guha Balakrishnan, Fredo Durand, John V. Guttag, Adrian V. Dalca


Image segmentation is an important task in many medical applications. Methods based on convolutional neural networks attain state-of-the-art accuracy; however, they typically rely on supervised training with large labeled datasets. Labeling medical images requires significant expertise and time, and typical hand-tuned approaches for data augmentation fail to capture the complex variations in such images. We present an automated data augmentation method for synthesizing labeled medical images. We demonstrate our method on the task of segmenting magnetic resonance imaging (MRI) brain scans. Our method requires only a single segmented scan, and leverages other unlabeled scans in a semi-supervised approach. We learn a model of transformations from the images, and use the model along with the labeled example to synthesize additional labeled examples. Each transformation is comprised of a spatial deformation field and an intensity change, enabling the synthesis of complex effects such as variations in anatomy and image acquisition procedures. We show that training a supervised segmenter with these new examples provides significant improvements over state-of-the-art methods for one-shot biomedical image segmentation.
[focus, capture, framework] [atlas, registration, volume, international, deformation, computer, pattern, field, single, active, computed, additional, vision] [image, appearance, method, transformation, anatomical, transform, ieee, intensity, conference, synthesize, mri, imaging, synthesized, biomedical, figure, realistic, synthetic, described] [brain, convolutional, neural, network, computing, deep, performance] [model, example, dice, random, sampled, arxiv, preprint, natural, automatic] [segmentation, spatial, medical, semantic, score, segmented, leverage, improvement, fully] [training, labeled, data, learning, augmentation, supervised, unlabeled, train, test, set, learn, target, similarity, loss, learned, large, sample, label, sampling, unsupervised, function]
@InProceedings{Zhao_2019_CVPR,
  author = {Zhao, Amy and Balakrishnan, Guha and Durand, Fredo and Guttag, John V. and Dalca, Adrian V.},
  title = {Data Augmentation Using Learned Transformations for One-Shot Medical Image Segmentation},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Shifting More Attention to Video Salient Object Detection
Deng-Ping Fan, Wenguan Wang, Ming-Ming Cheng, Jianbing Shen


The last decade has witnessed a growing interest in video salient object detection (VSOD). However, the research community long-term lacked a well-established VSOD dataset representative of real dynamic scenes with high-quality annotations. To address this issue, we elaborately collected a visual-attention-consistent Densely Annotated VSOD (DAVSOD) dataset, which contains 226 videos with 23,938 frames that cover diverse realistic-scenes, objects, instances and motions. With corresponding real human eye-fixation data, we obtain precise ground-truths. This is the first work that explicitly emphasizes the challenge of saliency shift, i.e., the video salient object(s) may dynamically change. To further contribute the community a complete benchmark, we systematically assess 17 representative VSOD algorithms over seven existing VSOD datasets and our DAVSOD with totally 84K frames (largest-scale). Utilizing three famous metrics, we then present a comprehensive and insightful performance analysis. Furthermore, we propose a baseline model. It is equipped with a saliency shift- aware convLSTM, which can efficiently capture video saliency dynamics through learning human attention-shift behavior. Extensive experiments open up promising future directions for model development and comparison.
[video, vsod, davsod, human, ssav, dynamic, spatiotemporal, dataset, convlstm, frame, jianbing, fixation, previous, motion, visal, static, temporal, fbms, davis, uvsd, vos, pdbm, pdc] [corresponding, provide, explicit, implicit, contrast, geodesic] [ieee, proposed, image, based, traditional, real, comprehensive, figure, eye] [deep, shift, table, performance, dilated, max, neural, better, conv, convolutional, promising, network] [attention, model, visual, indicates, diverse, evaluation, random] [object, salient, saliency, detection, feature, benchmark, segmentation, wenguan, annotated, ali, module, annotation, tcsvt, cvpr, baseline, selective, guided, spatial, fully] [learning, datasets, existing, set, test, representative, mcl, training]
@InProceedings{Fan_2019_CVPR,
  author = {Fan, Deng-Ping and Wang, Wenguan and Cheng, Ming-Ming and Shen, Jianbing},
  title = {Shifting More Attention to Video Salient Object Detection},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Neural Task Graphs: Generalizing to Unseen Tasks From a Single Video Demonstration
De-An Huang, Suraj Nair, Danfei Xu, Yuke Zhu, Animesh Garg, Li Fei-Fei, Silvio Savarese, Juan Carlos Niebles


Our goal is to generate a policy to complete an unseen task given just a single video demonstration of the task in a given domain. We hypothesize that to successfully generalize to unseen complex tasks from a single video demonstration, it is necessary to explicitly incorporate the compositional structure of the tasks into the model. To this end, we propose Neural Task Graph (NTG) Networks, which use conjugate task graph as the intermediate representation to modularize both the video demonstration and the derived policy. We empirically show NTG achieves inter-task generalization on two complex tasks: Block Stacking in BulletPhysics and Object Collection in AI2-THOR. NTG improves data efficiency with visual input as well as achieve strong generalization without the need for dense hierarchical supervision. We further show that similar performance trends hold when applied to real-world data. We show that NTG can effectively predict task structure on the JIGSAWS surgical dataset and generalize to unseen tasks.
[graph, video, action, state, stacking, planning, gcn, explicitly, surgical, previous, capture, complex, work, challenge, sequence, start] [single, observation, completion, directly] [figure, flat, generator, needle, input, intermediate] [block, structure, neural, order, full, better, efficiency] [visual, ntg, policy, demonstration, execution, node, model, imitation, conjugate, goal, demo, complete, engine, ntp, collection, strong, evaluate, compositionality, agent, tseen, path, pieter, generate, step, observed, potato, yuke] [edge, object, hierarchical, supervision, challenging, ablation, propose] [task, learning, unseen, generalize, classifier, learn, data, representation, novel, generalization, training, embedding, target, sergey, train, set, space, localizer]
@InProceedings{Huang_2019_CVPR,
  author = {Huang, De-An and Nair, Suraj and Xu, Danfei and Zhu, Yuke and Garg, Animesh and Fei-Fei, Li and Savarese, Silvio and Carlos Niebles, Juan},
  title = {Neural Task Graphs: Generalizing to Unseen Tasks From a Single Video Demonstration},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Beyond Tracking: Selecting Memory and Refining Poses for Deep Visual Odometry
Fei Xue, Xin Wang, Shunkai Li, Qiuyuan Wang, Junqiu Wang, Hongbin Zha


Most previous learning-based visual odometry (VO) methods take VO as a pure tracking problem. In contrast, we present a VO framework by incorporating two additional components called Memory and Refining. The Memory component preserves global information by employing an adaptive and efficient selection strategy. The Refining component ameliorates previous results with the contexts stored in the Memory by adopting a spatial-temporal attention mechanism for feature distilling. Experiments on the KITTI and TUM-RGBD benchmark datasets demonstrate that our method outperforms state-of-the-art learning-based methods by a large margin and produces competitive results against classic monocular VO approaches. Especially, our model achieves outstanding performance in challenging scenarios such as texture-less regions and abrupt motions, where classic VO algorithms tend to fail.
[hidden, previous, recurrent, tracking, time, current, classic, trel, temporal, accumulated, consecutive, motion, deepvo, dataset, ntex, sequence, state, framework, sfmlearner, flow, lstm, outperforms, abrupt, long, historical, optical, undeepvo] [monocular, depth, pose, relative, odometry, absolute, rrel, kitti, estimation, error, geometric, camera, view, rotation, direct, stereo] [method, image, component, figure, translation, accumulation] [deep, convolutional, performance, neural, seq, table, scale, denotes, channel, output, efficient] [memory, model, visual, attention, encoder, mechanism] [refining, global, feature, module, spatial, challenging, benchmark] [learning, unsupervised, stored, data]
@InProceedings{Xue_2019_CVPR,
  author = {Xue, Fei and Wang, Xin and Li, Shunkai and Wang, Qiuyuan and Wang, Junqiu and Zha, Hongbin},
  title = {Beyond Tracking: Selecting Memory and Refining Poses for Deep Visual Odometry},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Image Generation From Layout
Bo Zhao, Lili Meng, Weidong Yin, Leonid Sigal


Despite significant recent progress on generative models, controlled generation of images depicting multiple and complex object layouts is still a difficult problem. Among the core challenges are the diversity of appearance a given object may possess and, as a result, exponential set of images consistent with a specified layout. To address these challenges, we propose a novel approach for layout-based image generation; we call it Layout2Im. Given the coarse spatial layout (bounding boxes + object categories), our model can generate a set of realistic images which have the correct objects in the desired locations. The representation of each object is disentangled into a specified/certain part (category) and an unspecified/uncertain part (appearance). The category is encoded using a word embedding and the appearance is distilled into a low-dimensional vector sampled from a normal distribution. Individual object representations are composed together using convolutional LSTM, to obtain an encoding of the complete layout, and then decoded to an image. Several loss terms are introduced to encourage accurate and diverse generation. The proposed Layout2Im model significantly outperforms the previous state of the art, boosting the best reported inception score by 24.66% and 28.57% on the very challenging COCO-Stuff and Visual Genome datasets, respectively. Extensive experiments also demonstrate our method's ability to generate complex and diverse images with multiple objects.
[complex, multiple, hidden] [sky, latexit, scene, ground, estimator, truth, normal, corresponding, approach, consistent] [image, latent, input, figure, real, method, disentangled, proposed, generative, appearance, conditional, desired, based, realistic] [deep, convolutional, accuracy, table] [generated, model, generate, generation, inception, tree, grass, adversarial, diversity, sampled, visual, generating, diverse, genome, limg, giraffe, mage, recognizable, snow, vector, man, decoder, word, ability, zsi, lobj, sheep, encoder, fuser] [object, layout, feature, person, bounding, score, map, category, box, cropped, coarse, instance, coco] [loss, code, classification, set, representation, trained, learning, training, embedding, datasets, sample, train, distribution, difficult]
@InProceedings{Zhao_2019_CVPR,
  author = {Zhao, Bo and Meng, Lili and Yin, Weidong and Sigal, Leonid},
  title = {Image Generation From Layout},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Multimodal Explanations by Predicting Counterfactuality in Videos
Atsushi Kanehira, Kentaro Takemoto, Sho Inayoshi, Tatsuya Harada


This study addresses generating counterfactual explanations with multimodal information. Our goal is not only to classify a video into a specific category, but also to provide explanations on why it is not categorized to a specific class with combinations of visual-linguistic information. Requirements that the expected output should satisfy are referred to as counterfactuality in this paper: (1) Compatibility of visual-linguistic explanations, and (2) Positiveness/negativeness for the specific positive/negative class. Exploiting a spatio-temporal region (tube) and an attribute as visual and linguistic explanations respectively, the explanation model is trained to predict the counterfactuality for possible combinations of multimodal information in a post-hoc manner. The optimization problem, which appears during training/inference, can be efficiently solved by inserting a novel neural network layer, namely the maximum subpath layer. We demonstrated the effectiveness of this method by comparison with a baseline of the action recognition datasets extended for this task. Moreover, we provide information-theoretical insight into the proposed method.
[prediction, work, dataset, video, time, action, framework] [corresponding, algorithm, predicts] [method, proposed, attribute, input, conditional, quantitative, figure, based] [output, accuracy, element, convolutional, number, efficiently, deep, layer, network, max, applied, neural, ratio, applying, summation, pooling] [explanation, model, cneg, cpos, visual, counterfactuality, mcpos, linguistic, subsection, reason, evaluation, ycpos, counterfactual, classified, considered, calculated, path, concept, multimodal, expected, system, negativeness, generating, subpath] [region, category, propose, score, baseline, feature, assigned] [class, negative, specific, target, classifier, learning, olympic, sample, set, classification, pair, function, maximum, datasets, positive, train, loss, novel, existing, training, compatible, mutual, exploiting]
@InProceedings{Kanehira_2019_CVPR,
  author = {Kanehira, Atsushi and Takemoto, Kentaro and Inayoshi, Sho and Harada, Tatsuya},
  title = {Multimodal Explanations by Predicting Counterfactuality in Videos},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Learning to Explain With Complemental Examples
Atsushi Kanehira, Tatsuya Harada


This paper addresses the generation of explanations with visual examples. Given an input sample, we build a system that not only classifies it to a specific category, but also outputs linguistic explanations and a set of visual examples that render the decision interpretable. Focusing especially on the complementarity of the multimodal information, i.e., linguistic and visual examples, we attempt to achieve it by maximizing the interaction information, which provides a natural definition of complementarity from an information theoretical viewpoint. We propose a novel framework to generate complemental explanations, on which the joint distribution of the variables to explain, and those to be explained is parameterized by three different neural networks: predictor, linguistic explainer, and example selector. Explanation models are trained collaboratively to maximize the interaction information to ensure the generated explanation are complemental to each other for the target. The results of experiments conducted on several datasets demonstrate the effectiveness of the proposed method.
[selector, dataset, interaction, work, framework, considering, joint, represented, utilized, prediction] [variable, optimization, bound] [attribute, method, proposed, input, figure, consistency, image, fidelity] [output, accuracy, network, neural, number, lower, process, optimized, deep, weight, performed, element] [explanation, linguistic, visual, reasoner, complemental, model, aadb, explainer, subsection, vector, probability, type, example, generated, sampled, system, explain, generating, representing, random, reason, expectation, decision, multimodal, generate, consider, generates, variational, reparameterization, correct, machine] [category, three, assigned, baseline] [target, set, sample, distribution, predictor, cub, function, trained, sampling, discriminative, objective, selected, softmax, learning, complementarity, auxiliary, novel, conducted, class, categorical, specific, conclusion]
@InProceedings{Kanehira_2019_CVPR,
  author = {Kanehira, Atsushi and Harada, Tatsuya},
  title = {Learning to Explain With Complemental Examples},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
HAQ: Hardware-Aware Automated Quantization With Mixed Precision
Kuan Wang, Zhijian Liu, Yujun Lin, Ji Lin, Song Han


Model quantization is a widely used technique to compress and accelerate deep neural network (DNN) inference. Emergent DNN hardware accelerators begin to support mixed precision (1-8 bits) to further improve the computation efficiency, which raises a great challenge to find the optimal bitwidth for each layer: it requires domain experts to explore the vast design space trading off among accuracy, latency, energy, and model size, which is both time-consuming and sub-optimal. There are plenty of specialized hardware for neural networks, but little research has been done for specialized neural network optimization for a particular hardware architecture. Conventional quantization algorithm ignores the different hardware architectures and quantizes all the layers in a uniform way. In this paper, we introduce the Hardware-Aware Automated Quantization (HAQ) framework which leverages the reinforcement learning to automatically determine the quantization policy, and we take the hardware accelerator's feedback in the design loop. Rather than relying on proxy signals such as FLOPs and model size, we employ a hardware simulator to generate direct feedback signals (latency and energy) to the RL agent. Compared with conventional methods, our framework is fully automated and can specialize the quantization policy for different neural network architectures and hardware architectures. Our framework effectively reduced the latency by 1.4-1.95x and the energy consumption by 1.9x with negligible loss of accuracy compared with the fixed bitwidth (8 bits) quantization. Our framework reveals that the optimal policies on different hardware architectures (i.e., edge and cloud architectures) under different resource constraints (i.e., latency, energy and model size) are drastically different. We interpreted the implication of different quantization policies, which offer insights for both neural network architecture design and hardware architecture design.
[framework, action, determine, multiple, temporal, time, hidden] [cloud, optimal, han, direct, directly] [figure, feedback, mixed, proposed] [quantization, hardware, neural, network, bitwidth, layer, deep, latency, architecture, depthwise, flexible, convolution, accuracy, number, efficient, design, energy, bit, search, pact, table, size, quantized, computation, specialized, automated, precision, fixed, performance, compared, bitfusion, accelerator, song, quantize, better, bismo, reduce, original, pointwise, resource, low, inference, fewer, compression, convolutional, consumption, mobilenets, mobile, best, batch, bitwidths, activation, andrew, explore] [model, policy, agent, memory, reinforcement] [edge, spatial, assigned] [learning, space, proxy, conventional, loss, domain]
@InProceedings{Wang_2019_CVPR,
  author = {Wang, Kuan and Liu, Zhijian and Lin, Yujun and Lin, Ji and Han, Song},
  title = {HAQ: Hardware-Aware Automated Quantization With Mixed Precision},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Content Authentication for Neural Imaging Pipelines: End-To-End Optimization of Photo Provenance in Complex Distribution Channels
Pawel Korus, Nasir Memon


Forensic analysis of digital photo provenance relies on intrinsic traces left in the photograph at the time of its acquisition. Such analysis becomes unreliable after heavy post-processing, such as down-sampling and re-compression applied upon distribution in the Web. This paper explores end-to-end optimization of the entire image acquisition and distribution workflow to facilitate reliable forensic analysis at the end of the distribution channel. We demonstrate that neural imaging pipelines can be trained to replace the internals of digital cameras, and jointly optimized for high-fidelity photo development and reliable provenance analysis. In our experiments, the proposed approach increased image manipulation detection accuracy from 45% to over 90%. The findings encourage further research towards building more reliable imaging pipelines with explicit provenance-guaranteeing properties.
[joint, complex, work] [analysis, optimization, pipeline, camera, rgb, reliable, sensor, approach] [image, imaging, fan, photo, digital, forensic, color, manipulation, forensics, unet, ieee, acquisition, jpeg, demosaicing, djpeg, nikon, inet, provenance, adoption, raw, figure, developed, denoising, content, psnr, rggb, jpg, facilitate, native, rounding, quality, patch, libjpeg, nef, canon, proposed, successive] [neural, network, accuracy, entire, channel, standard, deep, compression, convolutional, validation, better, processing, layer, optimize, design, table, typical, gaussian, output] [model, nip, eos, implemented, example, considered, develop, multimedia, visual, history] [detection] [distribution, learning, trained, training, classification, authentication]
@InProceedings{Korus_2019_CVPR,
  author = {Korus, Pawel and Memon, Nasir},
  title = {Content Authentication for Neural Imaging Pipelines: End-To-End Optimization of Photo Provenance in Complex Distribution Channels},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Inverse Procedural Modeling of Knitwear
Elena Trunz, Sebastian Merzbach, Jonathan Klein, Thomas Schulze, Michael Weinmann, Reinhard Klein


The analysis and modeling of cloth has received a lot of attention in recent years. While recent approaches are focused on woven cloth, we present a novel practical approach for the inference of more complex knitwear structures as well as the respective knitting instructions from only a single image without attached annotations. Knitwear is produced by repeating instances of the same pattern, consisting of grid-like arrangements of a small set of basic stitch types. Our framework addresses the identification and localization of the occurring stitch types, which is challenging due to huge appearance variations. The resulting coarsely localized stitch types are used to infer the underlying grid structure as well as for the extraction of the knitting instruction of pattern repeats, taking into account principles of Gestalt theory. Finally, the derived instructions allow the reproduction of the knitting structures, either as renderings or by actual knitting, as demonstrated in several examples.
[individual, modeling, determine, framework] [stitch, knitting, pattern, template, procedural, matching, corresponding, inverse, respective, underlying, point, computer, approach, optimal, well, single, error, repeat, account, knitted, position, woven, knit, knitwear, repeating, allow, robust, minimal, vision, derived, derivation, yarn, purl, optimization, compute, symmetry, technique, additional, problem] [image, input, based, ieee, appearance, acm, user, figure, conference, correction] [size, structure, number, inference, order, width, basic, search, best, law] [type, step, correct, find, infer] [grid, localization, height, region, clothing, center, including] [likelihood, set, similarity, distance]
@InProceedings{Trunz_2019_CVPR,
  author = {Trunz, Elena and Merzbach, Sebastian and Klein, Jonathan and Schulze, Thomas and Weinmann, Michael and Klein, Reinhard},
  title = {Inverse Procedural Modeling of Knitwear},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Estimating 3D Motion and Forces of Person-Object Interactions From Monocular Video
Zongmian Li, Jiri Sedlar, Justin Carpentier, Ivan Laptev, Nicolas Mansard, Josef Sivic


In this paper, we introduce a method to automatically reconstruct the 3D motion of a person interacting with an object from a single RGB video. Our method estimates the 3D poses of the person and the object, contact positions, and forces and torques actuated by the human limbs. The main contributions of this work are three-fold. First, we introduce an approach to jointly estimate the motion and the actuation forces of the person on the manipulated object by modeling contacts and the dynamics of their interactions. This is cast as a large-scale trajectory optimization problem. Second, we develop a method to automatically recognize from the input video the position and timing of contacts between the person and the object or the ground, thereby significantly simplifying the complexity of the optimization. Third, we validate our approach on a recent MoCap dataset with ground truth contact forces and demonstrate its performance on a new dataset of Internet videos showing people manipulating a variety of tools in unconstrained environments.
[human, joint, motion, video, trajectory, force, dataset, modeling, recognition, work, torque, people, state, sequence, sole, frame, parkour, instructional, hammer, interaction, handtool, multiple, manually, term] [contact, pose, estimation, position, body, point, ground, estimated, optimization, problem, single, approach, estimate, case, endpoint, estimating, truth, robotics, rigid, manipulated, optimal, linear, rgb, constraint, relative, error, note, tool] [input, image, method, control, manipulating, figure, manipulation, mapping, recovered] [configuration, table, output, deep] [model, vector, sum, consider, system] [object, person, mask, including, challenging, spatial, annotated] [data, learning]
@InProceedings{Li_2019_CVPR,
  author = {Li, Zongmian and Sedlar, Jiri and Carpentier, Justin and Laptev, Ivan and Mansard, Nicolas and Sivic, Josef},
  title = {Estimating 3D Motion and Forces of Person-Object Interactions From Monocular Video},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
DeepMapping: Unsupervised Map Estimation From Multiple Point Clouds
Li Ding, Chen Feng


We propose DeepMapping, a novel registration framework using deep neural networks (DNNs) as auxiliary functions to align multiple point clouds from scratch to a globally consistent frame. We use DNNs to model the highly non-convex mapping process that traditionally involves hand-crafted data association, sensor pose initialization, and global refinement. Our key novelty is that "training" these DNNs with properly defined unsupervised losses is equivalent to solving the underlying registration problem, but less sensitive to good initialization than ICP. Our framework contains two DNNs: a localization network that estimates the poses for input point clouds, and a map network that models the scene structure by estimating the occupancy status of global coordinates. This allows us to convert the registration problem to a binary occupancy classification, which can be solved efficiently using gradient-based optimization. We further show that DeepMapping can be readily extended to address the problem of Lidar SLAM by imposing geometric constraints between consecutive point clouds. Experiments are conducted on both simulated and real datasets. Qualitative and quantitative comparisons demonstrate that DeepMapping often enables more robust and accurate global registration of multiple point clouds than existing techniques. Our code is available at https://ai4ce.github.io/DeepMapping/.
[multiple, dataset, trajectory, temporal] [point, registration, deepmapping, occupancy, vision, cloud, sensor, pattern, pose, direct, camera, local, optimization, lidar, chamfer, icp, unoccupied, geometric, depth, laser, problem, defined, corresponding, ground, ate, scene, slam, robust, coordinate, estimated, truth, active, deepm, multiway, volume, globally] [ieee, figure, input, simulated, captured, image, method] [network, dnns, deep, binary, neural, best] [model, environment, observed, sampled] [global, map, feature, localization, semantic, spatial, bce, average, propose] [distance, loss, learning, unsupervised, function, set, training, data, viewed, cross, sample, pairwise, space, label, auxiliary, supervised]
@InProceedings{Ding_2019_CVPR,
  author = {Ding, Li and Feng, Chen},
  title = {DeepMapping: Unsupervised Map Estimation From Multiple Point Clouds},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
End-To-End Interpretable Neural Motion Planner
Wenyuan Zeng, Wenjie Luo, Simon Suo, Abbas Sadat, Bin Yang, Sergio Casas, Raquel Urtasun


In this paper, we propose a neural motion planner for learning to drive autonomously in complex urban scenarios that include traffic-light handling, yielding, and interactions with multiple road-users. Towards this goal, we design a holistic model that takes as input raw LIDAR data and a HD map and produces interpretable intermediate representations in the form of 3D detections and their future trajectories, as well as a cost volume defining the goodness of each position that the self-driving car can take within the planning horizon. We then sample a set of diverse physically possible trajectories and choose the one with the minimum learned cost. Importantly, our cost volume is able to naturally capture multi-modality. We demonstrate the effectiveness of our approach in real-world driving data captured in several cities in North America. Our experiments show that the learned cost volume can generate safer planning than all the baselines.
[trajectory, motion, planning, driving, forecasting, future, planner, clothoid, time, traffic, velocity, timesteps, steering, prediction, collision, human, lane, follow, wenjie, drive] [volume, sdv, lidar, approach, point, note, autonomous, well, form, position, angle, international, good, optimization] [input, conference, ieee, raw, intermediate, control, real] [cost, neural, output, network, designed, size, filter, deep, represents, lower] [model, interpretable, perception, path, arxiv, preprint, interpretability, imitation] [map, detection, vehicle, regression, final, urban, anchor, propose, utilize, backbone, bounding, location, curve, raquel] [learning, loss, data, sample, learned, negative, set, distance, sampling, minimum, uncertainty, learn, classification, metric]
@InProceedings{Zeng_2019_CVPR,
  author = {Zeng, Wenyuan and Luo, Wenjie and Suo, Simon and Sadat, Abbas and Yang, Bin and Casas, Sergio and Urtasun, Raquel},
  title = {End-To-End Interpretable Neural Motion Planner},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Divergence Triangle for Joint Training of Generator Model, Energy-Based Model, and Inferential Model
Tian Han, Erik Nijkamp, Xiaolin Fang, Mitch Hill, Song-Chun Zhu, Ying Nian Wu


This paper proposes the divergence triangle as a framework for joint training of a generator model, energy-based model and inference model. The divergence triangle is a compact and symmetric (anti-symmetric) objective function that seamlessly integrates variational learning, adversarial learning, wake-sleep algorithm, and contrastive divergence in a unified probabilistic formulation. This unification makes the processes of sampling, inference, and energy evaluation readily available without the need for costly Markov chain Monte Carlo methods. Our experiments demonstrate that the divergence triangle is capable of learning (1) an energy-based model with well-formed energy landscape, (2) direct sampling in the form of a generator network, and (3) feed-forward inference that faithfully reconstructs observed as well as synthesized data.
[joint, current, framework, gibbs, version, jointly] [triangle, algorithm, international, explicit, reconstruction, langevin, computer, monte, well, respect] [generator, figure, generative, latent, conference, celeba, image, based, method, mapping, synthesized] [inference, energy, neural, deep, gradient, network, convolutional, processing, compact, approximated] [model, adversarial, variational, arxiv, preprint, generated, machine, observed, probability, true, requires, expectation, gan, alice, inception, generation] [three, ali] [learning, divergence, training, min, data, function, distribution, maximum, likelihood, learned, qdata, contrastive, mcmc, sampling, defines, objective, vae, surrogate, test, probabilistic, learn, posterior, sample, seek, minimizing, close, symmetric, update, unsupervised, trained, energybased]
@InProceedings{Han_2019_CVPR,
  author = {Han, Tian and Nijkamp, Erik and Fang, Xiaolin and Hill, Mitch and Zhu, Song-Chun and Nian Wu, Ying},
  title = {Divergence Triangle for Joint Training of Generator Model, Energy-Based Model, and Inferential Model},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Image Deformation Meta-Networks for One-Shot Learning
Zitian Chen, Yanwei Fu, Yu-Xiong Wang, Lin Ma, Wei Liu, Martial Hebert


Humans can robustly learn novel visual concepts even when images undergo various deformations and loose certain information. Mimicking the same behavior and synthesizing deformed instances of new concepts may help visual recognition systems perform better one-shot learning, i.e., learning concepts from one or few examples. Our key insight is that, while the deformed images may not be visually realistic, they still maintain critical semantic information and contribute significantly to formulating classifier decision boundaries. Inspired by the recent progress of meta-learning, we combine a meta-learner with an image deformation sub-network that produces additional training examples, and optimize both models in an end-to-end manner. The deformation sub-network learns to deform images by fusing a pair of images --- a probe image that keeps the visual content and a gallery image that diversifies the deformations. We demonstrate results on the widely used one-shot learning benchmarks (miniImageNet and ImageNet 1K Challenge datasets), which significantly outperform state-of-the-art approaches.
[challenge, work, recognition, learns, key] [deformation, additional, approach, matching, directly, note, augmented, deform] [image, figure, synthesized, noise, method, visually, produce, patch, augment, synthesize] [network, performance, imagenet, gaussian, accuracy, table, deep, entire, neural, number, best, achieves, weight] [visual, sampled, model, femb, generate, introduce, generated] [feature, subnetwork, relation] [learning, deformed, training, gallery, set, embedding, classifier, novel, probe, prototype, sample, loss, base, randomly, train, class, dbase, support, igallery, data, learn, classification, yprobe, division, trained, combination, dnovel, prototypical, iprobe, naug, update, miniimagenet, large, labeled, effectively, augmentation]
@InProceedings{Chen_2019_CVPR,
  author = {Chen, Zitian and Fu, Yanwei and Wang, Yu-Xiong and Ma, Lin and Liu, Wei and Hebert, Martial},
  title = {Image Deformation Meta-Networks for One-Shot Learning},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Online High Rank Matrix Completion
Jicong Fan, Madeleine Udell


Recent advances in matrix completion enable data imputation in full-rank matrices by exploiting low dimensional (nonlinear) latent structure. In this paper, we develop a new model for high rank matrix completion (HRMC), together with batch and online methods to fit the model and out-of-sample extension to complete new data. The method works by (implicitly) mapping the data into a high dimensional polynomial feature space using the kernel trick; importantly, the data occupies a low dimensional subspace in this feature space, even when the original data matrix is of full-rank. The online method can handle streaming or sequential data and adapt to non-stationary latent structure, and enjoys much lower space and time complexity than previous methods for HRMC. For example, the time complexity is reduced from O(n^3) to O(r^3), where n is the number of data points, r is the matrix rank in the feature space, and r
[online, time, motion, subject, consists] [matrix, completion, polynomial, kfmc, lemma, algorithm, compute, vmc, lrmc, kxd, optimization, define, kdd, hrmc, problem, xij, nlmc, generically, supplementary, error, pattern, international, solve, rmn, kxx] [high, method, missing, proposed, recover, incomplete, figure, ieee, conference, nonlinear, recovery, drawn, latent, synthetic, recovered] [kernel, low, complexity, denotes, sparse, offline, factorization, computational, lower, efficient, cost, rate, gradient, number, full] [rbf, model, complete, probability, consider, provided, write, generate, machine] [feature, propose, union] [rank, data, update, minimize, space, subspace, sampling, clustering, extension, sample, function, learning, randomly, large, suppose, mij]
@InProceedings{Fan_2019_CVPR,
  author = {Fan, Jicong and Udell, Madeleine},
  title = {Online High Rank Matrix Completion},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Multispectral Imaging for Fine-Grained Recognition of Powders on Complex Backgrounds
Tiancheng Zhi, Bernardo R. Pires, Martial Hebert, Srinivasa G. Narasimhan


Hundreds of materials, such as drugs, explosives, makeup, food additives, are in the form of powder. Recognizing such powders is important for security checks, criminal identification, drug control, and quality assessment. However, powder recognition has drawn little attention in the computer vision community. Powders are hard to distinguish: they are amorphous, appear matte, have little color or texture variation and blend with surfaces they are deposited on in complex ways. To address these challenges, we present the first comprehensive dataset and approach for powder recognition using multi-spectral imaging. By using Shortwave Infrared (SWIR) multi-spectral imaging together with visible light (RGB) and Near Infrared (NIR), powders can be discriminated with reasonable accuracy. We present a method to select discriminative spectral bands to significantly reduce acquisition time while improving recognition accuracy. We propose a blending model to synthesize images of powders of various thickness deposited on a wide range of surfaces. Incorporating band selection and image synthesis, we conduct fine-grained recognition of 100 powders on complex backgrounds, and achieve 60% 70% accuracy on recognition with known powder location, and over 40% mean IoU without known location.
[recognition, dataset, time] [computer, vision, rgb, truth, light, ground, pattern, material, thickness, rendered, camera, equation, compute, render, supplementary, algorithm, blend, approach] [powder, swir, band, background, image, blending, figure, thick, thin, ieee, conference, nncv, method, patch, alpha, hyperspectral, inpainting, spectral, pixel, imaging, real, acquisition, rgbn, intensity, rough, based, shading, nir, remote, mvpca, geoscience, captured, imaged, transmittance, synthetic, infrared, row] [selection, table, accuracy, number] [model, common, include, white] [segmentation, mask, detection, iou, location, semantic, grid] [set, data, nearest, neighbor, selected, class, training, hard]
@InProceedings{Zhi_2019_CVPR,
  author = {Zhi, Tiancheng and Pires, Bernardo R. and Hebert, Martial and Narasimhan, Srinivasa G.},
  title = {Multispectral Imaging for Fine-Grained Recognition of Powders on Complex Backgrounds},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
ContactDB: Analyzing and Predicting Grasp Contact via Thermal Imaging
Samarth Brahmbhatt, Cusuh Ham, Charles C. Kemp, James Hays


Grasping and manipulating objects is an important human skill. Since hand-object contact is fundamental to grasping, capturing it can lead to important insights. However, observing contact through external sensors is challenging because of occlusion and the complexity of the human hand. We present ContactDB, a novel dataset of contact maps for household objects that captures the rich hand-object contact that occurs during grasping, enabled by use of a thermal camera. Participants in our study grasped 3D printed objects with a post-grasp functional intent. ContactDB includes 3750 3D meshes of 50 household objects textured with contact maps and 375K frames of synchronized RGB-D+thermal images. To the best of our knowledge, this is the first large-scale dataset that records detailed contact maps for human grasps. Analysis of this data shows the influence of functional intent and object size on grasping, the tendency to touch/avoid 'active areas', and the high frequency of palm and proximal finger contact. Finally, we train state-of-the art image translation and 3D convolution algorithms to predict diverse contact patterns from object shape. Data, code and models are available at https://contactdb.cc.gatech.edu.
[human, multiple, dataset, predict, prediction, work, influence, capture, tactile, predicting, previous, recording] [contact, hand, thermal, grasp, grasping, functional, computer, pointnet, pose, international, robotics, shape, vision, pattern, voxnet, analysis, camera, point, single, heat, turntable, contactdb, household, depth, bimanual, diversenet, material, estimation, voxel, michael, palm, surface, mesh, dominant, handoff, closest] [figure, ieee, conference, printed, input, translation] [size, table, network, processing, full, deep, neural] [diverse, intent, model, robotic, observed, collection] [object, map, area, wine] [data, learning, soft, unseen, training, large, smcl, upper, loss, observe, representation]
@InProceedings{Brahmbhatt_2019_CVPR,
  author = {Brahmbhatt, Samarth and Ham, Cusuh and Kemp, Charles C. and Hays, James},
  title = {ContactDB: Analyzing and Predicting Grasp Contact via Thermal Imaging},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Robust Subspace Clustering With Independent and Piecewise Identically Distributed Noise Modeling
Yuanman Li, Jiantao Zhou, Xianwei Zheng, Jinyu Tian, Yuan Yan Tang


Most of the existing subspace clustering (SC) frameworks assume that the noise contaminating the data is generated by an independent and identically distributed (i.i.d.) source, where the Gaussianity is often imposed. Though these assumptions greatly simplify the underlying problems, they do not hold in many real-world applications. For instance, in face clustering, the noise is usually caused by random occlusions, local variations and unconstrained illuminations, which is essentially structural and hence satisfies neither the i.i.d. property nor the Gaussianity. In this work, we propose an independent and piecewise identically distributed (i.p.i.d.) noise model, where the i.i.d. property only holds locally. We demonstrate that the i.p.i.d. model better characterizes the noise encountered in practical scenarios, and accommodates the traditional i.i.d. model as a special case. Assisted by this generalized noise model, we design an information theoretic learning (ITL) framework for robust SC through a novel minimum weighted error entropy (MWEE) criterion. Extensive experimental results show that our proposed SC scheme significantly outperforms the state-of-the-art competing algorithms.
[signal, motion, video, structural, framework, moving, sequence, complex, recognition] [robust, algorithm, problem, pattern, matrix, linear, error, argmin, underlying, local, assumption, definition, occlusion, practical, optimization, well, purely, characterize, affine, property, define] [noise, ieee, image, proposed, spectral, face, traditional, based, figure, method, statistical, fidelity, mse, competing, difference] [number, criterion, performance, density, pre, sparse, gaussian, better, weighted, approximate, accuracy, distributed, design, regularization] [generated, model, example, probability, random] [segmentation, feature, affinity, union, adopted, piecewise] [clustering, subspace, data, entropy, function, source, representation, learning, independent, gaussianity, extended, existing, yale, theoretic, minimum, itl]
@InProceedings{Li_2019_CVPR,
  author = {Li, Yuanman and Zhou, Jiantao and Zheng, Xianwei and Tian, Jinyu and Yan Tang, Yuan},
  title = {Robust Subspace Clustering With Independent and Piecewise Identically Distributed Noise Modeling},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
What Correspondences Reveal About Unknown Camera and Motion Models?
Thomas Probst, Ajad Chhatkuli, Danda Pani Paudel, Luc Van Gool


In two-view geometry, camera models and motion types are used as key knowledge along with the image point correspondences in order to solve several key problems of 3D vision. Problems such as Structure-from-Motion (SfM) and camera self-calibration are tackled under the assumptions of a specific camera projection model and motion type. However, these key assumptions may not be always justified, i.e.., we may often know neither the camera model nor the motion type beforehand. In that context, one can extract only the point correspondences between images. From such correspondences, recovering two-view relationship --expressed by the unknown camera model and motion type-- remains to be an unsolved problem. In this paper, we tackle this problem in two steps. First, we propose a method that computes the correct two-view relationship in the presence of noise and outliers. Later, we study different possibilities to disambiguate the obtained relationships into camera model and motion type. By extensive experiments on both synthetic and real data, we verify our theory and assumptions in practical settings.
[motion, key, sequence] [camera, rotation, matrix, problem, perspective, point, affine, homography, fundamental, orthographic, pure, uncalibrated, case, calibrated, outlier, property, vandermonde, ransac, computer, vision, inlier, singular, polynomial, ideal, disambiguated, solve, fitting, ambiguity, solution, degree, disambiguation, proof, varying, pattern, recovering, sfm, consensus, provide, approach, compute, constraint, relative, oxford] [method, translation, image, figure, noise, real, synthetic, variety, row] [sparse, full, order, table, discussed, search, better, rate] [model, basis, type, relationship, correct, consider, find] [global, detection] [essential, set, unknown, space, sample, metric]
@InProceedings{Probst_2019_CVPR,
  author = {Probst, Thomas and Chhatkuli, Ajad and Pani Paudel, Danda and Van Gool, Luc},
  title = {What Correspondences Reveal About Unknown Camera and Motion Models?},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Self-Calibrating Deep Photometric Stereo Networks
Guanying Chen, Kai Han, Boxin Shi, Yasuyuki Matsushita, Kwan-Yee K. Wong


This paper proposes an uncalibrated photometric stereo method for non-Lambertian scenes based on deep learning. Unlike previous approaches that heavily rely on assumptions of specific reflectances and light source distributions, our method is able to determine both shape and light directions of a scene with unknown arbitrary reflectances observed under unknown varying light directions. To achieve this goal, we propose a two-stage deep learning architecture, called SDPS-Net, which can effectively take advantage of intermediate supervision, resulting in reduced learning difficulty compared to a single-stage model. Experiments on both synthetic and real datasets show that our proposed approach significantly outperforms previous uncalibrated photometric stereo methods.
[dataset, framework, multiple, previous] [light, lighting, photometric, normal, stereo, estimation, lcnet, direction, surface, uncalibrated, estimated, nenet, reflectance, error, calibrated, estimate, range, problem, lambertian, discretization, lcnetreg, shape, rendered, elevation, diligent, general, azimuth, merltest, local, unny, yasuyuki, varying, directly, brdfs] [based, input, method, intensity, proposed, denoted, image, real, figure, synthetic, handle, arbitrary, isotropic] [network, number, deep, table, performance, phere] [model, introduced] [object, feature, stage, map, mask, average, baseline, improve] [learning, datasets, unknown, classification, data, test, loss, trained, source, distribution, uniform, angular, existing]
@InProceedings{Chen_2019_CVPR,
  author = {Chen, Guanying and Han, Kai and Shi, Boxin and Matsushita, Yasuyuki and Wong, Kwan-Yee K.},
  title = {Self-Calibrating Deep Photometric Stereo Networks},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Argoverse: 3D Tracking and Forecasting With Rich Maps
Ming-Fang Chang, John Lambert, Patsorn Sangkloy, Jagjeet Singh, Slawomir Bak, Andrew Hartnett, De Wang, Peter Carr, Simon Lucey, Deva Ramanan, James Hays


We present Argoverse, a dataset designed to support autonomous vehicle perception tasks including 3D tracking and motion forecasting. Argoverse includes sensor data collected by a fleet of autonomous vehicles in Pittsburgh and Miami as well as 3D tracking annotations, 300k extracted interesting vehicle trajectories, and rich semantic maps. The sensor data consists of 360 degree images from 7 cameras with overlapping fields of view, forward-facing stereo imagery, 3D point clouds from long range LiDAR, and 6-DOF pose. Our 290km of mapped lanes contain rich geometric and semantic metadata which are not currently available in any public dataset. All data is released under a Creative Commons license at Argoverse.org. In baseline experiments, we use map information such as lane direction, driveable area, and ground height to improve the accuracy of 3D object tracking. We use 3D object tracking to mine for more than 300k interesting vehicle trajectories to create a trajectory forecasting benchmark. Motion forecasting experiments ranging in complexity from classical methods (k-NN) to LSTMs demonstrate that using detailed vector maps with lane-level information substantially reduces prediction error. Our tracking and forecasting experiments represent only a superficial exploration of the potential of rich maps in robotic perception. We hope that Argoverse will enable the research community to explore these problems in greater depth.
[tracking, lane, forecasting, dataset, lstm, trajectory, social, driveable, motion, driving, argoverse, prediction, track, tracked, predict, tobs, fleet, dynamic, rasterized, future, capture, follow, sequence, jesse] [ground, lidar, autonomous, david, vision, sensor, computer, coordinate, point, approach, international, kitti, provide, publicly, nuscenes, field, cloud, direction, pattern] [conference, figure, ieee, input] [table, number, tracker] [vector, model, multimodal, rich, system] [map, object, vehicle, area, semantic, baseline, context, road, spatial, annotated, height, interest, car, detection, segment, region, benchmark, mined, three, urban, raquel, sebastian] [data, datasets, large, observe, set, learning]
@InProceedings{Chang_2019_CVPR,
  author = {Chang, Ming-Fang and Lambert, John and Sangkloy, Patsorn and Singh, Jagjeet and Bak, Slawomir and Hartnett, Andrew and Wang, De and Carr, Peter and Lucey, Simon and Ramanan, Deva and Hays, James},
  title = {Argoverse: 3D Tracking and Forecasting With Rich Maps},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Side Window Filtering
Hui Yin, Yuanhao Gong, Guoping Qiu


Local windows are routinely used in computer vision and almost without exception the center of the window is aligned with the pixels being processed. We show that this conventional wisdom is not universally applicable. When a pixel is on an edge, placing the center of the window on the pixel is one of the fundamental reasons that cause many filtering algorithms to blur the edges. Based on this insight, we propose a new Side Window Filtering (SWF) technique which aligns the window's side or corner with the pixel being processed. The SWF technique is surprisingly simple yet theoretically rooted and very effective in practice. We show that many traditional linear and nonlinear filters can be easily implemented under the SWF framework. Extensive analysis and experiments show that implementing the SWF principle can significantly improve their edge preserving capabilities and achieve state of the art performances in applications such as image smoothing, denoising, enhancement, structure-preserving texture-removing, mutual-structure extraction, and HDR tone mapping. In addition to image filtering, we further show that the SWF principle can be extended to other applications involving the use of a local window. Using colorization by optimization as an example, we demonstrate that implementing the SWF principle can effectively prevent artifacts such as color leakage associated with the conventional implementation. Given the ubiquity of window based operations in computer vision, the new SWF technique is likely to benefit many more applications.
[window, framework, signal, version, state] [technique, linear, local, algorithm, computer, analysis, optimization, property, corner, left, vision, total, neighborhood, form, horizontal] [side, image, input, swf, filtering, pixel, based, preserve, zoomed, ieee, bil, gui, figure, result, traditional, bilateral, implementing, colorization, color, patch, method, preserving, roof, gau, tone, enhancement, hdr, ramp, real, application, acm, blur] [filter, output, original, operation, processing, principle, gaussian, weighted, table, kernel, better, computational, approximation, weight, iteration] [median, easily, potential] [box, edge, guided, center, improve, propose] [target, combination]
@InProceedings{Yin_2019_CVPR,
  author = {Yin, Hui and Gong, Yuanhao and Qiu, Guoping},
  title = {Side Window Filtering},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Defense Against Adversarial Images Using Web-Scale Nearest-Neighbor Search
Abhimanyu Dubey, Laurens van der Maaten, Zeki Yalniz, Yixuan Li, Dhruv Mahajan


A plethora of recent work has shown that convolutional networks are not robust to adversarial images: images that are created by perturbing a sample from the data distribution as to maximize the loss on the perturbed example. In this work, we hypothesize that adversarial perturbations move the image away from the image manifold in the sense that there exists no physical process that could have produced the adversarial image. This hypothesis suggests that a successful defense mechanism against adversarial images should aim to project the images back onto the image manifold. We study such defense mechanisms, which approximate the projection onto the unknown image manifold by a nearest-neighbor search against a web-scale image database containing tens of billions of images. Empirical evaluations of this defense strategy on ImageNet suggest that it very effective in attack settings in which the adversary does not have access to the image database. We also propose two novel attack methods to break nearest-neighbor defense settings and show conditions under which nearest-neighbor defense fails. We perform a series of ablation experiments, which suggest that there is a trade-off between robustness and accuracy between as we use features from deeper in the network, that a large index size (hundreds of millions) is crucial to get good performance, and that careful construction of database is crucial for robustness against nearest-neighbor attacks.
[work, prediction, perform, hypothesis, hashtags] [normalized, robust, projection, associated] [image, database, figure, strength, clean, based, study, input, method] [accuracy, imagenet, search, conv, effectiveness, size, table, effective, deep, convolutional, neural, network, norm, gradient, number, approximate, small, performance, better] [adversarial, defense, attack, arxiv, preprint, model, robustness, adversary, manifold, access, pgd, perturbation, attacker, example, true, swan, goose, sign] [feature, three, ablation] [classification, nearest, set, training, function, strategy, loss, nearestneighbor, weighting, knn, sample, similarity, setting, distance, softmax]
@InProceedings{Dubey_2019_CVPR,
  author = {Dubey, Abhimanyu and van der Maaten, Laurens and Yalniz, Zeki and Li, Yixuan and Mahajan, Dhruv},
  title = {Defense Against Adversarial Images Using Web-Scale Nearest-Neighbor Search},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Incremental Object Learning From Contiguous Views
Stefan Stojanov, Samarth Mishra, Ngoc Anh Thai, Nikhil Dhanda, Ahmad Humayun, Chen Yu, Linda B. Smith, James M. Rehg


In this work, we present CRIB (Continual Recognition Inspired by Babies), a synthetic incremental object learning environment that can produce data that models visual imagery produced by object exploration in early infancy. CRIB is coupled with a new 3D object dataset, Toys-200, that contains 200 unique toy-like object instances, and is also compatible with existing 3D datasets. Through extensive empirical evaluation of state-of-the-art incremental learning algorithms, we find the novel empirical result that repetition can significantly ameliorate the effects of catastrophic forgetting. Furthermore, we find that in certain cases repetition allows for performance approaching that of batch learning algorithms. Finally, we propose an unsupervised incremental learning task with intriguing baseline results.
[recognition, work, sequence, dataset, early, current] [exposure, computer, vision, pattern, single, rendering, algorithm, total, shapenet] [conference, repeated, figure, ieee, prior, image, play, background, synthetic, based] [performance, number, accuracy, deep, batch, neural, standard, extensive, processing] [visual, environment, unique, generate, random, repetition, generated, length] [object, instance, category, three, baseline, foreground, european] [learning, incremental, crib, data, catastrophic, set, task, distillation, forgetting, novel, exemplar, test, loss, open, training, contiguous, existing, class, unlimited, learner, testing, uos, unsupervised]
@InProceedings{Stojanov_2019_CVPR,
  author = {Stojanov, Stefan and Mishra, Samarth and Anh Thai, Ngoc and Dhanda, Nikhil and Humayun, Ahmad and Yu, Chen and Smith, Linda B. and Rehg, James M.},
  title = {Incremental Object Learning From Contiguous Views},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
IP102: A Large-Scale Benchmark Dataset for Insect Pest Recognition
Xiaoping Wu, Chi Zhan, Yu-Kun Lai, Ming-Ming Cheng, Jufeng Yang


Insect pests are one of the main factors affecting agricultural product yield. Accurate recognition of insect pests facilitates timely preventive measures to avoid economic losses. However, the existing datasets for the visual classification task mainly focus on common objects, e.g., flowers and dogs. This limits the application of powerful deep learning technology on specific domains like the agricultural field. In this paper, we collect a large-scale dataset named IP102 for insect pest recognition. Specifically, it contains more than 75, 000 images belonging to 102 categories, which exhibit a natural long-tailed distribution. In addition, we annotate about 19, 000 images with bounding boxes for object detection. The IP102 has a hierarchical taxonomy and the insect pests which mainly affect one specific agricultural product are grouped into the same upperlevel category. Furthermore, we perform several baseline experiments on the IP102 dataset, including handcrafted and deep feature based classification methods. Experimental results show that this dataset has the challenges of interand intra- class variance and data imbalance. We believe our IP102 will facilitate future research on practical insect pest control, fine-grained visual classification, and imbalanced learning fields. We make the dataset and pre-trained models publicly available at https://github.com/xpwu95/IP102.
[dataset, recognition, professional, previous, work] [handcrafted, field, journal, computer, international, corresponding] [image, based, collect, figure, real, background, color, expert] [deep, performance, table, number, resnet, neural, accuracy, convolutional, imagenet, compared, structure, achieves, alexnet] [insect, pest, agricultural, system, common, machine, rice, evaluate, paddy, taxonomic, economic, visual, indicates, evaluation, mauc] [feature, detection, object, including, category, crop, hierarchical, annotate, average, annotation, bounding, detailed, utilize, illustrated] [classification, learning, class, datasets, data, imbalanced, svm, large, training, knn, existing, split, distribution, classifier, representation, label, hard, sample, set]
@InProceedings{Wu_2019_CVPR,
  author = {Wu, Xiaoping and Zhan, Chi and Lai, Yu-Kun and Cheng, Ming-Ming and Yang, Jufeng},
  title = {IP102: A Large-Scale Benchmark Dataset for Insect Pest Recognition},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
CityFlow: A City-Scale Benchmark for Multi-Target Multi-Camera Vehicle Tracking and Re-Identification
Zheng Tang, Milind Naphade, Ming-Yu Liu, Xiaodong Yang, Stan Birchfield, Shuo Wang, Ratnesh Kumar, David Anastasiu, Jenq-Neng Hwang


Urban traffic optimization using traffic cameras as sensors is driving the need to advance state-of-the-art multi-target multi-camera (MTMC) tracking. This work introduces CityFlow, a city-scale traffic camera dataset consisting of more than 3 hours of synchronized HD videos from 40 cameras across 10 intersections, with the longest distance between two simultaneous cameras being 2.5 km. To the best of our knowledge, CityFlow is the largest-scale dataset in terms of spatial coverage and the number of cameras/videos in an urban environment. The dataset contains more than 200K annotated bounding boxes covering a wide range of scenes, viewing angles, vehicle models, and urban traffic flow conditions. Camera geometry and calibration information are provided to aid spatio-temporal analysis. In addition, a subset of the benchmark is made available for the task of image-based vehicle re-identification (ReID). We conducted an extensive experimental evaluation of baselines/state-of-the-art approaches in MTMC tracking, multi-target single-camera (MTSC) tracking, object detection, and image-based ReID on this dataset, analyzing the impact of different network architectures, loss functions, spatio-temporal models and their combinations on task effectiveness. An evaluation server is launched with the release of our benchmark at the 2019 AI City Challenge (https://www.aicitychallenge.org/) that allows researchers to compare the performance of their newest techniques. We expect this dataset to catalyze research in this field, propel the state-of-the-art forward, and lead to deployed traffic optimization(s) in the real world.
[tracking, traffic, dataset, multiple, online, fvs, video, manually, zheng, flow, time, deepsort, moana, largest] [camera, city, calibration, provide, problem, note] [method, image, based, figure, proposed, appearance] [performance, number, table, network, deep, compared, wei, accuracy, original, batch, best, precision, top] [evaluation, association, model, include] [vehicle, person, benchmark, mtmc, mtsc, object, cityflow, bounding, detection, map, false, spatial, urban, dukemtmc, residential, average, public] [reid, learning, loss, metric, triplet, data, distance, datasets, hard, existing, subset, task, combination, test, domain, distribution]
@InProceedings{Tang_2019_CVPR,
  author = {Tang, Zheng and Naphade, Milind and Liu, Ming-Yu and Yang, Xiaodong and Birchfield, Stan and Wang, Shuo and Kumar, Ratnesh and Anastasiu, David and Hwang, Jenq-Neng},
  title = {CityFlow: A City-Scale Benchmark for Multi-Target Multi-Camera Vehicle Tracking and Re-Identification},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Social-IQ: A Question Answering Benchmark for Artificial Social Intelligence
Amir Zadeh, Michael Chan, Paul Pu Liang, Edmund Tong, Louis-Philippe Morency


As intelligent systems increasingly blend into our everyday life, artificial social intelligence becomes a prominent area of research. Intelligent systems must be socially intelligent in order to comprehend human intents and maintain a rich level of interaction with humans. Human language offers a unique unconstrained approach to probe through questions and reason through answers about social situations. This unconstrained approach extends previous attempts to model social intelligence through numeric supervision (e.g. sentiment and emotions labels). In this paper, we introduce Social-IQ, a unconstrained benchmark specifically designed to train and evaluate socially intelligent technologies. By providing a rich source of open-ended questions and answers, Social-IQ opens the door to explainable social intelligence. The dataset contains rigorously annotated and validated videos, questions and answers, as well as annotations for the complexity level of each question and answer. Social-IQ contains 1,250 natural in-the-wild social situations, 7,500 questions and 52,500 correct and incorrect answers. Although humans can reason about social situations with very high accuracy (95.08%), existing state-of-the-art computational models struggle on this task. As a result, Social-IQ brings novel challenges that will spark future research in social intelligence modeling, visual reasoning, and multimodal question answering (QA).
[social, dataset, video, human, multiple, recognition, fusion, state, performing, amir, future, joint, people] [computer, pattern, vision, analysis, intelligent, june, require, total, paul, well, journal] [ieee, demonstrates, conference, based, figure, high, unconstrained, image] [performance, complexity, validation, processing, computational, neural, number, binary, network] [question, intelligence, correct, multimodal, answering, man, answer, understanding, incorrect, language, woman, memory, visual, machine, creation, example, length, attention, model, evaluation, bert, artificial, men, natural, movieqa, simple, association] [level, baseline, annotation, stage, sentiment, art, average, benchmark] [set, distribution, training, learning, embeddings, bias, datasets]
@InProceedings{Zadeh_2019_CVPR,
  author = {Zadeh, Amir and Chan, Michael and Pu Liang, Paul and Tong, Edmund and Morency, Louis-Philippe},
  title = {Social-IQ: A Question Answering Benchmark for Artificial Social Intelligence},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
UPSNet: A Unified Panoptic Segmentation Network
Yuwen Xiong, Renjie Liao, Hengshuang Zhao, Rui Hu, Min Bai, Ersin Yumer, Raquel Urtasun


In this paper, we propose a unified panoptic segmentation network (UPSNet) for tackling the newly proposed panoptic segmentation task. On top of a single backbone residual network, we first design a deformable convolution based semantic segmentation head and a Mask R-CNN style instance segmentation head which solve these two subtasks simultaneously. More importantly, we introduce a parameter-free panoptic head which solves the panoptic segmentation via pixel-wise classification. It first leverages the logits from the previous two heads and then innovatively expands the representation for enabling prediction of an extra unknown class which helps better resolving the conflicts between semantic and instance segmentation. Besides, it handles the challenge caused by the varying number of instances and permits back propagation to the bottom modules in an end-to-end manner. Extensive experimental results on Cityscapes, COCO and our internal dataset demonstrate that our UPSNet achieves state-of-the-art performance with much faster inference. Code has been made available at: https://github.com/uber-research/UPSNet
[combined, dataset, predict, prediction, consists] [single, ground, truth, scene, provide] [image, method, based, pixel, proposed, quality] [table, network, convolutional, better, deep, design, convolution, number, inference, size, channel, scale, top, achieves, performance, architecture, order, compare, larger, apply] [model, arxiv, preprint, visual, belongs] [segmentation, semantic, panoptic, instance, head, mask, backbone, logits, stuff, feature, upsnet, thing, box, deformable, coco, object, roi, predicted, pqth, fully, pqst, miou, ablation, kaiming, jian, ross, extra, faster, parsing, pyramid, fpn, pspnet, xmaski] [class, loss, unknown, learning, unified, training, set, train]
@InProceedings{Xiong_2019_CVPR,
  author = {Xiong, Yuwen and Liao, Renjie and Zhao, Hengshuang and Hu, Rui and Bai, Min and Yumer, Ersin and Urtasun, Raquel},
  title = {UPSNet: A Unified Panoptic Segmentation Network},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
JSIS3D: Joint Semantic-Instance Segmentation of 3D Point Clouds With Multi-Task Pointwise Networks and Multi-Value Conditional Random Fields
Quang-Hieu Pham, Thanh Nguyen, Binh-Son Hua, Gemma Roig, Sai-Kit Yeung


Deep learning techniques have become the to-go models for most vision-related tasks on 2D images. However, their power has not been fully realised on several tasks in 3D space, e.g., 3D scene understanding. In this work, we jointly address the problems of semantic and instance segmentation of 3D point clouds. Specifically, we develop a multi-task pointwise network that simultaneously performs two tasks: predicting the semantic classes of 3D points and embedding the points into high-dimensional vectors so that points of the same object instance are represented by similar embeddings. We then propose a multi-value conditional random field model to incorporate the semantic and instance labels and formulate the problem of semantic and instance segmentation as jointly optimising labels in the field model. The proposed method is thoroughly evaluated and compared with existing methods on different indoor scene datasets including S3DIS and SceneNN. Experimental results showed the robustness of the proposed joint semantic-instance segmentation scheme over its single components. Our method also achieved state-of-the-art performance on semantic segmentation.
[recognition, joint, window, jointly, term, work] [point, vision, computer, scene, pattern, cloud, field, indoor, international, defined, optimisation, pointnet, directly, pipeline, approach, thanh, problem, volumetric, exp, scenenn] [conference, ieee, method, proposed, conditional, figure, input, based, comparison] [network, deep, neural, table, performance, pointwise, inference, accuracy, compared, mlp, number, energy] [random, variational, model, understanding, potential] [semantic, instance, segmentation, object, european, feature, duc, fully, assigned, crfs] [class, embeddings, embedding, set, learning, label, loss, existing, classification, experimental, log, data, representation]
@InProceedings{Pham_2019_CVPR,
  author = {Pham, Quang-Hieu and Nguyen, Thanh and Hua, Binh-Son and Roig, Gemma and Yeung, Sai-Kit},
  title = {JSIS3D: Joint Semantic-Instance Segmentation of 3D Point Clouds With Multi-Task Pointwise Networks and Multi-Value Conditional Random Fields},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Instance Segmentation by Jointly Optimizing Spatial Embeddings and Clustering Bandwidth
Davy Neven, Bert De Brabandere, Marc Proesmans, Luc Van Gool


Current state-of-the-art instance segmentation methods are not suited for real-time applications like autonomous driving, which require fast execution times at high accuracy. Although the currently dominant proposal-based methods have high accuracy, they are slow and generate masks at a fixed and low resolution. Proposal-free methods, by contrast, can generate masks at high resolution and are often faster, but fail to reach the same accuracy as the proposal-based methods. In this work we propose a new clustering loss function for proposal-free instance segmentation. The loss function pulls the spatial embeddings of pixels belonging to the same instance together and jointly learns an instance-specific clustering bandwidth, maximizing the intersection-over-union of the resulting instance mask. When combined with a fast architecture, the network can perform instance segmentation in real-time while maintaining a high accuracy. We evaluate our method on the challenging Cityscapes benchmark and achieve top results (5% improvement over Mask R-CNN) at more than 10 fps on 2MP images.
[dataset, time, work, current] [computer, point, vision, directly, notice, optimal, coordinate, pattern] [method, pixel, high, conference, ieee, based, figure, resolution, difference, background] [network, learnable, fine, achieve, small, gaussian, fixed, accuracy, performance, low, standard, output, table, top, binary, scalar, compare, processing, inference, optimized] [arxiv, preprint, evaluate, pointing, vector] [instance, segmentation, sigma, center, object, mask, seed, semantic, offset, region, detection, map, branch, regression, bigger, propose, location, car, person, coarse, score, panet, lay] [loss, function, clustering, margin, centroid, embedding, learning, embeddings, set, train, learn, test, belonging, big, specific, learned]
@InProceedings{Neven_2019_CVPR,
  author = {Neven, Davy and De Brabandere, Bert and Proesmans, Marc and Van Gool, Luc},
  title = {Instance Segmentation by Jointly Optimizing Spatial Embeddings and Clustering Bandwidth},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
DeepCO3: Deep Instance Co-Segmentation by Co-Peak Search and Co-Saliency Detection
Kuang-Jui Hsu, Yen-Yu Lin, Yung-Yu Chuang


In this paper, we address a new task called instance co-segmentation. Given a set of images jointly covering object instances of a specific category, instance co-segmentation aims to identify all of these instances and segment each of them, i.e. generating one mask for each instance. This task is important since instance-level segmentation is preferable for humans and many vision applications. It is also challenging because no pixel-wise annotated training data are available and the number of instances in each image is unknown. We solve this task by dividing it into two sub-tasks, co-peak search and instance mask segmentation. In the former sub-task, we develop a CNN-based network to detect the co-peaks as well as co-saliency maps for a pair of images. A co-peak has two endpoints, one in each image, that are local maxima in the response maps and similar to each other. Thereby, the two endpoints are potentially covered by a pair of instances of the same category. In the latter subtask, we design a ranking function that takes the detected co-peaks and co-saliency maps as inputs and can select the object proposals to produce the final results. Our method for instance co-segmentation and its variant for object colocalization are evaluated on four datasets, and achieve favorable performance against the state-of-the-art methods. The source codes and the collected datasets are available at https://github.com/KuangJuiHsu/DeepCO3/
[dataset, multiple, term, joint] [matching, form, additional, defined] [method, image, figure, competing, proposed, collected, input, background, high] [performance, deep, number, search, network, table, design, convolutional, group, called, better, compared, applied, correlation] [model, common] [instance, object, segmentation, saliency, feature, mask, three, salient, bounding, map, detection, including, proposal, segment, detected, category, tnm, semantic, affinity, cvpr, box, soc, pascal, nldf, adopt, voc, clrw, ddt, prm, weakly, cosegmentation] [loss, training, set, task, data, ranking, function, learning, unsupervised, discriminative, trained, pair, datasets, supervised]
@InProceedings{Hsu_2019_CVPR,
  author = {Hsu, Kuang-Jui and Lin, Yen-Yu and Chuang, Yung-Yu},
  title = {DeepCO3: Deep Instance Co-Segmentation by Co-Peak Search and Co-Saliency Detection},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Improving Semantic Segmentation via Video Propagation and Label Relaxation
Yi Zhu, Karan Sapra, Fitsum A. Reda, Kevin J. Shih, Shawn Newsam, Andrew Tao, Bryan Catanzaro


Semantic segmentation requires large amounts of pixel-wise annotations to learn accurate models. In this paper, we present a video prediction-based methodology to scale up training sets by synthesizing new training samples in order to improve the accuracy of semantic segmentation networks. We exploit video prediction models' ability to predict future frames in order to also predict future labels. A joint propagation strategy is also proposed to alleviate mis-alignments in synthesized samples. We demonstrate that training segmentation models on datasets augmented by the synthesized samples leads to significant improvements in accuracy. Furthermore, we introduce a novel boundary label relaxation technique that makes training robust to annotation noise and propagation artifacts along object boundaries. Our proposed methods achieve state-of-the-art mIoUs of 83.5% on Cityscapes and 82.9% on CamVid. Our single model, without model ensembles, achieves 72.8% mIoU on the KITTI semantic segmentation test set, which surpasses the winning entry of the ROB challenge 2018.
[video, propagation, prediction, joint, future, motion, recognition, optical, frame, perform, dataset, flow, predict, propagate, multiple] [vision, computer, relaxation, pattern, reconstruction, augmented, approach, international, kitti, note, accurate, dense, robust, estimation] [conference, ieee, proposed, image, demonstrate, pixel, synthesized, method, figure, input, handle, comparison, quality] [better, table, standard, achieve, performance, scale, deep, accuracy, imagenet, effective] [model, indicates, create, visual, generate, random] [semantic, segmentation, boundary, miou, propagated, propose, object, baseline, mapillary, european, annotated, predicted, annotation] [label, training, class, data, test, learned, large, learning, set, sampling, datasets, strategy, uniform, augmentation]
@InProceedings{Zhu_2019_CVPR,
  author = {Zhu, Yi and Sapra, Karan and Reda, Fitsum A. and Shih, Kevin J. and Newsam, Shawn and Tao, Andrew and Catanzaro, Bryan},
  title = {Improving Semantic Segmentation via Video Propagation and Label Relaxation},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Accel: A Corrective Fusion Network for Efficient Semantic Segmentation on Video
Samvit Jain, Xin Wang, Joseph E. Gonzalez


We present Accel, a novel semantic video segmentation system that achieves high accuracy at low inference cost by combining the predictions of two network branches: (1) a reference branch that extracts high-detail features on a reference keyframe, and warps these features forward using frame-to-frame optical flow estimates, and (2) an update branch that computes features of adjustable quality on the current frame, performing a temporal update at each video frame. The modularity of the update branch, where feature subnetworks of varying layer depth can be inserted (e.g. ResNet-18 to ResNet-101), enables operation over a new, state-of-the-art accuracy-throughput trade-off spectrum. Over this curve, Accel models achieve both higher accuracy and faster inference times than the closest comparable single-frame segmentation networks. In general, Accel significantly outperforms previous work on efficient semantic video segmentation, correcting warping-related error that compounds on datasets with complex dynamics. Accel is end-to-end trainable and highly modular: the reference network, the optical flow network, and the update network can each be selected independently, depending on application requirements, and then jointly fine-tuned. The result is a robust, general system for fast, high-accuracy semantic segmentation on video.
[video, accel, keyframe, fusion, frame, eat, ntask, flow, optical, warping, current, work, time, temporal, consists, recognition, motion, previous, forward, jointly, nufeat, warped, action] [accurate, error, range, scene, problem, note] [reference, image, high, based, intermediate, input, figure] [network, accuracy, inference, convolutional, deep, interval, table, higher, efficient, layer, architecture, block, channel, cost, operation, full, achieve, output, computation, entire, standard] [model, evaluate, evaluation, system, executed, execute] [segmentation, semantic, feature, branch, deeplab, score, object, fully, dff, miou, faster, three] [update, learning, task, datasets, softmax, train]
@InProceedings{Jain_2019_CVPR,
  author = {Jain, Samvit and Wang, Xin and Gonzalez, Joseph E.},
  title = {Accel: A Corrective Fusion Network for Efficient Semantic Segmentation on Video},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Shape2Motion: Joint Analysis of Motion Parts and Attributes From 3D Shapes
Xiaogang Wang, Bin Zhou, Yahao Shi, Xiaowu Chen, Qinping Zhao, Kai Xu


For the task of mobility analysis of 3D shapes, we propose joint analysis for simultaneous motion part segmentation and motion attribute estimation, taking a single 3D model as input. The problem is significantly different from those tackled in the existing works which assume the availability of either a pre-existing shape segmentation or multiple 3D models in different motion states. To that end, we develop Shape2Motion which takes a single 3D point cloud as input, and jointly computes a mobility-oriented segmentation and the associated motion attributes. Shape2Motion is comprised of two deep neural networks designed for mobility proposal generation and mobility optimization, respectively. The key contribution of these networks is the novel motion-driven features and losses used in both motion part segmentation and motion attribute estimation. This is based on the observation that the movement of a functional part preserves the shape structure. We evaluate Shape2Motion with a newly proposed benchmark for mobility analysis of 3D shapes. Results demonstrate that our method achieves the state-of-the-art performance both in terms of motion part segmentation and motion attribute estimation.
[motion, sim, static, multiple, dynamic, moved, displacement, joint, state, key, perform, work, prediction] [shape, point, analysis, cloud, orientation, matching, optimization, corresponding, axis, single, functional, matrix, confidence, computer, problem, associated, mon, supplemental, approach, tool] [method, attribute, figure, based, input, acm, proposed, quality] [network, neural, residual, table, rate, deep, achieves, performance, mpn, architecture] [type, model, vector, evaluate, conf] [mobility, segmentation, proposal, score, module, object, iou, three, regression, anchor, propose, benchmark, annotation, labeling, predicted, final, semantic] [training, similarity, loss, set, learning, classification, existing, train, measure, selected, testing]
@InProceedings{Wang_2019_CVPR,
  author = {Wang, Xiaogang and Zhou, Bin and Shi, Yahao and Chen, Xiaowu and Zhao, Qinping and Xu, Kai},
  title = {Shape2Motion: Joint Analysis of Motion Parts and Attributes From 3D Shapes},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Semantic Correlation Promoted Shape-Variant Context for Segmentation
Henghui Ding, Xudong Jiang, Bing Shuai, Ai Qun Liu, Gang Wang


Context is essential for semantic segmentation. Due to the diverse shapes of objects and their complex layout in various scene images, the spatial scales and shapes of contexts for different objects have very large variation. It is thus ineffective or inefficient to aggregate various context information from a predefined fixed region. In this work, we propose to generate a scale- and shape-variant semantic mask for each pixel to confine its contextual region. To this end, we first propose a novel paired convolution to infer the semantic correlation of the pair and based on that to generate a shape mask. Using the inferred spatial scope of the contextual region, we propose a shape-variant convolution, of which the receptive field is controlled by the shape mask that varies with the appearance of input. In this way, the proposed network aggregates the context information of a pixel from its semantic-correlated region instead of a predefined fixed region. Furthermore, this work also proposes a labeling denoising model to reduce wrong predictions caused by the noisy low-level features. Without bells and whistles, the proposed segmentation network achieves new state-of-the-arts consistently on the six public segmentation datasets.
[predefined, correlated, recurrent, previous, second] [shape, scene, position, approach, inferred, local, dense, single, robust, error, corresponding] [proposed, pixel, image, figure, paired, input, denoising, based, noise] [convolution, network, correlation, convolutional, table, kernel, deep, neural, performance, better, size, aggregate, penalty, fixed, receptive, higher, learnable, layer] [model, diverse, svc, generate] [semantic, context, segmentation, mask, propose, spatial, region, labeling, svcnet, parsing, feature, surrounding, object, level, score, location, gang, global, fully, contextual, bing, layout] [training, testing, large, learn, target, learning, noisy, discriminative, label, confusion]
@InProceedings{Ding_2019_CVPR,
  author = {Ding, Henghui and Jiang, Xudong and Shuai, Bing and Qun Liu, Ai and Wang, Gang},
  title = {Semantic Correlation Promoted Shape-Variant Context for Segmentation},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Relation-Shape Convolutional Neural Network for Point Cloud Analysis
Yongcheng Liu, Bin Fan, Shiming Xiang, Chunhong Pan


Point cloud analysis is very challenging, as the shape implied in irregular points is difficult to capture. In this paper, we propose RS-CNN, namely, Relation-Shape Convolutional Neural Network, which extends regular grid CNN to irregular configuration for point cloud analysis. The key to RS-CNN is learning from relation, i.e., the geometric topology constraint among points. Specifically, the convolutional weight for local point set is forced to learn a high-level relation expression from predefined geometric priors, between a sampled point from this point set and the others. In this way, an inductive local representation with explicit reasoning about the spatial layout of points can be obtained, which leads to much shape awareness and robustness. With this convolution as a basic operator, RS-CNN, a hierarchical architecture can be developed to achieve contextual shape-aware learning for point cloud analysis. Extensive experiments on challenging benchmarks across three tasks verify RS-CNN achieves the state of the arts.
[learns, key, classic, graph, perform, predefined] [point, shape, cloud, geometric, local, xyz, pointnet, irregular, normal, analysis, underlying, rigid, psub, permutation, hij, well, pcnn, robust, rscnn, explicit] [input, mapping, image, method, figure, translation, expression, transformation, decent] [convolutional, neural, convolution, deep, table, network, applied, number, mlp, max, weight, accuracy, achieve, achieves, aggregation, layer, pooling, configuration] [model, robustness, regular, sampled, encode, random] [relation, cnn, grid, feature, spatial, three, segmentation, layout, contextual, hierarchical] [learning, function, representation, shared, learn, classification, set, learned, euclidean, data, distance, inductive, discriminative, symmetric]
@InProceedings{Liu_2019_CVPR,
  author = {Liu, Yongcheng and Fan, Bin and Xiang, Shiming and Pan, Chunhong},
  title = {Relation-Shape Convolutional Neural Network for Point Cloud Analysis},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Enhancing Diversity of Defocus Blur Detectors via Cross-Ensemble Network
Wenda Zhao, Bowen Zheng, Qiuhua Lin, Huchuan Lu


Defocus blur detection (DBD) is a fundamental yet challenging topic, since the homogeneous region is obscure and the transition from the focused area to the unfocused region is gradual. Recent DBD methods make progress through exploring deeper or wider networks with the expense of high memory and computation. In this paper, we propose a novel learning strategy by breaking DBD problem into multiple smaller defocus blur detectors and thus estimate errors can cancel out each other. Our focus is the diversity enhancement via cross-ensemble network. Specifically, we design an end-to-end network composed of two logical parts: feature extractor network (FENet) and defocus blur detector cross-ensemble network (DBD-CENet). FENet is constructed to extract low-level features. Then the features are fed into DBD-CENet containing two parallel-branches for learning two groups of defocus blur detectors. For each individual, we design cross-negative and self-negative correlations and an error function to enhance ensemble diversity and balance individual accuracy. Finally, the multiple defocus blur detectors are combined with a uniformly weighted average to obtain the final DBD map. Experimental results indicate the superiority of our method in terms of accuracy and speed when compared with several state-of-the-art methods.
[focused, dataset, multiple, second, focus, individual, current, fed] [computer, vision, pattern, error, ground, uniformly, single, equation, homogeneous] [blur, defocus, ieee, figure, method, cuhk, conference, image, comparison, proposed, input, row] [dbd, cenet, convnet, network, convolutional, group, correlation, menet, fenet, deep, number, neural, senet, table, design, achieve, layer, computation, weighted, implement, performance, dbdf, dhcf, hifst, lbp, btbnet, deeper, accuracy, gradient] [diversity, model] [detector, dut, enhance, mae, detection, adopt, propose, average, fully, recall, map, region, area, wider, feature] [ensemble, learning, strategy, training, learn, train, large, function, measure]
@InProceedings{Zhao_2019_CVPR,
  author = {Zhao, Wenda and Zheng, Bowen and Lin, Qiuhua and Lu, Huchuan},
  title = {Enhancing Diversity of Defocus Blur Detectors via Cross-Ensemble Network},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
BubbleNets: Learning to Select the Guidance Frame in Video Object Segmentation by Deep Sorting Frames
Brent A. Griffin, Jason J. Corso


Semi-supervised video object segmentation has made significant progress on real and challenging videos in recent years. The current paradigm for segmentation methods and benchmark datasets is to segment objects in video provided a single annotation in the first frame. However, we find that segmentation performance across the entire video varies dramatically when selecting an alternative frame for annotation. This paper addresses the problem of learning to suggest the single best frame across the video for user annotation-this is, in fact, never the first frame of video. We achieve this by introducing BubbleNets, a novel deep sorting network that learns to select frames using a performance-based loss function that enables the conversion of expansive amounts of training examples from already existing datasets. Using BubbleNets, we are able to achieve an 11% relative improvement in segmentation performance on the DAVIS benchmark without any changes to the underlying method of segmentation.
[frame, video, davis, middle, bnlf, vos, bubblenets, sorting, recognition, bubble, framework, sort, bnnifi, prediction, work, predicting, challenge, current, greatest, multiple, osvos, time] [computer, relative, vision, pattern, single, international, active, normalized, problem, provide, initial] [conference, ieee, figure, reference, user, input, comparison, image] [performance, selection, best, network, deep, table, number, validation, neural, resnet, increasing, batch, entire, better, size, implementation, architecture] [selecting, simple, find, random, worst, machine] [segmentation, object, annotation, annotated, predicted, improve, benchmark, segment] [training, learning, datasets, loss, data, train, set, select, function, selected, learn, labeled]
@InProceedings{Griffin_2019_CVPR,
  author = {Griffin, Brent A. and Corso, Jason J.},
  title = {BubbleNets: Learning to Select the Guidance Frame in Video Object Segmentation by Deep Sorting Frames},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Collaborative Global-Local Networks for Memory-Efficient Segmentation of Ultra-High Resolution Images
Wuyang Chen, Ziyu Jiang, Zhangyang Wang, Kexin Cui, Xiaoning Qian


Segmentation of ultra-high resolution images is increasingly demanded, yet poses significant challenges for algorithm efficiency, in particular considering the (GPU) memory limits. Current approaches either downsample an ultra-high resolution image, or crop it into small patches for separate processing. In either way, the loss of local fine details or global contextual information results in limited segmentation accuracy. We propose collaborative Global-Local Networks (GLNet) to effectively preserve both global and local information in a highly memory-efficient manner. GLNet is composed of a global branch and a local branch, taking the downsampled entire image and its cropped local patches as respective inputs. For segmentation, GLNet deeply fuses feature maps from two branches, capturing both the high-resolution fine structures from zoomed-in local patches and the contextual dependency from the downsampled input. To further resolve the potential class imbalance problem between background and foreground regions, we present a coarse-to-fine variant of GLNet, also being memory-efficient. Extensive experiments and analyses have been performed on three real-world ultra-high aerial and medical image datasets (resolution up to 30 million pixels). With only one single 1080Ti GPU and less than 2GB memory used, our GLNet yields high-quality segmentation results, and achieves much more competitive accuracy-memory usage trade-offs compared to state-of-the-arts.
[dataset, bidirectional] [local, computer, vision, pattern, international, ground, truth, problem] [resolution, image, conference, ieee, downsampled, figure, high, collaborative, comparison, patch, proposed, background, skin] [deep, sharing, gpu, table, size, usage, aggregation, inference, accuracy, fine, convolutional, regularization, performance, layer, best, achieve, shallow, icnet] [memory, model, arxiv, preprint] [global, feature, segmentation, branch, glnet, map, semantic, miou, deepglobe, cropped, isic, aerial, foreground, bounding, box, context, inria, pyramid, object, spatial, adopted, crop, contextual, three, refinement, european] [training, trained, class, loss, test, learning, datasets, large, source, imbalance]
@InProceedings{Chen_2019_CVPR,
  author = {Chen, Wuyang and Jiang, Ziyu and Wang, Zhangyang and Cui, Kexin and Qian, Xiaoning},
  title = {Collaborative Global-Local Networks for Memory-Efficient Segmentation of Ultra-High Resolution Images},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Efficient Parameter-Free Clustering Using First Neighbor Relations
Saquib Sarfraz, Vivek Sharma, Rainer Stiefelhagen


We present a new clustering method in the form of a single clustering equation that is able to directly discover groupings in the data. The main proposition is that the first neighbor of each sample is all one needs to discover large chains and finding the groups in the data. In contrast to most existing clustering algorithms our method does not require any hyper-parameters, distance thresholds and/or the need to specify the number of clusters. The proposed algorithm belongs to the family of hierarchical agglomerative methods. The technique has a very low computational overhead, is easily scalable and applicable to large practical problems. Evaluation on well known datasets from different domains ranging between 1077 and 8.1 million samples shows substantial performance gains when compared to the existing clustering techniques.
[graph, time, dataset] [algorithm, equation, matrix, well, directly, compute, require, form, corresponding, total] [spectral, proposed, method, based, figure, comparison, high, meaningful, input, quality] [number, deep, table, scale, full, better, compare, performance, automatically, sparse, computational, integer, discovered, efficient, approximate, computing, small, structure] [step, true, requires, discovering, simple] [hierarchical, feature, merge, average, merges] [clustering, finch, data, partition, mnist, neighbor, distance, large, cluster, similarity, learning, unsupervised, datasets, adjacency, nearest, set, sample, existing, agglomerative, pairwise, kmeans, objective, linkage, hac, subspace, train, training]
@InProceedings{Sarfraz_2019_CVPR,
  author = {Sarfraz, Saquib and Sharma, Vivek and Stiefelhagen, Rainer},
  title = {Efficient Parameter-Free Clustering Using First Neighbor Relations},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Learning Personalized Modular Network Guided by Structured Knowledge
Xiaodan Liang


The dominant deep learning approaches use a "one-size-fits-all" paradigm with the hope that underlying characteristics of diverse inputs can be captured via a fixed structure. They also overlook the importance of explicitly modeling feature hierarchy. However, complex real-world tasks often require discovering diverse reasoning paths for different inputs to achieve satisfying predictions, especially for challenging large-scale recognition tasks with complex label relations. In this paper, we treat the structured commonsense knowledge (e.g. concept hierarchy) as the guidance of customizing more powerful and explainable network structures for distinct inputs, leading to dynamic and individualized inference paths. Give an off-the-shelf large network configuration, the proposed Personalized Modular Network (PMN) is learned by selectively activating a sequence of network modules where each of them is designated to recognize particular levels of structured knowledge. Learning semantic configurations and activation of modules to align well with structured knowledge can be regarded as a decision-making procedure, which is solved by a new graph-based reinforcement learning algorithm. Experiments on three semantic segmentation tasks and classification tasks show our PMN can achieve superior performance with the reduced number of network modules while discovering personalized and explainable module configurations for each input.
[graph, action, prediction, dynamic, recurrent, recognize, previous, early, dataset] [] [image, figure, based, proposed, input, comparison, method, distinct, conditional] [network, pmn, structured, deep, neural, computation, layer, table, search, number, searching, residual, structure, accuracy, inference, performance, usage, convolutional, superior, selection, higher] [policy, reinforcement, personalized, arxiv, modular, preprint, concept, reward, indicates, visual, random] [module, semantic, activated, final, feature, segmentation, object, level, parent, three, guided, hierarchy, easy, person] [knowledge, learning, set, selected, function, classification, specific, learned, space, training, test]
@InProceedings{Liang_2019_CVPR,
  author = {Liang, Xiaodan},
  title = {Learning Personalized Modular Network Guided by Structured Knowledge},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
A Generative Appearance Model for End-To-End Video Object Segmentation
Joakim Johnander, Martin Danelljan, Emil Brissman, Fahad Shahbaz Khan, Michael Felsberg


One of the fundamental challenges in video object segmentation is to find an effective representation of the target and background appearance. The best performing approaches resort to extensive fine-tuning of a convolutional neural network for this purpose. Besides being prohibitively expensive, this strategy cannot be truly trained end-to-end since the online fine-tuning procedure is not integrated into the offline training of the network. To address these issues, we propose a network architecture that learns a powerful representation of the target and background appearance in a single forward pass. The introduced appearance module learns a probabilistic generative model of target and background feature distributions. Given a new image, it predicts the posterior class probabilities, providing a highly discriminative cue, which is processed in later network modules. Both the learning and prediction stages of our appearance module are fully differentiable, enabling true end-to-end training of the entire segmentation pipeline. Comprehensive experiments demonstrate the effectiveness of the proposed approach on three video object segmentation benchmarks. We close the gap to approaches based on online fine-tuning on DAVIS17, while operating at 15 FPS on a single GPU. Furthermore, our method outperforms all published approaches on the large-scale YouTube-VOS dataset.
[video, online, frame, rgmp, previous, fusion, prediction, propagation, recurrent, favos, forward, causal] [approach, computer, vision, single, pattern, initial, assignment, compute, problem] [appearance, generative, background, component, proposed, conference, based, image, ieee, method, input, comparison, extracted] [network, neural, convolutional, performance, table, output, entire, architecture, deep, obtains, best, extensive, upsampling, inference, gaussian, compare, compared] [model] [segmentation, object, module, feature, mask, foreground, final, coarse, three, map, fully, accurately, segment] [target, training, learning, mixture, set, trained, class, unseen, update, soft, base, train]
@InProceedings{Johnander_2019_CVPR,
  author = {Johnander, Joakim and Danelljan, Martin and Brissman, Emil and Shahbaz Khan, Fahad and Felsberg, Michael},
  title = {A Generative Appearance Model for End-To-End Video Object Segmentation},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
A Flexible Convolutional Solver for Fast Style Transfers
Gilles Puy, Patrick Perez


We propose a new flexible deep convolutional neural network (convnet) to perform fast neural style transfers. Our network is trained to solve approximately, but rapidly, the artistic style transfer problem of [Gatys et al.] for arbritary styles. While solutions already exist, our network is uniquely flexible by design: it can be manipulated at runtime to enforce new constraints on the final output. As examples, we show that it can be modified to perform tasks such as fast photorealistic style transfer, or fast video style transfer with short term consistency, with no retraining. This flexibility stems from the proposed architecture which is obtained by unrolling the gradient descent algorithm used in [Gatys et al.]. Regularisations added to [Gatys et al.] to solve a new task can be reported on-the-fly in our network, even after training.
[video, signal, recognition, perform, term, second, temporal, graph] [computer, vision, international, problem, note, pattern, algorithm, solve, inverse, runtime, solution, matrix, laplacian, minimisation, linear, compute] [style, image, conference, method, ieee, artistic, content, proximal, texture, ltv, photo, stylised, control, consistency, proposed, gram, minimise, xlow, regulariser, flickering, user, figure, filtering, xref, arbitrary] [network, fast, deep, neural, gradient, architecture, convolutional, processing, descent, original, flexible, flexibility, number, layer, structure, better, unrolling, computation, filter] [provided, machine, iterative, step, partial] [propose, final, spatial, european, feature] [transfer, loss, trained, training, learning, learned, unsupervised]
@InProceedings{Puy_2019_CVPR,
  author = {Puy, Gilles and Perez, Patrick},
  title = {A Flexible Convolutional Solver for Fast Style Transfers},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Cross Domain Model Compression by Structurally Weight Sharing
Shangqian Gao, Cheng Deng, Heng Huang


Regular model compression methods focus on RGB input. While cross domain tasks demand more DNN models, each domain often needs its own model. Consequently, for such tasks, the storage cost, memory footprint and computation cost increase dramatically compared to single RGB input. Moreover, the distinct appearance and special structure in cross domain tasks make it difficult to directly apply regular compression methods on it. In this paper, thus, we propose a new robust cross domain model compression method. Specifically, the proposed method compress cross domain models by structurally weight sharing, which is achieved by regularizing the models with graph embedding at training time. Due to the channel wise weights sharing, the proposed method can reduce computation cost without specially designed algorithm. In the experiments, the proposed method achieves state of the art results on two diverse tasks: action recognition and RGB-D scene recognition.
[action, recognition, graph, dataset, optical, flow] [rgb, computer, single, scene, vision, pattern, induced, matrix, constraint, depth] [method, input, conference, proposed, figure, spectral, ieee, based, intermediate, image] [compression, weight, layer, neural, pruning, sharing, group, rate, performance, achieve, network, growl, deep, computation, regularization, sparsity, number, table, size, cost, convolutional, compared, channel, correlation, original, processing, rspectral, efficient, reduce, structured, applied, prune, convolution, popular, better, reduced] [model, arxiv, preprint] [feature, map, fully] [domain, cross, learning, similarity, training, embedding, clustering, classification, set, shared, share, hyperparameter, trained, large]
@InProceedings{Gao_2019_CVPR,
  author = {Gao, Shangqian and Deng, Cheng and Huang, Heng},
  title = {Cross Domain Model Compression by Structurally Weight Sharing},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
TraVeLGAN: Image-To-Image Translation by Transformation Vector Learning
Matthew Amodio, Smita Krishnaswamy


Interest in image-to-image translation has grown substantially in recent years with the success of unsupervised models based on the cycle-consistency assumption. The achievements of these models have been limited to a particular subset of domains where this assumption yields good results, namely homogeneous domains that are characterized by style or texture differences. We tackle the challenging problem of image-to-image translation where the domains are defined by high-level shapes and contexts, as well as including significant clutter and heterogeneity. For this purpose, we introduce a novel GAN based on preserving intra-domain vector transformations in a latent space learned by a siamese network. The traditional GAN system introduced a discriminator network to guide the generator into generating images in the target domain. To this two-network system we add a third: a siamese network that guides the generator so that each original image shares semantics with its generated version. With this new three-network system, we no longer need to constrain the generators with the ubiquitous cycle-consistency restraint or any other autoencoding regularization. As a result, the generators can learn mappings between more complex domains that differ from each other by more than just style or texture. We demonstrate our model by mapping between high-resolution, arbitrarily chosen classes from the Imagenet dataset completely without pre-processing such as cropping, centering, or filtering unrepresentative images.
[previous, dataset, learns, work] [well, computer, pattern, property, shape, inverse, point, vision] [travelgan, image, mapping, latent, generator, crossword, figure, transformation, gxy, abacus, pixel, real, translation, style, preserving, realistic, change, conference, generative, rav, ieee, traditional, cycleconsistency, cycle, arbitrary, paired, preserve, forcing] [original, network, siamese, output, neural, imagenet, regularization, standard, table, processing] [generated, arxiv, vector, preprint, discriminator, adversarial, black, gan, relationship, white, van, inception, generates] [semantic, map, semantics, fully] [domain, space, unsupervised, learning, learn, loss, learned, task, transfer, pairwise, target, distance, function, specific, datasets]
@InProceedings{Amodio_2019_CVPR,
  author = {Amodio, Matthew and Krishnaswamy, Smita},
  title = {TraVeLGAN: Image-To-Image Translation by Transformation Vector Learning},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Deep Robust Subjective Visual Property Prediction in Crowdsourcing
Qianqian Xu, Zhiyong Yang, Yangbangyan Jiang, Xiaochun Cao, Qingming Huang, Yuan Yao


The problem of estimating subjective visual properties (SVP) of images (e.g., Shoes A is more comfortable than B) is gaining rising attention. Due to its highly subjective nature, different annotators often exhibit different interpretations of scales when adopting absolute value tests. Therefore, recent investigations turn to collect pairwise comparisons via crowdsourcing platforms. However, crowdsourcing data usually contains outliers. For this purpose, it is desired to develop a robust model for learning SVP from crowdsourced noisy annotations. In this paper, we construct a deep SVP prediction model which not only leads to better detection of annotation outliers but also enables learning with extremely sparse annotations. Specifically, we construct a comparison multi-graph based on the collected annotations, where different labeling results correspond to edges with different directions between two vertexes. Then, we propose a generalized deep probabilistic framework which consists of an SVP prediction module and an outlier modeling module that work collaboratively and are optimized jointly. Extensive experiments on various benchmark datasets demonstrate that our new approach guarantees promising results.
[prediction, dataset, work, human, framework, multiple, graph] [outlier, robust, computer, pattern, property, relative, vision, problem, international, reliable, corresponding, ground, algorithm, denote] [age, image, conference, method, ieee, comparison, proposed, subjective, figure, based, crowdsourced, quality, contaminated, traditional, caused, face] [deep, network, better, neural, table, performance, compare, gradient, indicator, sparse, power] [model, visual, majority, goal, step] [detection, adopt, propose, detect, regression, score] [learning, svp, set, noisy, pairwise, data, probabilistic, pair, min, ranking, learn, distribution, crowdsourcing, rank, training, function, yij, specific, china, representation, posterior]
@InProceedings{Xu_2019_CVPR,
  author = {Xu, Qianqian and Yang, Zhiyong and Jiang, Yangbangyan and Cao, Xiaochun and Huang, Qingming and Yao, Yuan},
  title = {Deep Robust Subjective Visual Property Prediction in Crowdsourcing},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Transferable AutoML by Model Sharing Over Grouped Datasets
Chao Xue, Junchi Yan, Rong Yan, Stephen M. Chu, Yonggang Hu, Yonghua Lin


Automated Machine Learning (AutoML) is an active area on the design of deep neural networks for specific tasks and datasets. Given the complexity of discovering new network designs, methods for speeding up the search procedure are becoming important. This paper presents a so-called transferable AutoML approach that Automated Machine Learning (AutoML) is an active area on the design of deep neural networks for specific tasks and datasets. Given the complexity of discovering new network designs, methods for speeding up the search procedure are becoming important. This paper presents a so-called transferable AutoML approach that leverages previously trained models to speed up the search process for new tasks and datasets. Our approach involves a novel meta-feature extraction technique based on the performance of benchmark models, and a dynamic dataset clustering algorithm based on Markov process and statistical hypothesis test. As such multiple models can share a common structure while with different learned parameters. The transferable AutoML can either be applied to search from scratch, search from predesigned models, or transfer from basic cells according to the difficulties of the given datasets. The experimental results on image classification show notable speedup in overall search time for multiple datasets with negligible loss in accuracy.
[dataset, time, hypothesis, markov, state, sequential, predefined, online, multiple, combined] [algorithm, optimization, well, error, analysis, total, approach, empty, relative] [method, based, proposed, statistical, image, raw] [search, automl, number, performance, metaqnn, bayesian, basic, hyperband, architecture, neural, gaussian, enas, searched, table, deep, network, standalone, sharing, process, compare, searching, accuracy, overhead] [model, random, type, arxiv, preprint, machine, evaluation, consider, probability, reinforcement] [benchmark, feature, assigned, grouping] [datasets, cluster, learning, set, test, clustering, transferable, share, representation, classification, transfer, existing, paper, space, svhn, specific, nonempty, mnist]
@InProceedings{Xue_2019_CVPR,
  author = {Xue, Chao and Yan, Junchi and Yan, Rong and Chu, Stephen M. and Hu, Yonggang and Lin, Yonghua},
  title = {Transferable AutoML by Model Sharing Over Grouped Datasets},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Learning Not to Learn: Training Deep Neural Networks With Biased Data
Byungju Kim, Hyunwoo Kim, Kyungsu Kim, Sungjin Kim, Junmo Kim


We propose a novel regularization algorithm to train deep neural networks, in which data at training time is severely biased. Since a neural network efficiently learns data distribution, a network is likely to learn the bias information to categorize input data. It leads to poor performance at test time, if the bias is, in fact, irrelevant to the categorization. In this paper, we formulate a regularization loss based on mutual information between feature embedding and bias. Based on the idea of minimizing this mutual information, we propose an iterative algorithm to unlearn the bias information. We employ an additional network to predict the bias distribution and train the network adversarially against the feature embedding network. At the end of learning, the bias prediction network is not able to predict the bias not because it is poorly trained, but because the feature embedding network successfully unlearns the bias information. We also demonstrate quantitative and qualitative experimental results which show that our algorithm effectively removes the bias information from feature embedding.
[predict, prediction, dataset, learns] [algorithm, problem, computer, vision, corresponding, additional, shape] [color, figure, age, gender, proposed, face, image, female, male, based, high, removing, bright, dark, grayscale, denoted, conference, ieee, qualitative, removal, remove] [network, neural, deep, performance, regularization, represents, denotes, table, best] [model, adversarial, provided, evaluation, machine, sampled, decision] [feature, baseline, propose, evaluated] [bias, trained, training, test, data, biased, target, train, colored, learning, set, mutual, unknown, digit, imdb, learn, independent, confusion, mnist, cat, embedding, minimize, unlearn, distribution, label, dog, loss, classifier, oracle, categorize, categorized, classification, class, novel, minimizing]
@InProceedings{Kim_2019_CVPR,
  author = {Kim, Byungju and Kim, Hyunwoo and Kim, Kyungsu and Kim, Sungjin and Kim, Junmo},
  title = {Learning Not to Learn: Training Deep Neural Networks With Biased Data},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
IRLAS: Inverse Reinforcement Learning for Architecture Search
Minghao Guo, Zhao Zhong, Wei Wu, Dahua Lin, Junjie Yan


In this paper, we propose an inverse reinforcement learning method for architecture search (IRLAS), which trains an agent to learn to search network structures that are topologically inspired by human-designed network. Most existing architecture search approaches totally neglect the topological characteristics of architectures, which results in complicated architecture with a high inference latency. Motivated by the fact that human-designed networks are elegant in topology with a fast inference speed, we propose a mirror stimuli function inspired by biological cognition theory to extract the abstract topological knowledge of an expert human-design network (ResNeXt). To avoid raising a too strong prior over the search space, we introduce inverse reinforcement learning to train the mirror stimuli function and exploit it as a heuristic guidance for architecture search, easily generalized to different architecture search algorithms. On CIFAR-10, the best architecture searched by our proposed IRLAS achieves 2.60% error rate. For ImageNet mobile setting, our model achieves a state-of-the-art top-1 accuracy 75.28%, while being 2 4x faster than most auto-generated architectures. A fast version of this model achieves 10% faster than MobileNetV2, while maintaining a higher accuracy.
[state, extract] [inverse, topological, topology, problem, algorithm, equation, optimization] [expert, figure, method, proposed, based, image, input, high, elegant] [architecture, search, mirror, network, neural, searching, searched, inference, accuracy, irlas, number, block, size, achieves, design, output, topologically, imagenet, process, andrew, latency, structure, denotes, explore, operation, ftopology, efficient, residual, mobile, higher, compared, effective, weight, original, table, rate, odif, wei, convolutional, resnet] [agent, reinforcement, policy, reward, arxiv, preprint, model, sampled, observed, choose, introduce, abstract, strong, easily] [feature, guidance, count, propose, faster, including] [function, learning, space, knowledge, training, set, train, generalized, trained, existing]
@InProceedings{Guo_2019_CVPR,
  author = {Guo, Minghao and Zhong, Zhao and Wu, Wei and Lin, Dahua and Yan, Junjie},
  title = {IRLAS: Inverse Reinforcement Learning for Architecture Search},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Learning for Single-Shot Confidence Calibration in Deep Neural Networks Through Stochastic Inferences
Seonguk Seo, Paul Hongsuck Seo, Bohyung Han


We propose a generic framework to calibrate accuracy and confidence of a prediction in deep neural networks through stochastic inferences. We interpret stochastic regularization using a Bayesian model, and analyze the relation between predictive uncertainty of networks and variance of the prediction scores obtained by stochastic inferences for a single example. Our empirical study shows that the accuracy and the score of a prediction are highly correlated with the variance of multiple stochastic inferences given by stochastic depth or dropout. Motivated by this observation, we design a novel variance-weighted confidence-integrated loss function that is composed of two cross-entropy loss terms with respect to ground-truth and uniform distribution, which are balanced by variance of stochastic prediction scores. The proposed loss function enables us to learn deep neural networks that predict confidence calibrated scores using a single inference. Our algorithm presents outstanding confidence calibration performance and improves classification accuracy when combined with two popular stochastic regularization techniques---stochastic depth and dropout---in multiple models and datasets; it alleviates overconfidence issue in deep neural networks significantly by training networks to achieve prediction accuracy proportional to confidence of prediction.
[prediction, multiple, framework, term, dataset, forward, nll] [confidence, calibration, estimate, normalized, note, single, depth, respect, algorithm, estimation, calibrated, technique, calibrate, approach] [proposed, noise, based, figure, method, high] [stochastic, accuracy, neural, deep, variance, regularization, bayesian, network, performance, tiny, imagenet, dropout, number, inference, table, approximate] [coverage, example, model, empirical, random] [score, baseline, average, predicted, propose] [vwci, loss, uncertainty, training, function, trained, sample, ece, log, distribution, set, classification, learning, dkl, brier, predictive, uniform, test, generic, learn, interpretation, label, mce, novel]
@InProceedings{Seo_2019_CVPR,
  author = {Seo, Seonguk and Hongsuck Seo, Paul and Han, Bohyung},
  title = {Learning for Single-Shot Confidence Calibration in Deep Neural Networks Through Stochastic Inferences},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Attention-Based Adaptive Selection of Operations for Image Restoration in the Presence of Unknown Combined Distortions
Masanori Suganuma, Xing Liu, Takayuki Okatani


Many studies have been conducted so far on image restoration, the problem of restoring a clean image from its distorted version. There are many different types of distortion affecting image quality. Previous studies have focused on single types of distortion, proposing methods for removing them. However, image quality degrades due to multiple factors in the real world. Thus, depending on applications, e.g., vision for autonomous cars or surveillance cameras, we need to be able to deal with multiple combined distortions with unknown mixture ratios. For this purpose, we propose a simple yet effective layer architecture of neural networks. It performs multiple operations in parallel, which are weighted by an attention mechanism to enable selection of proper operations depending on the input. The layer can be stacked to form a deep network, which is differentiable and thus can be trained in an end-to-end fashion by gradient descent. The experimental results show that the proposed method works better than previous methods by a good margin on tasks of restoring images with multiple combined distortions.
[combined, multiple, previous, motion, dataset, work, consists, employed] [single, distortion, distorted, vision, range, chair] [image, proposed, method, input, jpeg, ssim, psnr, restoration, figure, restored, blur, raindrop, noise, study, dncnn, degradation, clean, quality, based, generative, stack] [layer, network, deep, operation, output, convolutional, gaussian, table, convolution, compression, performance, residual, neural, standard, applied, block, architecture, number, gradient, better, group, cnns, effectiveness, channel] [attention, mechanism, type, depending, adversarial, model, evaluate, generated, indicates] [feature, extraction, map, person, detection, three] [set, trained, training, learning, test, randomly, novel]
@InProceedings{Suganuma_2019_CVPR,
  author = {Suganuma, Masanori and Liu, Xing and Okatani, Takayuki},
  title = {Attention-Based Adaptive Selection of Operations for Image Restoration in the Presence of Unknown Combined Distortions},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Fully Learnable Group Convolution for Acceleration of Deep Neural Networks
Xijun Wang, Meina Kan, Shiguang Shan, Xilin Chen


Benefitted from its great success on many tasks, deep learning is increasingly used on low-computational-cost devices, e.g. smartphone, embedded devices, etc. To reduce the high computational and memory cost, in this work, we propose a fully learnable group convolution module (FLGC for short) which is quite efficient and can be embedded into any deep neural networks for acceleration. Specifically, our proposed method automatically learns the group structure in the training stage in a fully end-to-end manner, leading to a better structure than the existing pre-defined, two-steps, or iterative strategies. Moreover, our method can be further combined with depthwise separable convolution, resulting in 5 times acceleration than the vanilla Resnet50 on single CPU. An additional advantage is that in our FLGC the number of groups can be set as any value, but not necessarily 2^k as in most existing methods, meaning better tradeoff between accuracy and speed. As evaluated in our experiments, our method achieves better performance than existing learnable group convolution and standard group convolution when using the same number of groups.
[recognition, time, considering, dataset] [vision, computer, pattern, matrix, international] [input, conference, ieee, proposed, method, face, figure, image, comparison] [group, convolution, deep, neural, flgc, network, efficient, layer, standard, number, selection, convolutional, structure, accuracy, acceleration, pruning, better, learnable, cost, achieves, channel, binary, filter, separable, architecture, condensenet, optimized, ith, table, processing, computational, output, inference, replace, denotes, automatically, design, madds, applying, verification, fewer, sparse, connection, speed, dynamically] [model, arxiv, preprint, indicates, embedded, iterative, easily, simply] [fully, grouping, including] [learning, existing, classification, training, objective, loss, meta, large, representation, hard, function, softmax, china]
@InProceedings{Wang_2019_CVPR,
  author = {Wang, Xijun and Kan, Meina and Shan, Shiguang and Chen, Xilin},
  title = {Fully Learnable Group Convolution for Acceleration of Deep Neural Networks},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
EIGEN: Ecologically-Inspired GENetic Approach for Neural Network Structure Searching From Scratch
Jian Ren, Zhe Li, Jianchao Yang, Ning Xu, Tianbao Yang, David J. Foran


Designing the structure of neural networks is considered one of the most challenging tasks in deep learning, especially when there is few prior knowledge about the task domain. In this paper, we propose an Ecologically-Inspired GENetic (EIGEN) approach that uses the concept of succession, extinction, mimicry, and gene duplication to search neural network structure from scratch with poorly initialized simple network and few constraints forced during the evolution, as we assume no prior knowledge about the task domain. Specifically, we first use primary succession to rapidly evolve a population of poorly initialized neural network structures into a more diverse population, followed by a secondary succession stage for fine-grained searching based on the networks from the primary succession. Extinction is applied in both stages to reduce computational cost. Mimicry is employed during the entire evolution process to help the inferior networks imitate the behavior of a superior network and gene duplication is utilized to duplicate the learned blocks of novel structures, both of which help to find better network structures. Experimental results show that our proposed approach can achieve similar or better performance compared to the existing genetic approaches with dramatically reduced computation cost. For example, the network discovered by our approach on CIFAR-100 dataset achieves 78.1% test accuracy under 120 GPU hours, compared to 77.0% test accuracy in more than 65, 536 GPU hours in [35].
[work, dataset, individual] [approach, eigen, algorithm, computer, vision, optimal, pattern, international, limited, denote, analysis] [figure, conference, based, proposed, image, ieee, comparison, includes] [network, neural, searching, succession, genetic, search, mimicry, mutation, duplication, structure, performance, accuracy, architecture, population, gene, secondary, better, fitness, computation, discovered, cost, compared, rapid, computational, evolution, convolutional, number, block, deep, efficient, inferior, best, layer, table, superior, operation, ecological, gpuh, size, scratch, reduce, achieve, pooling, automatically, order, extinction] [primary, arxiv, preprint, generation, evolutionary, machine] [propose, score, parent] [test, training, learning, randomly, knowledge, experimental, learned, strategy, space, datasets]
@InProceedings{Ren_2019_CVPR,
  author = {Ren, Jian and Li, Zhe and Yang, Jianchao and Xu, Ning and Yang, Tianbao and Foran, David J.},
  title = {EIGEN: Ecologically-Inspired GENetic Approach for Neural Network Structure Searching From Scratch},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Deep Incremental Hashing Network for Efficient Image Retrieval
Dayan Wu, Qi Dai, Jing Liu, Bo Li, Weiping Wang


Hashing has shown great potential in large-scale image retrieval due to its storage and computation efficiency, especially the recent deep supervised hashing methods. To achieve promising performance, deep supervised hashing methods require a large amount of training data from different classes. However, when images of new categories emerge, existing deep hashing methods have to retrain the CNN model and generate hash codes for all the database images again, which is impractical for large-scale retrieval system. In this paper, we propose a novel deep hashing framework, called Deep Incremental Hashing Network (DIHN), for learning hash codes in an incremental manner. DIHN learns the hash codes for the new coming images directly, while keeping the old ones unchanged. Simultaneously, a deep hash function for query set is learned by preserving the similarities between training points. Extensive experiments on two widely used image retrieval benchmarks demonstrate that the proposed DIHN framework can significantly decrease the training time while keeping the state-of-the-art retrieval accuracy.
[time, learns, framework, work, ting, previous, utilized] [note, directly, denote, algorithm, matrix, discrete, associated, approach] [database, image, proposed, figure, tanh, traditional, method, comparison, row] [deep, original, binary, network, complexity, performance, number, neural, table, denotes, achieve, convolutional, computational, efficient, top, extensive, best] [query, model, length, sign, sampled, generate, calculated] [map, cnn, semantic, adopt, three, feature] [hashing, hash, incremental, dihn, learning, training, set, retrieval, function, supervised, adsh, asymmetric, loss, learn, data, similarity, ksij, pairwise, code, learned, class, tao, existing, representation, min, update, sample, large, novel]
@InProceedings{Wu_2019_CVPR,
  author = {Wu, Dayan and Dai, Qi and Liu, Jing and Li, Bo and Wang, Weiping},
  title = {Deep Incremental Hashing Network for Efficient Image Retrieval},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Robustness via Curvature Regularization, and Vice Versa
Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, Jonathan Uesato, Pascal Frossard


State-of-the-art classifiers have been shown to be largely vulnerable to adversarial perturbations. One of the most effective strategies to improve robustness is adversarial training. In this paper, we investigate the effect of adversarial training on the geometry of the classification landscape and decision boundaries. We show in particular that adversarial training leads to a significant decrease in the curvature of the loss surface with respect to inputs, leading to a drastically more "linear" behaviour of the network. Using a locally quadratic approximation, we provide theoretical evidence on the existence of a strong relation between large robustness and small curvature. To further show the importance of reduced curvature for improving the robustness, we propose a new regularizer that directly minimizes curvature of the loss surface, and leads to adversarial robustness that is on par with adversarial training. Besides being a more efficient and principled alternative to adversarial training, the proposed regularizer confirms our claims on the importance of exhibiting quasi-linear behavior in the vicinity of data points in order to achieve robustness.
[consists, second] [curvature, note, respect, robust, analysis, international, normal, computed, surface, geometric, approach, point, provide, denote, finite] [proposed, conference, figure, profile, high, input, result, difference] [network, small, original, gradient, regularizer, decrease, regularization, neural, deep, accuracy, efficient, lower, increasing, hessian, norm, regularized, order, lead, denotes, improving, comparable] [adversarial, robustness, decision, cure, adversarially, strong, random, existence, vicinity, pgd, xadv, attack, spsa, arxiv, preprint] [improve, boundary] [loss, training, trained, function, learning, data, large, corresponds, set, upper, svhn, class, test, classifier, observe, reported]
@InProceedings{Moosavi-Dezfooli_2019_CVPR,
  author = {Moosavi-Dezfooli, Seyed-Mohsen and Fawzi, Alhussein and Uesato, Jonathan and Frossard, Pascal},
  title = {Robustness via Curvature Regularization, and Vice Versa},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
SparseFool: A Few Pixels Make a Big Difference
Apostolos Modas, Seyed-Mohsen Moosavi-Dezfooli, Pascal Frossard


Deep Neural Networks have achieved extraordinary results on image classification tasks, but have been shown to be vulnerable to attacks with carefully crafted perturbations of the input data. Although most attacks usually change values of many image's pixels, it has been shown that deep networks are also vulnerable to sparse alterations of the input. However, no computationally efficient method has been proposed to compute sparse perturbations. In this paper, we exploit the low mean curvature of the decision boundary, and propose SparseFool, a geometry inspired sparse attack that controls the sparsity of the perturbations. Extensive evaluations show that our approach computes sparse perturbations very fast, and scales efficiently to high dimensional data. We further analyze the transferability and the visual effects of the perturbations, and show the existence of shared semantic information across the images and the networks. Finally, we show that adversarial training can only slightly improve the robustness against sparse additive perturbations computed with SparseFool.
[recognition, time, dataset, dynamic] [algorithm, problem, computer, computed, pattern, vision, optimization, solving, point, international, linear, dimensional, provide, corresponding, projection, solution, compute, geometry, minimal, normal] [image, ieee, conference, method, high, proposed, figure, control, noise, input] [sparse, deep, neural, sparsity, approximated, imagenet, computing, network, rate, efficient, number, table, performance, magnitude, iteration, efficiently] [adversarial, sparsefool, fooling, perturbed, perturbation, decision, attack, jsma, robustness, computes, datapoint, deer, perceptibility, validity, linearized, iterate, median, execution, simple] [boundary, average, semantic, improve, propose] [learning, mnist, observe, classification, shared, corresponds, class, dog, cat, exploit, training, label, classifier]
@InProceedings{Modas_2019_CVPR,
  author = {Modas, Apostolos and Moosavi-Dezfooli, Seyed-Mohsen and Frossard, Pascal},
  title = {SparseFool: A Few Pixels Make a Big Difference},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Interpretable and Fine-Grained Visual Explanations for Convolutional Neural Networks
Jorg Wagner, Jan Mathias Kohler, Tobias Gindele, Leon Hetzel, Jakob Thaddaus Wiedemer, Sven Behnke


To verify and validate networks, it is essential to gain insight into their decisions, limitations as well as possible shortcomings of training data. In this work, we propose a post-hoc, optimization based visual explanation method, which highlights the evidence in the input image for a specific prediction. Our approach is based on a novel technique to defend against adversarial evidence (i.e. faulty evidence due to artefacts) by filtering gradients during optimization. The defense does not depend on human-tuned parameters. It enables explanations which are both fine-grained and preserve the characteristics of images, such as edges and colors. The explanations are interpretable, suited for visualizing detailed evidence and can be tested as they are valid model inputs. We qualitatively and quantitatively evaluate our approach on a multitude of models and datasets.
[version, prediction, recognition] [optimization, international, computed, vision, computer, approach, additional, mct, well, compute, pattern, technique, directly, suited, valid, additionally] [image, conference, based, method, color, input, ieee, proposed, preservation, produce, figure, reference, visually, verify, quantitatively] [deep, neural, original, convolutional, imagenet, gradient, activation, validation, compared, output, compare] [explanation, adversarial, model, evidence, deletion, defense, game, visual, fgvis, generate, arxiv, evaluate, faithfulness, perturbed, generation, introduced, interpretable, generated, visualize, correctly, school, machine] [mask, diabetic, retinopathy, fundus, propose, medical, retinal] [class, learning, target, metric, data, training, classification, bias, similarity, discriminative, novel]
@InProceedings{Wagner_2019_CVPR,
  author = {Wagner, Jorg and Mathias Kohler, Jan and Gindele, Tobias and Hetzel, Leon and Thaddaus Wiedemer, Jakob and Behnke, Sven},
  title = {Interpretable and Fine-Grained Visual Explanations for Convolutional Neural Networks},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Structured Pruning of Neural Networks With Budget-Aware Regularization
Carl Lemaire, Andrew Achkar, Pierre-Marc Jodoin


Pruning methods have shown to be effective at reducing the size of deep neural networks while keeping accuracy almost intact. Among the most effective methods are those that prune a network while training it with a sparsity prior loss and learnable dropout parameters. A shortcoming of these approaches however is that neither the size nor the inference speed of the pruned network can be controlled directly; yet this is a key feature for targeting deployment of CNNs on low-power hardware. To overcome this, we introduce a budgeted regularized pruning framework for deep CNNs. Our approach naturally fits into traditional neural network training as it consists of a learnable masking layer, a novel budget-aware objective function, and the use of knowledge distillation. We also provide insights on how to prune a residual network and how this can lead to new architectures. Experimental results reveal that CNNs pruned with our method are more accurate and less compute-hungry than state-of-the-art methods. Also, our approach is more effective at preventing accuracy collapse in case of severe pruning; this allows pruning factors of up to 16x without significant accuracy drop.
[signal, transition, work, framework] [volume, approach, compute, constraint, optimization, well, allow, problem, computed, differentiable] [method, figure, prior, removing, proposed] [pruning, network, neural, pruned, residual, budget, number, barrier, proc, sparsity, block, effective, deep, accuracy, layer, output, size, dropout, factor, structured, convolutional, better, parameter, prune, budgeted, convolution, typical, width, architecture, unpruned, flop, mentioned, efficient, reduction, severe, respecting, reduce, approximation, weight, morphnet, dsl, regularization, canada, reducing] [variational, arxiv, preprint, requires, random] [feature, cnn, map, connectivity, three] [training, function, learning, loss, knowledge, test, metric, novel, main, distillation, distribution, objective]
@InProceedings{Lemaire_2019_CVPR,
  author = {Lemaire, Carl and Achkar, Andrew and Jodoin, Pierre-Marc},
  title = {Structured Pruning of Neural Networks With Budget-Aware Regularization},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
MBS: Macroblock Scaling for CNN Model Reduction
Yu-Hsun Lin, Chun-Nan Chou, Edward Y. Chang


In this paper we propose the macroblock scaling (MBS) algorithm, which can be applied to various CNN architectures to reduce their model size. MBS adaptively reduces each CNN macroblock depending on its information redundancy measured by our proposed effective flops. Empirical studies conducted with ImageNet and CIFAR-10 attest that MBS can reduce the model size of some already compact CNN models, e.g., MobileNetV2 (25.03% further reduction) and ShuffleNet (20.74%), and even ultra-deep ones such as ResNet-101 (51.67%) and ResNet-1202 (72.71%) with negligible accuracy degradation. MBS also performs better reduction at a much lower cost than the state-of-the-art optimization-based methods do. MBS's simplicity and efficiency, its flexibility to work with any CNN model, and its scalability to work with models of any depth make it an attractive choice for CNN model size reduction.
[prediction, work, consists, early] [field, algorithm, total, define, estimate, defined, additional] [input, proposed, enhancement, figure, image, ieee, method, based] [convolution, reduction, macroblock, size, effective, accuracy, layer, receptive, channel, scaling, deep, neural, table, number, convolutional, compact, filter, complexity, density, output, sizej, ecj, ratio, factor, imagenet, shufflenet, mobilenet, efficient, network, width, etotal, ebase, reduce, redundancy, pruning, binary, resnet, relu, cnns, widthc, reduces, computation, divided, neuron, bitrate, applied, negligible, redundant, achieves, architecture, tensor, widthj, larger] [model, step] [cnn, feature] [training, base, set, setting, learning, trained, large]
@InProceedings{Lin_2019_CVPR,
  author = {Lin, Yu-Hsun and Chou, Chun-Nan and Chang, Edward Y.},
  title = {MBS: Macroblock Scaling for CNN Model Reduction},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Fast Neural Architecture Search of Compact Semantic Segmentation Models via Auxiliary Cells
Vladimir Nekrasov, Hao Chen, Chunhua Shen, Ian Reid


Automated design of neural network architectures tailored for a specific task is an extremely promising, albeit inherently difficult, avenue to explore. While most results in this domain have been achieved on image classification and language modelling problems, here we concentrate on dense per-pixel tasks, in particular, semantic image segmentation using fully convolutional networks. In contrast to the aforementioned areas, the design choices of a fully convolutional network require several changes, ranging from the sort of operations that need to be used - e.g., dilated convolutions - to a solving of a more difficult optimisation problem. In this work, we are particularly interested in searching for high-performance compact segmentation architectures, able to run in real-time using limited resources. To achieve that, we intentionally over-parameterise the architecture during the training time via a set of auxiliary cells that provide an intermediate supervisory signal and can be omitted during the evaluation phase. The design of the auxiliary cell is emitted by a controller, a neural network with the fixed structure trained using reinforcement learning. More crucially, we demonstrate how to efficiently search for these architectures within limited time and computational budgets. In particular, we rely on a progressive strategy that terminates non-promising architectures from being further trained, and on Polyak averaging coupled with knowledge distillation to speed-up the convergence. Quantitatively, in 8 GPU-days our approach discovers a set of architectures performing on-par with state-of-the-art among compact models on the semantic segmentation, pose estimation and depth prediction tasks. Code will be made available here: https://github.com/drsleep/nas-segm-pytorch
[work, second, averaging, multiple, time, prediction, human] [optimisation, depth, provide, pose, single, estimate, dense, estimation, rely, approach, emitted] [image, ieee, intermediate, figure, based, method] [search, cell, architecture, neural, network, output, controller, number, performance, compact, design, structure, block, convolutional, better, rate, achieve, conv, searching, polyak, efficient, process, applied, validation, computational, best, deep, apply, full, table, progressive] [decoder, sampled, encoder, reward, model, reinforcement, consider] [semantic, segmentation, stage, supervision, fully, pascal, coco] [training, auxiliary, learning, knowledge, select, set, train, task, classification, trained, distillation, strategy, sample, randomly, sampling]
@InProceedings{Nekrasov_2019_CVPR,
  author = {Nekrasov, Vladimir and Chen, Hao and Shen, Chunhua and Reid, Ian},
  title = {Fast Neural Architecture Search of Compact Semantic Segmentation Models via Auxiliary Cells},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Generating 3D Adversarial Point Clouds
Chong Xiang, Charles R. Qi, Bo Li


Deep neural networks are known to be vulnerable to adversarial examples which are carefully crafted instances to cause the models to make wrong predictions. While adversarial examples for 2D images and CNNs have been extensively studied, less attention has been paid to 3D data such as point clouds. Given many safety-critical 3D applications such as autonomous driving, it is important to study how adversarial point clouds could affect current deep 3D models. In this work, we propose several novel algorithms to craft adversarial point clouds against PointNet, a widely used deep neural network for point cloud processing. Our algorithms work in two ways: adversarial point perturbation and adversarial point generation. For point perturbation, we shift existing points negligibly. For point generation, we generate either a set of independent and scattered points or a small number (1-3) of point clusters with meaningful shapes such as balls and airplanes which could be hidden in the human psyche. In addition, we formulate six perturbation measurement metrics tailored to the attacks in point clouds and conduct extensive experiments to evaluate the proposed algorithms on the ModelNet40 3D shape classification dataset. Overall, our attack algorithms achieve a success rate higher than 99% for all targeted attacks.
[focus] [point, cloud, pointnet, chamfer, equation, optimization, measurement, well, algorithm, constraint, initial, case, computer, shape, robust, provide, farthest, vision] [proposed, figure, study, input, meaningful, conference] [number, original, rate, performance, deep, neural, table, small, add, search, achieve, structure, norm, optimize, max, parameter] [adversarial, attack, perturbation, arxiv, preprint, success, critical, generate, attacking, robustness, adding, generating, model, vulnerable, find, targeted, generation, choose, defense, adversary, evaluation, unnoticeable, generated] [object, hausdorff, visualization, propose, three] [data, set, distance, learning, loss, independent, target, cluster, metric, existing, class, space, min, large, selected]
@InProceedings{Xiang_2019_CVPR,
  author = {Xiang, Chong and Qi, Charles R. and Li, Bo},
  title = {Generating 3D Adversarial Point Clouds},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Partial Order Pruning: For Best Speed/Accuracy Trade-Off in Neural Architecture Search
Xin Li, Yiming Zhou, Zheng Pan, Jiashi Feng


Achieving good speed and accuracy trade-off on a target platform is very important in deploying deep neural networks in real world scenarios. However, most existing automatic architecture search approaches only concentrate on high performance. In this work, we propose an algorithm that can offer better speed/accuracy trade-off of searched networks, which is termed "Partial Order Pruning". It prunes the architecture search space with a partial order assumption to automatically search for the architectures with the best speed and accuracy trade-off. Our algorithm explicitly takes profile information about the inference speed on the target platform into consideration. With the proposed algorithm, we present several Dongfeng (DF) networks that provide high accuracy and fast inference speed on various application GPU platforms. By further searching decoder architectures, our DF-Seg real-time segmentation networks yield state-of-the-art speed/accuracy trade-off on both the target embedded device and the high-end GPU.
[time, explicitly] [algorithm, provide, assumption, depth, good] [figure, resolution, image, proposed, comparison, high, conduct] [architecture, search, network, inference, latency, order, accuracy, speed, neural, searching, efficient, better, fps, table, higher, convolutional, block, platform, deep, convolution, achieves, width, lower, number, best, pruning, achieve, miouclass, mobile, layer, icnet, design, building, process, comparable, pruned, searched, fast, compared, imagenet, gtx, residual, shufflenet, employ, tensor, actual] [partial, decoder, find, embedded] [segmentation, backbone, semantic, boundary, stage, cnn, spatial] [target, set, space, train, training, trained, learning]
@InProceedings{Li_2019_CVPR,
  author = {Li, Xin and Zhou, Yiming and Pan, Zheng and Feng, Jiashi},
  title = {Partial Order Pruning: For Best Speed/Accuracy Trade-Off in Neural Architecture Search},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Memory in Memory: A Predictive Neural Network for Learning Higher-Order Non-Stationarity From Spatiotemporal Dynamics
Yunbo Wang, Jianjin Zhang, Hongyu Zhu, Mingsheng Long, Jianmin Wang, Philip S. Yu


Natural spatiotemporal processes can be highly non-stationary in many ways, e.g. the low-level non-stationarity such as spatial correlations or temporal dependencies of local pixel values; and the high-level variations such as the accumulation, deformation or dissipation of radar echoes in precipitation forecasting. From Cramer's Decomposition, any non-stationary process can be decomposed into deterministic, time-variant polynomials, plus a zero-mean stochastic term. By applying differencing operations appropriately, we may turn time-variant polynomials into a constant, making the deterministic component predictable. However, most previous recurrent neural networks for spatiotemporal prediction do not use the differential signals effectively, and their relatively simple state transition functions prevent them from learning too complicated variations in spacetime. We propose the Memory In Memory (MIM) networks and corresponding recurrent blocks for this purpose. The MIM blocks exploit the differential signals between adjacent recurrent states to model the non-stationary and approximately stationary properties in spatiotemporal dynamics with two cascaded, self-renewed memory modules. By stacking multiple MIM blocks, we could potentially handle higher-order non-stationarity. The MIM networks achieve the state-of-the-art results on four spatiotemporal prediction tasks across both synthetic and real-world datasets. We believe that the general idea of this work can be potentially applied to other time-series forecasting tasks.
[mim, spatiotemporal, prediction, recurrent, video, lstm, stationary, predrnn, radar, causal, modeling, forget, sequence, temporal, differencing, state, frame, transition, future, hidden, moving, previous, work, time, precipitation, forecasting, traffic, capture, rnns, frnn, deterministic, stacking, dataset, flow, mims, csi, multiple, series, long] [differential, ground, truth, local] [figure, mse, pixel, input, proposed, complicated, synthetic] [network, table, neural, block, layer, process, stochastic, gate, deep, higher, better, capability, convolutional, standard] [memory, model, generated, natural] [module, predicted, cascaded, neighboring] [learning, predictive, idea]
@InProceedings{Wang_2019_CVPR,
  author = {Wang, Yunbo and Zhang, Jianjin and Zhu, Hongyu and Long, Mingsheng and Wang, Jianmin and Yu, Philip S.},
  title = {Memory in Memory: A Predictive Neural Network for Learning Higher-Order Non-Stationarity From Spatiotemporal Dynamics},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Variational Information Distillation for Knowledge Transfer
Sungsoo Ahn, Shell Xu Hu, Andreas Damianou, Neil D. Lawrence, Zhenwen Dai


Transferring knowledge from a teacher neural network pretrained on the same or a similar task to a student neural network can significantly improve the performance of the student neural network. Existing knowledge transfer approaches match the activations or the corresponding hand-crafted features of the teacher and the student networks. We propose an information-theoretic framework for knowledge transfer which formulates knowledge transfer as maximizing the mutual information between the teacher and the student networks. We compare our method with existing knowledge transfer methods on both knowledge distillation and transfer learning tasks and show that our method consistently outperforms existing methods. We further demonstrate the strength of our method on knowledge transfer across heterogeneous network architectures by transferring knowledge from a convolutional neural network (CNN) to a multi-layer perceptron (MLP) on CIFAR-10. The resulting MLP significantly outperforms the-state-of-the-art methods and it achieves similar performance to the CNN with a single convolutional layer.
[framework, outperforms, dataset] [computer, linear, matching, equation, corresponding, single, maximization, match, vision, note, squared, varying] [intermediate, proposed, input, image, based, method, transferring, high, conference, ieee, demonstrate, amount, figure] [network, layer, neural, performance, convolutional, size, mlp, deep, compare, output, table, variance, number, regularization, small, unit, residual, designed, implementation] [variational, evaluate, arxiv, preprint, maximizing, attention, consider, choice] [spatial, cnn, propose] [knowledge, transfer, student, teacher, learning, distribution, mutual, training, data, distillation, task, target, fitnet, log, vid, lwf, trained, existing, source, dimension, nst, classification, set, corresponds, class, large, loss, selected, function, logit]
@InProceedings{Ahn_2019_CVPR,
  author = {Ahn, Sungsoo and Xu Hu, Shell and Damianou, Andreas and Lawrence, Neil D. and Dai, Zhenwen},
  title = {Variational Information Distillation for Knowledge Transfer},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
You Look Twice: GaterNet for Dynamic Filter Selection in CNNs
Zhourong Chen, Yang Li, Samy Bengio, Si Si


The concept of conditional computation for deep nets has been proposed previously to improve model performance by selectively using only parts of the model conditioned on the sample it is processing. In this paper, we investigate input-dependent dynamic filter selection in deep convolutional neural networks (CNNs). The problem is interesting because the idea of forcing different parts of the model to learn from different types of samples may help us acquire better filters in CNNs, improve the model generalization performance and potentially increase the interpretability of model behavior. We propose a novel yet simple framework called GaterNet, which involves a backbone and a gater network. The backbone network is a regular CNN that performs the major computation needed for making a prediction, while a global gater network is introduced to generate binary gates for selectively activating filters in the backbone network based on each input. Extensive experiments on CIFAR and ImageNet datasets show that our models consistently outperform the original models with a large margin. On CIFAR-10, our model also improves upon state-of-the-art results.
[gating, dynamic, prediction, multiple, performing] [error, note, discrete, additional, defined, equation, good] [method, conditional, figure, input, proposed, image] [network, gater, number, deep, binary, computation, resnet, neural, better, layer, filter, original, dynamically, table, residual, selection, convolutional, imagenet, performance, gate, size, shallow, called, sparse, small, apply, gaternet, needed, cifar, bottleneck, denotes] [model, vector, generate, gated, improved, introduce] [backbone, feature, cnn, improves, baseline, map] [training, learning, test, large, classification, learn, function, consistently, subset, distribution, sample, select, set, data, main]
@InProceedings{Chen_2019_CVPR,
  author = {Chen, Zhourong and Li, Yang and Bengio, Samy and Si, Si},
  title = {You Look Twice: GaterNet for Dynamic Filter Selection in CNNs},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
SpherePHD: Applying CNNs on a Spherical PolyHeDron Representation of 360deg Images
Yeonkun Lee, Jaeseok Jeong, Jongseob Yun, Wonjune Cho, Kuk-Jin Yoon


Omni-directional cameras have many advantages overconventional cameras in that they have a much wider field-of-view (FOV). Accordingly, several approaches have beenproposed recently to apply convolutional neural networks(CNNs) to omni-directional images for various visual tasks.However, most of them use image representations defined inthe Euclidean space after transforming the omni-directionalviews originally formed in the non-Euclidean space. Thistransformation leads to shape distortion due to nonuniformspatial resolving power and the loss of continuity. Theseeffects make existing convolution kernels experience diffi-culties in extracting meaningful information. This paper presents a novel method to resolve such prob-lems of applying CNNs to omni-directional images. Theproposed method utilizes a spherical polyhedron to rep-resent omni-directional views. This method minimizes thevariance of the spatial resolving power on the sphere sur-face, and includes new convolution and pooling methodsfor the proposed representation. The proposed method canalso be adopted by any existing CNN-based methods. Thefeasibility of the proposed method is demonstrated throughclassification, detection, and semantic segmentation taskswith synthetic and real datasets.
[dataset] [cube, erp, spherical, spherephd, irregularity, polyhedron, resolving, subdivision, distortion, projection, nonuniform, omnidirectional, define, computer, sphere, discontinuity, subdivided, vision, feasibility, scene, orientation] [image, proposed, method, pixel, figure, conference, input, result, ieee] [kernel, convolution, pooling, power, variance, accuracy, number, layer, cnns, apply, neural, performance, applied, effective, size, convolutional, smaller, table, network, compare, original, padding, connected, implementation, compared] [represent, regular, random, visual, making] [map, spatial, semantic, detection, segmentation, object, average, area, three] [representation, synthia, mnist, conventional, space, classification, euclidean, class, domain, uniform]
@InProceedings{Lee_2019_CVPR,
  author = {Lee, Yeonkun and Jeong, Jaeseok and Yun, Jongseob and Cho, Wonjune and Yoon, Kuk-Jin},
  title = {SpherePHD: Applying CNNs on a Spherical PolyHeDron Representation of 360deg Images},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
ESPNetv2: A Light-Weight, Power Efficient, and General Purpose Convolutional Neural Network
Sachin Mehta, Mohammad Rastegari, Linda Shapiro, Hannaneh Hajishirzi


We introduce a light-weight, power efficient, and general purpose convolutional neural network, ESPNetv2, for modeling visual and sequential data. Our network uses group point-wise and depth-wise dilated separable convolutions to learn representations from a large effective receptive field with fewer FLOPs and parameters. The performance of our network is evaluated on four different tasks: (1) object classification, (2) semantic segmentation, (3) object detection, and (4) language modeling. Experiments on these tasks, including image classification on the ImageNet and language modeling on the PenTree bank dataset, demonstrate the superior performance of our method over the state-of-the-art methods. Our network outperforms ESPNet by 4-5% and has 2-4x fewer FLOPs on the PASCAL VOC and the Cityscapes dataset. Compared to YOLOv2 on the MS-COCO object detection, ESPNetv2 delivers 4.4% higher accuracy with 6x fewer FLOPs. Our experiments show that ESPNetv2 is much more power efficient than existing state-of-the-art efficient methods including ShuffleNets and MobileNets. Our code is open-source and available at https://github.com/sacmehta/ESPNetv2.
[dataset, modeling, outperforms] [field, cyclic, linear, single] [image, input, figure, comparison] [network, efficient, performance, neural, dilated, convolutional, convolution, eesp, separable, unit, rate, group, add, effective, computational, deep, receptive, table, power, fewer, architecture, imagenet, better, standard, accuracy, hff, size, espnet, number, esp, delivers, shortcut, inference, channel, output, dilation, strided, efficiency, operation, low, fixed] [language, model, arxiv, preprint, evaluate, visual] [semantic, object, pascal, feature, voc, spatial, segmentation, edge, including] [learning, learn, large, training, classification, set, data, existing]
@InProceedings{Mehta_2019_CVPR,
  author = {Mehta, Sachin and Rastegari, Mohammad and Shapiro, Linda and Hajishirzi, Hannaneh},
  title = {ESPNetv2: A Light-Weight, Power Efficient, and General Purpose Convolutional Neural Network},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Assisted Excitation of Activations: A Learning Technique to Improve Object Detectors
Mohammad Mahdi Derakhshani, Saeed Masoudnia, Amir Hossein Shaker, Omid Mersa, Mohammad Amin Sadeghi, Mohammad Rastegari, Babak N. Araabi


We present a simple yet effective learning technique that significantly improves mAP of YOLO object detectors without compromising their speed. During network training, we carefully feed in localization information. We excite certain activations in order to help the network learn to better localize (Figure 2). In the later stages of training, we gradually reduce our assisted excitation to zero. We reached a new state-of-the-art in the speed-accuracy trade-off (Figure 1). Our technique improves the mAP of YOLOv2 by 3.8% and mAP of YOLOv3 by 2.2% on MSCOCO dataset. This technique is inspired from curriculum learning. It is simple and effective and it is applicable to most single-stage object detectors.
[second, manually] [computer, technique, vision, pattern, international, initial] [conference, figure, proposed, ieee, method, comparison, based, input, image, change] [excitation, table, network, best, achieved, number, activation, compared, better, layer, performance, search, fast, gradually, convolutional, accuracy, architecture, compare, original, inspired, deep, identical, neural] [model, mscoco] [object, detection, stage, assisted, yolo, localization, segmentation, excite, faster, map, semantic, improve, feature, module, redmon, bbox, ross, easy, pascal, voc, bounding, kaiming, three] [learning, curriculum, loss, strategy, negative, training, test, auxiliary, trained]
@InProceedings{Derakhshani_2019_CVPR,
  author = {Mahdi Derakhshani, Mohammad and Masoudnia, Saeed and Hossein Shaker, Amir and Mersa, Omid and Amin Sadeghi, Mohammad and Rastegari, Mohammad and Araabi, Babak N.},
  title = {Assisted Excitation of Activations: A Learning Technique to Improve Object Detectors},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Exploiting Edge Features for Graph Neural Networks
Liyu Gong, Qiang Cheng


Edge features contain important information about graphs. However, current state-of-the-art neural network models designed for graph learning, e.g., graph convolutional networks (GCN) and graph attention networks (GAT), inadequately utilize edge features, especially multi-dimensional edge features. In this paper, we build a new framework for a family of new graph neural network models that can more sufficiently exploit edge features, including those of undirected or multi-dimensional edges. The proposed framework can consolidate current graph neural network models, e.g., GCN and GAT. The proposed framework and new models have the following novelties: First, we propose to use doubly stochastic normalization of graph edge features instead of the commonly used row or symmetric normalization approaches used in current graph neural networks. Second, we construct new formulas for the operations in each individual layer so that they can handle multi-dimensional edge features. Third, for the proposed new framework, edge features are adaptive across network layers. As a result, our proposed new framework and new models are able to exploit a rich source of graph edge information. We apply our new models to graph node classification on several citation networks, whole graph classification, and regression on several molecular datasets. Compared with the current state-of-the-art methods, i.e., GCNs and GAT, our models obtain better performance, which testify to the importance of exploiting edge features in graph neural networks.
[graph, gat, doubly, citation, cora, gcn, current, framework, gcns, directed, eijp, pubmed, multidimensional, time, incorporate, citeseer, dataset, complex, structural] [matrix, international, dense, linear, dimensional, defined, associated, problem, denote] [conference, proposed, based, splitting, method, handle, real, spectral] [network, neural, layer, stochastic, normalization, convolution, molecular, deep, applied, table, sparse, performance, architecture, operation, output, weighted, validation, convolutional, compared, original, binary, tensor] [node, attention, machine, vector, model, mechanism, represent, random] [edge, three, feature, propose, regression, fully] [learning, egnn, classification, datasets, class, gnn, training, adjacency, function, loss, test, exploit, embedding]
@InProceedings{Gong_2019_CVPR,
  author = {Gong, Liyu and Cheng, Qiang},
  title = {Exploiting Edge Features for Graph Neural Networks},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Propagation Mechanism for Deep and Wide Neural Networks
Dejiang Xu, Mong Li Lee, Wynne Hsu


Recent deep neural networks (DNN) utilize identity mappings involving either element-wise addition or channel-wise concatenation for the propagation of these identity mappings. In this paper, we propose a new propagation mechanism called channel-wise addition (cAdd) to deal with the vanishing gradients problem without sacrificing the complexity of the learned features. Unlike channel-wise concatenation, cAdd is able to eliminate the need to store feature maps thus reducing the memory requirement. The proposed cAdd mechanism can deepen and widen existing neural architectures with fewer parameters compared to channel-wise concatenation and element-wise addition. We incorporate cAdd into state-of-the-art architectures such as ResNet, WideResNet, and CondenseNet and carry out extensive experiments on CIFAR10, CIFAR100, SVHN and ImageNet to demonstrate that cAdd-based architectures are able to achieve much higher accuracy with fewer parameters compared to their corresponding base architectures.
[propagation] [error, depth, corresponding, vanishing] [input, figure, identity, proposed, based, image, conference] [neural, cadd, number, unit, network, width, eadd, rate, deep, architecture, size, ccon, fewer, addition, table, output, basic, resnet, compared, higher, gradient, convolutional, cwrn, condensenet, performance, batch, concatenation, imagenet, residual, layer, efficient, wrn, cresnet, accuracy, small, increase, convolution, hwu, ccondensenet, deeper, best, deepen, achieve, growth, designed, bottleneck, compare, params, gao] [mechanism, memory, machine] [feature, three, wider] [learning, svhn, training, learned, observe, train]
@InProceedings{Xu_2019_CVPR,
  author = {Xu, Dejiang and Li Lee, Mong and Hsu, Wynne},
  title = {Propagation Mechanism for Deep and Wide Neural Networks},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Catastrophic Child's Play: Easy to Perform, Hard to Defend Adversarial Attacks
Chih-Hui Ho, Brandon Leung, Erik Sandstrom, Yen Chang, Nuno Vasconcelos


The problem of adversarial CNN attacks is considered, with an emphasis on attacks that are trivial to perform but difficult to defend. A framework for the study of such attacks is proposed, using real world object manipulations. Unlike most works in the past, this framework supports the design of attacks based on both small and large image perturbations, implemented by camera shake and pose variation. A setup is proposed for the collection of such perturbations and determination of their perceptibility. It is argued that perceptibility depends on context, and a distinction is made between imperceptible and semantically imperceptible perturbations. While the former survives image comparisons, the latter are perceptible but have no impact on human object recognition. A procedure is proposed to determine the perceptibility of perturbations using Turk experiments, and a dataset of both perturbation classes which enables replicable studies of object manipulation attacks, is assembled. Experiments using defenses based on many datasets, CNN models, and algorithms from the literature elucidate the difficulty of defending these attacks -- in fact, none of the existing defenses is found effective against them. Better results are achieved with real world data augmentation, but even this is not foolproof. These results confirm the hypothesis that current CNNs are vulnerable to attacks implementable even by a child, and that such attacks may prove difficult to defend.
[recognition, dataset, work, drone] [pose, camera, computer, problem, vision, affine, algorithm] [image, based, variation, real, figure, frontal, study, method, produce, digital, transformation, proposed, manipulation] [shake, small, gradient, imagenet, cnns, rate, difficulty, neural, table, deep, effective, performance, standard, better, network, number, compare] [attack, indistinguishable, adversarial, defense, perturbation, semantically, true, imperceptible, fool, procedure, defend, turk, simple, successful, consider, random, collection, replicable, contribution, easily, validity, natural, asked, ian, implemented, perceptible, depending, simply] [object, easy, cnn, interest] [large, positive, difficult, data, training, set, learning, setup, hard, randomly, space]
@InProceedings{Ho_2019_CVPR,
  author = {Ho, Chih-Hui and Leung, Brandon and Sandstrom, Erik and Chang, Yen and Vasconcelos, Nuno},
  title = {Catastrophic Child's Play: Easy to Perform, Hard to Defend Adversarial Attacks},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Embedding Complementary Deep Networks for Image Classification
Qiuyu Chen, Wei Zhang, Jun Yu, Jianping Fan


In this paper, a deep embedding algorithm is developed to achieve higher accuracy rates on large-scale image classification. By adapting the importance of the object classes to their error rates, our deep embedding algorithm can train multiple complementary deep networks sequentially, where each of them focuses on achieving higher accuracy rates for different subsets of object classes in an easy-to-hard way. By integrating such complementary deep networks to generate an ensemble network, our deep embedding algorithm can improve the accuracy rates for the hard object classes (which initially have higher error rates) at certain degrees while effectively preserving high accuracy rates for the easy object classes. Our deep embedding algorithm has achieved higher overall accuracy rates on large scale image classification.
[multiple, combining, dataset, joint] [error, algorithm, computer, international, vision, pattern, optimal, defined] [image, conference, ieee, comparison, traditional, developed, preserving, high] [deep, network, accuracy, higher, boosting, neural, rate, iteration, achieve, lth, convolutional, larger, number, processing, achieving, small, pay, weighted, epoch, low, sequentially, process] [visual, generate, attention, easily, demonstrated, probability] [object, complementary, easy, improve, three, weak, average] [hard, embedding, training, learning, class, set, ensemble, train, tth, min, sample, distribution, discriminative, learn, function, objective, updating, test, observe, weighting, effectively, log, mnist, classification, existing, experimental]
@InProceedings{Chen_2019_CVPR,
  author = {Chen, Qiuyu and Zhang, Wei and Yu, Jun and Fan, Jianping},
  title = {Embedding Complementary Deep Networks for Image Classification},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Deep Multimodal Clustering for Unsupervised Audiovisual Learning
Di Hu, Feiping Nie, Xuelong Li


The seen birds twitter, the running cars accompany with noise, etc. These naturally audiovisual correspondences provide the possibilities to explore and understand the outside world. However, the mixed multiple objects and sounds make it intractable to perform efficient matching in the unconstrained environment. To settle this problem, we propose to adequately excavate audio and visual components and perform elaborate correspondence learning among them. Concretely, a novel unsupervised audiovisual learning model is proposed, named as Deep Multimodal Clustering (DMC),that synchronously performs sets of clustering with multimodal vectors of convolutional maps in different shared spaces for capturing multiple audiovisual correspondences. And such integrated multimodal clustering network can be effectively trained with max-margin loss in the end-to-end fashion. Amounts of experiments in feature evaluation and audiovisual tasks are performed. The results demonstrate that DMC can learn effective unimodal representation, with which the classifier can even outperform human performance. Further, DMC shows noticeable performance in sound localization, multisource detection, and audiovisual understanding.
[audiovisual, sound, audio, dmc, dataset, unimodal, complex, multiple, perform, human, signal, multisource, barking, current, previous, event, recognition, vggish, elaborate, temporal] [corresponding, correspondence, assignment, acoustic, scene, provide, international, single, vision, computer] [image, input, separation, proposed, conference, ieee, unconstrained, noticeable] [network, deep, neural, effective, table, net, convolutional, performance, processing, andrew, concrete] [visual, model, multimodal, arxiv, preprint, modality, evaluation, attention, simple, correlate] [feature, center, propose, supervision, object, detection, threshold] [learning, clustering, source, representation, specific, dij, sij, unsupervised, trained, difficult, training, classification, cluster, distance, soft, learn, train, set, data, update]
@InProceedings{Hu_2019_CVPR,
  author = {Hu, Di and Nie, Feiping and Li, Xuelong},
  title = {Deep Multimodal Clustering for Unsupervised Audiovisual Learning},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Dense Classification and Implanting for Few-Shot Learning
Yann Lifchitz, Yannis Avrithis, Sylvaine Picard, Andrei Bursuc


Few-shot learning for deep neural networks is a highly challenging and key problem in many computer vision tasks. In this context, we are targeting knowledge transfer from a set with abundant data to other sets with few available examples. We propose two simple and effective solutions: (i) dense classification over feature maps, which for the first time studies local activations in the domain of few-shot learning, and (ii) implanting, that is, attaching new neurons to a previously trained network to learn new, task-specific features. Implanting enables training of multiple layers in the few-shot regime, departing from most related methods derived from metric learning that train only the final layer. Both contributions show consistent gains when used individually or jointly and we report state of the art performance on few-shot classification on miniImageNet.
[multiple, work] [dense, single, parametric, denote, defined, approach] [image, input, figure, face] [network, pooling, neural, convolutional, number, top, deep, activation, table, layer, best, accuracy, performance, weight, standard, cost, residual, parameter, block] [vector, simple, query, model, collection, generate, call, visual, choice] [stage, spatial, average, global, feature, semantic, rrd, bottom] [learning, novel, classification, base, class, embedding, training, support, set, data, function, gap, classifier, learn, task, large, trained, implanting, prototypical, similarity, miniimagenet, loss, tadam, train, cosine, implant, softmax, learned, gmp, metric, representation]
@InProceedings{Lifchitz_2019_CVPR,
  author = {Lifchitz, Yann and Avrithis, Yannis and Picard, Sylvaine and Bursuc, Andrei},
  title = {Dense Classification and Implanting for Few-Shot Learning},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Class-Balanced Loss Based on Effective Number of Samples
Yin Cui, Menglin Jia, Tsung-Yi Lin, Yang Song, Serge Belongie


With the rapid increase of large-scale, real-world datasets, it becomes critical to address the problem of long-tailed data distribution (i.e., a few classes account for most of the data, while most classes are under-represented). Existing solutions typically adopt class re-balancing strategies such as re-sampling and re-weighting based on the number of observations for each class. In this work, we argue that as the number of samples increases, the additional benefit of a newly added data point will diminish. We introduce a novel theoretical framework to measure data overlap by associating with each sample a small neighboring region rather than a single point. The effective number of samples is defined as the volume of samples and can be calculated by a simple formula (1-b^ n )/(1-b), where n is the number of samples and b \in [0,1) is a hyperparameter. We design a re-weighting scheme that uses the effective number of samples for each class to re-balance the loss, thereby yielding a class-balanced loss. Comprehensive experiments are conducted on artificially induced long-tailed CIFAR datasets and large-scale datasets including ImageNet and iNaturalist. Our results show that when trained with the proposed class-balanced loss, the network is able to achieve significant performance gains on long-tailed datasets.
[term, recognition, framework, dataset, work, version] [focal, volume, error, problem, inverse, denote, assume, theoretical, case, single] [proposed, figure, based, frequency, major] [number, effective, sigmoid, deep, performance, ilsvrc, rate, neural, factor, cifar, larger, overlap, smaller, table, imagenet, size, original, newly, small, convolutional, better, scale] [sampled, visual, model, probability, random, commonly, expected] [yang, feature, serge, propose] [loss, data, class, training, softmax, sample, learning, datasets, inaturalist, imbalance, set, classification, trained, function, imbalanced, minor, large, lim, inversely, sampling, weighting, log, novel]
@InProceedings{Cui_2019_CVPR,
  author = {Cui, Yin and Jia, Menglin and Lin, Tsung-Yi and Song, Yang and Belongie, Serge},
  title = {Class-Balanced Loss Based on Effective Number of Samples},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Discovering Visual Patterns in Art Collections With Spatially-Consistent Feature Learning
Xi Shen, Alexei A. Efros, Mathieu Aubry


Our goal in this paper is to discover near duplicate patterns in large collections of artworks. This is harder than standard instance mining due to differences in the artistic media (oil, pastel, drawing, etc), and imperfections inherent in the copying process. Our key technical insight is to adapt a standard deep feature to this task by fine-tuning it on the specific art collection using self-supervised learning. More specifically, spatial consistency between neighbouring feature matches is used as supervisory fine-tuning signal. The adapted feature leads to more accurate style invariant matching, and can be used with a standard discovery approach, based on geometric verification, to identify duplicate patterns in the dataset. The approach is evaluated on several different datasets and shows surprisingly good qualitative discovery results. For quantitative evaluation of the method, we annotated 273 near duplicate details in a dataset of 1587 artworks attributed to Jan Brueghel and his workshop. Beyond artworks, we also demonstrate improvement on localization on the Oxford5K photo dataset as well as on historical photograph localization on the Large Time Lags Location (LTLL) dataset.
[dataset, time, recognition, focus, extract, work] [computer, approach, matching, vision, pattern, well, corresponding, geometric, correspondence, consistent, match, international, analysis] [image, figure, conference, repeated, ieee, consistency, artistic, method, based, demonstrate, artwork] [deep, verification, imagenet, standard, discovered, top, modern] [visual, candidate, query, find, arxiv, procedure, example, preprint, discover, collection, evaluate, discovering] [feature, discovery, brueghel, region, object, art, detection, ltll, score, spatial, annotated, proposal, context, map, duplicate, instance, location, visualized, matched, improvement, localization] [learning, positive, large, task, training, trained, similarity, retrieval, specific, cosine, datasets, data, set]
@InProceedings{Shen_2019_CVPR,
  author = {Shen, Xi and Efros, Alexei A. and Aubry, Mathieu},
  title = {Discovering Visual Patterns in Art Collections With Spatially-Consistent Feature Learning},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Min-Max Statistical Alignment for Transfer Learning
Samitha Herath, Mehrtash Harandi, Basura Fernando, Richard Nock


A profound idea in learning invariant features for transfer learning is to align statistical properties of the domains. In practice, this is achieved by minimizing the disparity between the domains, usually measured in terms of their statistical properties. We question the capability of this school of thought and propose to minimize the maximum disparity between domains. Furthermore, we develop an end-to-end learning scheme that enables us to benefit from the proposed min-max strategy in training deep models. We show that the min-max solution can outperform the existing statistical alignment solutions, and can compete with state-of-the-art solutions on two challenging learning tasks, namely, Unsupervised Domain Adaptation (UDA) and Zero-Shot Learning (ZSL).
[recognition, dataset, second, report, outperforms] [solution, case, computer, disparity, vision, pattern, denote, supplementary, computed] [statistical, proposed, conference, ieee, accumulation, method, input, attribute, comparison, generative, real] [network, deep, neural, performance, gradient, layer, cifar, output, table] [model, adversarial, machine, represent, generated] [feature, propose, baseline, aligned, semantic] [alignment, domain, training, learning, confusion, uda, zsl, loss, unseen, source, min, minimizing, learn, target, trained, adaptation, minimization, class, train, stl, invariant, unsupervised, space, softmax, data, shared, classifier, svhn, idea, align, discriminative, dkl, dimensionality, objective, classification, labeled, mmd, sun, protocol, observe]
@InProceedings{Herath_2019_CVPR,
  author = {Herath, Samitha and Harandi, Mehrtash and Fernando, Basura and Nock, Richard},
  title = {Min-Max Statistical Alignment for Transfer Learning},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Spatial-Aware Graph Relation Network for Large-Scale Object Detection
Hang Xu, Chenhan Jiang, Xiaodan Liang, Zhenguo Li


How to proper encode high-order object relation in the detection system without any external knowledge? How to leverage the information between co-occurrence and locations of objects for better reasoning? These questions are key challenges towards large-scale object detection system that aims to recognize thousands of objects entangled with complex spatial and semantic relationships nowadays. Distilling key relations that may affect object recognition is crucially important since treating each region separately leads to a big performance drop when facing heavy long-tail data distributions and plenty of confusing categories. Recent works try to encode relation by constructing graphs, e.g. using handcraft linguistic knowledge between classes or implicitly learning a fully-connected graph between regions. However, the handcraft linguistic knowledge cannot be individualized for each image due to the semantic gap between linguistic and visual context while the fully-connected graph is inefficient and noisy by incorporating redundant and distracted relations/edges from irrelevant objects and backgrounds. In this work, we introduce a Spatial-aware Graph Relation Network (SGRN) to adaptive discover and incorporate key semantic and spatial relationships for reasoning over each object. Our method considers the relative location layouts and interactions among which can be easily injected into any detection pipelines to boost the performance. Specifically, our SGRN integrates a graph learner module for learning a interpatable sparse graph structure to encode relevant contextual regions and a spatial graph reasoning module with learnable spatial Gaussian kernels to perform graph inference with spatial awareness. Extensive experiments verify the effectiveness of our method, e.g. achieving around 32% improvement on VG(3000 classes) and 28% on ADE in terms of mAP.
[graph, ade, dataset, recognition, learns, previous, propagate, propagation, concatenated] [matrix, note, directly] [method, image, figure, proposed, input] [gaussian, sparse, network, table, number, structure, performance, connected, deep, original, convolutional, neural, learnable, convolution, layer, output, redundant, accuracy, kernel] [visual, reasoning, model, node, linguistic, relationship, relevant, encode, transferability] [sgrn, object, relation, detection, spatial, fpn, module, region, semantic, feature, baseline, coco, improve, bbox, handcraft, context, proposal, regional, regression, faster, map, backbone, rnr] [classification, learning, learner, embedding, learned, knowledge, training, pairwise, trained, domain, adjacency, embeddings]
@InProceedings{Xu_2019_CVPR,
  author = {Xu, Hang and Jiang, Chenhan and Liang, Xiaodan and Li, Zhenguo},
  title = {Spatial-Aware Graph Relation Network for Large-Scale Object Detection},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Deformable ConvNets V2: More Deformable, Better Results
Xizhou Zhu, Han Hu, Stephen Lin, Jifeng Dai


The superior performance of Deformable Convolutional Networks arises from its ability to adapt to the geometric variations of objects. Through an examination of its adaptive behavior, we observe that while the spatial support for its neural features conforms more closely than regular ConvNets to object structure, this support may nevertheless extend well beyond the region of interest, causing features to be influenced by irrelevant image content. To address this problem, we present a reformulation of Deformable ConvNets that improves its ability to focus on pertinent image regions, through increased modeling power and stronger training. The modeling power is enhanced through a more comprehensive integration of deformable convolution within the network, and by introducing a modulation mechanism that expands the scope of deformation modeling. To effectively harness this enriched modeling capability, we guide network training via a proposed feature mimicking scheme that helps the network to learn features that reflect the object focus and classification power of R-CNN features. With the proposed contributions, this new version of Deformable ConvNets yields significant performance gains over the original model and produces leading results on the COCO benchmark for object detection and instance segmentation.
[modeling, work, focus] [geometric, deformation, field, computed, additional] [image, input, figure, content] [network, convnets, effective, convolutional, convolution, conv, receptive, neural, modulation, layer, deep, power, better, applied, replaced, table, original, accuracy, convnet, learnable, output] [regular, model, mimic, node, enriched, ability] [deformable, feature, object, spatial, faster, roipooling, mimicking, region, apbbox, detection, mask, coco, aligned, modulated, saliency, offset, bin, stage, apmask, baseline, visualized, module, three, foreground, roi, box, context] [support, sampling, classification, learning, training, loss, learned, trained, set, representation, adapt]
@InProceedings{Zhu_2019_CVPR,
  author = {Zhu, Xizhou and Hu, Han and Lin, Stephen and Dai, Jifeng},
  title = {Deformable ConvNets V2: More Deformable, Better Results},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Interaction-And-Aggregation Network for Person Re-Identification
Ruibing Hou, Bingpeng Ma, Hong Chang, Xinqian Gu, Shiguang Shan, Xilin Chen


Person re-identification (reID) benefits greatly from deep convolutional neural networks (CNNs) which learn robust feature embeddings. However, CNNs are inherently limited in modeling the large variations in person pose and scale due to their fixed geometric structures. In this paper, we propose a novel network structure, Interaction-and-Aggregation (IA), to enhance the feature representation capability of CNNs. Firstly, Spatial IA (SIA) module is introduced. It models the interdependencies between spatial features and then aggregates the correlated features corresponding to the same body parts. Unlike CNNs which extract features from fixed rectangle regions, SIA can adaptively determine the receptive fields according to the input person pose and scale. Secondly, we introduce Channel IA (CIA) module which selectively aggregates channel features to enhance the feature representation, especially for small-scale visual cues. Further, IA network can be constructed by inserting IA blocks into CNNs at any depth. We validate the effectiveness of our model for person reID by demonstrating its superiority over state-of-the-art methods on three benchmark datasets.
[interaction, localize, multiple, explicitly, modeling, fusion, outperforms, extract, human] [body, pose, corresponding, geometric] [input, appearance, proposed, method, figure, image, comparison] [sia, cia, block, network, channel, deep, scale, convolutional, adaptively, receptive, aggregation, cnns, small, fixed, neural, effectiveness, operation, number, conv, performance, convolution, aggregate, structure, standard, layer] [attention, model, arxiv, preprint, visual, generate, evaluation] [person, feature, map, spatial, relation, semantic, location, ianet, enhance, detection, pedestrian, context, dukemtmc, three, module, propose] [learning, reid, large, representation, similarity, learn, metric]
@InProceedings{Hou_2019_CVPR,
  author = {Hou, Ruibing and Ma, Bingpeng and Chang, Hong and Gu, Xinqian and Shan, Shiguang and Chen, Xilin},
  title = {Interaction-And-Aggregation Network for Person Re-Identification},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Rare Event Detection Using Disentangled Representation Learning
Ryuhei Hamaguchi, Ken Sakurada, Ryosuke Nakamura


This paper presents a novel method for rare event detection from an image pair with class-imbalanced datasets. A straightforward approach for event detection tasks is to train a detection network from a large-scale dataset in an end-to-end manner. However, in many applications such as building change detection on satellite images, few positive samples are available for the training. Moreover, an image pair of scenes contains many trivial events, such as in illumination changes or background motions. These many trivial events and the class imbalance problem lead to false alarms for rare event detection. In order to overcome these difficulties, we propose a novel method to learn disentangled representations from only low-cost negative samples. The proposed method disentangles the different aspects in a pair of observations: variant and invariant factors that represent trivial events and image contents, respectively. The effectiveness of the proposed approach is verified by the quantitative evaluations on four change detection datasets, and the qualitative analysis shows that the proposed method can acquire the representations that disentangle rare events from trivial ones.
[event, dataset, hidden, work, anomaly, learns] [augmented, problem, analysis] [image, method, proposed, change, latent, figure, disentangled, input, background, comparison, generative, result, extracted] [deep, network, activation, table, order, neural, number, architecture, sparsity, convolutional, standard, parameter, unit, represents] [common, model, encoder, represent, variational, decoder, requires] [detection, feature, detecting, mask, detector, hierarchical, lsim] [learning, representation, trivial, specific, loss, rare, similarity, negative, pair, distance, positive, invariant, training, mnist, learned, vae, distribution, class, learn, trained, pcd, wdc, data, abcd, function, digit, novel, target, classifier, domain, posterior]
@InProceedings{Hamaguchi_2019_CVPR,
  author = {Hamaguchi, Ryuhei and Sakurada, Ken and Nakamura, Ryosuke},
  title = {Rare Event Detection Using Disentangled Representation Learning},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Shape Robust Text Detection With Progressive Scale Expansion Network
Wenhai Wang, Enze Xie, Xiang Li, Wenbo Hou, Tong Lu, Gang Yu, Shuai Shao


Scene text detection has witnessed rapid progress especially with the recent development of convolutional neural networks. However, there still exists two challenges which prevent the algorithm into industry applications. On the one hand, most of the state-of-art algorithms require quadrangle bounding box which is in-accurate to locate the texts with arbitrary shape. On the other hand, two text instances which are close to each other may lead to a false detection which covers both instances. Traditionally, the segmentation-based approach can relieve the first problem but usually fail to solve the second challenge. To address these two challenges, in this paper, we propose a novel Progressive Scale Expansion Network (PSENet), which can precisely detect text instances with arbitrary shapes. More specifically, PSENet generates the different scale of kernels for each text instance, and gradually expands the minimal scale kernel to the text instance with the complete shape. Due to the fact that there are large geometrical margins among the minimal scale kernels, our method is effective to split the close text instances, making it easier to use segmentation-based methods to detect arbitrary-shaped text instances. Extensive experiments on CTW1500, Total-Text, ICDAR 2015 and ICDAR 2017 MLT validate the effectiveness of PSENet. Notably, on CTW1500, a dataset full of long curve texts, PSENet achieves a F-measure of 74.3% at 27 FPS, and our best F-measure (82.2%) outperforms state-of-art algorithms by 6.6%. The code will be released in the future.
[multiple, dataset, long, time, formulated] [scene, minimal, algorithm, note, shape, ground, robust] [method, proposed, based, result, arbitrary, input, image, figure, separate] [scale, progressive, kernel, table, performance, network, original, deep, convolutional, neural, output, science, connected, fps, achieves, best] [text, complete, represent, natural, arxiv, preprint, find, evaluate, indicates, external] [psenet, expansion, detection, curve, segmentation, icdar, instance, feature, map, xiang, final, region, polygon, detecting, detect, oriented, textsnake, mask, cong, locate, adopt, object, east, adopted, recall, propose] [test, training, function, large, learning, close, set]
@InProceedings{Wang_2019_CVPR,
  author = {Wang, Wenhai and Xie, Enze and Li, Xiang and Hou, Wenbo and Lu, Tong and Yu, Gang and Shao, Shuai},
  title = {Shape Robust Text Detection With Progressive Scale Expansion Network},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Dual Encoding for Zero-Example Video Retrieval
Jianfeng Dong, Xirong Li, Chaoxi Xu, Shouling Ji, Yuan He, Gang Yang, Xun Wang


This paper attacks the challenging problem of zero-example video retrieval. In such a retrieval paradigm, an end user searches for unlabeled videos by ad-hoc queries described in natural language text with no visual example provided. Given videos as sequences of frames and queries as sequences of words, an effective sequence-to-sequence cross-modal matching is required. The majority of existing methods are concept based, extracting relevant concepts from queries and videos and accordingly establishing associations between the two modalities. In contrast, this paper takes a concept-free approach, proposing a dual deep encoding network that encodes videos and queries into powerful dense representations of their own. Dual encoding is conceptually simple, practically effective and end-to-end. As experiments on three benchmarks, i.e. MSR-VTT, TRECVID 2016 and 2017 Ad-hoc Video Search show, the proposed solution establishes a new state-of-the-art for zero-example video retrieval.
[video, trecvid, gru, bigru, joint, event, work, dataset, infap, recurrent, forward, backward, renmin, temporal, multiple, bidirectional, performs, sequence] [university, ground, note] [dual, method, proposed, content, based, figure, distinct] [network, performance, table, pooling, neural, effective, convolutional, deep, top, output, size, search, best] [encoding, common, sentence, model, query, visual, vector, text, concept, relevant, med, natural, sum, msvd, language, describing, word, improved] [feature, level, three, map, cnn] [retrieval, learning, space, test, data, training, paper, trained, specific, set, train, loss]
@InProceedings{Dong_2019_CVPR,
  author = {Dong, Jianfeng and Li, Xirong and Xu, Chaoxi and Ji, Shouling and He, Yuan and Yang, Gang and Wang, Xun},
  title = {Dual Encoding for Zero-Example Video Retrieval},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
MaxpoolNMS: Getting Rid of NMS Bottlenecks in Two-Stage Object Detectors
Lile Cai, Bin Zhao, Zhe Wang, Jie Lin, Chuan Sheng Foo, Mohamed Sabry Aly, Vijay Chandrasekhar


Modern convolutional object detectors have improved the detection accuracy significantly, which in turn inspired the development of dedicated hardware accelerators to achieve real-time performance by exploiting inherent parallelism in the algorithm. Non-maximum suppression (NMS) is an indispensable operation in object detection. In stark contrast to most operations, the commonly-adopted GreedyNMS algorithm does not foster parallelism, which can be a major performance bottleneck. In this paper, we introduce MaxpoolNMS, a parallelizable alternative to the NMS algorithm, which is based on max-pooling classification score maps. By employing a novel multi-scale multi-channel max-pooling strategy, our method is 20x faster than GreedyNMS while simultaneously achieves comparable accuracy, when quantified across various benchmarking datasets, i.e., MS COCO, KITTI and PASCAL VOC. Furthermore, our method is better suited for hardware-based acceleration than GreedyNMS.
[time, second, perform, prediction] [kitti, computer, dense, algorithm, matching, vision, corresponding, pattern] [figure, high, conference, input, image, method, remove, ieee] [network, table, pooling, number, size, kernel, accuracy, max, convolutional, achieve, precision, performance, gpu, scale, comparable, convolution, deep, sparse, small, neural] [arxiv, preprint, execution, procedure] [object, score, greedynms, detection, faster, region, proposal, maxpoolnms, map, anchor, neighboring, aspect, stage, objectness, final, nscl, nar, feature, iou, coco, ross, kaiming, benchmarking, response] [selected, set, trained, learning, large, select]
@InProceedings{Cai_2019_CVPR,
  author = {Cai, Lile and Zhao, Bin and Wang, Zhe and Lin, Jie and Sheng Foo, Chuan and Sabry Aly, Mohamed and Chandrasekhar, Vijay},
  title = {MaxpoolNMS: Getting Rid of NMS Bottlenecks in Two-Stage Object Detectors},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Character Region Awareness for Text Detection
Youngmin Baek, Bado Lee, Dongyoon Han, Sangdoo Yun, Hwalsuk Lee


Scene text detection methods based on neural networks have emerged recently and have shown promising results. Previous methods trained with rigid word-level bounding boxes exhibit limitations in representing the text region in an arbitrary shape. In this paper, we propose a new scene text detection method to effectively detect text area by exploring each character and affinity between characters. To overcome the lack of individual character level annotations, our proposed framework exploits both the given character-level annotations for synthetic images and the estimated character-level ground-truths for real images acquired by the learned interim model. In order to estimate affinity between characters, the network is trained with the newly proposed representation for affinity. Extensive experiments on six benchmarks, including the TotalText and CTW-1500 datasets which contain highly curved texts in natural images, demonstrate that our character-level text detection significantly outperforms the state-of-the-art detectors. According to the results, our proposed method guarantees high flexibility in detecting complicated scene text images, such as arbitrarily-oriented, curved, or deformed texts.
[dataset, individual, recognition] [scene, ground, truth, robust, local, single, confidence, upconv] [image, proposed, method, synthetic, real, figure, based, arbitrary, pixel] [convolutional, network, deep, gaussian, table, performed, block, conv, number, neural, original, epoch] [text, character, word, model, natural, generation, generate, provided, sconf, arxiv, preprint, spotting, procedure] [region, score, affinity, detection, box, bounding, map, icdar, craft, center, totaltext, polygon, detecting, characterlevel, detect, detector, regression, segmentation, level, curved, mask, textspotter, annotation, fully, annotated, oriented, area] [training, datasets, trained, train, learning, representation, set, testing]
@InProceedings{Baek_2019_CVPR,
  author = {Baek, Youngmin and Lee, Bado and Han, Dongyoon and Yun, Sangdoo and Lee, Hwalsuk},
  title = {Character Region Awareness for Text Detection},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Effective Aesthetics Prediction With Multi-Level Spatially Pooled Features
Vlad Hosu, Bastian Goldlucke, Dietmar Saupe


We propose an effective deep learning approach to aesthetics quality assessment that relies on a new type of pre-trained features, and apply it to the AVA data set, the currently largest aesthetics database. While previous approaches miss some of the information in the original images, due to taking small crops, down-scaling or warping the originals during training, we propose the first method that efficiently supports full resolution images as an input, and can be trained on variable input sizes. This allows us to significantly improve upon the state of the art, increasing the Spearman rank-order correlation coefficient (SRCC) of ground-truth mean opinion scores (MOS) from the existing best reported of 0.612 to 0.756. To achieve this performance, we extract multi-level spatially pooled (MLSP) features from all convolutional blocks of a pre-trained InceptionResNet-v2 network, and train a custom shallow Convolutional Neural Network (CNN) architecture on these new features.
[ava, aqa, assessment, previous, work, perform, multiple, recognition, prediction, extract] [approach, computer, vision, pattern, well, international] [image, quality, resolution, high, aesthetic, extracted, narrow, conference, input, based, figure, perceptual, proposed, photo, spatially] [performance, network, architecture, mlsp, best, accuracy, deep, srcc, original, correlation, low, binary, small, table, batch, wide, pooled, rescaled, better, pooling, talebi, number, entire, pretrained, neural, fixed, dnn] [model, random, inception] [feature, average, head, score, propose] [learning, training, classification, trained, existing, set, test, reported, base, data]
@InProceedings{Hosu_2019_CVPR,
  author = {Hosu, Vlad and Goldlucke, Bastian and Saupe, Dietmar},
  title = {Effective Aesthetics Prediction With Multi-Level Spatially Pooled Features},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Attentive Region Embedding Network for Zero-Shot Learning
Guo-Sen Xie, Li Liu, Xiaobo Jin, Fan Zhu, Zheng Zhang, Jie Qin, Yazhou Yao, Ling Shao


Zero-shot learning (ZSL) aims to classify images from unseen categories, by merely utilizing seen class images as the training data. Existing works on ZSL mainly leverage the global features or learn the global regions, from which, to construct the embeddings to the semantic space. However, few of them study the discrimination power implied in local image regions (parts), which, in some sense, correspond to semantic attributes, have stronger discrimination than attributes, and can thus assist the semantic transfer between seen/unseen classes. In this paper, to discover (semantic) regions, we propose the attentive region embedding network (AREN), which is tailored to advance the ZSL task. Specifically, AREN is end-to-end trainable and consists of two network branches, i.e., the attentive region embedding (ARE) stream, and the attentive compressed second-order embedding (ACSE) stream. ARE is capable of discovering multiple part regions under the guidance of the attention and the compatibility loss. Moreover, a novel adaptive thresholding mechanism is proposed for suppressing redundant (such as background) attention regions. To further guarantee more stable semantic transfer from the perspective of second-order collaboration, ACSE is incorporated into the AREN. In the comprehensive evaluations on four benchmarks, our models achieve state-of-the-art performances under ZSL setting, and compelling results under generalized ZSL setting.
[recognition, prediction] [well, projected] [image, proposed, based, input, attribute, figure, latent] [network, max, compressed, best, achieve, convolutional, deep, number, table, adaptive, pooling, coefficient] [attention, visual, vector, mechanism, leopard, model, evaluation, generating, indicates] [semantic, attentive, feature, region, global, map, backbone, ware, leading, mask, cnn] [embedding, zsl, acse, unseen, learning, space, aca, representation, class, compatibility, set, loss, vare, training, label, trained, test, function, cub, transfer, discriminative, zcps, maximum, wacse, vacse, setting, inductive, tiger, gzsl, transductive, testing, ncps, embeddings, thresholding, incorporated, bobcat, data, cps, suppose, sun]
@InProceedings{Xie_2019_CVPR,
  author = {Xie, Guo-Sen and Liu, Li and Jin, Xiaobo and Zhu, Fan and Zhang, Zheng and Qin, Jie and Yao, Yazhou and Shao, Ling},
  title = {Attentive Region Embedding Network for Zero-Shot Learning},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Explicit Spatial Encoding for Deep Local Descriptors
Arun Mukundan, Giorgos Tolias, Ondrej Chum


We propose a kernelized deep local-patch descriptor based on efficient match kernels of neural network activations. Response of each receptive field is encoded together with its spatial location using explicit feature maps. Two location parametrizations, Cartesian and polar, are used to provide robustness to a different types of canonical patch misalignment. Additionally, we analyze how the conventional architecture, i.e. a fully connected layer attached after the convolutional part, encodes responses in a spatially variant way. In contrary, explicit spatial encoding is used in our descriptor, whose potential applications are not limited to local-patches. We evaluate the descriptor on standard benchmarks. Both versions, encoding 32x32 or 64x64 patches, consistently outperform all other methods on all benchmarks. The number of parameters of the model is independent of the input patch resolution.
[work, explicitly, report, combined, cordelia] [local, descriptor, mxy, position, ard, explicit, match, provide, case, sift, matrix, coordinate, cartesian, matching, geometric, total, krystian, field, affine, form, phototourism] [patch, image, input, proposed, comparison, translation, gxy, ieee, figure, pixel] [convolutional, deep, performance, kernel, number, efficient, variant, size, standard, performed, layer, architecture, equal, neural, network, tensor, andrew, receptive, impact, pooling, implementation] [encoding, sum, encodes, evaluate, common, visual, encode] [feature, spatial, map, propose, cnn, giorgos, object, final] [learning, training, invariant, similarity, learned, set, conventional, function, dimensionality, train, cat, product, retrieval, large, loss, hardnet]
@InProceedings{Mukundan_2019_CVPR,
  author = {Mukundan, Arun and Tolias, Giorgos and Chum, Ondrej},
  title = {Explicit Spatial Encoding for Deep Local Descriptors},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Panoptic Segmentation
Alexander Kirillov, Kaiming He, Ross Girshick, Carsten Rother, Piotr Dollar


We propose and study a task we name panoptic segmentation (PS). Panoptic segmentation unifies the typically distinct tasks of semantic segmentation (assign a class label to each pixel) and instance segmentation (detect and segment each object instance). The proposed task requires generating a coherent scene segmentation that is rich and complete, an important step toward real-world vision systems. While early work in computer vision addressed related image/scene parsing tasks, these are not currently popular, possibly due to lack of appropriate metrics or associated recognition challenges. To address this, we propose a novel panoptic quality (PQ) metric that captures performance for all classes (stuff and things) in an interpretable and unified manner. Using the proposed metric, we perform a rigorous study of both human and machine performance for PS on three existing datasets, revealing interesting insights about the task. The aim of our work is to revive the interest of the community in a more unified view of image segmentation. For more analysis and up-to-date results, please check the arXiv version of the paper: https://arxiv.org/abs/1801.00868.
[human, recognition, joint, work, challenge, dataset, future, drive] [ground, truth, scene, note, confidence, matching, general, vision, algorithm, single] [image, quality, consistency, pixel, proposed, figure, study] [table, performance, small, convolutional, output, format] [machine, simple, requires, visual, hope, evaluation] [segmentation, instance, panoptic, semantic, stuff, object, thing, iou, segment, predicted, pqst, pqth, detection, three, annotated, doll, parsing, coco, void, person, including, mapillary, matched, ross, piotr, fully, assigned, heuristic, overlapping, false] [task, datasets, metric, class, label, unified, existing, learning, uniform, large, gap, measure, set]
@InProceedings{Kirillov_2019_CVPR,
  author = {Kirillov, Alexander and He, Kaiming and Girshick, Ross and Rother, Carsten and Dollar, Piotr},
  title = {Panoptic Segmentation},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
You Reap What You Sow: Using Videos to Generate High Precision Object Proposals for Weakly-Supervised Object Detection
Krishna Kumar Singh, Yong Jae Lee


We propose a novel way of using videos to obtain high precision object proposals for weakly-supervised object detection. Existing weakly-supervised detection approaches use off-the-shelf proposal methods like edge boxes or selective search to obtain candidate boxes. These methods provide high recall but at the expense of thousands of noisy proposals. Thus, the entire burden of finding the few relevant object regions is left to the ensuing object mining step. To mitigate this issue, we focus instead on improving the precision of the initial candidate object proposals. Since we cannot rely on localization annotations, we turn to video and leverage motion cues to automatically estimate the extent of objects to train a Weakly-supervised Region Proposal Network (W-RPN). We use the W-RPN to generate high precision object proposals, which are in turn used to re-rank high recall proposals like edge boxes or selective search according to their spatial overlap. Our W-RPN proposals lead to significant improvement in performance for state-of-the-art weakly-supervised object detection approaches on PASCAL VOC 2007 and 2012.
[motion, video, dataset, focus, prediction, perform, work] [approach, outlier, compute, initial] [high, image, method, figure, based, produce] [network, precision, search, performance, overlap, lead, higher, low, table, compared, deep] [candidate, generate, relevant, step, evaluate, turn] [object, detection, edge, box, proposal, wsddn, score, oicr, region, bounding, selective, pascal, segmentation, priority, recall, voc, iou, weaklysupervised, detector, boost, weakly, spatial, lrank, improvement, scoring, final, localization, map] [class, learning, training, train, existing, noisy, set, rank, mining, trained, loss, pseudo, supervised, discriminative, measure]
@InProceedings{Singh_2019_CVPR,
  author = {Kumar Singh, Krishna and Jae Lee, Yong},
  title = {You Reap What You Sow: Using Videos to Generate High Precision Object Proposals for Weakly-Supervised Object Detection},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Explore-Exploit Graph Traversal for Image Retrieval
Cheng Chang, Guangwei Yu, Chundi Liu, Maksims Volkovs


We propose a novel graph-based approach for image retrieval. Given a nearest neighbor graph produced by the global descriptor model, we traverse it by alternating between exploit and explore steps. The exploit step maximally utilizes the immediate neighborhood of each vertex, while the explore step traverses vertices that are farther away in the descriptor space. By combining these two steps we can better capture the underlying image manifold, and successfully retrieve relevant images that are visually dissimilar to the query. Our traversal algorithm is conceptually simple, has few tunable parameters and can be implemented with basic data structures. This enables fast real-time inference for previously unseen queries with minimal memory overhead. Despite relative simplicity, we show highly competitive results on multiple public benchmarks, including the largest image retrieval dataset that is currently publicly available. Full code for this work is available here: https://github.com/layer6ai-labs/egt.
[graph, online, work, dataset, challenge, time, farther, multiple, drift] [approach, descriptor, algorithm, ransac, inlier, geodesic, typically, analysis, computer, neighborhood] [image, figure, proposed, landmark, based] [explore, performance, table, highly, weight, verification, complexity, top, efficient, inference, deep, offline, phase, applied, achieved, compact, reduce] [query, relevant, retrieve, step, topic, model] [edge, map, global, spatial, leading, propose, threshold, object, improve, inner, expansion] [retrieval, retrieved, egt, traversal, hard, similarity, roxford, regt, nnk, rparis, set, diffusion, exploit, large, dfs, distance, medium, nearest, neighbor, effectively, list, popped, datasets, novel, unseen]
@InProceedings{Chang_2019_CVPR,
  author = {Chang, Cheng and Yu, Guangwei and Liu, Chundi and Volkovs, Maksims},
  title = {Explore-Exploit Graph Traversal for Image Retrieval},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Dissimilarity Coefficient Based Weakly Supervised Object Detection
Aditya Arun, C.V. Jawahar, M. Pawan Kumar


We consider the problem of weakly supervised object detection, where the training samples are annotated using only image-level labels that indicate the presence or absence of an object category. In order to model the uncertainty in the location of the objects, we employ a dissimilarity coefficient based probabilistic learning objective. The learning objective minimizes the difference between an annotation agnostic prediction distribution and an annotation aware conditional distribution. The main computational challenge is the complex nature of the conditional distribution, which consists of terms over hundreds or thousands of variables. The complexity of the conditional distribution rules out the possibility of explicitly modeling it. Instead, we exploit the fact that deep learning frameworks rely on stochastic optimization. This allows us to use a state of the art discrete generative model that can provide annotation consistent samples from the conditional distribution. Extensive experiments on PASCAL VOC 2007 and 2012 data sets demonstrate the efficacy of our proposed approach.
[prediction, modeling, explicitly, framework, multiple, complex] [discrete, single, compute, optimization, algorithm, define, supplementary] [conditional, image, based, figure, method, noise, high, input, proposed, difference] [net, network, coefficient, deep, order, output, table, represents, standard, stochastic, convolutional, factorized, neural] [model, diversity, div, probability, consider] [bounding, object, box, weakly, detection, voc, prp, prc, annotation, pascal, category, isco, score, localization, fully, detector, challenging, instance, wsod, mil, pawan] [distribution, supervised, learning, dissimilarity, objective, data, loss, training, function, uncertainty, task, test, set, probabilistic, train, sample, specific, bottle]
@InProceedings{Arun_2019_CVPR,
  author = {Arun, Aditya and Jawahar, C.V. and Pawan Kumar, M.},
  title = {Dissimilarity Coefficient Based Weakly Supervised Object Detection},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Kernel Transformer Networks for Compact Spherical Convolution
Yu-Chuan Su, Kristen Grauman


Ideally, 360deg imagery could inherit the deep convolutional neural networks (CNNs) already trained with great success on perspective projection images. However, existing methods to transfer CNNs from perspective to spherical images introduce significant computational costs and/or degradations in accuracy. We present the Kernel Transformer Network (KTN) to efficiently transfer convolution kernels from perspective images to the equirectangular projection of 360deg images. Given a source CNN for perspective images as input, the KTN produces a function parameterized by a polar angle and kernel as output. Given a novel 360deg image, that function in turn can compute convolutions for arbitrary layers and kernels as would the source CNN on the corresponding tangent plane projections. Distinct from all existing methods, KTNs allow model transfer: the same model can be applied to different source CNNs with the same base architecture. This enables application to multiple recognition tasks without re-training the KTN. Validating our approach with multiple source CNNs and datasets, we show that KTNs improve the state of the art for spherical convolution. KTNs successfully preserve the source CNN's accuracy, while offering transferability, scalability to typical image resolutions, and, in many cases, a substantially lower memory footprint.
[video, multiple, learns, work, perform, recognition] [projection, spherical, perspective, tangent, distortion, approach, plane, require, sphere, polar, account, single, accurate, directly, defined, note] [image, resolution, input, transformation, row, reproduce, figure] [ktn, convolution, kernel, cnns, equirectangular, size, apply, accuracy, pherical, network, sphconv, applied, phere, ktns, architecture, overhead, onv, ordinary, layer, deep, vgg, depthwise, compared, applying, entire, number, ubemap, convolutional, dependent, higher, output] [model, memory, transformer, visual, generate] [cnn, feature, pascal, object, faster, annotated, detection, spatial, map] [source, trained, training, learning, existing, function, train, transfer, data, target, learn, objective]
@InProceedings{Su_2019_CVPR,
  author = {Su, Yu-Chuan and Grauman, Kristen},
  title = {Kernel Transformer Networks for Compact Spherical Convolution},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Object Detection With Location-Aware Deformable Convolution and Backward Attention Filtering
Chen Zhang, Joohee Kim


Multi-class and multi-scale object detection for autonomous driving is challenging because of the high variation in object scales and the cluttered background in complex street scenes. Context information and high-resolution features are the keys to achieve a good performance in multi-scale object detection. However, context information is typically unevenly distributed, and the high-resolution feature map also contains distractive low-level features. In this paper, we propose a location-aware deformable convolution and a backward attention filtering to improve the detection performance. The location-aware deformable convolution extracts the unevenly distributed context features by sampling the input from where informative context exists. Different from the original deformable convolution, the proposed method applies an individual convolutional layer on each input sampling grid location to obtain a wide and unique receptive field for a better offset estimation. Meanwhile, the backward attention filtering module filters the high-resolution feature map by highlighting the informative features and suppressing the distractive features using the semantic features from the deep layers. Extensive experiments are conducted on the KITTI object detection and PASCAL VOC 2007 datasets. The proposed method shows an average 6% performance improvement over the Faster R-CNN baseline, and it has the top-3 performance on the KITTI leaderboard with the fastest processing speed.
[backward, dataset, driving] [computer, vision, kitti, pattern, field, estimation, autonomous, good, international] [proposed, filtering, method, input, ieee, conference, based, figure, comparison, image] [convolution, convolutional, network, performance, layer, neural, deep, standard, dilation, size, table, output, filtered, number, receptive, pooling, original, residual, validation, small, pooled, processing] [attention, generate, machine] [feature, object, detection, deformable, context, map, module, offset, box, bounding, faster, roi, semantic, pascal, voc, three, regression, grid, location, car, spatial, backbone, improvement, carried, pedestrian, distractive, propose, improve] [sampling, embedding, classification, set, training, setup, informative, learning]
@InProceedings{Zhang_2019_CVPR,
  author = {Zhang, Chen and Kim, Joohee},
  title = {Object Detection With Location-Aware Deformable Convolution and Backward Attention Filtering},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Variational Prototyping-Encoder: One-Shot Learning With Prototypical Images
Junsik Kim, Tae-Hyun Oh, Seokju Lee, Fei Pan, In So Kweon


In daily life, graphic symbols, such as traffic signs and brand logos, are ubiquitously utilized around us due to its intuitive expression beyond language boundary. We tackle an open-set graphic symbol recognition problem by one-shot classification with prototypical images as a single training example for each novel class. We take an approach to learn a generalizable embedding space for novel tasks. We propose a new approach called variational prototyping-encoder (VPE) that learns the image translation task from real-world input images to their corresponding prototypical images as a meta-task. As a result, VPE learns image similarity as well as prototypical concepts which differs from widely used metric learning based approaches. Our experiments with diverse datasets demonstrate that the proposed VPE performs favorably against competing metric learning based one-shot methods. Also, our qualitative analyses show that our meta-task induces an effective embedding space suitable for unseen data representation.
[traffic, dataset, learns, recognition] [single, corresponding, international, approach, reconstruction, computer, graphic, well] [real, image, input, latent, based, proposed, method, conference, prior, translation, generative, figure, appearance, ieee, competing] [neural, performance, deep, table, network, size, best, experiment, convolution, layer, symbol, siamese, accuracy, better, original] [sign, variational, encoder, visual, model, understanding, decoder, simple, machine, query] [feature, average] [learning, vpe, prototype, training, data, test, gtsrb, metric, unseen, set, classification, embedding, space, class, support, datasets, logo, distribution, learn, domain, loss, novel, similarity, distance, prototypical, retrieval, log, learned, siamnet, matchnet, trained, vae, train, quadnet, task]
@InProceedings{Kim_2019_CVPR,
  author = {Kim, Junsik and Oh, Tae-Hyun and Lee, Seokju and Pan, Fei and So Kweon, In},
  title = {Variational Prototyping-Encoder: One-Shot Learning With Prototypical Images},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Unsupervised Domain Adaptation Using Feature-Whitening and Consensus Loss
Subhankar Roy, Aliaksandr Siarohin, Enver Sangineto, Samuel Rota Bulo, Nicu Sebe, Elisa Ricci


A classifier trained on a dataset seldom works on other datasets obtained under different conditions due to domain shift. This problem is commonly addressed by domain adaptation methods. In this work we introduce a novel deep learning framework which unifies different paradigms in unsupervised domain adaptation. Specifically, we propose domain alignment layers which implement feature whitening for the purpose of matching source and target feature distributions. Additionally, we leverage the unlabeled target data by proposing the Min-Entropy Consensus loss, which regularizes training while avoiding the adoption of many user-defined hyper-parameters. We report results on publicly available datasets, considering both digit classification and object recognition tasks. We show that, in most of our experiments, our approach improves upon previous methods, setting new state-of-the-art performances.
[dataset, previous, second, work] [approach, case, matrix, consensus, problem, respect, corresponding, directly, computed, project] [proposed, consistency, method, based, transform, image, comparison, figure, generative] [deep, whitening, network, covariance, batch, layer, shift, accuracy, order, correlation, better, compare, replace, size, number, group, introducing] [consider, adversarial, common, perturbed] [feature, propose, three, adopted] [target, domain, loss, source, dwt, mec, data, learning, unsupervised, alignment, mnist, adaptation, training, uda, unlabeled, entropy, augmentation, paradigm, sample, svhn, usps, trained, datasets, function, distribution, reported, paper, set, log, stl]
@InProceedings{Roy_2019_CVPR,
  author = {Roy, Subhankar and Siarohin, Aliaksandr and Sangineto, Enver and Rota Bulo, Samuel and Sebe, Nicu and Ricci, Elisa},
  title = {Unsupervised Domain Adaptation Using Feature-Whitening and Consensus Loss},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
FEELVOS: Fast End-To-End Embedding Learning for Video Object Segmentation
Paul Voigtlaender, Yuning Chai, Florian Schroff, Hartwig Adam, Bastian Leibe, Liang-Chieh Chen


Many of the recent successful methods for video object segmentation (VOS) are overly complicated, heavily rely on fine-tuning on the first frame, and/or are slow, and are hence of limited practical use. In this work, we propose FEELVOS as a simple and fast method which does not rely on fine-tuning. In order to segment a video, for each frame FEELVOS uses a semantic pixel-wise embedding together with a global and a local matching mechanism to transfer information from the first frame and from the previous frame of the video to the current frame. In contrast to previous work, our embedding is only used as an internal guidance of a convolutional network. Our novel dynamic segmentation head allows us to train the network, including the embedding, end-to-end for the multiple object segmentation task with a cross entropy loss. We achieve a new state of the art in video object segmentation without fine-tuning with a J&F measure of 71.5% on the DAVIS 2017 validation set. We make our code and models available at https://github.com/tensorflow/models/tree/master/research/feelvos.
[frame, video, feelvos, previous, davis, rgmp, current, dynamic, time, challenge, pml, multiple, onavos, premvos, videomatch, osmn, vos, favos, extract] [matching, local, ground, note, truth, practical, single, directly, rely, runtime] [method, pixel, proposed, image, produce, result, based] [validation, design, network, convolutional, conv, fast, table, achieves, denotes, deep, neural, apply, achieve, better, number, practice] [strong, arxiv, preprint, evaluation] [segmentation, object, global, head, map, semantic, backbone, instance, propose, score, mask, final, refinement, belong, logits] [embedding, distance, training, set, learning, nearest, data, neighbor, space, learned, transfer, loss, soft]
@InProceedings{Voigtlaender_2019_CVPR,
  author = {Voigtlaender, Paul and Chai, Yuning and Schroff, Florian and Adam, Hartwig and Leibe, Bastian and Chen, Liang-Chieh},
  title = {FEELVOS: Fast End-To-End Embedding Learning for Video Object Segmentation},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
PartNet: A Recursive Part Decomposition Network for Fine-Grained and Hierarchical Shape Segmentation
Fenggen Yu, Kun Liu, Yan Zhang, Chenyang Zhu, Kai Xu


Deep learning approaches to 3D shape segmentation are typically formulated as a multi-class labeling problem. These models are trained for a fixed set of labels, which greatly limits their flexibility and adaptivity. We opt for top-down recursive decomposition and develop the first deep learning model for hierarchical segmentation of 3D shapes, based on recursive neural networks. Starting from a full shape represented as a point cloud, our model performs recursive binary decomposition, where the decomposition network at all nodes in the hierarchy share weights. At each node, a node classifier is trained to determine the type (adjacency or symmetry) and stopping criteria of its decomposition. The features extracted in higher level nodes are recursively propagated to lower level ones. Thus, the meaningful decompositions in higher levels provide strong contextual cues constraining the segmentations in lower levels. Meanwhile, to increase the segmentation accuracy at each node, we enhance the recursive contextual feature with the shape feature extracted for the corresponding part. Our method segments a 3D shape in point cloud into an arbitrary number of parts, depending on the shape complexity, showing strong generality and flexibility. It achieves the state-of-the-art performance, both for fine-grained and semantic segmentation, on the public benchmark and a new benchmark of fine-grained segmentation proposed in this work. We also demonstrate its application for fine-grained part refinements in image-to-shape reconstruction.
[current, work, performs, leaf] [point, shape, decomposition, cloud, symmetry, pointnet, chair, computer, supplemental, corresponding, shapenet, reconstruction, targeting, analysis, left, consistent, vision] [figure, method, based, acm, comparison, input, conference, ieee] [recursive, network, deep, neural, table, full, number, higher, structure, fixed, binary, lower, convolutional, flexibility] [node, model, visual, child, decoding, depending, type] [segmentation, feature, hierarchical, semantic, partnet, hierarchy, module, benchmark, segment, context, iou, labeling, sgpn, three, instance, contextual, level, average, sofa] [learning, trained, training, classification, label, loss, set, classifier, existing]
@InProceedings{Yu_2019_CVPR,
  author = {Yu, Fenggen and Liu, Kun and Zhang, Yan and Zhu, Chenyang and Xu, Kai},
  title = {PartNet: A Recursive Part Decomposition Network for Fine-Grained and Hierarchical Shape Segmentation},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Learning Multi-Class Segmentations From Single-Class Datasets
Konstantin Dmitriev, Arie E. Kaufman


Multi-class segmentation has recently achieved significant performance in natural images and videos. This achievement is due primarily to the public availability of large multi-class datasets. However, there are certain domains, such as biomedical images, where obtaining sufficient multi-class annotations is a laborious and often impossible task and only single-class datasets are available. While existing segmentation research in such domains use private multi-class datasets or focus on single-class segmentations, we propose a unified highly efficient framework for robust simultaneous learning of multi-class segmentations by combining single-class datasets and utilizing a novel way of conditioning a convolutional network for the purpose of segmentation. We demonstrate various ways of incorporating the conditional information, perform an extensive evaluation, and show compelling multi-class segmentation performance on biomedical images, which outperforms current state-of-the-art solutions (up to 2.7%). Unlike current solutions, which are meticulously tailored for particular single-class datasets, we utilize datasets from a variety of sources. Furthermore, we show the applicability of our method also to natural images and evaluate it on the Cityscapes dataset. We further discuss other possible applications of our proposed framework.
[dataset, framework, recognition, work, current, multiple] [computer, vision, pattern, single, international, additional, approach, applicability, ground, truth] [image, conditioning, conference, proposed, ieee, conditional, method, imaging, input, separate, biomedical, demonstrate, figure, described] [convolutional, performance, layer, network, neural, convnet, binary, experiment, size, table, processing, number, deep, computing, efficient, lookup, compare] [model, natural, conditioned, decoder, describe, evaluate, intervention, private, encoder, common, visual] [segmentation, pancreas, medical, semantic, liver, abdominal, spleen, illustrated, spatial, urban, propose] [datasets, learning, training, class, set, trained, target, base, test, purpose, novel]
@InProceedings{Dmitriev_2019_CVPR,
  author = {Dmitriev, Konstantin and Kaufman, Arie E.},
  title = {Learning Multi-Class Segmentations From Single-Class Datasets},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Convolutional Recurrent Network for Road Boundary Extraction
Justin Liang, Namdar Homayounfar, Wei-Chiu Ma, Shenlong Wang, Raquel Urtasun


Creating high definition maps that contain precise information of static elements of the scene is of utmost importance for enabling self driving cars to drive safely. In this paper, we tackle the problem of drivable road boundary extraction from LiDAR and camera imagery. Towards this goal, we design a structured model where a fully convolutional network obtains deep features encoding the location and direction of road boundaries and then, a convolutional recurrent network outputs a polyline representation for each one of them. Importantly, our method is fully automatic and does not require a user in the loop. We showcase the effectiveness of our method on a large North American city where we obtain perfect topology of road boundaries 99.3% of the time at a high precision and recall.
[recurrent, predict, drivable, work, perform, prediction, amortized, driving, lane, static] [direction, ground, truth, lidar, computer, camera, vision, international, topology, corresponding, vertex, dense, elevation, intelligent, pattern, field, note, david, journal, definition, problem, sensor] [conference, figure, ieee, input, image, imagery, high, method, transform, difference, based] [network, deep, convolutional, structured, output, number, precision, process, neural] [model, automatic, visual, encoder, vector, encoding] [road, boundary, map, predicted, aerial, raquel, polyline, location, feature, polylines, detection, semantic, score, segmentation, csnake, extraction, shenlong, recall, fully, spatial, precise, satellite, connectivity, rotated, sanja] [learning, distance, trained, representation, transportation]
@InProceedings{Liang_2019_CVPR,
  author = {Liang, Justin and Homayounfar, Namdar and Ma, Wei-Chiu and Wang, Shenlong and Urtasun, Raquel},
  title = {Convolutional Recurrent Network for Road Boundary Extraction},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
DFANet: Deep Feature Aggregation for Real-Time Semantic Segmentation
Hanchao Li, Pengfei Xiong, Haoqiang Fan, Jian Sun


This paper introduces an extremely efficient CNN architecture named DFANet for semantic segmentation under resource constraints. Our proposed network starts from a single lightweight backbone and aggregates discriminative features through sub-network and sub-stage cascade respectively. Based on the multi-scale feature propagation, DFANet substantially reduces the number of parameters, but still obtains sufficient receptive field and enhances the model learning ability, which strikes a balance between the speed and segmentation performance. Experiments on Cityscapes and CamVid datasets demonstrate the superior performance of DFANet with 8xless FLOPs and 2xfaster than the existing state-of-the-art real-time semantic segmentation methods while providing comparable accuracy. Specifically, it achieves 70.3% Mean IOU on the Cityscapes test dataset with only 1.7 GFLOPs and a speed of 160 FPS on one NVIDIA Titan X card, and 71.3% Mean IOU with 3.4 GFLOPs while inferring on a higher resolution image.
[previous, prediction, dataset] [computer, pattern, field, single, vision, analysis, depth] [proposed, image, resolution, input, figure, based, ieee, method, conference, composed, high] [aggregation, network, xception, deep, dfanet, speed, performance, accuracy, structure, receptive, architecture, convolution, output, table, implement, lightweight, inference, fps, computation, better, aspp, number, size, neural, aggregate, convolutional, camvid, achieves, small, depthwise, separable, operation, upsampling, layer] [model, arxiv, decoder, preprint, attention, encoder] [backbone, semantic, feature, segmentation, spatial, module, stage, context, pyramid, enhance, miou, combine, final, three, fuse] [learning, test, training, classification, set]
@InProceedings{Li_2019_CVPR,
  author = {Li, Hanchao and Xiong, Pengfei and Fan, Haoqiang and Sun, Jian},
  title = {DFANet: Deep Feature Aggregation for Real-Time Semantic Segmentation},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
A Cross-Season Correspondence Dataset for Robust Semantic Segmentation
Mans Larsson, Erik Stenborg, Lars Hammarstrand, Marc Pollefeys, Torsten Sattler, Fredrik Kahl


In this paper, we present a method to utilize 2D-2D point matches between images taken during different image conditions to train a convolutional neural network for semantic segmentation. Enforcing label consistency across the matches makes the final segmentation algorithm robust to seasonal changes. We describe how these 2D-2D matches can be generated with little human interaction by geometrically matching points from 3D models built from images. Two cross-season correspondence datasets are created providing 2D-2D matches across seasonal changes as well as from day to night. The datasets are made publicly available to facilitate further research. We show that adding the correspondences as extra supervision during training improves the segmentation performance of the convolutional neural network, making it more robust to seasonal changes and weather conditions.
[dataset, work] [correspondence, cmu, robotcar, point, camera, well, foliage, marc, weather, dense, scene, reconstruction, depth, lidar, hingef, robust, matching, corresponding, torsten, seasonal, pose, constraint, note] [image, reference, pixel, presented, row, based, wilddash, method, consistency] [performance, network, convolutional, table, neural, number, validation, compared, parameter] [adding, visual, adversarial, arxiv, preprint, step, snow] [semantic, segmentation, final, localization, annotated, feature, extra, cnn, fully, included] [loss, training, set, trained, domain, datasets, target, test, learning, data, adaptation, hinge, class, large, supervised, distance, trevor, traversal]
@InProceedings{Larsson_2019_CVPR,
  author = {Larsson, Mans and Stenborg, Erik and Hammarstrand, Lars and Pollefeys, Marc and Sattler, Torsten and Kahl, Fredrik},
  title = {A Cross-Season Correspondence Dataset for Robust Semantic Segmentation},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
ManTra-Net: Manipulation Tracing Network for Detection and Localization of Image Forgeries With Anomalous Features
Yue Wu, Wael AbdAlmageed, Premkumar Natarajan


To fight against real-life image forgery, which commonly involves different types and combined manipulations, we propose a unified deep neural architecture called ManTra-Net. Unlike many existing solutions, ManTra-Net is an end-to-end network that performs both detection and localization without extra preprocessing and postprocessing. \manifold is a fully convolutional network and handles images of arbitrary sizes and many known forgery types such splicing, copy-move, removal, enhancement, and even unknown types. This paper has three salient contributions. We design a simple yet effective self-supervised learning task to learn robust image manipulation traces from classifying 385 image manipulation types. Further, we formulate the forgery localization problem as a local anomaly detection problem, design a Z-score feature to capture local anomaly, and propose a novel long short-term memory solution to assess local anomalies. Finally, we carefully conduct ablation experiments to systematically optimize the proposed network design. Our extensive experimental results demonstrate the generalizability, robustness and superiority of ManTra-Net, not only in single types of manipulations/forgeries, but also in their complicated combinations.
[anomaly, dataset, signal, combined] [local, computer, international, pattern, camera, vision, dominant, solution, manipulated] [image, manipulation, forgery, ieee, conference, forged, splicing, imc, ifld, trace, anomalous, study, half, proposed, kcmi, forensics, pristine, noise, face, patch, method, difference, based, figure, pixel, color, reference, synthesized, inpainting, casia, columbia] [table, performance, relu, network, deep, convolutional, dnn, neural, architecture, size, validation, small, number, best, nist] [model, random, evaluation, decision, evaluate, multimedia] [feature, detection, localization, region, hierarchy, three, level, fully, cnn] [learning, testing, classification, novel, sample, training, experimental]
@InProceedings{Wu_2019_CVPR,
  author = {Wu, Yue and AbdAlmageed, Wael and Natarajan, Premkumar},
  title = {ManTra-Net: Manipulation Tracing Network for Detection and Localization of Image Forgeries With Anomalous Features},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
On Zero-Shot Recognition of Generic Objects
Tristan Hascoet, Yasuo Ariki, Tetsuya Takiguchi


Many recent advances in computer vision are the results of a healthy competition among researchers on high quality, task-specific, benchmarks. After a decade of active research, zero-shot learning (ZSL) models accuracy on the Imagenet benchmark remains far too low to be considered for practical object recognition applications. In this paper, we argue that the main reason behind this apparent lack of progress is the poor quality of this benchmark. We highlight major structural flaws of the current benchmark and analyze different factors impacting the accuracy of ZSL models. We show that the actual classification accuracy of existing ZSL models is significantly higher than was previously thought as we account for these flaws. We then introduce the notion of structural bias specific to ZSL datasets. We discuss how the presence of this new form of bias allows for a trivial solution to the standard benchmark and conclude on the need for a new benchmark. We then detail the semi-automated construction of a new benchmark to address these flaws.
[structural, recognition, dataset, gcn, consists] [solution, computer, vision, linear, practical, analysis, respect, define, pattern, supplementary, local, construction] [image, figure, quality, based, proposed, conference, ieee, high] [accuracy, standard, imagenet, population, ratio, low, impact, distributed, secondary, filter] [visual, word, model, wordnet, example, ambiguous, evaluate, appendix, meaning, identify, classified, primary, ability, evaluation, consider, fact] [benchmark, semantic, object, baseline, argue] [test, zsl, training, set, trivial, class, bias, sample, embeddings, classification, split, reported, toy, generic, label, existing, cte, learning, negative, rare, notion, specific, distance, similarity, main]
@InProceedings{Hascoet_2019_CVPR,
  author = {Hascoet, Tristan and Ariki, Yasuo and Takiguchi, Tetsuya},
  title = {On Zero-Shot Recognition of Generic Objects},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Explicit Bias Discovery in Visual Question Answering Models
Varun Manjunatha, Nirat Saini, Larry S. Davis


Researchers have observed that Visual Question Answering (VQA ) models tend to answer questions by learning statistical biases in the data. For example, their answer to the question "What is the color of the grass?" is usually "Green", whereas a question like "What is the title of the book?" cannot be answered by inferring statistical biases. It is of interest to the community to explicitly discover such biases, both for understanding the behavior of such models, and towards debugging them. Our work address this problem. In a database, we store the words of the question, answer and visual words corresponding to regions of interest in attention maps. By running simple rule mining algorithms on this database, we discover human-interpretable rules which give us unique insight into the behavior of such models. Our results also show examples of unusual behaviors learned by models in attempting VQA tasks.
[dataset, work, behavior, time, explicitly, follow, second, people, focus] [problem, provide, confidence, corresponding, algorithm, note, vision] [image, figure, method, statistical, based, gender, database] [validation, deep, neural, network, number] [visual, vqa, question, model, answer, attention, rule, antecedant, answering, language, word, discover, machine, frequent, itemset, answered, indicates, consequents, observed, provided, apriori, correctly, understanding, simple, plausible, choose, miner, mscoco, common, debugging, unique, insight] [baseline, presence, bounding, box, predicted, cropping, discovery, interest] [learned, learning, set, mining, support, data, bias, large, training, classification, corresponds, task]
@InProceedings{Manjunatha_2019_CVPR,
  author = {Manjunatha, Varun and Saini, Nirat and Davis, Larry S.},
  title = {Explicit Bias Discovery in Visual Question Answering Models},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
REPAIR: Removing Representation Bias by Dataset Resampling
Yi Li, Nuno Vasconcelos


Modern machine learning datasets can have biases for certain representations that are leveraged by algorithms to achieve high performance without learning to solve the underlying task. This problem is referred to as "representation bias". The question of how to reduce the representation biases of a dataset is investigated and a new dataset REPresentAtion bIas Removal (REPAIR) procedure is proposed. This formulates bias minimization as an optimization problem, seeking a weight distribution that penalizes examples easy for a classifier built on a given feature representation. Bias reduction is then equated to maximizing the ratio between the classification loss on the reweighted dataset and the uncertainty of the ground-truth class labels. This is a minimax problem that REPAIR solves by alternatingly updating classifier parameters and dataset resampling weights, using stochastic gradient descent. An experimental set-up is also introduced to measure the bias of any dataset for a given representation, and the impact of this bias on the performance of recognition models. Experiments with synthetic and action recognition data show that dataset REPAIR can significantly reduce representation bias, and lead to improved generalization of models trained on REPAIRed datasets. The tools used for characterizing representation bias, and the proposed dataset REPAIR algorithm, are available at https://github.com/JerryYLi/Dataset-REPAIR/.
[dataset, recognition, action, static, temporal, video, kinetics, human, dependency] [computer, vision, international, pattern, solve, optimization, case, problem, well] [color, conference, figure, fairness] [performance, deep, neural, original, accuracy, weight, reduce, table, reduction, structure, size, processing, performed, convolutional, larger] [model, random, procedure, ability, machine, example, enables, visual] [cnn, feature, three, european, evaluated, contextual] [bias, representation, learning, resampling, repair, generalization, training, test, datasets, colored, set, biased, learned, mnist, class, log, repaired, data, trained, classifier, function, sampling, large, classification, measure, resampled, digit, distribution, loss, unbiased, min]
@InProceedings{Li_2019_CVPR,
  author = {Li, Yi and Vasconcelos, Nuno},
  title = {REPAIR: Removing Representation Bias by Dataset Resampling},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Label Efficient Semi-Supervised Learning via Graph Filtering
Qimai Li, Xiao-Ming Wu, Han Liu, Xiaotong Zhang, Zhichao Guan


Graph-based methods have been demonstrated as one of the most effective approaches for semi-supervised learning, as they can exploit the connectivity patterns between labeled and unlabeled data samples to improve learning performance. However, existing graph-based methods either are limited in their ability to jointly model graph structures and data features, such as the classical label propagation methods, or require a considerable amount of labeled data for training and validation due to high model complexity, such as the recent neural-network-based methods. In this paper, we address label efficient semi-supervised learning from a graph filtering perspective. Specifically, we propose a graph filtering framework that injects graph similarity into data features by taking them as signals on the graph and applying a low-pass graph filter to extract useful data representations for classification, where label efficiency can be achieved by conveniently adjusting the strength of the graph filter. Interestingly, this framework unifies two seemingly very different methods -- label propagation and graph convolutional networks. Revisiting them under the graph filtering framework leads to new insights that improve their modeling capabilities and reduce model complexity. Experiments on various semi-supervised classification tasks on four citation networks and one knowledge graph and one semi-supervised regression task for zero-shot image recognition validate our findings and proposals.
[graph, gcn, signal, propagation, citation, cora, recognition, framework] [matrix, international, laplacian, vertex, normalized, analysis, computer, classical] [conference, filtering, image, strength, smooth, proposed, frequency, produce, spectral, input] [filter, convolutional, neural, processing, deep, network, performance, layer, design, table, efficiency, trick, outperform, regularization] [model, machine, vector, par, basis, artificial, arxiv, preprint, manifold, entity, document] [feature, including, response, propose, improve, regression, three] [label, learning, data, igcn, glp, classification, labeled, unlabeled, large, function, class, rnm, training, classifier, set, knowledge, test, renormalization, embedding, semisupervised, classify, train, supervised]
@InProceedings{Li_2019_CVPR,
  author = {Li, Qimai and Wu, Xiao-Ming and Liu, Han and Zhang, Xiaotong and Guan, Zhichao},
  title = {Label Efficient Semi-Supervised Learning via Graph Filtering},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
MVTec AD -- A Comprehensive Real-World Dataset for Unsupervised Anomaly Detection
Paul Bergmann, Michael Fauser, David Sattlegger, Carsten Steger


The detection of anomalous structures in natural image data is of utmost importance for numerous tasks in the field of computer vision. The development of methods for unsupervised anomaly detection requires data on which to train and evaluate new approaches and ideas. We introduce the MVTec Anomaly Detection (MVTec AD) dataset containing 5354 high-resolution color images of different object and texture categories. It contains normal, i.e., defect-free, images intended for training and images with anomalies intended for testing. The anomalies manifest themselves in the form of over 70 different types of defects such as scratches, dents, contaminations, and various structural changes. In addition, we provide pixel-precise ground truth regions for all anomalies. We also conduct a thorough evaluation of current state-of-the-art unsupervised anomaly detection methods based on deep architectures such as convolutional autoencoders, generative adversarial networks, and feature descriptors using pre-trained convolutional neural networks, as well as classical computer vision methods. This initial benchmark indicates that there is considerable room for improvement. To the best of our knowledge, this is the first comprehensive, multi-object, multi-defect dataset for anomaly detection that provides pixel-accurate ground truth regions and focuses on real-world applications.
[anomaly, dataset, structural] [provide, well, international, computer, ground, vision, truth, initial, outlier, good] [anomalous, image, texture, inspection, method, figure, industrial, conference, mvtec, input, variation, generative, pixel, anogan, dictionary, based, proposed, ten, metal, ssim, solely, latent, row, comprehensive, differ] [size, deep, defect, performance, convolutional, neural, applied, performed, best] [model, evaluation, machine, generated, natural, adversarial, evaluate, example] [detection, object, segmentation, feature, evaluated, category, cnn, detect, threshold, benchmark] [training, unsupervised, learning, classification, data, test, datasets, set, autoencoder, trained, autoencoders, train, novel]
@InProceedings{Bergmann_2019_CVPR,
  author = {Bergmann, Paul and Fauser, Michael and Sattlegger, David and Steger, Carsten},
  title = {MVTec AD -- A Comprehensive Real-World Dataset for Unsupervised Anomaly Detection},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
ABC: A Big CAD Model Dataset for Geometric Deep Learning
Sebastian Koch, Albert Matveev, Zhongshi Jiang, Francis Williams, Alexey Artemov, Evgeny Burnaev, Marc Alexa, Denis Zorin, Daniele Panozzo


We introduce ABC-Dataset, a collection of one million Computer-Aided Design (CAD) models for research of geometric deep learning methods and applications. Each model is a collection of explicitly parametrized curves and surfaces, providing ground truth for differential quantities, patch segmentation, geometric feature detection, and shape reconstruction. Sampling the parametric descriptions of surfaces and curves allows generating data in different formats and resolutions, enabling fair comparisons for a wide range of geometric learning algorithms. As a use case for our dataset, we perform a large-scale benchmark for estimation of surface normals, comparing existing data driven methods and evaluating their performance against both the ground truth and traditional normal estimation methods.
[dataset, graph, recognition, overview] [point, surface, geometric, cad, normal, estimation, ground, computer, shape, truth, geometry, vision, differential, local, pattern, analytic, mesh, robust, cloud, triangle, angle, discrete, well, deviation, rost, university, onshape, compute, pipeline, allows, case, analysis, vertex, osculating, cgal] [ieee, conference, patch, sharp, figure, resolution, based, spectral, acm, input] [deep, neural, processing, convolutional, full, number, size] [model, collection, step] [feature, benchmark, connectivity, cnn] [learning, data, large, representation, existing, datasets, set, sampling, uniform, training, loss, extension]
@InProceedings{Koch_2019_CVPR,
  author = {Koch, Sebastian and Matveev, Albert and Jiang, Zhongshi and Williams, Francis and Artemov, Alexey and Burnaev, Evgeny and Alexa, Marc and Zorin, Denis and Panozzo, Daniele},
  title = {ABC: A Big CAD Model Dataset for Geometric Deep Learning},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Tightness-Aware Evaluation Protocol for Scene Text Detection
Yuliang Liu, Lianwen Jin, Zecheng Xie, Canjie Luo, Shuaitao Zhang, Lele Xie


Evaluation protocols play key role in the developmental progress of text detection methods. There are strict requirements to ensure that the evaluation methods are fair, objective and reasonable. However, existing metrics exhibit some obvious drawbacks: 1) They are not goal-oriented; 2) they cannot recognize the tightness of detection methods; 3) existing one-to-many and many-to-one solutions involve inherent loopholes and deficiencies. Therefore, this paper proposes a novel evaluation protocol called Tightness-aware Intersect-over-Union (TIoU) metric that could quantify completeness of ground truth, compactness of detection, and tightness of matching degree. Specifically, instead of merely using the IoU value, two common detection behaviors are properly considered; meanwhile, directly using the score of TIoU to recognize the tightness. In addition, we further propose a straightforward method to address the annotation granularity issue, which can fairly evaluate word and text-line detections simultaneously. By adopting the detection results from published methods and general object detection frameworks, comprehensive experiments on ICDAR 2013 and ICDAR 2015 datasets are conducted to compare recent metrics and the proposed TIoU metric. The comparison demonstrated some promising new prospects, e.g., determining the methods and frameworks for which the detection is tighter and more beneficial to recognize. Our method is extremely simple; however, the novelty is none other than the proposed metric can utilize simplest but reasonable improvements to lead to many interesting and insightful prospects and solving most the issues of the previous metrics. The code is publicly available at https://github.com/Yuliang-Liu/TIoU-metric.
[previous, recognition, recognize, dataset] [matching, scene, computer, pattern, ground, solution, vision, truth, robust, international, directly, general, defined, inconsistency, analysis, completeness] [conference, ieee, proposed, figure, method, based, image] [precision, table, calculate, performance, neural, higher] [text, evaluation, evaluate, perfect, considered, cutting, consider, evaluating, arxiv, preprint, primary] [detection, tiou, iou, icdar, recall, bounding, annotation, object, ned, tightness, matched, detecting, pixellink, three, east, score, box, false, bestm, gts, evaluated, tighter, adopted, region, faster, fairly, threshold] [metric, target, set, avoid, quantify, address]
@InProceedings{Liu_2019_CVPR,
  author = {Liu, Yuliang and Jin, Lianwen and Xie, Zecheng and Luo, Canjie and Zhang, Shuaitao and Xie, Lele},
  title = {Tightness-Aware Evaluation Protocol for Scene Text Detection},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
PointConv: Deep Convolutional Networks on 3D Point Clouds
Wenxuan Wu, Zhongang Qi, Li Fuxin


Unlike images which are represented in regular dense grids, 3D point clouds are irregular and unordered, hence applying convolution on them can be difficult. In this paper, we extend the dynamic filter to a new convolution operation, named PointConv. PointConv can be applied on point clouds to build deep convolutional networks. We treat convolution kernels as nonlinear functions of the local coordinates of 3D points comprised of weight and density functions. With respect to a given point, the weight functions are learned with multi-layer perceptron networks and the density functions through kernel density estimation. A novel reformulation is proposed for efficiently computing the weight functions, which allowed us to dramatically scale up the network and significantly improve its performance. The learned convolution kernel can be used to compute translation-invariant and permutation-invariant convolution on any point set in the 3D space. Besides, PointConv can also be used as deconvolution operators to propagate features from a subsampled point cloud back to its original resolution. Experiments on ModelNet40, ShapeNet, and ScanNet show that deep convolutional neural networks built on PointConv are able to achieve state-of-the-art on challenging semantic segmentation benchmarks on 3D point clouds. Besides, our experiments converting CIFAR-10 into a point cloud showed that networks built on PointConv can match the performance of convolutional networks in 2D images of a similar structure.
[work, version, recognition, perform] [point, local, cloud, computer, vision, continuous, pattern, scannet, inverse, approach, compute, scene, shape, volumetric, directly, indoor] [input, figure, ieee, conference, image, proposed, deconvolution, based, nonlinear] [pointconv, convolution, density, weight, cin, convolutional, deep, scale, network, cmid, layer, neural, mlp, efficient, table, number, order, cout, output, operation, approximation, approximate, structure, kernel, achieve, performance, filter, size, better, applied, original, apply] [memory, arxiv, preprint, evaluate, sampled] [segmentation, feature, semantic, region, propose, cnn, object, miou] [learning, learned, classification, function, set, novel, data, conventional, viewed, training]
@InProceedings{Wu_2019_CVPR,
  author = {Wu, Wenxuan and Qi, Zhongang and Fuxin, Li},
  title = {PointConv: Deep Convolutional Networks on 3D Point Clouds},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Octree Guided CNN With Spherical Kernels for 3D Point Clouds
Huan Lei, Naveed Akhtar, Ajmal Mian


We propose an octree guided neural network architecture and spherical convolutional kernel for machine learning from arbitrary 3D point clouds. The network architecture capitalizes on the sparse nature of irregular point clouds,and hierarchically coarsens the data representation with space partitioning. At the same time, the proposed spherical kernels systematically quantize point neighborhoods to identify local geometric structures in the data, while maintaining the properties of translation-invariance and asymmetry. We specify spherical kernels with the help of network neurons that in turn are associated with spatial locations.We exploit this association to avert dynamic kernel generation during network training that enables efficient learning with high resolution point clouds. The effectiveness of the proposed technique is established on the benchmark tasks of 3D object classification and segmentation, achieving competitive performance on ShapeNet and RueMonge2014 datasets.
[graph, dynamic, dataset, perform, leaf, time] [point, spherical, octree, cloud, computer, vision, pattern, local, geometric, neighborhood, range, international, associated, radius, irregular, volumetric, directly, compute, pointnet, computed, sphere, shape, depth] [conference, input, proposed, ieee, spectral, raw, method, figure, resolution, based] [network, convolutional, neural, kernel, convolution, number, processing, octnet, performance, layer, table, process, standard, computational, applied, weight, architecture, deep, achieve, search, activation, structure, size, performed] [tree, node] [segmentation, spatial, object, feature, semantic, guided, bin] [data, classification, learning, space, representation, large, exploit, partitioning, training, existing, maximum, train]
@InProceedings{Lei_2019_CVPR,
  author = {Lei, Huan and Akhtar, Naveed and Mian, Ajmal},
  title = {Octree Guided CNN With Spherical Kernels for 3D Point Clouds},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
VITAMIN-E: VIsual Tracking and MappINg With Extremely Dense Feature Points
Masashi Yokozuka, Shuji Oishi, Simon Thompson, Atsuhiko Banno


In this paper, we propose a novel indirect monocular simultaneous localization and mapping (SLAM) algorithm called "VITAMIN-E," which is highly accurate and robust as a result of tracking extremely dense feature points. Typical indirect methods have difficulty in reconstructing dense geometry because of their careful feature point selection for accurate matching. Unlike conventional methods, the proposed method processes an enormous number of feature points using the tracking local extrema of curvature based on dominant flow estimation. Because this may lead to high computational cost during bundle adjustment, we propose a novel optimization technique called the "subspace Newton's method" that significantly improves the computational efficiency of bundle adjustment by partially updating the variables. We concurrently generate meshes from the reconstructed points and merge them for an entire three-dimensional(3D) model. Experimental results on the SLAM benchmark EuRoC demonstrated that the proposed method outperformed state-of-the-art SLAM methods such as DSO, ORB-SLAM, and LSD-SLAM, both in terms of accuracy and robustness in trajectory estimation. The proposed method simultaneously generated significantly detailed 3D geometry as a result of the dense feature points in real time using only a CPU.
[frame, tracking, flow, time, multiple, motion] [dense, point, international, camera, monocular, slam, bundle, indirect, curvature, adjustment, direct, matrix, hcp, euroc, robust, geometry, gaussnewton, reconstruction, hpp, robotics, dominant, vision, tsdf, daniel, odometry, computer, stereo, automation, michael, accurate, local, optimization, mesh, position, note, equation, hcc, loop, pattern, pure, error, allows, denote, inverse] [method, proposed, conference, image, mapping, real, reconstructed, figure, result, high, ieee, change] [number, accuracy, initialization, size, fast, table, highly, process, computational, cost, performed] [visual, success] [feature, localization, detailed, map, average, easy] [large, function, novel, conventional, updating, subspace, experimental, difficult]
@InProceedings{Yokozuka_2019_CVPR,
  author = {Yokozuka, Masashi and Oishi, Shuji and Thompson, Simon and Banno, Atsuhiko},
  title = {VITAMIN-E: VIsual Tracking and MappINg With Extremely Dense Feature Points},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Conditional Single-View Shape Generation for Multi-View Stereo Reconstruction
Yi Wei, Shaohui Liu, Wang Zhao, Jiwen Lu


In this paper, we present a new perspective towards image-based shape generation. Most existing deep learning based shape reconstruction methods employ a single-view deterministic model which is sometimes insufficient to determine a single groundtruth shape because the back part is occluded. In this work, we first introduce a conditional generative network to model the uncertainty for single-view reconstruction. Then, we formulate the task of multi-view reconstruction as taking the intersection of the predicted shape spaces on each single image. We design new differentiable guidance including the front constraint, the diversity constraint, and the consistency loss to enable effective single-view conditional generation and multi-view synthesis. Experimental results and ablation studies show that our proposed approach outperforms state-of-the-art methods on 3D reconstruction test error and demonstrates its generalization ability on real world data.
[deterministic, multiple, online, formulated, outperforms, overview] [shape, reconstruction, point, front, single, constraint, cloud, groundtruth, view, lossconsis, differentiable, partially, psgn, rendered, approach, depth, shapenet, volume, error, span, ront, lossdiv] [conditional, generative, input, consistency, image, figure, method, proposed, based, real, latent, demonstrate] [network, table, denotes, deep, structure, inference, performed, better] [model, diversity, random, generation, adversarial, reasonable, sampled, conditioned, introduce, enables, generating] [object, predicted, propose, including, ablation, lossf] [loss, training, learning, distance, supervised, learn, existing, test, space, task, metric, sampling, set, specific, uncertainty, large, trained, conducted]
@InProceedings{Wei_2019_CVPR,
  author = {Wei, Yi and Liu, Shaohui and Zhao, Wang and Lu, Jiwen},
  title = {Conditional Single-View Shape Generation for Multi-View Stereo Reconstruction},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Learning to Adapt for Stereo
Alessio Tonioni, Oscar Rahnama, Thomas Joy, Luigi Di Stefano, Thalaiyasingam Ajanthan, Philip H.S. Torr


Real world applications of stereo depth estimation require models that are robust to dynamic variations in the environment. Even though deep learning based stereo methods are successful, they often fail to generalize to unseen variations in the environment, making them less suitable for practical applications such as autonomous driving. In this work, we introduce a "learning-to-adapt" framework that enables deep stereo methods to continuously adapt to new target domains in an unsupervised manner. Specifically, our approach incorporates the adaptation procedure into the learning objective to obtain a base set of parameters that are better suited for unsupervised online adaptation. To further improve the quality of the adaptation, we learn a confidence measure that effectively masks the errors introduced during the unsupervised adaptation. We evaluate our method on synthetic and real-world stereo datasets and our experiments evidence that learning-to-adapt is, indeed beneficial for online adaptation on vastly different domains.
[online, video, framework, frame, sequence, dataset, perform, recognition] [stereo, confidence, vision, computer, disparity, international, kitti, error, estimation, optimization, left, single, pattern, initial, dispnet, problem, ground, continuous, depth, suited, algorithm, dense, truth, thomas, autonomous] [synthetic, conference, real, ieee, method, proposed, based, input] [network, performance, deep, weighted, gradient, descent, neural, fine, process, weight] [model, evaluation, evaluate, machine] [map, propose, predicted, cnn, average] [adaptation, learning, loss, unsupervised, training, set, adapt, function, learn, supervised, trained, data, base, carla, target, test, synthia, objective, measure, adapted, unseen, effectively, update, train, domain, weighting, continuously]
@InProceedings{Tonioni_2019_CVPR,
  author = {Tonioni, Alessio and Rahnama, Oscar and Joy, Thomas and Di Stefano, Luigi and Ajanthan, Thalaiyasingam and Torr, Philip H.S.},
  title = {Learning to Adapt for Stereo},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
3D Appearance Super-Resolution With Deep Learning
Yawei Li, Vagia Tsiminaki, Radu Timofte, Marc Pollefeys, Luc Van Gool


We tackle the problem of retrieving high-resolution (HR) texture maps of objects that are captured from multiple view points. In the multi-view case, model-based super-resolution (SR) methods have been recently proved to recover high quality texture maps. On the other hand, the advent of deep learning-based methods has already a significant impact on the problem of video and image SR. Yet, a deep learning-based approach to super-resolve the appearance of 3D objects is still missing. The main limitation of exploiting the power of deep learning techniques in the multi-view case is the lack of data. We introduce a 3D appearance SR (3DASR) dataset based on the existing ETH3D [42], SyB3R [31], MiddleBury, and our Collection of 3D scenes from TUM [21], Fountain [51] and Relief [53]. We provide the high- and low-resolution texture maps, the 3D geometric model, images and projection matrices. We exploit the power of 2D learning-based SR methods and design networks suitable for the 3D multi-view case. We incorporate the geometric information by introducing normal maps and further improve the learning process. Experimental results demonstrate that our proposed networks successfully incorporate the 3D geometric information and super-resolve the texture maps.
[dataset, incorporate, multiple, capture, video] [normal, geometric, projection, case, single, geometry, compute, point, volume, projected, mesh, problem, view, approach, scene, camera, computer, provide, optimization, reconstruction, multiview, formation, corresponding, intrinsic, matrix, middlebury, algorithm, relief, define] [texture, image, appearance, edsr, nlr, resolution, high, based, figure, psnr, ieee, nhr, mapping, method, quality, synthetic, captured, superresolution, real, color, hrst, acm, recover] [deep, network, scaling, upsampling, layer, convolutional] [model, introduce, collection, provided, visual, step] [map, feature] [learning, training, subset, space, domain, data, function, set, exploit, trained]
@InProceedings{Li_2019_CVPR,
  author = {Li, Yawei and Tsiminaki, Vagia and Timofte, Radu and Pollefeys, Marc and Van Gool, Luc},
  title = {3D Appearance Super-Resolution With Deep Learning},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Radial Distortion Triangulation
Zuzana Kukelova, Viktor Larsson


This paper presents the first optimal, maximal likelihood, solution to the triangulation problem for radially distorted cameras. The proposed solution to the two-view triangulation problem minimizes the L2-norm of the reprojection error in the distorted image space. We cast the problem as the search for corrected distorted image points, and we use a Lagrange multiplier formulation to impose the epipolar constraint for undistorted points. For the one-parameter division model, this formulation leads to a system of five quartic polynomial equations in five unknowns, which can be exactly solved using the Groebner basis method. While the proposed Groebner basis solution is provably optimal; it is too slow for practical applications. Therefore, we developed a fast iterative solver to this problem. Extensive empirical tests show that the iterative algorithm delivers the optimal solution virtually every time, thus making it an L2-optimal algorithm de facto. It is iterative in nature, yet in practice, it converges in no more than five iterations. We thoroughly evaluate the proposed method on both synthetic and real-world data, and we show the benefits of performing the triangulation in the distorted space in the presence of radial distortion.
[recognition, performing, slow] [triangulation, error, distortion, radial, problem, solution, distorted, solver, vision, computer, optimal, reprojection, camera, polynomial, undistorted, algorithm, point, linear, equation, note, pattern, lagrange, obner, optimization, solving, xdi, runtime, corrected, undistortion, calibration, checkerboard, zuzana, minimizes, quartic, approach, matrix, projection, form, triangulated, dlt, pose, international, kalle, viktor, practical, bundle, adjustment, local] [image, method, figure, noise, proposed, conference, gopro, comparison, based, synthetic] [wide, table, original, cost, experiment, martin, parameter, fast, converges, number, larger] [iterative, median, basis, model, system, richard, making] [border, presence] [function, medium, division, corresponds, minimization]
@InProceedings{Kukelova_2019_CVPR,
  author = {Kukelova, Zuzana and Larsson, Viktor},
  title = {Radial Distortion Triangulation},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Robust Point Cloud Based Reconstruction of Large-Scale Outdoor Scenes
Ziquan Lan, Zi Jian Yew, Gim Hee Lee


Outlier feature matches and loop-closures that survived front-end data association can lead to catastrophic failures in the back-end optimization of large-scale point cloud based 3D reconstruction. To alleviate this problem, we propose a probabilistic approach for robust back-end optimization in the presence of outliers. More specifically, we model the problem as a Bayesian network and solve it using the Expectation-Maximization algorithm. Our approach leverages on a long-tail Cauchy distribution to suppress outlier feature matches in the odometry constraints, and a Cauchy-Uniform mixture model with a set of binary latent variables to simultaneously suppress outlier loop-closure constraints and outlier feature matches in the inlier loop-closure constraints. Furthermore, we show that by using a Gaussian-Uniform mixture model, our approach degenerates to the formulation of a state-of-the-art approach for robust indoor reconstruction. Experimental results demonstrate that our approach has comparable performance with the state-of-the-art on a benchmark indoor dataset, and outperforms it on a large-scale outdoor dataset. Our source code can be found on the project website.
[trajectory, term, route, consecutive, dataset] [odometry, outlier, point, reconstruction, cloud, robust, inlier, approach, indoor, fragment, outdoor, choi, problem, constraint, maximization, cauchy, optimization, error, pij, solve, formulation, scene, algorithm, geometric, variable, dense, registration, note, assignment, denote, surface, simultaneously] [based, method, latent, figure, result, suppress, reconstructed, mapping] [accuracy, top, covariance, number, scale, bij, bayesian, comparable, lead, network] [model, expectation, probability, step, living, room, association] [feature, global, baseline, propose, detection, recall, average, detected] [distribution, set, posterior, data, yij, multivariate, mixture, uniform, zij, probabilistic, existing]
@InProceedings{Lan_2019_CVPR,
  author = {Lan, Ziquan and Jian Yew, Zi and Hee Lee, Gim},
  title = {Robust Point Cloud Based Reconstruction of Large-Scale Outdoor Scenes},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Minimal Solvers for Mini-Loop Closures in 3D Multi-Scan Alignment
Pedro Miraldo, Surojit Saha, Srikumar Ramalingam


3D scan registration is a classical, yet a highly useful problem in the context of 3D sensors such as Kinect and Velodyne. While there are several existing methods, the techniques are usually incremental where adjacent scans are registered first to obtain the initial poses, followed by motion averaging and bundle-adjustment refinement. In this paper, we take a different approach and develop minimal solvers for jointly computing the initial poses of cameras in small loops such as 3-, 4-, and 5-cycles. Note that the classical registration of 2 scans can be done using a minimum of 3 point matches to compute 6 degrees of relative motion. On the other hand, to jointly compute the 3D registrations in n-cycles, we take 2 point matches between the first n-1 consecutive pairs (i.e., Scan 1 & Scan 2, ... , and Scan n-1 & Scan n) and 1 or 2 point matches between Scan 1 and Scan n. Overall, we use 5, 7, and 10 point matches for 3-, 4-, and 5-cycles, and recover 12, 18, and 24 degrees of transformation variables, respectively. Using simulations and real-data we show that the 3D registration using mini n-cycles are computationally efficient, and can provide alternate and better initial poses compared to standard pairwise methods.
[predefined, graph, recognition, averaging, motion, jointly, previous, considering] [point, pose, minimal, registration, vision, computer, relative, rotation, compute, problem, scan, solution, pattern, solve, case, estimation, solver, camera, correspondence, note, coordinate, freedom, marc, robust, denote, defined, daniel, david, approach, computed, ransac, solving, planar, single, analysis, hongdong, initial, algebraic, geometric, rgb] [ieee, method, transformation, proposed, translation, figure] [number, table, standard, efficient, deep, computation, better] [consider, find, simple] [three, european, edge] [pairwise, remaining, set, data, minimum, mini]
@InProceedings{Miraldo_2019_CVPR,
  author = {Miraldo, Pedro and Saha, Surojit and Ramalingam, Srikumar},
  title = {Minimal Solvers for Mini-Loop Closures in 3D Multi-Scan Alignment},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Volumetric Capture of Humans With a Single RGBD Camera via Semi-Parametric Learning
Rohit Pandey, Anastasia Tkach, Shuoran Yang, Pavel Pidlypenskyi, Jonathan Taylor, Ricardo Martin-Brualla, Andrea Tagliasacchi, George Papandreou, Philip Davidson, Cem Keskin, Shahram Izadi, Sean Fanello


Volumetric (4D) performance capture is fundamental for AR/VR content generation. Whereas previous work in 4D performance capture has shown impressive results in studio settings, the technology is still far from being accessible to a typical consumer who, at best, might own a single RGBD sensor. Thus, in this work, we propose a method to synthesize free viewpoint renderings using a single RGBD camera. The key insight is to leverage previously seen "calibration" images of a given user to extrapolate what should be rendered in a novel viewpoint from the data available in the sensor. Given these past observations from multiple viewpoints, and the current RGBD image from a fixed view, we propose an end-to-end framework that fuses both these data sources to generate novel renderings of the performer. We demonstrate that the method can produce high fidelity images, and handle extreme changes in subject pose and camera viewpoints. We also show that the system generalizes to performers not seen in the training data. We run exhaustive experiments demonstrating the effectiveness of the proposed semi-parametric model (i.e. calibration images available to the neural network) compared to other state of the art machine learned solutions. Further, we compare the method with more traditional pipelines that employ multi-view capture. We show that our framework is able to achieve compelling results, with substantially less infrastructure than previously required.
[capture, multiple, warp, warped, framework, state, current, work, human] [calibration, calib, pose, rgbd, viewpoint, camera, volumetric, view, iwarp, keypoints, single, icloud, warper, depth, reconstruction, rgb, notice, note, blender, compute, silhouette, supplementary, well, rendering, confidence, groundtruth, additional, point, coordinate, normal, compelling, infrastructure] [image, figure, method, high, user, proposed, input, quality, igt, background, traditional, arbitrary, desired, transformation, acm, synthesize, produce] [network, neural, output, performance, compare] [system, machine, generate, gan, required, observed, collection] [final, mask, score, art, fully, foreground, map, stage] [novel, loss, training, learning, target, unseen, selected, data, similarity]
@InProceedings{Pandey_2019_CVPR,
  author = {Pandey, Rohit and Tkach, Anastasia and Yang, Shuoran and Pidlypenskyi, Pavel and Taylor, Jonathan and Martin-Brualla, Ricardo and Tagliasacchi, Andrea and Papandreou, George and Davidson, Philip and Keskin, Cem and Izadi, Shahram and Fanello, Sean},
  title = {Volumetric Capture of Humans With a Single RGBD Camera via Semi-Parametric Learning},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Joint Face Detection and Facial Motion Retargeting for Multiple Faces
Bindita Chaudhuri, Noranart Vesdapunt, Baoyuan Wang


Facial motion retargeting is an important problem in both computer graphics and vision, which involves capturing the performance of a human face and transferring it to another 3D character. Learning 3D morphable model (3DMM) parameters from 2D face images using convolutional neural networks is common in 2D face alignment, 3D face reconstruction etc. However, existing methods either require an additional face detection step before retargeting or use a cascade of separate networks to perform detection followed by retargeting in a sequence. In this paper, we present a single end-to-end network to jointly predict the bounding box locations and 3DMM parameters for multiple faces. First, we design a novel multitask learning framework that learns a disentangled representation of 3DMM parameters for a single face. Then, we leverage the trained single face model to generate ground truth 3DMM parameters for multiple faces to train another network that performs joint face detection and motion retargeting for images with multiple faces. Experimental results show that our joint detection and retargeting network has high face detection accuracy and is robust to extreme expressions and poses while being faster than state-of-the-art methods.
[joint, multiple, recognition, tracking, motion, dataset, predict, performs, capture, human, perform] [single, computer, vision, pose, ground, truth, pattern, reconstruction, approach, require, shape, international, regressing, regress, accurate, realtime, fitting] [face, facial, retargeting, expression, conference, ieee, landmark, image, sfn, mfn, identity, change, separate, input, figure, wexp, eye, ijk, chen, morphable, method, disentangled] [network, performance, scale, table, deep, design, accuracy, architecture, size, multitask, denotes, layer] [model, generate, arxiv, preprint] [detection, bounding, box, branch, regression, object, localization, global, european, cascade] [alignment, learning, training, test, representation, set, large, trained, loss]
@InProceedings{Chaudhuri_2019_CVPR,
  author = {Chaudhuri, Bindita and Vesdapunt, Noranart and Wang, Baoyuan},
  title = {Joint Face Detection and Facial Motion Retargeting for Multiple Faces},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Monocular Depth Estimation Using Relative Depth Maps
Jae-Han Lee, Chang-Su Kim


We propose a novel algorithm for monocular depth estimation using relative depth maps. First, using a convolutional neural network, we estimate relative depths between pairs of regions, as well as ordinary depths, at various scales. Second, we restore relative depth maps from selectively estimated data based on the rank-1 property of pairwise comparison matrices. Third, we decompose ordinary and relative depth maps into components and recombine them optimally to reconstruct a final depth map. Experimental results show that the proposed algorithm provides the state-of-art depth estimation performance.
[multiple, joint, prediction] [depth, relative, estimation, algorithm, dense, wsm, monocular, estimate, single, estimated, ordinal, reconstruction, optimal, matrix, decomposition, rmse, geometric, scene, laina, corresponding, note, pattern, yield, estimating, denote, proposition, well, indoor, surface, normal, eigen, supplemental] [proposed, image, figure, comparison, reconstruct, pixel, ieee, based, method, resolution, ten, decomposed, input, component, korea] [ordinary, network, deep, convolutional, neural, table, performance, scale, block, number, better, size, structure, crf, fine, best] [decoder, encoder, vector, evaluation] [map, regression, semantic, cnn, predicted] [learning, log, data, pairwise, training, train, set, conventional, novel]
@InProceedings{Lee_2019_CVPR,
  author = {Lee, Jae-Han and Kim, Chang-Su},
  title = {Monocular Depth Estimation Using Relative Depth Maps},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Unsupervised Primitive Discovery for Improved 3D Generative Modeling
Salman H. Khan, Yulan Guo, Munawar Hayat, Nick Barnes


3D shape generation is a challenging problem due to the high-dimensional output space and complex part configurations of real-world objects. As a result, existing algorithms experience difficulties in accurate generative modeling of 3D shapes. Here, we propose a novel factorized generative model for 3D shape generation that sequentially transitions from coarse to fine scale shape generation. To this end, we introduce an unsupervised primitive discovery algorithm based on a higher-order conditional random field model. Using the primitive parts for shapes as attributes, a parameterized 3D representation is modeled in the first stage. This representation is further refined in the next stage by adding fine scale details to shape. Our results demonstrate improved representation ability of the generative model and better quality samples of newly generated 3D shapes. Further, our primitive generation approach can accurately parse common objects into a simplified representation.
[modeling, dataset, consists, recognition] [shape, primitive, computer, approach, vision, pattern, volumetric, single, point, defined, form, convex, uij, scene, cuboid, denote, volume, note, parametric, problem] [generative, proposed, ieee, conference, based, image, generator, figure, input, method, quality, coc] [cost, convolution, number, network, deep, table, denotes, neural, better, crf, original, unary, weight, performance, output, conv, convolutional, compared] [model, gan, generation, generate, discriminator, improved, generated, represent, adversarial, common, complete, par, arxiv, preprint, potential, inception, random] [object, discovery, box, propose, three, recall] [set, representation, unsupervised, learning, distribution, space, training, learned, trained, data, measure, sample, generic]
@InProceedings{Khan_2019_CVPR,
  author = {Khan, Salman H. and Guo, Yulan and Hayat, Munawar and Barnes, Nick},
  title = {Unsupervised Primitive Discovery for Improved 3D Generative Modeling},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Learning to Explore Intrinsic Saliency for Stereoscopic Video
Qiudan Zhang, Xu Wang, Shiqi Wang, Shikai Li, Sam Kwong, Jianmin Jiang


The human visual system excels at biasing the stereoscopic visual signals by the attention mechanisms. Traditional methods relying on the low-level features and depth relevant information for stereoscopic video saliency prediction have fundamental limitations. For example, it is cumbersome to model the interactions between multiple visual cues including spatial, temporal, and depth information as a result of the sophistication. In this paper, we argue that the high-level features are crucial and resort to the deep learning framework to learn the saliency map of stereoscopic videos. Driven by spatio-temporal coherence from consecutive frames, the model first imitates the mechanism of saliency by taking advantage of the 3D convolutional neural network. Subsequently, the saliency originated from the intrinsic depth is derived based on the correlations between left and right views in a data-driven manner. Finally, a Convolutional Long Short-Term Memory (Conv-LSTM) based fusion network is developed to model the instantaneous interactions between spatio-temporal and depth attributes, such that the ultimate stereoscopic saliency maps over time are produced. Moreover, we establish a new large-scale stereoscopic video saliency dataset (SVS) including 175 stereoscopic video sequences and their fixation density annotations, aiming to comprehensively study the intrinsic attributes for stereoscopic video saliency detection. Extensive experiments show that our proposed model can achieve superior performance compared to the state-of-the-art methods on the newly built dataset for stereoscopic videos.
[stereoscopic, video, prediction, fixation, temporal, dataset, human, fusion, motion, svs, dynamic, coherence, consecutive, static, frame, multiple, spatiotemporal, optical, itti, fed, sst, hidden] [depth, left, intrinsic, view, estimation, contrast] [based, proposed, image, eye, method, ieee, deconvolution, gaze, traditional, input, figure, produce] [convolution, deep, network, convolutional, performance, kernel, layer, neural, better, residual, density, architecture, size, compared, designed, correlation, output] [model, visual, attention, natural, evaluation, system, vector] [saliency, feature, map, detection, spatial, pyramid, semantic, final, including, built, challenging, adopted] [learning, training, set, data]
@InProceedings{Zhang_2019_CVPR,
  author = {Zhang, Qiudan and Wang, Xu and Wang, Shiqi and Li, Shikai and Kwong, Sam and Jiang, Jianmin},
  title = {Learning to Explore Intrinsic Saliency for Stereoscopic Video},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Spherical Regression: Learning Viewpoints, Surface Normals and 3D Rotations on N-Spheres
Shuai Liao, Efstratios Gavves, Cees G. M. Snoek


Many computer vision challenges require continuous outputs, but tend to be solved by discrete classification. The reason is classification's natural containment within a probability n-simplex, as defined by the popular softmax activation function. Regular regression lacks such a closed geometry, leading to unstable training and convergence to suboptimal local minima. Starting from this insight we revisit regression in convolutional neural networks. We observe many continuous output problems in computer vision are naturally contained in closed geometrical manifolds, like the Euler angles in viewpoint estimation or the normals in surface normal estimation. A natural framework for posing such continuous output problems are n-spheres, which are naturally closed geometric manifolds defined in the R^(n+1) space. By introducing a spherical exponential mapping on n-spheres at the regression output, we obtain well-behaved gradients, leading to stable training. We show how our spherical regression can be utilized for several computer vision challenges, specifically viewpoint estimation, surface normal estimation and 3D rotation estimation. For all these problems our experiments demonstrate the benefit of spherical regression. All paper resources are available at https://github.com/leoshine/Spherical_Regression.
[prediction, framework, naturally, work, predict, predicting] [spherical, surface, rotation, estimation, viewpoint, normal, euler, continuous, sexp, exponential, pose, computer, vision, matrix, directly, defined, ground, geometric, stable, general, truth, derivative, angle, quaternion, david, typically, problem, regressing] [image, raw, based, mapping, input, latent, proposed] [output, activation, network, gradient, neural, norm, convolutional, deep, layer, alexnet, unit, normalization, lat, equivalent, better, accuracy, table] [constrained, partial, sign, sum, probability, model, evaluate] [regression, object, branch, propose, head, cnn, conclude, leading] [classification, loss, training, function, learning, representation, closed, learn, observe, embedding, set, softmax, space, base]
@InProceedings{Liao_2019_CVPR,
  author = {Liao, Shuai and Gavves, Efstratios and Snoek, Cees G. M.},
  title = {Spherical Regression: Learning Viewpoints, Surface Normals and 3D Rotations on N-Spheres},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Refine and Distill: Exploiting Cycle-Inconsistency and Knowledge Distillation for Unsupervised Monocular Depth Estimation
Andrea Pilzer, Stephane Lathuiliere, Nicu Sebe, Elisa Ricci


Nowadays, the majority of state of the art monocular depth estimation techniques are based on supervised deep learning models. However, collecting RGB images with associated depth maps is a very time consuming procedure. Therefore, recent works have proposed deep architectures for addressing the monocular depth prediction task as a reconstruction problem, thus avoiding the need of collecting ground-truth depth. Following these works, we propose a novel self-supervised deep model for estimating depth maps. Our framework exploits two main strategies: refinement via cycle-inconsistency and distillation. Specifically, first a student network is trained to predict a disparity map such as to recover from a frame in a camera view the associated image in the opposite view. Then, a backward cycle network is applied to the generated image to re-synthesize back the input image, estimating the opposite disparity. A third network exploits the inconsistency between the original and the reconstructed input frame in order to output a refined depth map. Finally, knowledge distillation is exploited, such as to transfer information from the refinement network to the student. Our extensive experimental evaluation demonstrate the effectiveness of the proposed framework which outperforms state of the art unsupervised methods on the KITTI benchmark.
[predict, backward, second, prediction, warping, dataset, framework, opposite, work, forward] [depth, disparity, monocular, estimation, left, stereo, kitti, disp, reconstruction, inconsistency, estimated, view, single, approach, rel, camera, predicts, note, eigen, estimate, corresponding, godard, error, scene, associated] [image, proposed, cycle, input, method, synthesized, reconstructed, resolution, study] [network, deep, order, better, performance, employ, table, architecture, output, low, higher, original, convolutional, structure, compared] [model] [map, ablation, improve, propose, predicted, yang, refined, supervision, feature, improves] [distillation, unsupervised, training, learning, loss, student, teacher, knowledge, trained, testing, test, feat, supervised, exploiting, large, exploit, split, discrepancy]
@InProceedings{Pilzer_2019_CVPR,
  author = {Pilzer, Andrea and Lathuiliere, Stephane and Sebe, Nicu and Ricci, Elisa},
  title = {Refine and Distill: Exploiting Cycle-Inconsistency and Knowledge Distillation for Unsupervised Monocular Depth Estimation},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Learning View Priors for Single-View 3D Reconstruction
Hiroharu Kato, Tatsuya Harada


There is some ambiguity in the 3D shape of an object when the number of observed views is small. Because of this ambiguity, although a 3D object reconstructor can be trained using a single view or a few views per object, reconstructed shapes only fit the observed views and appear incorrect from the unobserved viewpoints. To reconstruct shapes that look reasonable from any viewpoint, we propose to train a discriminator that learns prior knowledge regarding possible views. The discriminator is trained to distinguish the reconstructed views of the observed viewpoints from those of the unobserved viewpoints. The reconstructor is trained to correct unobserved views by fooling the discriminator. Our method outperforms current state-of-the-art methods on both synthetic and natural image datasets; this validates the effectiveness of our method.
[dataset, internal, multiple, prediction, outperforms, work] [view, reconstruction, shape, reconstructor, viewpoint, estimated, single, pressure, vpl, approach, mesh, silhouette, ground, truth, shapenet, point, voxels, reconstructors, chair, case, ambiguity, problem, corresponding, require, differentiable] [proposed, method, figure, prior, image, reconstructed, texture, comparison, synthetic, described, difference, half] [table, gradient, number, architecture, accuracy, performance] [discriminator, observed, unobserved, model, adversarial, incorrect, correct, encoder, decoder, visual, natural, random, reasonable, requires, generation] [object, baseline, iou, pascal, predicted, propose, distinguish] [training, loss, learning, knowledge, class, train, viewed, trained, function, learn, reversal, discrimination, difficult, conducted]
@InProceedings{Kato_2019_CVPR,
  author = {Kato, Hiroharu and Harada, Tatsuya},
  title = {Learning View Priors for Single-View 3D Reconstruction},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Geometry-Aware Symmetric Domain Adaptation for Monocular Depth Estimation
Shanshan Zhao, Huan Fu, Mingming Gong, Dacheng Tao


Supervised depth estimation has achieved high accuracy due to the advanced deep network architectures. Since the groundtruth depth labels are hard to obtain, recent methods try to learn depth estimation networks in an unsupervised way by exploring unsupervised cues, which are effective but less reliable than true labels. An emerging way to resolve this dilemma is to transfer knowledge from synthetic images with ground truth depth via domain adaptation techniques. However, these approaches overlook specific geometric structure of the natural images in the target domain (i.e., real data), which is important for high-performing depth prediction. Motivated by the observation, we propose a geometry-aware symmetric domain adaptation framework (GASDA) to explore the labels in the synthetic data and epipolar geometry in the real data jointly. Moreover, by training two image style translators and depth estimators symmetrically in an end-to-end network, our model achieves better image style transfer and generates high-quality depth maps. The experimental results demonstrate the effectiveness of our proposed method and comparable performance against the state-of-the-art.
[dataset, prediction, framework, previous] [depth, estimation, computer, monocular, vision, gasda, pattern, geometry, stereo, kitti, ground, truth, geometric, epipolar, single, approach, rel, eigen, error, kundu, rmse, well, scene, mde, volume, international] [image, conference, synthetic, real, ieee, style, consistency, figure, translation, proposed, method, cyclegan, mingming] [network, deep, neural, table, convolutional, structure, accuracy, performance, compared, processing] [model, adversarial, arxiv, preprint, machine] [semantic, map, predicted, european] [domain, adaptation, loss, data, unsupervised, learning, training, trained, target, supervised, transfer, symmetric, train, set, learn, main, updating, source, test, dacheng]
@InProceedings{Zhao_2019_CVPR,
  author = {Zhao, Shanshan and Fu, Huan and Gong, Mingming and Tao, Dacheng},
  title = {Geometry-Aware Symmetric Domain Adaptation for Monocular Depth Estimation},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Learning Monocular Depth Estimation Infusing Traditional Stereo Knowledge
Fabio Tosi, Filippo Aleotti, Matteo Poggi, Stefano Mattoccia


Depth estimation from a single image represents a fascinating, yet challenging problem with countless applications. Recent works proved that this task could be learned without direct supervision from ground truth labels leveraging image synthesis on sequences or stereo pairs. Focusing on this second case, in this paper we leverage stereo matching in order to improve monocular depth estimation. To this aim we propose monoResMatch, a novel deep architecture designed to infer depth from a single input image by synthesizing features from a different point of view, horizontally aligned with the input image, performing stereo matching between the two cues. In contrast to previous works sharing this rationale, our network is the first trained end-to-end from scratch. Moreover, we show how obtaining proxy ground truth annotation through traditional stereo algorithms, such as Semi-Global Matching, enables more accurate monocular depth estimation still countering the need for expensive depth labels by keeping a self-supervised approach. Exhaustive experimental results prove how the synergy between i) the proposed monoResMatch architecture and ii) proxy-supervision attains state-of-the-art for self-supervised monocular depth estimation. The code is publicly available at https://github.com/fabiotosi92/monoResMatch-Tensorflow.
[recognition, dataset, framework, second] [stereo, depth, disparity, monocular, computer, vision, monoresmatch, estimation, pattern, ground, matching, single, truth, kitti, left, initial, matteo, stefano, accurate, sgm, fabio, view, international, virtual, leveraging, volume, confidence, eigen, rmse, dispnetc, godard, error, luo] [conference, image, ieee, input, proposed, figure, synthetic, resolution, traditional] [deep, network, architecture, convolutional, table, better, residual, accuracy, neural, cost, size, order, compared, best, layer, effective] [infer, enables] [supervision, refinement, map, feature, european, aligned, final, module] [learning, proxy, training, trained, unsupervised, set, strategy, train, loss, supervised]
@InProceedings{Tosi_2019_CVPR,
  author = {Tosi, Fabio and Aleotti, Filippo and Poggi, Matteo and Mattoccia, Stefano},
  title = {Learning Monocular Depth Estimation Infusing Traditional Stereo Knowledge},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
SIGNet: Semantic Instance Aided Unsupervised 3D Geometry Perception
Yue Meng, Yongxi Lu, Aman Raj, Samuel Sunarjo, Rui Guo, Tara Javidi, Gaurav Bansal, Dinesh Bharadia


Unsupervised learning for geometric perception (depth, optical flow, etc.) is of great interest to autonomous systems. Recent works on unsupervised learning have made considerable progress on perceiving geometry; however, they usually ignore the coherence of objects and perform poorly under scenarios with dark and noisy environments. In contrast, supervised learning algorithms, which are robust, require large labeled geometric dataset. This paper introduces SIGNet, a novel framework that provides robust geometry perception without requiring geometrically informative labels. Specifically, SIGNet integrates semantic information to make depth and flow predictions consistent with objects and robust to low lighting conditions. SIGNet is shown to improve upon the state-of-the-art unsupervised learning for depth prediction by 30% (in squared relative error). In particular, SIGNet improves the dynamic object class performance by 39% in depth prediction and 29% in flow prediction. Our code will be made available at https://github.com/mengyuest/SIGNet
[flow, prediction, optical, recognition, work, motion, framework, dynamic, frame, dataset, predict] [depth, computer, vision, geometry, pattern, rgb, monocular, geometric, reconstruction, scene, error, dense, kitti, estimation, yin, relative, posenet, rel, rsme, international, camera, provide, single, june, robust, approach] [conference, ieee, image, input, method, figure, proposed] [deep, performance, network, table, channel, convolutional, neural, compared] [model, encoding, perception, visual, evaluation, arxiv, preprint] [semantic, instance, object, segmentation, feature, map, improve, edge, signet, fig, spatial, predicted, propose] [unsupervised, learning, loss, class, training, supervised, transfer, function, learn, augmentation, set]
@InProceedings{Meng_2019_CVPR,
  author = {Meng, Yue and Lu, Yongxi and Raj, Aman and Sunarjo, Samuel and Guo, Rui and Javidi, Tara and Bansal, Gaurav and Bharadia, Dinesh},
  title = {SIGNet: Semantic Instance Aided Unsupervised 3D Geometry Perception},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
3D Guided Fine-Grained Face Manipulation
Zhenglin Geng, Chen Cao, Sergey Tulyakov


We present a method for fine-grained face manipulation. Given a face image with an arbitrary expression, our method can synthesize another arbitrary expression by the same person. This is achieved by first fitting a 3D face model and then disentangling the face into a texture and a shape. We then learn different networks in these two spaces. In the texture space, we use a conditional generative network to change the appearance, and carefully design input formats and loss functions to achieve the best results. In the shape space, we use a fully connected network to predict the accurate shapes and use the available depth data for supervision. Both networks are conditioned on expression coefficients rather than discrete labels, allowing us to generate an unlimited amount of expressions. We show the superiority of this disentangling approach through both quantitative and qualitative studies. In a user study, our method is preferred in 85% of cases when compared to the most recent work. When compared to the ground truth, annotators cannot reliably distinguish between our synthesized images and real images, preferring our method in 53% of the cases.
[work, action, capture, perform] [shape, approach, ground, computer, depth, pattern, fitting, geometry, vision, note, mesh, problem, directly, truth, deformation, linear, camera] [face, texture, expression, image, input, method, facial, generative, synthesized, real, conference, ieee, figure, manipulation, proposed, identity, qualitative, presented, mouth, appearance, desired, user, fitted, arbitrary, facewarehouse, translation, synthesis, acm, realistic, morphable, separate, preserve, generator, perceptual] [network, neural, deep, compared, better, number, best, convolutional, output] [model, generated, generate, fake, adversarial, evaluate] [branch, global, fully, distinguish] [loss, target, set, learn, representation, train, training, source, data, transfer, distance, function]
@InProceedings{Geng_2019_CVPR,
  author = {Geng, Zhenglin and Cao, Chen and Tulyakov, Sergey},
  title = {3D Guided Fine-Grained Face Manipulation},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Neuro-Inspired Eye Tracking With Eye Movement Dynamics
Kang Wang, Hui Su, Qiang Ji


Generalizing eye tracking to new subjects/environments remains challenging for existing appearance-based methods. To address this issue, we propose to leverage on eye movement dynamics inspired by neurological studies. Studies show that there exist several common eye movement types, independent of viewing contents and subjects, such as fixation, saccade, and smooth pursuits. Incorporating generic eye movement dynamics can therefore improve the generalization capabilities. In particular, we propose a novel Dynamic Gaze Transition Network (DGTN) to capture the underlying eye movement dynamics and serve as the topdown gaze prior. Combined with the bottom-up gaze measurements from the deep convolutional neural network, our method achieves better performance for both within-dataset and cross-dataset evaluations compared to state-of-the-art. In addition, a new DynamicGaze dataset is also constructed to study eye movement dynamics and eye gaze estimation.
[movement, static, tracking, state, dynamic, saccade, video, transition, fixation, duration, online, time, amplitude, dgtn, dataset, moving, construct, focus, work, prediction, perform, frame, watching, avg, wang] [estimation, error, computer, pattern, international, vision, point, underlying, horizontal, vertical, analysis, estimated, estimate, well, pose, angle, algorithm, groundtruth, symposium] [gaze, eye, smooth, conference, ieee, pursuit, proposed, method, figure, pixel, face, image, input, user] [network, better, table, neural, performance, compare, max, gaussian] [model, true, random, arg, natural] [refinement, propose, improve, head, help, spatial, curve] [data, generalize, log, web, learning]
@InProceedings{Wang_2019_CVPR,
  author = {Wang, Kang and Su, Hui and Ji, Qiang},
  title = {Neuro-Inspired Eye Tracking With Eye Movement Dynamics},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Facial Emotion Distribution Learning by Exploiting Low-Rank Label Correlations Locally
Xiuyi Jia, Xiang Zheng, Weiwei Li, Changqing Zhang, Zechao Li


Emotion recognition from facial expressions is an interesting and challenging problem and has attracted much attention in recent years. Substantial previous research has only been able to address the ambiguity of "what describes the expression", which assumes that each facial expression is associated with one or more predefined affective labels while ignoring the fact that multiple emotions always have different intensities in a single picture. Therefore, to depict facial expressions more accurately, this paper adopts a label distribution learning approach for emotion recognition that can address the ambiguity of "how to describe the expression" and proposes an emotion distribution learning method that exploits label correlations locally. Moreover, a local low-rank structure is employed to capture the local label correlations implicitly. Experiments on benchmark facial expression datasets demonstrate that our method can better address the emotion distribution recognition problem than state-of-the-art methods.
[recognition, employed, previous, influence, predefined, affective, capture, work, dataset] [local, problem, algorithm, solve, international, optimization, matrix, pattern, direction, denote, analysis, ambiguity] [facial, expression, method, based, conference, ieee, image, proposed, xin, figure, face, presented] [structure, performance, number, sbu, basic, correlation, table, science, divided, admm] [description, vector, indicates, automatic, machine, artificial, locally] [global, three, predicted, intersection] [label, distribution, emotion, learning, set, training, ldl, data, function, datasets, min, edl, exploiting, china, paper, dis, exploit, exploited, objective, experimental, similarity, cluster, cosine, ldllc, address]
@InProceedings{Jia_2019_CVPR,
  author = {Jia, Xiuyi and Zheng, Xiang and Li, Weiwei and Zhang, Changqing and Li, Zechao},
  title = {Facial Emotion Distribution Learning by Exploiting Low-Rank Label Correlations Locally},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Unsupervised Face Normalization With Extreme Pose and Expression in the Wild
Yichen Qian, Weihong Deng, Jiani Hu


Face recognition achieves great success thanks to the emergence of deep learning. However, many contemporary face recognition models still have limited invariance to strong intra-personal variations such as large pose changes. Face normalization provides an effective and cheap way to distil face identity and dispel face variances for recognition. We focus on face generation in the wild with unpaired data. To this end, we propose a Face Normalization Model (FNM) to generate a frontal, neutral expression, photorealistic face image for face recognition. FNM is a well-designed Generative Adversarial Network (GAN) with three distinct novelties. First, a face expert network is introduced to construct generator and provide the ability of retaining face identity. Second, with the reconstruction of normal face, pixel-wise loss is applied to stabilize optimization process. Third, we present a series of face attention discriminators to refine local textures. FNM could recover canonical-view, expression-free image and directly improve the performance of face recognition model. Extensive qualitative and quantitative experiments on both controlled and in-the-wild databases demonstrate the superiority of our face normalization method.
[recognition, framework, complex, dataset, previous] [normal, normalized, pose, light, computer, vision, local, view, pattern, optimization, corresponding, directly] [face, fnm, identity, image, generator, unconstrained, expert, input, preserving, controlled, method, genc, expression, unpaired, synthesis, photorealistic, generative, proposed, database, ieee, neutral, synthesize, frontal, facial, prior, gdec, lip, conference, figure, ladv, great, qualitative, demonstrate, frontalization] [network, normalization, performance, deep, applied, effective, employ, table, fixed] [attention, model, adversarial, gan, introduce, discriminator, environment, mechanism, arxiv, preprint, generate] [cnn, feature, propose, extreme, crop] [loss, set, training, data, target, large, learning, representation, function, novel, incorporated, distribution]
@InProceedings{Qian_2019_CVPR,
  author = {Qian, Yichen and Deng, Weihong and Hu, Jiani},
  title = {Unsupervised Face Normalization With Extreme Pose and Expression in the Wild},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Semantic Component Decomposition for Face Attribute Manipulation
Ying-Cong Chen, Xiaohui Shen, Zhe Lin, Xin Lu, I-Ming Pao, Jiaya Jia


Deep neural network-based methods were proposed for face attribute manipulation. There still exist, however, two major issues, i.e., insufficient visual quality (or resolution) of the results and lack of user control. They limit the applicability of existing methods since users may have different editing preference on facial attributes. In this paper, we address these issues by proposing a semantic component model. The model decomposes a facial attribute into multiple semantic components, each corresponds to a specific face region. This not only allows for user control of edit strength on different parts based on their preference, but also makes it effective to remove unwanted edit effect. Further, each semantic component is composed of two fundamental elements, which determine the edit effect and region respectively. This property provides fine interactive control. As shown in experiments, our model not only produces high-quality results, but also allows effective user interaction.
[fusion, work, time, multiple] [allows, note, corresponding, approach, general, initial, view, decomposition] [attribute, edit, face, facial, component, image, vsi, changing, strength, hair, based, result, dfi, editing, manipulation, unwanted, method, generative, produce, figure, cyclegan, resgan, user, decomposes, change, input, younger, facelet, control, expression, older, quality, removed, manipulating, rhwc, removing, comparison, proposed] [network, deep, original, vgg, denotes, architecture, net, neural, effective, adjust, compared] [model, attention, painter, visual, adversarial, gan, edited, kind, simple, system] [semantic, region, interactive, feature, spatial, final, map] [training, learning, learn, target, large, specific, train, setting, set, existing]
@InProceedings{Chen_2019_CVPR,
  author = {Chen, Ying-Cong and Shen, Xiaohui and Lin, Zhe and Lu, Xin and Pao, I-Ming and Jia, Jiaya},
  title = {Semantic Component Decomposition for Face Attribute Manipulation},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
R3 Adversarial Network for Cross Model Face Recognition
Ken Chen, Yichao Wu, Haoyu Qin, Ding Liang, Xuebo Liu, Junjie Yan


In this paper, we raise a new problem, namely cross model face recognition (CMFR), which has considerable economic and social significance. The core of this problem is to make features extracted from different models comparable. However, the diversity, mainly caused by different application scenarios, frequent version updating, and all sorts of service platforms, obstructs interaction among different models and poses a great challenge. To solve this problem, from the perspective of Bayesian modelling, we propose R3 Adversarial Network (R3AN) which consists of three paths: Reconstruction, Representation and Regression. We also introduce adversarial learning into the reconstruction path for better performance. Comprehensive experiments on public datasets demonstrate the feasibility of interaction among different models with the proposed framework. When updating the gallery, R3AN conducts the feature transformation nearly 10 times faster than ResNet-101. Meanwhile, the transformed feature distribution is very close to that of target model, and its error rate is incredibly reduced by approximately 75% compared with a naive transformation model. Furthermore, we show that face feature can be deciphered into original face image roughly by the reconstruction path, which may give valuable hints for improving the original face recognition models.
[recognition, interaction, time, consists] [reconstruction, problem, feasibility, left, practical, solve] [face, transformation, cmfr, input, real, prior, extracted, proposed, transform, generator, image, based, generative, poly, application, mapping, latent, recover, src, tgt, figure, polynete, demonstrate] [original, accuracy, architecture, rate, deep, network, performance, output, bottleneck, table, typical, better, order, bayesian] [model, adversarial, path, generated, evaluate, system, querying, find] [feature, regression, three, identification, propose, map, adopt] [target, representation, learning, distribution, source, training, loss, gallery, naive, transfer, knowledge, cross, probe, datasets, updating, learned, learn, extractor, close, set, data, domain, train, refers]
@InProceedings{Chen_2019_CVPR,
  author = {Chen, Ken and Wu, Yichao and Qin, Haoyu and Liang, Ding and Liu, Xuebo and Yan, Junjie},
  title = {R3 Adversarial Network for Cross Model Face Recognition},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Disentangling Latent Hands for Image Synthesis and Pose Estimation
Linlin Yang, Angela Yao


Hand image synthesis and pose estimation from RGB images are both highly challenging tasks due to the large discrepancy between factors of variation ranging from image background content to camera viewpoint. To better analyze these factors of variation, we propose the use of disentangled representations and a disentangled variational autoencoder (dVAE) that allows for specific sampling and inference of these factors. The derived objective from the variational lower bound as well as the proposed training strategy are highly flexible, allowing us to handle crossmodal encoders and decoders as well as semi-supervised learning scenarios. Experiments show that our dVAE can synthesize highly realistic images of the hand specifiable by both pose and image background content and also estimate 3D hand poses from RGB images with accuracy competitive with state-of-the-art on two public benchmarks.
[multiple, joint, dataset, construct] [hand, pose, rgb, estimation, depth, dvae, cpose, single, viewpoint, additional, estimate, rhd, monocular, epe, stb, well, bound, associated, accurate, pck, thomas, variable] [latent, image, disentangled, content, generative, disentangling, background, figure, disentangle, synthesized, input, synthesis, variation, proposed, synthesize, decode, method, control, row] [deep, highly, lower, full, convolutional, fixed] [model, variational, evidence, tag, evaluate, encoding, decoding, observed, encode] [fully, weak] [learning, log, space, learn, vae, training, representation, distribution, data, independent, vaes, embedding, dkl, update, strategy, difficult, datasets, transfer]
@InProceedings{Yang_2019_CVPR,
  author = {Yang, Linlin and Yao, Angela},
  title = {Disentangling Latent Hands for Image Synthesis and Pose Estimation},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Generating Multiple Hypotheses for 3D Human Pose Estimation With Mixture Density Network
Chen Li, Gim Hee Lee


3D human pose estimation from a monocular image or 2D joints is an ill-posed problem because of depth ambiguity and occluded joints. We argue that 3D human pose estimation from a monocular input is an inverse problem where multiple feasible solutions can exist. In this paper, we propose a novel approach to generate multiple feasible hypotheses of the 3D pose from 2D joints. In contrast to existing deep learning approaches which minimize a mean square error based on an unimodal Gaussian distribution, our method is able to generate multiple feasible hypotheses of 3D pose based on a multimodal mixture density networks. Our experiments show that the 3D poses estimated by our approach from an input of 2D joints are consistent in 2D reprojections, which supports our argument that multiple solutions exist for the 2D-to-3D inverse problem. Furthermore, we show state-of-the-art performance on the Human3.6M dataset in both best hypothesis and multi-view settings, and we demonstrate the generalization capacity of our model by testing on the MPII and MPI-INF-3DHP datasets. Our code is available at the project website.
[human, multiple, dataset, joint, hypothesis, mpii] [pose, estimation, computer, approach, single, problem, vision, pattern, linear, lprior, ambiguity, consistent, hourglass, dirichlet, feasible, estimated, reprojections, monocular, depth, inverse, indoor, jahangiri, outdoor, note, ground, truth, smoke, error, view, estimate, martinez] [conference, input, ieee, prior, based, mixing, mdn, image, figure, method, result] [network, gaussian, deep, table, density, number, performance, best, stacked, kernel, learnable, compare, standard] [model, generated, generate, generating, simple, generates] [european, feature, detected, baseline, three] [mixture, set, training, datasets, learning, train, function, distribution, generalization, data, loss, test, protocol]
@InProceedings{Li_2019_CVPR,
  author = {Li, Chen and Hee Lee, Gim},
  title = {Generating Multiple Hypotheses for 3D Human Pose Estimation With Mixture Density Network},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
CrossInfoNet: Multi-Task Information Sharing Based Hand Pose Estimation
Kuo Du, Xiangbo Lin, Yi Sun, Xiaohong Ma


This paper focuses on the topic of vision based hand pose estimation from single depth map using convolutional neural network (CNN). Our main contributions lie in designing a new pose regression network architecture named CrossInfoNet. The proposed CrossInfoNet decomposes hand pose estimation task into palm pose estimation sub-task and finger pose estimation sub-task, and adopts two-branch crossconnection structure to share the beneficial complementary information between the sub-tasks. Our work is inspired by multi-task information sharing mechanism, which has been few discussed in hand pose estimation using depth data in previous publications. In addition, we propose a heat-map guided feature extraction structure to get better feature maps, and train the complete network end-to-end. The effectiveness of the proposed CrossInfoNet is evaluated with extensively self-comparative experiments and in comparison with state-of-the-art methods on four public hand pose datasets. The code is available.
[joint, human, dataset, recognition, work, challenge] [pose, hand, palm, estimation, finger, computer, vision, depth, pattern, single, initial, nyu, estimated, error, international, densereg, handpointnet, regressing, heat, deepprior, volume, june] [conference, ieee, based, method, proposed, icvl, input, figure, image] [network, sharing, residual, better, deep, output, architecture, performance, compared, size, convolutional, design, layer, block, structure] [model] [feature, regression, extraction, branch, module, refinement, baseline, cnn, three, msra, guided, hierarchical, final, map, region] [learning, set, distance, training, task, loss, data]
@InProceedings{Du_2019_CVPR,
  author = {Du, Kuo and Lin, Xiangbo and Sun, Yi and Ma, Xiaohong},
  title = {CrossInfoNet: Multi-Task Information Sharing Based Hand Pose Estimation},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
P2SGrad: Refined Gradients for Optimizing Deep Face Models
Xiao Zhang, Rui Zhao, Junjie Yan, Mengya Gao, Yu Qiao, Xiaogang Wang, Hongsheng Li


Cosine-based softmax losses significantly improve the performance of deep face recognition networks. However, these losses always include sensitive hyper-parameters which can make training process unstable, and it is very tricky to set suitable hyper parameters for a specific dataset. This paper addresses this challenge by directly designing the gradients for training in an adaptive manner. We first investigate and unify previous cosine softmax losses from the perspective of gradients. This unified view inspires us to propose a novel gradient called P2SGrad (Probability-to-Similarity Gradient), which leverages a cosine similarity instead of classification probability to control the gradients for updating neural network parameters. P2SGrad is adaptive and hyper-parameter free, which makes training process more efficient and faster. We evaluate our P2SGrad on three face recognition benchmarks, LFW, MegaFace, and IJB-C. The results show that P2SGrad is stable in training, robust to noise, and achieves state-of-the-art performance on all the three benchmarks.
[recognition, backward, forward, dataset] [computer, vision, pattern, direction, angle, perspective, formulation, range, corresponding] [face, conference, proposed, ieee, figure, based, change, produce, method, facial] [gradient, deep, number, neural, performance, factor, compared, best, network, iteration, small, verification, table, process, weight, larger] [length, probability, arxiv, preprint, vector, model, evaluation] [feature, average, curve, calculation, xiaogang, assigned] [softmax, cosine, loss, training, class, classification, margin, angular, testing, learning, function, megaface, hyperparameter, updating, lce, cosface, arcface, similarity, logit, trained, conventional, viewed, metric, large, test, hyperparameters]
@InProceedings{Zhang_2019_CVPR,
  author = {Zhang, Xiao and Zhao, Rui and Yan, Junjie and Gao, Mengya and Qiao, Yu and Wang, Xiaogang and Li, Hongsheng},
  title = {P2SGrad: Refined Gradients for Optimizing Deep Face Models},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Action Recognition From Single Timestamp Supervision in Untrimmed Videos
Davide Moltisanti, Sanja Fidler, Dima Damen


Recognising actions in videos relies on labelled supervision during training, typically the start and end times of each action instance. This supervision is not only subjective, but also expensive to acquire. Weak video-level supervision has been successfully exploited for recognition in untrimmed videos, however it is challenged when the number of different actions in training videos increases. We propose a method that is supervised by single timestamps located around each action instance, in untrimmed videos. We replace expensive action bounds with sampling distributions initialised from these timestamps. We then use the classifier's response to iteratively update the sampling distributions. We demonstrate that these distributions converge to the location and extent of discriminative action segments. We evaluate our method on three datasets for fine-grained recognition, with increasing number of different actions per video, and show that single timestamps offer a reasonable compromise between recognition performance and labelling effort, performing comparably to full temporal supervision. Our update method improves top-1 test accuracy by up to 5.4%. across the evaluated datasets.
[action, temporal, video, timestamp, untrimmed, timestamps, thumos, epic, recognition, beoid, plateau, multiple, iteratively, jar, narration, ivan, start, frame] [single, initial, corresponding, point, approach, confidence, note, equation] [figure, method, based, background] [number, full, accuracy, table, net, best, converge, increasing, comparable] [model, relevant, sampled, refer, step] [supervision, average, three, weakly, propose, annotation, weak, location, extent, aligned, annotated, cup] [sampling, update, training, distribution, class, labelled, classifier, supervised, select, softmax, train, learning, set, datasets, updating, selected, test, sufficient, function, base, curriculum, close, sample]
@InProceedings{Moltisanti_2019_CVPR,
  author = {Moltisanti, Davide and Fidler, Sanja and Damen, Dima},
  title = {Action Recognition From Single Timestamp Supervision in Untrimmed Videos},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Time-Conditioned Action Anticipation in One Shot
Qiuhong Ke, Mario Fritz, Bernt Schiele


The goal of human action anticipation is to predict future actions. Ideally, in real-world applications such as video surveillance and self-driving systems, future actions should not only be predicted with high accuracy but also at arbitrary and variable time-horizons ranging from short- to long-term predictions. Current work mostly focuses on predicting the next action and thus long-term prediction is achieved by recursive prediction of each next action, which is both inefficient and accumulates errors. In this paper, we propose a novel time-conditioned method for efficient and effective long-term action anticipation. There are two key ingredients to our approach. First, by explicitly conditioning our anticipation network on time allows to efficiently anticipate also long-term actions. And second, we propose an attended temporal feature and a time-conditioned skip connection to extract relevant and useful information from observations for effective anticipation. We conduct extensive experiments on the large-scale Epic-Kitchen and the 50Salads Datasets. Experimental results show that the proposed method is capable of anticipating future actions at both short-term and long-term, and achieves state-of-the-art performance.
[action, anticipation, temporal, future, time, tos, anticipate, anticipating, dataset, wash, prediction, early, rnn, recognition, human, video, sequence, rdc, tsc, incorporate, benefit, mix, predicting, tsn, outperforms] [observation, cut, computer, pattern, initial, vision, international, analysis, dense] [method, proposed, conference, figure, ieee, remove, conduct, based] [skip, connection, performance, table, network, scale, number, accuracy, parameter, compared, denotes, order, convolution, achieved, deep] [attended, iterative, generate, observed, attention, introduced, step, model, machine, introduce] [feature, baseline, cnn, final, average, european] [set, representation, unseen, label, learning, training, class]
@InProceedings{Ke_2019_CVPR,
  author = {Ke, Qiuhong and Fritz, Mario and Schiele, Bernt},
  title = {Time-Conditioned Action Anticipation in One Shot},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Dance With Flow: Two-In-One Stream Action Detection
Jiaojiao Zhao, Cees G. M. Snoek


The goal of this paper is to detect the spatio-temporal extent of an action. The two-stream detection network based on RGB and flow provides state-of-the-art accuracy at the expense of a large model-size and heavy computation. We propose to embed RGB and optical-flow into a single two-in-one stream network with new layers. A motion condition layer extracts motion information from flow images, which is leveraged by the motion modulation layer to generate transformation parameters for modulating the low-level RGB features. The method is easily embedded in existing appearance- or two-stream action detection networks, and trained end-to-end. Experiments demonstrate that leveraging the motion condition to modulate RGB features improves detection accuracy. With only half the computation and parameters of the state-of-the-art two-stream methods, our two-in-one stream still achieves impressive results on UCF101-24, UCFSports and J-HMDB.
[action, motion, stream, flow, video, multiple, cordelia, singh, outperforms, broxflow, frame, spatiotemporal, time, optical, human, kalogeiton, follow, actor, realtimeflow, mubarak, ucfsports, peng] [rgb, condition, single, corresponding] [figure, appearance, image, method, based, half, high, prior, transformation, input, proposed] [layer, network, modulation, accuracy, better, convolutional, conv, computation, best, modulate, deep, table, add, efficiency, efficient, order] [generate, model, embed, visual] [detection, feature, detector, three, propose, spatial, localization, average, box, leading, modulated] [loss, trained, learning, classification, training, learned]
@InProceedings{Zhao_2019_CVPR,
  author = {Zhao, Jiaojiao and Snoek, Cees G. M.},
  title = {Dance With Flow: Two-In-One Stream Action Detection},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Representation Flow for Action Recognition
AJ Piergiovanni, Michael S. Ryoo


In this paper, we propose a convolutional layer inspired by optical flow algorithms to learn motion representations. Our representation flow layer is a fully-differentiable layer designed to capture the `flow' of any representation channel within a convolutional neural network for action recognition. Its parameters for iterative flow optimization are learned in an end-to-end fashion together with the other CNN model parameters, maximizing the action recognition performance. Furthermore, we newly introduce the concept of learning `flow of flow' representations by stacking multiple representation flow layers. We conducted extensive experimental evaluations, confirming its advantages over previous recognition models using traditional optical flows in both computational speed and performance. The code is publicly available.
[flow, optical, motion, recognition, action, video, capture, temporal, multiple, activity, kinetics, fusion, sequential, capturing, hmdb, confirm, performs, previous] [rgb, computer, vision, compute, pattern, optimization, algorithm, single, note, allows] [conference, ieee, input, method, based, figure, image, intermediate, appearance] [layer, convolutional, table, computing, cnns, block, performance, compare, number, residual, applied, standard, neural, network, compared, computation, better, inspired, designed] [model, iterative, find] [cnn, feature, spatial, including] [representation, learning, learned, learn, divergence, classification, trained, large, existing]
@InProceedings{Piergiovanni_2019_CVPR,
  author = {Piergiovanni, AJ and Ryoo, Michael S.},
  title = {Representation Flow for Action Recognition},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
LSTA: Long Short-Term Attention for Egocentric Action Recognition
Swathikiran Sudhakaran, Sergio Escalera, Oswald Lanz


Egocentric activity recognition is one of the most challenging tasks in video analysis. It requires a fine-grained discrimination of small objects and their manipulation. While some methods base on strong supervision and attention mechanisms, they are either annotation consuming or do not take spatio-temporal patterns into account. In this paper we propose LSTA as a mechanism to focus on features from spatial relevant parts while attention is being tracked smoothly across the video sequence. We demonstrate the effectiveness of LSTA on egocentric activity recognition with an end-to-end trainable two-stream architecture, achieving state-of-the-art performance on four standard benchmarks.
[stream, action, lsta, recognition, video, activity, egocentric, fusion, motion, flow, state, frame, lstm, temporal, recurrent, gating, gtea, tracking, elegatt, long, sequence, focus, dataset, second, prediction, convlstm, previous] [analysis, rgb, view] [input, proposed, appearance, image, method, control, gaze, based] [network, output, pooling, deep, convolutional, performance, standard, neural, rate, applied, weight, tensor, accuracy, better, design, fixed, residual] [attention, memory, relevant, model, generated, mechanism, encoding, enables, strong, visual, adding, generates] [map, object, person, baseline, spatial, feature, ablation, pool, improvement, level, detailed] [learning, trained, bias, classification, discriminative, training]
@InProceedings{Sudhakaran_2019_CVPR,
  author = {Sudhakaran, Swathikiran and Escalera, Sergio and Lanz, Oswald},
  title = {LSTA: Long Short-Term Attention for Egocentric Action Recognition},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Learning Actor Relation Graphs for Group Activity Recognition
Jianchao Wu, Limin Wang, Li Wang, Jie Guo, Gangshan Wu


Modeling relation between actors is important for recognizing group activity in a multi-person scene. This paper aims at learning discriminative relation between actors efficiently using deep models. To this end, we propose to build a flexible and efficient \rm Actor Relation Graph (ARG) to simultaneously capture the appearance and position relation between actors. Thanks to the Graph Convolutional Network, the connections in ARG could be automatically learned from group activity videos in an end-to-end manner, and the inference on ARG could be efficiently performed with standard matrix operations. Furthermore, in practice, we come up with two variants to sparsify ARG for more effective modeling in videos: spatially localized ARG and temporal randomized ARG. We perform extensive experiments on two standard group activity recognition datasets: the Volleyball dataset and the Collective Activity dataset, where state-of-the-art performance is achieved on both datasets. We also visualize the learned actor graphs and relation features, which demonstrate that the proposed ARG is able to capture the discriminative relation information for group activity recognition.
[activity, graph, actor, temporal, recognition, modeling, individual, action, video, multiple, collective, gcn, volleyball, dataset, perform, capture, fusion, build, frame, explicitly, late, greg, state, key, recognizing, recurrent, people, framework] [position, scene, single, compute] [appearance, proposed, figure, method, comparison, based] [group, network, deep, performance, neural, convolutional, accuracy, table, efficient, building, inference, original, number, sparse, weight, flexible, denotes] [model, relational, arg, reasoning, visual, represent, understanding, node, embedded] [relation, feature, bounding, visualization, spatial, hierarchical, context, object, adopt, fused] [learning, distance, learned, representation, set, classification, sampling, training, strategy]
@InProceedings{Wu_2019_CVPR,
  author = {Wu, Jianchao and Wang, Limin and Wang, Li and Guo, Jie and Wu, Gangshan},
  title = {Learning Actor Relation Graphs for Group Activity Recognition},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
A Structured Model for Action Detection
Yubo Zhang, Pavel Tokmakov, Martial Hebert, Cordelia Schmid


A dominant paradigm for learning-based approaches in computer vision is training generic models, such as ResNet for image recognition, or I3D for video understanding, on large datasets and allowing them to discover the optimal representation for the problem at hand. While this is an obviously attractive approach, it is not applicable in all scenarios. We claim that action detection is one such challenging problem - the models that need to be trained are large, and labeled data is expensive to obtain. To address this limitation, we propose to incorporate domain knowledge into the structure of the model, simplifying optimization. In particular, we augment a standard I3D network with a tracking module to aggregate long-term motion patterns, and use a graph convolutional network to reason about interactions between actors and objects. Evaluated on the challenging AVA dataset, the proposed approach improves over the I3D baseline by 5.5% mAP and over the state-of-the-art by 4.8% mAP.
[action, actor, graph, video, temporal, human, tubelets, tracking, interaction, modeling, explicitly, dataset, frame, cordelia, tubelet, motion, ava, recognition, multiple, represented, work, sequence, time] [approach, ground, truth, analysis] [appearance, manipulation, proposed, method, based, figure] [convolutional, performance, network, architecture, table, neural, achieves, validation, aggregate, deep] [model, visual, association, node, relational] [object, detection, relation, feature, baseline, propose, bounding, person, map, module, integrate, spatial, localization, box, region, ross, interest, edge, challenging, detect] [learning, representation, similarity, training, hard, soft, trained, train, learn, dimension, large]
@InProceedings{Zhang_2019_CVPR,
  author = {Zhang, Yubo and Tokmakov, Pavel and Hebert, Martial and Schmid, Cordelia},
  title = {A Structured Model for Action Detection},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Out-Of-Distribution Detection for Generalized Zero-Shot Action Recognition
Devraj Mandal, Sanath Narayan, Sai Kumar Dwivedi, Vikram Gupta, Shuaib Ahmed, Fahad Shahbaz Khan, Ling Shao


Generalized zero-shot action recognition is a challenging problem, where the task is to recognize new action categories that are unavailable during the training stage, in addition to the seen action categories. Existing approaches suffer from the inherent bias of the learned classifier towards the seen action categories. As a consequence, unseen category samples are incorrectly classified as belonging to one of the seen action categories. In this paper, we set out to tackle this issue by arguing for a separate treatment of seen and unseen action categories in generalized zero-shot action recognition. We introduce an out-of-distribution detector that determines whether the video features belong to a seen or unseen action category. To train our out-of-distribution detector, video features for unseen action categories are synthesized using generative adversarial networks trained on seen action category features. To the best of our knowledge, we are the first to propose an out-of-distribution detector based GZSL framework for action recognition in videos. Experiments are performed on three action recognition datasets: Olympic Sports, HMDB51 and UCF101. For generalized zero-shot action recognition, our proposed approach outperforms the baseline with absolute gains (in classification accuracy) of 7.0%, 3.4%, and 4.9%, respectively, on these datasets.
[action, video, recognition, framework, dataset, outperforms] [approach, manual, problem, denote, corresponding] [proposed, real, comparison, synthesize, generative, conditional, synthesized, based, image, generator, input] [performance, output, size, best, network, achieves, better, equal, number, accuracy, higher] [gan, generated, visual, adversarial, wgan, random, generating] [detector, feature, baseline, category, three] [unseen, class, gzsl, embedding, zsl, bias, learning, generalized, classifier, data, training, trained, test, classification, cewgan, entropy, task, datasets, learned, olympic, loss, existing, fod, learn, clswgan, belonging, set, train, transductive, unlabelled, distribution, combination]
@InProceedings{Mandal_2019_CVPR,
  author = {Mandal, Devraj and Narayan, Sanath and Kumar Dwivedi, Sai and Gupta, Vikram and Ahmed, Shuaib and Shahbaz Khan, Fahad and Shao, Ling},
  title = {Out-Of-Distribution Detection for Generalized Zero-Shot Action Recognition},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Object Discovery in Videos as Foreground Motion Clustering
Christopher Xie, Yu Xiang, Zaid Harchaoui, Dieter Fox


We consider the problem of providing dense segmentation masks for object discovery in videos. We formulate the object discovery problem as foreground motion clustering, where the goal is to cluster foreground pixels in videos into different objects. We introduce a novel pixel-trajectory recurrent neural network that learns feature embeddings of foreground pixel trajectories linked across time. By clustering the pixel trajectories using the learned feature embeddings, our method establishes correspondences between foreground object masks across video frames. To demonstrate the effectiveness of our framework for object discovery, we conduct experiments on commonly used datasets for motion segmentation, where we achieve state-of-the-art performance.
[motion, trajectory, video, flow, optical, fusion, recognition, time, fbms, recurrent, learns, dataset, ccg, linked, moving, tracking, work, frame, rnn, forward, multiple, lvo] [computer, vision, pattern, international, rgb, note, problem, analysis, compute, dense, denote] [pixel, conference, ieee, method, figure, produce, image, background, based] [network, neural, architecture, performance, table, unit, conv, best, convolutional] [discover, model, obj, machine, introduce] [foreground, object, segmentation, feature, mask, instance, segment, discovery] [embeddings, clustering, learning, set, cluster, training, loss, embedding, learn, function, novel, train, intra, inter]
@InProceedings{Xie_2019_CVPR,
  author = {Xie, Christopher and Xiang, Yu and Harchaoui, Zaid and Fox, Dieter},
  title = {Object Discovery in Videos as Foreground Motion Clustering},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Towards Natural and Accurate Future Motion Prediction of Humans and Animals
Zhenguang Liu, Shuang Wu, Shuyuan Jin, Qi Liu, Shijian Lu, Roger Zimmermann, Li Cheng


Anticipating the future motions of 3D articulate objects is challenging due to its non-linear and highly stochastic nature. Current approaches typically represent the skeleton of an articulate object as a set of 3D joints, which unfortunately ignores the relationship between joints, and fails to encode fine-grained anatomical constraints. Moreover, conventional recurrent neural networks, such as LSTM and GRU, are employed to model motion contexts, which inherently have difficulties in capturing long-term dependencies. To address these problems, we propose to explicitly encode anatomical constraints by modeling their skeletons with a Lie algebra representation. Importantly, a hierarchical recurrent network structure is developed to simultaneously encodes local contexts of individual frames and global contexts of the sequence. We proceed to explore the applications of our approach to several distinct quantities including human, fish, and mouse. Extensive experiments show that our approach achieves more natural and accurate predictions over state-of-the-art methods.
[motion, state, recurrent, human, hmr, lie, mouse, prediction, fish, hidden, joint, algebra, lstm, future, sequence, erd, articulate, bone, forget, frame, dataset, modeling, forecasting, action, represented, window, skeleton, explicitly, motionless, time] [pose, kinematic, body, local, skeletal, rigid, accurate, approach, well, coordinate, relative, estimation, depth] [proposed, input, tanh, based, anatomical] [network, number, cell, table, gate, size, process, performance, neural, structure, deep] [step, model, decoder, machine, encoder, natural, chain, animal] [global, hierarchical, context, object, baseline, predicted, neighboring] [existing, loss, representation, datasets, learning, update, function, training, conventional]
@InProceedings{Liu_2019_CVPR,
  author = {Liu, Zhenguang and Wu, Shuang and Jin, Shuyuan and Liu, Qi and Lu, Shijian and Zimmermann, Roger and Cheng, Li},
  title = {Towards Natural and Accurate Future Motion Prediction of Humans and Animals},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Automatic Face Aging in Videos via Deep Reinforcement Learning
Chi Nhan Duong, Khoa Luu, Kha Gia Quach, Nghia Nguyen, Eric Patterson, Tien D. Bui, Ngan Le


This paper presents a novel approach for synthesizing automatically age-progressed facial images in video sequences using Deep Reinforcement Learning. The proposed method models facial structures and the longitudinal face-aging process of given subjects coherently across video frames. The approach is optimized using a long-term reward, Reinforcement Learning function with deep feature extraction from Deep Convolutional Neural Network. Unlike previous age-progression methods that are only able to synthesize an aged likeness of a face from a single input image, the proposed approach is capable of age-progressing facial likenesses in videos with consistently synthesized facial features across frames. In addition, the deep reinforcement learning method guarantees preservation of the visual identity of input faces after age-progression. Results on videos of our new collected aging face AGFW-v2 database demonstrate the advantages of the proposed solution in terms of both quality of age-progressed faces, temporal smoothness, and cross-age face verification.
[video, temporal, frame, current, previous, recognition, action, framework, state, employed] [approach, computer, matching, longitudinal, estimation, single] [age, aging, face, facial, image, proposed, input, progression, consistency, based, synthesized, synthesis, synthesize, method, database, collected, presented, conditional, figure, khoa, aged, produce, appearance, chi, nhan, kha, gia, quality, synthesizing] [deep, process, network, neural, table, processing, structure, number, original] [model, policy, young, reward, automatic, reinforcement, relationship, agent, selecting, adversarial, step] [feature] [neighbor, learning, representation, set, function, embedding, nearest, consistently]
@InProceedings{Duong_2019_CVPR,
  author = {Nhan Duong, Chi and Luu, Khoa and Gia Quach, Kha and Nguyen, Nghia and Patterson, Eric and Bui, Tien D. and Le, Ngan},
  title = {Automatic Face Aging in Videos via Deep Reinforcement Learning},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Multi-Adversarial Discriminative Deep Domain Generalization for Face Presentation Attack Detection
Rui Shao, Xiangyuan Lan, Jiawei Li, Pong C. Yuen


Face presentation attacks have become an increasingly critical issue in the face recognition community. Many face anti-spoofing methods have been proposed, but they cannot generalize well on "unseen" attacks. This work focuses on improving the generalization ability of face anti-spoofing methods from the perspective of the domain generalization. We propose to learn a generalized feature space via a novel multi-adversarial discriminative deep domain generalization framework. In this framework, a multi-adversarial deep domain generalization is performed under a dual-force triplet-mining constraint. This ensures that the learned feature space is discriminative and shared by multiple source domains, and thus is more generalized to new face presentation attacks. An auxiliary face depth supervision is incorporated to further enhance the generalization ability. Extensive experiments on four public datasets validate the effectiveness of the proposed method.
[multiple, video, fusion, extract, recognition, work, learns, subject] [depth, constraint, well] [face, proposed, method, generator, presentation, real, image, comparison, color, photo, based, texture, figure, print, replay, input, antispoofing, casia, printed] [deep, table, network, rate, layer, automatically, binary, adaptively, denotes] [adversarial, ability, fake, model, common, attack] [feature, false, cnn, supervision, detection] [domain, source, space, generalization, generalized, discriminative, training, learning, learned, shared, learn, testing, auxiliary, differentiation, data, unseen, distribution, train, loss, datasets, trained, target, unsupervised, idiap, paper, exploit, positive, lbptop, maddg, pong, incorporated, adaptation, aligning, ida, xiangyuan]
@InProceedings{Shao_2019_CVPR,
  author = {Shao, Rui and Lan, Xiangyuan and Li, Jiawei and Yuen, Pong C.},
  title = {Multi-Adversarial Discriminative Deep Domain Generalization for Face Presentation Attack Detection},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
A Content Transformation Block for Image Style Transfer
Dmytro Kotovenko, Artsiom Sanakoyeu, Pingchuan Ma, Sabine Lang, Bjorn Ommer


Style transfer has recently received a lot of attention, since it allows to study fundamental challenges in image understanding and synthesis. Recent work has significantly improved the representation of color and texture and com- putational speed and image resolution. The explicit transformation of image content has, however, been mostly neglected: while artistic style affects formal characteristics of an image, such as color, shape or texture, it also deforms, adds or removes content details. This paper explicitly focuses on a content-and style-aware stylization of a content image. Therefore, we introduce a content transformation module between the encoder and decoder. Moreover, we utilize similar content appearing in photographs and style samples to learn how style alters content details and we generalize this to other class details. Additionally, this work presents a novel normalization layer critical for high resolution image synthesis. The robustness and speed of our model enables a video stylization in real-time and high definition. We perform extensive qualitative and quantitative evaluations to demonstrate the validity of our approach.
[deception, dataset, work, perform, human] [computer, approach, local, vision, pattern, compute, column] [content, style, image, stylized, stylization, real, transformation, input, figure, artistic, method, quality, conference, texture, artist, ast, synthesis, ieee, alters, resolution, qualitative, control, quantitative, stylize, alter, expert, rsscd] [neural, block, normalization, network, layer, deep, convolutional, rate, pretrained, fast, tensor, accuracy, table, number] [model, encoder, generated, van, arxiv, preprint, decoder, visual, adversarial, discriminator] [feature, score, utilize, object, spatial, art, instance, average] [transfer, class, representation, loss, learn, specific, training, classification, positive, target, distance, generalize, learned]
@InProceedings{Kotovenko_2019_CVPR,
  author = {Kotovenko, Dmytro and Sanakoyeu, Artsiom and Ma, Pingchuan and Lang, Sabine and Ommer, Bjorn},
  title = {A Content Transformation Block for Image Style Transfer},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
BeautyGlow: On-Demand Makeup Transfer Framework With Reversible Generative Network
Hung-Jen Chen, Ka-Ming Hui, Szu-Yu Wang, Li-Wu Tsao, Hong-Han Shuai, Wen-Huang Cheng


As makeup has been widely-adopted for beautification, finding suitable makeup by virtual makeup applications becomes popular. Therefore, a recent line of studies proposes to transfer the makeup from a given reference makeup image to the source non-makeup one. However, it is still challenging due to the massive number of makeup combinations. To facilitate on-demand makeup transfer, in this work, we propose BeautyGlow that decompose the latent vectors of face images derived from the Glow model into makeup and non-makeup latent vectors. Since there is no paired dataset, we formulate a new loss function to guide the decomposition. Afterward, the non-makeup latent vector of a source image and makeup latent vector of a reference image and are effectively combined and revert back to the image domain to derive the results. Experimental results show that the transfer quality of BeautyGlow is comparable to the state-of-the-art methods, while the unique ability to manipulate latent vectors allows BeautyGlow to realize on-demand makeup transfer.
[framework, recognition, extract] [vision, computer, matrix, pattern, general, international, derived] [makeup, latent, image, style, reference, conference, beautyglow, facial, transformation, face, glow, ieee, perceptual, figure, color, cycle, consistency, generative, input, denoted, realistic, cyclegan, based, manipulating, generator, proposed, qualitative, analogy, method, supposed, decompose, paired, guide, manipulate, adjusting, user, invertible, quantitative, comparison, manipulation, lintra] [layer, number, proposes, formulate, comparable, architecture, compared, deep] [vector, model, generate, find, adversarial, introduce, ability] [average, propose, heavy] [transfer, loss, source, space, function, domain, training, train, unsupervised, specific, close, centroid, learning]
@InProceedings{Chen_2019_CVPR,
  author = {Chen, Hung-Jen and Hui, Ka-Ming and Wang, Szu-Yu and Tsao, Li-Wu and Shuai, Hong-Han and Cheng, Wen-Huang},
  title = {BeautyGlow: On-Demand Makeup Transfer Framework With Reversible Generative Network},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Style Transfer by Relaxed Optimal Transport and Self-Similarity
Nicholas Kolkin, Jason Salavon, Gregory Shakhnarovich


The goal of style transfer algorithms is to render the content of one image using the style of another. We propose Style Transfer by Relaxed Optimal Transport and Self-Similarity (STROTSS), a new optimization-based style transfer algorithm. We extend our method to allow user specified point-to-point or region-to-region control over visual similarity between the style image and the output. Such guidance can be used to either achieve a particular visual effect or correct errors made by unconstrained style transfer. In order to quantitatively compare our method to prior work, we conduct a large-scale user study designed to assess the style-content tradeoff across settings in style transfer algorithms. Our results indicate that for any desired level of content preservation, our method provides higher quality stylization than prior work.
[work, emd, term, human, multiple, recognition] [computer, relaxed, relative, algorithm, vision, pattern, tij, define, defined, matching, match] [style, content, image, method, figure, extracted, conference, prior, user, control, quality, proposed, earth, based, arbitrary, texture, ieee, color, stylization, stylized, gatys, resolution, unconstrained, study, palette] [output, neural, order, weight, deep, compare, network, convolutional, higher, computing] [cij, visual, evaluation, arxiv, preprint, example] [feature, spatial, propose, guidance, default, level] [transfer, loss, distance, set, distribution, min, cosine, test, transport, similarity, measure, defines, pairwise, dij]
@InProceedings{Kolkin_2019_CVPR,
  author = {Kolkin, Nicholas and Salavon, Jason and Shakhnarovich, Gregory},
  title = {Style Transfer by Relaxed Optimal Transport and Self-Similarity},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Inserting Videos Into Videos
Donghoon Lee, Tomas Pfister, Ming-Hsuan Yang


In this paper, we introduce a new problem of manipulating a given video by inserting other videos into it. Our main task is, given an object video and a scene video, to insert the object video at a user-specified location in the scene video so that the resulting video looks realistic. We aim to handle different object motions and complex backgrounds without expensive segmentation annotations. As it is difficult to collect training pairs for this problem, we synthesize fake training pairs that can provide helpful supervisory signals when training a neural network with unpaired real data. The proposed network architecture can take both real and fake pairs as input and perform both supervised and unsupervised training in an adversarial learning scheme. To synthesize a realistic video, the network renders each frame based on the current input and previous frames. Within this framework, we observe that injecting noise into previous frames while generating the current frame stabilizes training. We conduct experiments on real-world videos in object tracking and person re-identification benchmark datasets. Experimental results demonstrate that the proposed algorithm is able to synthesize long sequences of realistic videos with a given object video inserted.
[video, previous, frame, current, sequence, motion, learns, long] [algorithm, computer, problem, scene, international, vision, approach, render, shape, pattern] [image, input, figure, conference, proposed, synthesize, real, inserting, realistic, based, content, method, synthesized, conditional, eub, ieee, blending, blended, generator, unpaired, background, noise, translation] [network, inserted, insert, neural, processing, number, convolutional] [fake, insertion, adversarial, generated, random, generating, vector] [object, semantic, segmentation, location, baseline, recall, detector, predicted, challenging, mask, bounding, map, dukemtmc, surrounding, pedestrian, person, detection] [training, learning, data, unsupervised, objective, trained, existing, learn, loss, target, pair, address, source, function, main, task, supervised, observe]
@InProceedings{Lee_2019_CVPR,
  author = {Lee, Donghoon and Pfister, Tomas and Yang, Ming-Hsuan},
  title = {Inserting Videos Into Videos},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Learning Image and Video Compression Through Spatial-Temporal Energy Compaction
Zhengxue Cheng, Heming Sun, Masaru Takeuchi, Jiro Katto


Compression has been an important research topic for many decades, to produce a significant impact on data transmission and storage. Recent advances have shown a great potential of learning based image and video compression. Inspired from related works, in this paper, we present an image compression architecture using a convolutional autoencoder, and then generalize image compression to video compression, by adding an interpolation loop into both encoder and decoder sides. Our basic idea is to realize spatial-temporal energy compaction in learning image and video compression. Thereby, we propose to add a spatial energy compaction-based penalty into loss function, to achieve higher image compression performance. Furthermore, based on temporal energy distribution, we propose to select the number of frames in one interpolation loop, adapting to the motion characteristics of video contents. Experimental results demonstrate that our proposed image compression outperforms the latest image compression standard with MS-SSIM quality metric, and provides higher performance compared with state-of-the-art learning compression methods at high bit rates, which benefits from our spatial energy compaction approach. Meanwhile, our proposed video compression approach with temporal energy compaction can significantly outperform MPEG-4, and is competitive with commonly used H.264. Both our image and video compression can produce more visually pleasant results than traditional standards.
[video, temporal, motion, work, outperforms, frame, recurrent, dataset] [approach, computer, june, loop, error, analysis, vision, reconstruction, software, pattern] [image, interpolation, based, reconstructed, proposed, high, figure, method, ieee, comparison, synthesis, traditional, kodak, vtl, jpeg, produce, quality, visually, pleasant, transform, resolution] [compression, energy, coding, bit, performance, quantization, compaction, convolutional, neural, penalty, low, compressed, achieve, higher, network, rate, architecture, adaptive, optimized, add, better, bpg, efficiency, represents, standard, block, quantized] [model, encoder, observed, commonly, decoder, random, system] [propose, spatial, neighboring] [learning, entropy, data, loss, distribution, autoencoder, learned, function, select, conventional, set, test]
@InProceedings{Cheng_2019_CVPR,
  author = {Cheng, Zhengxue and Sun, Heming and Takeuchi, Masaru and Katto, Jiro},
  title = {Learning Image and Video Compression Through Spatial-Temporal Energy Compaction},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Event-Based High Dynamic Range Image and Very High Frame Rate Video Generation Using Conditional Generative Adversarial Networks
Lin Wang, S. I. Mohammad Mostafavi, Yo-Sung Ho, Kuk-Jin Yoon


Event cameras have a lot of advantages over traditional cameras, such as low latency, high temporal resolution, and high dynamic range. However, since the outputs of event cameras are the sequences of asynchronous events over time rather than actual intensity images, existing algorithms could not be directly applied. Therefore, it is demanding to generate intensity images from events for other tasks. In this paper, we unlock the potential of event camera-based conditional generative adversarial networks to create images/videos from an adjustable portion of the event data stream. The stacks of space-time coordinates of events are used as inputs and the network is trained to reproduce images based on the spatio-temporal intensity changes. The usefulness of event cameras to generate high dynamic range (HDR) images even in extreme illumination conditions and also non blurred images under rapid motion is also shown. In addition, the possibility of generating very high frame rate videos is demonstrated, theoretically up to 1 million frames per second(FPS) since the temporal resolution of event cameras is about 1 microsecond. Proposed methods are evaluated by comparing the results with the intensity images captured on the same pixel grid-line of events using online available real datasets and synthetic datasets produced by the event camera simulator.
[event, frame, time, dynamic, temporal, video, motion, stacking, sbe, dataset, brisque, asynchronous, stream, framework, davis] [reconstruction, camera, vision, ground, computer, truth, illumination, left, range, estimation, corresponding, international, normal] [image, aps, intensity, high, method, based, hdr, proposed, figure, stack, real, input, conference, translation, ieee, quality, reconstructed, sbt, generative, reconstruct, simulated, resolution, cgans, generator, blur, color, pixel, synthetic] [rate, network, number, better, table, output, deep, low, fast, applied] [generate, generated, arxiv, adversarial, create, discriminator, preprint, visual, potential] [extreme] [data, datasets, training, learning, loss, similarity]
@InProceedings{Wang_2019_CVPR,
  author = {Wang, Lin and Mohammad Mostafavi, S. I. and Ho, Yo-Sung and Yoon, Kuk-Jin},
  title = {Event-Based High Dynamic Range Image and Very High Frame Rate Video Generation Using Conditional Generative Adversarial Networks},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Enhancing TripleGAN for Semi-Supervised Conditional Instance Synthesis and Classification
Si Wu, Guangchang Deng, Jichang Li, Rui Li, Zhiwen Yu, Hau-San Wong


Learning class-conditional data distributions is crucial for Generative Adversarial Networks (GAN) in semi-supervised learning. To improve both instance synthesis and classification in this setting, we propose an enhanced TripleGAN (EnhancedTGAN) model in this work. We follow the adversarial training scheme of the original TripleGAN, but completely re-design the training targets of the generator and classifier. Specifically, we adopt feature-semantics matching to enhance the generator in learning class-conditional distributions from both the aspects of statistics in the latent space and semantics consistency with respect to the generator and classifier. Since a limited amount of labeled data is not sufficient to determine satisfactory decision boundaries, we include two classifiers, and incorporate collaborative learning into our model to provide better guidance for generator training. The synthesized high-fidelity data can in turn be used for improving classifier training. In the experiments, the superior performance of our approach on multiple benchmark datasets demonstrates the effectiveness of the mutual reinforcement between the generator and classifiers in facilitating semi-supervised instance synthesis and classification.
[term, human] [matching, international, accurate, provide, approach, corresponding, error, computer, denote, match, respect] [synthesized, proposed, generator, conference, generative, synthesis, real, figure, conditional, consistency, collaborative, method, image, ieee, latent, face] [network, deep, neural, better, processing, denotes, regularization, improving, effectiveness, number, rate, table, scheme] [model, adversarial, discriminator, probability, random, machine] [instance, improve, predicted, feature, including, enhanced, adopt, semantics] [data, learning, triplegan, training, enhancedtgan, unlabeled, classification, classifier, class, labeled, distribution, learn, semisupervised, loss, unsupervised, function, smoreg, adam, conreg, test, label, divergence, log, svhn, space]
@InProceedings{Wu_2019_CVPR,
  author = {Wu, Si and Deng, Guangchang and Li, Jichang and Li, Rui and Yu, Zhiwen and Wong, Hau-San},
  title = {Enhancing TripleGAN for Semi-Supervised Conditional Instance Synthesis and Classification},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Capture, Learning, and Synthesis of 3D Speaking Styles
Daniel Cudeiro, Timo Bolkart, Cassidy Laidlaw, Anurag Ranjan, Michael J. Black


Audio-driven 3D facial animation has been widely explored, but achieving realistic, human-like performance is still unsolved. This is due to the lack of available 3D datasets, models, and standard evaluation metrics. To address this, we introduce a unique 4D face dataset with about 29 minutes of 4D scans captured at 60 fps and synchronized audio from 12 speakers. We then train a neural network on our dataset that factors identity from facial motion. The learned model, VOCA (Voice Operated Character Animation) takes any speech signal as input--even speech in languages other than English--and realistically animates a wide range of adult faces. Conditioning on subject labels during training allows the model to learn a variety of realistic speaking styles. VOCA also provides animator controls to alter speaking style, identity-dependent facial shape, and pose (i.e. head, jaw, and eyeball rotations) during animation. To our knowledge, VOCA is the only realistic 3D facial animation model that is readily applicable to unseen subjects without retargeting. This makes VOCA suitable for tasks like in-game video, virtual reality avatars, or any scenario in which the speaker, speech, or language is not known in advance. We make the dataset and model available for research purposes at http://voca.is.tue.mpg.de.
[speech, audio, motion, speaking, subject, capture, video, dataset, signal, static, dynamic, sequence, speaker] [shape, computer, template, range, allows, international, linear, pose, expressive, mesh, vertex, supplementary, analysis, pattern] [facial, face, voca, animation, animate, identity, figure, deepspeech, expression, conference, realistic, flame, database, input, method, mouth, style, reference, karras, vocaset, lip, talking, captured, synthesize, user] [deep, output, neural, layer, performance, network, wide, convolutional, connected, standard, relu] [model, character, visual, spoken, turkers, automatic, conditioned] [head, fully, three, driven] [training, data, learning, trained, generic, test, train, generalization, learn, large]
@InProceedings{Cudeiro_2019_CVPR,
  author = {Cudeiro, Daniel and Bolkart, Timo and Laidlaw, Cassidy and Ranjan, Anurag and Black, Michael J.},
  title = {Capture, Learning, and Synthesis of 3D Speaking Styles},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Nesti-Net: Normal Estimation for Unstructured 3D Point Clouds Using Convolutional Neural Networks
Yizhak Ben-Shabat, Michael Lindenbaum, Anath Fischer


In this paper, we propose a normal estimation method for unstructured 3D point clouds. This method, called Nesti-Net, builds on a new local point cloud representation which consists of multi-scale point statistics (MuPS), estimated on a local coarse Gaussian grid. This representation is a suitable input to a CNN architecture. The normals are estimated using a mixture-of-experts (MoE) architecture, which relies on a data-driven approach for selecting the optimal scale around each point and encourages sub-network specialization. Interesting insights into the network's resource distribution are provided. The scale prediction significantly improves robustness to different noise levels, point density variations and different levels of detail. We achieve state-of-the-art results on a benchmark synthetic dataset and present qualitative results on real scanned scenes.
[consists, prediction, dataset] [point, normal, estimation, local, computer, cloud, mups, surface, unstructured, associated, error, estimate, depth, scanned, geometric, estimated, approach, optimal, additional, manager, note, vision, reconstruction, estimating, neighborhood, pcpnet, geometry, chosen] [noise, expert, figure, method, proposed, input, maxpool, conference, component, pca, color, image, ieee, raw, based, detail, sharp] [scale, architecture, deep, density, gaussian, number, performance, network, small, convolutional, neural, fine, selection, best, nestinet] [vector, robustness, selecting] [cnn, propose, coarse, predicted, three, grid, assigned, average, improves] [representation, set, data, learning, large, training, mixture, trained, classification, medium, noisy]
@InProceedings{Ben-Shabat_2019_CVPR,
  author = {Ben-Shabat, Yizhak and Lindenbaum, Michael and Fischer, Anath},
  title = {Nesti-Net: Normal Estimation for Unstructured 3D Point Clouds Using Convolutional Neural Networks},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Ray-Space Projection Model for Light Field Camera
Qi Zhang, Jinbo Ling, Qing Wang, Jingyi Yu


Light field essentially represents the collection of rays in space. The rays captured by multiple light field cameras form subsets of full rays in 3D space and can be transformed to each other. However, most previous approaches model the projection from an arbitrary point in 3D space to corresponding pixel on the sensor. There are few models on describing the ray sampling and transformation among multiple light field cameras. In the paper, we propose a novel ray-space projection model to transform sets of rays captured by multiple light field cameras in term of the Plucker coordinates. We first derive a 6x6 ray-space intrinsic matrix based on multi-projection-center (MPC) model. A homogeneous ray-space projection matrix and a fundamental matrix are then proposed to establish ray-ray correspondences among multiple light fields. Finally, based on the ray-space projection matrix, a novel camera calibration method is proposed to verify the proposed model. A linear constraint and a ray-ray cost function are established for linear initial solution and non-linear optimization respectively. Experimental results on both synthetic and real light field data have verified the effectiveness and robustness of the proposed model.
[multiple, motion] [light, field, camera, ray, projection, matrix, intrinsic, calibration, ucker, relative, distortion, linear, dpw, pose, point, mpc, kij, solution, view, bjw, qing, estimation, geometric, geometry, extrinsic, checkerboard, plane, corresponding, homogeneous, initial, rotation, error, constraint, optimization, rsim, estimate, projective, scene, jingyi, fundamental, depth, epipolar] [proposed, ieee, method, captured, transformation, image, noise, pixel, based, raw, verify, real, figure] [number, performance, order, cost, effectiveness, effective, represents, denotes, compared] [model, vector, describe, physical, relationship] [propose] [data, space, sampling, generalized, function, novel, datasets]
@InProceedings{Zhang_2019_CVPR,
  author = {Zhang, Qi and Ling, Jinbo and Wang, Qing and Yu, Jingyi},
  title = {Ray-Space Projection Model for Light Field Camera},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Deep Geometric Prior for Surface Reconstruction
Francis Williams, Teseo Schneider, Claudio Silva, Denis Zorin, Joan Bruna, Daniele Panozzo


The reconstruction of a discrete surface from a point cloud is a fundamental geometry processing problem that has been studied for decades, with many methods developed. We propose the use of a deep neural network as a geometric prior for surface reconstruction. Specifically, we overfit a neural network representing a local chart parameterization to part of an input point cloud using the Wasserstein distance as a measure of approximation. By jointly fitting many such networks to overlapping parts of the point cloud, while enforcing a consistency condition, we compute a manifold atlas. By sampling this atlas, we can produce a dense reconstruction of the surface approximating the input cloud. The entire procedure does not require any training data or explicit regularization, yet, we show that it is able to perform remarkably well: not introducing typical overfitting artifacts, and approximating sharp features closely at the same time. We experimentally show that this geometric prior produces good results for both man-made objects containing sharp features and smoother organic objects, as well as noisy inputs. We compare our method with a number of well-known reconstruction methods on a standard surface reconstruction benchmark.
[work, transition] [surface, point, reconstruction, local, chart, cloud, fitting, atlasnet, fit, shape, ear, geometric, define, parametric, implicit, error, compute, consistent, scattered, geometry, optimization, approach, parametrization, fourier, sinkhorn, dinprec, drecgt, approximating, explicit, remarkably, volumetric, atlas, robust, range, completion, mpu, single, corresponding, nsf] [figure, input, method, poisson, consistency, sharp, patch, image, fitted, acm, prior, produce, frequency, noise, result, described, separate, wavelet] [neural, deep, network, gradient, number, architecture, descent, compare, relu] [wasserstein, arxiv, preprint, model, consider, ball] [overlapping, global, benchmark] [set, function, learning, loss, distance, noisy, data, partition, min, training, observe]
@InProceedings{Williams_2019_CVPR,
  author = {Williams, Francis and Schneider, Teseo and Silva, Claudio and Zorin, Denis and Bruna, Joan and Panozzo, Daniele},
  title = {Deep Geometric Prior for Surface Reconstruction},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Analysis of Feature Visibility in Non-Line-Of-Sight Measurements
Xiaochun Liu, Sebastian Bauer, Andreas Velten


We formulate an equation describing a general Non-line-of-sight (NLOS) imaging measurement and analyze the properties of the measurement in the Fourier domain regarding the spatial frequencies of the scene it encodes. We conclude that for a relay wall with finite size, certain scene configurations and features are not detectable in an NLOS measurement. We then provide experimental examples of invisible scene features and their reconstructions, as well as a set of example scenes that lead to an ill-posed NLOS imaging problem.
[time, elliptical, work, hidden, version] [measurement, nlos, fourier, scene, reconstruction, aperture, limited, local, cone, inverse, wall, radon, integral, planar, angle, pattern, relay, confocal, projection, illumination, provide, visible, surface, problem, computed, rotation, point, equation, light, position, origin, invisible, well, backprojection, linear, analysis, normal, completely, ellipsoid, slice, column, computer, vision, visibility] [imaging, figure, transform, patch, missing, spectrum, frequency, high, ieee, based, conference, intensity, mtf] [represents, analyze, approximate, higher, lead] [model, simple, represent, simply, consider, example, vector] [spatial, detection, three, feature] [function, domain, sampling, unknown, space, set, corresponds, target]
@InProceedings{Liu_2019_CVPR,
  author = {Liu, Xiaochun and Bauer, Sebastian and Velten, Andreas},
  title = {Analysis of Feature Visibility in Non-Line-Of-Sight Measurements},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Hyperspectral Imaging With Random Printed Mask
Yuanyuan Zhao, Hui Guo, Zhan Ma, Xun Cao, Tao Yue, Xuemei Hu


Hyperspectral images can provide rich clues for various computer vision tasks. However, the requirements of professional and expensive hardware for capturing hyperspectral images impede its wide applications. In this paper, based on a simple but not widely noticed phenomenon that the color printer can print color masks with a large number of independent spectral transmission responses, we propose a simple and low-budget scheme to capture the hyperspectral images with a random mask printed by the consumer-level color printer. Specifically, we notice that the printed dots with different colors are stacked together, forming multiplicative, instead of additive, spectral transmission responses. Therefore, new spectral transmission response uncorrelated with that of the original printer dyes are generated. With the random printed color mask, hyperspectral images could be captured in a snapshot way. A convolutional neural network (CNN) based method is developed to reconstruct the hyperspectral images from the captured image. The effectiveness and accuracy of the proposed system are verified on both synthetic and real captured images.
[capture, marked, term, work] [rgb, reconstruction, point, camera, sensor, single, light, matrix, column, analysis, calibration, estimated, pattern] [spectral, hyperspectral, transmission, imaging, color, printed, method, image, based, ink, print, captured, printing, coded, uncorrelated, spectrum, high, noise, proposed, figure, synthetic, recover, recovered, snapshot, real, quality, reconstructed, demonstrate, resbottle, printer, input, monochromatic, multispectral] [number, network, conv, coding, filter, density, effectiveness, denotes, upsampling, compact, higher, layer, compare, output, correlation, deep, downsampling, scheme] [model, random, system, simple, encoded, choose, physical, develop] [mask, response, propose, three, spatial] [rank, randomly, setting, data, prototype, set, large, training]
@InProceedings{Zhao_2019_CVPR,
  author = {Zhao, Yuanyuan and Guo, Hui and Ma, Zhan and Cao, Xun and Yue, Tao and Hu, Xuemei},
  title = {Hyperspectral Imaging With Random Printed Mask},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
All-Weather Deep Outdoor Lighting Estimation
Jinsong Zhang, Kalyan Sunkavalli, Yannick Hold-Geoffroy, Sunil Hadap, Jonathan Eisenman, Jean-Francois Lalonde


We present a neural network that predicts HDR outdoor illumination from a single LDR image. At the heart of our work is a method to accurately learn HDR lighting from LDR panoramas under any weather condition. We achieve this by training another CNN (on a combination of synthetic and real images) to take as input an LDR panorama, and regress the parameters of the Lalonde-Mathews outdoor illumination model. This model is trained such that it a) reconstructs the appearance of the sky, and b) renders the appearance of objects lit by this illumination. We use this network to label a large-scale dataset of LDR panoramas with lighting parameters and use them to train our single image outdoor lighting estimation network. We demonstrate, via extensive experiments, that both our panorama and singe image networks outperform the state of the art, and unlike prior work, are able to handle weather conditions ranging from fully sunny to overcast skies.
[dataset, work, predict] [sky, lighting, outdoor, ldr, single, panorama, illumination, panonet, cropnet, estimate, approach, render, ground, scene, position, computer, truth, softness, estimation, weather, estimated, estimating, error, international, sunny, overcast, wsky, ranging, fit, parametric, reflectance, rmse, pattern, regress, range, rendering, limited, view, wsun, tfsky, tfsun, vision] [hdr, image, method, conference, figure, clear, proposed, comparison, ieee, synthetic, real, input, appearance, qualitative, sharp, extracted] [network, deep, better, shadow, employ, architecture, compare, wide, neural] [model, lsun, environment] [cnn, map] [sun, loss, train, training, learning, learn, set, domain, label]
@InProceedings{Zhang_2019_CVPR,
  author = {Zhang, Jinsong and Sunkavalli, Kalyan and Hold-Geoffroy, Yannick and Hadap, Sunil and Eisenman, Jonathan and Lalonde, Jean-Francois},
  title = {All-Weather Deep Outdoor Lighting Estimation},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
A Variational EM Framework With Adaptive Edge Selection for Blind Motion Deblurring
Liuge Yang, Hui Ji


Blind motion deblurring is an important problem that receives enduring attention in last decade. Based on the observation that a good intermediate estimate of latent image for estimating motion-blur kernel is not necessarily the one closest to latent image, edge selection has proven itself a very powerful technique for achieving state-of-the-art performance in blind deblurring. This paper presented an interpretation of edge selection/reweighting in terms of variational Bayes inference, and therefore developed a novel variational expectation maximization (VEM) algorithm with built-in adaptive edge selection for blind deblurring. Together with a restart strategy for avoiding undesired local convergence, the proposed VEM method not only has a solid mathematical foundation but also noticeably outperformed the state-of-the-art methods on benchmark datasets.
[motion, framework, dataset] [estimation, estimate, variable, algorithm, problem, optimization, estimating, local, estimated, estimator, camera, approach, single, solution, constant, define, good, denote, truth, robust, technique, feasible, suboptimal] [image, blind, deblurring, method, proposed, vem, latent, based, blur, prior, blurring, intermediate, clear, input, restarting, deconvolution, blurred, comparison, cho, levin, psnr, ieee, presented, sharp] [kernel, denotes, gradient, selection, restart, better, regularization, deep, illustration, adaptive, sparse, number, performance, bayesian, gaussian, covariance, output, table, small] [variational, natural, mathematical, model, iterative, procedure] [edge, map, salient, inner, average] [set, log, large, strategy, uniform, update, paper, trivial, maximum, distribution, existing, learning]
@InProceedings{Yang_2019_CVPR,
  author = {Yang, Liuge and Ji, Hui},
  title = {A Variational EM Framework With Adaptive Edge Selection for Blind Motion Deblurring},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Viewport Proposal CNN for 360deg Video Quality Assessment
Chen Li, Mai Xu, Lai Jiang, Shanyi Zhang, Xiaoming Tao


Recent years have witnessed the growing interest in visual quality assessment (VQA) for 360deg video. Unfortunately, the existing VQA approaches do not consider the facts that: 1) Observers only see viewports of 360deg video, rather than patches or whole 360deg frames. 2) Within the viewport, only salient regions can be perceived by observers with high resolution. Thus, this paper proposes a viewport-based convolutional neural network (V-CNN) approach for VQA on 360deg video, considering both auxiliary tasks of viewport proposal and viewport saliency prediction. Our V-CNN approach is composed of two stages, i.e., viewport proposal and VQA. In the first stage, the viewport proposal network (VP-net) is developed to yield several potential viewports, seen as the first auxiliary task. In the second stage, a viewport quality network (VQ-net) is designed to rate the VQA score for each proposed viewport, in which the saliency map of the viewport is predicted and then utilized in VQA score rating. Consequently, another auxiliary task of viewport saliency prediction can be achieved. More importantly, the main task of VQA on 360deg video can be accomplished via integrating the VQA scores of all viewports. The experiments validate the effectiveness of our V-CNN approach in significantly advancing the state-of-the-art performance of VQA on 360deg video. In addition, our approach achieves comparable performance in two auxiliary tasks. The code of our V-CNN approach is available at https://github.com/Archer-Tatsu/V-CNN.
[video, prediction, frame, predicting, assessment, sequence, watching, framework, second, predict, human, performs] [approach, corresponding, spherical, international, error, ground, note, truth, computer, range, pattern, omnidirectional, rmse] [proposed, quality, ieee, conference, figure, input, image, developed, based, subjective, pixel, psnr, content] [performance, deep, table, coefficient, weight, correlation, convolutional, processing, network, neural, rate] [vqa, impaired, visual, model, softer, potential, evaluation, multimedia] [viewport, saliency, score, proposal, viewports, map, dmos, predicted, stage, anchor, location, cnn, ablation, propose, mai, threshold] [auxiliary, task, objective, training, learning, loss, set, test, main, alignment]
@InProceedings{Li_2019_CVPR,
  author = {Li, Chen and Xu, Mai and Jiang, Lai and Zhang, Shanyi and Tao, Xiaoming},
  title = {Viewport Proposal CNN for 360deg Video Quality Assessment},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Beyond Gradient Descent for Regularized Segmentation Losses
Dmitrii Marin, Meng Tang, Ismail Ben Ayed, Yuri Boykov


The simplicity of gradient descent (GD) made it the default method for training ever-deeper and complex neural networks. Both loss functions and architectures are often explicitly tuned to be amenable to this basic local optimization. In the context of weakly-supervised CNN segmentation, we demonstrate a well-motivated loss function where an alternative optimizer (ADM) achieves the state-of-the-art while GD performs poorly. Interestingly, GD obtains its best result for a "smoother" tuning of the loss function. The results are consistent across different network architectures. Our loss is motivated by well-understood MRF/CRF regularization models in "shallow" segmentation and their known global solvers. Our work suggests that network design/training should pay more attention to optimization methods.
[work, graph, report, term] [dense, optimization, approach, computer, pattern, vision, discrete, local, international, compute, good, journal, robust, general, analysis, bandwidth, direction, optimizing] [method, ieee, image, conference, based, splitting, quality, figure, latent] [crf, adm, regularized, network, gradient, regularization, descent, better, potts, inference, efficient, accuracy, neural, standard, deep, shallow, compare, yuri, full, kernel, energy, achieves, binary, powerful, gaussian, lower] [model, machine, common, partial] [grid, segmentation, cnn, weakly, supervision, boundary, miou, semantic, context, improves, global, weak, pascal] [loss, training, supervised, minimization, set, pairwise, train, minimize, minimizing]
@InProceedings{Marin_2019_CVPR,
  author = {Marin, Dmitrii and Tang, Meng and Ben Ayed, Ismail and Boykov, Yuri},
  title = {Beyond Gradient Descent for Regularized Segmentation Losses},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
MAGSAC: Marginalizing Sample Consensus
Daniel Barath, Jiri Matas, Jana Noskova


A method called, sigma-consensus, is proposed to eliminate the need for a user-defined inlier-outlier threshold in RANSAC. Instead of estimating the noise sigma, it is marginalized over a range of noise scales. The optimized model is obtained by weighted least-squares fitting where the weights come from the marginalization over sigma of the point likelihoods of being inliers. A new quality function is proposed not requiring sigma and, thus, a set of inliers to determine the model quality. Also, a new termination criterion for RANSAC is built on the proposed marginalization approach. Applying sigma-consensus, MAGSAC is proposed with no need for a user-defined sigma and improving the accuracy of robust estimation significantly. It is superior to the state-of-the-art in terms of geometric accuracy on publicly available real-world datasets for epipolar geometry (F and E) and homography estimation. In addition, applying sigma-consensus only once as a post-processing step to the RANSAC output always improved the model quality on a wide range of vision problems without noticeable deterioration in processing time, adding a few milliseconds.
[dataset, manually, time, competitor] [magsac, ransac, inlier, inliers, robust, point, estimation, msc, computer, rsc, fitting, geometric, outlier, error, range, confidence, homography, vision, algorithm, minimal, fundamental, pattern, eavg, international, epipolar, estimated, note, accurate, marginalizing, deterioration, stereo, homogr, ground, truth, exp, analysis, geometry, plane] [image, proposed, quality, noise, method, input, fails, conference, noticeable] [max, number, processing, applied, applying, weighted, wide, accuracy, standard, iteration, ratio, criterion, size] [model, step, random, machine, required, calculated, termination, improved] [threshold, three, baseline] [set, function, sample, essential, marginalized, likelihood, data, selected]
@InProceedings{Barath_2019_CVPR,
  author = {Barath, Daniel and Matas, Jiri and Noskova, Jana},
  title = {MAGSAC: Marginalizing Sample Consensus},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Understanding and Visualizing Deep Visual Saliency Models
Sen He, Hamed R. Tavakoli, Ali Borji, Yang Mi, Nicolas Pugeault


Recently, data-driven deep saliency models have achieved high performance and have outperformed classical saliency models, as demonstrated by results on datasets such as the MIT300 and SALICON. Yet, there remains a large gap between the performance of these models and the inter-human baseline. Some outstanding questions include what have these models learned, how and where they fail, and how they can be improved. This article attempts to answer these questions by analyzing the representations learned by individual neurons located at the intermediate layers of deep saliency models. To this end, we follow the steps of existing deep saliency models, that is borrowing a pre-trained model of object recognition to encode the visual features and learning a decoder to infer the saliency. We consider two cases when the encoder is used as a fixed feature extractor and when it is fine-tuned, and compare the inner representations of the network. To study how the learned representations depend on the task, we fine-tune the same network using the same image set but for two different tasks: saliency prediction versus scene classification. Our analyses reveal that: 1) some visual regions (e.g. head, text, symbol, vehicle) are already encoded within various layers of the network pre-trained for object recognition, 2) using modern datasets, we find that fine-tuning pre-trained models for saliency prediction makes them favor some categories (e.g. head) over some others (e.g. text), 3) although deep models of saliency outperform classical models on natural images, the converse is true for synthetic stimuli (e.g. pop-out search arrays), an evidence of significant difference between human and data-driven saliency models, and 4) we confirm that, after-fine tuning, the change in inner-representations is mostly due to the task and not the domain shift in the data
[prediction, recognition, human, dataset, fixation] [scene, computer, classical, pattern, analysis, vision, compute, ground, truth] [image, synthetic, ieee, input, figure, conference, difference, study, based, proposed, high, change, gaze] [deep, activation, network, layer, table, performance, convolutional, top, neural, output, number, vgg, compare, search, ith, original, convolution, finetuning] [model, visual, animal, attention, understanding, text, example, observed] [saliency, salient, inner, map, category, person, head, object, cnn, annotated, region, score, ali, threshold, feature, salicon, mask, vehicle, overlapped] [representation, learned, task, data, large, learn]
@InProceedings{He_2019_CVPR,
  author = {He, Sen and Tavakoli, Hamed R. and Borji, Ali and Mi, Yang and Pugeault, Nicolas},
  title = {Understanding and Visualizing Deep Visual Saliency Models},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Divergence Prior and Vessel-Tree Reconstruction
Zhongwen Zhang, Dmitrii Marin, Egor Chesakov, Marc Moreno Maza, Maria Drangova, Yuri Boykov


We propose a new geometric regularization principle for reconstructing vector fields based on prior knowledge about their divergence. As one important example of this general idea, we focus on vector fields modelling blood flow pattern that should be divergent in arteries and convergent in veins. We show that this previously ignored regularization constraint can significantly improve the quality of vessel tree reconstruction particularly around bifurcations where non-zero divergence is concentrated. Our divergence prior is critical for resolving (binary) sign ambiguity in flow orientations produced by standard vessel filters, e.g. Frangi. Our vessel tree centerline reconstruction combines divergence constraints with robust curvature regularization. Our unsupervised method can reconstruct complete vessel trees with near-capillary details on synthetic and real 3D volumes.
[flow, work, directed, focus] [vessel, curvature, centerline, tangent, reconstruction, pattern, estimation, point, field, estimate, constraint, ground, truth, optimization, compute, frangi, local, fpq, divergent, convergent, surface, consistent, angle, volume, computer, analysis, defined, total, volumetric, directly, vesselness, assuming, unoriented, bifurcation, orientation, facet, estimating, voxel, geometric, ambiguity, disambiguating, estimated] [figure, prior, method, real, synthetic, image, based, thin, produced] [energy, regularization, binary, unit, size, standard, filter, number, apply] [tree, vector, model, example, sign, evaluation] [oriented, propose, segmentation, detection, global, blood, edge, threshold, roc, mask] [divergence, data, large, knowledge, minimum]
@InProceedings{Zhang_2019_CVPR,
  author = {Zhang, Zhongwen and Marin, Dmitrii and Chesakov, Egor and Moreno Maza, Marc and Drangova, Maria and Boykov, Yuri},
  title = {Divergence Prior and Vessel-Tree Reconstruction},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Unsupervised Domain-Specific Deblurring via Disentangled Representations
Boyu Lu, Jun-Cheng Chen, Rama Chellappa


Image deblurring aims to restore the latent sharp images from the corresponding blurred ones. In this paper, we present an unsupervised method for domain-specific, single-image deblurring based on disentangled representations. The disentanglement is achieved by splitting the content and blur features in a blurred image using content encoders and blur encoders. We enforce a KL divergence loss to regularize the distribution range of extracted blur attributes such that little content information is contained. Meanwhile, to handle the unpaired training data, a blurring branch and the cycle-consistency loss are added to guarantee that the content structures of the deblurred results match the original images. We also add an adversarial loss on deblurred results to generate visually realistic images and a perceptual loss to further mitigate the artifacts. We perform extensive experiments on the tasks of face and text deblurring using both synthetic datasets and real images, and achieve improved results compared to recent state-of-the-art deblurring methods.
[recognition, dataset, perform, motion] [vision, computer, pattern, well, good, approach, international, single] [image, face, blur, blurred, deblurring, content, conference, method, sharp, perceptual, ieee, deblurred, proposed, cyclegan, blurring, real, pan, celeba, disentangled, blind, bmvc, rama, figure, disentangle, nah, recover, unpaired, prior, generator, identity, psnr, ssim, study, quantitative] [performance, compared, add, deep, table, original, verification, achieve, neural, network, best, kernel, layer] [text, visual, generate, encoder, adversarial, encoders, find, natural, encode, model, generated, ocr, adding] [branch, semantic, feature, european] [loss, unsupervised, training, divergence, set, learning, conventional, generic, distance, distribution, test]
@InProceedings{Lu_2019_CVPR,
  author = {Lu, Boyu and Chen, Jun-Cheng and Chellappa, Rama},
  title = {Unsupervised Domain-Specific Deblurring via Disentangled Representations},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Douglas-Rachford Networks: Learning Both the Image Prior and Data Fidelity Terms for Blind Image Deconvolution
Raied Aljadaany, Dipan K. Pal, Marios Savvides


Blind deconvolution problems are heavily ill-posed where the specific blurring kernel is not known. Recovering these images typically requires estimates of the kernel. In this paper, we present a method called Dr-Net, which does not require any such estimate and is further able to invert the effects of the blurring in blind image recovery tasks. These image recovery problems typically have two terms, the data fidelity term (for faithful reconstruction) and the image prior (for realistic looking reconstructions). We use the Douglas-Rachford iterations to solve this problem since it is a more generally applicable optimization procedure than methods such as the proximal gradient descent algorithm. Two proximal operators originate from these iterations, one from the data fidelity term and the second from the image prior. It is non-trivial to design a hand-crafted function to represent these proximal operators for the data fidelity and the image prior terms which would work with real-world image distributions. We therefore approximate both these proximal operators using deep networks. This provides a sound motivation for the final architecture for Dr-Net which we find outperforms the state-of-the-art on two mainstream blind deconvolution benchmarks. We also find that Dr-Net is one of the fastest algorithms according to wall-clock times while doing so.
[term, motion, dataset, signal, previous, modelling, framework] [optimization, computer, problem, approach, vision, pattern, solve, estimate, assume, condition, inverse, good, algorithm, single, solution, point] [image, prior, proximal, fidelity, proposed, blind, ieee, blurring, conference, recovered, method, firmly, deconvolution, psnr, blur, blurry, gopro, recovery, figure, kupyn, corrupted, deblurring, based, splitting, recover] [network, deep, kernel, neural, operator, performance, processing, architecture, convolutional, applied, convolution, number, sparse, skip, layer, gradient, obtains, size, compared] [find, model, gan, iterative, generated] [ablation] [data, learning, loss, function, test, distribution, set, training, space, learn]
@InProceedings{Aljadaany_2019_CVPR,
  author = {Aljadaany, Raied and Pal, Dipan K. and Savvides, Marios},
  title = {Douglas-Rachford Networks: Learning Both the Image Prior and Data Fidelity Terms for Blind Image Deconvolution},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Speed Invariant Time Surface for Learning to Detect Corner Points With Event-Based Cameras
Jacques Manderscheid, Amos Sironi, Nicolas Bourdis, Davide Migliore, Vincent Lepetit


We propose a learning approach to corner detection for event-based cameras that is stable even under fast and abrupt motions. Event-based cameras offer high temporal resolution, power efficiency, and high dynamic range. However, the properties of event-based data are very different compared to standard intensity images, and simple extensions of corner detection methods designed for these images do not perform well on event-based data. We first introduce an efficient way to compute a time surface that is invariant to the speed of the objects. We then show that we can train a Random Forest to recognize events generated by a moving corner from our time surface. Random Forests are also extremely efficient, and therefore a good choice to deal with the high capture frequency of event-based cameras ---our implementation processes up to 1.6Mev/s on a single CPU. Thanks to our time surface formulation and this learning approach, our method is significantly more robust to abrupt changes of direction of the corners compared to previous ones. Our method also naturally assigns a confidence score for the corners, which can be useful for postprocessing. Moreover, we introduce a high-resolution dataset suitable for quantitative evaluation and comparison of corner detection methods for event-based cameras. We call our approach SILC, for Speed Invariant Learned Corners, and compare it to the state-of-the-art with extensive experiments, showing better performance.
[time, event, dataset, tracking, asynchronous, dynamic, hvga, temporal, evfast, atis, moving, evharris, eventbased, previous, graylevel, arc] [corner, surface, vision, local, approach, stable, error, contrast, pattern, reprojection, computer, algorithm, international, formulation, camera, sensor, well, compute, robust, harris, case, point, single] [method, ieee, high, input, pixel, intensity, forest, conference, resolution, figure, comparison] [speed, standard, neural, fast, efficient, applied] [random, visual, simple, decision, introduce, generated, evaluation, machine, tree] [feature, detection, detect, detector, detected, location, object] [invariant, learning, data, training, classifier, trained, classification, set, train, large, nearest, neighbor, representation]
@InProceedings{Manderscheid_2019_CVPR,
  author = {Manderscheid, Jacques and Sironi, Amos and Bourdis, Nicolas and Migliore, Davide and Lepetit, Vincent},
  title = {Speed Invariant Time Surface for Learning to Detect Corner Points With Event-Based Cameras},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Training Deep Learning Based Image Denoisers From Undersampled Measurements Without Ground Truth and Without Image Prior
Magauiya Zhussip, Shakarim Soltanayev, Se Young Chun


Compressive sensing is a method to recover the original image from undersampled measurements. In order to overcome the ill-posedness of this inverse problem, image priors are used such as sparsity, minimal total-variation, or self-similarity of images. Recently, deep learning based compressive image recovery methods have been proposed and have yielded state-of-the-art performances. They used data-driven approaches instead of hand-crafted image priors to regularize ill-posed inverse problems with undersampled data. Ironically, training deep neural networks (DNNs) for them requires "clean" ground truth images, but obtaining the best quality images from undersampled data requires well-trained DNNs. To resolve this dilemma, we propose novel methods based on two well-grounded theories: denoiser-approximate message passing (D-AMP) and Stein's unbiased risk estimator (SURE). Our proposed methods were able to train deep learning based image denoisers from undersampled measurements without ground truth images and without additional image priors, and to recover images with state-of-the-art qualities from undersampled data. We evaluated our methods for various compressive sensing recovery problems with Gaussian random, coded diffraction pattern, and compressive sensing MRI measurement matrices. Our proposed methods yielded state-of-the-art performances for all cases without ground truth images. Our methods also yielded comparable performances to the methods with ground truth data.
[time, signal, dataset, recognition, utilized] [ground, truth, measurement, algorithm, computer, vision, estimation, reconstruction, pattern, matrix, international, inverse, michael, estimator, deviation, linear, optimization, single] [image, ldamp, method, based, proposed, noise, dncnn, denoisers, recovery, denoiser, ieee, compressive, undersampled, sensing, conference, denoising, psnr, imaging, prior, figure, coded, mri, cdp, diffraction, investigated, hyperspectral, reconstructed, contaminated] [deep, gaussian, network, yielded, neural, performance, standard, sparse, dnn, residual, table, compressed, highly, processing, accuracy] [requires, true] [level] [learning, training, data, sampling, train, trained, test, conventional, set]
@InProceedings{Zhussip_2019_CVPR,
  author = {Zhussip, Magauiya and Soltanayev, Shakarim and Young Chun, Se},
  title = {Training Deep Learning Based Image Denoisers From Undersampled Measurements Without Ground Truth and Without Image Prior},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
A Variational Pan-Sharpening With Local Gradient Constraints
Xueyang Fu, Zihuang Lin, Yue Huang, Xinghao Ding


Pan-sharpening aims at fusing spectral and spatial information, which are respectively contained in the multispectral (MS) image and panchromatic (PAN) image, to produce a high resolution multi-spectral (HRMS) image. In this paper, a new variational model based on a local gradient constraint for pan-sharpening is proposed. Different with previous methods that only use global constraints to preserve spatial information, we first consider gradient difference of PAN and HRMS images in different local patches and bands. Then a more accurate spatial preservation based on local gradient constraints is incorporated into the objective to fully utilize spatial information contained in the PAN image. The objective is formulated as a convex optimization problem which minimizes two leastsquares terms and thus very simple and easy to implement. A fast algorithm is also designed to improve efficiency. Experiments show that our method outperforms previous variational algorithms and achieves better generalization than recent deep learning methods.
[fusion, previous, modeling, structural] [local, equation, algorithm, distortion, linear, ground, optimization, truth, ideal, corresponding, analysis, constraint, problem, laplacian] [image, spectral, pan, figure, hrms, proposed, based, method, remote, ieee, pnn, pannet, phlp, lrms, sirf, bdsd, awlp, quality, geoscience, sensing, resolution, pracs, multispectral, pansharpening, indusion, comparison, panchromatic, intensity, high, difference, transform, wavelet, qnr, preservation] [gradient, deep, original, scale, table, better, fast, best, size, science, performance] [model, variational, ability, simple, relationship, visual, introduce, find] [spatial, global, satellite, fused, easy] [learning, objective, generalization, data, function, universal, trained, china, training]
@InProceedings{Fu_2019_CVPR,
  author = {Fu, Xueyang and Lin, Zihuang and Huang, Yue and Ding, Xinghao},
  title = {A Variational Pan-Sharpening With Local Gradient Constraints},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
F-VAEGAN-D2: A Feature Generating Framework for Any-Shot Learning
Yongqin Xian, Saurabh Sharma, Bernt Schiele, Zeynep Akata


When labeled training data is scarce, a promising data augmentation approach is to generate visual features of unknown classes using their attributes. To learn the class conditional distribution of CNN features, these models rely on pairs of image features and class attributes. Hence, they can not make use of the abundance of unlabeled data samples. In this paper, we tackle any-shot learning problems i.e. zero-shot and few-shot, in a unified feature generating framework that operates in both inductive and transductive learning settings. We develop a conditional generative model that combines the strength of VAE and GANs and in addition, via an unconditional discriminator, learns the marginal feature distribution of unlabeled images. We empirically show that our model learns highly discriminative CNN features for five datasets, i.e. CUB, SUN, AWA and ImageNet, and establish a new state-of-the-art in any-shot learning, i.e. inductive and transductive (generalized) zero- and few-shot learning settings. We also demonstrate that our learned features are interpretable: we visualize them by inverting them back to the pixel space and we explain them by generating textual arguments of why they are associated with a certain label.
[framework, learns] [well, approach, inverting] [image, real, generative, synthetic, generator, figure, conditional, input, attribute, latent] [accuracy, imagenet, deep, number, network, neural, convolutional] [model, generating, generated, gan, visual, discriminator, yellow, adversarial, generate, white, red, gans, generates, generation, sentence] [feature, semantic, cnn, center] [class, learning, data, unseen, flower, transductive, novel, embedding, training, labeled, vae, unlabeled, inductive, embeddings, softmax, zsl, distribution, generalized, learn, label, set, cub, discriminative, setting, sun, gzsl, learned, space, classifier, flo, fsl, stamen, train, loss, gfsl, large, awa, trained, sample]
@InProceedings{Xian_2019_CVPR,
  author = {Xian, Yongqin and Sharma, Saurabh and Schiele, Bernt and Akata, Zeynep},
  title = {F-VAEGAN-D2: A Feature Generating Framework for Any-Shot Learning},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Sliced Wasserstein Discrepancy for Unsupervised Domain Adaptation
Chen-Yu Lee, Tanmay Batra, Mohammad Haris Baig, Daniel Ulbricht


In this work, we connect two distinct concepts for unsupervised domain adaptation: feature distribution alignment between domains by utilizing the task-specific decision boundary and the Wasserstein metric. Our proposed sliced Wasserstein discrepancy (SWD) is designed to capture the natural notion of dissimilarity between the outputs of task-specific classifiers. It provides a geometrically meaningful guidance to detect target samples that are far from the support of the source and enables efficient distribution alignment in an end-to-end trainable fashion. In the experiments, we validate the effectiveness and genericness of our method on digit and sign recognition, image classification, semantic segmentation, and object detection.
[framework, dataset, outperforms, perform] [optimal, radial, direct, matching, geometrically, ground, linear] [method, image, proposed, generator, synthetic, input, figure, meaningful, generative, based, real] [shift, deep, output, table, standard, network, number, neural, experiment, conv, performance, comparable] [wasserstein, sliced, swd, adversarial, probability, sign, model, step, decision, variational, generated] [feature, semantic, object, segmentation, detection] [domain, source, discrepancy, adaptation, target, learning, unsupervised, set, training, distribution, data, distance, mcd, transport, space, measure, train, mnist, synthia, trained, alignment, metric, loss, transfer, support, min, svhn, task, classification, deepjdot, label, ldis, large]
@InProceedings{Lee_2019_CVPR,
  author = {Lee, Chen-Yu and Batra, Tanmay and Haris Baig, Mohammad and Ulbricht, Daniel},
  title = {Sliced Wasserstein Discrepancy for Unsupervised Domain Adaptation},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Graph Attention Convolution for Point Cloud Semantic Segmentation
Lei Wang, Yuchun Huang, Yaolin Hou, Shenman Zhang, Jie Shan


Standard convolution is inherently limited for semantic segmentation of point cloud due to its isotropy about features. It neglects the structure of an object, results in poor object delineation and small spurious regions in the segmentation result. This paper proposes a novel graph attention convolution (GAC), whose kernels can be dynamically carved into specific shapes to adapt to the structure of an object. Specifically, by assigning proper attentional weights to different neighboring points, GAC is designed to selectively focus on the most relevant part of them according to their dynamically learned features. The shape of the convolution kernel is then determined by the learned distribution of the attentional weights. Though simple, GAC can capture the structured features of point clouds for fine-grained segmentation and avoid feature contamination between objects. Theoretically, we provided a thorough analysis on the expressive capabilities of GAC to show how it can learn about the features of point clouds. Empirically, we evaluated the proposed GAC on challenging indoor and outdoor datasets and achieved the state-of-the-art results in both scenarios.
[graph, dataset, terrain, capture, work, key, structural] [point, cloud, vertex, shape, local, initial, analysis, field, directly, pointnet, indoor, corresponding] [proposed, attentional, figure, spectral, input, method, image] [gac, convolution, deep, gacnet, crf, max, neural, network, standard, table, convolutional, operator, scale, kernel, applied, structure, dynamically, pooling, layer, designed, number, weight, output, proper, constructed, sharing, accuracy, effectiveness] [attention, mechanism, provided, model, random] [feature, spatial, segmentation, semantic, neighboring, object, pyramid, area, cnn, three] [learning, set, learned, training, classification, adapt, learn, testing, function, novel, specific, label, idea, representation]
@InProceedings{Wang_2019_CVPR,
  author = {Wang, Lei and Huang, Yuchun and Hou, Yaolin and Zhang, Shenman and Shan, Jie},
  title = {Graph Attention Convolution for Point Cloud Semantic Segmentation},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Normalized Diversification
Shaohui Liu, Xiao Zhang, Jianqiao Wangni, Jianbo Shi


Generating diverse yet specific data is the goal of the generative adversarial network (GAN), but it suffers from the problem of mode collapse. We introduce the concept of normalized diversity which force the model to preserve the normalized pairwise distance between the sparse samples from a latent parametric distribution and their corresponding high-dimensional outputs. The normalized diversification aims to unfold the manifold of unknown topology and non-uniform distribution, which leads to safe interpolation between valid latent variables. By alternating the maximization over the pairwise distance and updating the total distance (normalizer), we encourage the model to actively explore in the high-dimensional output space. We demonstrate that by combining the normalized diversity loss and the adversarial loss, we generate diverse data without suffering from mode collapsing. Experimental results show that our method achieves consistent improvement on unsupervised image generation, conditional image generation and hand pose estimation over strong baselines.
[multiple, joint] [normalized, hand, pose, problem, valid, fit, visible, variable, corresponding, topology, matrix, vision, compute, estimation, dense] [image, latent, generative, conditional, method, figure, input, mapping, interpolation, extrapolation, proposed, bicyclegan, quality, generator, demonstrate, comparison, synthetic] [output, table, better, sparse, achieves, normalization, gaussian, deep] [model, diversity, generated, adversarial, gan, mode, diversification, generate, bourgan, safe, generation, manifold, generates, parametrized, simple, multimodal, diverse, variational, pdata] [] [pairwise, distance, space, data, training, function, distribution, loss, metric, learning, dij, unsupervised, learn, learned, objective, target, domain, pair, sample, vae, update, setting, illustrate, existing, measure]
@InProceedings{Liu_2019_CVPR,
  author = {Liu, Shaohui and Zhang, Xiao and Wangni, Jianqiao and Shi, Jianbo},
  title = {Normalized Diversification},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Learning to Localize Through Compressed Binary Maps
Xinkai Wei, Ioan Andrei Barsan, Shenlong Wang, Julieta Martinez, Raquel Urtasun


One of the main difficulties of scaling current localization systems to large environments is the on-board storage required for the maps. In this paper we propose to learn to compress the map representation such that it is optimal for the localization task. As a consequence, higher compression rates can be achieved without loss of localization accuracy when compared to standard coding schemes that optimize for reconstruction, thus ignoring the end task. Our experiments show that it is possible to learn a task-specific compression which reduces storage requirements by two orders of magnitude over general-purpose codecs such as WebP without sacrificing performance.
[online, driving, previous, highway, current, report, localize, onboard] [matching, lidar, error, reconstruction, pose, defined, computed, optimal, dense, point, position] [method, intensity, image, pixel, proposed, gps, based, high, input, competing, figure] [compression, network, deep, storage, binary, performance, accuracy, rate, convolutional, compress, full, lossless, table, compressed, neural, compared, standard, bit, magnitude, huffman, architecture, inference, higher, coding, reduces, webp, png, lower, scheme, lightweight, offline] [encoding, decoder, model, probability, diverse, visual] [map, localization, module, vehicle, urban, feature, failure, fully, road, score] [embedding, loss, learning, code, large, learn, representation, training, entropy, train, paper, learned, probabilistic]
@InProceedings{Wei_2019_CVPR,
  author = {Wei, Xinkai and Andrei Barsan, Ioan and Wang, Shenlong and Martinez, Julieta and Urtasun, Raquel},
  title = {Learning to Localize Through Compressed Binary Maps},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
A Parametric Top-View Representation of Complex Road Scenes
Ziyan Wang, Buyu Liu, Samuel Schulter, Manmohan Chandraker


In this paper, we address the problem of inferring the layout of complex road scenes given a single camera as input. To achieve that, we first propose a novel parameterized model of road layouts in a top-view representation, which is not only intuitive for human visualization but also provides an interpretable interface for higher-level decision making. Moreover, the design of our top-view scene model allows for efficient sampling and thus generation of large-scale simulated data, which we leverage to train a deep neural network to infer our scene model's parameters. Specifically, our proposed training procedure uses supervised domain-adaptation techniques to incorporate both simulated as well as manually annotated data. Finally, we design a Conditional Random Field (CRF) that enforces coherent predictions for a single frame and encourages temporal smoothness among video frames. Experiments on two public data sets show that: (1) Our parametric top-view model is representative enough to describe complex road scenes; (2) The proposed method outperforms baselines trained on manually-annotated or simulated data only, thus getting the best of both; (3) Our CRF is able to generate temporally smoothed while semantically meaningful results.
[complex, temporal, work, prediction, coherent, driving, framework, modeling] [scene, single, defined, ground, rgb, depth, perspective, kitti, truth, nuscenes, corresponding, continuous, approach, manual, define, parametric, consistent, camera, allows, well, rendering] [simulated, real, proposed, image, hybrid, consistency, figure, input, attribute, qualitative, side, method] [neural, network, number, design, binary, deep, output, efficient, crf, width, parameterized] [model, understanding, potential, infer, rich, enables, existence, complete, interpretable, describe, goal, requires] [semantic, road, graphical, propose, leverage, layout, feature, supervision, annotation] [data, domain, representation, training, set, learning, adaptation, sampling, main, function, loss]
@InProceedings{Wang_2019_CVPR,
  author = {Wang, Ziyan and Liu, Buyu and Schulter, Samuel and Chandraker, Manmohan},
  title = {A Parametric Top-View Representation of Complex Road Scenes},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Self-Supervised Spatiotemporal Learning via Video Clip Order Prediction
Dejing Xu, Jun Xiao, Zhou Zhao, Jian Shao, Di Xie, Yueting Zhuang


We propose a self-supervised spatiotemporal learning technique which leverages the chronological order of videos. Our method can learn the spatiotemporal representation of the video by predicting the order of shuffled clips from the video. The category of the video is not required, which gives our technique the potential to take advantage of infinite unannotated videos. There exist related works which use frames, while compared to frames, clips are more consistent with the video dynamics. Clips can help to reduce the uncertainty of orders and are more appropriate to learn a video representation. The 3D convolutional neural networks are utilized to extract features for clips, and these features are processed to predict the actual order. The learned representations are evaluated via nearest neighbor retrieval experiments. We also use the learned networks as the pre-trained models and finetune them on the action recognition task. Three types of 3D convolutional neural networks are tested in experiments, and we gain large improvements compared to existing self-supervised methods.
[video, action, clip, prediction, spatiotemporal, recognition, temporal, extract, framework, tuple, dataset, human, finetuned, selfsupervised] [computer, vision, pattern, international] [conference, ieee, image, figure, method, extracted, input, balance, proposed, based] [cnns, order, convolution, network, neural, convolutional, number, table, accuracy, actual, deep, architecture, pooling, top, conv, compared, imagenet, better, convnets, layer, output, size, kernel] [model, query, beam, sampled, visual] [feature, three, spatial, leverage, object] [learning, trained, task, representation, retrieval, training, shuffled, learned, datasets, classification, learn, nearest, large, set, test, data, sample, neighbor]
@InProceedings{Xu_2019_CVPR,
  author = {Xu, Dejing and Xiao, Jun and Zhao, Zhou and Shao, Jian and Xie, Di and Zhuang, Yueting},
  title = {Self-Supervised Spatiotemporal Learning via Video Clip Order Prediction},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Superquadrics Revisited: Learning 3D Shape Parsing Beyond Cuboids
Despoina Paschalidou, Ali Osman Ulusoy, Andreas Geiger


Abstracting complex 3D shapes with parsimonious part-based representations has been a long standing goal in computer vision. This paper presents a learning-based solution to this problem which goes beyond the traditional 3D cuboid representation by exploiting superquadrics as atomic elements. We demonstrate that superquadrics lead to more expressive 3D scene parses while being easier to learn than 3D cuboid representations. Moreover, we provide an analytical solution to the Chamfer loss which avoids the need for computational expensive reinforcement learning or iterative prediction. Our model learns to parse 3D objects into consistent superquadric representations without supervision. Results on various ShapeNet categories as well as the SURREAL human body dataset demonstrate the flexibility of our model in capturing fine details and complex poses that could not have been modelled using cuboids.
[capture, recognition, work, human, modeling, complex, dataset] [shape, superquadrics, primitive, computer, point, vision, superquadric, note, cloud, pattern, cuboid, reconstruction, require, surface, well, allows, tulsiani, scene, body, approach, chamfer, shapenet, volumetric, single, analytical, contrast, international, parsimonious, provide, additional, andreas, solution, surreal, allow, linear] [ieee, figure, input, demonstrate, image, proposed] [number, network, neural, deep, evolution, variance, structure, gradient, size, iteration, fine] [model, existence, simple, goal, reinforcement, parse, visual, represent] [object, predicted, parsing, propose, utilize, faster] [loss, learning, distance, set, training, target, representation, function, unsupervised, sampling, observe, learn, data]
@InProceedings{Paschalidou_2019_CVPR,
  author = {Paschalidou, Despoina and Osman Ulusoy, Ali and Geiger, Andreas},
  title = {Superquadrics Revisited: Learning 3D Shape Parsing Beyond Cuboids},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Unsupervised Disentangling of Appearance and Geometry by Deformable Generator Network
Xianglei Xing, Tian Han, Ruiqi Gao, Song-Chun Zhu, Ying Nian Wu


We present a deformable generator model to disentangle the appearance and geometric information in purely unsupervised manner. The appearance generator models the appearance related information, including color, illumination, identity or category, of an image, while the geometric generator performs geometric related warping, such as rotation and stretching, through generating displacement of the coordinates of each pixel to obtain the final image. Two generators act upon independent latent factors to extract disentangled appearance and geometric information from image. The proposed scheme is general and can be easily integrated into different generative models. An extensive set of qualitative and quantitative experiments show that the appearance and geometric information can be well disentangled, and the learned geometric generator can be conveniently transferred to the other image datasets to facilitate knowledge transfer tasks.
[warping, dataset, optical, second, largest, flow, displacement, work] [geometric, shape, left, viewing, computer, algorithm, view, langevin, corresponding, illumination, reconstruction, rotation, well, differentiable, varying, pattern, vision, estimation] [appearance, generator, latent, figure, face, proposed, image, disentangled, generative, interpolation, identity, row, celeba, disentangling, color, change, transferring, ieee, fixing, conference, disentangle, transferred, method, variation, expression, transformation] [inference, covariance, deep, network, gradient, varies, operation, stochastic] [model, vector, generated, variational, step, adversarial, generate, encodes, arxiv, preprint] [deformable, third, final] [learning, learned, dimension, representation, log, transfer, learn, training, unsupervised, independent, unseen, sample, vae, set, function, posterior, sampling]
@InProceedings{Xing_2019_CVPR,
  author = {Xing, Xianglei and Han, Tian and Gao, Ruiqi and Zhu, Song-Chun and Nian Wu, Ying},
  title = {Unsupervised Disentangling of Appearance and Geometry by Deformable Generator Network},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Self-Supervised Representation Learning by Rotation Feature Decoupling
Zeyu Feng, Chang Xu, Dacheng Tao


We introduce a self-supervised learning method that focuses on beneficial properties of representation and their abilities in generalizing to real-world tasks. The method incorporates rotation invariance into the feature learning framework, one of many good and well-studied properties of visual representation, which is rarely appreciated or exploited by previous deep convolutional neural network based self-supervised representation learning methods. Specifically, our model learns a split representation that contains both rotation related and unrelated parts. We train neural networks by jointly predicting image rotations and discriminating individual instances. In particular, our model decouples the rotation discrimination from instance discrimination, which allows us to improve the rotation prediction by mitigating the influence of rotation label noise, as well as discriminate instances without regard to image rotations. The resulting feature has a better generalization ability for more various tasks. Experimental results show that our model outperforms current state-of-the-art methods on standard self-supervised feature learning benchmarks.
[recognition, predicting, dataset, prediction, influence, previous] [rotation, computer, vision, international, pattern, linear, orientation, approach, june, well] [image, conference, ieee, method, figure, based, zhang, input] [neural, network, convolutional, imagenet, deep, performance, table, standard, max, best, achieve] [model, visual, machine, semantically, vector, ability] [feature, instance, rotated, object, pascal, agnostic, default, detection, semantic, segmentation, voc, eccv, fully] [learning, classification, unrelated, rotnet, unsupervised, representation, learned, noroozi, unlabeled, set, trained, discriminative, noisy, positive, data, training, task, loss, ahenb, springer, decoupling, train, decoupled, favaro, discrimination, pretext]
@InProceedings{Feng_2019_CVPR,
  author = {Feng, Zeyu and Xu, Chang and Tao, Dacheng},
  title = {Self-Supervised Representation Learning by Rotation Feature Decoupling},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Weakly Supervised Deep Image Hashing Through Tag Embeddings
Vijetha Gattupalli, Yaoxin Zhuo, Baoxin Li


Many approaches to semantic image hashing have been formulated as supervised learning problems that utilize images and label information to learn the binary hash codes. However, large-scale labelled image data is expensive to obtain, thus imposing a restriction on the usage of such algorithms. On the other hand, unlabelled image data is abundant due to the existence of many Web image repositories. Such Web images may often come with images tags that contains useful information, although raw tags in general do not readily lead to semantic labels. Motivated by this scenario, we formulate the problem of semantic image hashing as a weakly-supervised learning problem. We utilize the information contained in the user-generated tags associated with the images to learn the hash codes. More specifically, we extract the word2vec semantic embeddings of the tags and use the information contained in them for constraining the learning. Accordingly, we name our model Weakly Supervised Deep Hashing using Tag Embeddings (WDHT). WDHT is tested for the task of semantic image retrieval and is compared against several state-of-art models. Results show that our approach sets a new state-of-art in the area of weekly supervised image hashing.
[work, dataset] [associated, approach, problem, pattern, international, ground, truth, computer, algorithm, vision, computed] [image, conference, ieee, raw, method, acm, component, based, proposed, figure] [binary, deep, neural, represents, output, performance, table, quantization, network, compared, processing, aggregation, alexnet, better, kernel] [tag, model, vector, word, semantically, attempted, query] [semantic, weakly, map, feature, three, aggregated, area] [hash, loss, hashing, learning, supervised, set, similarity, space, label, training, learn, sample, data, embeddings, web, unsupervised, function, embedding, representation, code, ranking, hinge, margin, task, retrieval, cosine, close]
@InProceedings{Gattupalli_2019_CVPR,
  author = {Gattupalli, Vijetha and Zhuo, Yaoxin and Li, Baoxin},
  title = {Weakly Supervised Deep Image Hashing Through Tag Embeddings},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Improved Road Connectivity by Joint Learning of Orientation and Segmentation
Anil Batra, Suriya Singh, Guan Pang, Saikat Basu, C.V. Jawahar, Manohar Paluri


Road network extraction from satellite images often produce fragmented road segments leading to road maps unfit for real applications. Pixel-wise classification fails to predict topologically correct and connected road masks due to the absence of connectivity supervision and difficulty in enforcing topological constraints. In this paper, we propose a connectivity task called Orientation Learning, motivated by the human behavior of annotating roads by tracing it at a specific orientation. We also develop a stacked multi-branch convolutional module to effectively utilize the mutual information between orientation learning and segmentation tasks. These contributions ensure that the model predicts topologically correct and connected road masks. We also propose Connectivity Refinement approach to further enhance the estimated road networks. The refinement model is pre-trained to connect and refine the corrupted ground-truth masks and later fine-tuned to enhance the predicted road masks. We demonstrate the advantages of our approach on two diverse road extraction datasets SpaceNet and DeepGlobe. Our approach improves over the state-of-the-art techniques by 9% and 7.5% in road topology metric on SpaceNet and DeepGlobe, respectively.
[fusion, joint, graph, dataset, flow, human, predict, complex, predicting, perform] [orientation, topology, approach, estimated, groundtruth, accurate, tracing, well, point] [figure, proposed, based, image, corrupted, pixel, missing, stack, produce, remove] [network, stacked, connected, table, deep, structure, number, topologically, convolutional, performance] [model, path, automatic, correct, shortest, iterative] [road, connectivity, segmentation, refinement, module, iou, extraction, improve, connect, spacenet, deepglobe, false, feature, improves, satellite, propose, cnn, refine, improvement, aerial, supervision, enhance, attyus] [learning, task, loss, learn, shared, classification, training, similarity, function]
@InProceedings{Batra_2019_CVPR,
  author = {Batra, Anil and Singh, Suriya and Pang, Guan and Basu, Saikat and Jawahar, C.V. and Paluri, Manohar},
  title = {Improved Road Connectivity by Joint Learning of Orientation and Segmentation},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Deep Supervised Cross-Modal Retrieval
Liangli Zhen, Peng Hu, Xu Wang, Dezhong Peng


Cross-modal retrieval aims to enable flexible retrieval across different modalities. The core of cross-modal retrieval is how to measure the content similarity between different types of data. In this paper, we present a novel cross-modal retrieval method, called Deep Supervised Cross-modal Retrieval (DSCMR). It aims to find a common representation space, in which the samples from different modalities can be compared directly. Specifically, DSCMR minimises the discrimination loss in both the label space and the common representation space to supervise the model learning discriminative features. Furthermore, it simultaneously minimises the modality invariance loss and uses a weight sharing strategy to eliminate the cross-modal discrepancy of multimedia data in the common representation space to learn modality-invariant features. Comprehensive experimental results on four widely-used benchmark datasets demonstrate that the proposed method is effective in cross-modal learning and significantly outperforms the state-of-the-art cross-modal retrieval methods.
[dataset, outperforms, heterogeneous] [linear, international, analysis, matrix, equation, pattern, computer] [image, proposed, method, conference, ieee, figure, comparison, acm, denoted, traditional] [deep, correlation, performance, number, original, table, neural, compared, network, convolutional] [common, text, modality, sentence, machine, multimedia, model, vector, find, minimises] [semantic, map, cnn, feature, score, category, pascal, average, propose, benchmark, highest] [representation, learning, learn, retrieval, dscmr, space, data, loss, function, discrimination, label, discriminative, objective, similarity, supervised, invariance, classification, learned, measure, datasets, classifier, cca, metric, training, jrl, cmdn, ccl, dcca]
@InProceedings{Zhen_2019_CVPR,
  author = {Zhen, Liangli and Hu, Peng and Wang, Xu and Peng, Dezhong},
  title = {Deep Supervised Cross-Modal Retrieval},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
A Theoretically Sound Upper Bound on the Triplet Loss for Improving the Efficiency of Deep Distance Metric Learning
Thanh-Toan Do, Toan Tran, Ian Reid, Vijay Kumar, Tuan Hoang, Gustavo Carneiro


We propose a method that substantially improves the efficiency of deep distance metric learning based on the optimization of the triplet loss function. One epoch of such training process based on a na"ive optimization of the triplet loss function has a run-time complexity O(N^3), where N is the number of training samples. Such optimization scales poorly, and the most common approach proposed to address this high complexity issue is based on sub-sampling the set of triplets needed for the training process. Another approach explored in the field relies on an ad-hoc linearization (in terms of N) of the triplet loss that introduces class centroids, which must be optimized using the whole training set for each mini-batch - this means that a na"ive implementation of this approach has run-time complexity O(N^2). This complexity issue is usually mitigated with poor, but computationally cheap, approximate centroid optimization methods. In this paper, we first propose a solid theory on the linearization of the triplet loss with the use of class centroids, where the main conclusion is that our new linear loss represents a tight upper-bound to the triplet loss. Furthermore, based on the theory above, we propose a training algorithm that no longer requires the centroid optimization step, which means that our approach is the first in the field with a guaranteed linear run-time complexity. We show that the training of deep distance metric learning methods using the proposed upper-bound is substantially faster than triplet-based methods, while producing competitive retrieval accuracy results on benchmark datasets (CUB-200-2011 and CAR196).
[second, dataset, explored, term, work] [approach, optimization, bound, note, linear, point, surface, field, lemma, university, defined] [proposed, based, method, image, ive] [complexity, number, deep, table, layer, network, performance, process, accuracy, unit, max, standard, convolutional, order, computationally, represents] [requires, model, generation, generate] [feature, propose, global, fully, improves, faster] [loss, training, triplet, discriminative, learning, class, centroid, distance, metric, dml, mining, embedding, classification, clustering, upper, set, negative, softmax, embeddings, positive, function, min, issue, datasets, pairwise, cyi, retrieval, trained, linearization, hard, smart, large]
@InProceedings{Do_2019_CVPR,
  author = {Do, Thanh-Toan and Tran, Toan and Reid, Ian and Kumar, Vijay and Hoang, Tuan and Carneiro, Gustavo},
  title = {A Theoretically Sound Upper Bound on the Triplet Loss for Improving the Efficiency of Deep Distance Metric Learning},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Data Representation and Learning With Graph Diffusion-Embedding Networks
Bo Jiang, Doudou Lin, Jin Tang, Bin Luo


Recently, graph convolutional neural networks have been widely studied for graph-structured data representation and learning. In this paper, we present Graph Diffusion-Embedding networks (GDENs), a new model for graph-structured data representation and learning. GDENs are motivated by our development of graph based feature diffusion. GDENs integrate both feature diffusion and graph node (low-dimensional) embedding simultaneously into a unified network by employing a novel diffusion-embedding architecture. GDENs have two main advantages. First, the equilibrium representation of the diffusion-embedding operation in GDENs can be obtained via a simple closed-form solution, which thus guarantees the compactivity and efficiency of GDENs. Second, the proposed GDENs can be naturally extended to address the data with multiple graph structures. Experiments on various semi-supervised learning tasks on several benchmark datasets demonstrate that the proposed GDENs significantly outperform traditional graph convolutional networks.
[graph, gdens, gcn, multiple, hidden, equilibrium, gden, propagation, work, gat, naturally, walk] [matrix, laplacian, optimal, note, international, computer, vision, general, provide] [based, proposed, conference, ieee, input, figure, image, row, conduct, comparison, spectral] [neural, network, convolutional, number, operation, denotes, table, better, performance, architecture, layer, output, weight, processing, outperform, structure, convolution, regularization, accuracy, explore, activation] [node, model, random, attention, kind, machine] [feature, contextual, propose, final, three, map, employing] [diffusion, representation, data, learning, classification, svhn, address, datasets, label, labeled, generally, loss, set, embedding, main, experimental, aij]
@InProceedings{Jiang_2019_CVPR,
  author = {Jiang, Bo and Lin, Doudou and Tang, Jin and Luo, Bin},
  title = {Data Representation and Learning With Graph Diffusion-Embedding Networks},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Video Relationship Reasoning Using Gated Spatio-Temporal Energy Graph
Yao-Hung Hubert Tsai, Santosh Divvala, Louis-Philippe Morency, Ruslan Salakhutdinov, Ali Farhadi


Visual relationship reasoning is a crucial yet challenging task for understanding rich interactions across visual concepts. For example, a relationship \ man, open, door\ involves a complex relation \ open\ between concrete entities \ man, door\ . While much of the existing work has studied this problem in the context of still images, understanding visual relationships in videos has received limited attention. Due to their temporal nature, videos enable us to model and reason about a more comprehensive set of visual relationships, such as those requiring multiple (temporal) observations (e.g., \ man, lift up, box\ vs. \ man, put down, box\ ), as well as relationships that are often correlated through time (e.g., \ woman, pay, money\ followed by \ woman, buy, coffee\ ). In this paper, we construct a Conditional Random Field on a fully-connected spatio-temporal graph that exploits the statistical dependency between relational entities spatially and temporally. We introduce a novel gated energy function parametrization that learns adaptive relations conditioned on visual observations. Our model optimization is computationally efficient, and its space computation complexity is significantly amortized through our proposed parameterization. Experimental results on benchmark video datasets (ImageNet Video and Charades) demonstrate state-of-the-art performance across three standard relationship reasoning tasks: Detection, Tagging, and Recognition.
[video, graph, temporal, ueg, recognition, vidvrd, subject, steg, gsteg, ytk, tagging, work, time, dependency, activity, multiple, tagged] [computer, vision, pattern, scene, parameterization, defined, matrix, well, analysis, ground, truth, volume] [conference, input, statistical, method, ieee, proposed, conditional, spatially, image] [energy, imagenet, deep, neural, structure, represents, performance, network, number, compared, correlation] [relationship, visual, model, reasoning, gated, evaluation, random, consider, conditioned, arxiv, preprint, language] [relation, object, detection, seg, three, instance, segment, feature, bounding] [pairwise, function, learning, set, training, task, triplet, label]
@InProceedings{Tsai_2019_CVPR,
  author = {Hubert Tsai, Yao-Hung and Divvala, Santosh and Morency, Louis-Philippe and Salakhutdinov, Ruslan and Farhadi, Ali},
  title = {Video Relationship Reasoning Using Gated Spatio-Temporal Energy Graph},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Image-Question-Answer Synergistic Network for Visual Dialog
Dalu Guo, Chang Xu, Dacheng Tao


The image, question (combined with the history for de-referencing), and the corresponding answer are three vital components of visual dialog. Classical visual dialog systems integrate the image, question, and history to search for or generate the best matched answer, and so, this approach significantly ignores the role of the answer. In this paper, we devise a novel image-question-answer synergistic network to value the role of the answer for precise visual dialog. We extend the traditional one-stage solution to a two-stage solution. In the first stage, candidate answers are coarsely scored according to their relevance to the image and question pair. Afterward, in the second stage, answers with high probability of being correct are re-ranked by synergizing with image and question. On the Visual Dialog v1.0 dataset, the proposed synergistic network boosts the discriminative visual dialog model to achieve a new state-of-the-art of 57.88% normalized discounted cumulative gain. A generative visual dialog model equipped with the proposed technique also shows promising improvements.
[lstm, fusion, dataset, current, state, second, sequence] [computer, vision, problem, pattern, single, volume] [image, generative, figure, based, conference, color, method, ieee, proposed, high] [network, bilinear, performance, top, higher, neural, pooling, table, better, number, processing, best] [answer, visual, primary, synergistic, question, model, candidate, dialog, vector, history, correct, attention, probability, word, wearing, caption, black, arxiv, preprint, grey, blue, mrr, grass, visor, generate, white, encoder, decoder, common, ndcg, scored, green, memory, attended, answering, turn] [stage, score, easy, improves, three, highest, feature, cnn] [discriminative, learning, loss, set, learn, ranked, representation, selected, pair]
@InProceedings{Guo_2019_CVPR,
  author = {Guo, Dalu and Xu, Chang and Tao, Dacheng},
  title = {Image-Question-Answer Synergistic Network for Visual Dialog},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Not All Frames Are Equal: Weakly-Supervised Video Grounding With Contextual Similarity and Visual Clustering Losses
Jing Shi, Jia Xu, Boqing Gong, Chenliang Xu


We invest the problem of weakly-supervised video grounding, where only video-level sentences are provided. This is a challenging task, and previous Multi-Instance Learning (MIL) based image grounding methods turn to fail in the video domain. Recent work attempts to decompose the video-level MIL into frame-level MIL by applying weighted sentence-frame ranking loss over frames, but it is not robust and does not exploit the rich temporal information in videos. In this work, we address these issues by extending frame-level MIL with a false positive frame-bag constraint and modeling the visual feature consistency in the video. In specific, we design a contextual similarity between semantic and visual features to deal with sparse objects association across frames. Furthermore, we leverage temporal coherence by strengthening the clustering effect of similar features in the visual space. We conduct an extensive evaluation on YouCookII and RoboWatch datasets, and demonstrate our method significantly outperforms prior state-of-the-art methods.
[video, frame, youcookii, temporal, dataset, robowatch, work, possibility, ict] [zhou, defined, problem, denote] [method, image, consistency, figure, conduct, proposed, based, face] [accuracy, better, higher, full, performance, compared, max, deep, size, number, compare, lower] [visual, grounding, query, model, grounded, sentence, word, language, fct, textual, generalizability, potato, common, description, queried, dvsa, evaluation, correctly, dvsafrm] [mil, contextual, object, region, feature, bag, box, false, segment, score, propose, localization, val] [similarity, loss, positive, clustering, learning, ranking, set, test, training, embedding, class, label, rank, discriminative, experimental, supervised, weighting, upper]
@InProceedings{Shi_2019_CVPR,
  author = {Shi, Jing and Xu, Jia and Gong, Boqing and Xu, Chenliang},
  title = {Not All Frames Are Equal: Weakly-Supervised Video Grounding With Contextual Similarity and Visual Clustering Losses},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Inverse Cooking: Recipe Generation From Food Images
Amaia Salvador, Michal Drozdzal, Xavier Giro-i-Nieto, Adriana Romero


People enjoy food photography because they appreciate food. Behind each meal there is a story described in a complex recipe and, unfortunately, by simply looking at a food image we do not have access to its preparation process. Therefore, in this paper we introduce an inverse cooking system that recreates cooking recipes given food images. Our system predicts ingredients as sets by means of a novel architecture, modeling their dependencies without imposing any order, and then generates cooking instructions by attending to both image and its inferred ingredients simultaneously. We extensively evaluate the whole system on the large-scale Recipe1M dataset and show that (1) we improve performance w.r.t. previous baselines for ingredient prediction; (2) we are able to obtain high quality recipes by leveraging both image and ingredients; (3) our system is able to produce more compelling recipes than retrieval-based approaches according to human judgment. We make code and models publicly available.
[prediction, dataset, sequence, predict, modeling, recognition, human, joint, forward, concatenated, outperforms] [well, inverse, ground, directly] [image, figure, proposed, user, feed, based, study, method, conditional] [neural, add, table, order, deep, compare, performance, binary, best, convolutional, norm, layer, output] [food, ingredient, recipe, cooking, model, attention, generation, system, transformer, instruction, generated, visual, decoder, cheese, cardinality, conditioned, generating, text, consider, onion, tfset, butter, salt, encoder, subsection, appear, vocabulary, generates, cream] [predicted, iou, average, baseline] [set, retrieval, embeddings, learning, distribution, list, train, retrieved, loss, target, test, classification, embedding, large, label, independent, task]
@InProceedings{Salvador_2019_CVPR,
  author = {Salvador, Amaia and Drozdzal, Michal and Giro-i-Nieto, Xavier and Romero, Adriana},
  title = {Inverse Cooking: Recipe Generation From Food Images},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Adversarial Semantic Alignment for Improved Image Captions
Pierre Dognin, Igor Melnyk, Youssef Mroueh, Jerret Ross, Tom Sercu


In this paper, we study image captioning as a conditional GAN training, proposing both a context-aware LSTM captioner and co-attentive discriminator, which enforces semantic alignment between images and captions. We empirically focus on the viability of two training methods: Self-critical Sequence Training (SCST) and Gumbel Straight-Through (ST) and demonstrate that SCST shows more stable gradient behavior and improved results over Gumbel ST, even without accessing discriminator gradients directly. We also address the problem of automatic evaluation for captioning models and introduce a new semantic score, and show its correlation to human judgement. As an evaluation paradigm, we argue that an important criterion for a captioner is the ability to generalize to compositions of objects that do not usually co-occur together. To this end, we introduce a small captioned Out of Context (OOC) test set. The OOC set, combined with our semantic score, are the proposed new diagnosis tools for the captioning community. When evaluated on OOC and MS-COCO benchmarks, we show that SCST-based training has a strong performance in both semantic score and human evaluation, promising to be a valuable new approach for efficient discrete GAN training.
[human, lstm, sequence] [discrete, well, linear, approach, ground, computed, truth, optimization] [image, figure, proposed, based, generator, side] [gradient, table, better, best, compared, correlation, higher] [gan, scst, ooc, discriminator, caption, gumbel, vocabulary, coverage, captioning, cider, gans, attention, visual, evaluation, adversarial, captioner, sign, model, automatic, text, sentence, generation, sentinel, embed, parked, policy, language, step, word, ensce, improved, introduce, generated, diagnostic, call, reward, appendix, meteor, ensrl, standing] [semantic, score, coco, context, street, aware, propose] [training, test, set, log, alignment, embedding, trained, main, ensemble, softmax]
@InProceedings{Dognin_2019_CVPR,
  author = {Dognin, Pierre and Melnyk, Igor and Mroueh, Youssef and Ross, Jerret and Sercu, Tom},
  title = {Adversarial Semantic Alignment for Improved Image Captions},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Answer Them All! Toward Universal Visual Question Answering Models
Robik Shrestha, Kushal Kafle, Christopher Kanan


Visual Question Answering (VQA) research is split into two camps: the first focuses on VQA datasets that require natural image understanding and the second focuses on synthetic datasets that test reasoning. A good VQA algorithm should be capable of both, but only a few VQA algorithms are tested in this manner. We compare five state-of-the-art VQA algorithms across eight VQA datasets covering both domains. To make the comparison fair, all of the models are standardized as much as possible, e.g., they use the same visual features, answer vocabularies, etc. We find that methods do not generalize across the two domains. To address this problem, we propose a new VQA algorithm that rivals or exceeds the state-of-the-art for both domains.
[perform, fusion, early, gru, dataset, multiple, human, work, report, graph, performing, recurrent] [well, algorithm, dimensional, require] [synthetic, image, comparison, attribute, capable] [performance, table, network, accuracy, aggregation, number, validation, best, bilinear, drop, neural] [visual, question, vqa, natural, ramen, clevr, model, reasoning, answering, compositional, attention, ban, language, mac, answer, cvqa, qcg, tdiuc, concept, evaluate, updn, bimodal, simple, understanding, hat, ange, mpt, query] [region, evaluated, counting, faster, spatial, feature, propose] [datasets, test, generalization, train, learning, set, training, generalize, embeddings, split, tested, trained, large]
@InProceedings{Shrestha_2019_CVPR,
  author = {Shrestha, Robik and Kafle, Kushal and Kanan, Christopher},
  title = {Answer Them All! Toward Universal Visual Question Answering Models},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Unsupervised Multi-Modal Neural Machine Translation
Yuanhang Su, Kai Fan, Nguyen Bach, C.-C. Jay Kuo, Fei Huang


Unsupervised neural machine translation (UNMT) has recently achieved remarkable results [??] with only large monolingual corpora in each language. However, the uncertainty of associating target with source sentences makes UNMT theoretically an ill-posed problem. This work investigates the possibility of utilizing images for disambiguation to improve the performance of UNMT. Our assumption is intuitively based on the invariant property of image, i.e., the description of the same visual content by different languages should be approximately similar. We propose an unsupervised multi-modal machine translation (UMNMT) framework based on the language translation cycle consistency loss conditional on the image, targeting to learn the bidirectional multi-modal translation simultaneously. Through an alternate training between multi-modal and uni-modal, our inference model can translate with or without the image. On the widely used Multi30K dataset, the experimental results of our approach are significantly better than those of the text-only UNMT on the 2016 test dataset.
[dataset, framework] [problem, computer, vision, pattern, international, corresponding, additional] [image, translation, conference, figure, input, paired, ieee, based, proposed] [neural, inference, table, performance, better, layer, resnet, validation, processing] [model, text, machine, attention, transformer, encoder, decoder, language, orange, visual, hat, chapeau, encz, man, bleu, corpus, french, caption, word, monolingual, avec, english, multimodal, multilingual, controllable, umnmt, cider, arxiv, preprint, homme, sce, natural, umt, sentence, adversarial, mechanism, twig] [map, extra] [training, unsupervised, data, learning, source, loss, task, train, trained, target, supervised, shared, softmax, learn, large, set, testing]
@InProceedings{Su_2019_CVPR,
  author = {Su, Yuanhang and Fan, Kai and Bach, Nguyen and Jay Kuo, C.-C. and Huang, Fei},
  title = {Unsupervised Multi-Modal Neural Machine Translation},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Multi-Task Learning of Hierarchical Vision-Language Representation
Duy-Kien Nguyen, Takayuki Okatani


It is still challenging to build an AI system that can perform tasks that involve vision and language at human level. So far, researchers have singled out individual tasks separately, for each of which they have designed networks and trained them on its dedicated datasets. Although this approach has seen a certain degree of success, it comes with difficulties of understanding relations among different tasks and transferring the knowledge learned for a task to others. We propose a multi-task learning approach that enables to learn vision-language representation that is shared by many tasks from their diverse datasets. The representation is hierarchical, and prediction for each task is computed from the representation at its corresponding level of the hierarchy. We show through experiments that our method consistently outperforms previous single-task-learning methods on image caption retrieval, visual question answering, and visual grounding. We also analyze the learned hierarchical representation by visualizing attention maps generated in our network.
[joint, multiple, recognition, summary, previous, individual, dataset, consists, prediction] [vision, computer, international, pattern, single, compute, dense, corresponding, approach, matching] [image, conference, method, input, proposed, feedforward] [network, layer, table, number, employ, deep, output, effectiveness, accuracy] [visual, question, attention, vqa, caption, icr, decoder, encoder, riding, sentence, phrase, language, grounding, answering, word, understanding, natural, multimodal, vector, generated] [three, val, region, feature, hierarchical, score, european, level] [learning, task, training, train, shared, representation, test, retrieval, set, trained, learned, pair, learn]
@InProceedings{Nguyen_2019_CVPR,
  author = {Nguyen, Duy-Kien and Okatani, Takayuki},
  title = {Multi-Task Learning of Hierarchical Vision-Language Representation},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Cross-Modal Self-Attention Network for Referring Image Segmentation
Linwei Ye, Mrigank Rochan, Zhi Liu, Yang Wang


We consider the problem of referring image segmentation. Given an input image and a natural language expression, the goal is to segment the object referred by the language expression in the image. Existing works in this area treat the language expression and the input image separately in their representations. They do not sufficiently capture long-range correlations between these two modalities. In this paper, we propose a cross-modal self-attention (CMSA) module that effectively captures the long-range dependencies between linguistic and visual features. Our model can adaptively focus on informative words in the referring expression and important regions in the input image. In addition, we propose a gated multi-level fusion module to selectively integrate self-attentive cross-modal features corresponding to different levels in the image. This module controls the information flow of features at different levels. We validate the proposed approach on four evaluation datasets. Our proposed approach consistently outperforms existing state-of-the-art methods.
[fusion, capture, lstm, focus, previous, dataset, individual, interaction, outperforms, work] [computer, vision, pattern, corresponding, technique, coordinate, denote, approach, relative, linear] [image, conference, proposed, expression, method, ieee, input, produce, figure, based] [network, convolutional, neural, gate, table, performance, order, better, layer, output, entire, process] [referring, language, multimodal, visual, word, attention, model, gated, natural, linguistic, vector, referred, query, machine, unc, encode, evaluation, crossmodal] [feature, segmentation, spatial, module, object, map, fpn, level, semantic, location, cnn, propose, final, three, mask, fully, failure, detailed] [learning, representation, dimension, effectively, specific, datasets]
@InProceedings{Ye_2019_CVPR,
  author = {Ye, Linwei and Rochan, Mrigank and Liu, Zhi and Wang, Yang},
  title = {Cross-Modal Self-Attention Network for Referring Image Segmentation},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
DuDoNet: Dual Domain Network for CT Metal Artifact Reduction
Wei-An Lin, Haofu Liao, Cheng Peng, Xiaohang Sun, Jingdan Zhang, Jiebo Luo, Rama Chellappa, Shaohua Kevin Zhou


Computed tomography (CT) is an imaging modality widely used for medical diagnosis and treatment. CT images are often corrupted by undesirable artifacts when metallic implants are carried by patients, which creates the problem of metal artifact reduction (MAR). Existing methods for reducing the artifacts due to metallic implants are inadequate for two main reasons. First, metal artifacts are structured and non-local so that simple image domain enhancement approaches would not suffice. Second, the MAR approaches which attempt to reduce metal artifacts in the X-ray projection (sinogram) domain inevitably lead to severe secondary artifact due to sinogram inconsistency. To overcome these difficulties, we propose an end-to-end trainable Dual Domain Network (DuDoNet) to simultaneously restore sinogram consistency and enhance CT images. The linkage between the sigogram and image domains is a novel Radon inversion layer that allows the gradients to back-propagate from the image domain to the sinogram domain during training. Extensive experiments show that our method achieves significant improvements over other single domain MAR approaches. To the best of our knowledge, it is the first end-to-end dual-domain network for MAR.
[mar, time, recognition] [radon, projection, reconstruction, computer, ground, truth, pattern, vision, computed, single, dense, limited, algorithm] [metal, image, sinogram, artifact, enhancement, figure, proposed, ieee, dual, metallic, trace, sinograms, nmar, consistency, cnnmar, conference, inversion, dudonet, ril, reconstructed, xli, method, filtering, based, recover, inpainting, yli, corrupted, intense, recovers, interpolated, zhang] [network, deep, secondary, table, reduction, size, reduce, layer, residual, neural, performance, small, full, restore, convolutional, reduced, fine, architecture] [model, visual, consider, iterative] [medical, mask, propose, pyramid] [domain, learning, loss, data, effectively, existing]
@InProceedings{Lin_2019_CVPR,
  author = {Lin, Wei-An and Liao, Haofu and Peng, Cheng and Sun, Xiaohang and Zhang, Jingdan and Luo, Jiebo and Chellappa, Rama and Kevin Zhou, Shaohua},
  title = {DuDoNet: Dual Domain Network for CT Metal Artifact Reduction},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Fast Spatio-Temporal Residual Network for Video Super-Resolution
Sheng Li, Fengxiang He, Bo Du, Lefei Zhang, Yonghao Xu, Dacheng Tao


Recently, deep learning based video super-resolution (SR) methods have achieved promising performance. To simultaneously exploit the spatial and temporal information of videos, employing 3-dimensional (3D) convolutions is a natural approach. However, straight utilizing 3D convolutions may lead to an excessively high computational complexity which restricts the depth of video SR models and thus undermine the performance. In this paper, we present a novel fast spatio-temporal residual network (FSTRN) to adopt 3D convolutions for the video SR task in order to enhance the performance while maintaining a low computational load. Specifically, we propose a fast spatio-temporal residual block (FRB) that divide each 3D filter to the product of two 3D filters, which have considerably lower dimensions. Furthermore, we design a cross-space residual learning that directly links the low-resolution space and the high-resolution space, which can greatly relieve the computational burden on the feature fusion and up-scaling parts. Extensive evaluations and comparisons on benchmark datasets validate the strengths of the proposed approach and demonstrate that the proposed network significantly outperforms the current state-of-the-art methods.
[video, temporal, fusion, spatiotemporal, motion, frame, extract] [single, bound, computer, directly, greatly, approach, analysis, additional] [proposed, image, input, psnr, mapping, ieee, ssim, figure, method, based, high, interpolation, comparison, bicubic, result, deconvolution] [residual, network, convolutional, deep, frb, computational, fast, output, neural, block, fstrn, frbs, crl, performance, prelu, connection, original, conv, fsr, size, filter, applied, weight, layer, number, andrew, wei, addition, flrl, dropout] [natural, model, memory, ability] [feature, propose, global, spatial, enhance, benchmark, extraction, improve] [learning, space, generalization, function, lrl, training, set, novel, existing, test, loss, suppose]
@InProceedings{Li_2019_CVPR,
  author = {Li, Sheng and He, Fengxiang and Du, Bo and Zhang, Lefei and Xu, Yonghao and Tao, Dacheng},
  title = {Fast Spatio-Temporal Residual Network for Video Super-Resolution},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Complete the Look: Scene-Based Complementary Product Recommendation
Wang-Cheng Kang, Eric Kim, Jure Leskovec, Charles Rosenberg, Julian McAuley


Modeling fashion compatibility is challenging due to its complexity and subjectivity. Existing work focuses on predicting compatibility between product images (e.g. an image containing a t-shirt and an image containing a pair of jeans). However, these approaches ignore real-world 'scene' images (e.g. selfies); such images are hard to deal with due to their complexity, clutter, variations in lighting and pose (etc.) but on the other hand could potentially provide key context (e.g. the user's body type, or the season) for making more accurate recommendations. In this work, we propose a new task called 'Complete the Look', which seeks to recommend visually compatible products based on scene images. We design an approach to extract training data for this task, and propose a novel way to learn the scene-product compatibility from fashion or interior design images. Our approach measures compatibility both globally and locally via CNNs and attention mechanisms. Extensive experiments show that our method achieves significant performance gains over alternative systems. Human evaluation and qualitative analysis are also conducted to further understand model behavior. We hope this work could lead to useful applications which link large corpora of real-world scenes with shoppable products.
[human, dataset, work, key, predicting, extract, online] [scene, approach, local, typically, note, column, well] [image, style, based, figure, method] [performance, deep, table, siamese, design, better, accuracy, full, achieves, addition, network] [attention, visual, model, complete, query, relevant, consider, random, retrieving] [clothing, feature, complementary, category, map, bounding, recommend, cropped, region, box, global, area, context, adopted] [fashion, product, compatibility, compatible, data, datasets, ctl, learn, learning, stl, embeddings, seek, embedding, recommendation, existing, task, measure, ibr, shop, similarity, distance, test, deepsaliency, large, item, notion, unified, pair, training, labeled, ranking]
@InProceedings{Kang_2019_CVPR,
  author = {Kang, Wang-Cheng and Kim, Eric and Leskovec, Jure and Rosenberg, Charles and McAuley, Julian},
  title = {Complete the Look: Scene-Based Complementary Product Recommendation},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Selective Sensor Fusion for Neural Visual-Inertial Odometry
Changhao Chen, Stefano Rosa, Yishu Miao, Chris Xiaoxuan Lu, Wei Wu, Andrew Markham, Niki Trigoni


Deep learning approaches for Visual-Inertial Odometry (VIO) have proven successful, but they rarely focus on incorporating robust fusion strategies for dealing with imperfect input sensory data. We propose a novel end-to-end selective sensor fusion framework for monocular VIO, which fuses monocular images and inertial measurements in order to estimate the trajectory whilst improving robustness to real-life issues, such as missing and corrupted data or bad sensor synchronization. In particular, we propose two fusion modalities based on different masking strategies: deterministic soft fusion and stochastic hard fusion, and we compare with previously proposed direct fusion baselines. During testing, the network is able to selectively process the features of the available sensor modalities and produce a trajectory at scale. We present a thorough investigation on the performances on three public autonomous driving, Micro Aerial Vehicle (MAV) and hand-held VIO datasets. The results demonstrate the effectiveness of the fusion strategies, which offer better performances compared to direct fusion, particularly in presence of corrupted data. In addition, we study the interpretability of the fusion networks by visualising the masking layers in different scenarios and with varying data corruption, revealing interesting correlations between the fusion networks and imperfect sensory input data.
[fusion, inertial, vio, imu, temporal, dataset, framework, modelling, deterministic, lstm, time, motion, occur, explicitly, velocity, concatenated] [sensor, direct, odometry, pose, vision, international, robotics, monocular, robust, camera, normal, journal, computer, error, kitti, euroc, micro, occlusion] [degradation, figure, proposed, missing, noise, input, conference, corrupted, ieee, based, image, filtering, blur, translation, study] [neural, selection, deep, stochastic, table, network, compare, process, ratio, order, convolutional, full] [visual, model, encoder, probability, random, sensory, multimodal] [feature, mask, selective, presence, propose, misalignment, aerial, regression, vehicle] [hard, soft, data, learning, function, selected, trained]
@InProceedings{Chen_2019_CVPR,
  author = {Chen, Changhao and Rosa, Stefano and Miao, Yishu and Xiaoxuan Lu, Chris and Wu, Wei and Markham, Andrew and Trigoni, Niki},
  title = {Selective Sensor Fusion for Neural Visual-Inertial Odometry},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Look More Than Once: An Accurate Detector for Text of Arbitrary Shapes
Chengquan Zhang, Borong Liang, Zuming Huang, Mengyi En, Junyu Han, Errui Ding, Xinghao Ding


Previous scene text detection methods have progressed substantially over the past years. However, limited by the receptive field of CNNs and the simple representations like rectangle bounding box or quadrangle adopted to describe text, previous methods may fall short when dealing with more challenging text instances, such as extremely long text and arbitrarily shaped text. To address these two problems, we present a novel text detector namely LOMO, which localizes the text progressively for multiple times (or in other word, LOok More than Once). LOMO consists of a direct regressor (DR), an iterative refinement module (IRM) and a shape expression module (SEM). At first, text proposals in the form of quadrangle are generated by DR branch. Next, IRM progressively perceives the entire long text by iterative refinement based on the extracted feature blocks of preliminary proposals. Finally, a SEM is introduced to reconstruct more precise representation of irregular text by considering the geometry properties of text instance, including text region, text center line and border offsets. The state-of-the-art results on several public benchmarks including ICDAR2017-RCTW, SCUT-CTW1500, Total-Text, ICDAR2015 and ICDAR17-MLT confirm the striking robustness and effectiveness of LOMO.
[long, longer, consists, dataset, extract] [scene, corner, shape, direct, geometry, corresponding, robust, field, irregular] [based, expression, method, image, arbitrary, input, side, cover, figure, proposed] [performance, receptive, convolutional, deep, size, number, table, achieves, network, extremely, inference] [text, attention, iterative, arxiv, preprint, natural, reading, generate] [lomo, detection, irm, sem, border, center, detecting, module, feature, regression, quadrangle, refinement, hmean, curved, including, polygon, aspect, icdar, three, region, map, branch, detector, proposal, mask, roi, east, instance, oriented, box, propose] [set, training, loss, shared, testing, learning, sample, regressor, existing, datasets]
@InProceedings{Zhang_2019_CVPR,
  author = {Zhang, Chengquan and Liang, Borong and Huang, Zuming and En, Mengyi and Han, Junyu and Ding, Errui and Ding, Xinghao},
  title = {Look More Than Once: An Accurate Detector for Text of Arbitrary Shapes},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Learning Binary Code for Personalized Fashion Recommendation
Zhi Lu, Yang Hu, Yunchao Jiang, Yan Chen, Bing Zeng


With the rapid growth of fashion-focused social networks and online shopping, intelligent fashion recommendation is now in great needs. Recommending fashion outfits, each of which is composed of multiple interacted clothing and accessories, is relatively new to the field. The problem becomes even more interesting and challenging when considering users' personalized fashion style. Another challenge in a large-scale fashion outfit recommendation system is the efficiency issue of item/outfit search and storage. In this paper, we propose to learn binary code for efficient personalized fashion outfits recommendation. Our system consists of three components, a feature network for content extraction, a set of type-dependent hashing modules to learn binary codes, and a matching block that conducts pairwise matching. The whole framework is trained in an end-to-end manner. We collect outfit data together with user label information from a fashion-focused social website for the personalized recommendation task. Extensive experiments on our datasets show that the proposed framework outperforms the state-of-the-art methods significantly even with a simple backbone.
[work, auc, term, dataset, social, multiple, framework] [problem, matching, computed, technique] [user, preference, method, proposed, composition, based, image, collaborative, latent, comparison, composed] [binary, number, network, performance, table, weighted, deep, efficient, neural, better, efficiency, block] [personalized, model, textual, evaluate, visual, ndcg, ure, created, sign, encoder, system] [feature, score, propose, clothing, three, shuicheng] [fashion, outfit, hashing, learning, learn, fhn, set, compatibility, recommendation, code, negative, hash, embedding, ranking, hard, pairwise, positive, datasets, similarity, compu, retrieval, item, fash, fferen, nary, training, trained, data, large, train, fea, polyvore, fitb]
@InProceedings{Lu_2019_CVPR,
  author = {Lu, Zhi and Hu, Yang and Jiang, Yunchao and Chen, Yan and Zeng, Bing},
  title = {Learning Binary Code for Personalized Fashion Recommendation},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Attention Based Glaucoma Detection: A Large-Scale Database and CNN Model
Liu Li, Mai Xu, Xiaofei Wang, Lai Jiang, Hanruo Liu


Recently, the attention mechanism has been successfully applied in convolutional neural networks (CNNs), significantly boosting the performance of many computer vision tasks. Unfortunately, few medical image recognition approaches incorporate the attention mechanism in the CNNs. In particular, there exists high redundancy in fundus images for glaucoma detection, such that the attention mechanism has potential in improving the performance of CNN-based glaucoma detection. This paper proposes an attention-based CNN for glaucoma detection (AG-CNN). Specifically, we first establish a large-scale attention based glaucoma (LAG) database, which includes 5,824 fundus images labeled with either positive glaucoma (2,392) or negative glaucoma (3,432). The attention maps of the ophthalmologists are also collected in LAG database through a simulated eye-tracking experiment. Then, a new structure of AG-CNN is designed, including an attention prediction subnet, a pathological area localization subnet and a glaucoma classification subnet. Different from other attention-based CNN methods, the features are also visualized as the localized pathological area, which can advance the performance of glaucoma detection. Finally, the experiment results show that the proposed AG-CNN approach significantly advances state-of-the-art glaucoma detection.
[prediction, human, recognition, work] [optic, international, computer, vision, note] [figure, method, based, image, database, conference, ieee, input, proposed, extracted] [deep, performance, convolutional, table, neural, structure, building, redundancy, binary, order, experiment, network, small, processing, layer, better, applied] [attention, model, sensitivity, finding, mechanism, visual, arxiv, preprint] [glaucoma, fundus, pathological, area, localization, subnet, detection, lag, medical, predicted, cnn, roi, map, feature, visualization, ophthalmologist, disc, located, retinal, region, cup, specificity, score, diagnosis, visualized, cleared, locate, disease, fully, diabetic, retinopathy] [classification, learning, loss, set, positive, negative, training, label]
@InProceedings{Li_2019_CVPR,
  author = {Li, Liu and Xu, Mai and Wang, Xiaofei and Jiang, Lai and Liu, Hanruo},
  title = {Attention Based Glaucoma Detection: A Large-Scale Database and CNN Model},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Privacy Protection in Street-View Panoramas Using Depth and Multi-View Imagery
Ries Uittenbogaard, Clint Sebastian, Julien Vijverberg, Bas Boom, Dariu M. Gavrila, Peter H.N. de With


The current paradigm in privacy protection in street-view images is to detect and blur sensitive information. In this paper, we propose a framework that is an alternative to blurring, which automatically removes and inpaints moving objects (e.g. pedestrians, vehicles) in street-view imagery. We propose a novel moving object segmentation algorithm exploiting consistencies in depth across multiple street-view images that are later combined with the results of a segmentation network. The detected moving objects are removed and inpainted with information from other views, to obtain a realistic output image such that the moving object is not visible anymore. We evaluate our results on a dataset of 1000 images to obtain a peak noise-to-signal ratio (PSNR) and L 1 loss of 27.2 dB and 2.5%, respectively. To assess overall quality, we also report the results of a survey conducted on 35 professionals, asked to visually inspect the images whether object removal and inpainting had taken place. The inpainting dataset will be made publicly available for scientific benchmarking purposes at https://research.cyclomedia.com/.
[moving, dataset, static, framework, inpaint, time, recognition] [depth, international, computer, confidence, pattern, view, lidar, alternative, algorithm, point, reconstruction, local, rgb, vision, note] [image, inpainting, conference, inpainted, ieee, removed, proposed, input, method, based, protection, blurring, figure, extracted, removal, remove, produce, quality, row, survey] [network, deep, neural, convolutional, convolution, better, layer, processing, output, performance, best, applied] [privacy, adversarial, gan, generate] [object, segmentation, detection, mask, final, false, interest, street, context, semantic, threshold, average] [loss, learning, set, trained, observe, training, learn, poor]
@InProceedings{Uittenbogaard_2019_CVPR,
  author = {Uittenbogaard, Ries and Sebastian, Clint and Vijverberg, Julien and Boom, Bas and Gavrila, Dariu M. and de With, Peter H.N.},
  title = {Privacy Protection in Street-View Panoramas Using Depth and Multi-View Imagery},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Grounding Human-To-Vehicle Advice for Self-Driving Vehicles
Jinkyu Kim, Teruhisa Misu, Yi-Ting Chen, Ashish Tawari, John Canny


Recent success suggests that deep neural control networks are likely to be a key component of self-driving vehicles. These networks are trained on large datasets to imitate human actions, but they lack semantic understanding of image contents. This makes them brittle and potentially unsafe in situations that do not match training data. Here, we propose to address this issue by augmenting training data with natural language advice from a human. Advice includes guidance about what to do and where to attend. We present the first step toward advice giving, where we train an end-to-end vehicle controller that accepts advice. The controller adapts the way it attends to the scene (visual attention) and the control (steering and speed). Attention mechanisms tie controller behavior to salient objects in the advice. We evaluate our model on a novel advisable driving dataset with manually annotated human-to-vehicle advice called Honda Research Institute-Advice Dataset (HAD). We show that taking advice improves the performance of the end-to-end network, while the network cues on a variety of visual features that are provided by advice. The dataset is available at https://usa.honda-ri.com/HAD.
[advice, driving, steering, dataset, advisable, human, wheel, state, prediction, driver, video, visualizing, traffic, honda, work, action, lane, recognition, lstm] [angle, computer, vision, heat, ground, straight, left, international, provide, note, error, derivative] [control, input, image, figure, conference, latent, corr, ieee, raw, collected] [controller, neural, performance, speed, deep, table, network, convolutional] [model, attention, visual, encoder, language, natural, vector, median, textual, visualize, turn, generated, evaluate, provided] [vehicle, feature, salient, pedestrian, three, cnn, propose] [learning, trained, training, loss, task, testing, function, proportional, pulling, train]
@InProceedings{Kim_2019_CVPR,
  author = {Kim, Jinkyu and Misu, Teruhisa and Chen, Yi-Ting and Tawari, Ashish and Canny, John},
  title = {Grounding Human-To-Vehicle Advice for Self-Driving Vehicles},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Multi-Step Prediction of Occupancy Grid Maps With Recurrent Neural Networks
Nima Mohajerin, Mohsen Rohani


We investigate the multi-step prediction of the drivable space, represented by Occupancy Grid Maps (OGMs), for autonomous vehicles. Our motivation is that accurate multi-step prediction of the drivable space can efficiently improve path planning and navigation resulting in safe, comfortable and optimum paths in autonomous driving. We train a variety of Recurrent Neural Network (RNN) based architectures on the OGM sequences from the KITTI dataset. The results demonstrate significant improvement of the prediction accuracy using our proposed difference learning method, incorporating motion related features, over the state of the art. We remove the egomotion from the OGM sequences by transforming them into a common frame. Although in the transformed sequences the KITTI dataset is heavily biased toward static objects, by learning the difference between consecutive OGMs, our proposed method provides accurate prediction over both the static and moving objects.
[ogm, prediction, ogms, dynamic, drivable, predict, planning, state, frame, static, rnn, moving, video, employed, recurrent, consecutive, future, classic, occupied, tracking, correspond, motion, sequence, shaded, determine] [occupancy, approach, autonomous, matrix, international, kitti, journal, robotics, accurate, sensor, provide, algorithm] [difference, figure, based, compensation, input, proposed, conference, result, method] [network, output, neural, architecture, deep, cell, accuracy, size, number, employ, best, performance] [model, path, observed, red, environment, arxiv, preprint, common, represent] [predicted, module, area, grid, object, false, detection, indicate, illustrated] [learning, base, space, learn, classifier, positive, data, corresponds]
@InProceedings{Mohajerin_2019_CVPR,
  author = {Mohajerin, Nima and Rohani, Mohsen},
  title = {Multi-Step Prediction of Occupancy Grid Maps With Recurrent Neural Networks},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Connecting Touch and Vision via Cross-Modal Prediction
Yunzhu Li, Jun-Yan Zhu, Russ Tedrake, Antonio Torralba


Humans perceive the world using multi-modal sensory inputs such as vision, audition, and touch. In this work, we investigate the cross-modal connection between vision and touch. The main challenge in this cross-domain modeling task lies in the significant scale discrepancy between the two: while our eyes perceive an entire visual scene at once, humans can only feel a small region of an object at any given moment. To connect vision and touch, we introduce new tasks of synthesizing plausible tactile signals from visual inputs as well as imagining how we interact with objects given tactile data as input. To accomplish our goals, we first equip robots with both visual and tactile sensors and collect a large-scale dataset of corresponding vision and tactile image sequences. To close the scale gap, we present a new conditional adversarial model that incorporates the scale and location information of the touch. Human perceptual studies demonstrate that our model can produce realistic visual images from tactile data and vice versa. Finally, we present both qualitative and quantitative experimental results regarding different system designs, as well as visualizing the learned representations of our model.
[touch, tactile, temporal, gelsight, prediction, predict, moment, dataset, arm, work, touching, rebalancing, wenzhen, human, antonio, feel, predicting, signal, sequence, amt, sense] [vision, ground, truth, sensor, well, deformation, contact, scene, surface, position, corresponding] [image, reference, figure, method, input, conditional, edward, produce, real, generator, collect, qualitative, sensing, perceptual, realistic] [scale, output, table, deep, standard, convolutional, andrew, entire] [model, visual, adversarial, robotic, robot, gans, system, evaluate, discriminator, turkers] [object, predicted, location, help, improve] [data, learning, training, set, unseen, learned, trained, train, objective, loss, labeled, sample]
@InProceedings{Li_2019_CVPR,
  author = {Li, Yunzhu and Zhu, Jun-Yan and Tedrake, Russ and Torralba, Antonio},
  title = {Connecting Touch and Vision via Cross-Modal Prediction},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
X2CT-GAN: Reconstructing CT From Biplanar X-Rays With Generative Adversarial Networks
Xingde Ying, Heng Guo, Kai Ma, Jian Wu, Zhengxin Weng, Yefeng Zheng


Computed tomography (CT) can provide a 3D view of the patient's internal organs, facilitating disease diagnosis, but it incurs more radiation dose to a patient and a CT scanner is much more cost prohibitive than an X-ray machine too. Traditional CT reconstruction methods require hundreds of X-ray projections through a full rotational scan of the body, which cannot be performed on a typical X-ray machine. In this work, we propose to reconstruct CT from two orthogonal X-rays using the generative adversarial network (GAN) framework. A specially designed generator network is exploited to increase data dimension from 2D (X-rays) to 3D (CT), which is not addressed in previous research of GAN. A novel feature fusion method is proposed to combine information from two X-rays. The mean squared error (MSE) loss and adversarial loss are combined to train the generator, resulting in a high-quality CT volume both visually and quantitatively. Extensive experiments on a publicly available chest CT dataset demonstrate the effectiveness of the proposed method. It could be a nice enhancement of a low-cost X-ray machine to provide physicians a CT-like 3D volume in several niche applications.
[internal, previous, fusion, dataset, human] [reconstruction, computer, dense, projection, volume, single, view, vision, pattern, corresponding, shape, additional, scan] [biplanar, image, input, proposed, real, generative, reconstructed, reconstruct, generator, synthesized, ieee, method, conditional, figure, imaging, psnr, captured, mapping, paired, based, clinical, synthetic, anatomical, quality, ssim, radiation] [network, deep, orthogonal, relu, output, architecture, table, lateral, connection, process, increase] [adversarial, model, gan, encoder, discriminator, decoder, evaluation, arxiv, preprint, machine] [feature, medical, propose, improvement, chest] [loss, data, training, learning, transfer, train, learn, set, trained, novel, large]
@InProceedings{Ying_2019_CVPR,
  author = {Ying, Xingde and Guo, Heng and Ma, Kai and Wu, Jian and Weng, Zhengxin and Zheng, Yefeng},
  title = {X2CT-GAN: Reconstructing CT From Biplanar X-Rays With Generative Adversarial Networks},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Practical Full Resolution Learned Lossless Image Compression
Fabian Mentzer, Eirikur Agustsson, Michael Tschannen, Radu Timofte, Luc Van Gool


We propose the first practical learned lossless image compression system, L3C, and show that it outperforms the popular engineered codecs, PNG, WebP and JPEG 2000. At the core of our method is a fully parallelizable hierarchical probabilistic model for adaptive entropy coding which is optimized end-to-end for the compression task. In contrast to recent autoregressive discrete probabilistic models such as PixelCNN, our method i) models the image distribution jointly with learned auxiliary representations instead of exclusively modeling the image distribution in RGB space, and ii) only requires three forward-passes to predict all pixel probabilities instead of one for each pixel. As a result, L3C obtains over two orders of magnitude speedups when sampling compared to the fastest PixelCNN variant (Multiscale-PixelCNN). Furthermore, we find that learning the auxiliary representation is crucial and outperforms predefined auxiliary representations such as an RGB pyramid significantly.
[joint, outperforms, modeling, previous, forward, stream, dataset, predict] [rgb, discrete, note, single, classical, practical, contrast] [image, method, generative, based, pixel] [compression, pixelcnn, lossless, coding, arithmetic, webp, adaptive, table, number, autoregressive, flif, parallel, cuv, neural, network, symbol, quantization, batch, magnitude, architecture, deep, lossy, png, size, compare, full, compared, scale] [model, decoding, encode, probability, encoding, van] [feature, hierarchical, pyramid, faster, baseline, fully, three] [learned, auxiliary, distribution, learning, predictor, sampling, mixture, entropy, log, train, representation, learn, code, likelihood, logistic, training, shared, probabilistic]
@InProceedings{Mentzer_2019_CVPR,
  author = {Mentzer, Fabian and Agustsson, Eirikur and Tschannen, Michael and Timofte, Radu and Van Gool, Luc},
  title = {Practical Full Resolution Learned Lossless Image Compression},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Image-To-Image Translation via Group-Wise Deep Whitening-And-Coloring Transformation
Wonwoong Cho, Sungha Choi, David Keetae Park, Inkyu Shin, Jaegul Choo


Recently, unsupervised exemplar-based image-to-image translation, conditioned on a given exemplar without the paired data, has accomplished substantial advancements. In order to transfer the information from an exemplar to an input image, existing methods often use a normalization technique, e.g., adaptive instance normalization, that controls the channel-wise statistics of an input activation map at a particular layer, such as the mean and the variance. Meanwhile, style transfer approaches similar task to image translation by nature, demonstrated superior performance by using the higher-order statistics such as covariance among channels in representing a style. In detail, it works via whitening (given a zero-mean input feature, transforming its covariance matrix into the identity). followed by coloring (changing the covariance matrix of the whitened feature to those of the style feature). However, applying this approach in image translation is computationally intensive and error-prone due to the expensive time complexity and its non-trivial backpropagation. In response, this paper proposes an end-to-end approach tailored for image translation that efficiently approximates this transformation with our novel regularization methods. We further extend our approach to a group-wise form for memory and time efficiency as well as image quality. Extensive qualitative and quantitative experiments demonstrate that our proposed method is fast, both in training and inference, and highly effective in reflecting the style of an exemplar.
[time, work, multiple, dataset, expensive] [matrix, single, approach, well, column, corresponding, david, matching] [style, image, content, translation, transformation, drit, gdwct, munit, coloring, input, whitened, wct, proposed, figure, diagonal, method, generative, sct, user, demonstrate, female, smile, comparison, jaegul, qualitative, based, color, cab, ccw, celeba, male, texture, consistency, translate] [deep, covariance, mlp, whitening, number, regularization, applying, neural, table, process, original, order, superior, performance, output] [model, adversarial, diverse, generated, indicates, vector, generate, multimodal] [feature, baseline, propose, detailed, final] [transfer, target, unsupervised, exemplar, domain, loss, novel, learning, dwt, classification, existing]
@InProceedings{Cho_2019_CVPR,
  author = {Cho, Wonwoong and Choi, Sungha and Keetae Park, David and Shin, Inkyu and Choo, Jaegul},
  title = {Image-To-Image Translation via Group-Wise Deep Whitening-And-Coloring Transformation},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Max-Sliced Wasserstein Distance and Its Use for GANs
Ishan Deshpande, Yuan-Ting Hu, Ruoyu Sun, Ayis Pyrros, Nasir Siddiqui, Sanmi Koyejo, Zhizhen Zhao, David Forsyth, Alexander G. Schwing


Generative adversarial nets (GANs) and variational auto-encoders have significantly improved our distribution modeling capabilities, showing promise for dataset augmentation, image-to-image translation and feature learning. However, to model high-dimensional distributions, sequential training and stacked architectures are common, increasing the number of tunable hyper-parameters as well as the training time. Nonetheless, the sample complexity of the distance metrics remains one of the factors affecting GAN training. We first show that the recently proposed sliced Wasserstein distance has compelling sample complexity properties when compared to the Wasserstein distance. To further improve the sliced Wasserstein distance we then analyze its `projection complexity' and develop the max-sliced Wasserstein distance which enjoys compelling sample complexity while reducing projection complexity, albeit necessitating a max estimation. We finally illustrate that the proposed distance trains GANs on high-dimensional images up to a resolution of 256x256 easily.
[modeling, dataset, work] [projection, direction, defined, polynomial, projected, well, dimensional, prove, supplementary, compute, induced] [generative, translation, transform, generator, high, method, proposed, image, based, real, figure, resolution] [complexity, number, deep, gaussian, better, unit, max, analyze, gradient, layer, compared, process] [wasserstein, sliced, gan, adversarial, discriminator, gans, random, arxiv, preprint, word, generated, improved, simple, enjoys, claim, argmax, parametrized, empirical, requires, lsun, program, consider, variational, probability, intuition] [feature] [distance, sample, training, learning, distribution, data, trained, space, unsupervised, train, set, surrogate, function, learn, divergence, learnt, illustrate, randomly]
@InProceedings{Deshpande_2019_CVPR,
  author = {Deshpande, Ishan and Hu, Yuan-Ting and Sun, Ruoyu and Pyrros, Ayis and Siddiqui, Nasir and Koyejo, Sanmi and Zhao, Zhizhen and Forsyth, David and Schwing, Alexander G.},
  title = {Max-Sliced Wasserstein Distance and Its Use for GANs},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Meta-Learning With Differentiable Convex Optimization
Kwonjoon Lee, Subhransu Maji, Avinash Ravichandran, Stefano Soatto


Many meta-learning approaches for few-shot learning rely on simple base learners such as nearest-neighbor classifiers. However, even in the few-shot regime, discriminatively trained linear predictors can offer better generalization. We propose to use these predictors as base learners to learn representations for few-shot learning and show they offer better tradeoffs between feature size and performance across a range of few-shot recognition benchmarks. Our objective is to learn feature embeddings that generalize well under a linear classification rule for novel categories. To efficiently solve the objective, we exploit two properties of linear classifiers: implicit differentiation of the optimality conditions of the convex problem and the dual formulation of the optimization problem. This allows us to use high-dimensional embeddings with improved generalization at a modest increase in computational overhead. Our approach, named MetaOptNet, achieves state-of-the-art performance on miniImageNet, tieredImageNet, CIFAR-FS, and FC100 few-shot learning benchmarks.
[work, time, dataset] [optimization, linear, convex, solver, well, error, solution, allows, equation, implicit, approach, compute, differentiable, formulation, theorem, problem] [dual, figure, method, based, prior] [network, regularization, accuracy, table, better, number, performance, size, parameter, standard, convolutional, computational, achieves, denotes] [model, vector, simple] [feature, regression] [base, test, learning, embedding, set, training, learner, miniimagenet, classification, svm, ridge, objective, class, learn, generalization, prototypical, data, shot, train, task, tieredimagenet, nearest, embeddings, observe, fewshot, classifier, dimension, function]
@InProceedings{Lee_2019_CVPR,
  author = {Lee, Kwonjoon and Maji, Subhransu and Ravichandran, Avinash and Soatto, Stefano},
  title = {Meta-Learning With Differentiable Convex Optimization},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
RePr: Improved Training of Convolutional Filters
Aaditya Prakash, James Storer, Dinei Florencio, Cha Zhang


A well-trained Convolutional Neural Network can easily be pruned without significant loss of performance. This is because of unnecessary overlap in the features captured by the network's filters. Innovations in network architecture such as skip/dense connections and inception units have mitigated this problem to some extent, but these improvements come with increased computation and memory requirements at run-time. We attempt to address this problem from another angle - not by changing the network structure but by altering the training method. We show that by temporarily pruning and then restoring a subset of the model's filters, and repeating this process cyclically, overlap in the learned features is reduced, producing improved generalization. We show that the existing model-pruning criteria are not optimal for selecting filters to prune in this context, and introduce inter-filter orthogonality as the ranking criteria to determine under-expressive filters. Our method is applicable both to vanilla convolutional networks and more complex modern architectures, and improves the performance across a variety of tasks, especially when applied to smaller networks.
[individual, time, second, term] [computer, vision, denote, single, error, compute, linear, pattern, international, percentage, analysis] [figure, method, ieee, conference, comparison] [repr, filter, accuracy, convolutional, network, standard, neural, correlation, scheme, orthogonality, deep, performance, layer, pruning, vanilla, orthogonal, rate, better, number, pruned, regularization, table, convnets, dropping, ortho, compared, efficient, lower, epoch, achieve, higher, overlap, process, convnet, weight, taylor, activation, sgd, full] [model, greedy, improved, inception, question] [improvement, three, feature, improve, improves] [training, test, learning, ranking, metric, oracle, trained, rank, train, loss, set]
@InProceedings{Prakash_2019_CVPR,
  author = {Prakash, Aaditya and Storer, James and Florencio, Dinei and Zhang, Cha},
  title = {RePr: Improved Training of Convolutional Filters},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Tangent-Normal Adversarial Regularization for Semi-Supervised Learning
Bing Yu, Jingfeng Wu, Jinwen Ma, Zhanxing Zhu


Compared with standard supervised learning, the key difficulty in semi-supervised learning is how to make full use of the unlabeled data. A recently proposed method, virtual adversarial training (VAT), smartly performs adversarial training without label information to impose a local smoothness on the classifier, which is especially beneficial to semi-supervised learning. In this work, we propose tangent-normal adversarial regularization (TNAR) as an extension of VAT by taking the data manifold into consideration. The proposed TNAR is composed by two complementary parts, the tangent adversarial regularization (TAR) and the normal adversarial regularization (NAR). In TAR, VAT is applied along the tangent space of the data manifold, aiming to enforce local invariance of the classifier on the manifold, while in NAR, VAT is performed on the normal space orthogonal to the tangent space, intending to impose robustness on the classifier against the noise causing the observed data deviating from the underlying data manifold. Demonstrated by experiments on both artificial and practical datasets, our proposed TAR and NAR complement with each other, and jointly outperforms other state-of-the-art methods for semi-supervised learning.
[term, second, outperforms] [tangent, underlying, virtual, normal, localized, smoothness, coordinate, local, jacobian, assumption, corresponding, optimal, note, chart, impose, enforce, point, defined, solution, smallest] [proposed, generative, based, method, noise, figure, generator, denoted] [regularization, neural, compared, table, processing, norm, performance, orthogonal, max, gradient, better, lower, effectiveness, power, iteration] [manifold, adversarial, gan, perturbation, observed, arxiv, preprint, model, encoder, artificial, robustness, decoder, improved, example, demonstrated] [propose] [data, tnar, vat, space, classifier, learning, training, labeled, distance, loss, svhn, ssl, supervised, unlabeled, classification, vae, invariance, exd, rtangent, euclidean, rmanifold, dul, datasets, fashionmnist, rnormal, entropy]
@InProceedings{Yu_2019_CVPR,
  author = {Yu, Bing and Wu, Jingfeng and Ma, Jinwen and Zhu, Zhanxing},
  title = {Tangent-Normal Adversarial Regularization for Semi-Supervised Learning},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Auto-Encoding Scene Graphs for Image Captioning
Xu Yang, Kaihua Tang, Hanwang Zhang, Jianfei Cai


We propose Scene Graph Auto-Encoder (SGAE) that incorporates the language inductive bias into the encoder-decoder image captioning framework for more human-like captions. Intuitively, we humans use the inductive bias to compose collocations and contextual inference in discourse. For example, when we see the relation "person on bike", it is natural to replace "on" with "ride" and infer "person riding bike on a road" even the "road" is not evident. Therefore, exploiting such bias as a language prior is expected to help the conventional encoder-decoder models less likely to overfit to the dataset bias and focus on reasoning. Specifically, we use the scene graph --- a directed graph (G) where an object node is connected by adjective nodes and relationship nodes --- to represent the complex structural layout of both image (I) and sentence (S). In the textual domain, we use SGAE to learn a dictionary (D) that helps to reconstruct sentences in the S -> G -> D -> S pipeline, where D encodes the desired language prior; in the vision-language domain, we use the shared D to guide the encoder-decoder in the I -> G -> D -> S pipeline. Thanks to the scene graph representation and shared dictionary, the inductive bias is transferred across domains in principle. We validate the effectiveness of SGAE on the challenging MS-COCO image captioning benchmark, e.g., our SGAE-based single-model achieves a new state-of-the-art 127.8 CIDEr-D on the Karpathy split, and a competitive 125.5 CIDEr-D (c40) on the official server even compared to other ensemble models. Code has been made available at: https://github.com/yangxuntu/SGAE.
[graph, framework, human, walking, dynamic, gcn, dataset, directed] [scene, computer, vision, pattern] [image, figure, conference, dictionary, ieee, attribute, proposed, side] [compared, neural, network, computational, better, table, convolutional, size, inference] [language, visual, sgae, relationship, captioning, node, sentence, decoder, generated, generation, arxiv, association, preprint, memory, model, karpathy, caption, reward, black, green, generating, natural, machine, attention, rij, corpus, sitting, umbrella, textual] [object, road, propose, feature, street, semantic, comparing] [inductive, bias, training, learning, set, embedding, web, shared, learn, knowledge, base, train, split, loss, representation]
@InProceedings{Yang_2019_CVPR,
  author = {Yang, Xu and Tang, Kaihua and Zhang, Hanwang and Cai, Jianfei},
  title = {Auto-Encoding Scene Graphs for Image Captioning},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Fast, Diverse and Accurate Image Captioning Guided by Part-Of-Speech
Aditya Deshpande, Jyoti Aneja, Liwei Wang, Alexander G. Schwing, David Forsyth


Image captioning is an ambiguous problem, with many suitable captions for an image. To address ambiguity, beam search is the de facto method for sampling multiple captions. However, beam search is computationally expensive and known to produce generic captions. To address this concern, some variational auto-encoder (VAE) and generative adversarial net (GAN) based methods have been proposed. Though diverse, GAN and VAE are less accurate. In this paper, we first predict a meaningful summary of the image, then generate the caption based on that summary. We use part-of-speech as summaries, since our summary should drive caption generation. We achieve the trifecta: (1) High accuracy for the diverse captions as evaluated by standard captioning metrics and user studies; (2) Faster computation of diverse captions compared to beam search and diverse beam search; and (3) High diversity as evaluated by counting novel sentences, distinct n-grams and mutual overlap (i.e., mBleu-4) scores.
[sequence, people, human, joint, recurrent, predict, lstm] [consensus, approach, accurate, classical] [image, method, based, produce, user, high, conference, latent, figure, conditional, side] [search, accuracy, network, neural, compare, convolutional, number, quantized, higher, better, best, top, overlap, computational, complexity, standard, table, inference] [beam, captioning, diverse, caption, tag, diversity, generated, word, generate, spice, sampled, standing, bird, language, cider, gan, model, expand, evaluation, sitting, adversarial, find, machine, gumbel] [score, faster, object] [training, sampling, set, novel, sample, test, learning, log, oracle, train, classification, ranked, posterior, space]
@InProceedings{Deshpande_2019_CVPR,
  author = {Deshpande, Aditya and Aneja, Jyoti and Wang, Liwei and Schwing, Alexander G. and Forsyth, David},
  title = {Fast, Diverse and Accurate Image Captioning Guided by Part-Of-Speech},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Attention Branch Network: Learning of Attention Mechanism for Visual Explanation
Hiroshi Fukui, Tsubasa Hirakawa, Takayoshi Yamashita, Hironobu Fujiyoshi


Visual explanation enables humans to understand the decision making of deep convolutional neural network (CNN), but it is insufficient to contribute to improving CNN performance. In this paper, we focus on the attention map for visual explanation, which represents a high response value as the attention location in image recognition. This attention region significantly improves the performance of CNN by introducing an attention mechanism that focuses on a specific region in an image. In this work, we propose Attention Branch Network (ABN), which extends a response-based visual explanation model by introducing a branch structure with an attention mechanism. ABN can be applicable to several image recognition tasks by introducing a branch for the attention mechanism and is trainable for visual explanation and image recognition in an end-to-end manner. We evaluate ABN on several image recognition tasks such as image classification, fine-grained recognition, and multiple facial attribute recognition. Experimental results indicate that ABN outperforms the baseline models on these image recognition tasks while generating an attention map for visual explanation. Our code is available.
[recognition, multiple, consists, forward, dataset, visualizing, highlight] [computer, vision, pattern, international] [image, conference, facial, input, figure, attribute, feed, comparison] [abn, convolution, network, layer, resnet, residual, neural, accuracy, deep, introducing, performance, structure, imagenet, table, compare, convolutional, number, original, applied, block, interpret, vggnet, weight, processing, senet, constructed] [attention, visual, model, mechanism, explanation, perception, evaluate, probability, decision, making, visualize, maker, machine] [map, feature, branch, cam, baseline, cnn, car, response, location, region, score] [learning, training, specific, class, conventional, task, extractor, classification, loss, gap, testing, function]
@InProceedings{Fukui_2019_CVPR,
  author = {Fukui, Hiroshi and Hirakawa, Tsubasa and Yamashita, Takayoshi and Fujiyoshi, Hironobu},
  title = {Attention Branch Network: Learning of Attention Mechanism for Visual Explanation},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Cascaded Projection: End-To-End Network Compression and Acceleration
Breton Minnehan, Andreas Savakis


We propose a data-driven approach for deep convolutional neural network compression that achieves high accuracy with high throughput and low memory requirements. Current network compression methods either find a low-rank factorization of the features that requires more memory or select only a subset of features by pruning entire filter channels. We propose the Cascaded Projection (CaP) compression method that projects the output and input filter channels of successive layers to a unified low dimensional space based on a low-rank projection. We optimize the projection to minimize classification loss and the difference between the next layer's features in the compressed and uncompressed networks. To solve this non-convex optimization problem we propose a new optimization method of a proxy matrix using backpropagation and Stochastic Gradient Descent (SGD) with geometric constraints. Our cascaded projection approach leads to improvements in all critical areas of network compression: high accuracy, low memory consumption, low parameter count and high processing speed. The proposed CaP method demonstrates state of the art results compressing VGG16 and ResNet networks with over 4X reduction in the number of computations and excellent performance in top-5 accuracy on the ImageNet dataset before and after fine-tuning.
[work, recognition, perform] [projection, optimization, reconstruction, vision, computer, matrix, error, international, dimensional, linear, approach, single, pattern, problem, require, additional, optimal, provide] [method, conference, based, ieee, figure, input, proposed, high, amount, demonstrate] [compression, network, cap, layer, pruning, deep, neural, accuracy, factorization, convolutional, compressed, residual, low, number, kernel, backpropagation, compress, performed, table, filter, parameter, efficient, reduce, standard, compressing, convolution, original, channel, compare, optimize, processing, resnet, performance, impact] [memory, arxiv, preprint, machine] [feature, cascaded, propose, baseline] [classification, loss, learning, training, trained, proxy, representation, large, space]
@InProceedings{Minnehan_2019_CVPR,
  author = {Minnehan, Breton and Savakis, Andreas},
  title = {Cascaded Projection: End-To-End Network Compression and Acceleration},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
DeepCaps: Going Deeper With Capsule Networks
Jathushan Rajasegaran, Vinoj Jayasundara, Sandaru Jayasekara, Hirunima Jayasekara, Suranga Seneviratne, Ranga Rodrigo


Capsule Network is a promising concept in deep learning, yet its true potential is not fully realized thus far, providing sub-par performance on several key benchmark datasets with complex data. Drawing intuition from the success achieved by Convolutional Neural Networks (CNNs) by going deeper, we introduce DeepCaps, a deep capsule network architecture which uses a novel 3D convolution based dynamic routing algorithm. With DeepCaps, we surpass the state-of-the-art capsule domain networks results on CIFAR10, SVHN and Fashion MNIST, while achieving a 68% reduction in the number of parameters. Further, we propose a class independent decoder network, which strengthens the use of reconstruction loss as a regularization term. This leads to an interesting property of the decoder, which allows us to identify and control the physical attributes of the images represented by the instantiation parameters.
[capsule, routing, instantiation, dynamic, deepcaps, activity, complex, convcaps, sabour, stacking, capsnet, spqr, represented, work] [localized, algorithm, rotation, single, vertical, depth, corresponding] [proposed, image, figure, input, comparison, reconstructed, high, half] [layer, network, parameter, number, deep, performance, convolutional, architecture, connected, deeper, convolution, skip, variance, higher, better, tensor, regularization, neural, lower, order, kernel, cell, gradient, achieve, output, low, weighted, causing] [decoder, model, going, true, physical, vector] [propose, level, fully, feature, spatial] [class, novel, data, mnist, learning, datasets, existing, svhn, training, loss, specific, agreement, function]
@InProceedings{Rajasegaran_2019_CVPR,
  author = {Rajasegaran, Jathushan and Jayasundara, Vinoj and Jayasekara, Sandaru and Jayasekara, Hirunima and Seneviratne, Suranga and Rodrigo, Ranga},
  title = {DeepCaps: Going Deeper With Capsule Networks},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
FBNet: Hardware-Aware Efficient ConvNet Design via Differentiable Neural Architecture Search
Bichen Wu, Xiaoliang Dai, Peizhao Zhang, Yanghan Wang, Fei Sun, Yiming Wu, Yuandong Tian, Peter Vajda, Yangqing Jia, Kurt Keutzer


Designing accurate and efficient ConvNets for mobile devices is challenging because the design space is combinatorially large. Due to this, previous neural architecture search (NAS) methods are computationally expensive. ConvNet architecture optimality depends on factors such as input resolution and target devices. However, existing approaches are too resource demanding for case-by-case redesigns. Also, previous work focuses primarily on reducing FLOPs, but FLOP count does not always reflect actual latency. To address these, we propose a differentiable neural architecture search (DNAS) framework that uses gradient-based methods to optimize ConvNet architectures, avoiding enumerating and training individual architectures separately as in previous methods. FBNets (Facebook-Berkeley-Nets), a family of models discovered by DNAS surpass state-of-the-art models both designed manually and generated automatically. FBNet-B achieves 74.1% top-1 accuracy on ImageNet with 295M FLOPs and 23.1 ms latency on a Samsung S8 phone, 2.4x smaller and 1.5x faster than MobileNetV2-1.3 with similar accuracy. Despite higher accuracy and lower latency than MnasNet, we estimate FBNet-B's search cost is 420x smaller than MnasNet's, at only 216 GPU-hours. Searched for different resolutions and channel sizes, FBNets achieve 1.5% to 6.4% higher accuracy than MobileNetV2. The smallest FBNet achieves 50.2% accuracy and 2.9 ms latency (345 frames per second) on a Samsung S8. Over a Samsung-optimized FBNet, the iPhone-X-optimized model achieves a 1.4x speedup on an iPhone X. FBNet models are open-sourced at https://github.com/facebookresearch/mobile-vision.
[previous, dataset] [differentiable, problem, manual, optimal, runtime, computer, vision, respect, estimate] [input, super, figure, iphone, resolution, based, conference] [search, latency, architecture, accuracy, convnet, neural, size, cost, block, efficient, design, table, channel, flop, samsung, achieves, parameter, denotes, smaller, stochastic, searched, better, network, actual, convnets, mobile, imagenet, achieve, layer, reduce, lower, operator, net, efficiency, cell, automatically, computational, kernel, structure, mnasnet, convolution, optimize, fbnets, hardware, compared, depthwise, designed, fbnet] [model, arxiv, preprint, choose, probability, gumbel, reinforcement, candidate] [faster, count, three] [space, target, training, loss, trained, train, learning, sampling, function, distribution]
@InProceedings{Wu_2019_CVPR,
  author = {Wu, Bichen and Dai, Xiaoliang and Zhang, Peizhao and Wang, Yanghan and Sun, Fei and Wu, Yiming and Tian, Yuandong and Vajda, Peter and Jia, Yangqing and Keutzer, Kurt},
  title = {FBNet: Hardware-Aware Efficient ConvNet Design via Differentiable Neural Architecture Search},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
APDrawingGAN: Generating Artistic Portrait Drawings From Face Photos With Hierarchical GANs
Ran Yi, Yong-Jin Liu, Yu-Kun Lai, Paul L. Rosin


Significant progress has been made with image stylization using deep learning, especially with generative adversarial networks (GANs). However, existing methods fail to produce high quality artistic portrait drawings. Such drawings have a highly abstract style, containing a sparse set of continuous graphical elements such as lines, and so small artifacts are much more exposed than for painting styles. Moreover, artists tend to use different strategies to draw different facial features and the lines drawn are only loosely related to obvious image features. To address these challenges, we propose APDrawingGAN, a novel GAN based architecture that builds upon hierarchical generators and discriminators combining both a global network (for images as a whole) and local networks (for individual facial regions). This allows dedicated drawing strategies to be learned for different facial features. Since artists' drawings may not have lines perfectly aligned with image features, we develop a novel loss to measure similarity between generated and artists' drawings based on distance transforms, leading to improved strokes in portrait drawing. To train APDrawingGAN, we construct an artistic drawing dataset containing high-resolution portrait photos and corresponding professional artistic drawings. Extensive experiments, including a user study, show that APDrawingGAN produces significantly better artistic drawings than state-of-the-art methods.
[capture, work, dataset] [local, computer, vision, good, rendering, paul, ground, truth] [style, image, portrait, facial, face, apdrawing, method, input, drawing, artistic, hair, apdrawinggan, cyclegan, apdrawings, content, real, idt, proposed, photo, generator, analogy, based, pixel, eye, headshot, npr, conference, cnnmrf, delicate, synthesized, ldt, ieee, gatys, figure, drawn, synthesis, comparison, stylization, generative, transform, texture, study, nose] [deep, output, neural, network, convolutional, architecture, dedicated, designed] [model, discriminator, gan, adversarial, generates, white, generate, generated] [global, hierarchical, region, propose] [loss, transfer, training, set, learn, distance, function, test, novel, data, target]
@InProceedings{Yi_2019_CVPR,
  author = {Yi, Ran and Liu, Yong-Jin and Lai, Yu-Kun and Rosin, Paul L.},
  title = {APDrawingGAN: Generating Artistic Portrait Drawings From Face Photos With Hierarchical GANs},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Constrained Generative Adversarial Networks for Interactive Image Generation
Eric Heim


Generative Adversarial Networks (GANs) have received a great deal of attention due in part to recent success in generating original, high-quality samples from visual domains. However, most current methods only allow for users to guide this image generation process through limited interactions. In this work we develop a novel GAN framework that allows humans to be "in-the-loop" of the image generation process. Our technique iteratively accepts relative constraints of the form "Generate an image more like image A than image B". After each constraint is given, the user is presented with new outputs from the GAN, informing the next round of feedback. This feedback is used to constrain the output of the GAN with respect to an underlying semantic space that can be designed to model a variety of different notions of similarity (e.g. classes, attributes, object relationships, color, etc.). In our experiments, we show that our GAN framework is able to generate images that are of comparable quality to equivalent unsupervised GANs while satisfying a large number of the constraints provided by users, effectively changing a GAN into one that allows users interactive control over image generation without sacrificing image quality.
[work, framework, lstm] [constraint, relative, allows, single, form, feasible, error, algorithm] [image, generator, input, generative, produce, user, method, figure, produced, celeba, guide, noise, imagery, color, result, transpose, variety, quality, attribute, conditional] [network, output, order, convolutional, size, layer, process, processing, deep, neural, table, satisfies] [congan, wgan, discriminator, generation, gan, generated, model, adversarial, write, visual, gans, read, constrained, generate, vector, critic, chose, wasserstein, van, arxiv, preprint, attention, provided, evaluation] [three, semantic, interactive] [set, training, data, trained, train, mnist, space, learning, learn, loss, class, representation, novel, satisfy, positive, test, similarity, accept]
@InProceedings{Heim_2019_CVPR,
  author = {Heim, Eric},
  title = {Constrained Generative Adversarial Networks for Interactive Image Generation},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
WarpGAN: Automatic Caricature Generation
Yichun Shi, Debayan Deb, Anil K. Jain


We propose, WarpGAN, a fully automatic network that can generate caricatures given an input face photo. Besides transferring rich texture styles, WarpGAN learns to automatically predict a set of control points that can warp the photo into a caricature, while preserving identity. We introduce an identity-preserving adversarial loss that aids the discriminator to distinguish between different subjects. Moreover, WarpGAN allows customization of the generated caricatures by controlling the exaggeration extent and the visual styles. Experimental results on a public domain dataset, WebCaricature, show that WarpGAN is capable of generating caricatures that not only preserve the identities but also outputs a diverse set of caricatures for each input photo. Five caricature experts suggest that caricatures generated by WarpGAN are visually similar to hand-drawn ones and only prominent facial features are exaggerated.
[warping, warp, recognition, capture] [shape, deformation, geometric, exp, rendering, local] [image, caricature, style, texture, warpgan, face, figure, exaggeration, control, facial, real, identity, generator, photo, input, method, proposed, perceptual, content, quality, generative, translation, latent, transferring, study, cyclegan, patch, exaggerated, exaggerate, comparison, mapping, synthesized, visually] [network, neural, table, automatically, compared, convolutional, number, accuracy] [generated, adversarial, discriminator, automatic, generation, visual, encoder, decoder, system, generating, example, generate] [three, spatial, module, deformable, feature] [transfer, loss, set, trained, learning, space, domain, large, train, unsupervised, testing]
@InProceedings{Shi_2019_CVPR,
  author = {Shi, Yichun and Deb, Debayan and Jain, Anil K.},
  title = {WarpGAN: Automatic Caricature Generation},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Explainability Methods for Graph Convolutional Neural Networks
Phillip E. Pope, Soheil Kolouri, Mohammad Rostami, Charles E. Martin, Heiko Hoffmann


With the growing use of graph convolutional neural networks (GCNNs) comes the need for explainability. In this paper, we introduce explainability methods for GCNNs. We develop the graph analogues of three prominent explainability methods for convolutional neural networks: contrastive gradient-based (CG) saliency maps, Class Activation Mapping (CAM), and Excitation Back-Propagation (EB) and their variants, gradient-weighted CAM (Grad-CAM) and contrastive EB (c-EB). We show a proof-of-concept of these methods on classification problems in two application domains: visual scene graphs and molecular graphs. To compare the methods, we identify three desirable properties of explanations: (1) their importance to classification, as measured by the impact of occlusions, (2) their contrastivity with respect to different classes, and (3) their sparseness on a graph. We call the corresponding quantitative metrics fidelity, contrastivity, and sparsity and evaluate them for each method. Lastly, we analyze the salient subgraphs obtained from explanations and report frequently occurring patterns.
[graph, dataset, report, work, signal, prediction] [scene, computer, vision, defined, functional, define, pattern, analysis, matrix, respect] [input, method, figure, ieee, image, conference, spectral, fidelity, application, mapping] [convolutional, explainability, neural, gcnns, layer, deep, molecular, contrastivity, cnns, gradient, sparsity, molecule, activation, relu, processing, table, excitation, designed, number, network, chemical, hrl, llc, top] [visual, explanation, identified, model, calculated, arxiv, identify, preprint] [three, saliency, salient, feature, cam, bounding, highest, including, map] [class, classification, learning, data, softmax, contrastive, gap, large, kipf, paper, training, positive]
@InProceedings{Pope_2019_CVPR,
  author = {Pope, Phillip E. and Kolouri, Soheil and Rostami, Mohammad and Martin, Charles E. and Hoffmann, Heiko},
  title = {Explainability Methods for Graph Convolutional Neural Networks},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
A Generative Adversarial Density Estimator
M. Ehsan Abbasnejad, Qinfeng Shi, Anton van den Hengel, Lingqiao Liu


Density estimation is a challenging unsupervised learning problem. Current maximum likelihood approaches for density estimation are either restrictive or incapable of producing high-quality samples. On the other hand, likelihood-free models such as generative adversarial networks, produce sharp samples without a density model. The lack of a density estimate limits the applications to which the sampled data can be put, however. We propose a Generative Adversarial Density Estimator, a density estimation approach that bridges the gap between the two. Allowing for a prior on the parameters of the model, we extend our density estimator to a Bayesian model where we can leverage the predictive variance to measure our confidence in the likelihood. Our experiments on challenging applications such as visual dialog where the density and the confidence in predictions are crucial shows the effectiveness of our approach.
[dataset] [approach, international, equation, estimate, estimation, estimator, jacobian, algorithm, confidence, computer, compute, local, problem, respect, note, lemma, well, monte, matrix] [generative, conference, generator, image, figure, transformation, prior, quality, real, invertible, noise, based, drawn] [density, bayesian, deep, parameter, variance, neural, output, table, network, gradient, higher, stochastic, employ, rate, better, efficient, normalizer, inference] [adversarial, generated, model, dialog, visual, machine, discriminator, true, generate, mode, variational, ability, maximizing, van, den, gans, expected, answer, carlo] [easy] [distribution, function, learning, likelihood, space, sample, log, entropy, maximum, training, predictive, data, convergence, posterior, train, uncertainty, update]
@InProceedings{Abbasnejad_2019_CVPR,
  author = {Ehsan Abbasnejad, M. and Shi, Qinfeng and van den Hengel, Anton and Liu, Lingqiao},
  title = {A Generative Adversarial Density Estimator},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
SoDeep: A Sorting Deep Net to Learn Ranking Loss Surrogates
Martin Engilberge, Louis Chevallier, Patrick Perez, Matthieu Cord


Several tasks in machine learning are evaluated using non-differentiable metrics such as mean average precision or Spearman correlation. However, their non-differentiability prevents from using them as objective functions in a learning framework. Surrogate and relaxation methods exist but tend to be specific to a given metric. In the present work, we introduce a new method to learn approximations of such non-differentiable objective functions. Our approach is based on a deep architecture that approximates the sorting of arbitrary sets of scores. It is trained virtually for free using synthetic data. This sorting deep (SoDeep) net can then be combined in a plug-and-play manner with existing deep architectures. We demonstrate the interest of our approach in three different tasks that require ranking: Cross-modal text-image retrieval, multi-label image classification and visual memorability ranking. Our approach yields very competitive results on these three tasks, which validates the merit and the flexibility of SoDeep as a proxy for sorting operation in ranking-based losses.
[sorting, prediction, lstm, combined, recurrent, video, dataset, recognition] [differentiable, approach, defined, handcrafted, expressed, optimization, local] [image, based, proposed, synthetic, figure, method, raw, input] [deep, spearman, network, precision, architecture, convolutional, neural, correlation, performance, output, standard, number, design] [model, vector, evaluation, evaluate, machine, visual, arxiv, preprint] [average, score, cnn, recall, evaluated, predicted, object, map, baseline, three, propose] [loss, training, rank, sorter, function, learning, memorability, ranking, sodeep, trained, retrieval, surrogate, learn, task, classification, set, embedding, objective, proxy, metric, data, margin, learned, pairwise]
@InProceedings{Engilberge_2019_CVPR,
  author = {Engilberge, Martin and Chevallier, Louis and Perez, Patrick and Cord, Matthieu},
  title = {SoDeep: A Sorting Deep Net to Learn Ranking Loss Surrogates},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
High-Quality Face Capture Using Anatomical Muscles
Michael Bao, Matthew Cong, Stephane Grabli, Ronald Fedkiw


Muscle-based systems have the potential to provide both anatomical accuracy and semantic interpretability as compared to blendshape models; however, a lack of expressivity and differentiability has limited their impact. Thus, we propose modifying a recently developed rather expressive muscle-based system in order to make it fully-differentiable; in fact, our proposed modifications allow this physically robust and anatomically accurate muscle model to conveniently be driven by an underlying blendshape basis. Our formulation is intuitive, natural, as well as monolithically and fully coupled such that one can differentiate the model from end to end, which makes it viable for both optimization and learning-based approaches for a variety of applications. We illustrate this with a number of examples including both shape matching of three-dimensional geometry as as well as the automatic determination of a three-dimensional facial pose from a single two-dimensional RGB image without using markers or depth information.
[capture, joint, force] [simulation, computer, equation, mesh, geometry, volume, solve, pose, linear, surface, vertex, rgb, deformation, optimization, skinning, form, tetrahedral, vision, well, shape, michael, blend, single, volumetric, symposium, international, matthew, solving, differentiable, targeting, matrix, note, problem, finite] [muscle, blendshape, face, facial, figure, acm, neutral, method, conference, ieee, anatomical, ronald, flesh, blendshapes, image, animation, quasistatic, ffvm, poisson, demonstrate, animator, tetrahedralized, morph, jaw] [activation, performance, full] [model, system, write, physical, create, ability] [driven, semantic, curve, boundary, efficacy] [target, set, data, independent]
@InProceedings{Bao_2019_CVPR,
  author = {Bao, Michael and Cong, Matthew and Grabli, Stephane and Fedkiw, Ronald},
  title = {High-Quality Face Capture Using Anatomical Muscles},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
FML: Face Model Learning From Videos
Ayush Tewari, Florian Bernard, Pablo Garrido, Gaurav Bharaj, Mohamed Elgharib, Hans-Peter Seidel, Patrick Perez, Michael Zollhofer, Christian Theobalt


Monocular image-based 3D reconstruction of faces is a long-standing problem in computer vision. Since image data is a 2D projection of a 3D face, the resulting depth ambiguity makes the problem ill-posed. Most existing methods rely on data-driven priors that are built from limited 3D face scans. In contrast, we propose multi-frame video-based self-supervised training of a deep network that (i) learns a face identity model both in shape and appearance while (ii) jointly learning to reconstruct 3D faces. Our face model is learned using only corpora of in-the-wild video clips collected from the Internet. This virtually endless source of training data enables learning of a highly general 3D face model. In order to achieve this, we propose a novel multi-frame consistency loss that ensures consistent shape and appearance across multiple frames of a subject's face, thus minimizing depth ambiguity. At test time we can use an arbitrary number of frames, so that we can perform both monocular as well as multi-frame reconstruction.
[video, learns, graph, multiple, dataset, jointly, recognition, perform] [reconstruction, shape, monocular, approach, computer, deformation, geometry, mesh, illumination, pattern, vision, reflectance, well, parametric, single, pose, depth, albedo, note, differentiable, vertex, camera, linear, rigid, defined, estimation, scene, problem, rely] [face, facial, identity, appearance, based, ieee, acm, expression, image, morphable, zollh, conference, figure, consistency, blendshape, tran, landmark, input, quality] [network, better, employ, deep, number] [model, basis, represent, enables] [head, propose, feature, coarse] [learning, training, data, learned, learn, set, test, large, train, space, existing, shared, novel, loss]
@InProceedings{Tewari_2019_CVPR,
  author = {Tewari, Ayush and Bernard, Florian and Garrido, Pablo and Bharaj, Gaurav and Elgharib, Mohamed and Seidel, Hans-Peter and Perez, Patrick and Zollhofer, Michael and Theobalt, Christian},
  title = {FML: Face Model Learning From Videos},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
AdaCos: Adaptively Scaling Cosine Logits for Effectively Learning Deep Face Representations
Xiao Zhang, Rui Zhao, Yu Qiao, Xiaogang Wang, Hongsheng Li


The cosine-based softmax losses and their variants achieve great success in deep learning based face recognition. However, hyperparameter settings in these losses have significant influences on the optimization path as well as the final recognition performance. Manually tuning those hyperparameters heavily relies on user experience and requires many training tricks. In this paper, we investigate in depth the effects of two important hyperparameters of cosine-based softmax losses, the scale parameter and angular margin parameter, by analyzing how they modulate the predicted classification probability. Based on these analysis, we propose a novel cosine-based softmax loss, AdaCos, which is hyperparameter-free and leverages an adaptive scale parameter to automatically strengthen the training supervisions during the training process. We apply the proposed AdaCos loss to large-scale face verification and identification datasets, including LFW, MegaFace, and IJB-C 1:1 Verification. Our results show that training deep neural networks with the AdaCos loss is stable and able to achieve high face recognition accuracy. Our method outperforms state-of-the-art softmax losses on all the three datasets.
[recognition, dynamic, dataset] [computer, range, angle, corresponding, pattern, vision, international, optimization] [face, proposed, conference, ieee, based, facial, method, change, figure] [parameter, scale, deep, fixed, adaptive, neural, compared, performance, network, table, automatically, gradually, small, scaling, verification, size, number, iteration] [probability, arxiv, preprint, med, median, machine] [feature, average, predicted, logits, xiaogang, final, propose] [loss, training, adacos, softmax, margin, learning, arcface, cosface, class, classification, cosine, large, function, hyperparameters, megaface, lfw, angular, convergence, log, cleaned, hyperparameter, data, set, webface, trained, vggface, facenet, metric, sample]
@InProceedings{Zhang_2019_CVPR,
  author = {Zhang, Xiao and Zhao, Rui and Qiao, Yu and Wang, Xiaogang and Li, Hongsheng},
  title = {AdaCos: Adaptively Scaling Cosine Logits for Effectively Learning Deep Face Representations},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
3D Hand Shape and Pose Estimation From a Single RGB Image
Liuhao Ge, Zhou Ren, Yuncheng Li, Zehao Xue, Yingying Wang, Jianfei Cai, Junsong Yuan


This work addresses a novel and challenging problem of estimating the full 3D hand shape and pose from a single RGB image. Most current methods in 3D hand analysis from monocular RGB images only focus on estimating the 3D locations of hand keypoints, which cannot fully express the 3D shape of hand. In contrast, we propose a Graph Convolutional Neural Network (Graph CNN) based method to reconstruct a full 3D mesh of hand surface that contains richer information of both 3D hand shape and pose. To train networks with full supervision, we create a large-scale synthetic dataset containing both ground truth 3D meshes and 3D poses. When fine-tuning the networks on real-world datasets without 3D ground truth, we propose a weakly-supervised approach by leveraging the depth map as a weak supervision in training. Through extensive evaluations on our proposed new datasets and two public datasets, we show that our proposed method can produce accurate and reasonable 3D hand mesh, and can achieve superior 3D hand pose estimation accuracy when compared with state-of-the-art methods.
[dataset, graph, joint, human, tracking, manner, work] [hand, pose, mesh, depth, shape, estimation, rgb, truth, ground, single, error, stb, estimate, estimated, pck, liuhao, surface, rhd, junsong, analysis, monocular, estimating, well, camera, reconstruction, michael, mano, hourglass, direct, accurate, directly, regress] [method, synthetic, image, proposed, input, figure, latent, ieee, reference] [full, network, convolutional, neural, deep, table, output, performance, accuracy, better] [model, generated, create, evaluate] [map, baseline, feature, cnn, propose, weak, edge, supervision] [loss, training, datasets, set, learning, train, task, data, trained]
@InProceedings{Ge_2019_CVPR,
  author = {Ge, Liuhao and Ren, Zhou and Li, Yuncheng and Xue, Zehao and Wang, Yingying and Cai, Jianfei and Yuan, Junsong},
  title = {3D Hand Shape and Pose Estimation From a Single RGB Image},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
3D Hand Shape and Pose From Images in the Wild
Adnane Boukhayma, Rodrigo de Bem, Philip H.S. Torr


We present in this work the first end-to-end deep learning based method that predicts both 3D hand shape and pose from RGB images in the wild. Our network consists of the concatenation of a deep convolutional encoder, and a fixed model-based decoder. Given an input image, and optionally 2D joint detections obtained from an independent CNN, the encoder predicts a set of hand and view parameters. The decoder has two components: A pre-computed articulated mesh deformation hand model that generates a 3D mesh from the hand parameters, and a re-projection module controlled by the view parameters that projects the generated hand into the image domain. We show that using the shape and pose prior knowledge encoded in the hand model within a deep learning framework yields state-of-the-art performance in 3D pose prediction from images on standard benchmarks, and produces geometrically valid and plausible 3D reconstructions. Additionally, we show that training with weak supervision in the form of 2D joint annotations on datasets of images in the wild, in conjunction with full supervision in the form of 3D joint annotations on limited available datasets allows for good generalization to 3D shape and pose predictions on images in the wild.
[joint, dataset, work, tracking, human, egocentric, predict, skeleton, interaction, multiple] [hand, pose, shape, mesh, estimation, depth, rgb, pck, view, pii, camera, single, blend, exter, error, reconstruction, linear, fit, monocular, optimization, articulated, problem, pipeline, mano, tereo, spurr, fitting, predicts, deformation, computer, skinning, perspective] [figure, method, input, image, synthetic, based, acm, wild, prior, real, proposed, generative] [deep, table, convolutional, network, performance, size, low, efficient] [model, encoder, plausible, vector, decoder, generates, evaluate] [annotated, challenging, propose, weak, average] [training, learning, datasets, loss, set, zsl, distance, data, trained, combination, train]
@InProceedings{Boukhayma_2019_CVPR,
  author = {Boukhayma, Adnane and de Bem, Rodrigo and Torr, Philip H.S.},
  title = {3D Hand Shape and Pose From Images in the Wild},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Self-Supervised 3D Hand Pose Estimation Through Training by Fitting
Chengde Wan, Thomas Probst, Luc Van Gool, Angela Yao


We present a self-supervision method for 3D hand pose estimation from depth maps. We begin with a neural network initialized with synthesized data and fine-tune it on real but unlabelled depth maps by minimizing a set of data-fitting terms. By approximating the hand surface with a set of spheres, we design a differentiable hand renderer to align estimates by comparing the rendered and input depth maps. In addition, we place a set of priors including a data-driven term to further regulate the estimate's kinematic feasibility. Our method makes highly accurate estimates comparable to current supervised methods which require large amounts of labelled training samples, thereby advancing state-of-the-art in unsupervised learning for hand pose estimation.
[joint, term, tracking, current, human, multiple, bone] [hand, pose, depth, error, estimation, estimated, accurate, sphere, single, percentage, differentiable, view, rendered, surface, robust, fitting, kinematic, approach, nyu, manual, directly, hdepth, lvae, corresponding] [synthesized, method, figure, real, input, prior, consistency, proposed, synthetic, based, latent] [network, accuracy, deep, table, neural, highly, apply] [model, length, successful, evaluate] [map, supervision, average, improve] [training, data, set, trained, learning, loss, large, train, distance, test, unsupervised, vae, testing, domain, discriminative, conventional, unlabelled, supervised, labelled]
@InProceedings{Wan_2019_CVPR,
  author = {Wan, Chengde and Probst, Thomas and Van Gool, Luc and Yao, Angela},
  title = {Self-Supervised 3D Hand Pose Estimation Through Training by Fitting},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
CrowdPose: Efficient Crowded Scenes Pose Estimation and a New Benchmark
Jiefeng Li, Can Wang, Hao Zhu, Yihuan Mao, Hao-Shu Fang, Cewu Lu


Multi-person pose estimation is fundamental to many computer vision tasks and has made significant progress in recent years. However, few previous methods explored the problem of pose estimation in crowded scenes while it remains challenging and inevitable in many scenarios. Moreover, current benchmarks cannot provide an appropriate evaluation for such cases. In this paper, we propose a novel and efficient method to tackle the problem of pose estimation in the crowd and a new dataset to better evaluate algorithms. Our model consists of two key components: joint-candidate single person pose estimation (SPPE) and global maximum joints association. With multi-peak prediction for each joint and global association using the graph model, our method is robust to inevitable interference in crowded scenes and very efficient in inference. The proposed method surpasses the state-of-the-art methods on CrowdPose dataset by 5.2 mAP and results on MSCOCO dataset demonstrate the generalization ability of our method.
[human, joint, dataset, sppe, crowdpose, graph, cewu, current, predict, tackle, previous, alphapose, crowding, uncrowded, build, mar] [pose, estimation, problem, interference, algorithm, computer, vision, single] [method, figure, proposed, input, based, result] [performance, number, table, ith, network, efficient, connection, group, better, achieve, redundant] [association, mscoco, evaluate, node, candidate, model, arxiv, preprint, greedy] [crowded, crowd, person, global, map, bounding, detection, response, propose, heatmap, proposal, mask, average, detector, benchmark, public, final, level, box, three] [loss, set, target, distribution, test, conventional, novel, training, function, datasets]
@InProceedings{Li_2019_CVPR,
  author = {Li, Jiefeng and Wang, Can and Zhu, Hao and Mao, Yihuan and Fang, Hao-Shu and Lu, Cewu},
  title = {CrowdPose: Efficient Crowded Scenes Pose Estimation and a New Benchmark},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Towards Social Artificial Intelligence: Nonverbal Social Signal Prediction in a Triadic Interaction
Hanbyul Joo, Tomas Simon, Mina Cikara, Yaser Sheikh


We present a new research task and a dataset to understand human social interactions via computational methods, to ultimately endow machines with the ability to encode and decode a broad channel of social signals humans use. This research direction is essential to make a machine that genuinely communicates with humans, which we call Social Artificial Intelligence. We first formulate the "social signal prediction" problem as a way to model the dynamics of social signals exchanged among interacting individuals in a data-driven way. We then present a new 3D motion capture dataset to explore this problem, where the broad spectrum of social signals (3D body, face, and hand motions) are captured in a triadic social interaction scenario. Baseline approaches to predict speaking status, social formation, and body gestures of interacting individuals are presented in the defined social prediction framework.
[social, motion, prediction, signal, human, speaking, nonverbal, interaction, dataset, behavior, capture, predicting, individual, haggling, interacting, predict, behavioral, time, people, triadic, affective, multiple, conversational, broad, work, trajectory, previous, fpb, modeling, verbal, gesture, focus, proxemics] [body, formation, pose, problem, orientation, yaser, direction, hand, approach, position, supplementary, defined, vision] [facial, face, input, study, expression, method, result, figure, spectrum] [table, correlation, output, performance, network, neural, original] [communication, model, natural, system, strong, game, machine, language, attention, diverse, situation, visual] [person, location, including, baseline, exist] [target, data, function, datasets, learning, training, specific, predictive]
@InProceedings{Joo_2019_CVPR,
  author = {Joo, Hanbyul and Simon, Tomas and Cikara, Mina and Sheikh, Yaser},
  title = {Towards Social Artificial Intelligence: Nonverbal Social Signal Prediction in a Triadic Interaction},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
HoloPose: Holistic 3D Human Reconstruction In-The-Wild
Riza Alp Guler, Iasonas Kokkinos


We introduce HoloPose, a method for holistic monocular 3D human body reconstruction. We first introduce a part-based model for 3D model parameter regression that allows our method to operate in-the-wild, gracefully handling severe occlusions and large pose variation. We further train a multi-task network comprising 2D, 3D and Dense Pose estimation to drive the 3D reconstruction task. For this we introduce an iterative refinement method that aligns the model-based 3D estimates of 2D/3D joint positions and DensePose with their image-based counterparts delivered by CNNs, achieving both model-based, global consistency and high spatial accuracy thanks to the bottom-up CNN processing. We validate our contributions on challenging benchmarks, showing that our method allows us to get both accurate joint and 3D surface estimates while operating at more than 10fps in-the-wild. More information about our approach, including videos and demos is available at http://arielai.com/holopose.
[human, joint, recognition, multiple, prediction, work, perform] [pose, computer, reconstruction, shape, vision, pattern, body, estimation, surface, densepose, monocular, dense, estimate, angle, june, international, michael, linear, allows, single, keypoint, correspondence, keypoints, fitting, volume, mesh, error, estimated, rotation, position, geometric, javier, accurate, smpl, differentiable, reprojection, kinematic, approach, geodesic] [ieee, conference, image, prior, method, figure, based, qualitative] [network, accuracy, layer, performance, deep, process, better, parameter, architecture, convolutional] [model, system, introduce, iterative, simple, synergistic] [refinement, regression, object, localization, head, holistic, detection, jitendra, cnn] [learning, loss, alignment, trained, space, train, training, function]
@InProceedings{Guler_2019_CVPR,
  author = {Alp Guler, Riza and Kokkinos, Iasonas},
  title = {HoloPose: Holistic 3D Human Reconstruction In-The-Wild},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Weakly-Supervised Discovery of Geometry-Aware Representation for 3D Human Pose Estimation
Xipeng Chen, Kwan-Yee Lin, Wentao Liu, Chen Qian, Liang Lin


Recent studies have shown remarkable advances in 3D human pose estimation from monocular images, with the help of large-scale in-door 3D datasets and sophisticated network architectures. However, the generalizability to different environments remains an elusive goal. In this work, we propose a geometry-aware 3D representation for the human pose to address this limitation by using multiple views in a simple auto-encoder model at the training stage and only 2D keypoint information as supervision. A view synthesis framework is proposed to learn the shared 3D representation between viewpoints with synthe- sizing the human pose from one viewpoint to the other one. Instead of performing a direct transfer in the raw image- level, we propose a skeleton-based encoder-decoder mechanism to distil only pose-related representation in the latent space. A learning-based representation consistency constraint is further introduced to facilitate the robustness of latent 3D representation. Since the learnt representation encodes 3D geometry information, mapping it to 3D pose will be much easier than conventional frameworks that use an image or 2D coordinates as the input of 3D pose estimator. We demonstrate our approach on the task of 3D human pose estimation. Comprehensive experiments on three popular benchmarks show that our model can significantly improve the performance of state-of-the-art methods with simply injecting the representation as a robust 3D prior.
[human, skeleton, framework, dataset, mpii, capture, multiple, prediction] [pose, geometry, estimation, view, constraint, camera, robust, monocular, body, estimator, shape, relative, error, approach, directly, viewpoint] [latent, consistency, image, figure, synthesis, proposed, mapping, input, chen, raw, demonstrate, amount, sophisticated, facilitate, constrain] [network, structure, effectiveness, usage, denotes, performance, deep] [model, robustness, evaluation, mechanism, referred, simple, simply, encoder] [baseline, annotation, three, feature, propose, annotated, map, object, regression] [representation, learnt, learning, training, learn, target, space, source, loss, train, datasets, refers, regarded, domain, existing, data, sun, novel, shared, conventional, large]
@InProceedings{Chen_2019_CVPR,
  author = {Chen, Xipeng and Lin, Kwan-Yee and Liu, Wentao and Qian, Chen and Lin, Liang},
  title = {Weakly-Supervised Discovery of Geometry-Aware Representation for 3D Human Pose Estimation},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
In the Wild Human Pose Estimation Using Explicit 2D Features and Intermediate 3D Representations
Ikhsanul Habibie, Weipeng Xu, Dushyant Mehta, Gerard Pons-Moll, Christian Theobalt


Convolutional Neural Network based approaches for monocular 3D human pose estimation usually require a large amount of training images with 3D pose annotations. While it is feasible to provide 2D joint annotations for large corpora of in-the-wild images with humans, providing accurate 3D annotations to such in-the-wild corpora is hardly feasible in practice. Most existing 3D labelled data sets are either synthetically created or feature in-studio images. 3D pose estimation algorithms trained on such data often have limited ability to generalize to real world scene diversity. We therefore propose a new deep learning based method for monocular 3D human pose estimation that shows high accuracy and generalizes better to in-the-wild scenes. It has a network architecture that comprises a new disentangled hidden space encoding of explicit 2D and 3D features, and uses supervision by a new learned projection model from predicted 3D pose. Our algorithm can be jointly trained on image data with 3D labels and image data with only 2D labels. It achieves state-of-the-art accuracy on challenging in-the-wild data.
[human, prediction, joint, bone, recognition, predict, capture, motion, work, predicting] [pose, estimation, computer, vision, body, ground, projection, pattern, depth, truth, additional, monocular, approach, explicit, pck, outdoor, general, studio, lifting, well, vectorized, accurate, single, keypoints, camera, keypoint, frgb, mpjpe, lsp, scaled, international] [method, conference, ieee, image, proposed, latent, figure, input] [network, neural, deep, convolutional, accuracy, performance, architecture, achieve, table, achieves, better, design] [evaluation, model] [supervision, feature, predicted, heatmap, weak, benchmark, challenging, baseline] [training, data, trained, learning, loss, representation, learned, train, learn, datasets, test, space]
@InProceedings{Habibie_2019_CVPR,
  author = {Habibie, Ikhsanul and Xu, Weipeng and Mehta, Dushyant and Pons-Moll, Gerard and Theobalt, Christian},
  title = {In the Wild Human Pose Estimation Using Explicit 2D Features and Intermediate 3D Representations},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Slim DensePose: Thrifty Learning From Sparse Annotations and Motion Cues
Natalia Neverova, James Thewlis, Riza Alp Guler, Iasonas Kokkinos, Andrea Vedaldi


DensePose supersedes traditional landmark detectors by densely mapping image pixels to body surface coordinates. This power, however, comes at a greatly increased annotation cost, as supervising the model requires to manually label hundreds of points per pose instance. In this work, we thus seek methods to significantly slim down the DensePose annotations, proposing more efficient data collection strategies. In particular, we demonstrate that if annotations are collected in video frames, their efficacy can be multiplied for free by using motion cues. To explore this idea, we introduce DensePose-Track, a dataset of videos where selected frames are annotated in the traditional DensePose manner. Then, building on geometric properties of the DensePose mapping, we use the video dynamic to propagate ground-truth annotations in time as well as to learn from Siamese equivariance constraints. Having performed exhaustive empirical evaluation of various data annotation and learning strategies, we demonstrate that doing so can deliver significantly improved pose estimation results over strong baselines. However, despite what is suggested by some recent works, we show that merely synthesizing motion patterns by applying geometric transformations to isolated frames is significantly less effective, and that motion cues help much more when they are extracted from videos.
[flow, dataset, motion, human, optical, video, frame, posetrack, propagation, work, temporal, manually, complex, cheaper, time] [dense, densepose, pose, estimation, hourglass, body, correspondence, keypoints, geometric, field, surface, point, manual, constraint, approach, chart, shape] [image, synthetic, real, landmark, based, figure, mapping, collect, traditional, collected, amount, pixel] [table, number, performance, network, reduced, best, order, output, sparse, full, cost, better, applying] [model, strong] [equivariance, object, annotation, baseline, supervision, coco, semantic, segmentation, person, iasonas, stage] [training, learning, data, train, subset, trained, loss, unsupervised, learned, collecting]
@InProceedings{Neverova_2019_CVPR,
  author = {Neverova, Natalia and Thewlis, James and Alp Guler, Riza and Kokkinos, Iasonas and Vedaldi, Andrea},
  title = {Slim DensePose: Thrifty Learning From Sparse Annotations and Motion Cues},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Self-Supervised Representation Learning From Videos for Facial Action Unit Detection
Yong Li, Jiabei Zeng, Shiguang Shan, Xilin Chen


In this paper, we aim to learn discriminative representation for facial action unit (AU) detection from large amount of videos without manual annotations. Inspired by the fact that facial actions are the movements of facial muscles, we depict the movements as the transformation between two face images in different frames and use it as the self-supervisory signal to learn the representations. However, under the uncontrolled condition, the transformation is caused by both facial actions and head motions. To remove the influence by head motions, we propose a Twin-Cycle Autoencoder (TCAE) that can disentangle the facial action related movements and the head motion related ones. Specifically, TCAE is trained to respectively change the facial actions and head poses of the source face to those of the target face. Our experiments validate TCAE's capability of decoupling the movements. Experimental results also demonstrate that the learned representation is discriminative for AU detection, where TCAE outperforms or is comparable with the state-of-the-art self-supervised learning methods and supervised AU detection methods.
[action, selfsupervised, dataset, outperforms, motion, video, frame] [pose, descriptor, local, shape, reconstruction, denote] [face, facial, tcae, image, cycle, change, proposed, disfa, txy, gft, jeffrey, pixel, caused, disentangle, splitbrain, changed, changing, ave, figure, disentangled, disentangling] [unit, relu, deep, layer, convolutional, original, comparable, size, compared] [generated, decoder, encoder, attention] [head, detection, feature, average, propose, location] [source, target, learning, embeddings, representation, supervised, learn, emotion, discriminative, trained, sampling, autoencoder, training, loss, china, learned, supervisory, labelled, embedding]
@InProceedings{Li_2019_CVPR,
  author = {Li, Yong and Zeng, Jiabei and Shan, Shiguang and Chen, Xilin},
  title = {Self-Supervised Representation Learning From Videos for Facial Action Unit Detection},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Combining 3D Morphable Models: A Large Scale Face-And-Head Model
Stylianos Ploumpis, Haoyang Wang, Nick Pears, William A. P. Smith, Stefanos Zafeiriou


Three-dimensional Morphable Models (3DMMs) are powerful statistical tools for representing the 3D surfaces of an object class. In this context, we identify an interesting question that has previously not received research attention: is it possible to combine two or more 3DMMs that (a) are built using different templates that perhaps only partly overlap, (b) have different representation capabilities and (c) are built from different datasets that may not be publicly-available? In answering this question, we make two contributions. First, we propose two methods for solving this problem: i. use a regressor to complete missing parts of one model using the other, ii. use the Gaussian Process framework to blend covariance matrices from multiple models. Second, as an example application of our approach, we build a new head and face model that combines the variability and facial detail of the LSFM with the full head modelling of the LYHM. The resulting combined model achieves state-of-the-art performance and outperforms existing head models by a large margin. Finally, as an application experiment, we reconstruct full head representations from single, unconstrained images by utilizing our proposed large-scale model in conjunction with the Face-Warehouse blendshapes for handling expressions.
[human, combined, recognition, work, build] [shape, matrix, computer, mesh, point, registration, error, template, international, single, ear, corresponding, pattern, scan, principal, compute, deformation, thomas, registered, vision, reconstruction, analysis] [face, facial, morphable, lyhm, conference, statistical, lsfm, based, ieee, figure, bespoke, proposed, pca, method, utilizing, nicp, raw, latent, image, reference, cfhm, fitted, age, detail, reconstruct] [full, covariance, gaussian, process, order, william, entire, original, compared] [model] [head, regression, final, built, combine, region, area, object] [combination, generalization, posterior, large, methodology, distance, space, set, test, representation, data]
@InProceedings{Ploumpis_2019_CVPR,
  author = {Ploumpis, Stylianos and Wang, Haoyang and Pears, Nick and Smith, William A. P. and Zafeiriou, Stefanos},
  title = {Combining 3D Morphable Models: A Large Scale Face-And-Head Model},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Boosting Local Shape Matching for Dense 3D Face Correspondence
Zhenfeng Fan, Xiyuan Hu, Chen Chen, Silong Peng


Dense 3D face correspondence is a fundamental and challenging issue in the literature of 3D face analysis. Correspondence between two 3D faces can be viewed as a non-rigid registration problem that one deforms into the other, which is commonly guided by a few facial landmarks in many existing works. However, the current works seldom consider the problem of incoherent deformation caused by landmarks. In this paper, we explicitly formulate the deformation as locally rigid motions guided by some seed points, and the formulated deformation satisfies coherent local motions everywhere on a face. The seed points are initialized by a few landmarks, and are then augmented to boost shape matching between the template and the target face step by step, to finally achieve dense correspondence. In each step, we employ a hierarchical scheme for local shape registration, together with a Gaussian reweighting strategy for accurate matching of local features around the seed points. In our experiments, we evaluate the proposed method extensively on several datasets, including two publicly available ones: FRGC v2.0 and BU-3DFE. The experimental results demonstrate that our method can achieve accurate feature correspondence, coherent local shape motion, and compact data representation. These merits actually settle some important issues for practical applications, such as expressions, noise, and partial data.
[coherent, considering, work, motion] [correspondence, point, registration, shape, dense, computer, local, template, deformation, pattern, algorithm, corresponded, mesh, vision, international, accurate, matching, rigid, problem, analysis, surface, journal, manual, icp, compute, condition, augmented, finally, optimal, affine, note, fundamental] [face, ieee, facial, conference, method, based, landmark, proposed, morphable, database, acm, expression, nicp, missing, frgc, statistical, control, result] [process, achieve, weighted, number, gaussian, compared, boosting, initialized] [model, machine, step, automatic, locally, common] [seed, propose, guided, boost, feature, three, detection, global] [data, large, target, alignment, existing, strategy, distance, selected, set, viewed]
@InProceedings{Fan_2019_CVPR,
  author = {Fan, Zhenfeng and Hu, Xiyuan and Chen, Chen and Peng, Silong},
  title = {Boosting Local Shape Matching for Dense 3D Face Correspondence},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Unsupervised Part-Based Disentangling of Object Shape and Appearance
Dominik Lorenz, Leonard Bereska, Timo Milbich, Bjorn Ommer


Large intra-class variation is the result of changes in multiple object characteristics. Images, however, only show the superposition of different variable factors such as appearance or shape. Therefore, learning to disentangle and represent these different characteristics poses a great challenge, especially in the unsupervised case. Moreover, large object articulation calls for a flexible part-based model. We present an unsupervised approach for disentangling appearance and shape by learning parts consistently over all instances of a category. Our model for learning an object representation is trained by simultaneously exploiting invariance and equivariance constraints between synthetically transformed images. Since no part annotation or prior information on an object class is required, the approach is applicable to arbitrary classes. We evaluate our approach on a wide range of object categories and diverse tasks including pose prediction, disentangled image synthesis, and video-to-video translation. The approach outperforms the state-of-the-art on unsupervised keypoint prediction and compares favorably even against supervised approaches on the task of shape and appearance transfer.
[human, video, penn, dataset, action, multiple, prediction] [shape, pose, approach, local, note, body, reconstruction, error, articulation, estimation, rigid, keypoints, contrast, groundtruth, project] [appearance, image, landmark, disentangled, disentangling, figure, background, generative, bbc, disentangle, change, variation, color, reconstruct, transformation, conditional, synthesis] [deep, activation, network, table, performance, structure, flexible] [model, evaluate, visual, arxiv, diverse, decoder, provided, preprint] [object, equivariance, person, spatial, head, semantic, feature, mafl, supervision, detection] [unsupervised, representation, learning, invariance, supervised, test, learn, cat, large, task, training, set, target, loss, consistently, learned, datasets, trained]
@InProceedings{Lorenz_2019_CVPR,
  author = {Lorenz, Dominik and Bereska, Leonard and Milbich, Timo and Ommer, Bjorn},
  title = {Unsupervised Part-Based Disentangling of Object Shape and Appearance},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Monocular Total Capture: Posing Face, Body, and Hands in the Wild
Donglai Xiang, Hanbyul Joo, Yaser Sheikh


We present the first method to capture the 3D total motion of a target person from a monocular view input. Given an image or a monocular video, our method reconstructs the motion from body, face, and fingers represented by a 3D deformable mesh model. We use an efficient representation called 3D Part Orientation Fields (POFs), to encode the 3D orientations of all body parts in the common 2D image space. POFs are predicted by a Fully Convolutional Network, along with the joint confidence maps. To train our network, we collect a new 3D human motion dataset capturing diverse total body motion of 40 subjects in a multiview system. We leverage a 3D deformable human model to reconstruct total body pose from the CNN outputs with the aid of the pose and shape prior in the model. We also present a texture-based tracking method to obtain temporally coherent motion capture output. We perform thorough quantitative evaluations including comparison with the existing body-specific and hand-specific methods, and performance analysis on camera viewpoint and human pose changes. Finally, we demonstrate the results of our total body motion capture on various challenging in-the-wild videos.
[human, motion, joint, capture, tracking, dataset, previous, work, frame, flow, optical, follow, capturing, hanbyul] [pose, body, hand, estimation, mesh, total, monocular, orientation, single, depth, error, christian, confidence, shape, michael, fitting, camera, pof, defined, keypoint, yaser, pofs, optimization, ground, view, fit, photometric, keypoints, compute, constraint, estimate, rgb, fpof, note, mpjpe, javier, srinath, multiview] [image, method, input, texture, prior, face, result, consistency, figure, facial, reconstruct, demonstrate, based, captured, expression, comparison] [network, performance, output, convolutional, table] [model, evaluation] [deformable, cnn, person, fully, stage] [function, target, training, adam, objective, train, data, representation]
@InProceedings{Xiang_2019_CVPR,
  author = {Xiang, Donglai and Joo, Hanbyul and Sheikh, Yaser},
  title = {Monocular Total Capture: Posing Face, Body, and Hands in the Wild},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Expressive Body Capture: 3D Hands, Face, and Body From a Single Image
Georgios Pavlakos, Vasileios Choutas, Nima Ghorbani, Timo Bolkart, Ahmed A. A. Osman, Dimitrios Tzionas, Michael J. Black


To facilitate the analysis of human actions, interactions and emotions, we compute a 3D model of human body pose, hand pose, and facial expression from a single monocular image. To achieve this, we use thousands of 3D scans to train a new, unified, 3D model of the human body, SMPL-X, that extends SMPL with fully articulated hands and an expressive face. Learning to regress the parameters of SMPL-X directly from images is challenging without paired images and 3D ground truth. Consequently, we follow the approach of SMPLify, which estimates 2D features and then optimizes model parameters to fit the features. We improve on SMPLify in several significant ways: (1) we detect 2D features corresponding to the face, hands, and feet and fit the full SMPL-X model to these; (2) we train a new neural network pose prior using a large MoCap dataset; (3) we define a new interpenetration penalty that is both fast and accurate; (4) we automatically detect gender and the appropriate body models (male, female, or neutral); (5) our PyTorch implementation achieves a speedup of more than 8x over Chumpy. We use the new method, SMPLify-X, to fit SMPL-X to both controlled images and images in the wild. We evaluate 3D accuracy on a new curated dataset comprising 100 images with pseudo ground-truth. This is a step towards automatic expressive human capture from monocular RGB data. The models, code, and data are available for research purposes at https://smpl-x.is.tue.mpg.de.
[human, capture, dataset, joint, motion, term, modeling, capturing, collision, work, perform, tracking, follow] [pose, body, shape, hand, fit, expressive, single, michael, smpl, rgb, blend, error, approach, frank, estimate, smplify, estimation, optimization, fitting, surface, keypoints, linear, javier, monocular, articulated, corresponding, openpose, camera, computer, matthew, directly, pytorch, well] [face, facial, gender, image, acm, prior, expression, figure, method, based, major, flame, pca, latent, qualitative] [full, employ, table, penalty, implementation] [model, natural, appropriate] [holistic, head] [data, space, learning, training, learn, train, large, function, trained, learned, pseudo, datasets]
@InProceedings{Pavlakos_2019_CVPR,
  author = {Pavlakos, Georgios and Choutas, Vasileios and Ghorbani, Nima and Bolkart, Timo and Osman, Ahmed A. A. and Tzionas, Dimitrios and Black, Michael J.},
  title = {Expressive Body Capture: 3D Hands, Face, and Body From a Single Image},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Neural RGB(r)D Sensing: Depth and Uncertainty From a Video Camera
Chao Liu, Jinwei Gu, Kihwan Kim, Srinivasa G. Narasimhan, Jan Kautz


Depth sensing is crucial for 3D reconstruction and scene understanding. Active depth sensors provide dense metric measurements, but often suffer from limitations such as restricted operating ranges, low spatial resolution, sensor interference, and high power consumption. In this paper, we propose a deep learning (DL) method to estimate per-pixel depth and its uncertainty continuously from a monocular video stream, with the goal of effectively turning an RGB camera into an RGB-D camera. Unlike prior DL-based methods, we estimate a depth probability distribution for each pixel rather than a single depth value, leading to an estimate of a 3D depth probability volume for each input frame. These depth probability volumes are accumulated over time under a Bayesian filtering framework as more incoming frames are processed sequentially, which effectively reduces depth uncertainty and improves accuracy, robustness, and temporal stability. Compared to prior work, the proposed approach achieves more accurate and stable results, and generalizes better to new datasets. Experimental results also show the output of our approach can be directly fed into classical RGB-D based 3D scanning methods for 3D scene reconstruction.
[frame, time, recognition, video, window, temporal, prediction, work, predict, motion, second] [depth, vision, camera, computer, confidence, dpv, estimation, view, pattern, dorn, monocular, local, pose, reconstruction, dense, volume, demon, kitti, estimate, directly, international, estimated, stereo, relative, scene, sensor, indoor, active, single, approach, scanning, dpvs, compute, virtual] [conference, method, ieee, input, figure, filtering, sensing, high, reference, image, based, statistical, comparison] [bayesian, deep, scale, network, low, better, table, compared, compare, accuracy, performance] [probability, implemented, robustness, correct] [map, improve, european, integrate, global, spatial] [learning, uncertainty, distribution, datasets, trained, update, metric]
@InProceedings{Liu_2019_CVPR,
  author = {Liu, Chao and Gu, Jinwei and Kim, Kihwan and Narasimhan, Srinivasa G. and Kautz, Jan},
  title = {Neural RGB(r)D Sensing: Depth and Uncertainty From a Video Camera},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
DAVANet: Stereo Deblurring With View Aggregation
Shangchen Zhou, Jiawei Zhang, Wangmeng Zuo, Haozhe Xie, Jinshan Pan, Jimmy S. Ren


Nowadays stereo cameras are more commonly adopted in emerging devices such as dual-lens smartphones and unmanned aerial vehicles. However, they also suffer from blurry images in dynamic scenes which leads to visual discomfort and hampers further image processing. Previous works have succeeded in monocular deblurring, yet there are few studies on deblurring for stereoscopic images. By exploiting the two-view nature of stereo images, we propose a novel stereo image deblurring network with Depth Awareness and View Aggregation, named DAVANet. In our proposed network, 3D scene cues from the depth and varying information from two views are incorporated, which help to remove complex spatially-varying blur in dynamic scenes. Specifically, with our proposed fusion network, we integrate the bidirectional disparities estimation and deblurring into a unified framework. Moreover, we present a large-scale multi-scene dataset for stereo deblurring, containing 20,637 blurry-sharp stereo image pairs from 135 diverse sequences and their corresponding bidirectional disparities. The experimental results on our dataset demonstrate that DAVANet outperforms state-of-the-art methods in terms of accuracy, speed, and model size.
[dataset, dynamic, motion, video, bidirectional, frame, flow, consists, fusion, time, stereoscopic, outperforms] [stereo, depth, view, scene, disparity, camera, single, estimation, left, relative, corresponding, estimate, estimated, truth, well, note, ground, monocular, varying] [image, deblurring, blur, proposed, figure, deblurnet, blurry, dispbinet, method, davanet, remove, sharp, fusionnet, psnr, input, spatially, removal, jinshan, prior, gopro, ssim] [network, convolutional, deep, neural, residual, variant, size, table, effectiveness, rate, compare, aggregation] [generate, evaluate, model, natural, decoder, diverse] [context, propose, awareness, module, object, three, feature, map, cnn, help] [large, loss, train, training, learning]
@InProceedings{Zhou_2019_CVPR,
  author = {Zhou, Shangchen and Zhang, Jiawei and Zuo, Wangmeng and Xie, Haozhe and Pan, Jinshan and Ren, Jimmy S.},
  title = {DAVANet: Stereo Deblurring With View Aggregation},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
DVC: An End-To-End Deep Video Compression Framework
Guo Lu, Wanli Ouyang, Dong Xu, Xiaoyun Zhang, Chunlei Cai, Zhiyong Gao


Conventional video compression approaches use the predictive coding architecture and encode the corresponding motion information and residual information. In this paper, taking advantage of both classical architecture in the conventional video compression method and the powerful non-linear representation ability of neural networks, we propose the first end-to-end video compression deep model that jointly optimizes all the components for video compression. Specifically, learning based optical flow estimation is utilized to obtain the motion information and reconstruct the current frames. Then we employ two auto-encoder style neural networks to compress the corresponding motion and residual information. All the modules are jointly learned through a single loss function, in which they collaborate with each other by considering the trade-off between reducing the number of compression bits and improving quality of the decoded video. Experimental results show that the proposed approach can outperform the widely used video coding standard H.264 in terms of PSNR and be even on par with the latest standard H.265 in terms of MS-SSIM. Code is released at https://github.com/GuoLusjtu/DVC.
[motion, video, frame, flow, optical, bpp, framework, previous, hevc, current, warped, dataset, work, jointly, joint, codec, uvg] [estimation, corresponding, approach, provide, computer, distortion, estimate, measured, directly, estimated, vision] [based, proposed, image, reconstructed, method, psnr, traditional, compensation, transform, denoted, quality, figure] [compression, network, residual, deep, bit, neural, number, rate, quantized, coding, performance, original, better, compress, block, optimized, quantization, compared, magnitude, standard, higher, order, lot, represents] [step, encoding, model, encoder, decoder, arxiv, preprint, required, system, provided, vector, probability, encode, generate] [map, module, predicted, improve] [learning, training, representation, class, experimental, entropy, strategy, loss, distribution]
@InProceedings{Lu_2019_CVPR,
  author = {Lu, Guo and Ouyang, Wanli and Xu, Dong and Zhang, Xiaoyun and Cai, Chunlei and Gao, Zhiyong},
  title = {DVC: An End-To-End Deep Video Compression Framework},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
SOSNet: Second Order Similarity Regularization for Local Descriptor Learning
Yurun Tian, Xin Yu, Bin Fan, Fuchao Wu, Huub Heijnen, Vassileios Balntas


Despite the fact that Second Order Similarity (SOS) has been used with significant success in tasks such as graph matching and clustering, it has not been exploited for learning local descriptors. In this work, we explore the potential of \sos in the field of descriptor learning by building upon the intuition that a positive pair of matching points should exhibit similar distances with respect to other points in the embedding space. Thus, we propose a novel regularization term, named Second Order Similarity Regularization (SOSR), that follows this principle. By incorporating SOSR into training, our learned descriptor achieves state-of-the-art performance on several challenging benchmarks containing distinct tasks ranging from local patch retrieval to structure from motion. Furthermore, by designing a von Mises-Fischer distribution based evaluation method, we link the utilization of the descriptor space to the matching performance, thus demonstrating the effectiveness of our proposed SOSR. Extensive experimental results, empirical evidence, and in-depth analysis are provided, indicating that SOSR can significantly boost the matching performance of the learned descriptor.
[second, recognition, graph, term, dataset] [descriptor, local, matching, vision, computer, pattern, sift, geodesc, hpatches, note, liberty, krystian, analysis] [ieee, conference, patch, image, method, based, proposed] [performance, order, unit, regularization, network, number, achieves, convolutional, utilization, impact, ith, rate, best, neural, batch, employ, structure] [evaluation, indicates, introduce, fact] [feature, propose, three, average, indicating] [learning, training, sosnet, sosr, similarity, learned, space, distribution, loss, positive, triplet, trained, hardnet, doap, qht, adam, nearest, hard, retrieval, von, set, clustering, discriminative, rintra, rinter, ubc, hypersphere, tfeat, function, hinge]
@InProceedings{Tian_2019_CVPR,
  author = {Tian, Yurun and Yu, Xin and Fan, Bin and Wu, Fuchao and Heijnen, Huub and Balntas, Vassileios},
  title = {SOSNet: Second Order Similarity Regularization for Local Descriptor Learning},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
"Double-DIP": Unsupervised Image Decomposition via Coupled Deep-Image-Priors
Yosef Gandelsman, Assaf Shocher, Michal Irani


Many seemingly unrelated computer vision tasks can be viewed as a special case of image decomposition into separate layers. For example, image segmentation (separation into foreground and background layers); transparent layer separation (into reflection and transmission layers); Image dehazing (separation into a clear image and a haze map), and more. In this paper we propose a unified framework for unsupervised layer decomposition of a single image, based on coupled "Deep-image-Prior" (DIP) networks. It was shown [Ulyanov et al] that the structure of a single DIP generator network is sufficient to capture the low-level statistics of a single image. We show that coupling multiple such DIPs provides a powerful tool for decomposing images into their basic components, for a wide variety of applications. This capability stems from the fact that the internal statistics of a mixture of layers is more complex than the statistics of each of its individual components. We show the power of this approach for Image-Dehazing, Fg/Bg Segmentation, Watermark-Removal, Transparency Separation in images and video, and more. These capabilities are achieved in a totally unsupervised way, with no training examples other than the input image/video itself.
[video, internal, individual, framework, multiple, frame, recognition, graph, behavior] [single, computer, vision, project, pattern, ambiguity, decomposition, case, approach, pure, initial, varying, underlying, form] [image, dip, dehazing, ieee, airlight, input, conference, patch, mixed, figure, separation, transparent, watermark, transparency, separate, hazy, noise, reconstruct, reflection, transmission, color, resolved, based, decompose, texture, recovered, pixel, coupled, decomposing, haze, variety] [layer, network, output, small, deep, smaller] [natural, generated, fact, random, empirical, example, generate, strong] [mask, segmentation, inside, foreground] [uniform, unsupervised, loss, distribution, unified, learned, trained, share, training, task, entropy, train, shared, mixture, similarity]
@InProceedings{Gandelsman_2019_CVPR,
  author = {Gandelsman, Yosef and Shocher, Assaf and Irani, Michal},
  title = {"Double-DIP": Unsupervised Image Decomposition via Coupled Deep-Image-Priors},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Unprocessing Images for Learned Raw Denoising
Tim Brooks, Ben Mildenhall, Tianfan Xue, Jiawen Chen, Dillon Sharlet, Jonathan T. Barron


Machine learning techniques work best when the data used for training resembles the data used for evaluation. This holds true for learned single-image denoising algorithms, which are applied to real raw camera sensor readings but, due to practical constraints, are often trained on synthetic image data. Though it is understood that generalizing from synthetic to real images requires careful consideration of the noise properties of camera sensors, the other aspects of an image processing pipeline (such as gain, color correction, and tone mapping) are often overlooked, despite their significant effect on how raw measurements are transformed into finished images. To address this, we present a technique to "unprocess" images by inverting each step of an image processing pipeline, thereby allowing us to synthesize realistic raw sensor measurements from commonly available Internet photos. We additionally model the relevant components of an image processing pipeline when evaluating our loss function, which allows training to be aware of all relevant photometric processing that will occur after denoising. By unprocessing and processing training data and model outputs in this way, we are able to train a simple convolutional neural network that has 14%-38% lower error rates and is 9x-18x faster than the previous state of the art on the Darmstadt Noise Dataset, and generalizes to sensors outside of that dataset as well.
[dataset, work, modeling, report] [camera, pipeline, sensor, inverse, error, algorithm, runtime, inverting, technique, approach, single, relative, exposure, rgb, range, internet, additional] [image, raw, noise, synthetic, srgb, denoising, darmstadt, real, color, figure, input, unprocessing, tone, digital, gamma, realistic, psnr, balance, mapping, bayer, intensity, lei, vst, denoised, ssim, paired, pixel] [processing, network, output, gain, neural, gaussian, performance, applying, apply, runtimes, deep, variance, compared, standard, residual, best, process] [model, read, white, indicates, procedure, step, simple, random] [evaluated, faster, level] [training, data, noisy, shot, learning, loss, sample, train, large, function, log, learned, datasets]
@InProceedings{Brooks_2019_CVPR,
  author = {Brooks, Tim and Mildenhall, Ben and Xue, Tianfan and Chen, Jiawen and Sharlet, Dillon and Barron, Jonathan T.},
  title = {Unprocessing Images for Learned Raw Denoising},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Residual Networks for Light Field Image Super-Resolution
Shuo Zhang, Youfang Lin, Hao Sheng


Light field cameras are considered to have many potential applications since angular and spatial information is captured simultaneously. However, the limited spatial resolution has brought lots of difficulties in developing related applications and becomes the main bottleneck of light field cameras. In this paper, a learning-based method using residual convolutional networks is proposed to reconstruct light fields with higher spatial resolution. The view images in one light field are first grouped into different image stacks with consistent sub-pixel offsets and fed into different network branches to implicitly learn inherent corresponding relations. The residual information in different spatial directions is then calculated from each branch and further integrated to supplement high-frequency details for the view image. Finally, a flexible solution is proposed to super-resolve entire light field images with various angular resolutions. Experimental results on synthetic and real-world datasets demonstrate that the proposed method outperforms other state-of-the-art methods by a large margin in both visual and numerical evaluations. Furthermore, the proposed method shows good performances in preserving the inherent epipolar property in light field images.
[recognition, combined, capture, dataset, implicitly, outperforms] [view, light, field, computer, disparity, epipolar, vision, corresponding, pattern, single, solution, property, estimation, horizontal, lytro, analysis, occlusion] [image, proposed, psnr, ieee, central, ssim, conference, method, based, stack, resolution, resblock, edsr, reslf, lfnet, synthetic, figure, input, bicubic, inherent, superresolution, preserve, quality, reconstruct] [network, residual, convolutional, structure, table, conv, layer, compared, deep, better, higher, neural, output, flexible, entire] [visual] [spatial, global, surrounding, final, border, branch] [angular, specific, train, learn, training, domain, learning]
@InProceedings{Zhang_2019_CVPR,
  author = {Zhang, Shuo and Lin, Youfang and Sheng, Hao},
  title = {Residual Networks for Light Field Image Super-Resolution},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Modulating Image Restoration With Continual Levels via Adaptive Feature Modification Layers
Jingwen He, Chao Dong, Yu Qiao


In image restoration tasks, like denoising and superresolution, continual modulation of restoration levels is of great importance for real-world applications, but has failed most of existing deep learning based image restoration methods. Learning from discrete and fixed restoration levels, deep models cannot be easily generalized to data of continuous and unseen levels. This topic is rarely touched in literature, due to the difficulty of modulating well-trained models with certain hyper-parameters. We make a step forward by proposing a unified CNN framework that consists of little additional parameters than a single-level model yet could handle arbitrary restoration levels between a start and an end level. The additional module, namely AdaFM layer, performs channel-wise feature modification, and can adapt a model to another restoration level with high accuracy. By simply tweaking an interpolation coefficient, the intermediate model - AdaFM-Net could generate smooth and continuous restoration effects without artifacts. Extensive experiments on three image restoration tasks demonstrate the effectiveness of both model training and modulation testing. Besides, we carefully investigate the properties of AdaFM layers, providing a detailed guidance on the usage of the proposed method.
[start, middle] [range, computer, vision, corresponding, single, additional, fitting, observation] [restoration, image, psnr, figure, super, denoising, resolution, change, proposed, degradation, conference, arbitrary, conditional, method, input, based, jpeg, style, ieee, handle, interpolation] [adafm, filter, basic, modulation, layer, size, dejpeg, normalization, deep, network, convolution, output, table, residual, better, compression, performance, coefficient, batch, conv, number, adaptive, compare, adabn, modulating, achieve, achieves, nbas, smaller, modulate, activation] [model, find, modification] [level, feature, cnn, three, instance] [adaptation, task, training, test, distance, trained, train, continual, learning, large, domain, continuously, set, gap, function]
@InProceedings{He_2019_CVPR,
  author = {He, Jingwen and Dong, Chao and Qiao, Yu},
  title = {Modulating Image Restoration With Continual Levels via Adaptive Feature Modification Layers},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Second-Order Attention Network for Single Image Super-Resolution
Tao Dai, Jianrui Cai, Yongbing Zhang, Shu-Tao Xia, Lei Zhang


Recently, deep convolutional neural networks (CNNs) have been widely explored in single image super-resolution (SISR) and obtained remarkable performance. However, most of the existing CNN-based SISR methods mainly focus on wider or deeper architecture design, neglecting to explore the feature correlations of intermediate layers, hence hindering the representational power of CNNs. To address this issue, in this paper, we propose a second-order attention network (SAN) for more powerful feature expression and feature correlation learning. Specifically, a novel train- able second-order channel attention (SOCA) module is developed to adaptively rescale the channel-wise features by using second-order feature statistics for more discriminative representations. Furthermore, we present a non-locally enhanced residual group (NLRG) structure, which not only incorporates non-local operations to capture long-distance spatial contextual information, but also contains repeated local-source residual attention groups (LSRAG) to learn increasingly abstract feature representations. Experimental results demonstrate the superiority of our SAN network over state-of-the-art SISR methods in terms of both quantitative metrics and visual quality.
[focus, dbpn, capture] [single, matrix] [image, rcan, based, proposed, ssim, bicubic, edsr, input, method, lei, vdsr, srmd, degradation, sisr] [residual, deep, san, network, channel, covariance, group, performance, rdn, skip, convolutional, better, soca, convolution, fsrcnn, nlrg, structure, abundant, neural, lapsrn, compared, table, srcnn, normalization, memnet, upscale, powerful, obtains, nlrn, lsrag, size, higher, shallow, apply, effectiveness] [attention, visual, model] [feature, propose, module, enhanced, spatial, global, contextual, map, improve] [discriminative, training, learning, learn, function, exploiting, exploit, main, set, loss]
@InProceedings{Dai_2019_CVPR,
  author = {Dai, Tao and Cai, Jianrui and Zhang, Yongbing and Xia, Shu-Tao and Zhang, Lei},
  title = {Second-Order Attention Network for Single Image Super-Resolution},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Devil Is in the Edges: Learning Semantic Boundaries From Noisy Annotations
David Acuna, Amlan Kar, Sanja Fidler


We tackle the problem of semantic boundary prediction, which aims to identify pixels that belong to object(class) boundaries. We notice that relevant datasets consist of a significant level of label noise, reflecting the fact that precise annotations are laborious to get and thus annotators trade-off quality with efficiency. We aim to learn sharp and precise semantic boundaries by explicitly reasoning about annotation noise during training. We propose a simple new layer and loss that can be used with existing learning-based boundary detectors. Our layer/loss enforces the detector to predict a maximum response along the normal direction at an edge, while also regularizing its direction. We further reason about true object boundaries during training using a level set formulation, which allows the network to learn from misaligned labels in an end-to-end fashion. Experiments show that we improve over the CASENet backbone network by more than 4% in terms of MF(ODS) and 18.61% in terms of AP, outperforming all current state-of-the-art methods including those that deal with alignment. Furthermore, we show that our learned network can be used to significantly improve coarse segmentation labels, lending itself as an efficient way to label new data.
[dataset, current, work, term] [active, approach, normal, error, direction, compute, well, computed, optimization] [image, method, quality, figure, proposed, qualitative, high, pixel, comparison, misaligned, contour, real] [layer, table, performance, network, top, deep, standard, evolution, number, original] [evaluation, model, true, evaluate, introduced] [boundary, semantic, object, coarse, segmentation, edge, level, casenet, sbd, annotated, detection, annotation, thinning, seal, refine, coarsely, iou, precise, predicted, deeplab, steal, extra, propose, backbone, refining, refined] [set, loss, train, test, alignment, learning, training, noisy, label, data, existing, function, trained, log, datasets, learn, learned, task, aim, maximum]
@InProceedings{Acuna_2019_CVPR,
  author = {Acuna, David and Kar, Amlan and Fidler, Sanja},
  title = {Devil Is in the Edges: Learning Semantic Boundaries From Noisy Annotations},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Path-Invariant Map Networks
Zaiwei Zhang, Zhenxiao Liang, Lemeng Wu, Xiaowei Zhou, Qixing Huang


Optimizing a network of maps among a collection of objects/domains (or map synchronization) is a central problem across computer vision and many other relevant fields. Compared to optimizing pairwise maps in isolation, the benefit of map synchronization is that there are natural constraints among a map network that can improve the quality of individual maps. While such self-supervision constraints are well-understood for undirected map networks (e.g., the cycle-consistency constraint), they are under-explored for directed map networks, which naturally arise when maps are given by parametric maps (e.g., a feed-forward neural network). In this paper, we study a natural self-supervision constraint for directed map networks called path-invariance, which enforces that composite maps along different paths between a fixed pair of source and target domains are identical. We introduce path-invariance bases for efficient encoding of the path-invariance constraint and present an algorithm that outputs a path-variance basis with polynomial time and space complexities. We demonstrate the effectiveness of our formulation on optimizing object correspondences, estimating dense image maps via neural networks, and 3D scene segmentation via map networks of diverse 3D representations. In particular, our approach only requires 8% labeled data from ScanNet to achieve the same performance as training a single 3D semantic segmentation network with 30% to 100% labeled data.
[directed, graph, joint, individual, second] [approach, dense, computer, constraint, shape, matching, optimizing, optimization, define, algorithm, consistent, point, leonidas, pci, pathinvariance, definition, gdag, pattern, qixing, problem, vision, associated, fij, parametric, note, bdag, cloud, volumetric, eij, synchronization, pcii, enforce, stitch] [image, ieee, input, figure, conference, acm, translation, described, cycle] [network, neural, performance, computing, size, dag, output, regularization, small, operation] [basis, path, consider, collection] [map, segmentation, three, semantic, edge, baseline, undirected, feature, category] [labeled, set, pair, data, domain, label, task, enforcing, unlabeled, dsp, space, experimental, representation]
@InProceedings{Zhang_2019_CVPR,
  author = {Zhang, Zaiwei and Liang, Zhenxiao and Wu, Lemeng and Zhou, Xiaowei and Huang, Qixing},
  title = {Path-Invariant Map Networks},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
FilterReg: Robust and Efficient Probabilistic Point-Set Registration Using Gaussian Filter and Twist Parameterization
Wei Gao, Russ Tedrake


Probabilistic point-set registration methods have been gaining more attention for their robustness to noise, outliers and occlusions. However, these methods tend to be much slower than the popular iterative closest point (ICP) algorithms, which severely limits their usability. In this paper, we contribute a novel probabilistic registration method that achieves state-of-the-art robustness as well as substantially faster computational performance than modern ICP implementations. This is achieved using a rigorous yet computationally-efficient probabilistic formulation. Point-set registration is cast as a maximum likelihood estimation and solved using the EM algorithm. We show that with a simple augmentation, the E step can be formulated as a filtering problem, allowing us to leverage advances in efficient Gaussian filtering methods. We also propose a customized permutohedral filter to improve its performance while retaining sufficient accuracy for our task. Additionally, we present a simple and efficient twist parameterization that generalizes our method to the registration of articulated and deformable objects. For articulated objects, the complexity of our method is almost independent of the Degrees Of Freedom (DOFs), which makes it highly efficient even for high DOF systems. The results demonstrate the proposed method consistently outperforms many competitive baselines on a variety of registration tasks.
[motion, tracking, joint] [point, registration, kinematic, articulated, algorithm, twist, robust, observation, rigid, icp, rxi, geometric, computer, xyz, parameterization, permutohedral, formulation, estimation, pose, gmm, lattice, general, tricp, vision, supplemental, correspondence, error, international, dense, compute, local, note, bodyj, pattern, cloud, solve, assume] [method, proposed, conference, filtering, reference, ieee, figure, acm, input, image, statistical, transformation] [efficient, gaussian, performance, fast, fixed, accuracy, filter, parameter, achieves, computational, efficiency, cpu] [model, step, robustness, simple, example, improved, review] [cpd, deformable, faster, feature, final, propose, global, spatial] [probabilistic, alignment, set, distribution, independent, data, source]
@InProceedings{Gao_2019_CVPR,
  author = {Gao, Wei and Tedrake, Russ},
  title = {FilterReg: Robust and Efficient Probabilistic Point-Set Registration Using Gaussian Filter and Twist Parameterization},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Probabilistic Permutation Synchronization Using the Riemannian Structure of the Birkhoff Polytope
Tolga Birdal, Umut Simsekli


We present an entirely new geometric and probabilistic approach to synchronization of correspondences across multiple sets of objects or images. In particular, we present two algorithms: (1) Birkhoff-Riemannian L-BFGS for optimizing the relaxed version of the combinatorially intractable cycle consistency loss in a principled manner, (2) Birkhoff-Riemannian Langevin Monte Carlo for generating samples on the Birkhoff Polytope and estimating the confidence of the found solutions. To this end, we first introduce the very recently developed Riemannian geometry of the Birkhoff Polytope. Next, we introduce a new probabilistic synchronization model in the form of a Markov Random Field (MRF). Finally, based on the first order retraction operators, we formulate our problem as simulating a stochastic differential equation and devise new integrators. We show on both synthetic and real datasets that we achieve high quality multi-graph matching results with faster convergence and reliable confidence/uncertainty estimates.
[multiple, graph, wang, term, joint, markov] [computer, permutation, birkhoff, problem, riemannian, synchronization, matching, international, polytope, vision, pattern, algorithm, note, optimization, confidence, xij, monte, retraction, langevin, solution, consistent, geometric, geodesic, defined, journal, geometry, initial, absolute, convex, definition, matrix, pij, analysis, tolga, tangent, approach, well, estimation, umut, local, correspondence, integrator, semidefinite] [conference, ieee, image, consistency, method, proposed, spectral, latent, developed, high, figure] [stochastic, denotes, neural, processing, structure, gradient, number, called] [manifold, random, machine, model, partial, carlo] [map, global, object] [set, probabilistic, posterior, space, distribution, pairwise, learning, mcmc, hypersphere]
@InProceedings{Birdal_2019_CVPR,
  author = {Birdal, Tolga and Simsekli, Umut},
  title = {Probabilistic Permutation Synchronization Using the Riemannian Structure of the Birkhoff Polytope},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Lifting Vectorial Variational Problems: A Natural Formulation Based on Geometric Measure Theory and Discrete Exterior Calculus
Thomas Mollenhoff, Daniel Cremers


Numerous tasks in imaging and vision can be formulated as variational problems over vector-valued maps. We approach the relaxation and convexification of such vectorial variational problems via a lifting to the space of currents. To that end, we recall that functionals with polyconvex Lagrangians can be reparametrized as convex one-homogeneous functionals on the graph of the function. This leads to an equivalent shape optimization problem over oriented surfaces in the product space of domain and codomain. A convex formulation is then obtained by relaxing the search space from oriented surfaces to more general currents. We propose a discretization of the resulting infinite-dimensional optimization problem using Whitney forms, which also generalizes recent "sublabel-accurate" multilabeling approaches.
[graph, current, exterior, recognition, work] [convex, vision, computer, discrete, relaxation, continuous, geometric, polyconvex, problem, finite, discretization, denote, differential, pattern, calculus, optimization, international, optimal, mass, lifting, shape, definition, general, nonconvex, linear, solution, functional, defined, spt, formulation, approach, matching, case, total, define, form, surface, constraint, cubical, elementary, theory, functionals, whitney, mesh, local] [conference, dual, variation, based, smooth, imaging] [energy, norm, original, cost, efficient, larger, element, search] [variational, simple, vectorial, consider, example, introduce, refer, vector, manifold] [boundary, oriented, map, global, european] [space, product, set, function, measure, extension, setting, support, representation, notion]
@InProceedings{Mollenhoff_2019_CVPR,
  author = {Mollenhoff, Thomas and Cremers, Daniel},
  title = {Lifting Vectorial Variational Problems: A Natural Formulation Based on Geometric Measure Theory and Discrete Exterior Calculus},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
A Sufficient Condition for Convergences of Adam and RMSProp
Fangyu Zou, Li Shen, Zequn Jie, Weizhong Zhang, Wei Liu


Adam and RMSProp are two of the most influential adaptive stochastic algorithms for training deep neural networks, which have been pointed out to be divergent even in the convex setting via a few simple counterexamples. Many attempts, such as decreasing an adaptive learning rate, adopting a big batch size, incorporating a temporal decorrelation technique, seeking an analogous surrogate, etc., have been tried to promote Adam/RMSProp-type algorithms to converge. In contrast with existing approaches, we introduce an alternative easy-to-check sufficient condition, which merely depends on the parameters of the base learning rate and combinations of historical second-order moments, to guarantee the global convergence of generic Adam/RMSProp for solving large-scale non-convex stochastic optimization. Moreover, we show that the convergences of several variants of Adam, such as AdamNC, AdaEMA, etc., can be directly implied via the proposed sufficient condition in the non-convex setting. In addition, we illustrate that Adam is essentially a specifically weighted AdaGrad with exponential moving average momentum, which provides a novel perspective for understanding Adam and RMSProp. This observation coupled with this sufficient condition gives much deeper interpretations on their divergences. At last, we validate the sufficient condition by applying Adam and RMSProp to tackle a certain counterexample and train deep neural networks. Numerical results are exactly in accord with our theoretical analysis.
[moving, historical, work] [condition, constant, momentum, convex, convergent, theorem, exponential, guarantee, algorithm, bound, directly, case, theoretical, optimization, note, perspective, exactly] [proposed, based, result, conference] [rate, stochastic, weighted, gradient, parameter, adaptive, original, performance, neural, deep, lenet, remark, accuracy, batch, iteration, sgd, scheme] [machine, arxiv, preprint, probability] [average, global] [adam, convergence, learning, generic, rmsprop, gadam, adagrad, sufficient, corollary, training, amsgrad, setting, adaema, test, loss, base, counterexample, divergence, set, positive, satisfy, limt, mnist, illustrate, definiteness, nosadam, iters, zou, adamnc]
@InProceedings{Zou_2019_CVPR,
  author = {Zou, Fangyu and Shen, Li and Jie, Zequn and Zhang, Weizhong and Liu, Wei},
  title = {A Sufficient Condition for Convergences of Adam and RMSProp},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Guaranteed Matrix Completion Under Multiple Linear Transformations
Chao Li, Wei He, Longhao Yuan, Zhun Sun, Qibin Zhao


Low-rank matrix completion (LRMC) is a classical model in both computer vision (CV) and machine learning, and has been successfully applied to various real applications. In the recent CV tasks, the completion is usually employed on the variants of data, such as "non-local" or filtered, rather than their original forms. This fact makes that the theoretical analysis of the conventional LRMC is no longer suitable in these applications. To tackle this problem, we propose a more general framework for LRMC, in which the linear transformations of the data are taken into account. We rigorously prove the identifiability of the proposed model and show an upper bound of the reconstruction error. Furthermore, we derive an efficient completion algorithm by using augmented Lagrangian multipliers and the sketching trick. In the experiments, we apply the proposed method to the classical image inpainting problem and achieve the state-of-the-art results.
[multiple, work, framework] [matrix, linear, completion, mcmt, nnm, pattern, assumption, nuclear, bound, denote, theoretical, algorithm, error, reconstruction, lrmc, decomposition, solution, single, computer, observation, singular, vision, sketching, optimal, impose, lagrangian, additional, exact, certificate, assume, augmented, problem, convex, case, projection, lemma, null, theorem, guaranteed, solving, theoretically, condition, implies] [missing, image, ieee, conference, dual, proposed, method, inpainting, transformation, sensing, based, recover, recovery] [tensor, performance, norm, original, structure, denotes, efficient, approximation, number, gaussian, ratio] [model, find, random, consider, perturbation, machine] [] [function, rank, upper, space, set, min, conventional, data, objective, update, existing]
@InProceedings{Li_2019_CVPR,
  author = {Li, Chao and He, Wei and Yuan, Longhao and Sun, Zhun and Zhao, Qibin},
  title = {Guaranteed Matrix Completion Under Multiple Linear Transformations},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
MAP Inference via Block-Coordinate Frank-Wolfe Algorithm
Paul Swoboda, Vladimir Kolmogorov


We present a new proximal bundle method for Maximum-A-Posteriori (MAP) inference in structured energy minimization problems. The method optimizes a Lagrangean relaxation of the original energy minimization problem using a multi plane block-coordinate Frank-Wolfe method that takes advantage of the specific structure of the Lagrangean decomposition. We show empirically that our method outperforms state-of-the-art Lagrangean decomposition based algorithms on some challenging Markov Random Field, multi-label discrete tomography and graph matching problems.
[graph, passing, dataset, work, term, time, recognition, markov, version] [problem, decomposition, discrete, lagrangean, matching, algorithm, bundle, relaxation, tomography, subproblems, subgradient, pattern, computer, solution, solver, bound, vision, international, denote, optimal, linear, optimization, exact, approach, solving, general, compute, convex, duality, feasible, fwmap, good, typically, local, mrfs] [proximal, method, dual, based, conference, mrf, ieee, figure, proposed, image] [inference, energy, lower, number, approximate, efficient, max, original, optimize, called, fast, efficiently] [message, vector, machine, random, primal, arg, step, pass, tree] [map, three] [min, set, function, objective, minimization, learning, gap, large, minimizing]
@InProceedings{Swoboda_2019_CVPR,
  author = {Swoboda, Paul and Kolmogorov, Vladimir},
  title = {MAP Inference via Block-Coordinate Frank-Wolfe Algorithm},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
A Convex Relaxation for Multi-Graph Matching
Paul Swoboda, Dagmar Kainm"uller, Ashkan Mokarian, Christian Theobalt, Florian Bernard


We present a convex relaxation for the multi-graph matching problem. Our formulation allows for partial pairwise matchings, guarantees cycle consistency, and our objective incorporates both linear and quadratic costs. Moreover, we also present an extension to higher-order costs. In order to solve the convex relaxation we employ a message passing algorithm that optimizes the dual problem. We experimentally compare our algorithm on established benchmark problems from computer vision, as well as on large problems from biological image analysis, the size of which exceed previously investigated multi-graph matching instances.
[graph, passing, work, time, framework, individual, multiple, considering] [matching, problem, subproblems, quadratic, algorithm, decomposition, mgm, convex, approach, subproblem, relaxation, optimization, xst, linear, matrix, solving, define, assignment, formulation, general, feasible, lagrange, elementary, outlier, computer, correspondence, matchings, permutation, plane, note, bound, pattern, corresponding, programming, variable, constraint, minj, hotel, christian, solve, vision] [cycle, consistency, dual, based, synthetic, image, figure, proposed, method] [cost, order, number, precision, table, lower, add] [message, consider, refer, cutting, write, partial, primal, describe, house, florian] [recall, propose, three] [pairwise, min, set, triplet, mini, viewed]
@InProceedings{Swoboda_2019_CVPR,
  author = {Swoboda, Paul and Kainm"uller, Dagmar and Mokarian, Ashkan and Theobalt, Christian and Bernard, Florian},
  title = {A Convex Relaxation for Multi-Graph Matching},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Pixel-Adaptive Convolutional Neural Networks
Hang Su, Varun Jampani, Deqing Sun, Orazio Gallo, Erik Learned-Miller, Jan Kautz


Convolutions are the fundamental building blocks of CNNs. The fact that their weights are spatially shared is one of the main reasons for their widespread use, but it is also a major limitation, as it makes convolutions content-agnostic. We propose a pixel-adaptive convolution (PAC) operation, a simple yet effective modification of standard convolutions, in which the filter weights are multiplied with a spatially varying kernel that depends on learnable, local pixel features. PAC is a generalization of several popular filtering techniques and thus can be used for a wide range of use cases. Specifically, we demonstrate state-of-the-art performance when PAC is used for deep joint image upsampling. PAC also offers an effective alternative to fully-connected CRF (Full-CRF), called PAC-CRF, which performs competitively compared to Full-CRF, while being considerably faster. In addition, we also demonstrate that PAC can be used as a drop-in replacement for convolution layers in pre-trained networks, resulting in consistent performance improvements.
[joint, flow, work, optical, prediction, dynamic] [depth, computer, vision, allows, rgb, varying, local, dense, additional, lattice] [image, filtering, pixel, spatially, bilateral, input, djf, demonstrate, figure] [pac, convolution, standard, network, upsampling, filter, kernel, convolutional, deep, crf, conv, layer, inference, number, gaussian, pooling, efficient, neural, performance, output, validation, compared, operation, learnable, dilation, better, replacement] [simple, modification, visual] [semantic, spatial, segmentation, cnn, guidance, propose, fcn, guided, three, miou, feature] [adapting, pairwise, learning, training, existing, learn, trained, invariant, learned, observe, test, generalization]
@InProceedings{Su_2019_CVPR,
  author = {Su, Hang and Jampani, Varun and Sun, Deqing and Gallo, Orazio and Learned-Miller, Erik and Kautz, Jan},
  title = {Pixel-Adaptive Convolutional Neural Networks},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Single-Frame Regularization for Temporally Stable CNNs
Gabriel Eilertsen, Rafal K. Mantiuk, Jonas Unger


Convolutional neural networks (CNNs) can model complicated non-linear relations between images. However, they are notoriously sensitive to small changes in the input. Most CNNs trained to describe image-to-image mappings generate temporally unstable results when applied to video sequences, leading to flickering artifacts and other inconsistencies over time. In order to use CNNs for video material, previous methods have relied on estimating dense frame-to-frame motion information (optical flow) in the training and/or the inference phase, or by exploring recurrent learning structures. We take a different approach to the problem, posing temporal stability as a regularization of the cost function. The regularization is formulated to account for different types of motion that can occur between frames, so that temporally stable CNNs can be trained without the need for video material or expensive motion estimation. The training can be performed as a fine-tuning operation, without architectural modifications of the CNN. Our evaluation shows that the training strategy leads to large improvements in temporal smoothness. Moreover, for small datasets the regularization can help in boosting the generalization performance to a much larger extent than what is possible with naive augmentation strategies.
[video, temporal, frame, motion, term, temporally, recognition, consecutive, dataset, previous, prediction] [reconstruction, jacobian, smoothness, ground, computer, truth, international, vision, range, consistent, pattern, local, error] [transform, hdr, input, conference, image, figure, psnr, colorization, ieee, consistency, transformation, high, proposed, pixel, noise, strength, described, saturated, based, application] [regularization, sparse, stability, neural, cnns, performance, order, small, deep, network, table, applied, better, output, processing, convolutional] [transformed, adversarial, example, arxiv, preprint, model, sensitivity, robustness, evaluation] [cnn, baseline, improve] [training, invariance, loss, data, learning, function, augmentation, large, trained, test, train, selected, measure]
@InProceedings{Eilertsen_2019_CVPR,
  author = {Eilertsen, Gabriel and Mantiuk, Rafal K. and Unger, Jonas},
  title = {Single-Frame Regularization for Temporally Stable CNNs},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
An End-To-End Network for Generating Social Relationship Graphs
Arushi Goel, Keng Teck Ma, Cheston Tan


Socially-intelligent agents are of growing interest in artificial intelligence. To this end, we need systems that can understand social relationships in diverse social contexts. Inferring the social context in a given visual scene not only involves recognizing objects, but also demands a more in-depth understanding of the relationships and attributes of the people involved. To achieve this, one computational approach for representing human relationships and attributes is to use an explicit knowledge graph, which allows for high-level reasoning. We introduce a novel end-to-end-trainable neural network that is capable of generating a Social Relationship Graph - a structured, unified representation of social relationships and attributes - from a given input image. Our Social Relationship Graph Generation Network (SRG-GN) is the first to use memory cells like Gated Recurrent Units (GRUs) to iteratively update the social relationship states in a graph using scene and attribute context. The neural network exploits the recurrent connections among the GRUs to implement message passing between nodes and edges in the graph, and results in significant improvement over previous methods for social relationship recognition.
[social, graph, people, dataset, state, recognition, pisc, grus, gru, passing, previous, work, activity, predict, human, rship, updated, framework, ppair, recurrent, predicting, hidden, adjacent, professional] [scene, computer, vision, pattern, international, predicts, single] [image, gender, conference, figure, age, attribute, ieee, input] [network, neural, accuracy, table, inference, convnet, structured, pooling, gate, performance] [relationship, model, visual, generating, generation, message, understanding, memory, gated, node, family] [person, context, module, edge, final, semantic, contextual] [task, domain, knowledge, update, learning, novel, loss]
@InProceedings{Goel_2019_CVPR,
  author = {Goel, Arushi and Teck Ma, Keng and Tan, Cheston},
  title = {An End-To-End Network for Generating Social Relationship Graphs},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Meta-Learning Convolutional Neural Architectures for Multi-Target Concrete Defect Classification With the COncrete DEfect BRidge IMage Dataset
Martin Mundt, Sagnik Majumder, Sreenivas Murali, Panagiotis Panetsos, Visvanathan Ramesh


Recognition of defects in concrete infrastructure, especially in bridges, is a costly and time consuming crucial first step in the assessment of the structural integrity. Large variation in appearance of the concrete material, changing illumination and weather conditions, a variety of possible surface markings as well as the possibility for different types of defects to overlap, make it a challenging real-world task. In this work we introduce the novel COncrete DEfect BRidge IMage dataset (CODEBRIM) for multi-target classification of five commonly appearing concrete defects. We investigate and compare two reinforcement learning based meta-learning approaches, MetaQNN and efficient neural architecture search, to find suitable convolutional neural network architectures for this challenging multi-class multi-target task. We show that learned architectures have fewer overall parameters in addition to yielding better multi-target accuracy in comparison to popular neural architectures from the literature evaluated in the context of our application.
[dataset, recognition, work, individual, bridge] [vision, computer, international, contrast, pattern, material, corresponding, supplementary, surface] [based, patch, conference, image, proposed, figure, comparison, texture, amount] [defect, neural, validation, architecture, accuracy, size, concrete, deep, best, batch, convolutional, metaqnn, search, crack, enas, literature, top, exposed, rate, codebrim, efflorescence, number, network, imagenet, suitable, design, corrosion, alexnet, larger, warm, process] [reinforcement, machine, reward, step] [bounding, cnn, feature, box, final, overlapping, annotation, detection, challenging, object] [learning, test, classification, set, training, task, class, specific, large, train, datasets]
@InProceedings{Mundt_2019_CVPR,
  author = {Mundt, Martin and Majumder, Sagnik and Murali, Sreenivas and Panetsos, Panagiotis and Ramesh, Visvanathan},
  title = {Meta-Learning Convolutional Neural Architectures for Multi-Target Concrete Defect Classification With the COncrete DEfect BRidge IMage Dataset},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
ECC: Platform-Independent Energy-Constrained Deep Neural Network Compression via a Bilinear Regression Model
Haichuan Yang, Yuhao Zhu, Ji Liu


Many DNN-enabled vision applications constantly operate under severe energy constraints such as unmanned aerial vehicles, Augmented Reality headsets, and smartphones. Designing DNNs that can meet a stringent energy budget is becoming increasingly important. This paper proposes ECC, a framework that compresses DNNs to meet a given energy constraint while minimizing accuracy loss. The key idea of ECC is to model the DNN energy consumption via a novel bilinear regression function. The energy estimate model allows us to formulate DNN compression as a constrained optimization that minimizes the DNN loss function over the energy constraint. The optimization problem, however, has nontrivial constraints. Therefore, existing deep learning solvers do not apply directly. We propose an optimization algorithm that combines the essence of the Alternating Direction Method of Multipliers (ADMM) framework with gradient-based learning algorithms. The algorithm decomposes the original constrained optimization into several subproblems that are solved iteratively and efficiently. ECC is also portable across different hardware platforms without requiring hardware knowledge. Experiments show that ECC achieves higher accuracy under the same or lower energy budget compared to state-of-the-art resource-constrained DNN compression techniques.
[framework, prediction, key] [optimization, vision, dense, algorithm, computer, constraint, problem, directly, optimal, error, equation, variable, relative, solve] [figure, dual, conference, method, based, proximal, ieee] [energy, dnn, compression, ecc, sparsity, layer, neural, hardware, accuracy, network, deep, netadapt, latency, amc, consumption, bilinear, pruning, mobilenet, gtx, number, platform, jetson, mobile, efficient, achieves, meet, gradient, cost, convolutional, budget, search, processing, dnns, compared, compressed, compressing, weight, stochastic, imagenet, alexnet, table] [model, constrained, arxiv, preprint, find, primal] [propose, semantic] [learning, set, function, loss, test, target, min, update, updating, training, classification]
@InProceedings{Yang_2019_CVPR,
  author = {Yang, Haichuan and Zhu, Yuhao and Liu, Ji},
  title = {ECC: Platform-Independent Energy-Constrained Deep Neural Network Compression via a Bilinear Regression Model},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
SeerNet: Predicting Convolutional Neural Network Feature-Map Sparsity Through Low-Bit Quantization
Shijie Cao, Lingxiao Ma, Wencong Xiao, Chen Zhang, Yunxin Liu, Lintao Zhang, Lanshun Nie, Zhi Yang


In this paper we present a novel and general method to accelerate convolutional neural network (CNN) inference by taking advantage of feature map sparsity. We experimentally demonstrate that a highly quantized version of the original network is sufficient in predicting the output sparsity accurately, and verify that leveraging such sparsity in inference incurs negligible accuracy drop compared with the original network. To accelerate inference, for each convolution layer our approach first obtains a binary sparsity mask of the output feature maps by running inference on a quantized version of the original network layer, and then conducts a full-precision sparse convolution to find out the precise values of the non-zero outputs. Compared with existing work, our approach avoids the overhead of training additional auxiliary networks, while is still applicable to general CNN networks without being limited to certain application domains.
[prediction, predict, work, time, previous, online] [computer, vision, pattern, approach, error, additional, corresponding, equation] [figure, input, conference, method, ieee, demonstrate, proposed, high] [sparsity, quantized, convolution, accuracy, quantization, sparse, output, computation, neural, inference, layer, seernet, speedup, network, conv, relu, deep, efficient, convolutional, drop, quantizing, original, batch, normalization, table, compared, overhead, activation, small, low, negligible, weight, precision, popular, achieves, cpu, rate, integer, accelerate, highly, binary, running, skipping, full, featuremap, achieve, arithmetic] [model, arxiv, preprint] [cnn, feature, mask, map, fused, propose, average] [data, training]
@InProceedings{Cao_2019_CVPR,
  author = {Cao, Shijie and Ma, Lingxiao and Xiao, Wencong and Zhang, Chen and Liu, Yunxin and Zhang, Lintao and Nie, Lanshun and Yang, Zhi},
  title = {SeerNet: Predicting Convolutional Neural Network Feature-Map Sparsity Through Low-Bit Quantization},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Defending Against Adversarial Attacks by Randomized Diversification
Olga Taran, Shideh Rezaeifar, Taras Holotyak, Slava Voloshynovskiy


The vulnerability of machine learning systems to adversarial attacks questions their usage in many applications. In this paper, we propose a randomized diversification as a defense strategy. We introduce a multi-channel architecture in a gray-box scenario, which assumes that the architecture of the classifier and the training data set are known to the attacker. The attacker does not only have access to a secret key and to the internal states of the system at the test time. The defender processes an input in multiple channels. Each channel introduces its own randomization in a special transform domain based on a secret key shared between the training and testing stages. Such a transform based randomization with a shared key preserves the gradients in key-defined sub-spaces for the defender but it prevents gradient back propagation and the creation of various bypass systems for the attacker. An additional benefit of multi-channel randomization is the aggregation that fuses soft-outputs from all channels, thus increasing the reliability of the final score. The sharing of a secret key creates an information advantage to the defender. Experimental evaluation demonstrates an increased robustness of the proposed method to a number of known state-of-the-art attacks.
[key, represented, consists, propagation] [permutation, error, local, direct, case, corresponding, classical, single, international, general, bypass] [figure, based, proposed, image, input, transform, noise, conference] [original, architecture, aggregation, table, number, channel, gradient, performance, processing, group, operator, accuracy, kji, numerical, usage] [adversarial, defense, sign, secret, randomization, attacker, randomized, system, defender, machine, example, random, diversification, access, consider, considered, creates, advantage, robustness] [global, illustrated, level] [data, dct, classification, classifier, domain, training, pji, independent, set, learning, flipping, mnist, corresponds, investigate, strategy, test, class, main, shared, target, trained, testing]
@InProceedings{Taran_2019_CVPR,
  author = {Taran, Olga and Rezaeifar, Shideh and Holotyak, Taras and Voloshynovskiy, Slava},
  title = {Defending Against Adversarial Attacks by Randomized Diversification},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Rob-GAN: Generator, Discriminator, and Adversarial Attacker
Xuanqing Liu, Cho-Jui Hsieh


We study two important concepts in adversarial deep learning---adversarial training and generative adversarial network (GAN). Adversarial training is the technique used to improve the robustness of discriminator by combining adversarial attacker and discriminator in the training phase. GAN is commonly used for image generation by jointly optimizing discriminator and generator. We show these two concepts are indeed closely related and can be used to strengthen each other---adding a generator to the adversarial training procedure can improve the robustness of discriminators, and adding an adversarial attack to GAN training can improve the convergence speed and lead to better generators. Combining these two insights, we develop a framework called Rob-GAN to jointly optimize generator and discriminator in the presence of adversarial attacks---the generator generates fake images to fool discriminator; the adversarial attacker perturbs real images to fool discriminator, and the discriminator wants to minimize loss under fake and adversarial images. Through this end-to-end training procedure, we are able to simultaneously improve the convergence speed of GAN training, the quality of synthetic images, and the robustness of discriminator under strong adversarial attacks. Experimental results demonstrate that the obtained classifier is more robust than the state-of-the-art adversarial training approach (Madry 2017), and the generator outperforms SN-GAN on ImageNet-143.
[framework] [robust, international, algorithm, ground, truth, note] [generator, generative, image, real, conference, figure, conditional, proposed, quality, based, method, high, strength] [max, accuracy, network, deep, small, better, neural, original, imagenet, gradient, standard, compare, performance, processing, called, optimize] [adversarial, discriminator, gan, robustness, arxiv, preprint, fake, model, attack, attacker, defense, xadv, pdata, lipschitz, llv, find, empirical, improved, example, choose, fool] [improve, improves, comparing, recall] [training, loss, data, learning, trained, set, function, gap, classification, classifier, distribution, augmentation, test, convergence, ntr, generalization, testing, min, objective, large, train, idea, subset, minimize]
@InProceedings{Liu_2019_CVPR,
  author = {Liu, Xuanqing and Hsieh, Cho-Jui},
  title = {Rob-GAN: Generator, Discriminator, and Adversarial Attacker},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Learning From Noisy Labels by Regularized Estimation of Annotator Confusion
Ryutaro Tanno, Ardavan Saeedi, Swami Sankaranarayanan, Daniel C. Alexander, Nathan Silberman


The predictive performance of supervised learning algorithms depends on the quality of labels. In a typical label collection process, multiple annotators provide subjective noisy estimates of the "truth" under the influence of their varying skill-levels and biases. Blindly treating these noisy labels as the ground truth limits the accuracy of learning algorithms in the presence of strong disagreement. This problem is critical for applications in domains such as medical imaging where both the annotation cost and inter-observer variability are high. In this work, we present a method for simultaneously learning the individual annotator model and the underlying true label distribution, using only noisy observations. Each annotator is modeled by a confusion matrix that is jointly estimated along with the classifier predictions. We propose to add a regularization term to the loss function that encourages convergence to the true annotator confusion matrix. We provide a theoretical argument as to how the regularization is essential to our approach both for the case of single annotator and multiple annotators. Despite the simplicity of the idea, experiments on image classification tasks with both simulated and real labels show that our method either outperforms or performs on par with the state-of-the-art methods and is capable of estimating the skills of annotators even with a single label available per image.
[multiple, work, individual, skill, modelling] [estimated, estimation, ground, truth, matrix, single, view, robust, error, computer, international, approach, algorithm, pattern, varying, vision, range, estimate, optimization, supplementary] [method, image, trace, noise, conference, proposed, ieee, diagonal, quality, capable, real, high, figure, variability, simulated, input] [accuracy, number, neural, performance, deep, validation, regularization, compare, gradient, norm, group, vanilla] [model, true, probability, correct, machine, robustness, example] [annotator, average, cardiac, presence, medical, annotation, level, improves, cnn] [label, noisy, learning, classification, training, loss, distribution, data, set, confusion, mnist, trained, mbem, generalized, minimizing, class, diagonally, linda, classifier, convergence]
@InProceedings{Tanno_2019_CVPR,
  author = {Tanno, Ryutaro and Saeedi, Ardavan and Sankaranarayanan, Swami and Alexander, Daniel C. and Silberman, Nathan},
  title = {Learning From Noisy Labels by Regularized Estimation of Annotator Confusion},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Task-Free Continual Learning
Rahaf Aljundi, Klaas Kelchtermans, Tinne Tuytelaars


Methods proposed in the literature towards continual deep learning typically operate in a task-based sequential learning setup. A sequence of tasks is learned, one at a time, with all data of current task available but not of previous or future tasks. Task boundaries and identities are known at all times. This setup, however, is rarely encountered in practical applications. Therefore we investigate how to transform continual learning to an online setup. We develop a system that keeps on learning over time in a streaming fashion, with data distributions gradually changing and without the notion of separate tasks. To this end, we build on the work on Memory Aware Synapses, and show how this method can be made online by providing a protocol to decide i) when to update the importance weights, ii) which data to use to update them, and iii) how to accumulate the importance weights at each update step. Experimental results show the validity of the approach in the context of two applications: (self-)supervised learning of a face recognition model by watching soap series and learning a robot to avoid collisions.
[online, previous, buffer, time, sequential, joint, work, sequence, recognition, streaming, series, consecutive, window, watching, actor, current, second, collision, start] [estimate, initial, case, estimated, clearly, interference] [figure, method, based, face, input, change, conference, proposed, replay] [accuracy, deep, neural, performance, weight, network, better, parameter, small, number, gradient, offline, output] [model, system, memory, arxiv, preprint, step, machine, episode, reinforcement, robot] [baseline, detected, aware] [learning, continual, data, training, task, loss, update, test, hard, knowledge, learned, distribution, catastrophic, lifelong, trained, incremental, plt, updating, forgetting, learn, accumulating, setting, large]
@InProceedings{Aljundi_2019_CVPR,
  author = {Aljundi, Rahaf and Kelchtermans, Klaas and Tuytelaars, Tinne},
  title = {Task-Free Continual Learning},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Importance Estimation for Neural Network Pruning
Pavlo Molchanov, Arun Mallya, Stephen Tyree, Iuri Frosio, Jan Kautz


Structural pruning of neural network parameters reduces computational, energy, and memory transfer costs during inference. We propose a novel method that estimates the contribution of a neuron (filter) to the final loss and iteratively removes those with smaller scores. We describe two variations of our method using the first and second-order Taylor expansions to approximate a filter's contribution. Both methods scale consistently across any network layer without requiring per-layer sensitivity analysis and can be applied to any kind of layer, including skip connections. For modern networks trained on ImageNet, we measured experimentally a high (>93%) correlation between the contribution computed by our methods and a reliable estimate of the true importance. Pruning with the proposed methods led to an improvement over state-of-the-art in terms of accuracy, FLOPs, and parameter reduction. On ResNet-101, we achieve a 40% FLOPS reduction by removing 30% of the parameters, with a loss of 0.02% in the top-1 accuracy on ImageNet.
[averaged, individual] [computed, compute, error, approach, linear, estimate, squared, measured] [method, proposed, removing, remove, based, input] [pruning, network, taylor, number, correlation, skip, neural, gate, convolutional, neuron, layer, deep, weight, imagenet, criterion, scale, magnitude, filter, better, residual, smaller, applied, accuracy, computational, gradient, hessian, full, prune, pruned, parameter, combinatorial, best, table, size, larger, spearman, output, obd, scaling, sparse, compare, small, approximation, batch] [greedy, contribution, arxiv, preprint, requires, memory, sensitivity] [expansion, score, final, including, improvement] [loss, oracle, observe, trained, learning, training, set]
@InProceedings{Molchanov_2019_CVPR,
  author = {Molchanov, Pavlo and Mallya, Arun and Tyree, Stephen and Frosio, Iuri and Kautz, Jan},
  title = {Importance Estimation for Neural Network Pruning},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Detecting Overfitting of Deep Generative Networks via Latent Recovery
Ryan Webster, Julien Rabin, Loic Simon, Frederic Jurie


State of the art deep generative networks have achieved such realism that they can be suspected of memorizing training images. It is why it is not uncommon to include visualizations of training set nearest neighbors, to suggest generated images are not simply memorized. We argue this is not sufficient and motivates studying overfitting of deep generators with more scrutiny. We address this question by i) showing how simple losses are highly effective at reconstructing images for deep generators ii) analyzing the statistics of reconstruction errors for training versus validation images. Using this methodology, we show that pure GAN models appear to generalize well, in contrast with those using hybrid adversarial losses, which are amongst the most widely applied generative methods. We also show that standard GAN evaluation metrics fail to capture memorization for some deep generators. Finally, we note the ramifications of memorization on data privacy. Considering the already widespread application of generative networks, we provide a step in the right direction towards the important yet incomplete picture of generative overfitting.
[work, dataset] [international, optimization, error, computer, reconstruction, analysis, vision, well, recovering] [generative, latent, recovery, image, conference, face, generator, figure, variety, ieee, inpainting, difference, statistical, proposed, study, recover, quality, verbatim, perceptual, noise] [deep, validation, neural, network, table, small, processing, progressive, gaussian] [gan, glo, gans, generated, adversarial, evaluation, memorization, nnd, fid, machine, visual, aegan, pggan, model, arxiv, median, dcgan, nng, consider, privacy, preprint, simple, refer, random, considered, mesch, fact] [detect, threshold] [training, overfitting, train, set, learning, test, distribution, loss, datasets, target, large, distance, trained, nearest, space, euclidean, min]
@InProceedings{Webster_2019_CVPR,
  author = {Webster, Ryan and Rabin, Julien and Simon, Loic and Jurie, Frederic},
  title = {Detecting Overfitting of Deep Generative Networks via Latent Recovery},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Coloring With Limited Data: Few-Shot Colorization via Memory Augmented Networks
Seungjoo Yoo, Hyojin Bahng, Sunghyo Chung, Junsoo Lee, Jaehyuk Chang, Jaegul Choo


Despite recent advancements in deep learning-based automatic colorization, they are still limited when it comes to few-shot learning. Existing models require a significant amount of training data. To tackle this issue, we present a novel memory-augmented colorization model MemoPainter that can produce high-quality colorization with limited data. In particular, our model is able to capture rare instances and successfully colorize them. Also, we propose a novel threshold triplet loss that enables unsupervised training of memory networks without the need for class labels. Experiments show that our model has superior quality in both few-shot and one-shot colorization tasks.
[key, recognition, producing, perform, consists, remembering] [computer, vision, limited, pattern, dominant, compute, corresponding, analysis, condition, single, rgb, international, allows] [color, colorization, image, conference, coloring, memopainter, generative, produce, input, ieee, figure, conditional, lpips, proposed, denoted, style, generator, grayscale, extracted, age, animation, cartoon] [deep, neural, compare, superior] [memory, model, query, adversarial, diverse, discriminator, character, retrieve, slot, vibrant] [threshold, feature] [class, triplet, rare, training, loss, data, learning, distance, existing, unsupervised, colored, test, label, learn, classification, main, trained, flower, novel]
@InProceedings{Yoo_2019_CVPR,
  author = {Yoo, Seungjoo and Bahng, Hyojin and Chung, Sunghyo and Lee, Junsoo and Chang, Jaehyuk and Choo, Jaegul},
  title = {Coloring With Limited Data: Few-Shot Colorization via Memory Augmented Networks},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Characterizing and Avoiding Negative Transfer
Zirui Wang, Zihang Dai, Barnabas Poczos, Jaime Carbonell


When labeled data is scarce for a specific target task, transfer learning often offers an effective solution by utilizing data from a related source task. However, when transferring knowledge from a less related source, it may inversely hurt the target performance, a phenomenon known as negative transfer. Despite its pervasiveness, negative transfer is usually described in an informal manner, lacking rigorous definition, careful analysis, or systematic treatment. This paper proposes a formal definition of negative transfer and analyzes three important aspects thereof. Stemming from this analysis, a novel technique is proposed to circumvent negative transfer by filtering out unrelated source data. Based on adversarial networks, the technique is highly generic and can be applied to a wide range of transfer learning algorithms. The proposed approach is evaluated on six state-of-the-art deep transfer methods via experiments on four benchmark datasets with varying levels of difficulty. Empirically, the proposed method consistently improves the performance of all baseline methods and largely avoids negative transfer, even when the source data is degenerate.
[joint, work, rpt, recognition] [algorithm, computer, vision, matching, international, problem, pattern, underlying, assumption, exists] [method, conference, figure, proposed, study, based, ieee, amount, generative, input] [deep, performance, neural, better, table, gate, compare, compared, network, accuracy, processing, density, ratio, impact, achieve, standard] [discriminator, adversarial, perturbation, machine, marginal, random, perturbed, ntg, simple, model] [feature, three, benchmark, baseline] [transfer, negative, target, source, data, learning, domain, labeled, dann, danngate, set, divergence, distribution, knowledge, training, observe, classification, large, objective, adaptation, gap, datasets, task, unlabeled, label, dannt, unrelated, specific, exploit, avoid, sample, unsupervised]
@InProceedings{Wang_2019_CVPR,
  author = {Wang, Zirui and Dai, Zihang and Poczos, Barnabas and Carbonell, Jaime},
  title = {Characterizing and Avoiding Negative Transfer},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Building Efficient Deep Neural Networks With Unitary Group Convolutions
Ritchie Zhao, Yuwei Hu, Jordan Dotzel, Christopher De Sa, Zhiru Zhang


We propose unitary group convolutions (UGConvs), a building block for CNNs which compose a group convolution with unitary transforms in feature space to learn a richer set of representations than group convolution alone. UGConvs generalize two disparate ideas in CNN architecture, channel shuffling (i.e. ShuffleNet) and block-circulant networks (i.e. CirCNN), and provide unifying insights that lead to a deeper understanding of each technique. We experimentally demonstrate that dense unitary transforms can outperform channel shuffling in DNN accuracy. On the other hand, different dense transforms exhibit comparable accuracy performance. Based on these observations we propose HadaNet, a UGConv network using Hadamard transforms. HadaNets achieve similar accuracy to circulant networks with lower computation complexity, and better accuracy than ShuffleNets with the same number of parameters and floating-point multiplies.
[work, perform] [dense, matrix, convs, computer, vision, error, fourier, note, equation, additional] [transform, figure, input, shuffling, proposed, image, diagonal] [group, transforms, hadamard, weight, conv, circulant, layer, unitary, shufflenet, channel, ugconv, deep, neural, network, block, accuracy, efficient, number, shuffle, dft, convolution, structure, table, convolutional, outperform, depthwise, ugconvs, output, size, orthogonal, tensor, sparsity, hadanet, sparse, structured, applied, architecture, achieve, parameter, hardware, performance, building, cnns, circnn, shufflenets, separable, filter, computational, small] [arxiv, random, requires] [feature, cnn, propose, spatial, improve] [learning, test, log, mnist, large]
@InProceedings{Zhao_2019_CVPR,
  author = {Zhao, Ritchie and Hu, Yuwei and Dotzel, Jordan and De Sa, Christopher and Zhang, Zhiru},
  title = {Building Efficient Deep Neural Networks With Unitary Group Convolutions},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Semi-Supervised Learning With Graph Learning-Convolutional Networks
Bo Jiang, Ziyan Zhang, Doudou Lin, Jin Tang, Bin Luo


Graph Convolutional Neural Networks (graph CNNs) have been widely used for graph data representation and semi-supervised learning tasks. However, existing graph CNNs generally use a fixed graph which may not be optimal for semi-supervised learning tasks. In this paper, we propose a novel Graph Learning-Convolutional Network (GLCN) for graph data representation and semi-supervised learning. The aim of GLCN is to learn an optimal graph structure that best serves graph CNNs for semi-supervised learning by integrating both graph learning and graph convolution in a unified network architecture. The main advantage is that in GLCN both given labels and the estimated labels are incorporated and thus can provide useful 'weakly' supervised information to refine (or learn) the graph construction and also to facilitate the graph convolution operation for unknown label estimation. Experimental results on seven benchmarks demonstrate that GLCN significantly outperforms the state-of-the-art traditional fixed structure based graph CNNs.
[graph, glcn, gcn, dataset, hidden, work, social, human, deepwalk, gat, outperforms] [optimal, note, international, provide, computer, pattern, vision, matrix, descriptor, estimated] [based, figure, proposed, image, conference, spectral, demonstrates, comparison, method, traditional] [network, convolutional, convolution, neural, architecture, number, denotes, layer, table, cnns, operation, performance, structure, best, parameter, better, output, weight, deep, adaptive, science] [model, node, natural, attention] [propose, feature, final, cnn] [learning, data, sij, loss, representation, label, mnist, unified, function, classification, learn, learned, lgl, generally, semisupervised, set, aij, dimension, datasets, select, supervised, experimental, labeled, perceptron, novel, main, incorporated]
@InProceedings{Jiang_2019_CVPR,
  author = {Jiang, Bo and Zhang, Ziyan and Lin, Doudou and Tang, Jin and Luo, Bin},
  title = {Semi-Supervised Learning With Graph Learning-Convolutional Networks},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Learning to Remember: A Synaptic Plasticity Driven Framework for Continual Learning
Oleksiy Ostapenko, Mihai Puscas, Tassilo Klein, Patrick Jahnichen, Moin Nabi


Models trained in the context of continual learning (CL) should be able to learn from a stream of data over an undefined period of time. The main challenges herein are: 1) maintaining old knowledge while simultaneously benefiting from it when learning new tasks, and 2) guaranteeing model scalability with a growing amount of data to learn from. In order to tackle these challenges, we introduce Dynamic Generative Memory (DGM) - synaptic plasticity driven framework for continual learning. DGM relies on conditional generative adversarial networks with learnable connection plasticity realized with neural masking. Specifically, we evaluate two variants of neural masking: applied to (i) layer activations and (ii) to connection weights directly. Furthermore, we propose a dynamic network expansion mechanism that ensures sufficient model capacity to accommodate for continually incoming tasks. The amount of added capacity is determined dynamically from the learned binary mask. We evaluate DGM in the continual class-incremental setup on visual classification tasks.
[previous, dynamic, dataset, time] [single, growing, total, problem, directly] [generative, real, generator, method, replay, free, proposed, amount, synthesized, figure, comparison] [network, number, layer, binary, parameter, neural, performance, deep, size, output, growth, efficient, capacity, accuracy, ratio, applied, newly, weight, epoch] [memory, model, adversarial, arxiv, observed, preprint, plasticity, evaluation, attention, generated] [mask, propose, expansion, feature] [learning, task, training, dgmw, dgm, data, continual, catastrophic, incremental, learn, forgetting, dgma, mnist, icarl, classification, learned, storing, knowledge, space, class, svhn, replayed, smax, setup, classifier, base, set, stored, incrementally, blocked, function]
@InProceedings{Ostapenko_2019_CVPR,
  author = {Ostapenko, Oleksiy and Puscas, Mihai and Klein, Tassilo and Jahnichen, Patrick and Nabi, Moin},
  title = {Learning to Remember: A Synaptic Plasticity Driven Framework for Continual Learning},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
AIRD: Adversarial Learning Framework for Image Repurposing Detection
Ayush Jaiswal, Yue Wu, Wael AbdAlmageed, Iacopo Masi, Premkumar Natarajan


Image repurposing is a commonly used method for spreading misinformation on social media and online forums, which involves publishing untampered images with modified metadata to create rumors and further propaganda. While manual verification is possible, given vast amounts of verified knowledge available on the internet, the increasing prevalence and ease of this form of semantic manipulation call for the development of robust automatic ways of assessing the semantic integrity of multimedia data. In this paper, we present a novel method for image repurposing detection that is based on the real-world adversarial interplay between a bad actor who repurposes images with counterfeit metadata and a watchdog who verifies the semantic consistency between images and their accompanying metadata, where both players have access to a reference dataset of verified content, which they can use to achieve their goals. The proposed method exhibits state-of-the-art performance on location-identity, subject-identity and painting-artist verification, showing its efficacy across a diverse set of scenarios.
[dataset, framework, previous, recognition, social] [computer, international, vision, additional, form, case, pattern, equation] [image, reference, proposed, face, conference, described, real, consistency, figure, method, ieee, based] [neural, performance, order, deep, structured, network, top, verification] [metadata, query, fake, repurposing, adversarial, aird, counterfeiter, integrity, painter, model, wael, evaluation, news, encoders, system, evidence, misinformation, multimedia, interplay, entity, encoding, indexing, encoder, ayush, premkumar, call, text] [detection, semantic, detector, google, evaluated, detecting, three] [retrieval, datasets, training, trained, retrieved, data, learning, yue, similarity, novel, combination, knowledge]
@InProceedings{Jaiswal_2019_CVPR,
  author = {Jaiswal, Ayush and Wu, Yue and AbdAlmageed, Wael and Masi, Iacopo and Natarajan, Premkumar},
  title = {AIRD: Adversarial Learning Framework for Image Repurposing Detection},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
A Kernelized Manifold Mapping to Diminish the Effect of Adversarial Perturbations
Saeid Asgari Taghanaki, Kumar Abhishek, Shekoofeh Azizi, Ghassan Hamarneh


The linear and non-flexible nature of deep convolutional models makes them vulnerable to carefully crafted adversarial perturbations. To tackle this problem, we propose a non-linear radial basis convolutional feature mapping by learning a Mahalanobis-like distance function. Our method then maps the convolutional features onto a linearly well-separated manifold, which prevents small adversarial perturbations from forcing a sample to cross the decision boundary. We test the proposed method on three publicly available image classification and segmentation datasets namely, MNIST, ISBI ISIC 2017 skin lesion segmentation, and NIH Chest X-Ray-14. We evaluate the robustness of our method to different gradient (targeted and untargeted) and non-gradient based attacks and compare it to several non-gradient masking defense strategies. Our results demonstrate that the proposed method can increase the resilience of deep convolutional neural networks to adversarial perturbations without accuracy drop on clean data.
[dataset] [radial, matrix, international, robust, computer, linear, analysis, linearly] [method, proposed, mapping, based, figure, clean, image, skin, input, conference, transformation, high, ieee, noise] [accuracy, layer, table, deep, convolutional, gradient, original, network, applied, neural, output, gaussian, performance, activation, prop, size, binary, standard, neuron, number, higher, compare] [adversarial, arxiv, preprint, rbf, defense, basis, robustness, manifold, model, attack, vector, orig, fsm, machine] [feature, segmentation, cnn, propose, chest, lesion, masking] [classification, learning, function, distance, training, mnist, test, data, mahalanobis, positive, reported]
@InProceedings{Taghanaki_2019_CVPR,
  author = {Asgari Taghanaki, Saeid and Abhishek, Kumar and Azizi, Shekoofeh and Hamarneh, Ghassan},
  title = {A Kernelized Manifold Mapping to Diminish the Effect of Adversarial Perturbations},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Trust Region Based Adversarial Attack on Neural Networks
Zhewei Yao, Amir Gholami, Peng Xu, Kurt Keutzer, Michael W. Mahoney


Deep Neural Networks are quite vulnerable to adversarial perturbations. Current state-of-the-art adversarial attack methods typically require very time consuming hyper-parameter tuning, or require many iterations to solve an optimization based adversarial attack. To address this problem, we present a new family of trust region based adversarial attacks, with the goal of computing adversarial perturbations efficiently. We propose several attacks based on variants of the trust region optimization method. We test the proposed methods on Cifar-10 and ImageNet datasets using several different models including AlexNet, ResNet-50, VGG-16, and DenseNet-121 models. Our methods achieve comparable results with the Carlini-Wagner (CW) attack, but with significant speed up of up to 37x, for the VGG-16 model on a Titan Xp GPU. For the case of ResNet-50 on ImageNet, we can bring down its classification accuracy to less than 0.1% with at most 1.5% relative L_infinity (or L_2) perturbation requiring only 1.02 seconds as compared to 27.04 seconds for the CW attack. We have open sourced our method which can be accessed at [??].
[second, time, work, multiple] [case, optimization, radius, problem, compute, note, require, solve, robust, pattern, computed, form, solving] [method, input, image, based, ieee, conference, proposed] [neural, order, activation, compared, alexnet, smaller, hessian, table, imagenet, accuracy, max, deep, speed, approximation, magnitude, gradient, better, performance, achieve, dnn, adaptive, network, needed, achieves, lower, mlp] [attack, adversarial, perturbation, model, deepfool, fool, worst, arxiv, preprint, arg, trust, adap, decision, finding, find, stronger, fgsm, consider, swish, simple, defense, alexlike] [average, region, boundary, faster, including] [training, function, class, test, set, target, softmax, min, reported, hardest]
@InProceedings{Yao_2019_CVPR,
  author = {Yao, Zhewei and Gholami, Amir and Xu, Peng and Keutzer, Kurt and Mahoney, Michael W.},
  title = {Trust Region Based Adversarial Attack on Neural Networks},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
PEPSI : Fast Image Inpainting With Parallel Decoding Network
Min-cheol Sagong, Yong-goo Shin, Seung-wook Kim, Seung Park, Sung-jea Ko


Recently, a generative adversarial network (GAN)-based method employing the coarse-to-fine network with the contextual attention module (CAM) has shown outstanding results in image inpainting. However, this method requires numerous computational resources due to its two-stage process for feature encoding. To solve this problem, in this paper, we present a novel network structure, called PEPSI: parallel extended-decoder path for semantic inpainting. PEPSI can reduce the number of convolution operations by adopting a structure consisting of a single shared encoding network and a parallel decoding network with coarse and inpainting paths. The coarse path produces a preliminary inpainting result with which the encoding network is trained to predict features for the CAM. At the same time, the inpainting path creates a higher-quality inpainting result using refined features reconstructed by the CAM. PEPSI not only reduces the number of convolution operation almost by half as compared to the conventional coarse-to-fine networks but also exhibits superior performance to other models in terms of testing time and qualitative scores.
[time] [local, square, computer, reconstruction, pattern, single, well] [image, inpainting, pepsi, hole, method, figure, gatedconv, background, proposed, generative, result, input, conference, masked, missing, real, ieee, psnr, ssim, patch, comparison, reconstructed, completed] [network, convolution, performance, table, neural, parallel, deep, layer, truncated, computational, compared, architecture, employ, called, number, structure, operation, convolutional, imagenet, original, processing] [path, decoding, adversarial, encoding, discriminator, modified, generate, arxiv, preprint, visual, complete, encoded, gan] [coarse, feature, region, mask, cam, global, foreground, roughly, semantic, relation, refinement] [conventional, similarity, distance, loss, trained, cosine, learning, training, novel, euclidean]
@InProceedings{Sagong_2019_CVPR,
  author = {Sagong, Min-cheol and Shin, Yong-goo and Kim, Seung-wook and Park, Seung and Ko, Sung-jea},
  title = {PEPSI : Fast Image Inpainting With Parallel Decoding Network},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Model-Blind Video Denoising via Frame-To-Frame Training
Thibaud Ehret, Axel Davy, Jean-Michel Morel, Gabriele Facciolo, Pablo Arias


Modeling the processing chain that has produced a video is a difficult reverse engineering task, even when the camera is available. This makes model based video processing a still more complex task. In this paper we propose a fully blind video denoising method, with two versions off-line and on-line. This is achieved by fine-tuning a pre-trained AWGN denoising network to the video with a novel frame-to-frame training strategy. Our denoiser can be used without knowledge of the origin of the video or burst and the post-processing steps applied from the camera sensor. The on-line process only requires a couple of frames before achieving visually pleasing results for a wide range of perturbations. It nonetheless reaches state-of-the-art performance for standard Gaussian noise, and can be used off-line with still better performance.
[video, frame, optical, online, flow, signal, dataset, second, starting, framework] [single, case, deviation, optimization, ground, approach, require] [noise, image, denoising, dncnn, proposed, clean, psnr, figure, blind, method, awgn, denoise, clinic, jpeg, based, result, corrupted, real, denoiser, imaging] [network, gaussian, processing, neural, batch, standard, compressed, process, better, deep, convolutional, number, applied, performance, variance, compare, principle, parameter] [type, example, random, model, salt, pepper, simple, requires] [object, segmentation, cnn] [training, trained, loss, noisy, learning, data, large]
@InProceedings{Ehret_2019_CVPR,
  author = {Ehret, Thibaud and Davy, Axel and Morel, Jean-Michel and Facciolo, Gabriele and Arias, Pablo},
  title = {Model-Blind Video Denoising via Frame-To-Frame Training},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
End-To-End Efficient Representation Learning via Cascading Combinatorial Optimization
Yeonwoo Jeong, Yoonsung Kim, Hyun Oh Song


We develop hierarchically quantized efficient embedding representations for similarity-based search and show that this representation provides not only the state of the art performance on the search accuracy but also provides several orders of speed up during inference. The idea is to hierarchically quantize the representation so that the quantization granularity is greatly increased while maintaining the accuracy and keeping the computational complexity low. We also show that the problem of finding the optimal sparse compound hash code respecting the hierarchical structure can be optimized in polynomial time via minimum cost flow in an equivalent flow network. This allows us to train the method end-to-end in a mini-batch stochastic gradient descent setting. Our experiments on Cifar100 and ImageNet datasets show the state of the art search accuracy while providing several orders of magnitude search speedup respectively over exhaustive linear search over the dataset.
[flow, term, dataset, state, time, second, previous] [equation, corresponding, optimization, problem, solution, optimal, linear, compute, total, depth, vertex, bucket, define] [method, based, figure, input, image] [network, cost, search, deep, speedup, table, imagenet, number, accuracy, quantization, binary, efficient, neural, activation, rate, sparsity, size, output, performance, structure, layer, capacity, hierarchically, gradient] [query, vector, finding] [level, hierarchical, art, baseline, val, highest, granularity] [hash, learning, embedding, code, set, representation, metric, train, npairs, triplet, minimum, data, base, class, dissimilar, update, test, compound, loss, function, hashing, suf, ctest, seek, suppose, objective, emb, datasets]
@InProceedings{Jeong_2019_CVPR,
  author = {Jeong, Yeonwoo and Kim, Yoonsung and Oh Song, Hyun},
  title = {End-To-End Efficient Representation Learning via Cascading Combinatorial Optimization},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Sim-Real Joint Reinforcement Transfer for 3D Indoor Navigation
Fengda Zhu, Linchao Zhu, Yi Yang


There has been an increasing interest in 3D indoor navigation, where a robot in an environment moves to a target according to an instruction. To deploy a robot for navigation in the physical world, lots of training data is required to learn an effective policy. It is quite labour intensive to obtain sufficient real environment data for training robots while synthetic data is much easier to construct by render-ing. Though it is promising to utilize the synthetic environments to facilitate navigation training in the real world, real environment are heterogeneous from synthetic environment in two aspects. First, the visual representation of the two environments have significant variances. Second, the houseplans of these two environments are quite different. There-fore two types of information,i.e. visual representation and policy behavior, need to be adapted in the reinforce mentmodel. The learning procedure of visual representation and that of policy behavior are presumably reciprocal. We pro-pose to jointly adapt visual representation and policy behavior to leverage the mutual impacts of environment and policy. Specifically, our method employs an adversarial feature adaptation model for visual representation transfer anda policy mimic strategy for policy behavior imitation. Experiment shows that our method outperforms the baseline by 19.47% without any additional human annotations.
[action, joint, work, behavior, framework, recognition, predict, focus, lstm] [indoor, computer, vision, international, suncg, rgb, pattern, analysis, problem, approach, syn] [real, synthetic, conference, method, figure, identity, ieee, image, based, mapping] [performance, weight, rate, deep, table, neural, compared, scale] [policy, model, environment, adversarial, mimic, visual, reinforcement, navigation, unreal, success, arxiv, preprint, robot, agent, mapp, reinforce, mage, adaption, house, step, goal, rea, func, lidt] [feature, baseline, semantic, ablation, propose, adopt] [training, adaptation, loss, learning, transfer, trained, domain, function, data, knowledge, target, learn, space, large, distribution, test, representation, train, embedding, student]
@InProceedings{Zhu_2019_CVPR,
  author = {Zhu, Fengda and Zhu, Linchao and Yang, Yi},
  title = {Sim-Real Joint Reinforcement Transfer for 3D Indoor Navigation},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
ChamNet: Towards Efficient Network Design Through Platform-Aware Model Adaptation
Xiaoliang Dai, Peizhao Zhang, Bichen Wu, Hongxu Yin, Fei Sun, Yanghan Wang, Marat Dukhan, Yunqing Hu, Yiming Wu, Yangqing Jia, Peter Vajda, Matt Uyttendaele, Niraj K. Jha


This paper proposes an efficient neural network (NN) architecture design methodology called Chameleon that honors given resource constraints. Instead of developing new building blocks or using computationally-intensive reinforcement learning algorithms, our approach leverages existing efficient network building blocks and focuses on exploiting hardware traits and adapting computation resources to fit target latency and/or energy constraints. We formulate platform-aware NN architecture search in an optimization framework and propose a novel algorithm to search for optimal architectures aided by efficient accuracy and resource (latency and/or energy) predictors. At the core of our algorithm lies an accuracy predictor built atop Gaussian Process with Bayesian optimization for iterative sampling. With a one-time building cost for the predictors, our algorithm produces state-of-the-art model architectures on different platforms under given constraints in just minutes. Our results show that adapting computation resources to building blocks is critical to model performance. Without the addition of any special features, our models achieve significant accuracy improvements relative to state-of-the-art handcrafted and automatically designed architectures. We achieve 73.8% and 75.3% top-1 accuracy on ImageNet at 20ms latency on a mobile CPU and DSP. At reduced latency, our models achieve up to 8.2% (4.8%) and 6.7% (9.3%) absolute top-1 accuracy improvements compared to MobileNetV2 and MnasNet, respectively, on a mobile CPU (DSP), and 2.7% (4.6%) and 5.6% (2.6%) accuracy gains over ResNet-101 and ResNet-152, respectively, on an Nvidia GPU (Intel CPU).
[prediction, framework, build, lut] [optimization, algorithm, manual, computer, direct, measurement, handcrafted, constraint] [based, figure, input, image, real, ieee] [latency, accuracy, architecture, search, energy, neural, efficient, network, mobile, cpu, chameleon, building, resource, bottleneck, performance, number, computation, snapdragon, hardware, bayesian, chamnet, design, size, table, cost, compared, speed, operator, mnasnet, compare, processing, process, gpu, platform, compression, deep, automatically, nvidia, deployment, hexagon, device, wide, pruning, achieves, andrew] [model, arxiv, preprint, refer] [regression] [adaptation, predictor, sample, space, training, learning, target, set, existing, adapted]
@InProceedings{Dai_2019_CVPR,
  author = {Dai, Xiaoliang and Zhang, Peizhao and Wu, Bichen and Yin, Hongxu and Sun, Fei and Wang, Yanghan and Dukhan, Marat and Hu, Yunqing and Wu, Yiming and Jia, Yangqing and Vajda, Peter and Uyttendaele, Matt and Jha, Niraj K.},
  title = {ChamNet: Towards Efficient Network Design Through Platform-Aware Model Adaptation},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Regularizing Activation Distribution for Training Binarized Deep Networks
Ruizhou Ding, Ting-Wu Chin, Zeye Liu, Diana Marculescu


Binarized Neural Networks (BNNs) can significantly reduce the inference latency and energy consumption in resource-constrained devices due to their pure-logical computation and fewer memory accesses. However, training BNNs is difficult since the activation flow encounters degeneration, saturation, and gradient mismatch problems. Prior work alleviates these issues by increasing activation bits and adding floating-point scaling factors, thereby sacrificing BNN's energy efficiency. In this paper, we propose to use distribution loss to explicitly regularize the activation flow, and develop a framework to systematically formulate the loss. Our experiments show that the distribution loss can consistently improve the accuracy of BNNs without losing their energy benefits. Moreover, equipped with the proposed regularization, BNN training is shown to be robust to the selection of hyper-parameters including optimizer and learning rate.
[work, framework, flow, benefit] [international, computer, vision, deviation, robust, problem, pattern] [prior, conference, proposed, ieee, saturation, figure] [activation, bnn, accuracy, neural, gradient, network, binarized, energy, deep, table, convolutional, layer, batch, cost, bnns, scaling, regularization, binary, normalization, mismatch, standard, inference, difficulty, rate, selection, number, computation, formulate, degeneration, structure, hardware, convolution, small, larger, channel, higher, residual, processing, weight, compact, regularize, dnns, lead, approximate, imagenet, consumption] [sign, arxiv, preprint, model, memory, indicates, robustness] [baseline, improve, propose] [distribution, loss, training, learning, function, consistently, svhn, trained, testing, positive]
@InProceedings{Ding_2019_CVPR,
  author = {Ding, Ruizhou and Chin, Ting-Wu and Liu, Zeye and Marculescu, Diana},
  title = {Regularizing Activation Distribution for Training Binarized Deep Networks},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Robustness Verification of Classification Deep Neural Networks via Linear Programming
Wang Lin, Zhengfeng Yang, Xin Chen, Qingye Zhao, Xiangkun Li, Zhiming Liu, Jifeng He


There is a pressing need to verify robustness of classification deep neural networks (CDNNs) as they are embedded in many safety-critical applications. Existing robustness verification approaches rely on computing the over-approximation of the output set, and can hardly scale up to practical CDNNs, as the result of error accumulation accompanied with approximation. In this paper, we develop a novel method for robustness verification of CDNNs with sigmoid activation functions. It converts the robustness verification problem into an equivalent problem of inspecting the most suspected point in the input region which constitutes a nonlinear optimization problem. To make it amenable, by relaxing the nonlinear constraints into the linear inclusions, it is further refined as a linear programming problem. We conduct comparison experiments on a few CDNNs trained for classifying images in some state-of-the-art benchmarks, showing our advantages of precision and scalability that enable effective verification of practical CDNNs.
[recognition, hidden] [linear, problem, optimization, robust, tool, programming, range, point, international, property, practical, respect, university, bound, radius, corresponding, analysis, approach, defined, theorem, computer, error, solver, relaxed, condition, definition] [input, nonlinear, method, image, verify, conference, figure, based, ieee, comparison] [neural, verification, output, cdnns, activation, deep, layer, cdnn, network, performance, approximation, robustverifier, equivalent, interval, sigmoid, number, precision, relu, size, block, original, denotes, table, computing, unrobust, verifying, neuron] [robustness, vector, disturbance, perturbed, arg, machine, safety, generated] [region, three] [set, classification, function, label, learning, optimum, suppose, novel, bias, mnist, trained]
@InProceedings{Lin_2019_CVPR,
  author = {Lin, Wang and Yang, Zhengfeng and Chen, Xin and Zhao, Qingye and Li, Xiangkun and Liu, Zhiming and He, Jifeng},
  title = {Robustness Verification of Classification Deep Neural Networks via Linear Programming},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Additive Adversarial Learning for Unbiased Authentication
Jian Liang, Yuren Cao, Chenbin Zhang, Shiyu Chang, Kun Bai, Zenglin Xu


Authentication is a task aiming to confirm the truth between data instances and personal identities. Typical authentication applications include face recognition, person re-identification, authentication based on mobile devices and so on. The recently-emerging data-driven authentication process may encounter undesired biases, i.e., the models are often trained in one domain (e.g., for people wearing spring outfits) while required to apply in other domains (e.g., they change the clothes to summer outfits). To address this issue, we propose a novel two-stage method that disentangles the class/identity from domain-differences, and we consider multiple types of domain-difference. In the first stage, we learn disentangled representations by a one-versus-rest disentangle learning (OVRDL) mechanism. In the second stage, we improve the disentanglement by an additive adversarial learning (AAL) mechanism. Moreover, we discuss the necessity to avoid a learning dilemma due to disentangling causally related types of domain-difference. Comprehensive evaluation results demonstrate the effectiveness and superiority of the proposed method.
[recognition, hidden, multiple, outperforms, second] [problem, associated, vision, computer, denote, defined, direct] [attribute, method, disentangle, proposed, background, based, disentangling, demonstrate, face, disentangled, conference, ieee, dilemma, fairness, color, image] [table, network, group, effectiveness, optimize, achieve, mobile] [adversarial, arxiv, preprint, vector, mechanism, type, model, generated, consider] [stage, feature, improve, person, foreground, propose] [learning, data, domain, training, set, class, authentication, additive, adaptation, transfer, aal, train, learn, testing, aauc, test, unseen, digit, loss, unbiased, share, fml, independent, avoid, generalized, discriminative, min, cdrd, task]
@InProceedings{Liang_2019_CVPR,
  author = {Liang, Jian and Cao, Yuren and Zhang, Chenbin and Chang, Shiyu and Bai, Kun and Xu, Zenglin},
  title = {Additive Adversarial Learning for Unbiased Authentication},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Simultaneously Optimizing Weight and Quantizer of Ternary Neural Network Using Truncated Gaussian Approximation
Zhezhi He, Deliang Fan


In the past years, Deep convolution neural network has achieved great success in many artificial intelligence applications. However, its enormous model size and massive computation cost have become the main obstacle for deployment of such powerful algorithm in the low power and resource-limited mobile systems. As the countermeasure to this problem, deep neural networks with ternarized weights (i.e. -1, 0, +1) have been widely explored to greatly reduce model size and computational cost, with limited accuracy degradation. In this work, we propose a novel ternarized neural network training method which simultaneously optimizes both weights and quantizer during training, differentiating from prior works. Instead of fixed and uniform weight ternarization, we are the first to incorporate the thresholds of weight ternarization into a closed-form representation using truncated Gaussian approximation, enabling simultaneous optimization of weights and quantizer through back-propagation training. With both of the first and last layer ternarized, the experiments on the ImageNet classification task show that our ternarized ResNet-18/34/50 only has 3.9/2.52/2.16% accuracy degradation in comparison to the full-precision counterparts.
[forward, updated, work, backward, incorporate, perform] [computer, initial, assumption, vision, technique, error, pattern, optimization] [method, proposed, input, conference, figure, ieee, degradation, comparison] [weight, neural, network, gradient, ternarization, deep, accuracy, quantization, ternarized, gaussian, dnn, layer, imagenet, inference, quantizer, trainable, scaling, size, ste, binary, quantized, computation, rate, order, sgd, ternary, convolution, compression, convolutional, vanilla, tern, truncated, low, reduce, pruning, processing, better, speed, performed, scale, iteration, design, epoch, approximation] [model, arxiv, preprint, correctness, initialize] [propose, curve, threshold] [training, function, distribution, update, learning, test, classification, convergence, large, loss]
@InProceedings{He_2019_CVPR,
  author = {He, Zhezhi and Fan, Deliang},
  title = {Simultaneously Optimizing Weight and Quantizer of Ternary Neural Network Using Truncated Gaussian Approximation},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Adversarial Defense by Stratified Convolutional Sparse Coding
Bo Sun, Nian-Hsuan Tsai, Fangchen Liu, Ronald Yu, Hao Su


We propose an adversarial defense method that achieves state-of-the-art performance among attack-agnostic adversarial defense methods while also maintaining robustness to input resolution, scale of adversarial perturbation, and scale of dataset size. Based on convolutional sparse coding, we construct a stratified low-dimensional quasi-natural image space that faithfully approximates the natural image space while also removing adversarial perturbations. We introduce a novel Sparse Transformation Layer (STL) in between the input image and the first layer of the neural network to efficiently project images into our quasi-natural image space. Our experiments show state-of-the-art performance of our method compared to other attack-agnostic adversarial defense methods in various adversarial settings.
[] [problem, projection, algorithm, project, local, corresponding, well, optimization, reconstruction, robust] [image, method, clean, input, dictionary, transformation, figure, resolution, remove, comparison, high, generative, pixel, quality, based, denoising, reconstruct, conference, ieee] [sparse, convolutional, table, network, accuracy, small, neural, layer, deep, vanilla, achieve, number, achieves, scale, filter, coding, performance, compared, gradient] [adversarial, defense, robustness, natural, attack, deepfool, model, perturbation, bim, arxiv, attacked, xadv, preprint, fgsm, transformed, introduce, defend, defensive, machine] [feature, propose, map] [space, stl, learning, training, classifier, classification, trained, large, set, learn, data, novel, existing]
@InProceedings{Sun_2019_CVPR,
  author = {Sun, Bo and Tsai, Nian-Hsuan and Liu, Fangchen and Yu, Ronald and Su, Hao},
  title = {Adversarial Defense by Stratified Convolutional Sparse Coding},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Exploring Object Relation in Mean Teacher for Cross-Domain Detection
Qi Cai, Yingwei Pan, Chong-Wah Ngo, Xinmei Tian, Lingyu Duan, Ting Yao


Rendering synthetic data (e.g., 3D CAD-rendered images) to generate annotations for learning deep models in vision tasks has attracted increasing attention in recent years. However, simply applying the models learnt on synthetic images may lead to high generalization error on real images due to domain shift. To address this issue, recent progress in cross-domain recognition has featured the Mean Teacher, which directly simulates unsupervised domain adaptation as semi-supervised learning. The domain gap is thus naturally bridged with consistency regularization in a teacher-student scheme. In this work, we advance this Mean Teacher paradigm to be applicable for cross-domain detection. Specifically, we present Mean Teacher with Object Relations (MTOR) that novelly remolds Mean Teacher under the backbone of Faster R-CNN by integrating the object relations into the measure of consistency cost between teacher and student modules. Technically, MTOR firstly learns relational graphs that capture similarities between pairs of regions for teacher and student respectively. The whole architecture is then optimized with three consistency regularizations: 1) region-level consistency to align the region-level predictions between teacher and student, 2) inter-graph consistency for matching the graph structures between teacher and student, and 3) intra-graph consistency to enhance the similarity between regions of same class within the graph of student. Extensive experiments are conducted on the transfers across Cityscapes, Foggy Cityscapes, and SIM10k, and superior results are reported when comparing to state-of-the-art approaches. More remarkably, we obtain a new record of single model: 22.8% of mAP on Syn2Real detection dataset.
[graph, dataset, bicycle, ting, recognition, work] [error, matrix, directly, note] [consistency, image, figure, synthetic, real] [performance, deep, validation, regularization, network, parameter, firstly, table, better, constructed, coefficient] [model, relational, perturbed, random] [detection, region, object, map, faster, feature, three, car, proposal, semantic, person, affinity, backbone, enhance, foreground, average, rcnn, mtorre] [teacher, domain, student, target, mtor, set, learning, adaptation, source, labeled, unlabeled, data, foggy, similarity, loss, training, unsupervised, trained, transfer, paradigm, class, supervised, mtorr, sample, testing, align, gxt]
@InProceedings{Cai_2019_CVPR,
  author = {Cai, Qi and Pan, Yingwei and Ngo, Chong-Wah and Tian, Xinmei and Duan, Lingyu and Yao, Ting},
  title = {Exploring Object Relation in Mean Teacher for Cross-Domain Detection},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Hierarchical Disentanglement of Discriminative Latent Features for Zero-Shot Learning
Bin Tong, Chao Wang, Martin Klinkigt, Yoshiyuki Kobayashi, Yuuichi Nonaka


Most studies in zero-shot learning model the relationship, in the form of a classifier or mapping, between features from images of seen classes and their attributes. Therefore, the degree of a model's generalization ability for recognizing unseen images is highly constrained by that of image features and attributes. In this paper, we discuss two questions about generalization that are seldom discussed. Are image features trained with samples of seen classes expressive enough to capture the discriminative information for both seen and unseen classes? Is the relationship learned from seen image features and attributes sufficiently generalized to recognize unseen classes. To answer these two questions, we propose a model to learn discriminative and generalizable representations from image features under an auto-encoder framework. The discriminative latent features are learned through a group-wise disentanglement over feature groups with a hierarchical structure. On popular benchmark data sets, a significant improvement over state-of-the-art methods in tasks of typical and generalized zero-shot learning verifies the generalization ability of latent features for recognizing unseen images.
[second, recognizing, work, capture, recognition, framework, term] [denote, degree] [latent, image, figure, disentanglement, attribute, disentangled, mapping, generative, disentangling, transformation] [group, number, denotes, deep, variance, structure, layer, performance, variant, ratio, inference] [model, visual, adversarial, calculated, generated, variational, arxiv] [feature, semantic, hierarchical, three, average] [learning, unseen, data, discriminative, embedding, discrimination, loss, generalized, class, function, learn, dimension, representation, dlfzrl, apy, trained, set, cub, mutual, learned, space, sun, softmax, distance, classifier, ranked, suppose, label, independent, encourages, ldisentangle, relatedness, zeynep]
@InProceedings{Tong_2019_CVPR,
  author = {Tong, Bin and Wang, Chao and Klinkigt, Martin and Kobayashi, Yoshiyuki and Nonaka, Yuuichi},
  title = {Hierarchical Disentanglement of Discriminative Latent Features for Zero-Shot Learning},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
R2GAN: Cross-Modal Recipe Retrieval With Generative Adversarial Network
Bin Zhu, Chong-Wah Ngo, Jingjing Chen, Yanbin Hao


Representing procedure text such as recipe for crossmodal retrieval is inherently a difficult problem, not mentioning to generate image from recipe for visualization. This paper studies a new version of GAN, named Recipe Retrieval Generative Adversarial Network (R2GAN), to explore the feasibility of generating image from procedure text for retrieval problem. The motivation of using GAN is twofold: learning compatible cross-modal features in an adversarial way, and explanation of search results by showing the images generated from recipes. The novelty of R2GAN comes from architecture design, specifically a GAN with one generator and dual discriminators is used, which makes the generation of image from recipe a feasible idea. Furthermore, empowered by the generated images, a two-level ranking loss in both embedding and image spaces are considered. These add-ons not only result in excellent retrieval performance, but also generate close-to-realistic food images useful for explaining ranking of recipes. On recipe1M dataset, R2GAN demonstrates high scalability to data size, outperforms all the existing approaches, and generates images intuitive for human to interpret the search results.
[lstm] [international, computer, problem, vision, reconstruction] [image, figure, conference, generator, generative, real, ieee, reconstructed, acm, dual, proposed, based, result] [performance, search, architecture, original, design, deep, wide, layer, network, neural] [gan, recipe, food, adversarial, generated, cooking, model, discriminator, medr, text, arxiv, preprint, procedure, fake, adamine, vreal, multimedia, visual, common, query, rich, modality, jingjing, crossmodal, generate, generating, generation] [semantic, module, feature, three] [learning, retrieval, embedding, loss, ranking, embeddings, set, similarity, compatible, paper, function, large, trained, ranked, learn, learnt, pair, positive, training]
@InProceedings{Zhu_2019_CVPR,
  author = {Zhu, Bin and Ngo, Chong-Wah and Chen, Jingjing and Hao, Yanbin},
  title = {R2GAN: Cross-Modal Recipe Retrieval With Generative Adversarial Network},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Rethinking Knowledge Graph Propagation for Zero-Shot Learning
Michael Kampffmeyer, Yinbo Chen, Xiaodan Liang, Hao Wang, Yujia Zhang, Eric P. Xing


Graph convolutional neural networks have recently shown great potential for the task of zero-shot learning. These models are highly sample efficient as related concepts in the graph structure share statistical strength allowing generalization to new classes when faced with a lack of data. However, multi-layer architectures, which are required to propagate knowledge to distant nodes in the graph, dilute the knowledge by performing extensive Laplacian smoothing at each layer and thereby consequently decrease performance. In order to still enjoy the benefit brought by the graph structure while preventing dilution of knowledge from distant nodes, we propose a Dense Graph Propagation (DGP) module with carefully designed direct links among distant nodes. DGP allows us to exploit the hierarchical graph structure of the knowledge graph through additional connections. These connections are added based on a node's relationship to its ancestors and descendants. A weighting scheme is further used to weigh their contribution depending on the distance to the node to improve information propagation in the graph. Combined with finetuning of the representations in a two-stage training approach our method outperforms state-of-the-art zero-shot learning approaches.
[graph, propagation, gcn, predict, perform, previous, dataset, correspond] [approach, pattern, computer, dense, vision, analysis, additional, matrix, allows, form, allow] [conference, proposed, ieee, image, based, distant, sea, difference] [neural, layer, number, table, convolutional, imagenet, performance, structure, order, finetuning, scheme, denotes, accuracy, network, output, weight, processing] [model, node, word, visual, vector] [semantic, cnn, feature, predicted, propose, regression, connectivity] [dgp, knowledge, learning, sgcn, class, training, unseen, weighting, cat, set, gcnz, trained, classification, conse, adjacency, task, classifier, learned, test, distance, ancestor, embedding, descendant, observe]
@InProceedings{Kampffmeyer_2019_CVPR,
  author = {Kampffmeyer, Michael and Chen, Yinbo and Liang, Xiaodan and Wang, Hao and Zhang, Yujia and Xing, Eric P.},
  title = {Rethinking Knowledge Graph Propagation for Zero-Shot Learning},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Learning to Learn Image Classifiers With Visual Analogy
Linjun Zhou, Peng Cui, Shiqiang Yang, Wenwu Zhu, Qi Tian


Humans are far better learners who can learn a new concept very fast with only a few samples compared with machines. The plausible mystery making the difference is two fundamental learning mechanisms: learning to learn and learning by analogy. In this paper, we attempt to investigate a new human-like learning method by organically combining these two mechanisms. In particular, we study how to generalize the classification parameters from previously learned concepts to a new concept. we first propose a novel Visual Analogy Graph Embedded Regression (VAGER) model to jointly learn a low-dimensional embedding space and a linear mapping function from the embedding space to classification parameters for base classes. We then propose an out-of-sample embedding method to learn the embedding of a new class represented by a few samples through its visual analogy with base classes and derive the classification parameters for the new class. We conduct extensive experiments on ImageNet dataset and the results show that our method could consistently and significantly outperform state-of-the-art baselines.
[graph, auc, term, performs, recognition, previous, dataset] [computer, equation, pattern, problem, vision, algorithm, linear, international, matrix, note, well] [method, image, analogy, conference, ieee, mapping, figure, transferred, proposed, result] [deep, parameter, performance, neural, network, process, binary, ratio, imagenet, best, table, layer, number, original] [visual, model, mechanism, concept, evaluate] [regression, average, propose, feature] [classification, novel, learning, base, class, learn, embedding, training, similarity, classifier, learned, space, function, generalization, set, vager, embeddings, randomly, representation, loss, vnew, select, transfer, generalize, knowledge, data, setting, mrn, consistently]
@InProceedings{Zhou_2019_CVPR,
  author = {Zhou, Linjun and Cui, Peng and Yang, Shiqiang and Zhu, Wenwu and Tian, Qi},
  title = {Learning to Learn Image Classifiers With Visual Analogy},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Where's Wally Now? Deep Generative and Discriminative Embeddings for Novelty Detection
Philippe Burlina, Neil Joshi, I-Jeng Wang


We develop a framework for novelty detection (ND) methods relying on deep embeddings, either discriminative or generative, and also propose a novel framework for assessing their performance. While much progress was made recently in these approaches, it has been accompanied by certain limitations: most methods were tested on relatively simple problems (low resolution images / small number of classes) or involved non-public data; comparative performance has often proven inconclusive because of lacking statistical significance; and evaluation has generally been done on non-canonical problem sets of differing complexity, making apples-to-apples comparative performance evaluation difficult. This has led to a relative confusing state of affairs. We address these challenges via the following contributions: We make a proposal for a novel framework to measure the performance of novelty detection methods using a trade-space demonstrating performance (measured by ROCAUC) as a function of problem complexity. We also make several proposals to formally characterize problem complexity. We conduct experiments with problems of higher complexity (higher image resolution / number of classes). To this end we design several canonical datasets built from CIFAR-10 and ImageNet (IN-125) which we make available to perform future benchmarks for novelty detection as well as other related tasks including semantic zero/adaptive shot and unsupervised learning. Finally, we demonstrate, as one of the methods in our ND framework, a generative novelty detection method whose performance exceeds that of all recent best-in-class generative ND methods.
[framework, anomaly, auc, work, perform, dataset, future, consists] [problem, outlier, inlier, approach, error, characterize, computed, local, inliers, international, single, reconstruction] [generative, image, method, latent, based, prior, figure, conference, ieee, resolution, high, denoted, proposed] [complexity, performance, deep, network, number, applied, convolutional, compared, density, best, neural, computing, imagenet] [vector, arxiv, preprint, adversarial, discriminator, gan, gans, machine] [detection, score, semantic, feature, average, propose, including] [novelty, discriminative, class, training, learning, embedding, set, data, embeddings, function, datasets, test, measure, support, xgan, novel, representation, gap, space, classification, main, lof]
@InProceedings{Burlina_2019_CVPR,
  author = {Burlina, Philippe and Joshi, Neil and Wang, I-Jeng},
  title = {Where's Wally Now? Deep Generative and Discriminative Embeddings for Novelty Detection},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Weakly Supervised Image Classification Through Noise Regularization
Mengying Hu, Hu Han, Shiguang Shan, Xilin Chen


Weakly supervised learning is an essential problem in computer vision tasks, such as image classification, object recognition, etc., because it is expected to work in the scenarios where a large dataset with clean labels is not available. While there are a number of studies on weakly supervised image classification, they usually limited to either single-label or multi-label scenarios. In this work, we propose an effective approach for weakly supervised image classification utilizing massive noisy labeled data with only a small set of clean labels (e.g., 5%). The proposed approach consists of a clean net and a residual net, which aim to learn a mapping from feature space to clean label space and a residual mapping from feature space to the residual between clean labels and noisy labels, respectively, in a multi-task learning manner. Thus, the residual net works as a regularization term to improve the clean net training. We evaluate the proposed approach on two multi-label datasets (OpenImage and MS COCO2014) and a single-label dataset (Clothing1M). Experimental results show that the proposed approach outperforms the state-of-the-art methods, and generalizes well to both single-label and multi-label scenarios.
[dataset, work, term, prediction, influence, forward] [approach, denote, robust, reliable, linear, provide, practical, well, corresponding, single] [clean, image, proposed, method, ieee, noise, mapping, veit, figure] [net, residual, network, performance, small, table, sigmoid, regularization, rate, size, imagenet, process, compared, activation, deep, layer, better, best, achieves, number, entire] [model, activate, evaluate] [backbone, coco, feature, weakly, map, baseline, improve, relation] [noisy, set, label, learning, data, classification, training, space, supervised, openimage, massive, classifier, train, labeled, learn, function, apall, unreliable, positive, china, datasets, overfitting, trained, shared, loss]
@InProceedings{Hu_2019_CVPR,
  author = {Hu, Mengying and Han, Hu and Shan, Shiguang and Chen, Xilin},
  title = {Weakly Supervised Image Classification Through Noise Regularization},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Data-Driven Neuron Allocation for Scale Aggregation Networks
Yi Li, Zhanghui Kuang, Yimin Chen, Wayne Zhang


Successful visual recognition networks benefit from aggregating information spanning from a wide range of scales. Previous research has investigated information fusion of connected layers or multiple branches in a block, seeking to strengthen the power of multi-scale representations. Despite their great successes, existing practices often allocate the neurons for each scale manually, and keep the same ratio in all aggregation blocks of an entire network, rendering suboptimal performance. In this paper, we propose to learn the neuron allocation for aggregating multi-scale information in different building blocks of a deep network. The most informative output neurons in each block are preserved while others are discarded, and thus neurons for multiple scales are competitively and adaptively allocated. Our scale aggregation network (ScaleNet) is constructed by repeating a scale aggregation (SA) block that concatenates feature maps at a wide range of scales. Feature maps for each scale are generated by a stack of downsampling, convolution and upsampling operations. The data-driven neuron allocation and SA block achieve strong representational power at the cost of considerably low computational complexity. The proposed ScaleNet, by replacing all 3x3 convolutions in ResNet with our SA blocks, achieves better performance than ResNet and its outstanding variants like ResNeXt and SE-ResNet, in the same computational complexity. On ImageNet classification, ScaleNets absolutely reduce the top-1 error rate of ResNets by 1.12 (101 layers) and 1.82 (50 layers). On COCO object detection, ScaleNets absolutely improve the mAP with backbone of ResNets by 3.6 and 4.6 on Faster-RCNN, respectively. Code and models are released on https://github.com/Eli-YiLi/ScaleNet.
[multiple, previous] [range, error, field, corresponding] [proposed, figure, image, method] [scale, block, neuron, scalenets, allocation, network, aggregation, imagenet, conv, max, table, output, resnets, computational, deep, receptive, stride, rate, downsampling, number, convolutional, neural, layer, residual, top, channel, complexity, convolution, absolutely, learnable, shortcut, pruning, size, wide, replace, connected, power, resnet, concatenate, efficient, architecture, validation] [indicates] [object, feature, detection, pool, map, context, backbone, faster, module, baseline, coco] [set, proportion, classification, learning, effectively, trained, setting, training]
@InProceedings{Li_2019_CVPR,
  author = {Li, Yi and Kuang, Zhanghui and Chen, Yimin and Zhang, Wayne},
  title = {Data-Driven Neuron Allocation for Scale Aggregation Networks},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Graphical Contrastive Losses for Scene Graph Parsing
Ji Zhang, Kevin J. Shih, Ahmed Elgammal, Andrew Tao, Bryan Catanzaro


Most scene graph parsers use a two-stage pipeline to detect visual relationships: the first stage detects entities, and the second predicts the predicate for each entity pair using a softmax distribution. We find that such pipelines, trained with only a cross entropy loss over predicate classes, suffer from two common errors. The first, Entity Instance Confusion, occurs when the model confuses multiple instances of the same type of entity (e.g. multiple cups). The second, Proximal Relationship Ambiguity, arises when multiple subject-predicate-object triplets appear in close proximity with the same predicate, and the model struggles to infer the correct subject-object pairings (e.g. mis-pairing musicians and their instruments). We propose a set of contrastive loss formulations that specifically target these types of errors within the scene graph parsing problem, collectively termed the Graphical Contrastive Losses. These losses explicitly force the model to disambiguate related and unrelated instances through margin constraints specific to each type of confusion. We further construct a relationship detector, called RelDN, using the aforementioned pipeline to demonstrate the efficacy of our proposed losses. Our model outperforms the winning method of the OpenImages Relationship Detection Challenge by 4.7% (16.5% relatively) on the test set. We also show improved results over the best previous methods on the Visual Genome and Visual Relationship Detection datasets.
[graph, subject, second, previous, work, multiple] [scene, body, define, pipeline, ground, note] [figure, proximal, aforementioned, proposed, image] [table, better, best, network, top, max, full] [predicate, entity, visual, relationship, model, reldn, correct, openimages, adding, phrase, referring, evaluation, language, find, genome] [object, detection, instance, three, feature, detector, semantic, spatial, module, box, graphical, cnn, aware, parsing, wine, predicted, score, scorewtd, agnostic, bounding, ablation] [class, loss, contrastive, set, positive, confusion, negative, train, sample, pair, margin, classification, embeddings, min, softmax, suffer, test]
@InProceedings{Zhang_2019_CVPR,
  author = {Zhang, Ji and Shih, Kevin J. and Elgammal, Ahmed and Tao, Andrew and Catanzaro, Bryan},
  title = {Graphical Contrastive Losses for Scene Graph Parsing},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Deep Transfer Learning for Multiple Class Novelty Detection
Pramuditha Perera, Vishal M. Patel


We propose a transfer learning-based solution for the problem of multiple class novelty detection. In particular, we propose an end-to-end deep-learning based approach in which we investigate how the knowledge contained in an external, out-of-distributional dataset can be used to improve the performance of a deep network for visual novelty detection. Our solution differs from the standard deep classification networks on two accounts. First, we use a novel loss function, membership loss, in addition to the classical cross-entropy loss for training networks. Secondly, we use the knowledge from the external dataset more effectively to learn globally negative filters, filters that respond to generic objects outside the known class set. We show that thresholding the maximal activation of the proposed network can be used to identify novel objects effectively. Extensive experiments on four publicly available novelty detection datasets show that the proposed method achieves significant improvements over the state-of-the-art methods.
[dataset, recognition, perform] [globally, pattern, computer, problem, vision, single, local, note, associated] [proposed, method, reference, image, figure, based, ieee, conference, produce, high] [activation, deep, network, performance, top, neural, layer, convolutional, filter, alexnet, table, addition, kernel, compared] [considered, playing, vector, observed, correct, model, visual, introduce, generate, evidence] [detection, score, baseline, object, cnn, final, propose, feature] [novelty, class, negative, loss, positive, novel, classification, membership, training, datasets, calculator, data, knfst, learn, set, trained, learning, sample, test, standford, conventional, enrolled, knowledge, thresholding]
@InProceedings{Perera_2019_CVPR,
  author = {Perera, Pramuditha and Patel, Vishal M.},
  title = {Deep Transfer Learning for Multiple Class Novelty Detection},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
QATM: Quality-Aware Template Matching for Deep Learning
Jiaxin Cheng, Yue Wu, Wael AbdAlmageed, Premkumar Natarajan


Finding a template in a search image is one of the core problems in many computer vision applications, such as template matching, image semantic alignment, image-to-GPS verification etc.. In this paper, we propose a novel quality-aware template matching method, which is not only used as a standalone template matching algorithm, but also a trainable layer that can be easily plugged in any deep neural network. Specifically, we assess the quality of a matching pair as its soft-ranking among all matching pairs, and thus different matching scenarios like 1-to-1, 1-to-many, and many-to-many will be all reflected to different values. Our extensive studies in the classic template matching problem and deep learning tasks demonstrate the effectiveness of QATM: it not only outperforms SOTA template matching methods when used alone, but also largely improves existing DNN solutions when used in DNN.
[classic, dataset, outperforms, relies, assessment] [matching, template, computer, pattern, vision, international, ideal, qtam, analysis, problem, compute] [image, proposed, quality, ieee, conference, method, qualitative, figure, high, patch, reference, raw, based] [performance, search, deep, table, dnn, layer, verification, network, otb, standard, neural, size, low, best, efficient, max] [easily, evaluation, common, indicates, find, visual, machine, wael, introduced, potential] [qatm, matched, baseline, feature, score, semantic, region, unmatched, localization, object, response, wikimedia, bump, detection, deformable] [similarity, learning, target, negative, alignment, positive, cosine, softmax, pair, existing, function, distribution, task, yue, measure, likelihood]
@InProceedings{Cheng_2019_CVPR,
  author = {Cheng, Jiaxin and Wu, Yue and AbdAlmageed, Wael and Natarajan, Premkumar},
  title = {QATM: Quality-Aware Template Matching for Deep Learning},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Retrieval-Augmented Convolutional Neural Networks Against Adversarial Examples
Jake Zhao (Junbo), Kyunghyun Cho


We propose a retrieval-augmented convolutional network (RaCNN) and propose to train it with local mixup, a novel variant of the recently proposed mixup algorithm. The proposed hybrid architecture combining a convolutional network and an off-the-shelf retrieval engine was designed to mitigate the adverse effect of off-manifold adversarial examples, while the proposed local mixup addresses on-manifold ones by explicitly encouraging the classifier to locally behave linearly on the data manifold. Our evaluation of the proposed approach against seven readilyavailable adversarial attacks on three datasets-CIFAR-10, SVHN and ImageNet-demonstrate the improved robustness compared to a vanilla convolutional network, and comparable performance with the state-of-the-art reactive defense approaches.
[work] [local, convex, approach, robust, hull, projection, linearly, algorithm, normalized, varying, uniformly, case, computer] [proposed, input, image, clean, based, conference, strength, figure] [convolutional, neural, network, deep, mixup, vanilla, imagenet, original, table, gradient, number, accuracy, variant, performance, entire] [adversarial, racnn, attack, engine, robustness, arxiv, preprint, ifgsm, defense, deepfool, fgsm, machine, manifold, example, candidate, random, observed, behave, consider, indicates, reactive, mechanism, evaluate] [feature, baseline, final, propose, boundary, three, improve] [retrieval, classifier, training, learning, data, set, svhn, test, extractor, retrieved, observe, classification, train, pair, corresponds, scenario, novel, trained]
@InProceedings{(Junbo)_2019_CVPR,
  author = {Zhao (Junbo), Jake and Cho, Kyunghyun},
  title = {Retrieval-Augmented Convolutional Neural Networks Against Adversarial Examples},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Learning Cross-Modal Embeddings With Adversarial Networks for Cooking Recipes and Food Images
Hao Wang, Doyen Sahoo, Chenghao Liu, Ee-peng Lim, Steven C. H. Hoi


Food computing is playing an increasingly important role in human daily life, and has found tremendous applications in guiding human behavior towards smart food consumption and healthy lifestyle. An important task under the food-computing umbrella is retrieval, which is particularly helpful for health related applications, where we are interested in retrieving important information about food (e.g., ingredients, nutrition, etc.). In this paper, we investigate an open research task of cross-modal retrieval between cooking recipes and food images, and propose a novel framework Adversarial Cross-Modal Embedding (ACME) to resolve the cross-modal retrieval task in food domains. Specifically, the goal is to learn a common embedding feature space between the two modalities, in which our approach consists of several novel ideas: (i) learning by using a new triplet loss scheme together with an effective sampling strategy, (ii) imposing modality alignment using an adversarial learning strategy, and (iii) imposing cross-modal translation consistency such that the embedding of one modality is able to recover some important information of corresponding instances in the other modality. ACME achieves the state-of-the-art performance on the benchmark Recipe1M dataset, validating the efficacy of the proposed technique.
[framework, predict, recognition, work, heterogeneous, lstm, joint] [international, corresponding, ground, analysis] [image, translation, conference, proposed, figure, consistency, component, real, acm, ieee] [performance, deep, top, add, achieve, neural, processing] [food, recipe, adversarial, chicken, acme, cooking, query, modality, generated, common, sugar, generate, evaluate, baking, relevant, visual, goal, garlic, salt, model, retrieve, bring, true, stir, multimedia, health, lemon, calorie, appropriate, consider] [feature, propose, combine, semantic, instance, predicted] [embedding, retrieval, loss, learning, triplet, retrieved, alignment, embeddings, large, representation, task, set, objective, align, hard, sample, learn, space, rank, test, novel, pairwise]
@InProceedings{Wang_2019_CVPR,
  author = {Wang, Hao and Sahoo, Doyen and Liu, Chenghao and Lim, Ee-peng and Hoi, Steven C. H.},
  title = {Learning Cross-Modal Embeddings With Adversarial Networks for Cooking Recipes and Food Images},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
FastDraw: Addressing the Long Tail of Lane Detection by Adapting a Sequential Prediction Network
Jonah Philion


The search for predictive models that generalize to the long tail of sensor inputs is the central difficulty when developing data-driven models for autonomous vehicles. In this paper, we use lane detection to study modeling and training techniques that yield better performance on real world test drives. On the modeling side, we introduce a novel fully convolutional model of lane detection that learns to decode lane structures instead of delegating structure inference to post-processing. In contrast to previous works, our convolutional decoder is able to represent an arbitrary number of lanes per image, preserves the polyline representation of lanes without reducing lanes to polynomials, and draws lanes iteratively without requiring the computational and temporal complexity of recurrent neural networks. Because our model includes an estimate of the joint distribution of neighboring pixels belonging to the same lane, our formulation includes a natural and computationally cheap definition of uncertainty. On the training side, we demonstrate a simple yet effective approach to adapt the model to new environments using unsupervised style transfer. By training FastDraw to make predictions of lane structure that are invariant to low-level stylistic differences between images, we achieve strong performance at test time in weather and lighting conditions that deviate substantially from those of the annotated datasets that are publicly available. We quantitatively evaluate our approach on the CVPR 2017 Tusimple lane marking challenge, difficult CULane datasets [29], and a small labeled dataset of our own and achieve competitive accuracy while running at 90 FPS.
[lane, fastdraw, tusimple, dataset, culane, human, perform, work, sequence, long, previous, recurrent, marking, follow, exclusively, driving] [initial, approach, weather, well, ground, definition, robust, point, truth, publicly, general, shape, algorithm] [pixel, figure, style, image, munit, translation, decode, high, input, row, demonstrate, conditioning] [network, performance, convolutional, neural, table, achieve, number, competitive, accuracy, size, output, top, standard] [model, simple, find, decoding, decoder, evaluate, represent, adversarial, machine] [detection, segmentation, annotated, annotation, object, semantic, public, mask] [distribution, training, train, unsupervised, learning, test, datasets, trained, representation, transfer, adapt, categorical, loss, difficult, likelihood, data, distance, draw, learned]
@InProceedings{Philion_2019_CVPR,
  author = {Philion, Jonah},
  title = {FastDraw: Addressing the Long Tail of Lane Detection by Adapting a Sequential Prediction Network},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Weakly Supervised Video Moment Retrieval From Text Queries
Niluthpol Chowdhury Mithun, Sujoy Paul, Amit K. Roy-Chowdhury


There have been a few recent methods proposed in text to video moment retrieval using natural language queries, but requiring full supervision during training. However, acquiring a large number of training videos with temporal boundary annotations for each text description is extremely time-consuming and often not scalable. In order to cope with this issue, in this work, we introduce the problem of learning from weak labels for the task of text to video moment retrieval. The weak nature of the supervision is because, during training, we only have access to the video-text pairs rather than the temporal extent of the video to which different text descriptions relate. We propose a joint visual-semantic embedding based framework that learns the notion of relevant segments from video using only video-level sentence descriptions. Specifically, our main idea is to utilize latent alignment between video frames and sentence descriptions using Text-Guided Attention (TGA). TGA is then used during the test phase to retrieve relevant moments. Experiments on two benchmark datasets demonstrate that our method achieves comparable performance to state-of-the-art fully supervised approaches.
[video, temporal, joint, moment, dataset, activity, didemo, gru, report, framework, multiple, efrc, time, focus, mcn, prediction, portion, localizing, performs, action] [approach, corresponding, problem, case, matching] [proposed, based, method, image] [performance, network, neural, table, deep, convolutional, compared] [text, sentence, attention, relevant, description, evaluation, natural, language, model, vector, access, visual, arxiv, preprint, retrieve, consider] [feature, weakly, fully, localization, person, utilize] [embedding, supervised, retrieval, learning, training, similarity, task, set, learn, space, alignment, test, observe, datasets, loss, representation, embeddings, softmax, cosine]
@InProceedings{Mithun_2019_CVPR,
  author = {Chowdhury Mithun, Niluthpol and Paul, Sujoy and Roy-Chowdhury, Amit K.},
  title = {Weakly Supervised Video Moment Retrieval From Text Queries},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Content-Aware Multi-Level Guidance for Interactive Instance Segmentation
Soumajit Majumder, Angela Yao


In interactive instance segmentation, users give feedback to iteratively refine segmentation masks. The user-provided clicks are transformed into guidance maps which provide the network with necessary cues on the whereabouts of the object of interest. Guidance maps used in current systems are purely distance-based and are either too localized or non-informative. We propose a novel transformation of user clicks to generate content-aware guidance maps that leverage the hierarchical structural information present in an image. Using our guidance maps, even the most basic FCNs are able to outperform existing approaches that require state-of-the-art segmentation networks pre-trained on large scale segmentation datasets. We demonstrate the effectiveness of our proposed transformation strategy through comprehensive experimentation in which we significantly raise state-of-the-art on four standard interactive segmentation benchmarks.
[current, previous, prediction, graph, second, dataset] [single, ground, truth, initial, michael, geodesic, approach, estimate] [image, based, user, pixel, transform, input, proposed, figure, pablo, feedback] [number, network, scale, deep, performance, impact, convolutional, small, table, andrew, achieve] [required, generate, generated, provided, generating, consider, sampled, evaluation] [guidance, object, segmentation, interactive, map, superpixels, instance, pascal, superpixel, voc, improvement, grabcut, average, click, semantic, segmenting, foreground, slic, fully, berkeley, mcg, miou, val, bounding, mask] [positive, distance, negative, euclidean, set, training, existing, learning, base, min, randomly]
@InProceedings{Majumder_2019_CVPR,
  author = {Majumder, Soumajit and Yao, Angela},
  title = {Content-Aware Multi-Level Guidance for Interactive Instance Segmentation},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Greedy Structure Learning of Hierarchical Compositional Models
Adam Kortylewski, Aleksander Wieczorek, Mario Wieser, Clemens Blumer, Sonali Parbhoo, Andreas Morel-Forster, Volker Roth, Thomas Vetter


In this work, we consider the problem of learning a hierarchical generative model of an object from a set of images which show examples of the object in the presence of variable background clutter. Existing approaches to this problem are limited by making strong a-priori assumptions about the object's geometric structure and require seg- mented training data for learning. In this paper, we propose a novel framework for learning hierarchical compositional models (HCMs) which do not suffer from the mentioned limitations. We present a generalized formulation of HCMs and describe a greedy structure learning framework that consists of two phases: Bottom-up part learning and top-down model composition. Our framework integrates the foreground-background segmentation problem into the structure learning task via a background model. As a result, we can jointly optimize for the number of layers in the hierarchy, the number of parts per layer and a foreground- background segmentation based on class labels only. We show that the learned HCMs are semantically meaningful and achieve competitive results when compared to other generative object models at object classification on a standard transfer learning dataset.
[framework, dependency, recognition, graph, human, modeling] [computer, approach, vision, active, pattern, analysis, matching, note, shape, international, algorithm, journal, problem, well] [background, generative, image, ieee, figure, proposed, conference, composed, meaningful, dictionary, prior] [structure, number, layer, process, highly, scheme, compared, performance, full] [model, compositional, basis, greedy, hcms, natural, habm, visual, cabm, strong, semantically, hcm, dependence, random] [object, hierarchical, segmented, propose, detailed, holistic, deformable, segmentation, feature, hierarchy, detection] [learning, training, learned, learn, data, domain, set, unsupervised, generalized, classification, generalization, clustering, probabilistic, gabor, task, adaptation, class, posterior]
@InProceedings{Kortylewski_2019_CVPR,
  author = {Kortylewski, Adam and Wieczorek, Aleksander and Wieser, Mario and Blumer, Clemens and Parbhoo, Sonali and Morel-Forster, Andreas and Roth, Volker and Vetter, Thomas},
  title = {Greedy Structure Learning of Hierarchical Compositional Models},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Interactive Full Image Segmentation by Considering All Regions Jointly
Eirikur Agustsson, Jasper R. R. Uijlings, Vittorio Ferrari


We address interactive full image annotation, where the goal is to accurately segment all object and stuff regions in an image. We propose an interactive, scribble-based annotation framework which operates on the whole image to produce segmentations for all regions. This enables sharing scribble corrections across regions, and allows the annotator to focus on the largest errors made by the machine across the whole image. To realize this, we adapt Mask-RCNN [22] into a fast interactive segmentation framework and introduce an instance-aware loss measured at the pixel-level in the full image canvas, which lets predictions for nearby regions properly compete for space. Finally, we compare to interactive single object segmentation on the COCO panoptic dataset [11, 27, 34]. We demonstrate that our interactive full image segmentation approach leads to a 5% IoU gain, reaching 90% IoU at a budget of four extreme clicks and four corrective scribbles per region
[framework, work, prediction, dataset, multiple, nearby, human] [single, error, maskrcnn, rgb, point] [image, input, based, demonstrate, simulate, figure, pixel] [full, network, size, sharing, convolutional, deep, number, architecture, binary, compare, efficient, neural, inference, allocation] [model, machine, provided, common, create, requires, generate, enables] [segmentation, region, interactive, extreme, scribble, object, annotator, annotation, iou, stuff, map, box, semantic, roi, coco, corrective, backbone, dextr, mask, predicted, panoptic, feature, compete, fully, final, bounding, head] [loss, training, class, negative, learning, trained, logit, positive, task, train, set]
@InProceedings{Agustsson_2019_CVPR,
  author = {Agustsson, Eirikur and Uijlings, Jasper R. R. and Ferrari, Vittorio},
  title = {Interactive Full Image Segmentation by Considering All Regions Jointly},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Learning Active Contour Models for Medical Image Segmentation
Xu Chen, Bryan M. Williams, Srinivasa R. Vallabhaneni, Gabriela Czanner, Rachel Williams, Yalin Zheng


Image segmentation is an important step in medical image processing and has been widely studied and developed for refinement of clinical analysis and applications. New models based on deep learning have improved results but are restricted to pixel-wise fitting of the segmentation map. Our aim was to tackle this limitation by developing a new model based on deep learning which takes into account the area inside as well as outside the region of interest as well as the size of boundaries during learning. Specifically, we propose a new loss function which incorporates area and size information and integrates this into a dense deep learning model. We evaluated our approach on a dataset of more than 2,000 cardiac MRI scans. Our results show that the proposed loss function outperforms other mainstream loss function Cross-entropy on two common segmentation networks. Our loss function is robust while using different hyperparameter lambda.
[work, dataset, time, workshop, prediction] [active, computer, international, left, dense, vision, ground, defined, problem, truth, pattern, limited, analysis, well, approach, good] [image, proposed, contour, based, figure, conference, ieee, method, clinical, biomedical, cmr, pixel, limitation, high, result] [performance, deep, convolutional, neural, network, energy, order, computational, compared, fast, layer, applied, cnns] [model, length, dice, evaluate, commonly, introduced, arxiv, preprint, improved] [segmentation, ventricle, cardiac, region, cnn, medical, hausdorff, acwe, acw, myocardium, global, disease, score, area, inside, evaluated] [loss, function, learning, minimization, classification, training, measure, conventional]
@InProceedings{Chen_2019_CVPR,
  author = {Chen, Xu and Williams, Bryan M. and Vallabhaneni, Srinivasa R. and Czanner, Gabriela and Williams, Rachel and Zheng, Yalin},
  title = {Learning Active Contour Models for Medical Image Segmentation},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Customizable Architecture Search for Semantic Segmentation
Yiheng Zhang, Zhaofan Qiu, Jingen Liu, Ting Yao, Dong Liu, Tao Mei


In this paper, we propose a Customizable Architecture Search (CAS) approach to automatically generate a network architecture for semantic image segmentation. The generated network consists of a sequence of stacked computation cells. A computation cell is represented as a directed acyclic graph, in which each node is a hidden representation (i.e., feature map) and each edge is associated with an operation (e.g., convolution and pooling), which transforms data to a new layer. During the training, the CAS algorithm explores the search space for an optimized computation cell to build a network. The cells of the same type share one architecture but with different weights. In real applications, however, an optimization may need to be conducted under some constraints such as GPU time and model size. To this end, a cost corresponding to the constraint will be assigned to each operation. When an operation is selected during the search, its associated cost will be added to the objective. As a result, our CAS is able to search an optimized architecture with customized constraints. The approach has been thoroughly evaluated on Cityscapes and CamVid datasets, and demonstrates superior performance over several state-of-the-art techniques. More remarkably, our CAS achieves 72.3% mIoU on the Cityscapes dataset with speed of 108 FPS on an Nvidia TitanXp GPU.
[time, work, directed, dataset, consists, represented, manually] [normal, fitting, constraint, optimization, associated] [image, figure, identity, proposed, resolution, input, real] [architecture, network, cell, search, conv, performance, operation, gpu, cost, inference, optimized, convolutional, computation, reduction, pooled, customizable, gradient, convolution, speed, cpu, automatically, validation, pooling, number, designed, computational, lval, residual, deep, fps, params, dilated, separable, camvid, design, achieve, tradeoff, lightweight, ltrain, best, searching, efficient] [generate, procedure, candidate, mac] [semantic, segmentation, miou, spatial, feature, backbone, curve, edge, map, propose] [set, learning, test, training, cat, loss]
@InProceedings{Zhang_2019_CVPR,
  author = {Zhang, Yiheng and Qiu, Zhaofan and Liu, Jingen and Yao, Ting and Liu, Dong and Mei, Tao},
  title = {Customizable Architecture Search for Semantic Segmentation},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Local Features and Visual Words Emerge in Activations
Oriane Simeoni, Yannis Avrithis, Ondrej Chum


We propose a novel method of deep spatial matching (DSM) for image retrieval. Initial ranking is based on image descriptors extracted from convolutional neural network activations by global pooling, as in recent state-of-the-art work. However, the same sparse 3D activation tensor is also approximated by a collection of local features. These local features are then robustly matched to approximate the optimal alignment of the tensors. This happens without any network modification, additional layers or training. No local feature detection happens on the original image. No local feature descriptors and no visual vocabulary are needed throughout the whole process. We experimentally show that the proposed method achieves the state-of-the-art performance on standard benchmarks across different network architectures and different global pooling methods. The highest gain in performance is achieved when diffusion on the nearest-neighbor graph of global descriptors is initiated from spatially verified images.
[work, graph] [local, matching, geometric, inliers, linear, descriptor, single, josef, june, initial, pattern] [image, based, figure, method, proposed, spatially, transformation, database, denoted, input, extracted] [convolutional, deep, network, activation, pooling, channel, number, tensor, apply, entire, whitening, table, performance, verification, efficient, sparse, scale, vgg, neural, fast, size] [visual, query, collection, gem, mac, vocabulary] [spatial, feature, map, global, roxf, rpar, cnn, object, giorgos, yannis, ond, rej, detection, detected, response, verified, herv] [retrieval, diffusion, representation, similarity, nearest, neighbor, hard, negative, medium, conventional, learning, set, positive, ranking, large]
@InProceedings{Simeoni_2019_CVPR,
  author = {Simeoni, Oriane and Avrithis, Yannis and Chum, Ondrej},
  title = {Local Features and Visual Words Emerge in Activations},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Hyperspectral Image Super-Resolution With Optimized RGB Guidance
Ying Fu, Tao Zhang, Yinqiang Zheng, Debing Zhang, Hua Huang


To overcome the limitations of existing hyperspectral cameras on spatial/temporal resolution, fusing a low resolution hyperspectral image (HSI) with a high resolution RGB (or multispectral) image into a high resolution HSI has been prevalent. Previous methods for this fusion task usually employ hand-crafted priors to model the underlying structure of the latent high resolution HSI, and the effect of the camera spectral response (CSR) of the RGB camera on super-resolution accuracy has rarely been investigated. In this paper, we first present a simple and efficient convolutional neural network (CNN) based method for HSI super-resolution in an unsupervised way, without any prior training. Later, we append a CSR optimization layer onto the HSI super-resolution network, either to automatically select the best CSR in a given CSR dataset, or to design the optimal CSR under some physical restrictions. Experimental results show our method outperforms the state-of-the-arts, and the CSR optimization can further boost the accuracy of HSI super-resolution.
[dataset, recognition] [rgb, optimal, camera, computer, vision, optimization, corresponding, rmse, international, ground, error, truth, pattern, june, matrix, constraint, scene] [hsi, csr, image, resolution, method, high, hyperspectral, spectral, conference, figure, ieee, icvl, hybrid, based, restored, input, sam, prior, remote, harvard, superresolution, real, nssr, cave, ssim, latent, nonlinear, mapping, recovery, csrs, hsis, ergas, coupled] [low, best, sparse, network, layer, selection, optimize, design, deep, convolution, tensor, convolutional, factorization, designed, better, table, accuracy, bayesian] [worst, model, evaluate] [spatial, three, cnn, guidance] [training, learning, learned, set, selected, unsupervised, representation, learn, existing, select, function, effectively, datasets]
@InProceedings{Fu_2019_CVPR,
  author = {Fu, Ying and Zhang, Tao and Zheng, Yinqiang and Zhang, Debing and Huang, Hua},
  title = {Hyperspectral Image Super-Resolution With Optimized RGB Guidance},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Adaptive Confidence Smoothing for Generalized Zero-Shot Learning
Yuval Atzmon, Gal Chechik


Generalized zero-shot learning (GZSL) is the problem of learning a classifier where some classes have samples and others are learned from side information, like semantic attributes or text description, in a zero-shot learning fashion (ZSL). Training a single model that operates in these two regimes simultaneously is challenging. Here we describe a probabilistic approach that breaks the model into three modular components, and then combines them in a consistent way. Specifically, our model consists of three classifiers: A "gating" model that makes soft decisions if a sample is from a "seen" class, and two experts: a ZSL expert, and an expert model for seen classes. We address two main difficulties in this approach: How to provide an accurate estimate of the gating probability without any training samples for unseen classes; and how to use expert predictions when it observes samples outside of its domain. The key insight to our approach is to pass information between the three models to improve each one's accuracy, while maintaining the modular structure. We test our approach, adaptive confidence smoothing (COSMO), on four standard GZSL benchmark datasets and find that it largely outperforms state-of-the-art GZSL models. COSMO is also the first model that closes the gap and surpasses the performance of generative models for GZSL, even-though it is a light-weight model that is much easier to train and tune.
[gating, combined, auc, prediction, combining, work] [approach, confidence, estimate, constant, denote, provide] [expert, generative, image, prior, figure, tend, input, based, study] [accuracy, smoothing, adaptive, performance, compared, network, output, table, neural, validation, deep, better] [model, provided, evaluation, decision, probability, ood] [curve, three, baseline, semantic, detection, comparing, ablation, improve] [unseen, learning, cosmo, gzsl, training, class, test, acch, zsl, sun, generalized, lago, softmax, soft, trained, set, sample, cub, classifier, distribution, pzs, train, hard, learn, acctr, accts, probabilistic, classify, temperature, cbg, awa, ating, main, classification, data]
@InProceedings{Atzmon_2019_CVPR,
  author = {Atzmon, Yuval and Chechik, Gal},
  title = {Adaptive Confidence Smoothing for Generalized Zero-Shot Learning},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
PMS-Net: Robust Haze Removal Based on Patch Map for Single Images
Wei-Ting Chen, Jian-Jiun Ding, Sy-Yen Kuo


In this paper, we proposed a novel haze removal algorithm based on a new feature called the patch map. Conventional patch-based haze removal algorithms (e.g. the Dark Channel prior) usually performs dehazing with a fixed patch size. However, it may produce several problems in recovered results such as oversaturation and color distortion. Therefore, in this paper, we designed an adaptive and automatic patch size selection model called the Patch Map Selection Network (PMS-Net) to select the patch size corresponding to each pixel. This network is designed based on the convolutional neural network (CNN), which can generate the patch map from the image to image. Experimental results on both synthesized and real-world hazy images show that, with the combination of the proposed PMS-Net, the performance in haze removal is much better than that of other state-of-the-art algorithms and we can address the problems caused by the fixed patch size.
[dataset, perform] [single, algorithm, error, light, well, notice, determined, technique, sky, estimate] [patch, proposed, image, haze, transmission, dehazing, removal, dark, dcp, method, color, based, hazy, recovered, input, style, atmospheric, figure, prior, bright, mse, traditional, taiwan, bad, demonstrate, oversaturation, psnr, ssim, recover, quality, pixel] [size, performance, network, channel, fixed, relu, conv, applied, design, convolutional, apply, selection, higher, output, connected, table, better, residual, called, lead, adaptively, best, order] [white] [map, multiscale, pyramid, feature, comparing, module, densely, enhance, adopt] [min, test, learning, training, conventional, function, novel, select, experimental]
@InProceedings{Chen_2019_CVPR,
  author = {Chen, Wei-Ting and Ding, Jian-Jiun and Kuo, Sy-Yen},
  title = {PMS-Net: Robust Haze Removal Based on Patch Map for Single Images},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Deep Spherical Quantization for Image Search
Sepehr Eghbali, Ladan Tahvildari


Hashing methods, which encode high-dimensional images with compact discrete codes, have been widely applied to enhance large-scale image retrieval. In this paper, we put forward Deep Spherical Quantization (DSQ), a novel method to make deep convolutional neural networks generate supervised and compact binary codes for efficient image search. Our approach simultaneously learns a mapping that transforms the input images into a low-dimensional discriminative space, and quantizes the transformed data points using multi-codebook quantization. To eliminate the negative effect of norm variance on codebook learning, we force the network to L_2 normalize the extracted features and then quantize the resulting vectors using a new supervised quantization technique specifically designed for points lying on a unit hypersphere. Furthermore, we introduce an easy-to-implement extension of our quantization technique that enforces sparsity on the codebooks. Extensive experiments demonstrate that DSQ and its sparse variant can generate semantically separable compact binary codes outperforming many state-of-the-art image retrieval methods on three benchmarks.
[time, term] [problem, optimization, constraint, error, local, point, algorithm, dqn, approach, technique] [image, input, figure, proposed, ieee, method, preserving] [quantization, deep, sparse, binary, performance, search, dsq, compact, norm, network, coding, layer, codebook, sparsity, fast, codebooks, bij, imagenet, variance, cost, unit, achieve, lookup, efficient] [query, model, encoding, finding] [feature, center, map, three, inner, average, propose] [supervised, loss, mcq, distance, hashing, set, learning, training, product, function, discriminative, retrieval, class, data, nearest, neighbor, similarity, codewords, softmax, objective, space, unsupervised, representation, extension, hash, hamming, large, subic, updating]
@InProceedings{Eghbali_2019_CVPR,
  author = {Eghbali, Sepehr and Tahvildari, Ladan},
  title = {Deep Spherical Quantization for Image Search},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Large-Scale Interactive Object Segmentation With Human Annotators
Rodrigo Benenson, Stefan Popov, Vittorio Ferrari


Manually annotating object segmentation masks is very time consuming. Interactive object segmentation methods offer a more efficient alternative where a human annotator and a machine segmentation model collaborate. In this paper we make several contributions to interactive segmentation: (1) we systematically explore in simulation the design space of deep interactive segmentation models and report new insights and caveats; (2) we execute a large-scale annotation campaign with real human annotators, producing masks for 2.5M instances on the OpenImages dataset. We released this data publicly, forming the largest existing dataset for instance segmentation. Moreover, by re-annotating part of the COCO dataset, we show that we can produce instance masks 3x faster than traditional polygon drawing tools while also providing better quality. (3) We present a technique for automatically estimating the quality of the produced masks which exploits indirect signals from the annotation process.
[time, human, dataset, previous, work, multiple, current, report, starting] [error, simulation, well, manual, total] [quality, figure, image, input, noise, drawing, handle] [number, design, small, deep, better, binary, larger, smaller, compared, higher, explore] [model, generated, openimages, considered, policy, reach] [segmentation, corrective, annotation, instance, click, interactive, coco, object, annotator, region, mask, round, bounding, average, boundary, box, three, miou, annotate, iou, semantic, polygon, area, adetrain, faster, clicked, campaign] [training, train, class, large, ranking, set, observe, data, distance, trained, supervised, existing]
@InProceedings{Benenson_2019_CVPR,
  author = {Benenson, Rodrigo and Popov, Stefan and Ferrari, Vittorio},
  title = {Large-Scale Interactive Object Segmentation With Human Annotators},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
A Poisson-Gaussian Denoising Dataset With Real Fluorescence Microscopy Images
Yide Zhang, Yinhao Zhu, Evan Nichols, Qingfei Wang, Siyuan Zhang, Cody Smith, Scott Howard


Fluorescence microscopy has enabled a dramatic development in modern biology. Due to its inherently weak signal, fluorescence microscopy is not only much noisier than photography, but also presented with Poisson-Gaussian noise where Poisson noise, or shot noise, is the dominating noise source. To get clean fluorescence microscopy images, it is highly desirable to have effective denoising algorithms and datasets that are specifically designed to denoise fluorescence microscopy images. While such algorithms exist, no such datasets are available. In this paper, we fill this gap by constructing a dataset - the Fluorescence Microscopy Denoising (FMD) dataset - that is dedicated to Poisson-Gaussian denoising. The dataset consists of 12,000 real fluorescence microscopy images obtained with commercial confocal, two-photon, and wide-field microscopes and representative biological samples such as cells, zebrafish, and mouse brain tissues. We use image averaging to effectively obtain ground truth images and 60,000 noisy images with different noise levels. We use this dataset to benchmark 10 representative denoising algorithms and find that deep learning methods have the best performance. To our knowledge, this is the first real microscopy image dataset for Poisson-Gaussian denoising purposes and it could be an important tool for high-quality, real-time denoising applications in biomedical research.
[dataset, averaging, time, biological, mouse, signal, work, frame, sequence] [ground, truth, estimated, estimation, confocal, fov, inverse, estimate, well, registration, fovs, algorithm, exact] [noise, denoising, image, microscopy, fluorescence, raw, imaging, poisson, real, bpae, figure, fmd, psnr, dncnn, ieee, mixed, clean, based, vst, pixel, dominated, high, biomedical, blind, zebrafish, quality, presented, denoise, method, transformation] [gaussian, deep, number, table, brain, low, designed, power, excitation, residual, fixed, full, better, variance] [model, evaluate] [benchmark, including, three, commercial] [noisy, learning, test, training, representative, set, unbiased, large]
@InProceedings{Zhang_2019_CVPR,
  author = {Zhang, Yide and Zhu, Yinhao and Nichols, Evan and Wang, Qingfei and Zhang, Siyuan and Smith, Cody and Howard, Scott},
  title = {A Poisson-Gaussian Denoising Dataset With Real Fluorescence Microscopy Images},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Task Agnostic Meta-Learning for Few-Shot Learning
Muhammad Abdullah Jamal, Guo-Jun Qi


Meta-learning approaches have been proposed to tackle the few-shot learning problem. Typically, a meta-learner is trained on a variety of tasks in the hopes of being generalizable to new tasks. However, the generalizability on new tasks of a meta-learner could be fragile when it is over-trained on existing tasks during meta-training phase. In other words, the initial model of a meta-learner could be too biased towards existing tasks to adapt to new tasks, especially when only very few examples are available to update the model. To avoid a biased meta-learner and improve its generalizability, we propose a novel paradigm of Task-Agnostic Meta-Learning (TAML) algorithms. Specifically, we present an entropy-based approach that meta-learns an unbiased initial model with the largest uncertainty over the output labels by preventing it from over-performing in classification tasks. Alternatively, a more general inequality-minimization TAML is presented for more ubiquitous scenarios by directly minimizing the inequality of initial losses beyond the classification tasks wherever a suitable loss can be defined. Experiments on benchmarked datasets demonstrate that the proposed approaches outperform compared meta-learning algorithms in both few-shot classification and reinforcement learning tasks.
[lstm, dataset] [initial, approach, defined, algorithm, problem, matching, discrete] [based, proposed, variety, prior] [gradient, neural, network, table, outperform, compared, performance, batch, parameter, convolutional, better, lower, conv, accuracy, output, descent, initialization] [model, reinforcement, sampled, sensitive, policy, goal, evaluate, step] [propose] [taml, learning, inequality, classification, entropy, task, maml, lti, loss, training, update, learn, sample, trained, unbiased, measure, learner, distribution, shot, omniglot, updating, train, thiel, fair, reported, existing, biased, minimizing, set, unseen, idea, hti, generalizable, adapted, test, function, randomly]
@InProceedings{Jamal_2019_CVPR,
  author = {Abdullah Jamal, Muhammad and Qi, Guo-Jun},
  title = {Task Agnostic Meta-Learning for Few-Shot Learning},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Progressive Ensemble Networks for Zero-Shot Recognition
Meng Ye, Yuhong Guo


Despite the advancement of supervised image recognition algorithms, their dependence on the availability of labeled data and the rapid expansion of image categories raise the significant challenge of zero-shot learning. Zero-shot learning (ZSL) aims to transfer knowledge from labeled classes into unlabeled classes to reduce human labeling effort. In this paper, we propose a novel progressive ensemble network model with multiple projected label embeddings to address zero-shot image recognition. The ensemble network is built by learning multiple image classification functions with a shared feature extraction network but different label embedding representations, which enhance the diversity of the classifiers and facilitate information transfer to unlabeled classes. A progressive training framework is then deployed to gradually label the most confident images in each unlabeled class with predicted pseudo-labels and update the ensemble network with the training data augmented by the pseudo-labels. The proposed model performs training on both labeled and unlabeled data. It can naturally bridge the domain shift problem in visual appearances and be extended to the generalized zero-shot learning scenario. We conduct experiments on multiple ZSL datasets and the empirical results demonstrate the efficacy of the proposed model.
[multiple, prediction, dataset, work, recognition, framework, previous, perform] [matrix, projection, problem, projected, approach] [proposed, image, attribute, comparison, facilitate] [network, progressive, performance, accuracy, deep, shift, denotes, standard, original, table, number, neural] [model, visual, evaluation, empirical] [semantic, feature, predicted, refine, extraction, instance, average, baseline] [unseen, label, ensemble, zsl, embedding, training, class, test, data, learning, unlabeled, embeddings, labeled, classification, domain, set, transductive, pren, transfer, function, reported, generalized, knowledge, datasets, space, conventional, selected, loss, dtrain, cub, awa, split, gzsl, shared, subset, novel, address, existing]
@InProceedings{Ye_2019_CVPR,
  author = {Ye, Meng and Guo, Yuhong},
  title = {Progressive Ensemble Networks for Zero-Shot Recognition},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Direct Object Recognition Without Line-Of-Sight Using Optical Coherence
Xin Lei, Liangyu He, Yixuan Tan, Ken Xingze Wang, Xinggang Wang, Yihan Du, Shanhui Fan, Zongfu Yu


Visual object recognition under situations in which the direct line-of-sight is blocked, such as when it is occluded around the corner, is of practical importance in a wide range of applications. With coherent illumination, the light scattered from diffusive walls forms speckle patterns that contain information of the hidden object. It is possible to realize non-line-of-sight (NLOS) recognition with these speckle patterns. We introduce a novel approach based on speckle pattern recognition with deep neural network, which is simpler and more robust than other NLOS recognition methods. Simulations and experiments are performed to verify the feasibility and performance of this approach.
[recognition, human, coherent, dataset, complex, hidden, perform, multiple, modeled, optical] [speckle, light, wall, direct, pattern, scattering, body, posture, simulation, camera, laser, computer, vision, approach, diffusive, interference, point, surface, june, scattered, nlos, range, incoherent, plane, illumination, practical, feasibility, tof, require, holographic, lcd, occluded, corner, greatly, form, allows, measurement] [figure, imaging, image, method, based, intensity, ieee, demonstrate, conference, captured, result, side, traditional, acm, verify, digital] [deep, accuracy, experiment, phase, network, performed, neural] [random, visual, situation, system, potential, beam] [object] [mnist, source, classification, data, training, experimental, effectively]
@InProceedings{Lei_2019_CVPR,
  author = {Lei, Xin and He, Liangyu and Tan, Yixuan and Xingze Wang, Ken and Wang, Xinggang and Du, Yihan and Fan, Shanhui and Yu, Zongfu},
  title = {Direct Object Recognition Without Line-Of-Sight Using Optical Coherence},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Atlas of Digital Pathology: A Generalized Hierarchical Histological Tissue Type-Annotated Database for Deep Learning
Mahdi S. Hosseini, Lyndon Chan, Gabriel Tse, Michael Tang, Jun Deng, Sajad Norouzi, Corwyn Rowsell, Konstantinos N. Plataniotis, Savvas Damaskinos


In recent years, computer vision techniques have made large advances in image recognition and been applied to aid radiological diagnosis. Computational pathology aims to develop similar tools for aiding pathologists in diagnosing digitized histopathological slides, which would improve diagnostic accuracy and productivity amidst increasing workloads. However, there is a lack of publicly-available databases of (1) localized patch-level images annotated with (2) a large range of Histological Tissue Type (HTT). As a result, computational pathology research is constrained to diagnosing specific diseases or classifying tissues from specific organs, and cannot be readily generalized to handle unexpected diseases and organs. In this paper, we propose a new digital pathology database, the "Atlas of Digital Pathology" (or ADP), which comprises of 17,668 patch images extracted from 100 slides annotated with up to 57 hierarchical HTTs. Our data is generalized to different tissue types across different organs and aims to provide training data for supervised multi-label learning of patch-level HTT in a digitized whole slide image. We demonstrate the quality of our image labels through pathologist consultation and by training three state-of-the-art neural networks on tissue type classification. Quantitative results support the visually consistency of our data and we demonstrate a tissue type-based visual attention aid as a sample tool that could be developed from our database.
[recognition, human] [computer, atlas, analysis, vision, confidence, journal, corresponding, pattern, michael, field] [image, digital, patch, database, figure, ieee, proposed, cancer, conference, quality] [computational, neural, rate, size, table, deep, binary, network, convolutional, performance, original, applied, validation] [type, visual, rule, node, appears, provided, simple, diagnostic, association] [tissue, pathology, slide, hierarchical, three, annotated, level, histological, labeling, epithelial, pathologist, medical, histopathological, gland, taxonomy, assigned, disease, wsi, connective, histology, predicted, segmentation] [learning, label, training, large, data, classification, trained, specific, set, retrieval, class, positive, generalized, supervised]
@InProceedings{Hosseini_2019_CVPR,
  author = {Hosseini, Mahdi S. and Chan, Lyndon and Tse, Gabriel and Tang, Michael and Deng, Jun and Norouzi, Sajad and Rowsell, Corwyn and Plataniotis, Konstantinos N. and Damaskinos, Savvas},
  title = {Atlas of Digital Pathology: A Generalized Hierarchical Histological Tissue Type-Annotated Database for Deep Learning},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Perturbation Analysis of the 8-Point Algorithm: A Case Study for Wide FoV Cameras
Thiago L. T. da Silveira, Claudio R. Jung


This paper presents a perturbation analysis for the estimate of epipolar matrices using the 8-Point Algorithm (8-PA). Our approach explores existing bounds for singular subspaces and relates them to the 8-PA, without assuming any kind of error distribution for the matched features. In particular, if we use unit vectors as homogeneous image coordinates, we show that having a wide spatial distribution of matched features in both views tends to generate lower error bounds for the epipolar matrix error. Our experimental validation indicates that the bounds and the effective errors tend to decrease as the camera Field of View (FoV) increases, and that using the 8-PA for spherical images (that present 360degx180deg FoV) leads to accurate essential matrices. As an additional contribution, we present bounds for the direction of the translation vector extracted from the essential matrix based on singular subspace analysis.
[second, motion] [matrix, error, epipolar, singular, bound, spherical, matching, analysis, fov, computer, camera, pose, international, estimation, rotation, sin, vfov, left, fovs, hfov, vision, estimating, estimated, journal, pattern, estimate, fundamental, correspondence, calibrated, algorithm, homogeneous, view, scene, single, sine, case] [translation, conference, image, noise, ieee, based, figure, synthetic, extracted] [number, unit, structure, actual, smaller, table, full, impact, svd, wide] [perturbation, vector, consider, model, provided] [feature, matched, wider, average, spatial, leading, tighter] [essential, distribution, gap, angular]
@InProceedings{Silveira_2019_CVPR,
  author = {da Silveira, Thiago L. T. and Jung, Claudio R.},
  title = {Perturbation Analysis of the 8-Point Algorithm: A Case Study for Wide FoV Cameras},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Robustness of 3D Deep Learning in an Adversarial Setting
Matthew Wicker, Marta Kwiatkowska


Understanding the spatial arrangement and nature of real-world objects is of paramount importance to many complex engineering tasks, including autonomous navigation. Deep learning has revolutionized state-of-the-art performance for tasks in 3D environments; however, relatively little is known about the robustness of these approaches in an adversarial setting. The lack of comprehensive analysis makes it difficult to justify deployment of 3D deep learning models in real-world, safety-critical applications. In this work, we develop an algorithm for analysis of pointwise robustness of neural networks that operate on 3D data. We show that current approaches presented for understanding the resilience of state-of-the-art models vastly overestimate their robustness. We then use our algorithm to evaluate an array of state-of-the-art models in order to demonstrate their vulnerability to occlusion attacks. We show that, in the worst case, these networks can be reduced to 0% classification accuracy after the occlusion of at most 6.5% of the occupied input space.
[work, recognition, current, despite] [point, algorithm, occlusion, volumetric, analysis, pointnet, voxnet, case, confidence, exists, cloud, crafting, salience, approach, autonomous, vision, problem, directly, pointnets] [input, latent, figure, translation, change, method, based, conference, presented, removed] [deep, network, neural, order, accuracy, verification, performance, max, standard, pooling, convolution, output, architecture, represents, original, number, convolutional, achieve] [adversarial, critical, robustness, iso, random, model, example, find, cardinality, safety, iterative, attack, vector, evaluating, probability, understanding] [object, saliency, cvpr, detection] [learning, set, classification, data, representation, function, test, testing]
@InProceedings{Wicker_2019_CVPR,
  author = {Wicker, Matthew and Kwiatkowska, Marta},
  title = {Robustness of 3D Deep Learning in an Adversarial Setting},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
SceneCode: Monocular Dense Semantic Reconstruction Using Learned Encoded Scene Representations
Shuaifeng Zhi, Michael Bloesch, Stefan Leutenegger, Andrew J. Davison


Systems which incrementally create 3D semantic maps from image sequences must store and update representations of both geometry and semantic entities. However, while there has been much work on the correct formulation for geometrical estimation, state-of-the-art systems usually rely on simple semantic representations which store and update independent label estimates for each surface element (depth pixels, surfels, or voxels). Spatial correlation is discarded, and fused label maps are incoherent and noisy. We introduce a new compact and optimisable semantic representation by training a variational auto-encoder that is conditioned on a colour image. Using this learned latent space, we can tackle semantic label fusion by jointly optimising the low-dimenional codes associated with each of a set of overlapping images, producing consistent fused label maps which preserve spatial correlation. We also show how this approach can be used within a monocular keyframe based semantic mapping system where a similar code approach is used for geometry. The probabilistic formulation allows a flexible formulation where we can jointly estimate motion, geometry and semantics in a unified optimisation.
[fusion, jointly, recognition, motion, prediction, dataset, keyframe, joint, build] [dense, depth, geometry, computer, vision, monocular, scenenet, scene, international, reconstruction, slam, optimisation, pattern, camera, robotics, colour, jacobians, error, stefan, consistent, geometric, linear, relative, view, indoor, michael, allows] [conference, image, figure, ieee, latent, mapping, input, based, conditional, prior, method] [network, compact, size, deep, table, andrew, structure, neural] [encoded, variational, system, ambiguous, conditioned, encoding, ian] [semantic, semantics, refinenet, segmentation, object, spatial, predicted, refinement] [code, label, learned, learning, entropy, representation, training, set, test, optimised, large, softmax, data, probabilistic, stanford]
@InProceedings{Zhi_2019_CVPR,
  author = {Zhi, Shuaifeng and Bloesch, Michael and Leutenegger, Stefan and Davison, Andrew J.},
  title = {SceneCode: Monocular Dense Semantic Reconstruction Using Learned Encoded Scene Representations},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
StereoDRNet: Dilated Residual StereoNet
Rohan Chabra, Julian Straub, Christopher Sweeney, Richard Newcombe, Henry Fuchs


We propose a system that uses a convolution neural network (CNN) to estimate depth from a stereo pair followed by volumetric fusion of the predicted depth maps to produce a 3D reconstruction of a scene. Our proposed depth refinement architecture, predicts view-consistent disparity and occlusion maps that helps the fusion system to produce geometrically consistent reconstructions. We utilize 3D dilated convolutions in our proposed cost filtering network that yields better filtering while almost halving the computational cost in comparison to state of the art cost filtering architectures. For feature extraction we use the Vortex Pooling architecture. The proposed method achieves state of the art results in KITTI 2012, KITTI 2015 and ETH 3D stereo benchmarks. Finally, we demonstrate that our system is able to produce high fidelity 3D scene reconstructions that outperforms the state of the art stereo system.
[dataset, state, fusion, work] [stereo, disparity, depth, error, kitti, reconstruction, vortex, ground, stereodrnet, psmnet, left, light, truth, estimation, occlusion, computer, matching, vision, sceneflow, dense, scene, volume, view, pattern, geometric, total, local, passive, consistent, indoor, homogeneous, approach, gmac, international, reality] [filtering, image, proposed, method, figure, ieee, input, described, conference, demonstrate, comparison, produce, sharp, consistency, thin, quality, resolution, reflective] [network, cost, pooling, table, residual, architecture, dilated, convolution, better, structured, deep, order, neural, process] [system, richard, arxiv, preprint] [refinement, map, art, feature, spatial, pyramid, extraction, global, refined] [training, data, loss, learning]
@InProceedings{Chabra_2019_CVPR,
  author = {Chabra, Rohan and Straub, Julian and Sweeney, Christopher and Newcombe, Richard and Fuchs, Henry},
  title = {StereoDRNet: Dilated Residual StereoNet},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
The Alignment of the Spheres: Globally-Optimal Spherical Mixture Alignment for Camera Pose Estimation
Dylan Campbell, Lars Petersson, Laurent Kneip, Hongdong Li, Stephen Gould


Determining the position and orientation of a calibrated camera from a single image with respect to a 3D model is an essential task for many applications. When 2D-3D correspondences can be obtained reliably, perspective-n-point solvers can be used to recover the camera pose. However, without the pose it is non-trivial to find cross-modality correspondences between 2D images and 3D models, particularly when the latter only contains geometric information. Consequently, the problem becomes one of estimating pose and correspondences jointly. Since outliers and local optima are so prevalent, robust objective functions and global search strategies are desirable. Hence, we cast the problem as a 2D-3D mixture model alignment task and propose the first globally-optimal solution to this formulation under the robust L2 distance between mixture distributions. We derive novel bounds on this objective function and employ branch-and-bound to search the 6D space of camera poses, guaranteeing global optimality without requiring a pose estimate. To accelerate convergence, we integrate local optimization, implement GPU bound computations, and provide an intuitive way to incorporate side information such as semantic labels. The algorithm is evaluated on challenging synthetic and real datasets, outperforming existing approaches and reliably converging to the global optimum.
[] [pose, camera, computer, algorithm, pattern, problem, rotation, gosma, international, robust, bound, point, vision, local, june, registration, optimization, geometric, ransac, gopac, spherical, normal, runtime, vmf, error, journal, bearing, projected, analysis, estimation, solution, correspondence, inlier, qpn, respect, relative, projection, approach, volume, require, misesfisher] [conference, image, translation, figure, ieee, real, transformation, synthetic] [density, gaussian, search, lower, number] [model, probability, machine, find, random, correct, vector, success] [global, semantic, object] [mixture, function, set, alignment, distribution, objective, domain, distance, data, space, class, upper, novel, von, large, angular, minimum]
@InProceedings{Campbell_2019_CVPR,
  author = {Campbell, Dylan and Petersson, Lars and Kneip, Laurent and Li, Hongdong and Gould, Stephen},
  title = {The Alignment of the Spheres: Globally-Optimal Spherical Mixture Alignment for Camera Pose Estimation},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Learning Joint Reconstruction of Hands and Manipulated Objects
Yana Hasson, Gul Varol, Dimitrios Tzionas, Igor Kalevatykh, Michael J. Black, Ivan Laptev, Cordelia Schmid


Estimating hand-object manipulations is essential for in- terpreting and imitating human actions. Previous work has made significant progress towards reconstruction of hand poses and object shapes in isolation. Yet, reconstructing hands and objects during manipulation is a more challeng- ing task due to significant occlusions of both the hand and object. While presenting challenges, manipulations may also simplify the problem since the physics of contact re- stricts the space of valid hand-object configurations. For example, during manipulation, the hand and object should be in contact but not interpenetrate. In this work, we regu- larize the joint reconstruction of hands and objects with ma- nipulation constraints. We present an end-to-end learnable model that exploits a novel contact loss that favors phys- ically plausible hand-object constellations. Our approach improves grasp quality metrics over baselines, using RGB images as input. To train and evaluate the model, we also propose a new large-scale synthetic dataset, ObMan, with hand-object manipulations. We demonstrate the transfer- ability of ObMan-trained models to real data.
[dataset, tracking, joint, human, work, displacement, action, focus, predict, interacting] [hand, pose, contact, shape, reconstruction, estimation, mesh, penetration, obman, grasp, depth, mano, rgb, repulsion, atlasnet, single, ground, simulation, truth, body, error, fhbc, articulated, surface, differentiable, well, render, normalized, fhb, computer, point, vertex, define] [synthetic, real, figure, image, quality, input] [network, scale, deep, table, neural, full] [model, physical, find, appendix, sampled, visual] [object, attraction, predicted, supervision] [training, loss, learning, set, distance, datasets, split, data, trained, measure]
@InProceedings{Hasson_2019_CVPR,
  author = {Hasson, Yana and Varol, Gul and Tzionas, Dimitrios and Kalevatykh, Igor and Black, Michael J. and Laptev, Ivan and Schmid, Cordelia},
  title = {Learning Joint Reconstruction of Hands and Manipulated Objects},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Deep Single Image Camera Calibration With Radial Distortion
Manuel Lopez, Roger Mari, Pau Gargallo, Yubin Kuang, Javier Gonzalez-Jimenez, Gloria Haro


Single image calibration is the problem of predicting the camera parameters from one image. This problem is of importance when dealing with images collected in uncontrolled conditions by non-calibrated cameras, such as crowd-sourced applications. In this work we propose a method to predict extrinsic (tilt and roll) and intrinsic (focal length and radial distortion) parameters from a single image. We propose a parameterization for radial distortion that is better suited for learning than directly predicting the distortion parameters. Moreover, predicting additional heterogeneous variables exacerbates the problem of loss balancing. We propose a new loss function based on point projections to avoid having to balance heterogeneous loss terms. Our method is, to our knowledge, the first to jointly estimate the tilt, roll, focal length, and radial distortion parameters from a single image. We thoroughly analyze the performance of the proposed method and the impact of the improvements and compare with previous approaches for single image radial distortion correction.
[horizon, predict, prediction, apparent, predicting, work, previous, perform, dataset] [distortion, camera, radial, focal, single, calibration, computer, distorted, error, tilt, bearing, roll, ground, truth, parameterization, vision, directly, problem, undistorted, well, normalized, intrinsic, straight, projection, view, approach, extrinsic, point, field, vertical, rpx, international, rely, alternative, respect, panorama, projected, angle, huber] [image, figure, method, proposed, based, conference, ieee, real] [network, parameter, unit, compare, coefficient, convolutional, better, neural] [length, model, generate, balancing] [offset, predicted, propose, regression] [loss, set, training, function, distribution, learning, trained, test, proxy, large, learned, train]
@InProceedings{Lopez_2019_CVPR,
  author = {Lopez, Manuel and Mari, Roger and Gargallo, Pau and Kuang, Yubin and Gonzalez-Jimenez, Javier and Haro, Gloria},
  title = {Deep Single Image Camera Calibration With Radial Distortion},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
CAM-Convs: Camera-Aware Multi-Scale Convolutions for Single-View Depth
Jose M. Facil, Benjamin Ummenhofer, Huizhong Zhou, Luis Montesano, Thomas Brox, Javier Civera


Single-view depth estimation suffers from the problem that a network trained on images from one camera does not generalize to images taken with a different camera model. Thus, changing the camera model requires collecting an entirely new training dataset. In this work, we propose a new type of convolution that can take the camera parameters into account, thus allowing neural networks to learn calibration-aware patterns. Experiments confirm that this improves the generalization capabilities of depth prediction networks considerably, and clearly outperforms the state of the art when the train and test images are acquired with different cameras.
[prediction, dataset, work, recognition, multiple] [depth, camera, focal, sensor, computer, vision, pattern, international, single, point, resizing, rgb, estimation, ground, view, principal, field, depend, truth, notice, kitti, normalized, inverse, smallest, allows, intrinsics, confidence, error] [conference, figure, image, ieee, input, pixel] [network, size, table, deep, performance, neural, normalization, convolutional, better, architecture, weight] [length, model, visual, evaluate, arxiv] [context, map, feature] [test, training, trained, set, train, learning, generalization, data, generalize, datasets, distribution, loss, learn]
@InProceedings{Facil_2019_CVPR,
  author = {Facil, Jose M. and Ummenhofer, Benjamin and Zhou, Huizhong and Montesano, Luis and Brox, Thomas and Civera, Javier},
  title = {CAM-Convs: Camera-Aware Multi-Scale Convolutions for Single-View Depth},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Translate-to-Recognize Networks for RGB-D Scene Recognition
Dapeng Du, Limin Wang, Huiling Wang, Kai Zhao, Gangshan Wu


Cross-modal transfer is helpful to enhance modality-specific discriminative power for scene recognition. To this end, this paper presents a unified framework to integrate the tasks of cross-modal translation and modality-specific recognition, termed as Translate-to-Recognize Network TRecgNet. Specifically, both translation and recognition tasks share the same encoder network, which allows to explicitly regularize the training of recognition task with the help of translation, and thus improve its final generalization ability. For translation task, we place a decoder module on top of the encoder network and it is optimized with a new layer-wise semantic loss, while for recognition task, we use a linear classifier based on the feature embedding from encoder and its training is guided by the standard cross-entropy loss. In addition, our TRecgNet allows to exploit large numbers of unlabeled RGB-D data to train the translation task and thus improve the representation power of encoder network. Empirically, we verify that this new semi-supervised setting is able to further enhance the performance of recognition network. We perform experiments on two RGB-D scene recognition benchmarks: NYU Depth v2 and SUN RGB-D, demonstrating that TRecgNet achieves superior performance to the existing state-of-the-art methods, especially for recognition solely based on a single modality.
[recognition, dataset, fusion, wang] [depth, trecgnet, rgb, scene, trecg, indoor, modalityspecific, directly, vision, trecgnets] [translation, figure, image, content, proposed, paired, based, quality, aug, input, perceptual, method, study] [network, net, layer, residual, deep, imagenet, performance, number, accuracy, effectiveness, power, convolutional, table, effective, fine, basic, better, process, upsample, neural, operation] [generated, modality, encoding, encoder, model, gan] [semantic, feature, supervision, object, enhance, cnn, segmentation, branch, improve] [data, training, classification, sun, learning, test, loss, set, unlabeled, learn, train, similarity, transfer, representation, discriminative, task, datasets, specific]
@InProceedings{Du_2019_CVPR,
  author = {Du, Dapeng and Wang, Limin and Wang, Huiling and Zhao, Kai and Wu, Gangshan},
  title = {Translate-to-Recognize Networks for RGB-D Scene Recognition},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Re-Identification Supervised Texture Generation
Jian Wang, Yunshan Zhong, Yachun Li, Chi Zhang, Yichen Wei


The estimation of 3D human body pose and shape from a single image has been extensively studied in recent years. However, the texture generation problem has not been fully discussed. In this paper, we propose an end-to-end learning strategy to generate textures of human bodies under the supervision of person re-identification. We render the synthetic images with textures extracted from the inputs and maximize the similarity between the rendered and input images by using the re-identification network as the perceptual metrics. Experiment results on pedestrian images show that our model can generate the texture from a single image and demonstrate that our textures are of higher quality than those generated by other available methods. Furthermore, we extend the application scope to other categories and explore the possible utilization of our generated textures.
[human, dataset, influence, extract, hmr, performs, work, action] [body, pose, computer, rendered, shape, smpl, vision, pattern, single, mesh, rendering, michael, international, surreal, reconstruction, directly, estimation, render, approach, linear, differentiable, scanned] [texture, image, method, conference, input, ieee, perceptual, face, quality, result, ssim, proposed, figure, synthetic, extracted, user, acm, translation, qualitative] [network, deep, higher, pretrained, experiment, order, neural, process, better, performance, accuracy, epoch] [generated, model, generation, generate, generating, ace, arxiv, preprint, diversity] [person, feature, score, eccv, supervision, object] [loss, training, learning, trained, distance, reid, reidentification, function, similarity, task]
@InProceedings{Wang_2019_CVPR,
  author = {Wang, Jian and Zhong, Yunshan and Li, Yachun and Zhang, Chi and Wei, Yichen},
  title = {Re-Identification Supervised Texture Generation},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Action4D: Online Action Recognition in the Crowd and Clutter
Quanzeng You, Hao Jiang


Recognizing every person's action in a crowded and cluttered environment is a challenging task in computer vision. We propose to tackle this challenging problem using a holistic 4D "scan" of a cluttered scene to include every detail about the people and environment. This leads to a new problem, i.e., recognizing multiple people's actions in the cluttered 4D representation. At the first step, we propose a new method to track people in 4D, which can reliably detect and follow each person in real time. Then, we build a new deep neural network, the Action4DNet, to recognize the action of each tracked person. Such a model gives reliable and accurate results in the real-world settings. We also design an adaptive 3D convolution layer and a novel discriminative temporal feature learning objective to further improve the performance of our model. Our method is invariant to camera view angles, resistant to clutter and able to handle crowd. The experimental results show that the proposed method is fast, reliable and accurate. Our method paves the way to action recognition in the real-world applications and is ready to be deployed to enable smart homes, smart factories and smart stores.
[action, recognition, people, multiple, temporal, tracking, skeleton, time, video, recognize, kinect, trajectory, acc, subject, dataset, previous, recognizing, track, human, prediction] [volume, computer, vision, pattern, shape, camera, point, ground, truth, pointnet, scene, view, depth, single, body, phone, rgb, voxel, local] [method, proposed, figure, ieee, conference, background, input, based] [table, neural, performance, deep, adaptive, convolution, layer, better, network, convolutional, accuracy, number, achieve] [model, attention, include, candidate, evaluate] [person, three, feature, cluttered, propose, context, box, clutter, detection] [test, data, learning, discriminative, training, loss, label, train, testing, representation, set]
@InProceedings{You_2019_CVPR,
  author = {You, Quanzeng and Jiang, Hao},
  title = {Action4D: Online Action Recognition in the Crowd and Clutter},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Monocular 3D Object Detection Leveraging Accurate Proposals and Shape Reconstruction
Jason Ku, Alex D. Pon, Steven L. Waslander


We present MonoPSR, a monocular 3D object detection method that leverages proposals and shape reconstruction. First, using the fundamental relations of a pinhole camera model, detections from a mature 2D object detector are used to generate a 3D proposal per object in a scene. The 3D location of these proposals prove to be quite accurate, which greatly reduces the difficulty of regressing the final 3D bounding box detection. Simultaneously, a point cloud is predicted in an object centered coordinate system to learn local scale and shape information. However, the key challenge is how to exploit shape information to guide 3D localization. As such, we devise aggregate losses, including a novel projection alignment loss, to jointly optimize these tasks in the neural network to improve 3D localization accuracy. We validate our method on the KITTI benchmark where we set new state-of-the-art results among published monocular methods, including the harder pedestrian and cyclist classes, while maintaining efficient run-time.
[frame, jointly, previous] [point, depth, cloud, shape, reconstruction, monocular, coordinate, kitti, camera, projection, local, estimation, projected, estimated, accurate, error, scene, computer, lidar, ground, single, orientation, estimate, horizontal, angle, cad, viewing, formulation, corresponding, truth, june, regressing] [image, method, consistency, conference, based] [network, deep, full, neural, table, output, channel, processing] [generate, model, generated, calculated] [object, proposal, instance, box, detection, bounding, regression, feature, map, predicted, localization, height, moderate, benchmark, cyclist, easy, location, pedestrian, iou, module, three, segmentation, final] [learning, loss, centroid, hard, space, alignment, trained, task, data, training, test]
@InProceedings{Ku_2019_CVPR,
  author = {Ku, Jason and Pon, Alex D. and Waslander, Steven L.},
  title = {Monocular 3D Object Detection Leveraging Accurate Proposals and Shape Reconstruction},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Attribute-Aware Face Aging With Wavelet-Based Generative Adversarial Networks
Yunfan Liu, Qi Li, Zhenan Sun


Since it is difficult to collect face images of the same subject over a long range of age span, most existing face aging methods resort to unpaired datasets to learn age mappings. However, the matching ambiguity between young and aged face images inherent to unpaired training data may lead to unnatural changes of facial attributes during the aging process, which could not be solved by only enforcing identity consistency like most existing studies do. In this paper, we propose an attribute-aware face aging model with wavelet based Generative Adversarial Networks (GANs) to address the above issues. To be specific, we embed facial attribute vectors into both the generator and discriminator of the model to encourage each synthesized elderly face image to be faithful to the attribute of its corresponding input. In addition, a wavelet packet transform (WPT) module is incorporated to improve the visual fidelity of generated images by capturing age-related texture details at multiple scales in the frequency space. Qualitative results demonstrate the ability of our model in synthesizing visually plausible face images, and extensive quantitative evaluation results show that the proposed method achieves state-of-the-art performance on existing datasets.
[recognition, work, subject, dataset] [computer, vision, pattern, corresponding, estimated, international, matching, ambiguity, well] [face, aging, facial, age, attribute, image, identity, method, proposed, input, morph, ieee, wavelet, conference, unpaired, generator, preservation, elderly, female, conditional, cacd, generative, packet, consistency, transform, wpt, aged, synthetic, male, figure, synthesize, clear] [verification, table, performance, group, output, rate, deep, compared, convolutional] [model, adversarial, young, discriminator, generated, generation, visual, white, black, intelligence, mismatched, considered, vector] [semantic, level, three, adopted] [training, loss, generic, test, data, conducted, sample, embedding]
@InProceedings{Liu_2019_CVPR,
  author = {Liu, Yunfan and Li, Qi and Sun, Zhenan},
  title = {Attribute-Aware Face Aging With Wavelet-Based Generative Adversarial Networks},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Noise-Tolerant Paradigm for Training Face Recognition CNNs
Wei Hu, Yangyu Huang, Fan Zhang, Ruirui Li


Benefit from large-scale training datasets, deep Convolutional Neural Networks(CNNs) have achieved impressive results in face recognition(FR). However, tremendous scale of datasets inevitably lead to noisy data, which obviously reduce the performance of the trained CNN models. Kicking out wrong labels from large-scale FR datasets is still very expensive, although some cleaning approaches are proposed. According to the analysis of the whole process of training CNN models supervised by angular margin based loss(AM-Loss) functions, we find that the  distribution of training samples implicitly reflects their probability of being clean. Thus, we propose a novel training paradigm that employs the idea of weighting samples based on the above probability. Without any prior knowledge of noise, we can train high performance CNN models with largescale FR datasets. Experiments demonstrate the effectiveness of our training paradigm. The codes are available at https://github.com/huangyangyu/NoiseFace.
[dataset, recognition, employed, key] [computer, compute, international, pattern, corresponding, approach, left, vision] [clean, face, figure, noise, ieee, conference, method, based, proposed, demonstrate, image, prior] [deep, weight, performance, rate, accuracy, cnns, neural, larger, original, small, better, applied, convolutional, effectiveness, achieve] [find, machine, model] [cnn, final, propose, feature, refined] [training, noisy, loss, trained, learning, sample, datasets, lfw, histclean, paradigm, weighting, train, data, histall, histnoisy, distribution, arcface, label, classification, margin, strategy, supervised, set, softmax, angular, idea, knowledge, learn, hard]
@InProceedings{Hu_2019_CVPR,
  author = {Hu, Wei and Huang, Yangyu and Zhang, Fan and Li, Ruirui},
  title = {Noise-Tolerant Paradigm for Training Face Recognition CNNs},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Low-Rank Laplacian-Uniform Mixed Model for Robust Face Recognition
Jiayu Dong, Huicheng Zheng, Lina Lian


Sparse representation based methods have successfully put forward a general framework for robust face recognition through linear reconstruction and sparsity constraints. However, residual modeling in existing works is not yet robust enough when dealing with dense noise. In this paper, we aim at recognizing identities from faces with varying levels of noises of various forms such as occlusion, pixel corruption, or disguise, and take improving the fitting ability of the error model as the key to addressing this problem. To fully capture the characteristics of different noises, we propose a mixed model combining robust sparsity constraint and low-rank constraint, which can deal with random errors and structured errors simultaneously. For random noises such as pixel corruption, we adopt a Laplacian-uniform mixed function for fitting the error distribution. For structured errors like continuous occlusion or disguise, we utilize robust nuclear norm to constrain the rank of the error matrix. An effective iterative reweighted algorithm is then developed to solve the proposed model. Comprehensive experiments were conducted on several benchmark databases for robust face recognition, and the overall results demonstrate that our model is most robust against various kinds of noises, when compared with state-of-the-art methods.
[recognition, session, complex] [robust, pattern, error, occlusion, analysis, algorithm, single, computer, percentage, reconstruction, nuclear, illumination, matrix, problem, vision, optimization, linear] [face, proposed, ieee, mixed, method, pixel, image, cesr, based, corruption, conference, clean, real, figure, lum, eyb, disguise, comparison] [sparse, structured, accuracy, performance, deep, coding, norm, compared, block, sparsity, penalty, achieved, best] [model, random, machine, robustness, arg, evaluation, ability, iterative, evaluate] [multi, regression] [training, sample, function, min, test, distribution, objective, update, lfw, representation, learning, experimental, maximum, set, rank, minimization]
@InProceedings{Dong_2019_CVPR,
  author = {Dong, Jiayu and Zheng, Huicheng and Lian, Lina},
  title = {Low-Rank Laplacian-Uniform Mixed Model for Robust Face Recognition},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Generalizing Eye Tracking With Bayesian Adversarial Learning
Kang Wang, Rui Zhao, Hui Su, Qiang Ji


Existing appearance-based gaze estimation approaches with CNN have poor generalization performance. By systematically studying this issue, we identify three major factors: 1) appearance variations; 2) head pose variations and 3) over-fitting issue with point estimation. To improve the generalization performance, we propose to incorporate adversarial learning and Bayesian inference into a unified framework. In particular, we first add an adversarial component into traditional CNN-based gaze estimator so that we can learn features that are gaze-responsive but can generalize to appearance and pose variations. Next, we extend the point-estimation based deterministic model to a Bayesian framework so that gaze estimation can be performed using all parameters instead of only one set of parameters. Besides improved performance on several benchmark datasets, the proposed method also enables online adaptation of the model to new subjects/environments, demonstrating the potential usage for practical real-time eye tracking applications.
[tracking, perform, work, subject, incorporate, online, dataset, term, follow] [pose, estimation, computer, international, point, pattern, estimator, error, vision, estimate, geometric, good] [gaze, eye, appearance, proposed, conference, ieee, image, eyediap, method, figure, study, prior, conditional, based, face, mpiigaze, columbia] [bayesian, inference, neural, better, parameter, network, output, performance, deep, represents] [model, adversarial, arg, evaluation, system, introduce, probability] [head, baseline, map, improvement, three, propose, feature, cnn, improve] [domain, data, learning, source, target, posterior, large, generalization, classifier, adaptation, labeled, sample, issue, learn, generalize, set, draw, adapt, min]
@InProceedings{Wang_2019_CVPR,
  author = {Wang, Kang and Zhao, Rui and Su, Hui and Ji, Qiang},
  title = {Generalizing Eye Tracking With Bayesian Adversarial Learning},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Local Relationship Learning With Person-Specific Shape Regularization for Facial Action Unit Detection
Xuesong Niu, Hu Han, Songfan Yang, Yan Huang, Shiguang Shan


Encoding individual facial expressions via action units (AUs) coded by the Facial Action Coding System (FACS) has been found to be an effective approach in resolving the ambiguity issue among different expressions. While a number of methods have been proposed for AU detection, robust AU detection in the wild remains a challenging problem because of the diverse baseline AU intensities across individual subjects, and the weakness of appearance signal of AUs. To resolve these issues, in this work, we propose a novel AU detection method by utilizing local information and the relationship of individual local face regions. Through such a local relationship learning, we expect to utilize rich local information to improve the AU detection robustness against the potential perceptual inconsistency of individual local regions. In addition, considering the diversity in the baseline AU intensities of individual subjects, we further regularize local relationship learning via person-specific face shape information, i.e., reducing the influence of person-specific shape information, and obtaining more AU discriminative features. The proposed approach outperforms the state-of-the-art methods on two widely used AU detection datasets in the public domain (BP4D and DISFA).
[action, individual, outperforms, recognition, term, prediction, modeling, influence, multiple] [local, shape, robust, approach, analysis, defined, single] [facial, face, ieee, stem, proposed, expression, based, method, spontaneous, jeffrey, appearance, image, disfa, yan, figure, traditional] [regularization, network, unit, deep, performance, convolutional, table, effective, applied, occurrence, best, layer, activation, reduce, better, achieve, effectiveness, coding] [relationship, model, generated, automatic, improved, system, probability, diverse] [detection, feature, module, average, baseline, predicted, selective, score, region, final, global, propose] [learning, data, loss, discriminative, representation, china, learn]
@InProceedings{Niu_2019_CVPR,
  author = {Niu, Xuesong and Han, Hu and Yang, Songfan and Huang, Yan and Shan, Shiguang},
  title = {Local Relationship Learning With Person-Specific Shape Regularization for Facial Action Unit Detection},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Point-To-Pose Voting Based Hand Pose Estimation Using Residual Permutation Equivariant Layer
Shile Li, Dongheui Lee


Recently, 3D input data based hand pose estimation methods have shown state-of-the-art performance, because 3D data capture more spatial information than the depth image. Whereas 3D voxel-based methods need a large amount of memory, PointNet based methods need tedious preprocessing steps such as K-nearest neighbour search for each point. In this paper, we present a novel deep learning hand pose estimation method for an unordered point cloud. Our method takes 1024 3D points as input and does not require additional information. We use Permutation Equivariant Layer (PEL) as the basic element, where a residual network version of PEL is proposed for the hand pose estimation task. Furthermore, we propose a voting-based scheme to merge information from individual points to the final pose output. In addition to the pose estimation task, the voting-based scheme can also provide point cloud segmentation result without ground-truth for segmentation. We evaluate our method on both NYU dataset and the Hands2017Challenge dataset, where our method outperforms recent state-of-theart methods.
[dataset, version, joint, term, recognition, human, individual, outperforms] [pose, hand, point, pel, estimation, depth, computer, view, vision, pattern, nyu, pointnet, cloud, error, permutation, unordered, require, estimate, single, local, junsong, camera] [method, input, based, conference, ieee, image, comparison, result, figure, proposed] [deep, residual, network, output, layer, batchnorm, compared, table, performance, neural, normalization, number, scheme, weighted, order, convolutional, structure] [equivariant, requires, model, memory] [segmentation, detection, regression, cnn, feature, voting, final, three, propose, global, object] [data, learning, set, training, test, trained, loss, distribution, testing, large, invariant]
@InProceedings{Li_2019_CVPR,
  author = {Li, Shile and Lee, Dongheui},
  title = {Point-To-Pose Voting Based Hand Pose Estimation Using Residual Permutation Equivariant Layer},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Improving Few-Shot User-Specific Gaze Adaptation via Gaze Redirection Synthesis
Yu Yu, Gang Liu, Jean-Marc Odobez


As an indicator of human attention gaze is a subtle behavioral cue which can be exploited in many applications. However, inferring 3D gaze direction is challenging even for deep neural networks given the lack of large amount of data (groundtruthing gaze is expensive and existing datasets use different setups) and the inherent presence of gaze biases due to person-specific difference. In this work, we address the problem of person-specific gaze model adaptation from only a few reference training samples. The main and novel idea is to improve gaze adaptation by generating additional training samples through the synthesis of gaze-redirected eye images from existing reference samples. In doing so, our contributions are threefold:(i) we design our gaze redirection framework from synthetic data, allowing us to benefit from aligned training sample pairs to predict accurate inverse mapping fields; (ii) we proposed a self-supervised approach for domain adaptation; (iii) we exploit the gaze redirection to improve the performance of person-specific gaze estimation. Extensive experiments on two public datasets demonstrate the validity of our gaze retargeting and gaze estimation framework.
[warping, dataset, tracking, time, recognition] [vision, estimation, computer, estimator, approach, groundtruth, note, pattern, error, inverse, range, pose, international, andreas, analysis, well, angle, defined] [gaze, redirection, eye, image, redirected, conference, reference, synthetic, real, ieee, cycle, columbiagaze, based, mpiigaze, method, proposed, amount, input, appearance, iris, realistic, diffnet, acm, yusuke, figure, user, redftadap, synthesis, difference, consistency] [network, performance, original, output, deep, fine, number, compared, best] [model, machine, generate, visual, generated, find] [head, improve, semantic, propose, segmentation, european, aligned, three, person, map, annotated, predicted] [adaptation, domain, generic, data, training, loss, large, sample, learn, learning]
@InProceedings{Yu_2019_CVPR,
  author = {Yu, Yu and Liu, Gang and Odobez, Jean-Marc},
  title = {Improving Few-Shot User-Specific Gaze Adaptation via Gaze Redirection Synthesis},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
AdaptiveFace: Adaptive Margin and Sampling for Face Recognition
Hao Liu, Xiangyu Zhu, Zhen Lei, Stan Z. Li


Training large-scale unbalanced data is the central topic in face recognition. In the past two years, face recognition has achieved remarkable improvements due to the introduction of margin based Softmax loss. However, these methods have an implicit assumption that all the classes possess sufficient samples to describe its distribution, so that a manually set margin is enough to equally squeeze each intra-class variations. However, real face datasets are highly unbalanced, which means the classes have tremendously different numbers of samples. In this paper, we argue that the margin should be adapted to different classes. We propose the Adaptive Margin Softmax to adjust the margins for different classes adaptively. In addition to the unbalance challenge, face data always consists of large-scale classes and samples. Smartly selecting valuable classes and samples to participate in the training makes the training more effective and efficient. To this end, we also make the sampling process adaptive in two folds: Firstly, we propose the Hard Prototype Mining to adaptively select a small number of hard classes to participate in classification. Secondly, for data sampling, we introduce the Adaptive Data Sampling to find valuable samples for training adaptively. We combine these three parts together as AdaptiveFace. Extensive analysis and experiments on LFW, LFW BLUFR and MegaFace show that our method performs better than state-of-the-art methods using the same network architecture and training dataset. Code is available at https://github.com/haoliu1994/AdaptiveFace.
[recognition, dataset] [computer, vision, pattern, corresponding, international, approach, underlying] [face, conference, ieee, real, figure, based, method, proposed, feedback, zhen] [adaptive, deep, number, performance, table, adaptively, small, verification, larger, network, layer, better, squeeze, scale, compact] [example, find, rich, indicates, decision, blue, introduce, observed, arxiv, preprint] [feature, improve, propose, area, three, boundary] [margin, loss, softmax, hard, class, data, training, large, mining, lfw, sampling, poor, learning, prototype, blufr, space, cosface, sample, classification, megaface, set, distribution, valuable, adaptiveface, metric, select, angular, arcface, cosine, hpm, existing, similarity]
@InProceedings{Liu_2019_CVPR,
  author = {Liu, Hao and Zhu, Xiangyu and Lei, Zhen and Li, Stan Z.},
  title = {AdaptiveFace: Adaptive Margin and Sampling for Face Recognition},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Disentangled Representation Learning for 3D Face Shape
Zi-Hang Jiang, Qianyi Wu, Keyu Chen, Juyong Zhang


In this paper, we present a novel strategy to design disentangled 3D face shape representation. Specifically, a given 3D face shape is decomposed into identity part and expression part, which are both encoded and decoded in a nonlinear way. To solve this problem, we propose an attribute decomposition framework for 3D face mesh. To better represent face shapes which are usually nonlinear deformed between each other, the face shapes are represented by a vertex based deformation representation rather than Euclidean coordinates. The experimental results demonstrate that our method has better performance than existing methods on decomposing the identity and expression parts. Moreover, more natural expression transfer results can be achieved with our method than existing methods.
[graph, fusion, dataset, recognition, framework] [mesh, shape, vertex, deformation, decomposition, computer, vision, error, linear, defined, pattern, reconstruction, michael, local, exp, parametric, lexp] [expression, face, identity, method, based, latent, conference, facial, ieee, proposed, spectral, disentangled, facewarehouse, nonlinear, kld, morphable, flame, attribute, input, reconstructed, acm, decomposed, disentangling, includes, mexp, result, dexp, neutral, difference, lid] [convolution, original, deep, network, better, bilinear, convolutional, achieve, neural, structure] [model, represent, ability, natural, vector] [feature, propose, edge, adopt, branch, module, average, improve] [representation, learning, training, set, transfer, space, data, euclidean, novel, loss, deformed, augmentation, target]
@InProceedings{Jiang_2019_CVPR,
  author = {Jiang, Zi-Hang and Wu, Qianyi and Chen, Keyu and Zhang, Juyong},
  title = {Disentangled Representation Learning for 3D Face Shape},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
LBS Autoencoder: Self-Supervised Fitting of Articulated Meshes to Point Clouds
Chun-Liang Li, Tomas Simon, Jason Saragih, Barnabas Poczos, Yaser Sheikh


We present LBS-AE; a self-supervised autoencoding algorithm for fitting articulated mesh models to point clouds. As input, we take a sequence of point clouds to be registered as well as an artist-rigged mesh, i.e. a template mesh equipped with a linear-blend skinning (LBS) deformation space parameterized by a skeleton hierarchy. As output, we learn an LBS-based autoencoder that produces registered meshes from the input point clouds. To bridge the gap between the artist-defined geometry and the captured point clouds, our autoencoder models pose-dependent deviations from the template geometry. During training, instead of us- ing explicit correspondences, such as key points or pose supervision, our method leverages LBS deformations to boot- strap the learning process. To avoid poor local minima from erroneous point-to-point correspondences, we utilize a structured Chamfer distance based on part-segmentations, which are learned concurrently using self-supervision. We demonstrate qualitative results on real captured hands, and report quantitative evaluations on the FAUST benchmark for body registration. Our method achieves performance that is superior to other unsupervised approaches and com- parable to methods using supervised examples.
[joint, human, key, version] [point, template, deformation, mesh, correspondence, pose, body, shape, fitting, hand, smpl, inferred, algorithm, chamfer, local, skinning, well, fit, cloud, angle, finger, reconstruction, defined, estimation, additional, note, articulated, faust, registration, surreal, geometry] [figure, input, proposed, based, synthetic, prior, real, difference, captured, generative, study] [network, deep, neural, better, search, structured] [model, generate, sampled, infer, consider, true] [segmentation, propose, improve] [data, training, learning, learn, nearest, distance, neighbor, unsupervised, train, distribution, uniform, deformed, supervised, trained, testing, function, base, space, gap, knowledge, loss]
@InProceedings{Li_2019_CVPR,
  author = {Li, Chun-Liang and Simon, Tomas and Saragih, Jason and Poczos, Barnabas and Sheikh, Yaser},
  title = {LBS Autoencoder: Self-Supervised Fitting of Articulated Meshes to Point Clouds},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
PifPaf: Composite Fields for Human Pose Estimation
Sven Kreiss, Lorenzo Bertoni, Alexandre Alahi


We propose a new bottom-up method for multi-person 2D human pose estimation that is particularly well suited for urban mobility such as self-driving cars and delivery robots. The new method, PifPaf, uses a Part Intensity Field (PIF) to localize body parts and a Part Association Field (PAF) to associate body parts with each other to form full human poses. Our method outperforms previous methods at low resolution and in crowded, cluttered and occluded scenes thanks to (i) our new composite field PAF encoding fine-grained information and (ii) the choice of Laplace loss for regressions which incorporates a notion of uncertainty. Our architecture is based on a fully convolutional, single-shot, box-free design. We perform on par with the existing state-of-the-art bottom-up method on the standard COCO keypoint task and produce state-of-the-art results on a modified COCO keypoint task for the transportation domain.
[human, joint, paf, time, multiple] [pose, estimation, field, computer, confidence, keypoint, vision, body, pattern, international, openpose, laplace, left, well, form, partially, predicts, point, multiperson] [conference, resolution, figure, image, ieee, method, high, composite, component, based, input, proposed, intensity] [network, neural, low, output, deep, table, scale, size, small, performance, convolutional, outperform] [association, vector, model, decoding] [person, bounding, mask, coco, map, pif, pifpaf, location, head, box, feature, occlude, crowded, detection, european] [loss, learning, training, task, data, set]
@InProceedings{Kreiss_2019_CVPR,
  author = {Kreiss, Sven and Bertoni, Lorenzo and Alahi, Alexandre},
  title = {PifPaf: Composite Fields for Human Pose Estimation},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
TACNet: Transition-Aware Context Network for Spatio-Temporal Action Detection
Lin Song, Shiwei Zhang, Gang Yu, Hongbin Sun


Current state-of-the-art approaches for spatio-temporal action detection have achieved impressive results but remain unsatisfactory for temporal extent detection. The main reason comes from that, there are some ambiguous states similar to the real actions which may be treated as target actions even by a well trained network. In this paper, we define these ambiguous samples as "transitional states", and propose a Transition-Aware Context Network (TACNet) to distinguish transitional states. The proposed TACNet includes two main components, i.e., temporal context detector and transition-aware classifier. The temporal context detector can extract long-term context information with constant time complexity by constructing a recurrent network. The transition-aware classifier can further distinguish transitional states by classifying action and transitional states simultaneously. Therefore, the proposed TACNet can substantially improve the performance of spatio-temporal action detection. We extensively evaluate the proposed TACNet on UCF101-24 and J-HMDB datasets. The experimental results demonstrate that TACNet obtains competitive performance on JHMDB and significantly outperforms the state-of-the-art methods on the untrimmed UCF101 24 in terms of both frame-mAP and video-mAP.
[action, temporal, transitional, tacnet, state, extract, untrimmed, dataset, video, predict, construct, time, framework, perform, multiple, considering, prediction, recurrent, outperforms] [differential, analysis, algorithm, define, corresponding] [proposed, figure, based, method, demonstrate, background] [performance, apply, table, network, deep, standard, scheme, number] [evaluate, find, arxiv, preprint, probability, ambiguous, critical, common, mode, model] [context, detection, detector, category, ssd, distinguish, improve, spatial, propose, weakly, score, iou, improvement, regression, predicted, fully, detect, improves] [classifier, training, classification, set, supervised, learning, target, sample, train, experimental]
@InProceedings{Song_2019_CVPR,
  author = {Song, Lin and Zhang, Shiwei and Yu, Gang and Sun, Hongbin},
  title = {TACNet: Transition-Aware Context Network for Spatio-Temporal Action Detection},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Learning Regularity in Skeleton Trajectories for Anomaly Detection in Videos
Romero Morais, Vuong Le, Truyen Tran, Budhaditya Saha, Moussa Mansour, Svetha Venkatesh


Appearance features have been widely used in video anomaly detection even though they contain complex entangled factors. We propose a new method to model the normal patterns of human movements in surveillance video for anomaly detection using dynamic skeleton features. We decompose the skeletal movements into two sub-components: global body movement and local body posture. We model the dynamics and interaction of the coupled features in our novel Message-Passing Encoder-Decoder Recurrent Network. We observed that the decoupled features collaboratively interact in our spatio-temporal model to accurately identify human-related irregular events from surveillance video sequences. Compared to traditional appearance-based models, our method achieves superior outlier detection performance. Our model also offers "open-box" examination and decision explanation made possible by the semantically understandable features and a network architecture supporting interpretability.
[skeleton, anomaly, video, human, motion, abnormal, frame, movement, recurrent, prediction, dataset, event, rnn, trajectory, gru, dynamic, current, consists, action, time, liu, temporal] [local, normal, computer, vision, body, pattern, international, scene, error, pose, decomposition] [conference, input, ieee, method, figure, image, surveillance, anomalous, based, component, appearance, proposed] [network, performance, neural, architecture, structure, original, compared, deep] [model, observed, message, visual] [global, detection, person, bounding, score, shanghaitech, segment, feature, propose, branch, three, detect, detected, semantic, detecting] [learning, loss, training, data, set, unsupervised, trained]
@InProceedings{Morais_2019_CVPR,
  author = {Morais, Romero and Le, Vuong and Tran, Truyen and Saha, Budhaditya and Mansour, Moussa and Venkatesh, Svetha},
  title = {Learning Regularity in Skeleton Trajectories for Anomaly Detection in Videos},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Local Temporal Bilinear Pooling for Fine-Grained Action Parsing
Yan Zhang, Siyu Tang, Krikamol Muandet, Christian Jarvers, Heiko Neumann


Fine-grained temporal action parsing is important in many applications, such as daily activity understanding, human motion analysis, surgical robotics and others requiring subtle and precise operations over a long-term period. In this paper we propose a novel bilinear pooling operation, which is used in intermediate layers of a temporal convolutional encoder-decoder net. In contrast to previous work, our proposed bilinear pooling is learnable and hence can capture more complex local statistics than the conventional counterpart. In addition, we introduce exact lower-dimension representations of our bilinear forms, so that the dimensionality is reduced without suffering from information loss nor requiring extra computation. We perform extensive experiments to quantitatively analyze our model and show the superior performances to other state-of-the-art pooling work on various datasets.
[action, temporal, recognition, capture, perform, sequence, outperforms, spatiotemporal, work, video, frame, dataset, time] [computer, vision, pattern, local, form, matrix, international, ground, defined, volume, analysis, neighborhood] [conference, ieee, method, proposed, comparison, component, coupled, result] [bilinear, pooling, convolutional, neural, max, normalization, learnable, convolution, kernel, deep, net, compact, activation, reduction, vec, tcedbd, proposes, network, power, table, tcedbc, rojection, equivalent, layer, tced, operation, computational, tensor, denotes] [visual, vector, arxiv, preprint, model] [feature, cnn, score, spatial, parsing, segmentation, european] [dimension, decoupled, conventional, set, data, loss, consistently]
@InProceedings{Zhang_2019_CVPR,
  author = {Zhang, Yan and Tang, Siyu and Muandet, Krikamol and Jarvers, Christian and Neumann, Heiko},
  title = {Local Temporal Bilinear Pooling for Fine-Grained Action Parsing},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Improving Action Localization by Progressive Cross-Stream Cooperation
Rui Su, Wanli Ouyang, Luping Zhou, Dong Xu


Spatio-temporal action localization consists of three levels of tasks: spatial localization, action classification, and temporal segmentation. In this work, we propose a new Progressive Cross-stream Cooperation (PCSC) framework to iterative improve action localization results and generate better bounding boxes for one stream (i.e., Flow/RGB) by leveraging both region proposals and features from another stream (i.e., RGB/Flow) in an iterative fashion. Specifically, we first generate a larger set of region proposals by combining the latest region proposals from both streams, from which we can readily obtain a larger set of labelled training samples to help learn better action detection models. Second, we also propose a new message passing approach to pass information from one stream to another stream in order to learn better representations, which also leads to better action detection models. As a result, our iterative framework progressively improves action localization results at the frame level. To improve action localization results at the video level, we additionally propose a new strategy to train class-specific actionness detectors for better temporal segmentation, which can be readily learnt by using the training samples around temporal boundaries. Comprehensive experiments on two benchmark datasets UCF-101-24 and J-HMDB demonstrate the effectiveness of our newly proposed approaches for spatio-temporal action localization in realistic scenarios.
[action, flow, cooperation, stream, temporal, pcsc, actionness, passing, motion, dataset, tube, human, frame, recognition, video, work, framework, trgb] [rgb, computer, vision, approach, pattern] [method, conference, based, ieee, appearance, proposed] [table, better, performance, conv, network, progressive, order, convolutional, overlap, neural] [message, improved, model, progressively] [detection, feature, region, stage, bounding, improve, proposal, help, localization, head, iou, refinement, box, roi, propose, level, threshold, module, spatial, complementary, three, faster, pathway, object] [training, set, extractor, learning, learn, class, testing, strategy, train, existing, exploit]
@InProceedings{Su_2019_CVPR,
  author = {Su, Rui and Ouyang, Wanli and Zhou, Luping and Xu, Dong},
  title = {Improving Action Localization by Progressive Cross-Stream Cooperation},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Two-Stream Adaptive Graph Convolutional Networks for Skeleton-Based Action Recognition
Lei Shi, Yifan Zhang, Jian Cheng, Hanqing Lu


In skeleton-based action recognition, graph convolutional networks (GCNs), which model the human body skeletons as spatiotemporal graphs, have achieved remarkable performance. However, in existing GCN-based methods, the topology of the graph is set manually, and it is fixed over all layers and input samples. This may not be optimal for the hierarchical GCN and diverse samples in action recognition tasks. In addition, the second-order information (the lengths and directions of bones) of the skeleton data, which is naturally more informative and discriminative for action recognition, is rarely investigated in existing methods. In this work, we propose a novel two-stream adaptive graph convolutional network (2s-AGCN) for skeleton-based action recognition. The topology of the graph in our model can be either uniformly or individually learned by the BP algorithm in an end-to-end manner. This data-driven method increases the flexibility of the model for graph construction and brings more generality to adapt to various data samples. Moreover, a two-stream framework is proposed to model both the first-order and the second-order information simultaneously, which shows notable improvement for the recognition accuracy. Extensive experiments on the two large-scale datasets, NTU-RGBD and Kinetics-Skeleton, demonstrate that the performance of our model exceeds the state-of-the-art with a significant margin.
[graph, action, skeleton, recognition, human, joint, temporal, bone, dataset, spatiotemporal, perform, gcn, framework, sequence, work, second, video, naturally] [matrix, pattern, vertex, computer, vision, body, corresponding, topology, perspective, normalized, left] [ieee, based, conference, input, proposed, figure] [convolutional, adaptive, network, convolution, structure, represents, neural, number, layer, denotes, original, performance, deep, size, connection, residual, validation, designed, weight, accuracy, flexibility, output, processing, fixed] [model, vector, physical, attention, unique] [spatial, feature, final, three, center, propose, map, visualization, hierarchical] [data, set, learned, adjacency, learning, training, function, target, subset]
@InProceedings{Shi_2019_CVPR,
  author = {Shi, Lei and Zhang, Yifan and Cheng, Jian and Lu, Hanqing},
  title = {Two-Stream Adaptive Graph Convolutional Networks for Skeleton-Based Action Recognition},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
A Neural Network Based on SPD Manifold Learning for Skeleton-Based Hand Gesture Recognition
Xuan Son Nguyen, Luc Brun, Olivier Lezoray, Sebastien Bougleux


This paper proposes a new neural network based on SPD manifold learning for skeleton-based hand gesture recognition. Given the stream of hand's joint positions, our approach combines two aggregation processes on respectively spatial and temporal domains. The pipeline of our network architecture consists in three main stages. The first stage is based on a convolutional layer to increase the discriminative power of learned features. The second stage relies on different architectures for spatial and temporal Gaussian aggregation of joint features. The third stage learns a final SPD matrix from skeletal data. A new type of layer is proposed for the third stage, based on a variant of stochastic gradient descent on Stiefel manifolds. The proposed network is validated on two challenging datasets and shows state-of-the-art accuracies on both datasets.
[spd, recognition, action, gesture, gaussagg, skeleton, temporal, logeig, reeig, dhg, vecmat, dataset, joint, human, frame, fpha, graph, dout, sequence, capture, lie, second, modeling] [hand, matrix, skeletal, depth, riemannian, pose, approach, defined, body, estimated, geometric, computed] [method, based, proposed, comparison, image, figure, input, mapping, diagonal] [network, layer, convolutional, deep, neural, accuracy, gaussian, output, covariance, aggregation, table, performance, performed, best, architecture] [manifold, referred, physical, node] [feature, spatial, grid, three] [learning, set, data, classification, experimental, representation, euclidean]
@InProceedings{Nguyen_2019_CVPR,
  author = {Son Nguyen, Xuan and Brun, Luc and Lezoray, Olivier and Bougleux, Sebastien},
  title = {A Neural Network Based on SPD Manifold Learning for Skeleton-Based Hand Gesture Recognition},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Large-Scale Weakly-Supervised Pre-Training for Video Action Recognition
Deepti Ghadiyaram, Du Tran, Dhruv Mahajan


Current fully-supervised video datasets consist of only a few hundred thousand videos and fewer than a thousand domain-specific labels. This hinders the progress towards advanced video architectures. This paper presents an in-depth study of using large volumes of web videos for pre-training video models for the task of action recognition. Our primary empirical finding is that pre-training at a very large scale (over 65 million videos), despite on noisy social-media videos and hashtags, substantially improves the state-of-the-art on three challenging public action recognition datasets. Further, we examine three questions in the construction of weakly-supervised video action datasets. First, given that actions involve interactions with objects, how should one construct a verb-object pre-training label space to benefit transfer learning the most? Second, frame-based models perform quite well on action recognition; is pre-training for good image features sufficient or is pre-training for spatio-temporal features valuable for optimal transfer learning? Finally, actions are generally less well-localized in long videos vs. short videos; since action labels are provided at a video level, how should one choose video clips for best performance, given some fixed budget of number or minutes of videos?
[video, temporal, kinetics, action, short, dataset, construct, long, hashtags, report, recognition, human, longer, clip, despite] [rgb, approach, total, associated, constructing, note] [image, input, study, noise] [number, performance, accuracy, compared, table, better, fixed, convolutional, deep, fewer, imagenet, validation, best, budget, overlap] [model, arxiv, preprint, visual, random, diverse, diversity, consider] [object, weakly, improvement, seed, boost, localization, improves] [label, training, learning, datasets, data, target, large, space, source, pretraining, transfer, observe, web, supervised, sampling, distribution, test, noisy]
@InProceedings{Ghadiyaram_2019_CVPR,
  author = {Ghadiyaram, Deepti and Tran, Du and Mahajan, Dhruv},
  title = {Large-Scale Weakly-Supervised Pre-Training for Video Action Recognition},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Learning Spatio-Temporal Representation With Local and Global Diffusion
Zhaofan Qiu, Ting Yao, Chong-Wah Ngo, Xinmei Tian, Tao Mei


Convolutional Neural Networks (CNN) have been regarded as a powerful class of models for visual recognition problems. Nevertheless, the convolutional filters in these networks are local operations while ignoring the large-range dependency. Such drawback becomes even worse particularly for video recognition, since video is an information-intensive media with complex temporal variations. In this paper, we present a novel framework to boost the spatio-temporal representation learning by Local and Global Diffusion (LGD). Specifically, we construct a novel neural network architecture that learns the local and global representations in parallel. The architecture is composed of LGD blocks, where each block updates local and global features by modeling the diffusions between these two representations. Diffusions effectively interact two aspects of information, i.e., localized and holistic, for more powerful way of representation learning. Furthermore, a kernelized classifier is introduced to combine the representations from two aspects for video recognition. Our LGD networks achieve clear improvements on the large-scale Kinetics-400 and Kinetics-600 video classification datasets against the best competitors by 3.5% and 0.7%. We further examine the generalization of both the global and local representations produced by our pre-trained LGD networks on four different benchmarks for video action recognition and spatio-temporal action detection tasks. Superior performances over several state-of-the-art techniques on these benchmarks are reported.
[video, lgd, action, temporal, flow, recognition, spatiotemporal, consists, dataset, ting, optical, prediction, challenge, cordelia, kernelized] [local, rgb, projection] [proposed, figure, input, based, method, image, transformation] [performance, network, convolutional, block, table, neural, conv, architecture, residual, convolution, deep, validation, output, imagenet, kernel, accuracy, achieves, lower, size] [path, model, visual] [global, feature, cnn, backbone, proposal, detection, final, holistic, segment] [representation, learning, diffusion, training, set, classifier, test, learn, function, combination, gap, loss, learnt, tao, datasets, extended]
@InProceedings{Qiu_2019_CVPR,
  author = {Qiu, Zhaofan and Yao, Ting and Ngo, Chong-Wah and Tian, Xinmei and Mei, Tao},
  title = {Learning Spatio-Temporal Representation With Local and Global Diffusion},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Unsupervised Learning of Action Classes With Continuous Temporal Embedding
Anna Kukleva, Hilde Kuehne, Fadime Sener, Jurgen Gall


The task of temporally detecting and segmenting actions in untrimmed videos has seen an increased attention recently. One problem in this context arises from the need to define and label action boundaries to create annotations for training which is very time and cost intensive. To address this issue, we propose an unsupervised approach for learning action classes from untrimmed video sequences. To this end, we use a continuous temporal embedding of framewise features to benefit from the sequential nature of activities. Based on the latent space created by the embedding, we identify clusters of temporal segments across all videos that correspond to semantic meaningful action classes. The approach is evaluated on three challenging datasets, namely the Breakfast dataset, YouTube Instructions, and the 50Salads dataset. While previous works assumed that the videos contain the same high level activity, we furthermore show that the proposed approach can also be applied to a more general setting where the content of the videos is unknown.
[video, temporal, action, activity, mof, dataset, frame, breakfast, mallow, time, youtube, viterbi, subactions, subaction, ordering, recognition, ordered, report, hilde, juergen, untrimmed, previous, work, complex, sequence, xmn, human, outperforms, occur, clustered, influence, long, build] [approach, respect, continuous, case, well, note, completely, additional, problem, relative] [proposed, background, based, method, described, comparison, high] [table, accuracy, compare, order, impact, ratio, number, full] [model, evaluation, decoding, evaluate, embedded, visual] [weakly, iou, segmentation, propose, fully] [learning, unsupervised, embedding, cluster, supervised, learn, clustering, combination, set, label, representation, training, reported, task, space, datasets, protocol]
@InProceedings{Kukleva_2019_CVPR,
  author = {Kukleva, Anna and Kuehne, Hilde and Sener, Fadime and Gall, Jurgen},
  title = {Unsupervised Learning of Action Classes With Continuous Temporal Embedding},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Double Nuclear Norm Based Low Rank Representation on Grassmann Manifolds for Clustering
Xinglin Piao, Yongli Hu, Junbin Gao, Yanfeng Sun, Baocai Yin


Unsupervised clustering for high-dimension data (such as imageset or video) is a hard issue in data processing and data mining area since these data always lie on a manifold (such as Grassmann manifold). Inspired of Low Rank representation theory, researchers proposed a series of effective clustering methods for high-dimension data with non-linear metric. However, most of these methods adopt the traditional single nuclear norm as the relaxation of the rank function, which would lead to suboptimal solution deviated from the original one. In this paper, we propose a new low rank model for high-dimension data clustering task on Grassmann manifold based on the Double Nuclear norm which is used to better approximate the rank minimization of matrix. Further, to consider the inner geometry or structure of data space, we integrated the adaptive Laplacian regularization to construct the local relationship of data samples. The proposed models have been assessed on several public datasets for imageset clustering. The experimental results show that the proposed models outperform the state-of-the-art clustering ones.
[dataset, construct, video] [grassmann, matrix, laplacian, nuclear, solution, analysis, pattern, singular, wij, local, defined, international, algorithm, optimization, respect, robust, linear, solve, computer, column, university, geometry] [based, proposed, ieee, method, conference, double, image, traditional, fixing, face, raw] [low, norm, represents, sparse, adaptive, complexity, number, formulate, regularizer, ssc, original, structure, regularized] [manifold, model, machine, introduce] [adopt, affinity] [data, clustering, rank, representation, min, update, subspace, extended, function, xji, imageset, yale, minimization, sample, lrr, procceedings, yongli, junbin, set, learning, yanfeng, baocai, ballet]
@InProceedings{Piao_2019_CVPR,
  author = {Piao, Xinglin and Hu, Yongli and Gao, Junbin and Sun, Yanfeng and Yin, Baocai},
  title = {Double Nuclear Norm Based Low Rank Representation on Grassmann Manifolds for Clustering},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
SR-LSTM: State Refinement for LSTM Towards Pedestrian Trajectory Prediction
Pu Zhang, Wanli Ouyang, Pengfei Zhang, Jianru Xue, Nanning Zheng


In crowd scenarios, reliable trajectory prediction of pedestrians requires insightful understanding of their social behaviors. These behaviors have been well investigated by plenty of studies, while it is hard to be fully expressed by hand-craft rules. Recent studies based on LSTM networks have shown great ability to learn social behaviors. However, many of these methods rely on previous neighboring hidden states but ignore the important current intention of the neighbors. In order to address this issue, we propose a data-driven state refinement module for LSTM network (SR-LSTM), which activates the utilization of the current intention of neighbors, and jointly and iteratively refines the current states of all participants in the crowd through a message passing mechanism. To effectively extract the social effect of neighbors, we further introduce a social-aware information selection mechanism consisting of an element-wise motion gate and a pedestrian-wise attention to select useful message from neighboring pedestrians. Experimental results on two public datasets, i.e. ETH and UCY, demonstrate the effectiveness of our proposed SR-LSTM and we achieve state-of-the-art results.
[trajectory, lstm, motion, current, time, social, hidden, prediction, passing, walking, interaction, graph, previous, human, state, predicting, multiple, future, predict, behavior, intention, jointly, stationary, recurrent, frame, outperforms] [neighborhood, relative, scene, position, ground, volume] [based, input, row, figure, proposed, method] [gate, size, selection, denotes, neural, cell, table, group, performance, adaptively, output, implementation, layer] [attention, message, model, simple, arxiv, preprint, step, potential, game, consider, red] [pedestrian, refinement, crowd, neighboring, feature, module, location, refine, predicted, spatial, utilize, including, object] [pairwise, data, function, selected, select, neighbor, training, transportation]
@InProceedings{Zhang_2019_CVPR,
  author = {Zhang, Pu and Ouyang, Wanli and Zhang, Pengfei and Xue, Jianru and Zheng, Nanning},
  title = {SR-LSTM: State Refinement for LSTM Towards Pedestrian Trajectory Prediction},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Unsupervised Deep Epipolar Flow for Stationary or Dynamic Scenes
Yiran Zhong, Pan Ji, Jianyuan Wang, Yuchao Dai, Hongdong Li


Unsupervised deep learning for optical flow computation has achieved promising results. Most existing deep-net based methods rely on image brightness consistency and local smoothness constraint to train the networks. Their performance degrades at regions where repetitive textures or occlusions occur. In this paper, we propose Deep Epipolar Flow, an unsupervised optical flow method which incorporates global geometric constraints into network learning. In particular, we investigate multiple ways of enforcing the epipolar constraint in flow estimation. To alleviate a "chicken-and-egg" type of problem encountered in dynamic scenes where multiple motions may be present, we propose a low-rank constraint as well as a union-of-subspaces constraint for training. Experimental results on various benchmarking datasets show that our method achieves competitive performance compared with supervised methods and outperforms state-of-the-art unsupervised deep-learning methods.
[flow, optical, sintel, dataset, motion, dynamic, warping, multiple, stationary, work, term, frame] [kitti, epipolar, constraint, fundamental, estimation, matrix, error, scene, estimated, smoothness, photometric, geometric, problem, occlusion, geometry, compute, rigid, mpi, computer, note, ground, truth, camera, nuclear, michael, handling, wrong, corresponding, optimization] [image, ieee, method, based, figure, input, clean, conference, handle] [deep, network, performance, compared, norm, regularization, number, table, achieves, best, applied] [model, variational] [final, three, global, segmentation, propose] [loss, unsupervised, learning, training, subspace, train, conventional, supervised, data, trained, hard, rank, test, soft]
@InProceedings{Zhong_2019_CVPR,
  author = {Zhong, Yiran and Ji, Pan and Wang, Jianyuan and Dai, Yuchao and Li, Hongdong},
  title = {Unsupervised Deep Epipolar Flow for Stationary or Dynamic Scenes},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
An Efficient Schmidt-EKF for 3D Visual-Inertial SLAM
Patrick Geneva, James Maley, Guoquan Huang


It holds great implications for practical applications to enable centimeter-accuracy positioning for mobile and wearable sensor systems. In this paper, we propose a novel, high-precision, efficient visual-inertial (VI)-SLAM algorithm, termed Schmidt-EKF VI-SLAM (SEVIS), which optimally fuses IMU measurements and monocular images in a tightly-coupled manner to provide 3D motion tracking with bounded error. In particular, we adapt the Schmidt Kalman filter formulation to selectively include informative features in the state vector while treating them as nuisance parameters (or Schmidt states) once they become matured. This change in modeling allows for significant computational savings by no longer needing to constantly update the Schmidt states (or their covariance), while still allowing the EKF to correctly account for their cross-correlations with the active states. As a result, we achieve linear computational complexity in terms of map size, instead of quadratic as in the standard SLAM systems. In order to fully exploit the map information to bound navigation drifts, we advocate efficient keyframe-aided 2D-to-2D feature matching to find reliable correspondences between current 2D visual measurements and 3D map features. The proposed SEVIS is extensively validated in both simulations and experiments.
[state, sevis, current, vio, imu, trajectory, perform, inertial, keyframe, ekf, kalman, time, msckf, drift, tracked, frame, window, inclusion, nuisance, performing, motion, keyframes, pask] [slam, active, robotics, international, schmidt, error, matching, loop, algorithm, matrix, sensor, monocular, estimator, automation, analysis, measurement, allowing, bound, closure, camera, provide, allows, linear, period, estimation, relative] [proposed, ieee, conference, based, figure, noise, image] [computational, covariance, filter, number, complexity, performance, efficient, full, accuracy, size, mobile, standard, order] [navigation, visual, system, find] [map, baseline, feature, localization, sliding, global, three] [update, data]
@InProceedings{Geneva_2019_CVPR,
  author = {Geneva, Patrick and Maley, James and Huang, Guoquan},
  title = {An Efficient Schmidt-EKF for 3D Visual-Inertial SLAM},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
A Neural Temporal Model for Human Motion Prediction
Anand Gopalakrishnan, Ankur Mali, Dan Kifer, Lee Giles, Alexander G. Ororbia


We propose novel neural temporal models for predicting and synthesizing human motion, achieving state-of-the-art in modeling long-term motion trajectories while being competitive with prior work in short-term prediction and requiring significantly less computation. Key aspects of our proposed system include: 1) a novel, two-level processing architecture that aids in generating planned trajectories, 2) a simple set of easily computable features that integrate derivative information, and 3) a novel multi-objective loss function that helps the model to slowly progress from simple next-step prediction to the harder task of multi-step, closed-loop prediction. Our results demonstrate that these innovations improve the modeling of long-term motion trajectories. Finally, we propose a novel metric, called Normalized Power Spectrum Similarity (NPSS), to evaluate the long-term predictive ability of motion synthesis models, complementing the popular mean-squared error (MSE) measure of Euler joint angles over time. We conduct a user study to determine if the proposed NPSS correlates with human evaluation of long-term motion more strongly than MSE and find that it indeed does. We release code and additional results (visualizations) for this paper at: https://github.com/cr7anand/neural_temporal_models
[motion, prediction, human, sequence, walking, joint, action, work, rnn, recurrent, time, long, capture, state, modeling, fed] [well, computer, derivative, compute, international, computed] [mse, user, proposed, synthesis, input, noise, study, spectrum, conference, prior, based] [neural, table, power, processing, architecture, process, better, output, discussion, order, network] [model, evaluation, simple, generating] [predicted] [test, metric, loss, data, training, novel, learning, function, set, similarity, trained]
@InProceedings{Gopalakrishnan_2019_CVPR,
  author = {Gopalakrishnan, Anand and Mali, Ankur and Kifer, Dan and Giles, Lee and Ororbia, Alexander G.},
  title = {A Neural Temporal Model for Human Motion Prediction},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Multi-Agent Tensor Fusion for Contextual Trajectory Prediction
Tianyang Zhao, Yifei Xu, Mathew Monfort, Wongun Choi, Chris Baker, Yibiao Zhao, Yizhou Wang, Ying Nian Wu


Accurate prediction of others' trajectories is essential for autonomous driving. Trajectory prediction is challenging because it requires reasoning about agents' past movements, social interactions among varying numbers and kinds of agents, constraints from the scene context, and the stochasticity of human behavior. Our approach models these interactions and constraints jointly within a novel Multi-Agent Tensor Fusion (MATF) network. Specifically, the model encodes multiple agents' past trajectories and the scene context into a Multi-Agent Tensor, then applies convolutional fusion to capture multiagent interactions while retaining the spatial structure of agents and the scene context. The model decodes recurrently to multiple agents' future trajectories, using adversarial loss to learn stochastic predictions. Experiments on both highway driving and pedestrian crowd datasets show that the model achieves state-of-the-art prediction accuracy.
[social, prediction, trajectory, future, matf, deterministic, multiple, dataset, lstm, driving, work, ade, ngsim, interaction, human, static, yit, report, behavior, maneuver, drone, predicting, modeling, xit] [scene, computer, vision, approach, international, pattern, ground, directly, reconstruction] [generative, ieee, conference, quantitative, presented, qualitative, conditional, based] [stochastic, structure, architecture, tensor, table, best, pooling, convolutional, number, computational, performance, max] [model, agent, adversarial, gan, attention, encoding, ablative, reasoning, generated, introduced] [spatial, multi, context, pedestrian, final, predicted, crowd, baseline, fused, fully, feature] [datasets, learning, stanford, training, distribution, set, learn, trained, loss]
@InProceedings{Zhao_2019_CVPR,
  author = {Zhao, Tianyang and Xu, Yifei and Monfort, Mathew and Choi, Wongun and Baker, Chris and Zhao, Yibiao and Wang, Yizhou and Nian Wu, Ying},
  title = {Multi-Agent Tensor Fusion for Contextual Trajectory Prediction},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Coordinate-Based Texture Inpainting for Pose-Guided Human Image Generation
Artur Grigorev, Artem Sevastopolsky, Alexander Vakhitov, Victor Lempitsky


We present a new deep learning approach to pose-guided resynthesis of human photographs. At the heart of the new approach is the estimation of the complete body surface texture based on a single photograph. Since the input photograph always observes only a part of the surface, we suggest a new inpainting method that completes the texture of the human body. Rather than working directly with colors of texture elements, the inpainting network estimates an appropriate source location in the input image for each element of the body surface. This correspondence field between the input image and the texture is then further warped into the target image coordinate frame based on the desired pose, effectively establishing the correspondence between the source and the target view even when the pose change is drastic. The final convolutional network then uses the established correspondence and all other available information to synthesize the output image. A fully-convolutional architecture with deformable skip connections guided by the estimated correspondence field is used. We show state-of-the-art result for pose-guided image synthesis. Additionally, we demonstrate the performance of our system for garment transfer and pose-guided face resynthesis.
[human, frame, warping, warped, recognition, video, work, second, dataset, perform] [pose, view, body, approach, computer, vision, pipeline, correspondence, coordinate, densepose, pattern, well, estimation, single, estimated, estimate, additionally, ground, truth, rgb, june, surface] [texture, image, inpainting, input, face, conference, method, color, based, ieee, garment, resynthesis, mapping, result, figure, user, incomplete, appearance, inpainted] [network, convolutional, skip, deep, output, architecture, full, neural, compare, number, order] [complete, encoder, generation] [person, map, deformable, final, aligned, three, predicted, ablation, location] [source, target, transfer, test, loss, set, training, trained, learning, effectively, task, unknown]
@InProceedings{Grigorev_2019_CVPR,
  author = {Grigorev, Artur and Sevastopolsky, Artem and Vakhitov, Alexander and Lempitsky, Victor},
  title = {Coordinate-Based Texture Inpainting for Pose-Guided Human Image Generation},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
On Stabilizing Generative Adversarial Training With Noise
Simon Jenni, Paolo Favaro


We present a novel method and analysis to train generative adversarial networks (GAN) in a stable manner. As shown in recent analysis, training is often undermined by the probability distribution of the data being zero on neighborhoods of the data space. We notice that the distributions of real and generated data should match even when they undergo the same filtering. Therefore, to address the limited support problem we propose to train GANs by using different filtered versions of the real and generated data distributions. In this way, filtering does not prevent the exact matching of the data distribution, while helping training by extending the support of both distributions. As filtering we consider adding samples from an arbitrary distribution to the data, which corresponds to a convolution of the data distribution with the arbitrary one. We also propose to learn the generation of these samples so as to challenge the discriminator in the adversarial training. We show that our approach results in a stable and well-behaved training of even the original minimax GAN formulation. Moreover, our technique can be incorporated in most modern GAN formulations and leads to a consistent improvement on several common datasets.
[work] [problem, formulation, international, stable, optimization, matching, analysis, approach, technique] [noise, generator, real, method, generative, proposed, figure, image, separate, quality, conference, celeba, filtering, arbitrary] [original, normalization, table, standard, network, gradient, density, batch, number, gaussian, filtered, better, penalty, architecture, reduced, performance, relu, conv, martin] [gan, discriminator, adversarial, probability, fake, generated, model, arxiv, preprint, fid, adding, robustness, gans, mode, common, unstable, introduced, wasserstein, dcgan, random, true, consider] [improvement, propose] [training, data, support, distribution, trained, sample, set, noisy, learning, adam, train, min, function, lrelu, minimax, dfgan]
@InProceedings{Jenni_2019_CVPR,
  author = {Jenni, Simon and Favaro, Paolo},
  title = {On Stabilizing Generative Adversarial Training With Noise},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Self-Supervised GANs via Auxiliary Rotation Loss
Ting Chen, Xiaohua Zhai, Marvin Ritter, Mario Lucic, Neil Houlsby


Conditional GANs are at the forefront of natural image synthesis. The main drawback of such models is the necessity for labeled data. In this work we exploit two popular unsupervised learning techniques, adversarial training and self-supervision, and take a step towards bridging the gap between conditional and unconditional GANs. In particular, we allow the networks to collaborate on the task of representation learning, while being adversarial with respect to the classic GAN game. The role of self-supervision is to encourage the discriminator to learn meaningful feature representations which are not forgotten during training. We test empirically both the quality of the learned image representations, and the quality of the synthesized images. Under the same conditions, the self-supervised GAN attains a similar performance to state-of-the-art conditional counterparts. Finally, we show that this approach to fully unsupervised learning can be scaled to attain an FID of 23.4 on unconditional ImageNet generation.
[work, previous, predict, performs] [rotation, international, computer, vision, approach, respect] [image, conditional, unconditional, generator, conference, generative, quality, figure, proposed, real, high, method, celeba] [performance, imagenet, best, accuracy, cifar, neural, table, processing, batch, gradient, original, brain, add] [discriminator, gan, adversarial, model, gans, fid, true, random, fake, miyato, arxiv, preprint, generated, considered] [three, context, google, feature, rotated] [training, learning, representation, unsupervised, task, loss, classification, data, classifier, forgetting, trained, learn, test, distribution, main, labeled, catastrophic, setting, train, class, continual, observe, set, mario]
@InProceedings{Chen_2019_CVPR,
  author = {Chen, Ting and Zhai, Xiaohua and Ritter, Marvin and Lucic, Mario and Houlsby, Neil},
  title = {Self-Supervised GANs via Auxiliary Rotation Loss},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Texture Mixer: A Network for Controllable Synthesis and Interpolation of Texture
Ning Yu, Connelly Barnes, Eli Shechtman, Sohrab Amirghodsi, Michal Lukac


This paper addresses the problem of interpolating visual textures. We formulate this problem by requiring (1) by-example controllability and (2) realistic and smooth interpolation among an arbitrary number of texture samples. To solve it we propose a neural network trained simultaneously on a reconstruction task and a generation task, which can project texture examples onto a latent space where they can be linearly interpolated and projected back onto the image domain, thus ensuring both intuitive control and realistic results. We show our method outperforms a number of baselines according to a comprehensive suite of metrics as well as a user study. We further show several applications based on our technique, which include texture brush, texture dissolve, and animal hybridization.
[spd, complex] [computer, reconstruction, local, supplementary, vision, single, pattern, volume] [texture, interpolation, image, latent, user, synthesis, method, figure, shuffling, spatially, input, conference, interpolated, study, gram, side, acm, based, blending, cgd, ccd, cswd, tiling, style, ladv, appearance, background, generative, psgan, generator, real, brush, ieee, interpolating, realism, interpolate, inpainting] [network, neural, size, tensor, operation, processing, factor] [random, animal, adversarial, discriminator, arxiv, preprint, call, model, evaluation] [center, spatial, crop, cis, three] [training, source, loss, task, measure, trained, testing, space, datasets, domain, set, randomly, distance, classifier]
@InProceedings{Yu_2019_CVPR,
  author = {Yu, Ning and Barnes, Connelly and Shechtman, Eli and Amirghodsi, Sohrab and Lukac, Michal},
  title = {Texture Mixer: A Network for Controllable Synthesis and Interpolation of Texture},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Object-Driven Text-To-Image Synthesis via Adversarial Training
Wenbo Li, Pengchuan Zhang, Lei Zhang, Qiuyuan Huang, Xiaodong He, Siwei Lyu, Jianfeng Gao


In this paper, we propose Object-driven Attentive Generative Adversarial Newtorks (Obj-GANs) that allow attention-driven, multi-stage refinement for synthesizing complex images from text descriptions. With a novel object-driven attentive generative network, the Obj-GAN can synthesize salient objects by paying attention to their most relevant words in the text descriptions and their pre-generated class label. In addition, a novel object-wise discriminator based on the Fast R-CNN model is proposed to provide rich object-wise discrimination signals on whether the synthesized object matches the text description and the pre-generated class label. The proposed Obj-GAN significantly outperforms the previous state of the art in various metrics on the large-scale MS-COCO benchmark, increasing the inception score by 27% and decreasing the FID score by 11%. A thorough comparison between the classic grid attention and the new object-driven attention is provided through analyzing their mechanisms and visualizing their attention layers, showing insights of how the proposed model generates complex scenes in high quality.
[complex, previous, outperforms] [shape] [image, generative, generator, proposed, synthesis, based, conditional, figure, realistic, patch, comparison, traditional, quantitative, synthesizing, quality, synthesize, synthesized] [pat, table, process, higher, compare, fast] [attention, generation, text, generated, adversarial, vector, cobj, model, sentence, word, discriminator, obj, generating, inception, gan, mechanism, attngan, conditioned, lyt, generates, objectdriven, generate, evaluation, evaluate, visual, cpat, relevant, description, clab, damsm] [context, bounding, attentive, grid, layout, semantic, box, object, region, feature, score, coco, global, illustrated, propose, map] [loss, class, label, novel, retrieval, learning]
@InProceedings{Li_2019_CVPR,
  author = {Li, Wenbo and Zhang, Pengchuan and Zhang, Lei and Huang, Qiuyuan and He, Xiaodong and Lyu, Siwei and Gao, Jianfeng},
  title = {Object-Driven Text-To-Image Synthesis via Adversarial Training},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Zoom-In-To-Check: Boosting Video Interpolation via Instance-Level Discrimination
Liangzhe Yuan, Yibo Chen, Hantian Liu, Tao Kong, Jianbo Shi


We propose a light-weight video frame interpolation algorithm. Our key innovation is an instance-level supervision that allows information to be learned from the high-resolution version of similar objects. Our experiment shows that the proposed method can generate state-of-the-art results across different datasets, with fractional computation resources (time and memory) of competing methods. Given two image frames, a cascade network creates an intermediate frame with 1) a flow-warping module that computes coarse bi-directional optical flow and creates an interpolated image via flow-based warping, followed by 2) an image synthesis module to make fine-scale corrections. In the learning stage, object detection proposals are generated on the interpolated image. Lower resolution objects are zoomed into, and the learning algorithms using an adversarial loss trained on high-resolution objects to guide the system towards the instance-level refinement corrects details of object shape and boundaries.
[video, flow, optical, frame, motion, focus, complex, performs, work, prediction, structural, oursbaseline] [estimation, computer, vision, algorithm, international, pattern, estimate, scene, occlusion, corresponding] [image, interpolation, proposed, synthesis, method, conference, oursroigan, figure, ssim, ieee, interpolated, resolution, sepconv, perceptual, superslomo, synthesized, blending, based, quality, high, preserve, dvf] [network, size, better, kernel, deep, achieves, best, table] [adversarial, model, discriminator, evaluation, generate, generated, arxiv, preprint, system] [object, module, mask, region, improve, level, semantic, coarse, instance, crop] [training, loss, learning, large, trained, train, discrimination, learn]
@InProceedings{Yuan_2019_CVPR,
  author = {Yuan, Liangzhe and Chen, Yibo and Liu, Hantian and Kong, Tao and Shi, Jianbo},
  title = {Zoom-In-To-Check: Boosting Video Interpolation via Instance-Level Discrimination},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Disentangling Latent Space for VAE by Label Relevant/Irrelevant Dimensions
Zhilin Zheng, Li Sun


VAE requires the standard Gaussian distribution as a prior in the latent space. Since all codes tend to follow the same prior, it often suffers the so-called "posterior collapse". To avoid this, this paper introduces the class specific distribution for the latent code. But different from CVAE, we present a method for disentangling the latent space into the label relevant and irrelevant dimensions, zs and zu, for a single input. We apply two separated encoders to map the input into zs and zu respectively, and then give the concatenated code to the decoder to reconstruct the input. The label irrelevant code zu represent the common characteristics of all inputs, hence they are constrained by the standard Gaussian, and their encoder is trained in amortized variational inference way, like VAE. While zs is assumed to follow the Gaussian mixture distribution in which each component corresponds to a particular class. The parameters for the Gaussian components in zs encoder are optimized by the label supervision in a global stochastic way. In theory, we show that our method is actually equivalent to adding a KL divergence term on the joint distribution of zs and the class label c, and it can directly increase the mutual information between zs and the label c. Our model can also be extended to GAN by adding a discriminator in the pixel domain so that it produces high quality and diverse images.
[term, follow, capture, concatenated] [computer, vision, single, reconstruction, algorithm, pattern, defined, corresponding, note, varying, international] [image, latent, conference, generative, figure, method, input, ladv, prior, face, ieee, lrec, row, proposed, corrupted, reconstruct, quality, based, disentangling, high, conditional, disentangle, disentangled, llkd, real, fixing] [gaussian, neural, processing, original, covariance, inference, fixed, standard] [adversarial, generated, decoder, irrelevant, encoders, relevant, model, encoder, discriminator, variational, generation, gan, diverse, generate, arxiv, cvae, adv, preprint, cgan, lkl] [] [label, distribution, vae, class, data, training, code, loss, log, space, trained, mixture, lgm, learning, learned, encoderu, classifier, set, sample, facescrub, divergence, representation]
@InProceedings{Zheng_2019_CVPR,
  author = {Zheng, Zhilin and Sun, Li},
  title = {Disentangling Latent Space for VAE by Label Relevant/Irrelevant Dimensions},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Spectral Reconstruction From Dispersive Blur: A Novel Light Efficient Spectral Imager
Yuanyuan Zhao, Xuemei Hu, Hui Guo, Zhan Ma, Tao Yue, Xun Cao


Developing high light efficiency imaging techniques to retrieve high dimensional optical signal is a long-term goal in computational photography. Multispectral imaging, which captures images of different wavelengths and boosting the abilities for revealing scene properties, has developed rapidly in the last few decades. From scanning method to snapshot imaging, the limit of light collection efficiency is kept being pushed which enables wider applications especially under the light-starved scenes. In this work, we propose a novel multispectral imaging technique, that could capture the multispectral images with a high light efficiency. Through investigating the dispersive blur caused by spectral dispersers and introducing the difference of blur (DoB) constraints, we propose a basic theory for capturing multispectral information from a single dispersive-blurred image and an additional spectrum of an arbitrary point in the scene. Based on the theory, we design a prototype system and develop an optimization algorithm to realize snapshot multispectral imaging. The effectiveness of the proposed method is verified on both the synthetic data and real captured images.
[graph, capture, throughput, optical, adjacent, capturing] [light, single, reconstruction, rgb, corresponding, matrix, solution, point, projection, additional, aperture, exactly, surface, lemma, algorithm, sensor, scene, theory, lens, constraint, derivative, technique, camera, prove, problem, directly] [multispectral, spectral, imaging, image, proposed, dispersive, method, dob, spectrum, snapshot, based, hyperspectral, sharp, captured, gray, blurred, difference, high, row, noise, side, qcs, real, coded, blur, figure, spectrometer, reconstructed, removed, hybrid, synthetic, reconstruct, pixel, shading, imager, prism] [filter, full, denotes, introducing, number, sparse, achieve] [model, introduce, system, path, tree, spanning] [edge, spatial, propose, mask] [rank, set, prototype, space, novel]
@InProceedings{Zhao_2019_CVPR,
  author = {Zhao, Yuanyuan and Hu, Xuemei and Guo, Hui and Ma, Zhan and Yue, Tao and Cao, Xun},
  title = {Spectral Reconstruction From Dispersive Blur: A Novel Light Efficient Spectral Imager},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Quasi-Unsupervised Color Constancy
Simone Bianco, Claudio Cusano


We present here a method for computational color constancy in which a deep convolutional neural network is trained to detect achromatic pixels in color images after they have been converted to grayscale. The method does not require any information about the illuminant in the scene and relies on the weak assumption, fulfilled by almost all images available on the web, that training images have been approximately balanced. Because of this requirement we define our method as quasi-unsupervised. After training, unbalanced images can be processed thanks to the preliminary conversion to grayscale of the input to the neural network. The results of an extensive experimentation demonstrate that the proposed method is able to outperform the other unsupervised methods in the state of the art being, at the same time, flexible enough to be supervisedly fine-tuned to reach performance comparable with those of the best supervised methods.
[dataset, state, recognition, work, consists, version] [computer, vision, pattern, estimate, ground, truth, estimation, error, scene, illumination, international, form, assumption, respect, algorithm, estimated, light, parametric, journal, analysis] [color, illuminant, method, constancy, image, input, ieee, proposed, grayscale, conference, gray, achromatic, figure, unbalanced, processed, based, society, raw, imaging] [network, neural, deep, computational, convolutional, performance, fine, gradient, tuning, table, outperform, best, output, comparable] [median, machine, model, visual, kind] [three, annotated, art, preliminary, object, average] [training, learning, datasets, supervised, trained, large, unsupervised, angular, balanced, set, test, data, loss, main, train]
@InProceedings{Bianco_2019_CVPR,
  author = {Bianco, Simone and Cusano, Claudio},
  title = {Quasi-Unsupervised Color Constancy},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Deep Defocus Map Estimation Using Domain Adaptation
Junyong Lee, Sungkil Lee, Sunghyun Cho, Seungyong Lee


In this paper, we propose the first end-to-end convolutional neural network (CNN) architecture, Defocus Map Estimation Network (DMENet), for spatially varying defocus map estimation. To train the network, we produce a novel depth-of-field (DOF) dataset, SYNDOF, where each image is synthetically blurred with a ground-truth depth map. Due to the synthetic nature of SYNDOF, the feature characteristics of images in SYNDOF can differ from those of real defocused photos. To address this gap, we use domain adaptation that transfers the features of real defocused photos into those of synthetically blurred ones. Our DMENet consists of four subnetworks: blur estimation, domain adaptation, content preservation, and sharpness calibration networks. The subnetworks are connected to each other and jointly trained with their corresponding supervisions in an end-to-end manner. Our method is evaluated on publicly available blur detection and blur estimation datasets and the results show the state-of-the-art performance.In this paper, we propose the first end-to-end convolutional neural network (CNN) architecture, Defocus Map Estimation Network (DMENet), for spatially varying defocus map estimation. To train the network, we produce a novel depth-of-field (DOF) dataset, SYNDOF, where each image is synthetically blurred with a ground-truth depth map. Due to the synthetic nature of SYNDOF, the feature characteristics of images in SYNDOF can differ from those of real defocused photos. To address this gap, we use domain adaptation that transfers the features of real defocused photos into those of synthetically blurred ones. Our DMENet consists of four subnetworks: blur estimation, domain adaptation, content preservation, and sharpness calibration networks. The subnetworks are connected to each other and jointly trained with their corresponding supervisions in an end-to-end manner. Our method is evaluated on publicly available blur detection and blur estimation datasets and the results show the state-of-the-art performance.
[dataset, previous, consists, framework] [estimation, depth, estimated, estimate, calibration, homogeneous, defined, ground, corresponding, scene, truth, approach, accurate, field, error] [blur, defocus, defocused, real, image, synthetic, dmenet, sharpness, pixel, syndof, content, blurred, cuhk, figure, preservation, bdcs, method, sharp, coc, input, difference, ladv, aux, laux, deblurring, cocs, result, karaali, park] [network, size, layer, binary, convolutional, accuracy, gaussian, receptive, number] [adversarial, generate, model, discriminator, evaluation, generated] [map, object, feature, detection, predicted, propose, cnn] [domain, loss, adaptation, training, train, labeled, auxiliary, trained, learning, set, datasets, mixture, shi, classify, synthia]
@InProceedings{Lee_2019_CVPR,
  author = {Lee, Junyong and Lee, Sungkil and Cho, Sunghyun and Lee, Seungyong},
  title = {Deep Defocus Map Estimation Using Domain Adaptation},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Using Unknown Occluders to Recover Hidden Scenes
Adam B. Yedidia, Manel Baradad, Christos Thrampoulidis, William T. Freeman, Gregory W. Wornell


We consider the challenging problem of inferring a hidden moving scene from faint shadows cast on a diffuse surface. Recent work in passive non-line-of-sight (NLoS) imaging has shown that the presence of occluding objects in between the scene and the diffuse surface significantly improves the conditioning of the problem. However, that work assumes that the shape of the occluder is known a priori. In this paper, we relax this often impractical assumption, extending the range of applications for passive occluder-based NLoS imaging systems. We formulate the task of jointly recovering the unknown scene and unknown occluder as a blind deconvolution problem, for which we propose a simple but effective two-step algorithm. At the first step, the algorithm exploits motion in the scene in order to obtain an estimate of the occluder. In particular, it exploits the fact that motion in realistic scenes is typically sparse. The second step is more standard: using regularization, we deconvolve by the occluder estimate to solve for the hidden scene. We demonstrate the effectiveness of our method with simulations and experiments in a variety of settings.
[work, video, hidden, moving, frame, motion, signal, second] [scene, observation, occluder, algorithm, light, plane, estimate, problem, computer, single, reconstruction, assume, passive, surface, international, shape, nlos, active, relative, vision, pattern, occluders, diffuse, visible, reconstructing, denote, corresponding, approach, recovering] [blind, difference, figure, image, imaging, deconvolution, ieee, conference, method, deblurring, recover, variety, presented, color, result, acm, realistic] [size, convolution, computational, full, sparse, structure, shadow, fixed] [model, observed, simple, fact, step, example, describe, consider, making, contribution, true] [three, area, perfectly] [unknown, experimental, source, scenario, set]
@InProceedings{Yedidia_2019_CVPR,
  author = {Yedidia, Adam B. and Baradad, Manel and Thrampoulidis, Christos and Freeman, William T. and Wornell, Gregory W.},
  title = {Using Unknown Occluders to Recover Hidden Scenes},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Competitive Collaboration: Joint Unsupervised Learning of Depth, Camera Motion, Optical Flow and Motion Segmentation
Anurag Ranjan, Varun Jampani, Lukas Balles, Kihwan Kim, Deqing Sun, Jonas Wulff, Michael J. Black


We address the unsupervised learning of several interconnected problems in low-level vision: single view depth prediction, camera motion estimation, optical flow, and segmentation of a video into the static scene and moving regions. Our key insight is that these four fundamental vision problems are coupled through geometric constraints. Consequently, learning to solve them together simplifies the problem because the solutions can reinforce each other. We go beyond previous work by exploiting geometry more explicitly and segmenting the scene into static and moving regions. To that end, we introduce Competitive Collaboration, a framework that facilitates the coordinated training of multiple specialized neural networks to solve complex problems. Competitive Collaboration works much like expectation-maximization, but with neural networks that act as both competitors to explain pixels that correspond to static or moving regions, and as collaborators through a moderator that assigns pixels to be either static or independently moving. Our novel method integrates all these problems in a common framework and simultaneously reasons about the segmentation of the scene into moving objects and the static background, the camera motion, depth of the static scene structure, and the optical flow of moving objects. Our model is trained without any supervision and achieves state-of-the-art performance among joint unsupervised methods on all sub-problems.
[flow, optical, motion, static, moving, joint, collaboration, moderator, jointly, framework, work, independently, dataset, recognition, multiple, second, collaborate, term, key, previous] [depth, camera, scene, computer, vision, estimation, pattern, consensus, single, error, kitti, ground, truth, photometric, reconstructor, estimate, geometric, monocular, dispnet, volume, view, solve] [conference, ieee, image, method, ssim, figure, reference, real, competing, coupled] [network, performance, competitive, table, neural, convolutional, deep, better, basic, net, applied, larger, residual] [reason, introduce, competition, appendix, arxiv, preprint] [segmentation, mask, segment, car] [training, learning, unsupervised, data, loss, train, learn, trained, target, update]
@InProceedings{Ranjan_2019_CVPR,
  author = {Ranjan, Anurag and Jampani, Varun and Balles, Lukas and Kim, Kihwan and Sun, Deqing and Wulff, Jonas and Black, Michael J.},
  title = {Competitive Collaboration: Joint Unsupervised Learning of Depth, Camera Motion, Optical Flow and Motion Segmentation},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Learning Parallax Attention for Stereo Image Super-Resolution
Longguang Wang, Yingqian Wang, Zhengfa Liang, Zaiping Lin, Jungang Yang, Wei An, Yulan Guo


Stereo image pairs can be used to improve the performance of super-resolution (SR) since additional information is provided from a second viewpoint. However, it is challenging to incorporate this information for SR since disparities between stereo images vary significantly. In this paper, we propose a parallax-attention stereo superresolution network (PASSRnet) to integrate the information from a stereo image pair for SR. Specifically, we introduce a parallax-attention mechanism with a global receptive field along the epipolar line to handle different stereo images with large disparity variations. We also propose a new and the largest dataset for stereo image SR (namely, Flickr1024). Extensive experiments demonstrate that the parallax-attention mechanism can capture correspondence between stereo images to improve SR performance with a small computational and memory cost. Comparative results show that our PASSRnet achieves the state-of-the-art performance on the Middlebury, KITTI 2012 and KITTI 2015 datasets.
[dataset, dependency, capture, video, fed, transition, flow] [stereo, correspondence, left, kitti, single, disparity, volume, valid, accurate, note, groundtruth, epipolar, reliable, estimation, matching] [image, psnr, input, cycle, comparison, consistency, handle, proposed, superresolution, demonstrate, resolution, figure, high, method] [passrnet, network, residual, pam, cost, performance, stereosr, aspp, deep, conv, table, block, achieved, convolutional, receptive, compared, computational, achieves, mrightleft, mleftright, atrous, neural, effectiveness, decreased, srcnn, lapsrn] [generate, attention, mechanism, model, observed, introduce, introduced, demonstrated, generated] [feature, module, improve, propose, spatial, global] [loss, learning, large, training, comparative, test, shared]
@InProceedings{Wang_2019_CVPR,
  author = {Wang, Longguang and Wang, Yingqian and Liang, Zhengfa and Lin, Zaiping and Yang, Jungang and An, Wei and Guo, Yulan},
  title = {Learning Parallax Attention for Stereo Image Super-Resolution},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Knowing When to Stop: Evaluation and Verification of Conformity to Output-Size Specifications
Chenglong Wang, Rudy Bunel, Krishnamurthy Dvijotham, Po-Sen Huang, Edward Grefenstette, Pushmeet Kohli


Neural architectures able to generate variable-length outputs are extremely effective for applications like Machine Translation and Image Captioning. In this paper, we study the vulnerability of these models to attacks aimed at changing the output-size that can have undesirable consequences including increased computation and inducing faults in downstream modules that expect outputs of a certain length. We show the existence and construction of such attacks with two key contributions. First, to overcome the difficulties of discrete search space and the non-differentiable adversarial objective function, we develop an easy-to-compute differentiable proxy objective that can be used with gradient-based algorithms to find output-lengthening inputs. Second, we develop a verification approach to formally prove that the network cannot produce outputs greater than a certain length. Experimental results on Machine Translation and Image Captioning models show that our adversarial output-lengthening approach can produce outputs that are 50 times longer than the input, while our verification approach can, given a model and input domain, prove that the output length is below a certain size.
[sequence, longer, dataset, long, recurrent, rnn, state] [approach, algorithm, variable, computer, discrete, problem, corresponding, vision, pattern, continuous, radius, differentiable, linear, prove, robust, directly, initial] [image, input, translation, produce, conference, study, figure, ieee, method] [output, neural, verification, search, modulation, network, size, number, computational, gradient, small, max, denotes] [model, adversarial, length, attack, pgd, perturbation, machine, random, find, captioning, decoding, specification, token, eos, generate, greedy, arxiv, develop, preprint, finding, language, nmt, formally, decoder, robustness, consider, probability, example] [] [space, distribution, embedding, training, objective, set, proxy, maximum, function, mnist, target, test]
@InProceedings{Wang_2019_CVPR,
  author = {Wang, Chenglong and Bunel, Rudy and Dvijotham, Krishnamurthy and Huang, Po-Sen and Grefenstette, Edward and Kohli, Pushmeet},
  title = {Knowing When to Stop: Evaluation and Verification of Conformity to Output-Size Specifications},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Spatial Attentive Single-Image Deraining With a High Quality Real Rain Dataset
Tianyu Wang, Xin Yang, Ke Xu, Shaozhe Chen, Qiang Zhang, Rynson W.H. Lau


Removing rain streaks from a single image has been drawing considerable attention as rain streaks can severely degrade the image quality and affect the performance of existing outdoor vision tasks. While recent CNN-based derainers have reported promising performances, deraining remains an open problem for two reasons. First, existing synthesized rain datasets have only limited realism, in terms of modeling real rain characteristics such as rain shape, direction and intensity. Second, there are no public benchmarks for quantitative comparisons on real rain images, which makes the current evaluation less objective. The core challenge is that real world rain/clean image pairs cannot be captured at the same time. In this paper, we address the single image rain removal problem in two ways. First, we propose a semi-automatic method that incorporates temporal priors and human supervision to generate a high-quality clean image from each input sequence of real rain images. Using this method, we construct a large-scale dataset of 29.5K rain/rain-free image pairs that covers a wide range of natural rain scenes. Second, to better cover the stochastic distribution of real rain streaks, we propose a novel SPatial Attentive Network (SPANet) to remove rain streaks in a local-to-global manner. Extensive experiments demonstrate that our network performs favorably against the state-of-the-art deraining methods.
[sequence, dataset, video, temporal, recurrent, human, modeling, joint, time] [single, range, lack, problem, scene, well] [rain, image, real, figure, deraining, spanet, pixel, clean, proposed, background, input, derainers, removal, method, percentile, remove, based, irnn, sam, intensity, high, streak, jorder, removing, rescan, comparison, psnr, cover, synthetic, derained, tend, synthesized] [network, performance, table, residual, deep, output, architecture, highly] [attention, generate, covered, model, mode, visual, identify, ddn, evaluation] [spatial, attentive, propose, map, feature, contextual, three, detection, global, improve] [existing, training, trained, loss, address, test, datasets, novel]
@InProceedings{Wang_2019_CVPR,
  author = {Wang, Tianyu and Yang, Xin and Xu, Ke and Chen, Shaozhe and Zhang, Qiang and Lau, Rynson W.H.},
  title = {Spatial Attentive Single-Image Deraining With a High Quality Real Rain Dataset},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Focus Is All You Need: Loss Functions for Event-Based Vision
Guillermo Gallego, Mathias Gehrig, Davide Scaramuzza


Event cameras are novel vision sensors that output pixel-level brightness changes ("events") instead of traditional video frames. These asynchronous sensors offer several advantages over traditional cameras, such as, high temporal resolution, very high dynamic range, and no motion blur. To unlock the potential of such sensors, motion compensation methods have been recently proposed. We present a collection and taxonomy of twenty two objective functions to analyze event alignment in motion compensation approaches. We call them focus loss functions since they have strong connections with functions used in traditional shape-from-focus applications. The proposed loss functions allow bringing mature computer vision tools to the realm of event cameras. We compare the accuracy and runtime performance of all loss functions on a publicly available dataset, and conclude that the variance, the gradient and the Laplacian magnitudes are among the best loss functions. The applicability of the loss functions is shown on multiple tasks: rotational motion, depth and optical flow estimation. The proposed focus loss functions allow to unlock the outstanding properties of event cameras.
[event, focus, iwe, motion, optical, flow, warped, davide, guillermo, tracking, polarity, velocity, dispersion, henri, time, brightness, tobi, ryad, dynamic, autofocus, mav, consists, contiguity] [local, depth, laplacian, absolute, vision, pattern, point, square, range, camera, contrast, estimation, derivative, iter, fourier, stereo, squared, depend, estimate, accurate, focal] [image, ieee, statistical, high, pixel, based, compensation, proposed, figure, difference, traditional, produce] [variance, magnitude, max, gradient, table, accuracy, best, ratio, processing, energy, neural, compared] [visual, maximizing, maximize] [spatial, area, edge, global] [loss, function, alignment, measure, support, entropy, data, angular, learning]
@InProceedings{Gallego_2019_CVPR,
  author = {Gallego, Guillermo and Gehrig, Mathias and Scaramuzza, Davide},
  title = {Focus Is All You Need: Loss Functions for Event-Based Vision},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Scalable Convolutional Neural Network for Image Compressed Sensing
Wuzhen Shi, Feng Jiang, Shaohui Liu, Debin Zhao


Recently, deep learning based image Compressed Sensing (CS) methods have been proposed and demonstrated superior reconstruction quality with low computational complexity. However, the existing deep learning based image CS methods need to train different models for different sampling ratios, which increases the complexity of the encoder and decoder. In this paper, we propose a scalable convolutional neural network (dubbed SCSNet) to achieve scalable sampling and scalable reconstruction with only one model. Specifically, SCSNet provides both coarse and fine granular scalability. For coarse granular scalability, SCSNet is designed as a single sampling matrix plus a hierarchical reconstruction network that contains a base layer plus multiple enhancement layers. The base layer provides the basic reconstruction quality, while the enhancement layers reference the lower reconstruction layers and gradually improve the reconstruction quality. For fine granular scalability, SCSNet achieves sampling and reconstruction at any sampling ratio by using a greedy method to select the measurement bases. Compared with the existing deep learning based image CS methods, SCSNet achieves scalable sampling and quality scalable reconstruction at any sampling ratio with only one model. Experimental results demonstrate that SCSNet has the state-of-the-art performance while maintaining a comparable running speed with the existing deep learning based image CS methods.
[signal, multiple, video, represented] [reconstruction, measurement, initial, matrix, international, algorithm, computer, good] [image, based, psnr, ieee, figure, sensing, quality, conference, method, traditional, proposed, ssim, comparison, compressive, reconstructed, recovery, amount, imaging, blocking, removing, bcs] [scsnet, deep, scalable, network, ratio, granular, compared, fine, layer, compressed, running, convolution, csnet, size, ith, table, convolutional, lower, neural, computational, wireless, block, implement, conv, residual, achieves, sparse, coding, reconnet, basic, speed, group, higher] [greedy, visual, random, iterative] [average, coarse, improve, hierarchical, feature, final] [sampling, learning, existing, base, scalability, set, data, select, test, function]
@InProceedings{Shi_2019_CVPR,
  author = {Shi, Wuzhen and Jiang, Feng and Liu, Shaohui and Zhao, Debin},
  title = {Scalable Convolutional Neural Network for Image Compressed Sensing},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Event Cameras, Contrast Maximization and Reward Functions: An Analysis
Timo Stoffregen, Lindsay Kleeman


Event cameras asynchronously report timestamped changes in pixel intensity and offer advantages over conventional raster scan cameras in terms of low-latency, low redundancy sensing and high dynamic range. In recent years, much of research in event based vision has been focused on performing tasks such as optic flow estimation, moving object segmentation, feature tracking, camera rotation estimation and more, through contrast maximization. In contrast maximization, events are warped along motion trajectories whose parameters depend on the quantity being estimated, to some time t_ref. The parameters are then scored by some reward function of the accumulated events at t_ref. The versatility of this approach has lead to a flurry of research in recent years, but no in-depth study of the reward chosen during optimization has yet been made. In this work we examine the choice of reward used in contrast maximization, propose a classification of different rewards and show how a reward can be constructed that is more robust to noise and aperture uncertainty. We validate our work experimentally by predicting optical flow and comparing to ground-truth data.
[event, flow, rsos, optical, sequence, rsosa, rsoe, risoa, rmoa, moving, circle, warped, motion, tref, trajectory, stream, dynamic, time, iwe, warping, accumulated, work, perform, temporal, velocity] [contrast, camera, optic, estimate, error, vision, scene, aperture, plane, maximization, point, estimation, ground, respect, problem, truth, optimization, local, linear, well, case] [image, noise, intensity, figure, pixel, ieee, high, real, based, hybrid] [magnitude, number, better, original, structure, best] [reward, generated, generate, sum, red, vector] [segment, average, object, operate, extreme] [data, set, large, function, experimental, office, angular, conventional, tested]
@InProceedings{Stoffregen_2019_CVPR,
  author = {Stoffregen, Timo and Kleeman, Lindsay},
  title = {Event Cameras, Contrast Maximization and Reward Functions: An Analysis},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Convolutional Neural Networks Can Be Deceived by Visual Illusions
Alexander Gomez-Villa, Adrian Martin, Javier Vazquez-Corral, Marcelo Bertalmio


Visual illusions teach us that what we see is not always what is represented in the physical world. Their special nature make them a fascinating tool to test and validate any new vision model proposed. In general, current vision models are based on the concatenation of linear and non-linear operations. The similarity of this structure with the operations present in Convolutional Neural Networks (CNNs) has motivated us to study if CNNs trained for low-level visual tasks are deceived by visual illusions. In particular, we show that CNNs trained for image denoising, image deblurring, and computational color constancy are able to replicate the human response to visual illusions, and that the extent of this replication varies with respect to variation in architecture and spatial pattern size. These results suggest that in order to obtain CNNs that better replicate human behaviour, we may need to start aiming for them to better replicate visual illusions.
[replication, dungeon, human, replicate, illusion, assimilation, replicates, deceived, work, second, opposite, replicating, hidden] [vision, case, contrast, field, pattern, left, computer, classical] [color, image, grayscale, figure, row, denoising, reproduce, frequency, based, study, input, variation, chevreul, constancy] [cnns, size, scale, science, output, receptive, neural, convolutional, architecture, kernel, order, layer, residual, pooling, processing, larger, better, replicated, deeper, original, increase] [visual, perception, red, green, observed, simple, common, consider, white, model, adding] [cnn, spatial, three, response] [target, base, trained, selected, paper, tested, test, representation]
@InProceedings{Gomez-Villa_2019_CVPR,
  author = {Gomez-Villa, Alexander and Martin, Adrian and Vazquez-Corral, Javier and Bertalmio, Marcelo},
  title = {Convolutional Neural Networks Can Be Deceived by Visual Illusions},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
PDE Acceleration for Active Contours
Anthony Yezzi, Ganesh Sundaramoorthi, Minas Benyamin


Following the seminal work of Nesterov, accelerated optimization methods have been used to powerfully boost the performance of first-order, gradient-based parameter estimation in scenarios where second-order optimization strategies are either inapplicable or impractical. Accelerated gradient descent converges faster and performs a more robust local search of the parameter space by initially overshooting then oscillating back into minimizers which have a basis of attraction large enough to contain the overshoot. Recent work has demonstrated how a broad class of accelerated schemes can be cast in a variational framework leading to continuum limit ODE's. We extend their formulation to the PDE framework, specifically for the infinite dimensional manifold of continuous curves, to introduce acceleration, and its added robustness, into the broad range of PDE based active contours.
[framework, time, term, action, work, velocity, force, flow] [active, accelerated, pde, optimization, local, evolving, mass, finite, sobolev, momentum, convex, geometric, dimensional, additional, case, computer, normal, continuous, bregman, functional, kinetic, minimizers, explicit, regularity, general, vision, constant, initial, arclength, friction, pattern, continuum, limit, infinite, shape, directly, field, integral] [contour, based, method, image, coupled, conference, ieee, noise, difference] [gradient, energy, evolution, descent, acceleration, computational, unit, parameter, process, represents, order, stochastic, processing, standard, numerical, denotes, deep, distributed, cost, regularization] [system, variational, model, potential, machine, robustness, step] [level, segmentation, curve, global] [diffusion, set, large, class, generalized, function, learning]
@InProceedings{Yezzi_2019_CVPR,
  author = {Yezzi, Anthony and Sundaramoorthi, Ganesh and Benyamin, Minas},
  title = {PDE Acceleration for Active Contours},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Dichromatic Model Based Temporal Color Constancy for AC Light Sources
Jun-Sang Yoo, Jong-Ok Kim


Existing dichromatic color constancy approach commonly requires a number of spatial pixels which have high specularity. In this paper, we propose a novel approach to estimate the illuminant chromaticity of AC light source using high-speed camera. We found that the temporal observations of an image pixel at a fixed location distribute on an identical dichromatic plane. Instead of spatial pixels with high specularity, multiple temporal samples of a pixel are exploited to determine AC pixels for dichromatic plane estimation, whose pixel intensity is sinusoidally varying well. A dichromatic plane is calculated per each AC pixel, and illuminant chromaticity is determined by the intersection of dichromatic planes. From multiple dichromatic planes, an optimal illuminant is estimated with a novel MAP framework. It is shown that the proposed method outperforms both existing dichromatic based methods and temporal color constancy methods, irrespective of the amount of specularity.
[temporal, multiple, video, frame, time, determine, capture] [dichromatic, plane, specular, estimation, light, error, estimated, diffuse, estimate, optimal, computer, note, sinusoidal, vision, varying, single, rgb, pattern, well, require, accurate, determined, irrespective, surface, planckian, locus, analysis, international, colour, approach, sinusoidally, assumption, scene] [color, illuminant, proposed, method, pixel, image, constancy, intensity, ieee, based, chromaticity, high, figure, conference, prior, reflection, frequency, gamut, gray] [number, order, low, accuracy, performance, identical, fast, rate, table, weight, computational] [model, commonly, candidate, easily, grey, calculated, vector, probability, machine] [object, spatial, map, propose, intersection] [angular, existing, novel, source, exploit]
@InProceedings{Yoo_2019_CVPR,
  author = {Yoo, Jun-Sang and Kim, Jong-Ok},
  title = {Dichromatic Model Based Temporal Color Constancy for AC Light Sources},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Semantic Attribute Matching Networks
Seungryong Kim, Dongbo Min, Somi Jeong, Sunok Kim, Sangryul Jeon, Kwanghoon Sohn


We present semantic attribute matching networks (SAM-Net) for jointly establishing correspondences and transferring attributes across semantically similar images, which intelligently weaves the advantages of the two tasks while overcoming their limitations. SAM-Net accomplishes this through an iterative process of establishing reliable correspondences by reducing the attribute discrepancy between the images and synthesizing attribute transferred images using the learned correspondences. To learn the networks using weak supervisions in the form of image pairs, we present a semantic attribute matching loss based on the matching similarity between an attribute transferred source feature and a warped target feature. With SAM-Net, the state-of-the-art performance is attained on several benchmarks for semantic matching and attribute transfer.
[dataset, warped, work, recurrent, current, formulated] [matching, correspondence, limited, dense, estimate, geometric, local, field, defined, affine, parametric, note, confidence, establishing, reliable, form] [attribute, image, style, transformation, figure, method, transferred, stylized, tss, dia, proposed, patch, synthesize, photorealistic, content, ieee, qualitative, texture, based, reconstruct, patchmatch, blending, gatys, rtns, fis, dctm, transferring] [deep, neural, accuracy, convolutional, regularization, performance, network, skip, fast, highly, formulate, table] [semantically, decoder, iterative, intuition, consider] [semantic, feature, extraction, benchmark, weak, object, evaluated, spatial, including] [transfer, source, target, loss, learned, existing, set, training, learn, neighbor, discrepancy]
@InProceedings{Kim_2019_CVPR,
  author = {Kim, Seungryong and Min, Dongbo and Jeong, Somi and Kim, Sunok and Jeon, Sangryul and Sohn, Kwanghoon},
  title = {Semantic Attribute Matching Networks},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Skin-Based Identification From Multispectral Image Data Using CNNs
Takeshi Uemori, Atsushi Ito, Yusuke Moriuchi, Alexander Gatto, Jun Murayama


User identification from hand images only is still a challenging task. In this paper, we propose a new biometric identification system based solely on a skin patch from a multispectral image. The system is utilizing a novel modified 3D CNN architecture which is taking advantage of multispectral data. We demonstrate the application of our system for the example of human identification from multispectral images of hands. To the best of our knowledge, this paper is the first to describe a pose-invariant and robust to overlapping real-time human identification system using hands. Additionally, we provide a framework to optimize the required spectral bands for the given spatial resolution limitations.
[dataset, framework, human, subject, work, video] [hand, rgb, sensor, camera, approach, cube, equation, case, illumination, tabletop, vision, pipeline, reflectance, computer] [spectral, multispectral, skin, image, figure, input, user, noise, band, imaging, synthetic, based, resolution, proposed, color, amount, ieee, real, described, comparison, result, patch, conference, acquisition, high, hyperspectral, sony, wavelength] [performance, accuracy, number, block, residual, network, architecture, standard, actual, ratio, experiment, cnns, deep] [system, model, relevant, generating, machine, evaluation, fingerprint] [identification, spatial, cnn, feature] [data, classification, datasets, training, novel, experimental, paper, source, gap]
@InProceedings{Uemori_2019_CVPR,
  author = {Uemori, Takeshi and Ito, Atsushi and Moriuchi, Yusuke and Gatto, Alexander and Murayama, Jun},
  title = {Skin-Based Identification From Multispectral Image Data Using CNNs},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Large-Scale Distributed Second-Order Optimization Using Kronecker-Factored Approximate Curvature for Deep Convolutional Neural Networks
Kazuki Osawa, Yohei Tsuji, Yuichiro Ueno, Akira Naruse, Rio Yokota, Satoshi Matsuoka


Large-scale distributed training of deep neural networks suffers from the generalization gap caused by the increase in the effective mini-batch size. Previous approaches try to solve this problem by varying the learning rate and batch size over epochs and layers, or some ad hoc modification of the batch normalization. We propose an alternative approach using a second order optimization method that shows similar generalization capability to first order methods, but converges faster and can handle larger mini-batches. To test our method on a benchmark where highly optimized first order methods are available as references, we train ResNet-50 on ImageNet-1K. We converged to 75% Top-1 validation accuracy in 35 epochs for mini-batch sizes under 16,384, and achieved 75% even with a mini-batch size of 131,072, which took only 978 iterations.
[work, time, previous] [matrix, approach, optimization, curvature, damping, compute, inverse] [diagonal, figure, method, input] [size, accuracy, imagenet, kronecker, fim, distributed, rate, deep, neural, batch, number, validation, gradient, fisher, layer, sgd, reduce, approximate, normalization, computation, compared, overhead, iteration, achieve, tesla, gpus, epoch, extremely, convolutional, achieved, table, increase, ratio, network, parameter, chainer, preconditioned, mixup, scale, process, gpu, design] [arxiv, preprint, communication, erasing, model, memory] [stage, faster] [training, large, learning, train, min, hyperparameters, generalization, data, loss, update]
@InProceedings{Osawa_2019_CVPR,
  author = {Osawa, Kazuki and Tsuji, Yohei and Ueno, Yuichiro and Naruse, Akira and Yokota, Rio and Matsuoka, Satoshi},
  title = {Large-Scale Distributed Second-Order Optimization Using Kronecker-Factored Approximate Curvature for Deep Convolutional Neural Networks},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Putting Humans in a Scene: Learning Affordance in 3D Indoor Environments
Xueting Li, Sifei Liu, Kihwan Kim, Xiaolong Wang, Ming-Hsuan Yang, Jan Kautz


Affordance modeling plays an important role in visual understanding. In this paper, we aim to predict affordances of 3D indoor scenes, specifically what human poses are afforded by a given indoor environment, such as sitting on a chair or standing on the floor. In order to predict valid affordances and learn possible 3D human poses in indoor scenes, we need to understand the semantic and geometric structure of a scene as well as its potential interactions with a human. To learn such a model, a large-scale dataset of 3D indoor affordances is required. In this work, we build a fully automatic 3D pose synthesizer that fuses semantic knowledge from a large number of 2D poses extracted from TV shows as well as 3D geometric knowledge from voxel representations of indoor scenes. With the data created by the synthesizer, we introduce a 3D pose generative model to predict semantically plausible and physically feasible human poses within a given scene (provided as a single RGB, RGB-D, or depth image). We demonstrate that our human affordance prediction method consistently outperforms existing state-of-the-art methods.
[human, prediction, joint, dataset, predict, modeling, follow, understand, work, jointly, predicting] [pose, scene, depth, geometry, affordance, indoor, well, voxel, single, constraint, feasible, pelvis, rgb, voxels, heat, ground, truth, suncg, physically, geometric, camera, corresponding, sitcom] [method, image, synthesizer, generative, user, input, described, free, figure, study, proposed, collect] [table, correlation, adjust, network] [model, generated, generate, plausible, affordances, sitting, discriminator, adversarial, physical, natural, generating, sampled, represent] [location, semantic, map, baseline, score, object, module, predicted, propose, context] [train, space, training, support, learning, distribution, learn, knowledge, trained, data, class, positive]
@InProceedings{Li_2019_CVPR,
  author = {Li, Xueting and Liu, Sifei and Kim, Kihwan and Wang, Xiaolong and Yang, Ming-Hsuan and Kautz, Jan},
  title = {Putting Humans in a Scene: Learning Affordance in 3D Indoor Environments},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
PIEs: Pose Invariant Embeddings
Chih-Hui Ho, Pedro Morgado, Amir Persekian, Nuno Vasconcelos


The role of pose invariance in image recognition and retrieval is studied. A taxonomic classification of embeddings, according to their level of invariance, is introduced and used to clarify connections between existing embeddings, identify missing approaches, and propose invariant generalizations. This leads to a new family of pose invariant embeddings (PIEs), derived from existing approaches by a combination of two models, which follow from the interpretation of CNNs as estimators of class posterior probabilities: a view-to-object model and an object-to-class model. The new pose-invariant models are shown to have interesting properties, both theoretically and through experiments, where they outperform existing multiview approaches. Most notably, they achieve good performance for both 1) classification and retrieval, and 2) single and multiview inference. These are important properties for the design of real vision systems, where universal embeddings are preferable to task specific ones, and multiple images are usually not available at inference time. Finally, a new multiview dataset of real objects, imaged in the wild against complex backgrounds, is introduced. We believe that this is a much needed complement to the synthetic datasets in wide use and will contribute to the advancement of multiview recognition and retrieval.
[dataset, recognition, individual, multiple, work, perform, consists] [multiview, single, view, pose, shape, vision, computer, descriptor, pattern, modelnet, good, additional, problem, well] [figure, conference, based, image, proposed, ieee, real, produce, imaged, synthetic, method, realistic] [number, neural, performance, inference, convolutional, network, layer, better, deep, cnns] [model, refer] [object, center, feature, three, level, multi, cnn] [classification, invariant, embeddings, embedding, learning, retrieval, metric, proxy, triplet, loss, class, existing, datasets, distance, training, clustering, invariance, logistic, set, classifier, softmax, address]
@InProceedings{Ho_2019_CVPR,
  author = {Ho, Chih-Hui and Morgado, Pedro and Persekian, Amir and Vasconcelos, Nuno},
  title = {PIEs: Pose Invariant Embeddings},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Representation Similarity Analysis for Efficient Task Taxonomy & Transfer Learning
Kshitij Dwivedi, Gemma Roig


Transfer learning is widely used in deep neural network models when there are few labeled examples available. The common approach is to take a pre-trained network in a similar task and finetune the model parameters. This is usually done blindly without a pre-selection from a set of pre-trained models, or by finetuning a set of models trained on different tasks and selecting the best performing one by cross-validation. We address this problem by proposing an approach to assess the relationship between visual tasks and their task-specific models. Our method uses Representation Similarity Analysis (RSA), which is commonly used to find a correlation between neuronal responses from brain data and models. With RSA we obtain a similarity score among tasks by computing correlations between models trained on different tasks. Our method is efficient as it requires only pre-trained models, and a few images with no further training. We demonstrate the effectiveness and efficiency of our method to generating task taxonomy on Taskonomy dataset. We next evaluate the relationship of RSA with the transfer learning performance on Taskonomy tasks and a new task: Pascal VOC semantic segmentation. Our results reveal that models trained on tasks with higher similarity score show higher transfer learning performance. Surprisingly, the best transfer learning result for Pascal VOC semantic segmentation is not obtained from the pre-trained model on semantic segmentation, probably due to the domain differences, and our method successfully selects the high performing models.
[dataset, consists, rdms, performing, perform, report] [vision, approach, compute, computer, matrix, analysis, pattern, scene, note, computed] [figure, based, image, conference, method, ieee, comparison, high, application] [performance, small, computational, deep, neural, correlation, brain, compare, output, network, computing, best, selection, initialized, convolution, compressed, convolutional, dnn, imagenet, compared, dnns, size, table, lower, initialization] [model, encoder, relationship, visual, find, evaluate, refer] [semantic, pascal, score, voc, segmentation, object, taxonomy, hierarchical, fully, comparing] [similarity, task, transfer, taskonomy, learning, rsa, trained, representation, select, training, ranking, selected, dissimilarity, clustering, investigate, obtaining, subset, set, data]
@InProceedings{Dwivedi_2019_CVPR,
  author = {Dwivedi, Kshitij and Roig, Gemma},
  title = {Representation Similarity Analysis for Efficient Task Taxonomy & Transfer Learning},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Object Counting and Instance Segmentation With Image-Level Supervision
Hisham Cholakkal, Guolei Sun, Fahad Shahbaz Khan, Ling Shao


Common object counting in a natural scene is a challenging problem in computer vision with numerous real-world applications. Existing image-level supervised common object counting approaches only predict the global object count and rely on additional instance-level supervision to also determine object locations. We propose an image-level supervised approach that provides both the global object count and the spatial distribution of object instances by constructing an object category density map. Motivated by psychological studies, we further reduce image-level supervision using a limited object count information (up to four). To the best of our knowledge, we are the first to propose image-level supervised density map estimation for common object counting and demonstrate its effectiveness in image-level supervised instance segmentation. Comprehensive experiments are performed on the PASCAL VOC and COCO datasets. Our approach outperforms existing methods, including those using instance-level supervision, on both datasets for common object counting. Moreover, our approach improves state-of-the-art image-level supervised instance segmentation with a relative gain of 17.8% in terms of average best overlap, on the PASCAL VOC 2012 dataset.
[term, predict, outperforms, multiple] [approach, range, computed, local, estimation, corresponding, error, predicts, absolute, relative, rmse] [image, proposed, method, figure] [density, best, network, gain, compared, number, convolutional, performance] [common, sum, natural, requires, indicates, evaluate, van] [object, count, map, spatial, counting, subitizing, instance, global, person, category, pascal, branch, voc, ilc, segmentation, supervision, peak, predicted, coco, mrmse, mask, lspatial, crowd, prm, score, false, accurately, scoring, proposal, average, three, presence, region, penalizes, lrank] [supervised, loss, distribution, classification, set, training, function, pseudo, ranking, existing, metric, learning, class, positive, trained, train]
@InProceedings{Cholakkal_2019_CVPR,
  author = {Cholakkal, Hisham and Sun, Guolei and Shahbaz Khan, Fahad and Shao, Ling},
  title = {Object Counting and Instance Segmentation With Image-Level Supervision},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Variational Autoencoders Pursue PCA Directions (by Accident)
Michal Rolinek, Dominik Zietlow, Georg Martius


The Variational Autoencoder (VAE) is a powerful architecture capable of representation learning and generative modeling. When it comes to learning interpretable (disentangled) representations, VAE and its variants show unparalleled performance. However, the reasons for this are unclear, since a very particular alignment of the latent embedding is needed but the design of the VAE does not encourage it in any explicit way. We address this matter and offer the following explanation: the diagonal approximation in the encoder together with the inherent stochasticity force local orthogonality of the decoder. The local behavior of promoting both reconstruction and orthogonality matches closely how the PCA embedding is chosen. Alongside providing an intuitive understanding, we justify the statement with full theoretical analysis as well as with experiments.
[term, dataset] [matrix, local, reconstruction, linear, analysis, well, problem, optimization, polarized, theoretical, idealized, theorem, volume, case, square, singular, defined, rotation, proposition, rotationally, variable, good, decomposition, permutation, regime, column, international, intuitive, coordinate] [latent, disentanglement, generative, pca, figure, transformation, conference, disentangled, prior, diagonal, image, lrec] [orthogonality, deep, orthogonal, stochastic, enc, fixed, svd, precision, higher, neural, full, optimize, design] [variational, arxiv, decoder, dec, encoder, dto, lkl, example, choice, interpretable, reinforcement, mechanism, machine, adversarial] [global] [learning, objective, vae, loss, representation, log, vaes, training, autoencoders, embedding, product, autoencoder, alignment, set, main, min]
@InProceedings{Rolinek_2019_CVPR,
  author = {Rolinek, Michal and Zietlow, Dominik and Martius, Georg},
  title = {Variational Autoencoders Pursue PCA Directions (by Accident)},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
A Relation-Augmented Fully Convolutional Network for Semantic Segmentation in Aerial Scenes
Lichao Mou, Yuansheng Hua, Xiao Xiang Zhu


Most current semantic segmentation approaches fall back on deep convolutional neural networks (CNNs). However, their use of convolution operations with local receptive fields causes failures in modeling contextual spatial relations. Prior works have sought to address this issue by using graphical models or spatial propagation modules in networks. But such models often fail to capture long-range spatial relationships between entities, which leads to spatially fragmented predictions. Moreover, recent works have demonstrated that channel-wise information also acts a pivotal part in CNNs. In this work, we introduce two simple yet effective network units, the spatial relation module and the channel relation module, to learn and reason about global relationships between any two spatial positions or feature maps, and then produce relation-augmented feature representations. The spatial and channel relation modules are general and extensible, and can be used in a plug-and-play fashion with the existing fully convolutional network (FCN) framework. We evaluate relation module-equipped networks on semantic segmentation tasks using two aerial image datasets, which fundamentally depend on long-range spatial relational reasoning. The networks achieve very competitive results, bringing significant improvements over baselines.
[recognition, propagation, modeling, work, dataset, capture, serial] [computer, vision, international, pattern, local, well, matrix, scene] [image, ieee, conference, remote, figure, resolution, input, proposed, comprehensive, geoscience, produce, capable, study] [channel, convolutional, network, deep, neural, performance, table, receptive, cnns, effectiveness, size, low, processing] [model, relational, indicates, visual, reasoning, generate] [spatial, relation, semantic, segmentation, feature, module, aerial, global, fcn, fully, object, affinity, integration, vaihingen, potsdam, contextual, graphical, ablation, cnn, score, isprs, scnn, impervious, propose] [learning, set, classification, learned, training, similarity, compatibility]
@InProceedings{Mou_2019_CVPR,
  author = {Mou, Lichao and Hua, Yuansheng and Xiang Zhu, Xiao},
  title = {A Relation-Augmented Fully Convolutional Network for Semantic Segmentation in Aerial Scenes},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Temporal Transformer Networks: Joint Learning of Invariant and Discriminative Time Warping
Suhas Lohit, Qiao Wang, Pavan Turaga


Many time-series classification problems involve developing metrics that are invariant to temporal misalignment. In human activity analysis, temporal misalignment arises due to various reasons including differing initial phase, sensor sampling rates, and elastic time-warps due to subject-specific biomechanics. Past work in this area has only looked at reducing intra-class variability by elastic temporal alignment. In this paper, we propose a hybrid model-based and data-driven approach to learn warping functions that not just reduce intra-class variability, but also increase inter-class separation. We call this a temporal transformer network (TTN). TTN is an interpretable differentiable module, which can be easily integrated at the front end of a classification network. The module is capable of reducing intra-class variance by generating input-dependent warping functions which lead to rate-robust representations. At the same time, it increases inter-class variance by learning warping functions that are more discriminative. We show improvements over strong baselines in 3D action recognition on challenging datasets using the proposed framework. The improvements are especially pronounced when training sets are smaller.
[ttn, warping, action, temporal, recognition, time, dataset, human, sequence, frame, work, skeleton, lstm, series, activity, tcn, joint, performs] [pattern, computer, vision, pose, case, international, well, corresponding, hand, analysis, note, canonical, linear, column, geometric] [conference, input, ieee, figure, proposed] [network, rate, neural, deep, performance, convolutional, better, design, addition, transforms, performed, elastic, applied, architecture] [transformer, machine, improved, length] [module, spatial, including, integrated, equivalence, baseline] [classification, learning, function, discriminative, test, data, class, invariant, training, alignment, set, learn, classifier, paper, space]
@InProceedings{Lohit_2019_CVPR,
  author = {Lohit, Suhas and Wang, Qiao and Turaga, Pavan},
  title = {Temporal Transformer Networks: Joint Learning of Invariant and Discriminative Time Warping},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
PCAN: 3D Attention Map Learning Using Contextual Information for Point Cloud Based Retrieval
Wenxiao Zhang, Chunxia Xiao


Point cloud based retrieval for place recognition is an emerging problem in vision field. The main challenge is how to find an efficient way to encode the local features into a discriminative global descriptor. In this paper, we propose a Point Contextual Attention Network (PCAN), which can predict the significance of each local point feature based on point context. Our network makes it possible to pay more attention to the task-relevent features when aggregating local features. Experiments on various benchmark datasets show that the proposed network can provide outperformance than current state-of-the-art approaches.
[recognition, dataset, learns, outperforms, extracting, extract, focus, work] [point, local, cloud, computer, pcan, vision, oxford, pattern, pointnet, netvlad, sag, submaps, international, radius, place, problem, directly, scene] [conference, ieee, based, image, input, proposed, figure, denoted] [network, layer, top, size, output, table, deep, structure, aggregate, architecture, chunxia, efficient, apply] [attention, query, ball, introduced, ability] [feature, map, global, contextual, pointnetvlad, recall, final, score, grouping, object, refined, localization, context, average, baseline, european, location, vlad, fully, extraction] [retrieval, training, datasets, retrieved, discriminative, learning, data, set, learned, sampling, significance]
@InProceedings{Zhang_2019_CVPR,
  author = {Zhang, Wenxiao and Xiao, Chunxia},
  title = {PCAN: 3D Attention Map Learning Using Contextual Information for Point Cloud Based Retrieval},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Depth Coefficients for Depth Completion
Saif Imran, Yunfei Long, Xiaoming Liu, Daniel Morris


Depth completion involves estimating a dense depth image from sparse depth measurements, often guided by a color image. While linear upsampling is straight forward, it results in depth pixels being interpolated in empty space across discontinuities between objects. Current methods use deep networks to maintain gaps between objects. Nevertheless depth smearing remains a challenge. We propose a new representation for depth called Depth Coefficients (DC) to address this problem. It enables convolutions to more easily avoid inter-object depth mixing. We also show that the standard Mean Squared Error (MSE) loss function can promote depth mixing, and so we propose instead to use cross-entropy loss for DC. Both quantitative and qualitative evaluation are conducted on benchmarks, and we show that switching out sparse depth input and MSE loss functions with our DC representation and loss is a simple way to improve performance, reduce pixel depth mixing and can improve object detection.
[] [] [] [] [] [] []
@InProceedings{Imran_2019_CVPR,
  author = {Imran, Saif and Long, Yunfei and Liu, Xiaoming and Morris, Daniel},
  title = {Depth Coefficients for Depth Completion},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Diversify and Match: A Domain Adaptive Representation Learning Paradigm for Object Detection
Taekyung Kim, Minki Jeong, Seunghyeon Kim, Seokeon Choi, Changick Kim


We introduce a novel unsupervised domain adaptation approach for object detection. We aim to alleviate the imperfect translation problem of pixel-level adaptations, and the source-biased discriminativity problem of feature-level adaptations simultaneously. Our approach is composed of two stages, i.e., Domain Diversification (DD) and Multi-domain-invariant Representation Learning (MRL). At the DD stage, we diversify the distribution of the labeled data by generating various distinctive shifted domains from the source domain. At the MRL stage, we apply adversarial learning with a multi-domain discriminator to encourage feature to be indistinguishable among the domains. DD addresses the source-biased discriminativity, while MRL mitigates the imperfect image translation. We construct a structured domain adaptation framework for our learning paradigm and introduce a practical way of DD for implementation. Our method outperforms the state-of-the-art methods by a large margin of 3% 11% in terms of mean average precision (mAP) on various datasets.
[framework, recognition, dataset] [computer, vision, international, pattern, volume, constraint, denote, problem, june] [image, conference, method, ieee, translation, figure, translated, artistic, generative, daf, based, conduct] [performance, network, shift, table, neural, deep, convolutional, denotes, number, effectiveness, fast, validation, adaptive, inference] [adversarial, machine, diversification, model, correct, discriminator] [object, detection, feature, semantic, pascal, voc, imperfect, urban, faster, iou, ross] [domain, adaptation, learning, source, target, shifted, mrl, unsupervised, paradigm, loss, representation, classification, train, datasets, set, discriminativity, test, data, issue, address, shifter, class, existing, lmrl, adapting, distribution, labeled, discriminative]
@InProceedings{Kim_2019_CVPR,
  author = {Kim, Taekyung and Jeong, Minki and Kim, Seunghyeon and Choi, Seokeon and Kim, Changick},
  title = {Diversify and Match: A Domain Adaptive Representation Learning Paradigm for Object Detection},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Good News, Everyone! Context Driven Entity-Aware Captioning for News Images
Ali Furkan Biten, Lluis Gomez, Marcal Rusinol, Dimosthenis Karatzas


Current image captioning systems perform at a merely descriptive level, essentially enumerating the objects in the scene and their relations. Humans, on the contrary, interpret images by integrating several sources of prior knowledge of the world. In this work, we aim to take a step closer to producing captions that offer a plausible interpretation of the scene, by integrating such contextual information into the captioning pipeline. For this we focus on the captioning of images used to illustrate news articles. We propose a novel captioning method that is able to leverage contextual information provided by the text of news articles associated with an image. Our model is able to selectively draw information from the article guided by visual cues, and to dynamically extend the output dictionary to out-of-vocabulary named entities that appear in the context source. Furthermore we introduce "GoodNews", the largest news image captioning dataset in the literature and demonstrate state-of-the-art results.
[dataset, human, current, lstm, timestep, producing, performing, time, state] [template, computer, pattern, vision, provide, associated, ground, match, truth, ford, case, city, analysis] [image, conference, ieee, produced, input, method, figure, proposed, produce] [table, performance, best, standard, order, better, deep] [named, captioning, news, entity, article, insertion, caption, attention, visual, model, sentence, goodnews, text, evaluation, attend, encoding, word, ctxins, generation, automatic, mechanism, language, textual, attins, machine, glove, natural, breakingnews, attended, generating, meteor, xiaodong, chris, arxiv] [contextual, context, level, average, semantic, recall, guided, art, baseline] [task, datasets, learning, training, knowledge, interpretation]
@InProceedings{Biten_2019_CVPR,
  author = {Furkan Biten, Ali and Gomez, Lluis and Rusinol, Marcal and Karatzas, Dimosthenis},
  title = {Good News, Everyone! Context Driven Entity-Aware Captioning for News Images},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Multi-Level Multimodal Common Semantic Space for Image-Phrase Grounding
Hassan Akbari, Svebor Karaman, Surabhi Bhargava, Brian Chen, Carl Vondrick, Shih-Fu Chang


We address the problem of phrase grounding by learning a multi-level common semantic space shared by the textual and visual modalities. We exploit multiple levels of feature maps of a Deep Convolutional Neural Network, as well as contextualized word and sentence embeddings extracted from a character-based language model. Following dedicated non-linear mappings for visual features at each level, word, and sentence embeddings, we obtain multiple instantiations of our common semantic space in which comparisons between any target text and the visual content is performed with cosine similarity. We guide the model by a multi-level multimodal attention mechanism which outputs attended visual features at each level. The best level is chosen to be compared with text content for maximizing the pertinence scores of image-sentence pairs of the ground truth. Experiments conducted on three publicly available datasets show significant performance gains (20%-60% relative) over the state-of-the-art in phrase localization and set a new performance record on those datasets. We provide a detailed ablation study to show the contribution of each element of our approach and release our code on GitHub.
[multiple] [computer, vision, pattern, international, approach, linear] [image, conference, mapping, ieee, figure, method, study, based] [deep, neural, performance, selection, convolutional, cell, network, table, vgg, calculate, accuracy] [visual, attention, word, common, model, textual, sentence, language, text, grounding, multimodal, phrase, attended, pertinence, pointing, contextualized, question, mechanism, game, natural, choose, mscoco, elmo, query] [feature, level, semantic, map, bounding, score, localization, heatmap, heatmaps, european, ablation, location, fully, region, spatial, three] [space, representation, train, datasets, test, embeddings, similarity, loss, set, training, softmax, learning, cosine, combination, split, task]
@InProceedings{Akbari_2019_CVPR,
  author = {Akbari, Hassan and Karaman, Svebor and Bhargava, Surabhi and Chen, Brian and Vondrick, Carl and Chang, Shih-Fu},
  title = {Multi-Level Multimodal Common Semantic Space for Image-Phrase Grounding},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Spatio-Temporal Dynamics and Semantic Attribute Enriched Visual Encoding for Video Captioning
Nayyer Aafaq, Naveed Akhtar, Wei Liu, Syed Zulqarnain Gilani, Ajmal Mian


Automatic generation of video captions is a fundamental challenge in computer vision. Recent techniques typically employ a combination of Convolutional Neural Networks (CNNs) and Recursive Neural Networks (RNNs) for video captioning. These methods mainly focus on tailoring sequence learning through RNNs for better caption generation, whereas off-the-shelf visual features are borrowed from CNNs. We argue that careful designing of visual features for this task is equally important, and present a visual feature encoding technique to generate semantically rich captions using Gated Recurrent Units (GRUs). Our method embeds rich temporal dynamics in visual features by hierarchically applying Short Fourier Transform to CNN features of the whole video. It additionally derives high level semantics from an object detector to enrich the representation with spatial dynamics of the detected objects. The final representation is projected to a compact space and fed to a language model. By learning a relatively simple language model comprising two GRU layers, we establish new state-of-the-art on MSVD and MSR-VTT datasets for METEOR and ROUGE_L metrics.
[video, dataset, temporal, sequence, multiple, action, human, gru, recurrent, short, modelling, perform, performing] [computer, fourier, compute, vision, technique, pattern, international, single, approach] [ieee, proposed, method, based, conference, transform, high, transformation] [layer, output, neural, deep, pooling, table, network, performed, denotes, best, activation, performance, convolutional, processing, cnns, gain] [visual, language, captioning, encoding, model, evaluation, msvd, description, generate, sentence, meteor, rich, rougel, natural, machine, enriched, describing] [object, cnn, semantic, extraction, hierarchical, semantics, detector, final, level] [learning, representation, existing, training, embedding, datasets]
@InProceedings{Aafaq_2019_CVPR,
  author = {Aafaq, Nayyer and Akhtar, Naveed and Liu, Wei and Zulqarnain Gilani, Syed and Mian, Ajmal},
  title = {Spatio-Temporal Dynamics and Semantic Attribute Enriched Visual Encoding for Video Captioning},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Pointing Novel Objects in Image Captioning
Yehao Li, Ting Yao, Yingwei Pan, Hongyang Chao, Tao Mei


Image captioning has received significant attention with remarkable improvements in recent advances. Nevertheless, images in the wild encapsulate rich knowledge and cannot be sufficiently described with models built on image-caption pairs containing only in-domain objects. In this paper, we propose to address the problem by augmenting standard deep captioning architectures with object learners. Specifically, we present Long Short-Term Memory with Pointing (LSTM-P) --- a new architecture that facilitates vocabulary expansion and produces novel objects via pointing mechanism. Technically, object learners are initially pre-trained on available object recognition data. Pointing in LSTM-P then balances the probability between generating a word through LSTM and copying a word from the recognized objects at each time step in decoder stage. Furthermore, our captioning encourages global coverage of objects in the sentence. Extensive experiments are conducted on both held-out COCO image captioning and ImageNet datasets for describing novel objects, and superior results are reported when comparing to state-of-the-art approaches. More remarkably, we obtain an average of 60.9% in F1 score on held-out COCO dataset.
[lstm, recognition, time, sequential, ting, sequence, long] [directly, point] [image, figure, input, paired, proposed, balance] [output, imagenet, weight, deep, standard, architecture, performance, tradeoff, layer, neural, dynamically, firstly, parameter, accuracy] [captioning, word, sentence, pointing, copying, mechanism, coverage, probability, model, visual, generated, recognized, generation, vocabulary, language, attention, copy, generating, describing, yingwei, decoder, copied, decoding, regular, caption, describe, textual, prt, evaluation, yehao, memory, step, external, lrcn] [object, coco, cnn, score, semantic, predicted, feature, detected, global] [novel, training, loss, data, distribution, set, learning, target, log, tao, learnt, address]
@InProceedings{Li_2019_CVPR,
  author = {Li, Yehao and Yao, Ting and Pan, Yingwei and Chao, Hongyang and Mei, Tao},
  title = {Pointing Novel Objects in Image Captioning},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Informative Object Annotations: Tell Me Something I Don't Know
Lior Bracha, Gal Chechik


Capturing the interesting components of an image is a key aspect of image understanding. When a speaker annotates an image, selecting labels that are informative greatly depends on the prior knowledge of a prospective listener. Motivated by cognitive theories of categorization and communication, we present a new unsupervised approach to model this prior knowledge and quantify the informativeness of a description. Specifically, we compute how knowledge of a label reduces uncertainty over the space of labels and use this uncertainty reduction to rank candidate labels for describing an image. While the full estimation problem is intractable, we describe an efficient algorithm to approximate entropy reduction using a tree-structured graphical model. We evaluate our approach on the open-images dataset using a new evaluation set of 10K ground-truth ratings and find that it achieves over 65% agreement with human raters, close to the upper bound of inter-rater agreement and largely outperforming other unsupervised baseline approaches.
[joint, speaker, listener, second, dataset, human, interesting, term, key, multiple, focus] [compute, approach, confidence, computed, computer, single, problem, assume, algorithm] [image, prior, figure, ieee, conditional, clean] [precision, reduction, full, compared, table, approximate, achieves, number, efficiently] [model, tree, evaluation, visual, describe, random, selecting, natural, true, probability, correct, node, find, animal, provided] [scoring, object, graphical, recall, three, annotated, semantic, car, vehicle, oid, predicted] [label, entropy, distribution, informative, uncertainty, select, set, dog, measure, agreement, raters, transmitting, ranking, knowledge, probabilistic, setup, singleton, ranked, rank, noisy, classifier, quantify, learning]
@InProceedings{Bracha_2019_CVPR,
  author = {Bracha, Lior and Chechik, Gal},
  title = {Informative Object Annotations: Tell Me Something I Don't Know},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Engaging Image Captioning via Personality
Kurt Shuster, Samuel Humeau, Hexiang Hu, Antoine Bordes, Jason Weston


Standard image captioning tasks such as COCO and Flickr30k are factual, neutral in tone and (to a human) state the obvious (e.g., "a man playing a guitar"). While such tasks are useful to verify that a machine understands the content of an image, they are not engaging to humans as captions. With this in mind we define a new task, PERSONALITY-CAPTIONS, where the goal is to be as engaging to humans as possible by incorporating controllable style and personality traits. We collect and release a large dataset of 241,858 of such captions conditioned over 215 possible traits. We build models that combine existing work from (i) sentence representations [36] with Transformers trained on 1.7 billion dialogue examples; and (ii) image representations [32] with ResNets trained on 3.5 billion social media images. We obtain state-of-the-art performance on Flickr30k and COCO, and strong performance on our new task. Finally, online evaluations validate that our task and models are engaging to humans, with our best model close to human performance.
[human, work, dataset, perform, build, time, social, understand] [computer, vision, pattern, international, well, provide] [image, conference, ieee, generative, input, neutral, traditional, content, described, figure, produce] [table, performance, best, standard, neural, compared, number, compare, architecture, full, deep] [personality, caption, model, captioning, engaging, ell, word, aptions, transresnet, evaluation, transformer, encoder, text, howatt, visual, dialogue, trait, arxiv, automatic, attention, preprint, machine, conditioned, consider, engagingness, generation, billion, encoders, vector, win, multimodal, sentence, write, sweet] [coco, score, bag] [retrieval, trained, test, large, train, set, training, task, pretraining, datasets, data]
@InProceedings{Shuster_2019_CVPR,
  author = {Shuster, Kurt and Humeau, Samuel and Hu, Hexiang and Bordes, Antoine and Weston, Jason},
  title = {Engaging Image Captioning via Personality},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Vision-Based Navigation With Language-Based Assistance via Imitation Learning With Indirect Intervention
Khanh Nguyen, Debadeepta Dey, Chris Brockett, Bill Dolan


We present Vision-based Navigation with Language-based Assistance (VNLA), a grounded vision-language task where an agent with visual perception is guided via language to find objects in photorealistic indoor environments. The task emulates a real-world scenario in that (a) the requester may not know how to navigate to the target objects and thus makes requests by only specifying high-level end-goals, and (b) the agent is capable of sensing when it is lost and querying an advisor, who is more qualified at the task, to obtain language subgoals to make progress. To model language-based assistance, we develop a general framework termed Imitation Learning with Indirect Intervention (I3L), and propose a solution that is effective on the VNLA task. Empirical results show that this approach significantly improves the success rate of the learning agent over other baselines on both seen and unseen environments. Our code and data are publicly available at https://github.com/debadeepta/vnla .
[action, current, forward, behavior, time, dataset, start] [international, indirect, viewpoint, indoor, direct, vision, computer, optimal, pattern] [conference, ieee, input] [rate, est, budget, neural, number, table, computational] [agent, navigation, advisor, language, policy, nav, imitation, environment, success, room, natural, intervention, subgoals, visual, goal, subgoal, requesting, request, association, path, anav, grounded, find, vnla, executing, machine, requested, reinforcement, cloning, tentative, pnav, curr, arxiv, preprint, artificial, assistance, shortest, provided, aask, earned, requester] [help, final, location] [learning, teacher, test, main, task, training, unseen, learned, data, distribution, train, trained, set]
@InProceedings{Nguyen_2019_CVPR,
  author = {Nguyen, Khanh and Dey, Debadeepta and Brockett, Chris and Dolan, Bill},
  title = {Vision-Based Navigation With Language-Based Assistance via Imitation Learning With Indirect Intervention},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
TOUCHDOWN: Natural Language Navigation and Spatial Reasoning in Visual Street Environments
Howard Chen, Alane Suhr, Dipendra Misra, Noah Snavely, Yoav Artzi


We study the problem of jointly reasoning about language and vision through a navigation and spatial reasoning task. We introduce the Touchdown task and dataset, where an agent must first follow navigation instructions in a Street View environment to a goal position, and then guess a location in its observed environment described in natural language to find a hidden object. The data contains 9326 examples of English instructions and spatial descriptions paired with demonstrations. We perform qualitative linguistic analysis, and show that the data displays a rich use of spatial reasoning. Empirical analysis shows the data presents an open challenge to existing methods.
[follow, multiple, hidden, illustrates, state] [left, panorama, view, computer, vision, position, analysis, require, compute, place, pattern, light, exact] [conference, image, figure, includes, ieee, input, described] [top, table, number, performance, architecture, process] [navigation, touchdown, sdr, language, visual, instruction, agent, goal, worker, reasoning, environment, requires, natural, ouchdown, turn, find, example, heading, text, ing, description, writing, development, correct, question, collection, follower, complete, unique, evaluation, length, identify, consider, observed, describe] [spatial, location, street, three, average, feature, interactive, neighboring, map, including, google, edge, predicted] [task, data, learning, target, set, distribution, distance, training, test]
@InProceedings{Chen_2019_CVPR,
  author = {Chen, Howard and Suhr, Alane and Misra, Dipendra and Snavely, Noah and Artzi, Yoav},
  title = {TOUCHDOWN: Natural Language Navigation and Spatial Reasoning in Visual Street Environments},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
A Simple Baseline for Audio-Visual Scene-Aware Dialog
Idan Schwartz, Alexander G. Schwing, Tamir Hazan


The recently proposed audio-visual scene-aware dialog task paves the way to a more data-driven way of learning virtual assistants, smart speakers and car navigation systems. However, very little is known to date about how to effectively extract meaningful information from a plethora of sensors that pound the computational engine of those devices. Therefore, in this paper, we provide and carefully analyze a simple baseline for audio-visual scene-aware dialog which is trained end-to-end. Our method differentiates in a data-driven manner useful signals from distracting ones using an attention mechanism. We evaluate the proposed approach on the recently introduced and challenging audio-visual scene-aware dataset, and demonstrate the key features that permit to outperform the current state-of-the-art by more than 20% on CIDEr.
[video, audio, temporal, lstm, frame, dataset, action, state, recurrent, hidden, recognition, consists, version, current, watching, work, capture] [approach, good, additional, local] [image, input, extracted, proposed, based, figure] [neural, deep, performance, layer, convolutional, batch, number, reduce, basic] [question, attention, visual, dialog, model, answer, history, attended, generation, multimodal, word, answering, textual, generated, probability, vector, find, captioning, avf, evaluate, sentence, system, diverse, avsd, wearing, evaluation, simple, vqa, caption, embed, sampled, beam, holding, arxiv, preprint] [spatial, baseline, feature] [data, representation, embedding, learning, dimension, set, training, classification, trained, observe]
@InProceedings{Schwartz_2019_CVPR,
  author = {Schwartz, Idan and Schwing, Alexander G. and Hazan, Tamir},
  title = {A Simple Baseline for Audio-Visual Scene-Aware Dialog},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
End-To-End Learned Random Walker for Seeded Image Segmentation
Lorenzo Cerrone, Alexander Zeilmann, Fred A. Hamprecht


We present an end-to-end learned algorithm for seeded segmentation. Our method is based on the Random Walker algorithm, where we predict the edge weights of the un- derlying graph using a convolutional neural network. This can be interpreted as learning context-dependent diffusiv- ities for a linear diffusion process. After calculating the exact gradient for optimizing these diffusivities, we pro- pose simplifications that sparsely sample the gradient while still maintaining competitive results. The proposed method achieves the currently best results on the seeded CREMI neuron segmentation challenge.
[graph, work, marked, predict] [walker, seeded, linear, cremi, algorithm, computer, matrix, ground, international, pattern, diffusivities, truth, approach, derivative, assignment, solving, respect, volume, problem, reconstruction, optimization, unmarked, horizontal, arand, analysis, vision] [image, ieee, conference, figure, proposed, method, lrw, based, conditional, comparison, microscopy, qualitative, quantitative] [network, gradient, sparse, neural, structured, size, inference, approximation, tensor, backpropagation, order, gaussian, deep, descent, standard, convolutional, best] [random, machine, system, probability, step, kind, choose] [segmentation, edge, watershed, boundary, cnn, map, seed, semantic] [learned, learning, training, loss, label, sampling, diffusion, train, entropy, large, uncertainty, extended]
@InProceedings{Cerrone_2019_CVPR,
  author = {Cerrone, Lorenzo and Zeilmann, Alexander and Hamprecht, Fred A.},
  title = {End-To-End Learned Random Walker for Seeded Image Segmentation},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Efficient Neural Network Compression
Hyeji Kim, Muhammad Umar Karim Khan, Chong-Min Kyung


Network compression reduces the computational complexity and memory consumption of deep neural networks by reducing the number of parameters. In SVD-based network compression the right rank needs to be decided for every layer of the network. In this paper we propose an efficient method for obtaining the rank configuration of the whole network. Unlike previous methods which consider each layer separately, our method considers the whole network to choose the right rank configuration. We propose novel accuracy metrics to represent the accuracy and complexity relationship for a given neural network. We use these metrics in a non-iterative fashion to obtain the right rank configuration which satisfies the constraints on FLOPs and memory while maintaining sufficient accuracy. Experiments show that our method provides better compromise between accuracy and computational complexity/memory consumption while performing compression at much higher speed. For VGG-16 our network can reduce the FLOPs by 25% and improve accuracy by 0.7% compared to the baseline, while requiring only 3 minutes on a CPU to search for the right rank configuration. Previously, similar results were achieved in 4 hours with 8 GPUs. The proposed method can be used for lossless compression of a neural network as well. The better accuracy and complexity compromise, as well as the extremely fast speed of our method make it suitable for neural network compression.
[extract] [defined, decomposition, normalized, error, note, single, optimal, define, constraint, international, computer] [method, pca, based, proposed, mapping, conference, comparison, figure, ieee, image] [accuracy, complexity, neural, network, compression, layer, configuration, convolutional, search, deep, number, performance, rmin, table, energy, compared, reduction, pruning, lower, validation, latency, efficient, higher, ratio, alexnet, better, reduce, lossless, filter, kernel, original, computational, cpu, fast] [model, candidate, arxiv, preprint, choose, memory, represent, find] [propose, baseline] [rank, metric, space, rmax, learning, set, target, function, paper]
@InProceedings{Kim_2019_CVPR,
  author = {Kim, Hyeji and Umar Karim Khan, Muhammad and Kyung, Chong-Min},
  title = {Efficient Neural Network Compression},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Cascaded Generative and Discriminative Learning for Microcalcification Detection in Breast Mammograms
Fandong Zhang, Ling Luo, Xinwei Sun, Zhen Zhou, Xiuli Li, Yizhou Yu, Yizhou Wang


Accurate microcalcification (mC) detection is of great importance due to its high proportion in early breast cancers. Most of the previous mC detection methods belong to discriminative models, where classifiers are exploited to distinguish mCs from other backgrounds. However, it is still challenging for these methods to tell the mCs from amounts of normal tissues because they are too tiny (at most 14 pixels). Generative methods can precisely model the normal tissues and regard the abnormal ones as outliers, while they fail to further distinguish the mCs from other anomalies, i.e. vessel calcifications. In this paper, we propose a hybrid approach by taking advantages of both generative and discriminative models. Firstly, a generative model named Anomaly Separation Network (ASN) is used to generate candidate mCs. ASN contains two major components. A deep convolutional encoder-decoder network is built to learn the image reconstruction mapping and a t-test loss function is designed to separate the distributions of the reconstruction residuals of mCs from normal tissues. Secondly, a discriminative model is cascaded to tell the mCs from the false positives. Finally, to verify the effectiveness of our method, we conduct experiments on both public and in-house datasets, which demonstrates that our approach outperforms previous state-of-the-art methods.
[dataset, anomaly, abnormal, extract, hypothesis, previous] [reconstruction, normal, outlier, robust, computer, vessel, pattern, international, problem] [image, proposed, generative, inbreast, method, based, conference, separation, cancer, figure, statistical, ieee, pixel] [network, deep, residual, small, applied, parameter, table, tiny, convolutional, designed, effectiveness, achieve, lead, design, implement, reduction, sparse] [model] [detection, false, fpn, breast, propose, threshold, predicted, cascaded, public, final, calcification, proposal, distinguish, challenging, recall] [loss, positive, asn, mammogram, negative, discriminative, learning, data, fpr, large, training, test, trained, microcalcification, function, train, hard, learn, suffer, distribution, novel, supervised, set]
@InProceedings{Zhang_2019_CVPR,
  author = {Zhang, Fandong and Luo, Ling and Sun, Xinwei and Zhou, Zhen and Li, Xiuli and Yu, Yizhou and Wang, Yizhou},
  title = {Cascaded Generative and Discriminative Learning for Microcalcification Detection in Breast Mammograms},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
C3AE: Exploring the Limits of Compact Model for Age Estimation
Chao Zhang, Shuaicheng Liu, Xun Xu, Ce Zhu


Age estimation is a classic learning problem in computer vision. Many larger and deeper CNNs have been proposed with promising performance, such as AlexNet, VggNet, GoogLeNet and ResNet. However, these models are not practical for the embedded/mobile devices. Recently, MobileNets and ShuffleNets have been proposed to reduce the number of parameters, yielding lightweight models. However, their representation has been weakened because of the adoption of depth-wise separable convolution. In this work, we investigate the limits of compact model for small-scale image and propose an extremely Compact yet efficient Cascade Context-based Age Estimation model(C3AE). This model possesses only 1/9 and 1/2000 parameters compared with MobileNets/ShuffleNets and VggNet, while achieves competitive performance. In particular, we re-define age estimation problem by two-points representation, which is implemented by a cascade model. Moreover, to fully utilize the facial context information, multi-branch CNN network is proposed to aggregate multi-scale context. Experiments are carried out on three age estimation datasets. The state-of-the-art performance on compact model has been achieved with a relatively large margin.
[represented, second] [estimation, analysis, discrete, problem] [age, image, facial, morph, ssr, study, result, proposed, comparison, input, face, based, ieee, figure, resolution] [convolution, plain, compact, standard, deep, size, performance, layer, number, small, residual, convolutional, output, full, compared, bulky, competitive, computation, better, separable, efficient, cost, channel, suitable, dex, factor, pretrained, reduce, network, achieved, connected, table, process, mobilenets, achieves, low, neural] [model, arxiv, preprint, machine] [cascade, regression, module, context, ablation, three, feature, predicted, cnn, mae, fully, semantic, utilize] [distribution, representation, loss, training, label, learning, large, set, classification]
@InProceedings{Zhang_2019_CVPR,
  author = {Zhang, Chao and Liu, Shuaicheng and Xu, Xun and Zhu, Ce},
  title = {C3AE: Exploring the Limits of Compact Model for Age Estimation},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Adaptive Weighting Multi-Field-Of-View CNN for Semantic Segmentation in Pathology
Hiroki Tokunaga, Yuki Teramoto, Akihiko Yoshizawa, Ryoma Bise


Automated digital histopathology image segmentation is an important task to help pathologists diagnose tumors and cancer subtypes. For pathological diagnosis of cancer subtypes, pathologists usually change the magnification of whole-slide images (WSI) viewers. A key assumption is that the importance of the magnifications depends on the characteristics of the input image, such as cancer subtypes. In this paper, we propose a novel semantic segmentation method, called Adaptive-Weighting-Multi-Field-of-View-CNN (AWMF-CNN), that can adaptively use image features from images with different magnifications to segment multiple cancer subtype regions in the input image. The proposed method aggregates several expert CNNs for images of different magnifications by adaptively changing the weight of each expert depending on the input image. It leverages information in the images with different magnifications that might be useful for identifying the subtypes. It outperformed other state-of-the-art methods in experiments.
[individual, multiple] [field, view, single, estimated, estimate, normal, algorithm, corresponding, scene, check] [image, expert, input, figure, method, magnification, cancer, proposed, patch, resolution, changing] [network, cnns, adaptively, convolutional, deep, weight, size, neural, output, segnet, compared, number, wide, aggregating, accuracy, specialized, performance, architecture, small] [indicates, depending, correct] [segmentation, cnn, semantic, union, tumor, subtype, lung, region, three, segment, wsi, spatial, contextual, pathology, predicted, pathological, subtypes, context] [training, weighting, trained, learning, data, classification, target, train, set, class, large, task]
@InProceedings{Tokunaga_2019_CVPR,
  author = {Tokunaga, Hiroki and Teramoto, Yuki and Yoshizawa, Akihiko and Bise, Ryoma},
  title = {Adaptive Weighting Multi-Field-Of-View CNN for Semantic Segmentation in Pathology},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
In Defense of Pre-Trained ImageNet Architectures for Real-Time Semantic Segmentation of Road-Driving Images
Marin Orsic, Ivan Kreso, Petra Bevandic, Sinisa Segvic


Recent success of semantic segmentation approaches on demanding road driving datasets has spurred interest in many related application fields. Many of these applications involve real-time prediction on mobile platforms such as cars, drones and various kinds of robots. Real-time setup is challenging due to extraordinary computational complexity involved. Many previous works address the challenge with custom lightweight architectures which decrease computational complexity by reducing depth, width and layer capacity with respect to general purpose architectures. We propose an alternative approach which achieves a significantly better performance across a wide range of computing budgets. First, we rely on a light-weight general purpose architecture as the main recognition engine. Then, we leverage light-weight upsampling with lateral connections as the most cost-effective solution to restore the prediction resolution. Finally, we propose to enlarge the receptive field by fusing shared features at multiple resolutions in a novel fashion. Experiments on several road driving datasets show a substantial advantage of the proposed approach, either with ImageNet pre-trained parameters or when we learn from scratch. Our Cityscapes test submission entitled SwiftNetRN-18 delivers 75.5% MIoU and achieves 39.9 Hz on 1024x2048 images on GTX1080Ti.
[recognition, time, spp, prediction] [single, field, vision, pattern, computer, approach, note] [resolution, image, input, figure, conference, ieee, proposed, based] [upsampling, scale, imagenet, lateral, receptive, convolutional, number, convolution, table, computational, processing, residual, accuracy, order, efficient, deep, complexity, validation, dilated, fps, camvid, titanx, lightweight, pooling, neural, mobilenet, output, gpu, achieve, speed, achieves, best, aiming, inference, low, smaller] [model, encoder, decoder, encoders, visual] [semantic, pyramid, segmentation, spatial, propose, miou, val, feature] [test, training, large, learning, train, trained, metric, shared]
@InProceedings{Orsic_2019_CVPR,
  author = {Orsic, Marin and Kreso, Ivan and Bevandic, Petra and Segvic, Sinisa},
  title = {In Defense of Pre-Trained ImageNet Architectures for Real-Time Semantic Segmentation of Road-Driving Images},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Context-Aware Visual Compatibility Prediction
Guillem Cucurull, Perouz Taslakian, David Vazquez


How do we determine whether two or more clothing items are compatible or visually appealing? Part of the answer lies in understanding of visual aesthetics, and is biased by personal preferences shaped by social attitudes, time, and place. In this work we propose a method that predicts compatibility between two items based on their visual features, as well as their context. We define context as the products that are known to be compatible with each of these item. Our model is in contrast to other metric learning approaches that rely on pairwise comparisons between item features alone. We address the compatibility prediction problem using a graph neural network that learns to generate product embeddings conditioned on their context. We present results for two prediction tasks (fill in the blank and outfit compatibility) tested on two fashion datasets Polyvore and Fashion-Gen, and on a subset of the Amazon dataset; we achieve state of the art results when using context information and show how test performance improves as more context is used.
[graph, prediction, dataset, work, predict, previous, perform, consists] [matrix, defined, international, problem, approach, well, computed, computer, vision] [figure, method, conference, based, style, input, proposed] [neural, better, network, table, deep, convolutional, original, applied, siamese, accuracy, apply, structure, performance] [model, decoder, arxiv, preprint, visual, encoder, neighbourhood, men, relational, node, amazon, probability, evaluate, represent, computes, example, making] [context, clothing, edge, score, improves, improve] [compatibility, fashion, outfit, learning, task, item, fitb, set, polyvore, function, metric, test, product, embeddings, trained, embedding, adjacency, randomly, resampled, pair, training, bought, compatible, tested]
@InProceedings{Cucurull_2019_CVPR,
  author = {Cucurull, Guillem and Taslakian, Perouz and Vazquez, David},
  title = {Context-Aware Visual Compatibility Prediction},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Sim-To-Real via Sim-To-Sim: Data-Efficient Robotic Grasping via Randomized-To-Canonical Adaptation Networks
Stephen James, Paul Wohlhart, Mrinal Kalakrishnan, Dmitry Kalashnikov, Alex Irpan, Julian Ibarz, Sergey Levine, Raia Hadsell, Konstantinos Bousmalis


Real world data, especially in the domain of robotics, is notoriously costly to collect. One way to circumvent this can be to leverage the power of simulation to produce large amounts of labelled data. However, training models on simulated images does not readily transfer to real-world ones. Using domain adaptation methods to cross this "reality gap" requires a large amount of unlabelled real-world data, whilst domain randomization alone can waste modeling power. In this paper, we present Randomized-to-Canonical Adaptation Networks (RCANs), a novel approach to crossing the visual reality gap that uses no real-world data. Our method learns to translate randomized rendered images into their equivalent non-randomized, canonical versions. This in turn allows for real images to also be translated into canonical sim images. We demonstrate the effectiveness of this sim-to-real approach by training a vision-based closed-loop grasping reinforcement learning agent in simulation, and then transferring it to the real world to attain 70% zero-shot grasp success on unseen objects, a result that almost doubles the success of learning the same task directly on domain randomization alone. Additionally, by joint finetuning in the real-world with only 5,000 real-world grasps, our method achieves 91%, attaining comparable performance to a state-of-the-art system trained with 580,000 real-world grasps, resulting in a reduction of real-world data by more than 99%.
[joint, work, sim, version, arm, action] [grasping, canonical, simulation, grasp, robotics, directly, depth, computer, algorithm, realworld, approach, vision, allows, rgb, reality, scene, additional, pattern, journal] [real, conference, image, ieee, method, rcan, generator, figure, simulated, amount, input, translation, comparison, generative] [performance, deep, finetuning, network, neural, number, order, alex, achieve] [randomization, robot, robotic, randomized, reinforcement, policy, visual, agent, adversarial, machine, success, model, environment, tray, system, kalashnikov, stephen, generated] [object, segmentation] [domain, learning, adaptation, training, data, trained, transfer, learn, source, target, task, train, sergey, unsupervised, large, function, kate, unseen, adapted]
@InProceedings{James_2019_CVPR,
  author = {James, Stephen and Wohlhart, Paul and Kalakrishnan, Mrinal and Kalashnikov, Dmitry and Irpan, Alex and Ibarz, Julian and Levine, Sergey and Hadsell, Raia and Bousmalis, Konstantinos},
  title = {Sim-To-Real via Sim-To-Sim: Data-Efficient Robotic Grasping via Randomized-To-Canonical Adaptation Networks},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Multiview 2D/3D Rigid Registration via a Point-Of-Interest Network for Tracking and Triangulation
Haofu Liao, Wei-An Lin, Jiarui Zhang, Jingdan Zhang, Jiebo Luo, S. Kevin Zhou


We propose to tackle the problem of multiview 2D/3D rigid registration for intervention via a Point-Of-Interest Network for Tracking and Triangulation (POINT^2). POINT^2 learns to establish 2D point-to-point correspondences between the pre- and intra-intervention images by tracking a set of random POIs. The 3D pose of the pre-intervention volume is then estimated through a triangulation layer. In POINT^2, the unified framework of the POI tracker and the triangulation layer enables learning informative 2D features and estimating 3D pose jointly. In contrast to existing approaches, POINT^2 only requires a single forward-pass to achieve a reliable 2D/3D registration. As the POI tracker is shift-invariant, POINT^2 is more robust to the initial pose of the 3D pre-intervention image. Extensive experiments on a large-scale clinical cone-beam CT (CBCT) dataset show that the proposed POINT^2 method outperforms the existing learning-based method in terms of accuracy, robustness and running time. Furthermore, when used as an initial pose estimator, our method also improves the robustness and speed of the state-of-the-art optimization-based approaches by ten folds.
[tracking, framework, dataset, tracked, work, time, multiple, performs] [registration, pose, drr, poi, point, approach, triangulation, initial, volume, view, drrs, robust, rigid, denote, equation, directly, corresponding, computer, problem, correspondence, matrix, international, multiview, projected, good, projection, rotation] [method, proposed, figure, image, ieee, imaging, conference, clinical, high, reconstructed, captured] [network, patient, kernel, layer, convolution, better, searching, size, number, performance, speed] [iterative, find, requires, robustness, generation, evaluation, sensitive] [feature, medical, cbct, misalignment, faster, offset, location, detector, map] [similarity, set, learning, data, training, source, sampling]
@InProceedings{Liao_2019_CVPR,
  author = {Liao, Haofu and Lin, Wei-An and Zhang, Jiarui and Zhang, Jingdan and Luo, Jiebo and Kevin Zhou, S.},
  title = {Multiview 2D/3D Rigid Registration via a Point-Of-Interest Network for Tracking and Triangulation},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Context-Aware Spatio-Recurrent Curvilinear Structure Segmentation
Feigege Wang, Yue Gu, Wenxi Liu, Yuanlong Yu, Shengfeng He, Jia Pan


Curvilinear structures are frequently observed in various images in different forms, such as blood vessels or neuronal boundaries in biomedical images. In this paper, we propose a novel curvilinear structure segmentation approach using context-aware spatio-recurrent networks. Instead of directly segmenting the whole image or densely segmenting fixed-sized local patches, our method recurrently samples patches with varied scales from the target image with learned policy and processes them locally, which is similar to the behavior of changing retinal fixations in the human visual system and it is beneficial for capturing the multi-scale or hierarchical modality of the complex curvilinear structures. In specific, the policy of choosing local patches is attentively learned based on the contextual information of the image and the historical sampling experience. In this way, with more patches sampled and refined, the segmentation of the whole image can be progressively improved. To validate our approach, comparison experiments on different types of image data are conducted and the sampling procedures for exemplar images are illustrated. We demonstrate that our method achieves the state-of-the-art performance in public datasets.
[action, drive, historical, previous, recurrent, sequential] [local, approach, vessel, manual, algorithm, topology, computer, vision] [image, patch, based, biomedical, comparison, figure, proposed, method, extracted] [deep, structure, network, convolutional, performance, sequentially, applied, small, original, neural, size, process] [policy, model, reinforcement, attention, agent, sampled, visual, progressively, red, step] [segmentation, curvilinear, feature, global, mask, blood, retinal, semantic, holistic, medical, driu, stare, attentive, contextual, object, module, propose, segmenting, annotation, extraction, fully, three, crfs, neuronal, context] [training, learning, sampling, target, set, trained, learned, data]
@InProceedings{Wang_2019_CVPR,
  author = {Wang, Feigege and Gu, Yue and Liu, Wenxi and Yu, Yuanlong and He, Shengfeng and Pan, Jia},
  title = {Context-Aware Spatio-Recurrent Curvilinear Structure Segmentation},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
An Alternative Deep Feature Approach to Line Level Keyword Spotting
George Retsinas, Georgios Louloudis, Nikolaos Stamatopoulos, Giorgos Sfikas, Basilis Gatos


Keyword spotting (KWS) is defined as the problem of detecting all instances of a given word, provided by the user either as a query word image (Query-by-Example, QbE) or a query word string (Query-by-String, QbS) in a body of digitized documents. Keyword detection is typically preceded by a preprocessing step where the text is segmented into text lines (line-level KWS). Methods following this paradigm are monopolized by test-time computationally expensive handwritten text recognition (HTR)-based approaches; furthermore, they typically cannot handle image queries (QbE). In this work, we propose a time and storage-efficient, deep feature-based approach that enables both the image and textual search options. Three distinct components, all modeled as neural networks, are combined: normalization, feature extraction and representation of image and textual input into a common space. These components, even if designed on word level image representations, collaborate in order to achieve an efficient line level keyword spotting system. The experimental results indicate that the proposed system is on par with state-of-the-art KWS methods.
[recognition, work, consists, sequence] [matching, estimation, well, approach, international, analysis, problem, single, estimator] [image, proposed, input, method, figure, extracted, based, conference, component, produced] [width, convolutional, pooling, network, deep, neural, number, size, order, performance, normalization, entire, performed, connected, efficient, max] [word, query, character, phoc, spotting, qbs, vector, keyword, text, common, system, qbe, document, step, encoding, handwritten, procedure, string, generated, enables, iam] [feature, extraction, roi, map, level, average, segmentation, fully, three, final] [set, training, distance, space, representation, experimental, function, main, specific, reported, scenario, task, embedding, trained]
@InProceedings{Retsinas_2019_CVPR,
  author = {Retsinas, George and Louloudis, Georgios and Stamatopoulos, Nikolaos and Sfikas, Giorgos and Gatos, Basilis},
  title = {An Alternative Deep Feature Approach to Line Level Keyword Spotting},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Dynamics Are Important for the Recognition of Equine Pain in Video
Sofia Broome, Karina Bech Gleerup, Pia Haubro Andersen, Hedvig Kjellstrom


A prerequisite to successfully alleviate pain in animals is to recognize it, which is a great challenge in non-verbal species. Furthermore, prey animals such as horses tend to hide their pain. In this study, we propose a deep recurrent two-stream architecture for the task of distinguishing pain from non-pain in videos of horses. Different models are evaluated on a unique dataset showing horses under controlled trials with moderate pain induction, which has been presented in earlier work. Sequential models are experimentally compared to single-frame models, showing the importance of the temporal dimension of the data, and are benchmarked against a veterinary expert classification of the data. We additionally perform baseline comparisons with generalized versions of state-of-the-art human pain recognition methods. While equine pain detection in machine learning is a novel field, our results surpass veterinary expert performance and outperform pain detection results reported for other larger non-human species.
[pain, equine, recognition, action, temporal, video, flow, dataset, lstm, optical, recurrent, veterinary, horse, sequential, fusion, stream, subject, human, behavior, motion, perform, work, sequence, time, assessment] [rgb, single, computer, dense, pattern] [facial, method, presented, input, image, expert, figure, result, ieee, background] [convolutional, deep, layer, better, table, best, network, architecture, neural, standard, compared, performance, applied, larger, top, scale, pooling, performed] [model, automatic, attention, animal, example] [detection, spatial, feature, three, average, extraction, evaluated, level] [data, classification, trained, training, task, augmentation, test, class, learning]
@InProceedings{Broome_2019_CVPR,
  author = {Broome, Sofia and Bech Gleerup, Karina and Haubro Andersen, Pia and Kjellstrom, Hedvig},
  title = {Dynamics Are Important for the Recognition of Equine Pain in Video},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
LaserNet: An Efficient Probabilistic 3D Object Detector for Autonomous Driving
Gregory P. Meyer, Ankit Laddha, Eric Kee, Carlos Vallespi-Gonzalez, Carl K. Wellington


In this paper, we present LaserNet, a computationally efficient method for 3D object detection from LiDAR data for autonomous driving. The efficiency results from processing LiDAR data in the native range view of the sensor, where the input data is naturally compact. Operating in the range view involves well known challenges for learning, including occlusion and scale variation, but it also provides contextual information based on how the sensor data was captured. Our approach uses a fully convolutional network to predict a multimodal distribution over 3D boxes for each point and then it efficiently fuses these distributions to generate a prediction for each object. Experiments show that modeling each detection as a distribution rather than a single deterministic box leads to better overall detection performance. Benchmark results show that this approach has significantly lower runtime than other recent detectors and that it achieves state-of-the-art performance when compared on a large dataset that has enough data to overcome the challenges of training on the range view.
[predict, recognition, predicting, multiple, perform, unimodal] [range, lidar, view, point, computer, vision, approach, kitti, autonomous, sensor, pattern, single, laser, dense, ground, runtime, horizontal, directly, corresponding, predicts, total, international] [image, conference, method, ieee, figure, input, proposed, based, produce, eye, resolution, study] [network, performance, shift, deep, variance, table, processing, small, adaptive, size, number, efficient, convolutional, sparse] [probability, model, multimodal, evaluation] [object, bounding, box, detection, predicted, iou, feature, vehicle, raquel, benchmark] [distribution, data, learning, set, class, loss, training, mixture, representation, hard, train, operating, large, uncertainty, learn, probabilistic]
@InProceedings{Meyer_2019_CVPR,
  author = {Meyer, Gregory P. and Laddha, Ankit and Kee, Eric and Vallespi-Gonzalez, Carlos and Wellington, Carl K.},
  title = {LaserNet: An Efficient Probabilistic 3D Object Detector for Autonomous Driving},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Machine Vision Guided 3D Medical Image Compression for Efficient Transmission and Accurate Segmentation in the Clouds
Zihao Liu, Xiaowei Xu, Tao Liu, Qi Liu, Yanzhi Wang, Yiyu Shi, Wujie Wen, Meiping Huang, Haiyun Yuan, Jian Zhuang


Cloud based medical image analysis has become popular recently due to the high computation complexities of various deep neural network (DNN) based frameworks and the increasingly large volume of medical images that need to be processed. It has been demonstrated that for medical images the transmission from local to clouds is much more expensive than the computation in the clouds itself. Towards this, 3D image compression techniques have been widely applied to reduce the data traffic. However, most of the existing image compression techniques are developed around human vision, i.e., they are designed to minimize distortions that can be perceived by human eyes. In this paper, we will use deep learning based medical image segmentation as a vehicle and demonstrate that interestingly, machine and human view the compression quality differently. Medical images compressed with good quality w.r.t. human vision may result in inferior segmentation accuracy. We then design a machine vision oriented 3D image compression framework tailored for segmentation using DNNs. Our method automatically extracts and retains image features that are most important to the segmentation. Comprehensive experiments on widely adopted segmentation frameworks with HVSMR 2016 challenge dataset show that our method can achieve significantly higher segmentation accuracy at the same compression rate, or much better compression rate under the same segmentation accuracy, when compared with the existing JPEG 2000 method. To the best of the authors' knowledge, this is the first machine vision guided medical image compression framework for segmentation in the clouds.
[] [] [] [] [] [] []
@InProceedings{Liu_2019_CVPR,
  author = {Liu, Zihao and Xu, Xiaowei and Liu, Tao and Liu, Qi and Wang, Yanzhi and Shi, Yiyu and Wen, Wujie and Huang, Meiping and Yuan, Haiyun and Zhuang, Jian},
  title = {Machine Vision Guided 3D Medical Image Compression for Efficient Transmission and Accurate Segmentation in the Clouds},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
PointPillars: Fast Encoders for Object Detection From Point Clouds
Alex H. Lang, Sourabh Vora, Holger Caesar, Lubing Zhou, Jiong Yang, Oscar Beijbom


Object detection in point clouds is an important aspect of many robotics applications such as autonomous driving. In this paper, we consider the problem of encoding a point cloud into a format appropriate for a downstream detection pipeline. Recent literature suggests two types of encoders; fixed encoders tend to be fast but sacrifice accuracy, while encoders that are learned from data are more accurate, but slower. In this work, we propose PointPillars, a novel encoder which utilizes PointNets to learn a representation of point clouds organized in vertical columns (pillars). While the encoded features can be used with any standard 2D convolutional detection architecture, we further propose a lean downstream network. Extensive experimentation shows that PointPillars outperforms previous encoders with respect to both speed and accuracy by a large margin. Despite only using lidar, our full detection pipeline significantly outperforms the state of the art, even among fusion methods, with respect to both the 3D and bird's eye view KITTI benchmarks. This detection performance is achieved while running at 62 Hz: a 2 - 4 fold runtime improvement. A faster version of our method matches the state of the art at 105 Hz. These benchmarks suggest that PointPillars is an appropriate encoding for object detection in point clouds.
[second, outperforms, state, fusion] [point, lidar, cloud, kitti, ground, truth, runtime, view, pointnet, single, orientation, autonomous, pointnets, frustum, range, match, vertical] [image, figure, method, proposed, eye, drawn, based] [performance, network, convolutional, speed, table, inference, fixed, deep, stride, block, applied, design, standard, number, output, original, fast, full, sparse] [encoder, encoders, encoding, create] [detection, object, pointpillars, pillar, voxelnet, bev, backbone, car, feature, box, pedestrian, easy, faster, center, cyclist, head, benchmark, anchor, map, art, height, localization, average] [learning, loss, set, hard, data, training, test, augmentation, learned, learn, classification]
@InProceedings{Lang_2019_CVPR,
  author = {Lang, Alex H. and Vora, Sourabh and Caesar, Holger and Zhou, Lubing and Yang, Jiong and Beijbom, Oscar},
  title = {PointPillars: Fast Encoders for Object Detection From Point Clouds},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Motion Estimation of Non-Holonomic Ground Vehicles From a Single Feature Correspondence Measured Over N Views
Kun Huang, Yifu Wang, Laurent Kneip


The planar motion of ground vehicles is often non-holonomic, which enables a solution of the two-view relative pose problem from a single point feature correspondence. Man-made environments such as underground parking lots are however dominated by line features. Inspired by the planar tri-focal tensor and its ability to handle lines, we establish an n-linear constraint on the locally circular motion of non-holonomic vehicles able to handle an arbitrarily large and dense window of views. We prove that this stays a uni-variate problem under the assumption of locally constant vehicle speed, and it can transparently handle both point and vertical line correspondences. In particular, we prove that an application of Viete's formulas for extrapolating trigonometric functions of angle multiples and the Weierstrass substitution casts the problem as one that merely seeks the roots of a uni-variate polynomial. We present the complete theory of this novel solver, and test it on both simulated and real data. Our results prove that it successfully handles a variety of relevant scenarios, eventually outperforming the 1-point two-view solver.
[motion, work, window, frame, online, recognition, multiple] [ground, camera, relative, latexit, problem, planar, pose, case, ackermann, single, solution, rotation, point, vertical, international, angle, solver, form, computer, sin, remains, vision, prove, bearing, determinant, error, constraint, calibrated, minimisation, matrix, smallest, pattern, slam, robotics, measured, assumption, monocular, algorithm, horizontal, finally, polynomial, estimation, university, depends, odometry, tangential, stereo] [method, conference, ieee, real, figure, image, handle, result] [tensor, number, circular, scale, accuracy, approximated, original, represents] [model, visual, fact, random, validity, length] [vehicle, feature, indicated] [rank, objective, large]
@InProceedings{Huang_2019_CVPR,
  author = {Huang, Kun and Wang, Yifu and Kneip, Laurent},
  title = {Motion Estimation of Non-Holonomic Ground Vehicles From a Single Feature Correspondence Measured Over N Views},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
From Coarse to Fine: Robust Hierarchical Localization at Large Scale
Paul-Edouard Sarlin, Cesar Cadena, Roland Siegwart, Marcin Dymczyk


Robust and accurate visual localization is a fundamental capability for numerous applications, such as autonomous driving, mobile robotics, or augmented reality. It remains, however, a challenging task, particularly for large-scale environments and in presence of significant appearance changes. State-of-the-art methods not only struggle with such scenarios, but are often too resource intensive for certain real-time applications. In this paper we propose HF-Net, a hierarchical localization approach based on a monolithic CNN that simultaneously predicts local features and global descriptors for accurate 6-DoF localization. We exploit the coarse-to-fine localization paradigm: we first perform a global retrieval to obtain location hypotheses and only later match local features within those candidate places. This hierarchical approach incurs significant runtime savings and makes our system suitable for real-time operation. By leveraging learned descriptors, our method achieves remarkable localization robustness across large variations of appearance and sets a new state-of-the-art on two challenging benchmarks for large-scale localization.
[dataset, perform, report, performs] [local, matching, superpoint, sfm, pose, keypoints, night, robotcar, dense, robust, keypoint, netvlad, aachen, compute, approach, runtime, camera, descriptor, ground, truth, torsten, accurate, direct, single, hpatches, limited, well, harris] [image, based, method, database, figure, reference, composed, day] [network, scale, mobile, efficient, performance, table, number, efficiency, search, neural, deep, computational, competitive, multitask, architecture] [model, visual, robustness, query, evaluation, evaluate] [localization, global, hierarchical, challenging, feature, cnn, map, three, improve, detection, semantic, urban] [learning, retrieval, learned, large, distillation, training, data, teacher, trained, train, retrieved, doap]
@InProceedings{Sarlin_2019_CVPR,
  author = {Sarlin, Paul-Edouard and Cadena, Cesar and Siegwart, Roland and Dymczyk, Marcin},
  title = {From Coarse to Fine: Robust Hierarchical Localization at Large Scale},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Large Scale High-Resolution Land Cover Mapping With Multi-Resolution Data
Caleb Robinson, Le Hou, Kolya Malkin, Rachel Soobitsky, Jacob Czawlytko, Bistra Dilkina, Nebojsa Jojic


In this paper we propose multi-resolution data fusion methods for deep learning-based high-resolution land cover mapping from aerial imagery. The land cover mapping problem, at country-level scales, is challenging for common deep learning methods due to the scarcity of high-resolution labels, as well as variation in geography and quality of input images. On the other hand, multiple satellite imagery and low-resolution ground truth label sources are widely available, and can be used to improve model training efforts. Our methods include: introducing low-resolution satellite data to smooth quality differences in high-resolution input, exploiting low-resolution labels with a dual loss function, and pairing scarce high-resolution labels with inputs from several points in time. We train models that are able to generalize from a portion of the Northeast United States, where we have high-resolution land cover labels, to the rest of the US. With these models, we produce the first high-resolution (1-meter) land cover map of the contiguous US, consisting of over 8 trillion pixels. We demonstrate the robustness and potential applications of this data in a case study with domain experts and develop a web application to share our results. This work is practically useful, and can be applied to other locations over the earth as high-resolution imagery becomes more widely available even as high-resolution labeled land cover data remains sparse.
[fusion, time, dataset, state, recognition, multiple] [computer, single, vision, additional, pattern, international, ground] [land, cover, chesapeake, imagery, color, image, nlcd, mapping, input, bay, resolution, figure, naip, iowa, maryland, jaccard, remote, conference, landsat, method, forest, pixel, high, south, conservancy, northeast, quality, produce, earth] [entire, accuracy, performance, network, deep, best, cost, convolutional, neural, layer] [model, identify, random, potential] [satellite, map, aerial, segmentation, area, spatial, region, three, north, watershed] [data, training, set, label, trained, class, large, classification, test, generalize, domain, distribution, learning, loss, labeled, existing, augmentation, train, contiguous, testing]
@InProceedings{Robinson_2019_CVPR,
  author = {Robinson, Caleb and Hou, Le and Malkin, Kolya and Soobitsky, Rachel and Czawlytko, Jacob and Dilkina, Bistra and Jojic, Nebojsa},
  title = {Large Scale High-Resolution Land Cover Mapping With Multi-Resolution Data},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}
Leveraging Heterogeneous Auxiliary Tasks to Assist Crowd Counting
Muming Zhao, Jian Zhang, Chongyang Zhang, Wenjun Zhang


Crowd counting is a challenging task in the presence of drastic scale variations, the clutter background, and severe occlusions, etc. Existing CNN-based counting methods tackle these challenges mainly by fusing either multi-scale or multi-context features to generate robust representations. In this paper, we propose to address these issues by leveraging the heterogeneous attributes compounded in the density map. We identify three geometric/semantic/numeric attributes essentially important to the density estimation, and demonstrate how to effectively utilize these heterogeneous attributes to assist the crowd counting by formulating them into multiple auxiliary tasks. With the multi-fold regularization effects induced by the auxiliary tasks, the backbone CNN model is driven to embed desired properties explicitly and thus gains robust representations towards more accurate density estimation. Extensive experiments on three challenging crowd counting datasets have demonstrated the effectiveness of the proposed approach.
[prediction, dataset, heterogeneous, work] [depth, estimation, computer, pattern, vision, international, approach, accurate, robust] [proposed, image, conference, method, ieee, input, figure, attribute, comparison, zhang, surveillance, mse, desired, background] [density, deep, scale, convolutional, neural, table, network, conv, effectiveness, size, optimize, csrnet, compared, stride] [model, decoder, generate, encoder, introduce, observed] [crowd, cnn, counting, count, map, three, segmentation, feature, mae, mall, segment, clutter, challenging, backbone, object, regression, propose, improve, global, semantic, leverage, module, attentive] [auxiliary, base, learning, task, main, loss, training, existing, datasets]
@InProceedings{Zhao_2019_CVPR,
  author = {Zhao, Muming and Zhang, Jian and Zhang, Chongyang and Zhang, Wenjun},
  title = {Leveraging Heterogeneous Auxiliary Tasks to Assist Crowd Counting},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}