2020 CVPR Accepted Papers

Show each: abstract LDA topics bibtex

2020 CVPR Accepted Papers

Maintained by Matt Deitke
Adapted from Andrej Karpathy
(data from here)
Below every paper are TOP 100 most-occuring words in that paper and their color is based on LDA topic model with k = 7.
(It looks like 0 = attention, 1 = segmentation, 2 = adversarial attacks, 3 = image reconstruction, 4 = GANs, 5 = network pruning, 6 = 3D)
Toggle LDA topics to sort by: TOPIC0 TOPIC1 TOPIC2 TOPIC3 TOPIC4 TOPIC5 TOPIC6
Show each: abstract LDA topics bibtex
Unsupervised Learning of Probably Symmetric Deformable 3D Objects From Images in the Wild
Shangzhe Wu, Christian Rupprecht, Andrea Vedaldi


We propose a method to learn 3D deformable object categories from raw single-view images, without external supervision. The method is based on an autoencoder that factors each input image into depth, albedo, viewpoint and illumination. In order to disentangle these components without supervision, we use the fact that many object categories have, at least in principle, a symmetric structure. We show that reasoning about illumination allows us to exploit the underlying object symmetry even if the appearance is not symmetric due to shading. Furthermore, we model objects that are probably, but not certainly, symmetric by predicting a symmetry probability map, learned end-to-end with the other components of the model. Our experiments show that this method can recover very accurately the 3D shape of human faces, cat faces and cars from single-view images, without any supervision or a prior shape model. On benchmarks, we demonstrate superior accuracy compared to another method that uses supervision at the level of 2D image correspondences.
[multiple, work, order, dataset] [object, confidence, map, table, predicted, supervision, instance, side, extreme, category] [model, face, input, trained, flip, quality, facial, adversarial] [method, prior, deformable, illumination, pixel, raw, figure, output, light, perceptual, based, comparison] [image, loss, asymmetric, unsupervised, learn, cat, encoder, row, texture, synthetic, celeba, autoencoder] [learning, function, deep, training, baseline, consider, network, compared, test] [depth, reconstruction, shape, symmetry, symmetric, canonical, albedo, viewpoint, human, single, view, keypoint, shading, reconstruct, compare, michael, monocular, ground, lighting, truth, structure, david, andrea, geometry, second, allows, pose, bfm, predicts, modelling, require]
@InProceedings{Wu_2020_CVPR,
  author = {Wu, Shangzhe and Rupprecht, Christian and Vedaldi, Andrea},
  title = {Unsupervised Learning of Probably Symmetric Deformable 3D Objects From Images in the Wild},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Footprints and Free Space From a Single Color Image
Jamie Watson, Michael Firman, Aron Monszpart, Gabriel J. Brostow


Understanding the shape of a scene from a single color image is a formidable computer vision task. However, most methods aim to predict the geometry of surfaces that are visible to the camera, which is of limited use when planning paths for robots or augmented reality agents. Such agents can only move when grounded on a traversable surface, which we define as the set of classes which humans can also walk over, such as grass, footpaths and pavement. Models which predict beyond the line of sight often parameterize the scene with voxels or meshes, which can be expensive to use in machine learning frameworks. We introduce a model to predict the geometry of both visible and occluded traversable surfaces, given a single RGB image as input. We learn from stereo video sequences, using camera poses, per-frame depth and semantic segmentation to form training data, which is used to supervise an image-to-image network. We train models from the KITTI driving dataset, the indoor Matterport dataset, and from our own casually captured stereo footage. We find that a surprisingly low bar for spatial coverage of training scenes is required. We validate our algorithm against a range of strong baselines, and include an assessment of our predictions for a path-planning task.
[hidden, predict, moving, planning, prediction, evaluation, frame, video, static, work, multiple, predicting] [segmentation, object, bounding, map, box, amodal, mask, detection, semantic, table, iou] [model, input, masking] [method, figure, pixel, color, captured, flow, lightweight] [image, loss, learn, representation, target, source, unsupervised, layout, train, introduce] [training, path, learning, data, set, space, test, evaluate, network, best] [depth, geometry, ground, visible, traversable, scene, single, camera, surface, estimate, stereo, freespace, estimation, footprint, floor, monocular, indoor, walkable, truth, estimated, straversable, suntraversable, kitti, matterport, approach, convex, human, view, voxel, hull, shape, voxels, capture]
@InProceedings{Watson_2020_CVPR,
  author = {Watson, Jamie and Firman, Michael and Monszpart, Aron and Brostow, Gabriel J.},
  title = {Footprints and Free Space From a Single Color Image},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Dynamic Fluid Surface Reconstruction Using Deep Neural Network
Simron Thapa, Nianyi Li, Jinwei Ye


Recovering the dynamic fluid surface is a long-standing challenging problem in computer vision. Most existing image-based methods require multiple views or a dedicated imaging system. Here we present a learning-based single-image approach for 3D fluid surface reconstruction. Specifically, we design a deep neural network that estimates the depth and normal maps of a fluid surface by analyzing the refractive distortion of a reference background image. Due to the dynamic nature of fluid surfaces, our network uses recurrent layers that carry temporal information from previous frames to achieve spatio-temporally consistent reconstruction given a video input. Due to the lack of fluid data, we synthesize a large fluid dataset using physics-based fluid modeling and rendering techniques for network training and validation. Through experiments on simulated and real captured fluid images, we demonstrate that our proposed deep neural network trained on our fluid dataset can recover dynamic 3D fluid surfaces with high accuracy.
[temporal, dataset, water, recurrent, multiple, three, sequence, time, work, recognition] [map, predicted, background, object] [trained, distortion, difference, highly, input, model, original] [pattern, dynamic, reference, convolutional, recover, figure, light, subnet, existing, comparison, captured, classical, imaging] [image, loss, real, consistency, train, synthetic, perform] [network, neural, deep, training, learning, large, set, accuracy, data, metric, machine, problem] [fluid, depth, surface, normal, refraction, reconstruction, computer, conference, refractive, estimation, single, camera, vision, international, ground, fsrn, truth, compare, approach, error, accurate, assume, term, supplementary, shape, transparent, lambertian, enforcing]
@InProceedings{Thapa_2020_CVPR,
  author = {Thapa, Simron and Li, Nianyi and Ye, Jinwei},
  title = {Dynamic Fluid Surface Reconstruction Using Deep Neural Network},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
CvxNet: Learnable Convex Decomposition
Boyang Deng, Kyle Genova, Soroosh Yazdani, Sofien Bouaziz, Geoffrey Hinton, Andrea Tagliasacchi


Any solid object can be decomposed into a collection of convex polytopes (in short, convexes). When a small number of convexes are used, such a decomposition can be thought of as a piece-wise approximation of the geometry. This decomposition is fundamental in computer graphics, where it provides one of the most common ways to approximate geometry, for example, in real-time physics simulation. A convex object also has the property of being simultaneously an explicit and implicit representation: one can interpret it explicitly as a mesh derived by computing the vertices of a convex hull, or implicitly as the collection of half-space constraints or support functions. Their implicit representation makes them particularly well suited for neural network training, as they abstract away from the topology of the geometry they need to represent. However, at testing time, convexes can also generate explicit representations - polygonal meshes - which can then be used in any downstream application. We introduce a network architecture to represent a low dimensional family of convexes. This family is automatically derived via an auto-encoding process. We investigate the applications of this architecture including automatic convex decomposition, image to 3D reconstruction, and part-based shape retrieval.
[recognition, decoder, represent, explicit, modeling] [object, employ, table] [input, model, hyperplane, google] [figure, pattern, ieee, method, solid, resolution, convolutional, output] [representation, loss, image, encoder, train, latent] [learning, function, note, number, approximate, neural, indicator, deep, set, training, network, architecture, data, performance, arxiv, preprint, efficient] [convex, computer, shape, vision, conference, surface, reconstruction, sif, point, decomposition, convexes, collection, geometry, occnet, distance, single, implicit, mesh, acm, volumetric, signed, polygonal, international, hao, hyperplanes, leonidas, michael, geometric, differentiable, atlasnet, david, andrea, cvxnet, smooth, partnet, siggraph, thomas, european, andreas, well]
@InProceedings{Deng_2020_CVPR,
  author = {Deng, Boyang and Genova, Kyle and Yazdani, Soroosh and Bouaziz, Sofien and Hinton, Geoffrey and Tagliasacchi, Andrea},
  title = {CvxNet: Learnable Convex Decomposition},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
BSP-Net: Generating Compact Meshes via Binary Space Partitioning
Zhiqin Chen, Andrea Tagliasacchi, Hao Zhang


Polygonal meshes are ubiquitous in the digital 3D domain, yet they have only played a minor role in the deep learning revolution. Leading methods for learning generative models of shapes rely on implicit functions, and generate meshes only after expensive iso-surfacing routines. To overcome these challenges, we are inspired by a classical spatial data structure from computer graphics, Binary Space Partitioning (BSP), to facilitate 3D learning. The core ingredient of BSP is an operation for recursive subdivision of space to obtain convex sets. By exploiting this property, we devise BSP-Net, a network that learns to represent a 3D shape via convex decomposition. Importantly, BSP-Net is unsupervised since no convex shape decompositions are needed for training. The network is trained to reconstruct a shape using a set of convexes obtained from a BSP-tree built on a set of planes. The convexes inferred by BSP-Net can be easily extracted to form a polygon mesh, without any need for iso-surfacing. The generated meshes are compact (i.e., low-poly) and well suited to represent sharp geometry; they are guaranteed to be watertight and can be easily parameterized. We also show that the reconstruction quality by BSP-Net is competitive with state-of-the-art methods while using much fewer primitives. Code is available at https://github.com/czq142857/BSP-NET-original.
[represent, structured, visual, hierarchical, work] [segmentation, table, semantic, stage, feature, object, edge, inside, focus] [model, trained, input] [figure, sharp, method, output, convolutional, tree, field] [generative, generate, representation, encoder, train, generating, corresponding, learns, generated] [network, training, learning, deep, layer, note, number, binary, set, neural, space, size, compact, data, operation, approximate, learned] [shape, convex, point, reconstruction, hao, surface, implicit, convexes, single, distance, leonidas, polygonal, mesh, structure, well, plane, acm, geometry, collection, chamfer, volumetric, atlasnet, continuous, primitive, daniel, reconstruct, watertight, smooth, shapenet]
@InProceedings{Chen_2020_CVPR,
  author = {Chen, Zhiqin and Tagliasacchi, Andrea and Zhang, Hao},
  title = {BSP-Net: Generating Compact Meshes via Binary Space Partitioning},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Total3DUnderstanding: Joint Layout, Object Pose and Mesh Reconstruction for Indoor Scenes From a Single Image
Yinyu Nie, Xiaoguang Han, Shihui Guo, Yujian Zheng, Jian Chang, Jian Jun Zhang


Semantic reconstruction of indoor scenes refers to both scene understanding and object reconstruction. Existing works either address one part of this problem or focus on independent objects. In this paper, we bridge the gap between understanding and reconstruction, and propose an end-to-end solution to jointly reconstruct room layout, object bounding boxes and meshes from a single image. Instead of separately resolving scene understanding and object reconstruction, our method builds upon a holistic scene context and proposes a coarse-to-fine hierarchy with three components: 1. room layout with camera pose; 2. 3D object bounding boxes; 3. object meshes. We argue that understanding the context of each component can assist the task of parsing the others, which enables joint understanding and reconstruction. The experiments on the SUN RGB-D and Pix3D datasets demonstrate that our method consistently outperforms existing methods in indoor layout estimation, 3D object detection and mesh reconstruction.
[understanding, relational, predict, len, work, context, relation] [object, detection, bounding, sun, table, box, threshold, improves, center, feature, siyuan, parsing, predicted] [improve, input] [method, ieee, pattern, figure, spatial, existing] [layout, loss, image, generation, target, modification] [learning, network, training, average, deep, evaluate, arxiv, preprint, observe, better, performance, density] [mesh, scene, reconstruction, computer, conference, joint, camera, vision, pose, indoor, single, shape, distance, point, topology, estimation, room, international, full, approach, cloud, local, compare, reconstructed, mgn, reconstruct, voxel, implicit, odn, tmn, lco, european, rgbd, directly, geometry]
@InProceedings{Nie_2020_CVPR,
  author = {Nie, Yinyu and Han, Xiaoguang and Guo, Shihui and Zheng, Yujian and Chang, Jian and Zhang, Jian Jun},
  title = {Total3DUnderstanding: Joint Layout, Object Pose and Mesh Reconstruction for Indoor Scenes From a Single Image},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Generating and Exploiting Probabilistic Monocular Depth Estimates
Zhihao Xia, Patrick Sullivan, Ayan Chakrabarti


Beyond depth estimation from a single image, the monocular cue is useful in a broader range of depth inference applications and settings---such as when one can leverage other available depth cues for improved accuracy. Currently, different applications, with different inference tasks and combinations of depth cues, are solved via different specialized networks---trained separately for each application. Instead, we propose a versatile task-agnostic monocular model that outputs a probability distribution over scene depth given an input color image, as a sample approximation of outputs from a patch-wise conditional VAE. We show that this distributional output can be used to enable a variety of inference tasks in different settings, without needing to retrain for each application. Across a diverse set of applications (depth completion, user guided estimation, etc.), our common model yields results with high accuracy---comparable to or surpassing that of state-of-the-art methods dependent on application-specific networks.
[multiple, predict, describe] [map, global, wang, table, guided] [model, input, trained] [method, output, color, patch, levin, based, simply, spatial, figure] [image, conditional, generating, user, common, vae, train, generate, diverse, latent, separate, corresponding, plausible] [network, distribution, inference, distributional, set, learning, number, deep, accuracy, min, sample, note, random, evaluate, probabilistic, neural, better, performance, report, improved, standard, ordinal, sampling, test, find, consider] [depth, monocular, estimation, estimate, sparse, single, approach, scene, cost, additional, completion, form, accurate, dense, kpi, variety, well, overlapping, application, ambiguity, joint, define]
@InProceedings{Xia_2020_CVPR,
  author = {Xia, Zhihao and Sullivan, Patrick and Chakrabarti, Ayan},
  title = {Generating and Exploiting Probabilistic Monocular Depth Estimates},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Neural Cages for Detail-Preserving 3D Deformations
Wang Yifan, Noam Aigerman, Vladimir G. Kim, Siddhartha Chaudhuri, Olga Sorkine-Hornung


We propose a novel learnable representation for detail preserving shape deformation. The goal of our method is to warp a source shape to match the general structure of a target shape, while preserving the surface details of the source. Our method extends a traditional cage-based deformation technique, where the source shape is enclosed by a coarse control mesh termed cage, and translations prescribed on the cage vertices are interpolated to any point on the source mesh via special weight functions. The use of this sparse cage scaffolding enables preserving surface details regardless of the shape's intricacy and topology. Our key contribution is a novel neural network architecture for predicting deformations by controlling the cage. We incorporate a differentiable cage-based deformation module in our architecture, and train our network end-to-end. Our method can be trained with common collections of 3D models in an unsupervised fashion, without any cage-specific annotations. We demonstrate the utility of our method for synthesizing shape variations and deformation transfer.
[static, predict, prediction, dataset, work] [template, table, module, predicted] [model, fig, trained, distortion, example, quality, technique] [method, figure, detail, output, traditional] [source, target, preserving, alignment, loss, preservation, train, transfer, lshape, learn, corresponding, fine, generate, generative] [network, learning, space, training, optimization, optimize, weight, neural, alternative, better, set] [deformation, shape, cage, deformed, novel, distance, acm, mesh, pose, local, geometric, single, deform, surface, coarse, approach, chamfer, match, point, deforming, human, cbd, lnormal, compare, hao, sparse, correspondence, geometry, well, vladimir, enclosed, mvc]
@InProceedings{Yifan_2020_CVPR,
  author = {Yifan, Wang and Aigerman, Noam and Kim, Vladimir G. and Chaudhuri, Siddhartha and Sorkine-Hornung, Olga},
  title = {Neural Cages for Detail-Preserving 3D Deformations},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
PIFuHD: Multi-Level Pixel-Aligned Implicit Function for High-Resolution 3D Human Digitization
Shunsuke Saito, Tomas Simon, Jason Saragih, Hanbyul Joo


Recent advances in image-based 3D human shape estimation have been driven by the significant improvement in representation power afforded by deep neural networks. Although current approaches have demonstrated the potential in real world settings, they still fail to produce reconstructions with the level of detail often present in the input images. We argue that this limitation stems primarily form two conflicting requirements; accurate predictions require large context, but precise predictions require high resolution. Due to memory limitations in current hardware, previous approaches tend to take low resolution images as input to cover large spatial context, and produce less precise (or low resolution) 3D estimates as a result. We address this limitation by formulating a multi-level architecture that is end-to-end trainable. A coarse level observes the whole image at lower resolution and focuses on holistic reasoning. This provides context to an fine level which estimates highly detailed geometry by observing higher-resolution images. We demonstrate that our approach significantly outperforms existing state-of-the-art techniques on single image human shape reconstruction by fully leveraging 1k-resolution input images.
[embedding, context, work, people, reasoning, predict] [feature, level, module, global, holistic, semantic, precise, fully, final] [input, model, quality] [resolution, ieee, high, pattern, method, figure, output, detail, window] [image, fine, representation, produce, loss, texture] [function, network, higher, space, learning, training, sampling, large, memory, inference, neural, problem, set, size, deep] [human, computer, conference, geometry, pifu, implicit, vision, shape, single, reconstruction, surface, normal, coarse, occupancy, depth, approach, clothed, backside, mlp, international, estimation, limited, acm, detailed, body, additional, parametric, volume, renderpeople, accurate]
@InProceedings{Saito_2020_CVPR,
  author = {Saito, Shunsuke and Simon, Tomas and Saragih, Jason and Joo, Hanbyul},
  title = {PIFuHD: Multi-Level Pixel-Aligned Implicit Function for High-Resolution 3D Human Digitization},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
A Lighting-Invariant Point Processor for Shading
Kathryn Heal, Jialiang Wang, Steven J. Gortler, Todd Zickler


Under the conventional diffuse shading model with unknown directional lighting, the set of quadratic surface shapes that are consistent with the spatial derivatives of intensity at a single image point is a two-dimensional algebraic variety embedded in the five-dimensional space of quadratic shapes. We describe the geometry of this variety, and we introduce a concise feedforward model that computes an explicit, differentiable approximation of the variety from the intensity and its derivatives at any single image point. The result is a parallelizable processor that operates at each image point and produces a lighting-invariant descriptor of the continuous set of compatible surface shapes at the point. We describe two applications of this processor: two-shot uncalibrated photometric stereo and quadratic-surface shape from shading.
[three, pair, explicit, order] [positive, map] [input, model, compatible, true] [figure, light, intensity, likelihood, output, spatial, analysis, intermediate, pattern] [image, representation, extended, real, domain, unknown] [set, neural, network, quadratic, function, space, consider, training, sample, efficient, subset, knowledge, parametrized, choice, group, entire, simple] [shape, fyy, surface, local, fxy, point, fxx, shading, lighting, variety, processor, measurement, single, orientation, consistent, algebraic, iyy, ixx, ixy, intersection, computer, continuous, photometric, vision, contained, assume, uncalibrated, second, form, additional, direction, stereo, polynomial, well, represented, approach, ambiguity, david]
@InProceedings{Heal_2020_CVPR,
  author = {Heal, Kathryn and Wang, Jialiang and Gortler, Steven J. and Zickler, Todd},
  title = {A Lighting-Invariant Point Processor for Shading},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
ActiveMoCap: Optimized Viewpoint Selection for Active Human Motion Capture
Sena Kiciroglu, Helge Rhodin, Sudipta N. Sinha, Mathieu Salzmann, Pascal Fua


The accuracy of monocular 3D human pose estimation depends on the viewpoint from which the image is captured. While freely moving cameras, such as on drones, provide control over this viewpoint, automatically positioning them at the location which will yield the highest accuracy remains an open problem. This is the problem that we address in this paper. Specifically, given a short video sequence, we introduce an algorithm that predicts which viewpoints should be chosen to capture future frames so as to maximize 3D human pose estimation accuracy. The key idea underlying our approach is a method to estimate the uncertainty of the 3D body pose estimates. We integrate several sources of uncertainty, originating from deep learning based regressors and temporal smoothness. Our motion planner yields improved 3D body pose estimates and outperforms or matches existing ones that are based on person following and orbiting.
[future, moving, multiple, trajectory, time, current, placement, video, predict, work, temporal, planner] [predicted, location, aerial, autonomous] [constant, subject, model, noise, input] [motion, figure, pattern, based, noisy] [image, control, real, person, realistic] [active, set, algorithm, optimize, candidate, energy, accuracy, best, random, optimal, function, distribution, deep, learning, acceleration, average, maximize, posterior, variance, find, optimization, evaluate] [pose, drone, human, estimation, camera, uncertainty, conference, computer, view, capture, estimate, rotation, epose, monocular, approach, vision, error, eproj, international, reconstruction, position, flight, bone, projection, ground, viewpoint, relative, truth, acm, body, estimating, distance, simulated, single, robotics, joint]
@InProceedings{Kiciroglu_2020_CVPR,
  author = {Kiciroglu, Sena and Rhodin, Helge and Sinha, Sudipta N. and Salzmann, Mathieu and Fua, Pascal},
  title = {ActiveMoCap: Optimized Viewpoint Selection for Active Human Motion Capture},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Peek-a-Boo: Occlusion Reasoning in Indoor Scenes With Plane Representations
Ziyu Jiang, Buyu Liu, Samuel Schulter, Zhangyang Wang, Manmohan Chandraker


We address the challenging task of occlusion-aware indoor 3D scene understanding. We represent scenes by a set of planes, where each one is defined by its normal, offset and two masks outlining (i) the extent of the visible part and (ii) the full region that consists of both visible and occluded parts of the plane. We infer these planes from a single input image with a novel neural network architecture. It consists of a two-branch category-specific module that aims to predict layout and objects of the scene separately so that different types of planes can be handled better. We also introduce a novel loss function based on plane warping that can leverage multiple views at training time for improved occlusion-aware reasoning. In order to train and evaluate our occlusion-reasoning model, we use the ScanNet dataset and propose (i) a strategy to automatically extract ground truth for both visible and hidden regions and (ii) a new evaluation metric that specifically focuses on the prediction in hidden regions. We empirically demonstrate that our proposed approach can achieve higher accuracy for occlusion reasoning compared to competitive baselines on the ScanNet dataset, e.g. 42.65% relative improvement on hidden regions.
[prediction, hidden, reasoning, dataset, predict, evaluation, multiple, work, predicting, described] [mask, object, occlusion, semantic, occluded, region, foreground, offset, propose, module, branch, merging, segmentation, aph, detection, predicted, main, table, improves, background] [model, input] [proposed, warping, ieee, based, method, figure, pattern, column, extend] [layout, representation, image, loss, address, learn, introduce] [training, network, set, metric, problem, better, performance, data, compared, architecture, average, precision] [plane, visible, complete, depth, scene, computer, conference, novel, ground, single, dualrpn, indoor, full, vision, truth, normal, scannet, floor, demonstrate, relative, view, camera, pgj, approach, predicts, layered, well]
@InProceedings{Jiang_2020_CVPR,
  author = {Jiang, Ziyu and Liu, Buyu and Schulter, Samuel and Wang, Zhangyang and Chandraker, Manmohan},
  title = {Peek-a-Boo: Occlusion Reasoning in Indoor Scenes With Plane Representations},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Multi-Modal Domain Adaptation for Fine-Grained Action Recognition
Jonathan Munro, Dima Damen


Fine-grained action recognition datasets exhibit environmental bias, where multiple video sequences are captured from a limited number of environments. Training a model in one environment and deploying in another results in a drop in performance due to an unavoidable domain shift. Unsupervised Domain Adaptation (UDA) approaches have frequently utilised adversarial training between the source and target domains. However, these approaches have not explored the multi-modal nature of video within each domain. In this work we exploit the correspondence of modalities as a self-supervised alignment approach for UDA in addition to adversarial alignment (Fig. 1). We test our approach on three kitchens from the large-scale EPIC-Kitchens dataset, using two modalities commonly employed for action recognition: RGB and Optical Flow. We show that multi-modal self-supervision alone improves the performance over source-only training by 2.4% on average. We then combine adversarial training with multi-modal self-supervision, showing that our approach outperforms other UDA methods by 3%.
[action, recognition, modality, video, multiple, temporal, outperforms, work, three, late] [feature, table, labelled, ablation] [adversarial, trained, model, input, datasets, robust] [pattern, flow, proposed, figure, method, convolutional, fusion] [domain, target, source, adaptation, alignment, uda, discrepancy, unsupervised, supervised, optimised, train, loss, discriminative, separate, discriminator, corresponding, representation, aligning] [training, learning, performance, deep, unlabelled, data, neural, machine, compared, number, distribution, classifier, classification, accuracy, test, average, task, andrew, applied, layer, evaluate, space, network] [vision, computer, conference, rgb, international, approach, european, correspondence, single, well]
@InProceedings{Munro_2020_CVPR,
  author = {Munro, Jonathan and Damen, Dima},
  title = {Multi-Modal Domain Adaptation for Fine-Grained Action Recognition},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Evolving Losses for Unsupervised Video Representation Learning
AJ Piergiovanni, Anelia Angelova, Michael S. Ryoo


We present a new method to learn video representations from large-scale unlabeled video data. Ideally, this representation will be generic and transferable, directly usable for new tasks such as action recognition and zero or few-shot learning. We formulate unsupervised representation learning as a multi-modal, multi-task learning problem, where the representations are shared across different modalities via distillation. Further, we introduce the concept of loss function evolution by using an evolutionary search algorithm to automatically find optimal combination of loss functions capturing many (self-supervised) tasks and modalities. Thirdly, we propose an unsupervised representation evaluation metric using distribution matching to a large unlabeled dataset as a prior constraint, based on Zipf's law. This unsupervised constraint, which is not guided by any labeling, produces similar results to weakly-supervised, task-specific ones. The proposed unsupervised representation learning results in a single RGB network and outperforms previous methods. Notably, it is also more effective than several label-based methods (e.g., ImageNet), with the exception of large, fully labeled video datasets.
[video, recognition, kinetics, temporal, fitness, multiple, modality, audio, automatically, frame, hmdb, action, visual, future, previous, zipf, dataset, temporally, activity, order] [final, feature, table, main] [trained, datasets, model] [ieee, method, advantage, pattern, prior, based, figure, flow] [unsupervised, loss, representation, train, supervised, learn] [learning, distillation, unlabeled, data, task, network, function, labeled, evolution, number, find, clustering, training, evolutionary, search, distribution, outperform, random, arxiv, preprint, large, evolved, set, performance, classification, randomly, shuffle, baseline, andrew, entire, neural, power] [conference, vision, computer, rgb, matching, international, human, compare, european, law]
@InProceedings{Piergiovanni_2020_CVPR,
  author = {Piergiovanni, AJ and Angelova, Anelia and Ryoo, Michael S.},
  title = {Evolving Losses for Unsupervised Video Representation Learning},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Disentangling and Unifying Graph Convolutions for Skeleton-Based Action Recognition
Ziyu Liu, Hongwen Zhang, Zhenghao Chen, Zhiyong Wang, Wanli Ouyang


Spatial-temporal graphs have been widely used by skeleton-based action recognition algorithms to model human action dynamics. To capture robust movement patterns from these graphs, long-range and multi-scale context aggregation and spatial-temporal dependency modeling are critical aspects of a powerful feature extractor. However, existing methods have limitations in achieving (1) unbiased long-range joint relationship modeling under multi-scale operators and (2) unobstructed cross-spacetime information flow for capturing complex spatial-temporal dependencies. In this work, we present (1) a simple method to disentangle multi-scale graph convolutions and (2) a unified spatial-temporal graph convolutional operator named G3D. The proposed multi-scale aggregation scheme disentangles the importance of nodes in different neighborhoods for effective long-range modeling. The proposed G3D module leverages dense cross-spacetime edges as skip connections for direct information propagation across the spatial-temporal graph. By coupling these proposals, we develop a powerful feature extractor named MS-G3D based on which our model outperforms previous state-of-the-art methods on three large-scale datasets: NTU RGB+D 60, NTU RGB+D 120, and Kinetics Skeleton 400.
[graph, skeleton, action, temporal, recognition, ntu, adjacency, node, factorized, stgc, modeling, pathway, powerful, kinetics, work, time, artificial, three] [aggregation, feature, table, unified, biased, extra, mask, module] [model, robust] [spatial, convolutional, ieee, proposed, conv, pattern, residual, window, existing, block, dilation, flow, receptive] [disentangled, jun] [larger, learning, neural, set, weighting, number, matrix, scheme, performance, higher, accuracy, arxiv, preprint, deep, closer, network, large, layer, observe] [conference, human, joint, computer, vision, international, capture, complex, connectivity, body, neighborhood, dense]
@InProceedings{Liu_2020_CVPR,
  author = {Liu, Ziyu and Zhang, Hongwen and Chen, Zhenghao and Wang, Zhiyong and Ouyang, Wanli},
  title = {Disentangling and Unifying Graph Convolutions for Skeleton-Based Action Recognition},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
A Multigrid Method for Efficiently Training Video Models
Chao-Yuan Wu, Ross Girshick, Kaiming He, Christoph Feichtenhofer, Philipp Krahenbuhl


Training competitive deep video models is an order of magnitude slower than training their counterpart image models. Slow training causes long research cycles, which hinders progress in video understanding research. Following standard practice for training image models, video model training has used a fixed mini-batch shape: a specific number of clips, frames, and spatial size. However, what is the optimal shape? High resolution models perform well, but train slowly. Low resolution models train faster, but are less accurate. Inspired by multigrid methods in numerical optimization, we propose to use variable mini-batch shapes with different spatial-temporal resolutions that are varied according to a schedule. The different shapes arise from resampling the training data on multiple sampling grids. Training is accelerated by scaling up the mini-batch size and learning rate when shrinking the other dimensions. We empirically demonstrate a general and robust grid schedule that yields a significant out-of-the-box training speedup without a loss in accuracy for different models (I3D, non-local, SlowFast), datasets (Kinetics, Something-Something, Charades), and training settings (with and without pre-training, 128 GPUs or 1 GPU). As an illustrative example, the proposed multigrid method trains a ResNet-50 SlowFast network 4.5x faster (wall-clock time, same hardware) while also improving accuracy (+0.8% absolute) on Kinetics-400 compared to baseline training. Code is available online.
[video, short, long, time, temporal, action, kinetics, slowfast, work, understanding, dataset, recipe, span] [table, kaiming, faster, stride, obtains] [model, input, generalization, constant] [spatial, method, convolutional] [cycle, train, variable, common, image, source] [training, multigrid, baseline, learning, size, data, number, sampling, rate, schedule, speedup, default, accuracy, deep, large, performance, design, neural, random, gpu, scaling, better, efficient, larger, network, typically, smaller, iteration, small, observe, gain, set, implementation, standard, gpus, compared, hardware, lower, computation, fewer, note] [grid, shape, single, consistent, well]
@InProceedings{Wu_2020_CVPR,
  author = {Wu, Chao-Yuan and Girshick, Ross and He, Kaiming and Feichtenhofer, Christoph and Krahenbuhl, Philipp},
  title = {A Multigrid Method for Efficiently Training Video Models},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Ego-Topo: Environment Affordances From Egocentric Video
Tushar Nagarajan, Yanghao Li, Christoph Feichtenhofer, Kristen Grauman


First-person video naturally brings the use of a physical environment to the forefront, since it shows the camera wearer interacting fluidly in a space based on his intentions. However, current methods largely separate the observed actions from the persistent space itself. We introduce a model for environment affordances that is learned directly from egocentric video. The main idea is to gain a human-centric model of a physical space (such as a kitchen) that captures (1) the primary spatial zones of interaction and (2) the likely activities they support. Our approach decomposes a space into a topological map derived from first-person activity, organizing an ego-video into a series of visits to the different zones. Further, we show how to link zones across multiple related environments (e.g., from videos of multiple kitchens) to obtain a consolidated representation of environment functionality. On EPIC-Kitchens and EGTEA+, we demonstrate our approach for learning scene affordances and anticipating future actions in long-form video.
[video, action, egocentric, environment, affordance, graph, node, long, anticipation, affordances, activity, visual, future, link, multiple, zone, kitchen, temporal, epic, work, interaction, time, people, frame, understanding, opo, predict, outperforms, consolidated, visit, build, anticipate, encoding, clip, observed, anticipating, structured] [map, object, localization, linking, feature] [topological, model, physical, create, persistent] [based, spatial, figure, range] [representation, person, image] [learning, network, space, set, similarity, training, performance, classifier, binary] [scene, slam, human, approach, single, camera, allows, leverage, grid, capture, coherent, uniformly, view]
@InProceedings{Nagarajan_2020_CVPR,
  author = {Nagarajan, Tushar and Li, Yanghao and Feichtenhofer, Christoph and Grauman, Kristen},
  title = {Ego-Topo: Environment Affordances From Egocentric Video},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Generative Hybrid Representations for Activity Forecasting With No-Regret Learning
Jiaqi Guan, Ye Yuan, Kris M. Kitani, Nicholas Rhinehart


Automatically reasoning about future human behaviors is a difficult problem but has significant practical applications to assistive systems. Part of this difficulty stems from learning systems' inability to represent all kinds of behaviors. Some behaviors, such as motion, are best described with continuous representations, whereas others, such as picking up a cup, are best described with discrete representations. Furthermore, human behavior is generally not fixed: people can change their habits and routines. This suggests these systems must be able to learn and adapt continuously. In this work, we develop an efficient deep generative model to jointly forecast a person's future discrete actions and continuous motions. On a large-scale egocentric dataset, EPIC-KITCHENS, we observe our method generates high-quality and diverse samples while exhibiting better generalization than related generative models. Finally, we propose a variant to continually learn our model from streaming data, observe its practical effectiveness, and theoretically justify its learning efficiency.
[action, trajectory, future, forecasting, activity, egocentric, regret, kris, work, policy, time, imitation, context, modeling, behavior, forecast, simulator, sequence, forecasted, predict, outperforms] [recall, denotes, propose] [model, streaming, noise, quality, help] [ieee, method, pattern, reverse, invertible, prior, figure, exact, proposed] [generative, cross, discriminative, learn, generate, loss, variational, conditioned, diverse] [learning, online, distribution, data, entropy, arxiv, preprint, sample, discrete, log, forward, evaluate, function, average, set, deep, training, calculate, better, batch, algorithm, top, performance] [conference, computer, vision, joint, human, continuous, international, hybrid, european, jointly, approach]
@InProceedings{Guan_2020_CVPR,
  author = {Guan, Jiaqi and Yuan, Ye and Kitani, Kris M. and Rhinehart, Nicholas},
  title = {Generative Hybrid Representations for Activity Forecasting With No-Regret Learning},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Skeleton-Based Action Recognition With Shift Graph Convolutional Network
Ke Cheng, Yifan Zhang, Xiangyu He, Weihan Chen, Jian Cheng, Hanqing Lu


Action recognition with skeleton data is attracting more attention in computer vision. Recently, graph convolutional networks (GCNs), which model the human body skeletons as spatiotemporal graphs, have obtained remarkable performance. However, the computational complexity of GCN-based methods are pretty heavy, typically over 15 GFLOPs for one action sample. Recent works even reach about 100 GFLOPs. Another shortcoming is that the receptive fields of both spatial graph and temporal graph are inflexible. Although some works enhance the expressiveness of spatial graph by introducing incremental adaptive modules, their performance is still limited by regular GCN structures. In this paper, we propose a novel shift graph convolutional network (Shift-GCN) to overcome both shortcomings. Instead of using heavy regular graph convolutions, our Shift-GCN is composed of novel shift graph operations and lightweight point-wise convolutions, where the shift graph operations provide flexible receptive fields for both spatial graph and temporal graph. On three datasets for skeleton-based action recognition, the proposed Shift-GCN notably exceeds the state-of-the-art methods with more than 10 times less computational complexity.
[shift, graph, temporal, regular, action, skeleton, gcn, recognition, node, three, ntu, spatiotemporal, naive, dataset, gcns, outperforms, video, attention, combining, current] [table, feature, propose, shifted, ablation, effectiveness] [model, datasets, exhaustive, verify] [convolution, spatial, receptive, adaptive, ieee, field, convolutional, pattern, adjacent, kernel, cnns, proposed, gflops, lightweight, fusion, partition, figure, channel, learnable] [diverse] [operation, computational, data, network, set, size, top, neural, accuracy, efficient, denote, computation, learning, baseline, performance, larger, compared, typically, matrix] [conference, computer, vision, local, human, neighbor, body]
@InProceedings{Cheng_2020_CVPR,
  author = {Cheng, Ke and Zhang, Yifan and He, Xiangyu and Chen, Weihan and Cheng, Jian and Lu, Hanqing},
  title = {Skeleton-Based Action Recognition With Shift Graph Convolutional Network},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Predicting Goal-Directed Human Attention Using Inverse Reinforcement Learning
Zhibo Yang, Lihan Huang, Yupei Chen, Zijun Wei, Seoyoung Ahn, Gregory Zelinsky, Dimitris Samaras, Minh Hoai


Human gaze behavior prediction is important for behavioral vision and for computer vision applications. Most models mainly focus on predicting free-viewing behavior using saliency maps, but do not generalize to goal-directed behavior, such as when a person searches for a visual target object. We propose the first inverse reinforcement learning (IRL) model to learn the internal reward function and policy used by humans during visual search. We modeled the viewer's internal belief states as dynamic contextual belief maps of object locations. These maps were learned and then used to predict behavioral scanpaths for multiple target categories. To train and evaluate our IRL model we created COCO-Search18, which is now the largest dataset of high-quality search fixations in existence. COCO-Search18 has 10 participants searching for each of 18 target-object categories in 6202 images, making about 300,000 goal-directed fixations. When trained and evaluated on COCO-Search18, the IRL model outperformed baseline models in predicting search fixation scanpaths, both in terms of similarity to human search behavior and search efficiency. Finally, reward maps recovered by the IRL model reveal distinctive target-dependent patterns of object prioritization, which we interpret as a learned object context.
[visual, irl, fixation, state, scanpath, reward, attention, policy, context, behavior, belief, prediction, predict, predicting, reinforcement, dcb, dataset, scanpaths, sequence, people, multiple, behavioral] [object, category, saliency, map, contextual, mask, table, location] [model, input, eye, adversarial, gaze, trained] [ieee, based, inverse, figure, spatial, pattern, guidance, convolutional, journal, gregory, dynamic, comparison] [target, image, discriminator, representation, generated, train, generator, learn] [search, network, probability, training, learning, function, searching, data, test, task, deep, performance, number, learned, algorithm, ratio, neural, processing, class] [human, conference, computer, vision, scene, international]
@InProceedings{Yang_2020_CVPR,
  author = {Yang, Zhibo and Huang, Lihan and Chen, Yupei and Wei, Zijun and Ahn, Seoyoung and Zelinsky, Gregory and Samaras, Dimitris and Hoai, Minh},
  title = {Predicting Goal-Directed Human Attention Using Inverse Reinforcement Learning},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
X3D: Expanding Architectures for Efficient Video Recognition
Christoph Feichtenhofer


This paper presents X3D, a family of efficient video networks that progressively expand a tiny 2D image classification architecture along multiple network axes, in space, time, width and depth. Inspired by feature selection methods in machine learning, a simple stepwise network expansion approach is employed that expands a single axis in each step, such that good accuracy to complexity trade-off is achieved. To expand X3D to a specific target complexity, we perform progressive forward expansion followed by backward contraction. X3D achieves state-of-the-art performance while requiring 4.8x and 5.5x fewer multiply-adds and parameters for similar accuracy as previous work. Our most surprising finding is that networks with high spatiotemporal resolution can perform well, while being extremely light in terms of network width and parameters. We report competitive accuracy at unprecedented efficiency on video classification and detection benchmarks. Code is available at: https://github.com/facebookresearch/SlowFast.
[video, temporal, action, slowfast, spatiotemporal, clip, step, previous, recognition, frame, pathway, duration] [table, feature, detection, kaiming, resnet] [model, input, expanded, testing, tiny] [expansion, spatial, resolution, convolutional, residual, comparison, separable, channel, fast, convolution] [image, progressive, target, train] [network, fewer, width, accuracy, classification, architecture, efficient, expanding, design, number, neural, andrew, complexity, lower, inference, learning, set, comparable, expand, performance, computational, sampling, arxiv, preprint, test, expands, report, bottleneck, training, requires, better, increasing, data, comparing, deep] [depth, cost, single, axis, christoph]
@InProceedings{Feichtenhofer_2020_CVPR,
  author = {Feichtenhofer, Christoph},
  title = {X3D: Expanding Architectures for Efficient Video Recognition},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Dynamic Multiscale Graph Neural Networks for 3D Skeleton Based Human Motion Prediction
Maosen Li, Siheng Chen, Yangheng Zhao, Ya Zhang, Yanfeng Wang, Qi Tian


We propose novel dynamic multiscale graph neural networks (DMGNN) to predict 3D skeleton-based human motions. The core idea of DMGNN is to use a multiscale graph to comprehensively model the internal relations of a human body for motion feature learning. This multiscale graph is adaptive during training and dynamic across network layers. Based on this graph, we propose a multiscale graph computational unit (MGCU) to extract features at individual scales and fuse features across scales. The entire model is action-category-agnostic and follows an encoder-decoder framework. The encoder consists of a sequence of MGCUs to learn motion features. The decoder uses a proposed graph-based gate recurrent unit to generate future poses. Extensive experiments show that the proposed DMGNN outperforms state-of-the-art methods in both short and long-term predictions on the datasets of Human 3.6M and CMU Mocap. We further investigate the learned multiscale graphs for the interpretability. The codes could be downloaded from https://github.com/limaosen0/DMGNN.
[graph, dmgnn, prediction, time, recognition, maes, future, extract, action, temporal, predict, gru, decoder, short, mgcu, three, mgcus, recurrent, cmu, multiple, trainable, december, sequence, outperforms, state] [table, propose, feature, fuse] [model, difference] [motion, multiscale, ieee, figure, pattern, dynamic, june, proposed, convolution, based, fusion, scale, running, block, convolutional, adaptively] [encoder, learn, representation] [learning, neural, average, computational, processing, learned, large, fixed, deep] [human, conference, computer, vision, pose, body, international, csm, relative]
@InProceedings{Li_2020_CVPR,
  author = {Li, Maosen and Chen, Siheng and Zhao, Yangheng and Zhang, Ya and Wang, Yanfeng and Tian, Qi},
  title = {Dynamic Multiscale Graph Neural Networks for 3D Skeleton Based Human Motion Prediction},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Use the Force, Luke! Learning to Predict Physical Forces by Simulating Effects
Kiana Ehsani, Shubham Tulsiani, Saurabh Gupta, Ali Farhadi, Abhinav Gupta


When we humans look at a video of human-object interaction, we can not only infer what is happening but we can even extract actionable information and imitate those interactions. On the other hand, current recognition or geometric approaches lack the physicality of action representation. In this paper, we take a step towards more physical understanding of actions. We address the problem of inferring contact points and the physical forces from videos of humans interacting with objects. One of the main challenges in tackling this problem is obtaining ground-truth labels for forces. We sidestep this problem by instead using a physics simulator for supervision. Specifically, we use a simulator to predict effects, and enforce that estimated forces must lead to same effect as depicted in the video. Our quantitative and qualitative results show that (a) we can predict meaningful forces from videos whose effects lead to accurate imitation of the motions observed, (b) by jointly optimizing for contact point and force prediction, we can improve the performance on both tasks in comparison to independent training, and (c) we can learn a representation from this model that generalizes to novel objects using few shot examples.
[prediction, understanding, interaction, predict, state, observed, infer, predicting, video, dataset, work, simulator, goal, visual, frame] [object, predicted, table, semantic] [model, physical, input, trained, noise, magnitude, collect] [figure, pattern, motion, applying] [image, train, translation, generalize, corresponding, representation, loss, qualitative, meaningful, learn, common] [training, learning, applied, set, problem, performance, objective, optimizing, network, better, measure, test, lead, shot, note] [contact, point, error, lkeypoint, force, pose, conference, initial, computer, keypoint, simulation, vision, human, projection, lcp, rotation, joint, truth, ground, hand, keypoints, simulated, jointly, novel, estimation, geometric, approach, plane, interacting, estimating, camera]
@InProceedings{Ehsani_2020_CVPR,
  author = {Ehsani, Kiana and Tulsiani, Shubham and Gupta, Saurabh and Farhadi, Ali and Gupta, Abhinav},
  title = {Use the Force, Luke! Learning to Predict Physical Forces by Simulating Effects},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
DaST: Data-Free Substitute Training for Adversarial Attacks
Mingyi Zhou, Jing Wu, Yipeng Liu, Shuaicheng Liu, Ce Zhu


Machine learning models are vulnerable to adversarial examples. For the black-box setting, current substitute attacks need pre-trained models to generate adversarial examples. However, pre-trained models are hard to obtain in real-world tasks. In this paper, we propose a data-free substitute training method (DaST) to obtain substitute models for adversarial black-box attacks without the requirement of any real data. To achieve this, DaST utilizes specially designed generative adversarial networks (GANs) to train the substitute models. In particular, we design a multi-branch architecture and label-control loss for the generative model to deal with the uneven distribution of synthetic samples. The substitute model is then trained by the synthetic samples generated by the generative model, which are labeled by the attacked model subsequently. The experiments demonstrate the substitute models produced by DaST can achieve competitive performance compared with the baseline models which are trained by the same train set with attacked models. Additionally, to evaluate the practicability of the proposed method on the real-world task, we attack an online machine learning model on the Microsoft Azure platform. The remote model misclassifies 98.35% of the adversarial examples crafted by our method. To the best of our knowledge, we are the first to train a substitute model for adversarial attacks without any real data.
[current, evaluation, microsoft] [table, denotes, hard] [adversarial, model, substitute, attack, attacked, dast, success, fgsm, bim, pgd, trained, scenario, targeted, mnist, azure, access, transferability, input, medium, perturbation, ian, patrick] [method, proposed, output, figure, based, convolutional] [generated, train, generate, synthetic, generative, real, loss, introduce, produced, utilize] [learning, training, machine, network, performance, data, evaluate, rate, gradient, arxiv, preprint, deep, large, set, architecture, online, achieve, compared, number, label, small, design, objective, better, distribution] [conference, international, distance]
@InProceedings{Zhou_2020_CVPR,
  author = {Zhou, Mingyi and Wu, Jing and Liu, Yipeng and Liu, Shuaicheng and Zhu, Ce},
  title = {DaST: Data-Free Substitute Training for Adversarial Attacks},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Towards Verifying Robustness of Neural Networks Against A Family of Semantic Perturbations
Jeet Mohapatra, Tsui-Wei Weng, Pin-Yu Chen, Sijia Liu, Luca Daniel


Verifying robustness of neural networks given a specified threat model is a fundamental yet challenging task. While current verification methods mainly focus on the l_p-norm threat model of the input instances, robustness verification against semantic adversarial attacks inducing large l_p-norm perturbations, such as color shifting and lighting adjustment, are beyond their capacity. To bridge this gap, we propose Semantify-NN, a model-agnostic and generic robustness verification approach against semantic perturbations for neural networks. By simply inserting our proposed semantic perturbation layers (SP-layers) to the input layer of any given model, Semantify-NN is model-agnostic, and any l_p-norm based verification tools can be used to verify the model robustness against semantic perturbations. We illustrate the principles of designing the SP-layers and provide examples including semantic perturbations to image classification in the space of hue, saturation, lightness, brightness, contrast and rotation, respectively. In addition, an efficient refinement technique is proposed to further significantly improve the semantic certificate. Experiments on various network architectures and different datasets demonstrate the superior verification performance of Semantify-NN over l_p-norm-based verification frameworks that naively convert semantic perturbation to l_p-norm. The results show that Semantify-NN can support robustness verification against a wide range of semantic perturbations.
[explicit, spl, work] [semantic, refinement, cnn, propose, occlusion, table, including, apply, framework, denotes] [verification, robustness, input, attack, threat, adversarial, perturbation, norm, splitting, model, certification, original, certificate, trained, perturbed, certified, bounded, continuously, saturation, hue, lightness, verifying, technique, certifying, discretely, enumeration] [based, proposed, color, pixel, brightness, contrast, range, figure] [image, translation, idea] [space, neural, network, parameterized, linear, bound, min, layer, efficient, consider, general, experiment, find, lower, problem, activation, function, number, large, data, computation, upper, performance, deep, set] [mlp, rotation, implicit, rgb, convex, transformation, form]
@InProceedings{Mohapatra_2020_CVPR,
  author = {Mohapatra, Jeet and Weng, Tsui-Wei and Chen, Pin-Yu and Liu, Sijia and Daniel, Luca},
  title = {Towards Verifying Robustness of Neural Networks Against A Family of Semantic Perturbations},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
The Secret Revealer: Generative Model-Inversion Attacks Against Deep Neural Networks
Yuheng Zhang, Ruoxi Jia, Hengzhi Pei, Wenxiao Wang, Bo Li, Dawn Song


This paper studies model-inversion attacks, in which the access to a model is abused to infer information about the training data. Since its first introduction by [??], such attacks have raised serious concerns given that training data usually contain privacy sensitive information. Thus far, successful model-inversion attacks have only been demonstrated on simple models, such as linear regression and logistic regression. Previous attempts to invert neural networks, even the ones with simple architectures, have failed to produce convincing results. Here we present a novel attack method, termed the generative model-inversion attack, which can invert deep neural networks with high success rates. Rather than reconstructing private training data from scratch, we leverage partial public information, which can be very generic, to learn a distributional prior via generative adversarial networks (GANs) and use it to guide the inversion process. Moreover, we theoretically prove that a model's predictive power and its vulnerability to inversion attacks are indeed two sides of the same coin---highly predictive models are able to establish a strong correlation between features and labels, which coincides exactly with what an adversary exploits to mount the attacks. Our extensive experiments demonstrate that the proposed attack improves identification accuracy over the existing work by about 75% for reconstructing face images from a state-of-the-art face recognition classifier. We also show that differential privacy, in its canonical form, is of little avail to defend against our attacks.
[evaluation, recognition, work, dataset] [feature, table, mask, center] [attack, private, gmi, face, public, model, emi, pii, xns, adversary, sensitive, privacy, auxiliary, datasets, identity, access, dist, attacker, differential, protect, trained, vulnerability, effective, corrupted, knn, acc, security] [proposed, prior, output, existing, figure, method, ieee, likelihood, based, blurred, comparison, high, psnr] [target, image, inversion, loss, specific, generative, aim, generator, latent, corresponding, train, missing, inpainting, diversity] [training, data, knowledge, predictive, network, performance, deep, accuracy, learning, power, set, neural, distribution, label, classifier, optimization, algorithm, adapted, size, connection, log, feat, arxiv, preprint] [computer, reconstructed, conference, reconstruct, approach, distance]
@InProceedings{Zhang_2020_CVPR,
  author = {Zhang, Yuheng and Jia, Ruoxi and Pei, Hengzhi and Wang, Wenxiao and Li, Bo and Song, Dawn},
  title = {The Secret Revealer: Generative Model-Inversion Attacks Against Deep Neural Networks},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
A Self-supervised Approach for Adversarial Robustness
Muzammal Naseer, Salman Khan, Munawar Hayat, Fahad Shahbaz Khan, Fatih Porikli


Adversarial examples can cause catastrophic mistakes in Deep Neural Network (DNNs) based vision systems e.g., for classification, segmentation and object detection. The vulnerability of DNNs against such attacks can prove a major roadblock towards their real-world deployment. Transferability of adversarial examples demand generalizable defenses that can provide cross-task protection. Adversarial training that enhances robustness by modifying target model's parameters lacks such generalizability. On the other hand, different input processing based defenses fall short in the face of continuously evolving attacks. In this paper, we take the first step to combine the benefits of both approaches and propose a self-supervised adversarial training mechanism in the input space. By design, our defense is a generalizable approach and provides significant robustness against the unseen adversarial attacks (e.g. by reducing the success rate of translation-invariant ensemble attack from 82.6% to 31.9% in comparison to previous state-of-the-art). It can be deployed as a plug-and-play solution to protect a variety of vision systems, as we demonstrate for the case of classification, segmentation and detection.
[] [feature, table, object, propose, segmentation, detection, extractor] [adversarial, nrp, defense, input, attack, perturbation, trained, jpeg, ssp, transferability, model, purifier, distortion, adversarially, robustness, tvm, clean, perturbed, original, strong, ensemble, fooling, vgg, dim, noise, cda, dimt, fgsm, bpda] [proposed, ieee, pattern, based, perceptual, convolutional, figure, pixel, method, remove] [loss, image, representation, train, transferable, independent, discriminator, generator] [training, network, deep, neural, learning, rate, space, arxiv, preprint, processing, data, accuracy, imagenet, algorithm, label, architecture, compared, objective, random, machine] [conference, vision, computer, approach, international, distance]
@InProceedings{Naseer_2020_CVPR,
  author = {Naseer, Muzammal and Khan, Salman and Hayat, Munawar and Khan, Fahad Shahbaz and Porikli, Fatih},
  title = {A Self-supervised Approach for Adversarial Robustness},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Adversarial Vertex Mixup: Toward Better Adversarially Robust Generalization
Saehyung Lee, Hyungyu Lee, Sungroh Yoon


Adversarial examples cause neural networks to produce incorrect outputs with high confidence. Although adversarial training is one of the most effective forms of defense against adversarial examples, unfortunately, a large gap exists between test accuracy and training accuracy in adversarial training. In this paper, we identify Adversarial Feature Overfitting (AFO), which may cause poor adversarially robust generalization, and we show that adversarial training can overshoot the optimal point in terms of robust generalization, leading to AFO in our simple Gaussian model. Considering these theoretical results, we present soft labeling as a solution to the AFO problem. Furthermore, we propose Adversarial Vertex mixup (AVmixup), a soft-labeled data augmentation approach for improving adversarially robust generalization. We complement our theoretical analysis with experiments on CIFAR10, CIFAR100, SVHN, and Tiny ImageNet, and show that AVmixup significantly improves the robust generalization performance and that it reduces the trade-off between standard accuracy and adversarial robustness.
[dataset] [feature, table, apply] [adversarial, robust, avmixup, model, generalization, pgd, robustness, defense, example, input, tiny, trained, scatter, clean, attack, definition, afo, datasets, ian, adversarially, improve, fgsm, effective] [method, proposed, ieee, combination, figure, pattern] [train, factor] [training, accuracy, data, standard, neural, arxiv, preprint, deep, classification, distribution, set, classifier, learning, variance, linear, sample, vector, performance, svhn, simple, theoretical, weight, large, soft, mixup, machine, empirical, function, number, evaluate, overfitting, small, upper, theorem, imagenet, compared, andrew] [defined, vertex, computer, conference, approach, error, virtual, supplementary, vision]
@InProceedings{Lee_2020_CVPR,
  author = {Lee, Saehyung and Lee, Hyungyu and Yoon, Sungroh},
  title = {Adversarial Vertex Mixup: Toward Better Adversarially Robust Generalization},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
How Does Noise Help Robustness? Explanation and Exploration under the Neural SDE Framework
Xuanqing Liu, Tesi Xiao, Si Si, Qin Cao, Sanjiv Kumar, Cho-Jui Hsieh


Neural Ordinary Differential Equation (Neural ODE) has been proposed as a continuous approximation to the ResNet architecture. Some commonly used regularization mechanisms in discrete neural networks (e.g., dropout, Gaussian noise) are missing in current Neural ODE networks. In this paper, we propose a new continuous neural network framework called Neural Stochastic Differential Equation (Neural SDE), which naturally incorporates various commonly used regularization mechanisms based on random noise injection. For regularization purposes, our framework includes multiple types of noise patterns, such as dropout, additive, and multiplicative noise, which are common in plain neural networks. We provide some theoretical analyses explaining the improved robustness of our models against input perturbations. Furthermore, we demonstrate that the Neural SDE network can achieve better generalization than the Neural ODE and is more resistant to adversarial and non-adversarial input perturbations.
[hidden, time, work] [framework, resnet, propagation, level, add] [noise, ode, diffusion, model, sde, adversarial, robustness, differential, jump, multiplicative, perturbation, testing, adding, brownian, stability, pdrop, input, het, generalization, original, randomness, resistant, improve, dbt, toy, example, poisson, explanation, google] [gaussian, residual, figure, proposed, based, motion, block, result, output] [] [neural, dropout, stochastic, network, random, training, deep, layer, arxiv, preprint, accuracy, regularization, bernoulli, equation, better, randomly, algorithm, observe, test, set, deterministic, small, additive, learning, sample, initialization, discrete, theoretical, drift, consider, exponentially, note] [term, solution, depth, continuous, error, numerical]
@InProceedings{Liu_2020_CVPR,
  author = {Liu, Xuanqing and Xiao, Tesi and Si, Si and Cao, Qin and Kumar, Sanjiv and Hsieh, Cho-Jui},
  title = {How Does Noise Help Robustness? Explanation and Exploration under the Neural SDE Framework},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Unpaired Image Super-Resolution Using Pseudo-Supervision
Shunta Maeda


In most studies on learning-based image super-resolution (SR), the paired training dataset is created by downscaling high-resolution (HR) images with a predetermined operation (e.g., bicubic). However, these methods fail to super-resolve real-world low-resolution (LR) images, for which the degradation process is much more complicated and unknown. In this paper, we propose an unpaired SR method using a generative adversarial network that does not require a paired/aligned training dataset. Our network consists of an unpaired kernel/noise correction network and a pseudo-paired SR network. The correction network removes noise and adjusts the kernel of the inputted LR image; then, the corrected clean LR image is upscaled by the SR network. In the training phase, the correction network also produces a pseudo-clean LR image from the inputted HR image, and then a mapping from the pseudo-clean LR image to the inputted HR image is learned by the SR network in a paired manner. Because our SR network is independent of the correction network, well-studied existing network architectures and pixel-wise loss functions can be integrated with the proposed framework. Experiments on diverse datasets show that the proposed method is superior to existing solutions to the unpaired SR problem.
[dataset, attention] [aerial, table, challenge] [trained, adversarial, face, input, correction, noise, model, clean, original] [method, blind, proposed, degradation, comparison, figure, zssr, residual, bulat, perceptual, bicubic, ntire, degraded, inputted, result, dota, upscaling, dbpn, rcan, psnr, downscaling, deblurring, ssim, predetermined, kernel, enhanced] [image, loss, unpaired, paired, mapping, domain, generated, real, discriminator, cyclegan, lgeo, generative, learn, source, train, generator, ladv, diverse, translation, cycle, corresponding, target] [network, training, deep, set, validation, compared, better, learning, distribution, learned, problem, note, hyperparameters, performance, data, test] [single, geometric, provided]
@InProceedings{Maeda_2020_CVPR,
  author = {Maeda, Shunta},
  title = {Unpaired Image Super-Resolution Using Pseudo-Supervision},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Universal Litmus Patterns: Revealing Backdoor Attacks in CNNs
Soheil Kolouri, Aniruddha Saha, Hamed Pirsiavash, Heiko Hoffmann


The unprecedented success of deep neural networks in many applications has made these networks a prime target for adversarial exploitation. In this paper, we introduce a benchmark technique for detecting backdoor attacks (aka Trojan attacks) on deep convolutional neural networks (CNNs). We introduce the concept of Universal Litmus Patterns (ULPs), which enable one to reveal backdoor attacks by feeding these universal patterns to the network and analyzing the output (i.e., classifying the network as `clean' or `corrupted'). This detection is fast because it requires only a few forward passes through a CNN. We demonstrate the effectiveness of ULPs for detecting backdoor attacks on thousands of networks with different architectures trained on four benchmark datasets, namely the German Traffic Sign Recognition Benchmark (GTSRB), MNIST, CIFAR10, and Tiny-ImageNet. The codes and train/test models for this paper can be found here: https://umbcvision.github.io/Universal-Litmus-Patterns/.
[dataset, traffic, sign, recognition] [detection, detect, benchmark, pooling] [poisoned, backdoor, ulps, clean, trained, model, trigger, input, universal, attack, targeted, gtsrb, poisoning, infected, litmus, detecting, access, mnist, vgg, adversarial, generalizability, noise, perturbation, ulp, mutually, security, adversary, testing] [figure, proposed, method, cnns, convolutional, output, presented] [source, target, train, image, introduce, user] [training, neural, set, network, test, data, random, learning, class, deep, machine, architecture, arxiv, preprint, performance, accuracy, number, fixed, classifier, presence, baseline, ratio, requires, forward, classification, consider, small, optimization, binary, randomly] [approach, rely, outlier]
@InProceedings{Kolouri_2020_CVPR,
  author = {Kolouri, Soheil and Saha, Aniruddha and Pirsiavash, Hamed and Hoffmann, Heiko},
  title = {Universal Litmus Patterns: Revealing Backdoor Attacks in CNNs},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Robustness Guarantees for Deep Neural Networks on Videos
Min Wu, Marta Kwiatkowska


The widespread adoption of deep learning models places demands on their robustness. In this paper, we consider the robustness of deep neural networks on videos, which comprise both the spatial features of individual frames extracted by a convolutional neural network and the temporal dynamics between adjacent frames captured by a recurrent neural network. To measure robustness, we study the maximum safe radius problem, which computes the minimum distance from the optical flow sequence obtained from a given input to that of an adversarial example in the neighbourhood of the input. We demonstrate that, under the assumption of Lipschitz continuity, the problem can be approximated using finite optimisation via discretising the optical flow space, and the approximation has provable guarantees. We then show that the finite optimisation problem can be solved by utilising a two-player turn-based game in a cooperative setting, where the first player selects the optical flows and the second player determines the dimensions to be manipulated in the chosen flow. We employ an anytime approach to solve the game, in the sense of approximating the value of the game by monotonically improving its upper and lower bounds. We exploit a gradient-based search algorithm to compute the upper bounds, and the admissible A* algorithm to update the lower bounds. Finally, we evaluate our framework on the UCF101 video dataset.
[video, frame, sequence, state, recurrent, reward, temporal, msr, atomic, work, individual] [framework, denotes] [safe, input, robustness, adversarial, player, game, finite, original, definition, manipulation, lipschitz, verification, admissible, example, magnitude, fmsr, hammerthrow, perturbation, norm, perturbed, manipulated, ball, athf, study] [optical, flow, figure, spatial, based, brightness, convolutional, pixel, neighbourhood] [extracted, corresponding, image] [neural, lower, maximum, upper, bound, network, algorithm, deep, set, problem, min, learning, approximation, gradient, denote, classification, respect, function, number, strategy, note, computation, sampled, consider, search, evaluate, machine, linear] [distance, radius, compute, conference, international, define, computer, approach]
@InProceedings{Wu_2020_CVPR,
  author = {Wu, Min and Kwiatkowska, Marta},
  title = {Robustness Guarantees for Deep Neural Networks on Videos},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Benchmarking Adversarial Robustness on Image Classification
Yinpeng Dong, Qi-An Fu, Xiao Yang, Tianyu Pang, Hang Su, Zihao Xiao, Jun Zhu


Deep neural networks are vulnerable to adversarial examples, which becomes one of the most important research problems in the development of deep learning. While a lot of efforts have been made in recent years, it is of great significance to perform correct and complete evaluations of the adversarial attack and defense algorithms. In this paper, we establish a comprehensive, rigorous, and coherent benchmark to evaluate adversarial robustness on image classification tasks. After briefly reviewing plenty of representative attack and defense methods, we perform large-scale experiments with two robustness curves as the fair-minded evaluation criteria to fully understand the performance of these methods. Based on the evaluation results, we draw several important findings that can provide insights for future research, including: 1) The relative robustness between models can change across different attack configurations, thus it is encouraged to adopt the robustness curves to evaluate adversarial robustness; 2) As one of the most effective defense techniques, adversarial training can generalize across different threat models; 3) Randomization-based defenses are more robust to query-based black-box attacks.
[evaluation, recognition, future, understanding, previous] [benchmark, including, table, adopt] [adversarial, attack, robustness, defense, perturbation, untargeted, robust, budget, threat, model, input, norm, strength, example, yinpeng, constrained, tianyu, trained, bim, adversary, substitute, ensemble, success, ian, improve, targeted, jpeg, improving, nicolas, hang, craft, transferability, certified] [based, method, figure, ieee, pattern, proposed, optimized] [jun, image, target, transfer, perform, loss] [accuracy, learning, training, imagenet, neural, gradient, deep, random, machine, number, appendix, classifier, arxiv, preprint, evaluate, minimum, performance, small, find, processing, knowledge, better] [conference, international, computer, vision, distance]
@InProceedings{Dong_2020_CVPR,
  author = {Dong, Yinpeng and Fu, Qi-An and Yang, Xiao and Pang, Tianyu and Su, Hang and Xiao, Zihao and Zhu, Jun},
  title = {Benchmarking Adversarial Robustness on Image Classification},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
What It Thinks Is Important Is Important: Robustness Transfers Through Input Gradients
Alvin Chan, Yi Tay, Yew-Soon Ong


Adversarial perturbations are imperceptible changes to input pixels that can change the prediction of deep learning models. Learned weights of models robust to such perturbations are previously found to be transferable across different tasks but this applies only if the model architecture for the source and target tasks is the same. Input gradients characterize how small changes at each input pixel affect the model output. Using only natural images, we show here that training a student model's input gradients to match those of a robust teacher model can gain robustness close to a strong baseline that is robustly trained from scratch. Through experiments in MNIST, CIFAR-10, CIFAR-100 and Tiny-ImageNet, we show that our proposed method, input gradient adversarial matching, can transfer robustness across different tasks and even across different model architectures. This demonstrates that directly targeting the semantics of input gradients is a feasible way towards adversarial robustness.
[natural, dataset, work, prediction] [final, feature, table] [input, adversarial, model, robustness, robust, trained, igam, clean, logit, finetuned, mnist, fdisc, dsrc, adversarially, resizing, difference, strong, medium, adv, ldiff, dtar, effective, upwards, display, perturbation, pgd] [figure, pixel, proposed, output] [transfer, image, target, loss, discriminator, train, ladv, source, transferred, diff, transferring] [teacher, training, student, gradient, task, arxiv, preprint, standard, finetuning, test, neural, accuracy, layer, baseline, dimension, architecture, classification, log, deep, learning, small, regularization, bound, lower, average, top, outperform, classifier] [matching, term, match, transformation, conference, computer, vision]
@InProceedings{Chan_2020_CVPR,
  author = {Chan, Alvin and Tay, Yi and Ong, Yew-Soon},
  title = {What It Thinks Is Important Is Important: Robustness Transfers Through Input Gradients},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Transferable, Controllable, and Inconspicuous Adversarial Attacks on Person Re-identification With Deep Mis-Ranking
Hongjun Wang, Guangrun Wang, Ya Li, Dongyu Zhang, Liang Lin


The success of DNNs has driven the extensive applications of person re-identification (ReID) into a new era. However, whether ReID inherits the vulnerability of DNNs remains unexplored. To examine the robustness of ReID systems is rather important because the insecurity of ReID systems may cause severe losses, e.g., the criminals may use the adversarial perturbations to cheat the CCTV systems. In this work, we examine the insecurity of current best-performing ReID models by proposing a learning-to-mis-rank formulation to perturb the ranking of the system output. As the cross-dataset transferability is crucial in the ReID domain, we also perform a back-box attack by developing a novel multi-stage network architecture that pyramids the features of different levels to extract general and transferable features for the adversarial perturbations. Our method can control the number of malicious pixels by using differentiable multi-shot sampling. To guarantee the inconspicuousness of the attack, we also propose a new perception loss to achieve better visual quality. Extensive experiments on four of the largest ReID benchmarks (i.e., Market1501, CUHK03, DukeMTMC, and MSMT17) not only show the effectiveness of our method, but also provides directions of the future improvement in the robustness of ReID systems. For example, the accuracy of one of the best-performing ReID systems drops sharply from 91.8% to 1.4% after being attacked by our method. Some attack results are shown in Fig. 1. The code is available at: https://github.com/whj363636/Adversarial-attack-on-Person-ReID-With-Deep-Mis-Ranking.
[visual, perception, future, pair, three, extract] [map, table, liang, effectiveness, propose, feature, pcb, improvement, denotes] [adversarial, attack, noise, attacker, attacked, robustness, examine, attacking, pgd, success, improve, misclassification, model, alignedreid, quality, transferability, insecurity, dnns, mudeep, cheat, perturb, inconspicuousness] [method, existing, based, comparison, output] [reid, person, loss, image, target, gap, discriminator, perform, representation, gan, transferable, control] [number, deep, learning, network, ranking, training, accuracy, data, learned, performance, neural, architecture, general, classification, random, better, set] [system, distance, conference, formulation, computer, vision]
@InProceedings{Wang_2020_CVPR,
  author = {Wang, Hongjun and Wang, Guangrun and Li, Ya and Zhang, Dongyu and Lin, Liang},
  title = {Transferable, Controllable, and Inconspicuous Adversarial Attacks on Person Re-identification With Deep Mis-Ranking},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Video Modeling With Correlation Networks
Heng Wang, Du Tran, Lorenzo Torresani, Matt Feiszli


Motion is a salient cue to recognize actions in video. Modern action recognition models leverage motion information either explicitly by using optical flow as input or implicitly by means of 3D convolutional filters that simultaneously capture appearance and motion information. This paper proposes an alternative approach based on a learnable correlation operator that can be used to establish frame-to-frame matches over convolutional feature maps in the different layers of the network. The proposed architecture enables the fusion of this explicit temporal matching information with traditional appearance cues captured by 2D convolution. Our correlation network compares favorably with widely-used 3D CNNs for video modeling, and achieves competitive results over the prominent two-stream network while being much faster to train. We empirically demonstrate that correlation networks produce strong results on a variety of video datasets, and outperform the state of the art on four popular benchmarks for action recognition: Kinetics, Something-Something, Diving48, and Sports1M.
[video, action, temporal, kinetics, recognition, outperforms, state, stream, spatiotemporal, frame, work, dataset, lorenzo] [correlation, table, feature, propose, backbone, improvement] [model, input, datasets] [operator, optical, motion, flow, convolutional, convolution, figure, block, spatial, based, learnable, output, cnns, cin, corrnet, proposed, patch, designed, groupwise, introduced, adjacent, channel, cout] [image, appearance, learn, representation] [network, number, imagenet, learning, filter, deep, accuracy, size, design, better, neural, learned, set, reduce, training, paper, architecture, popular, classification, layer, computational, larger, large, comparing, andrew] [matching, rgb, compare, human, displacement, cost]
@InProceedings{Wang_2020_CVPR,
  author = {Wang, Heng and Tran, Du and Torresani, Lorenzo and Feiszli, Matt},
  title = {Video Modeling With Correlation Networks},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Projection & Probability-Driven Black-Box Attack
Jie Li, Rongrong Ji, Hong Liu, Jianzhuang Liu, Bineng Zhong, Cheng Deng, Qi Tian


Generating adversarial examples in a black-box setting retains a significant challenge with vast practical application prospects. In particular, existing black-box attacks suffer from the need for excessive queries, as it is non-trivial to find an appropriate direction to optimize in the high-dimensional space. In this paper, we propose Projection & Probability-driven Black-box Attack (PPBA) to tackle this problem by reducing the solution space and providing better optimization. For reducing the solution space, we first model the adversarial perturbation optimization problem as a process of recovering frequency-sparse perturbations with compressed sensing, under the setting that random noise in the low-frequency space is more likely to be adversarial. We then propose a simple method to construct a low-frequency constrained sensing matrix, which works as a plug-and-play projection matrix to reduce the dimensionality. Such a sensing matrix is shown to be flexible enough to be integrated into existing methods like NES and Bandits_ TD . For better optimization, we perform a random walk with a probability-driven strategy, which utilizes all queries over the whole progress to make full use of the sensing matrix for a less query budget. Extensive experiments show that our method requires at most 24% fewer queries with a higher attack success rate compared with state-of-the-art approaches. Finally, the attack method is evaluated on the real-world online service, i.e., Google Cloud Vision API, which further demonstrates our practical potentials.
[step, walk] [propose] [success, attack, adversarial, ppba, banditst, perturbation, model, google, original, prba, asr, query, input, banditstd, norm, effective, versus, auc, victim] [sensing, method, proposed, ieee, frequency, pattern, compressed, low, high, based, existing] [image, inception, perform, corresponding, target] [rate, matrix, space, number, optimization, random, dimension, average, reduce, neural, evaluate, function, set, vector, learning, find, deep, performance, large, maximum, compared, iteration, practical, higher, randomly, setting, reducing, better, fewer] [conference, vision, solution, projection, computer, international, cloud, measurement]
@InProceedings{Li_2020_CVPR,
  author = {Li, Jie and Ji, Rongrong and Liu, Hong and Liu, Jianzhuang and Zhong, Bineng and Deng, Cheng and Tian, Qi},
  title = {Projection & Probability-Driven Black-Box Attack},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Auxiliary Training: Towards Accurate and Robust Models
Linfeng Zhang, Muzhou Yu, Tong Chen, Zuoqiang Shi, Chenglong Bao, Kaisheng Ma


Training process is crucial for the deployment of the network in applications which have two strict requirements on both accuracy and robustness. However, most existing approaches are in a dilemma, i.e. model accuracy and robustness form an embarrassing tradeoff - the improvement of one leads to the drop of the other. The challenge remains as for we try to improve the accuracy and robustness simultaneously. In this paper, we propose a novel training method via introducing the auxiliary classifiers for training on corrupted samples, while the clean samples are normally trained with the primary classifier. In the training stage, a novel distillation method named input-aware self distillation is proposed to facilitate the primary classifier to learn the robust information from auxiliary classifiers. Along with it, a new normalization method - selective batch normalization is proposed to prevent the model from the negative influence of corrupted images. At the end of training period, a L2-norm penalty is applied to the weights of primary and auxiliary classifiers such that their weights are asymptotically identical. In the stage of inference, only the primary classifier is used and thus no extra computation and storage are needed. Extensive experiments on CIFAR10, CIFAR100 and ImageNet show that noticeable improvements on both accuracy and robustness can be observed by the proposed auxiliary training. On average, auxiliary training achieves 2.21% accuracy and 21.64% robustness (measured by corruption error) improvements over traditional training methods on CIFAR100. Codes has been released on github.
[observed, selective, attention, three, connected, privileged] [table, feature, framework, propose, named] [auxiliary, robustness, model, corruption, trained, primary, corrupted, clean, adversarial, perturbation, improve, increment, attack, drop, robust, study] [proposed, frequency, convolutional, method, utilized, comparison, gaussian, figure, formulated, traditional] [image, common, loss, learn] [training, accuracy, neural, batch, classifier, learning, data, normalization, augmentation, wide, distillation, deep, indicates, network, rate, imagenet, knowledge, alexnet, baseline, set, sample, conducted, applied] [error, computer, facilitate, vision, approach, consistent, normal, measured, conference]
@InProceedings{Zhang_2020_CVPR,
  author = {Zhang, Linfeng and Yu, Muzhou and Chen, Tong and Shi, Zuoqiang and Bao, Chenglong and Ma, Kaisheng},
  title = {Auxiliary Training: Towards Accurate and Robust Models},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
PaStaNet: Toward Human Activity Knowledge Engine
Yong-Lu Li, Liang Xu, Xinpeng Liu, Xijie Huang, Yue Xu, Shiyi Wang, Hao-Shu Fang, Ze Ma, Mingyang Chen, Cewu Lu


Existing image-based activity understanding methods mainly adopt direct mapping, i.e. from image to activity concepts, which may encounter performance bottleneck since the huge gap. In light of this, we propose a new path: infer human part states first and then reason out the activities based on part-level semantics. Human Body Part States (PaSta) are fine-grained action semantic tokens, e.g. , which can compose the activities and help us step toward human activity knowledge engine. To fully utilize the power of PaSta, we build a large-scale knowledge base PaStaNet, which contains 7M+ PaSta annotations. And two corresponding models are proposed: first, we design a model named Activity2Vec to extract PaSta features, which aim to be general representations for various activities. Second, we use a PaSta-based Reasoning method to infer activities. Promoted by PaStaNet, our method achieves significant improvements, e.g. 6.4 and 13.9 mAP on full and one-shot sets of HICO in supervised learning, and 3.2 and 4.2 mAP on V-COCO and images-based AVA in transfer learning. Code and data are available at http://hake-mvig.cn/.
[pasta, activity, asta, action, pastanet, recognition, graph, hierarchical, visual, language, node, cewu, understanding, late, tin, state, infer, extract, reasoning, video, step, interacted, multiple, fbert, ava, semantics, construct, interaction] [object, map, instance, feature, semantic, box, adopt, head, hico, detection, hoi, achieves, ross] [model, input, datasets] [method, based, fusion, existing, convolutional] [image, representation, transfer, person, common, corresponding, utilize, unseen] [knowledge, learning, data, performance, pairwise, arxiv, preprint, deep, find, decay, fei, general, achieve, vector, set, neural] [human, body, pose, well, estimation, hand]
@InProceedings{Li_2020_CVPR,
  author = {Li, Yong-Lu and Xu, Liang and Liu, Xinpeng and Huang, Xijie and Xu, Yue and Wang, Shiyi and Fang, Hao-Shu and Ma, Ze and Chen, Mingyang and Lu, Cewu},
  title = {PaStaNet: Toward Human Activity Knowledge Engine},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
A Hierarchical Graph Network for 3D Object Detection on Point Clouds
Jintai Chen, Biwen Lei, Qingyu Song, Haochao Ying, Danny Z. Chen, Jian Wu


3D object detection on point clouds finds many applications. However, most known point cloud object detection methods did not adequately accommodate the characteristics (e.g., sparsity) of point clouds, and thus some key semantic information (e.g., shape information) is not well captured. In this paper, we propose a new graph convolution (GConv) based hierarchical graph network (HGNet) for 3D object detection, which processes raw point clouds directly to predict 3D bounding boxes. HGNet effectively captures the relationship of the points and utilizes the multi-level semantics for object detection. Specially, we propose a novel shape-attentive GConv (SA-GConv) to capture the local shape features, by modelling the relative geometric positions of points to describe object shapes. An SA-GConv based U-shape network captures the multi-level features, which are mapped into an identical feature space by an improved voting module and then further utilized to generate proposals. Next, a new GConv based Proposal Reasoning Module reasons on the proposals considering the global scene semantics, and the bounding boxes are then predicted. Consequently, our new framework outperforms state-of-the-art methods on two large-scale point cloud datasets, by 4% mean average precision (mAP) on SUN RGB-D and by 3% mAP on ScanNet-V2.
[graph, semantics, hierarchical, reasoning, previous, three, predict, attention] [object, feature, detection, module, proposal, hgnet, voting, sun, prore, bounding, gconv, table, votenet, pyramid, global, propose, framework, map, key, aggregated, aggregate, predicted] [input, detecting, identical] [based, proposed, convolutional, convolution, method, figure, utilized, neighboring] [generate, generator, image] [network, neural, learning, operation, improved, performance, deep, average, indicates, data, process, set, space, precision, better, sampling] [point, geometric, shape, local, relative, xyz, novel, cloud, scene, rgb, pointnet, compare, capture, modelling, well, charles]
@InProceedings{Chen_2020_CVPR,
  author = {Chen, Jintai and Lei, Biwen and Song, Qingyu and Ying, Haochao and Chen, Danny Z. and Wu, Jian},
  title = {A Hierarchical Graph Network for 3D Object Detection on Point Clouds},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Learning Generative Models of Shape Handles
Matheus Gadelha, Giorgio Gori, Duygu Ceylan, Radomir Mech, Nathan Carr, Tamy Boubekeur, Rui Wang, Subhransu Maji


We present a generative model to synthesize 3D shapes as sets of handles -- lightweight proxies that approximate the original 3D shape -- for applications in interactive editing, shape parsing, and building compact 3D representations. Our model can generate handle sets with varying cardinality and different types of handles. Key to our approach is a deep architecture that predicts both the parameters and existence of shape handles and a novel similarity measure that can easily accommodate different types of handles, such as cuboids or sphere-meshes. We leverage the recent advances in semantic 3D annotation as well as automatic shape summarization techniques to supervise our approach. We show that the resulting shape representations are not only intuitive, but achieve superior quality than previous state-of-the-art. Finally, we demonstrate how our method can be used in applications such as interactive shape editing and completion, leveraging the latent space learned by our model to guide these tasks.
[multiple, describe, described, prediction, previous, represent] [branch, parsing, annotated, stage, rui] [model, existence, trained, original, input, easily] [method, figure, pattern, ieee, capable] [generative, generate, generating, latent, generated, representation, editing, loss, generation, generates, train, minimizing] [set, training, similarity, learning, number, procedure, space, network, data, metric, min, equation, computation, approximate, consider, probability, alternating] [shape, handle, point, conference, compute, computer, vision, distance, acm, cuboid, mesh, accurate, computed, hao, single, surface, leonidas, international, geometric, reconstruction, triangle, siggraph, varying, compare, human, term, defined, coverage, matheus, subhransu]
@InProceedings{Gadelha_2020_CVPR,
  author = {Gadelha, Matheus and Gori, Giorgio and Ceylan, Duygu and Mech, Radomir and Carr, Nathan and Boubekeur, Tamy and Wang, Rui and Maji, Subhransu},
  title = {Learning Generative Models of Shape Handles},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
One Man's Trash Is Another Man's Treasure: Resisting Adversarial Examples by Adversarial Examples
Chang Xiao, Changxi Zheng


Modern image classification systems are often built on deep neural networks, which suffer from adversarial examples--images with deliberately crafted, imperceptible noise to mislead the network's classification. To defend against adversarial examples, a plausible idea is to obfuscate the network's gradient with respect to the input image. This general idea has inspired a long line of defense methods. Yet, almost all of them have proven vulnerable. We revisit this seemingly flawed idea from a radically different perspective. We embrace the omnipresence of adversarial examples and the numerical procedure of crafting them, and turn this harmful attacking process into a useful defense mechanism. Our defense method is conceptually simple: before feeding an input image for classification, transform it by finding an adversarial example on a pre-trained external model. We evaluate our method against a wide range of possible attacks. On both CIFAR-10 and Tiny ImageNet datasets, our method is significantly more robust than state-of-the-art methods. Particularly, in comparison to adversarial training, our method offers lower training cost as well as stronger robustness.
[recognition, natural, work, red] [table, including] [adversarial, defense, input, model, attack, robustness, bpda, robust, perturbation, eot, trained, attacking, noise, pgd, example, stronger, adversarially, crafting, adversary, difference, identity, ian, defend, crafted, circumvented] [method, ieee, range, pattern, figure, prior, june] [image, transfer, idea, transformed] [gradient, training, network, learning, accuracy, number, neural, process, random, standard, classification, backward, pass, deep, evaluate, simple, machine, find, set, approximation, small, increasing, large, arxiv, preprint, function, appendix, size, respect, finding, expectation, optimization] [transformation, conference, international, computer, vision, differentiable, estimated, estimate]
@InProceedings{Xiao_2020_CVPR,
  author = {Xiao, Chang and Zheng, Changxi},
  title = {One Man's Trash Is Another Man's Treasure: Resisting Adversarial Examples by Adversarial Examples},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Toward a Universal Model for Shape From Texture
Dor Verbin, Todd Zickler


We consider the shape from texture problem, where the input is a single image of a curved, textured surface, and the texture and shape are both a priori unknown. We formulate this task as a three-player game between a shape process, a texture process, and a discriminator. The discriminator adapts a set of non-linear filters to try to distinguish image patches created by the texture process from those created by the shape process, while the shape and texture processes try to create image patches that are indistinguishable from those of the other. An equilibrium of this game yields two things: an estimate of the 2.5D surface from the shape process, and a stochastic texture synthesis model from the texture process. Experiments show that this approach is robust to common non-idealities such as shading, gloss, and clutter. We also find that it succeeds for a wide variety of texture types, including both periodic textures and those composed of isolated textons, which have previously required distinct and specialized processing.
[multiple, work, isolated, inspired, visual, previous] [map, including, global, cropped] [input, model, game, true, create] [patch, figure, output, spatial, periodic, field, warp, cyclostationary, pixel, affine, deconvolution, method, multiscale, created] [texture, generator, image, discriminator, synthetic, latent, learn, synthesis, appearance, textons] [distribution, size, find, process, stochastic, vector, set, network, learned, accuracy, training, evaluate, matrix, large, class, denote, equation, updating, architecture] [shape, surface, local, unwarper, tangent, normal, flat, well, approach, system, variety, computer, term, smoothness, single, avoid, shading, international, david, estimate, smooth, correspond, continuous, assume, square]
@InProceedings{Verbin_2020_CVPR,
  author = {Verbin, Dor and Zickler, Todd},
  title = {Toward a Universal Model for Shape From Texture},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
HybridPose: 6D Object Pose Estimation Under Hybrid Representations
Chen Song, Jiaru Song, Qixing Huang


We introduce HybridPose, a novel 6D object pose estimation approach. HybridPose utilizes a hybrid intermediate representation to express different geometric information in the input image, including keypoints, edge vectors, and symmetry correspondences. Compared to a unitary representation, our hybrid representation allows pose regression to exploit more and diverse features when one type of predicted representation is inaccurate (e.g., because of occlusion). Different intermediate representations used by HybridPose can all be predicted by the same simple neural network, and outliers in predicted intermediate representations are filtered by a robust regression module. Compared to state-of-the-art pose estimation approaches, HybridPose is comparable in running time and is significantly more accurate. For example, on Occlusion Linemod dataset, our method achieves a prediction speed of 30 fps with a mean ADD(-S) accuracy of 79.2%, representing a 67.4% improvement from the current state-of-the-art approach.
[prediction, predict, provide, three, current, multiple, outperforms] [object, predicted, edge, regression, occlusion, refinement, module, detection, utilizes, mask, table, achieves] [input, robust, model] [intermediate, ieee, pattern, reflection, output, tensor, method, figure] [image, representation, translation, train, loss, underlying, consists, introduce] [network, set, training, performance, learning, initialization, accuracy, deep, denote, optimization, number, optimal, neural, compared] [pose, symmetry, hybridpose, keypoints, computer, estimation, conference, vision, keypoint, approach, linemod, rotation, rgb, international, symmetric, hybrid, second, dpod, depth, european, relative, point, pvnet, error, dense, accurate, single, tgt, qixing]
@InProceedings{Song_2020_CVPR,
  author = {Song, Chen and Song, Jiaru and Huang, Qixing},
  title = {HybridPose: 6D Object Pose Estimation Under Hybrid Representations},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Boundary-Aware 3D Building Reconstruction From a Single Overhead Image
Jisan Mahmud, True Price, Akash Bapat, Jan-Michael Frahm


We propose a boundary-aware multi-task deep-learning-based framework for fast 3D building modeling from a single overhead image. Unlike most existing techniques which rely on multiple images for 3D scene modeling, we seek to model the buildings in the scene from a single overhead image by jointly learning a modified signed distance function (SDF) from the building boundaries, a dense heightmap of the scene, and scene semantics. To jointly train for these tasks, we leverage pixel-wise semantic segmentation and normalized digital surface maps (nDSM) as supervision, in addition to labeled building outlines. At test time, buildings in the scene are automatically modeled in 3D using only an input overhead image. We demonstrate an increase in building modeling performance using a multi-feature network architecture that improves building outline detection by considering network features learned for the other jointly learned tasks. We also introduce a novel mechanism for robustly refining instance-specific building outlines using the learned modified SDF. We verify the effectiveness of our method on multiple large-scale satellite and aerial imagery datasets, where we obtain state-of-the-art performance in the 3D building reconstruction task.
[overhead, prediction, modeling, provide, predict, recognition, dataset] [building, semantic, detection, bpsh, height, outline, mask, aerial, segmentation, object, imagery, feature, proposal, remote, overlap, framework, final, propose, instance, wang, mou, boundary, spacenet, frahm, region, refinement, ndsm, wbpsh, grss, table] [model] [ieee, method, pattern, sensing, high, proposed, convolutional, mae, fast] [loss, image, generate, learn, modified] [network, learning, performance, task, set, function, learned, deep, data, higher] [satellite, conference, computer, vision, reconstruction, distance, single, international, scene, signed, approach, jointly, estimation, ground, stereo, surface, demonstrate, median, rmse]
@InProceedings{Mahmud_2020_CVPR,
  author = {Mahmud, Jisan and Price, True and Bapat, Akash and Frahm, Jan-Michael},
  title = {Boundary-Aware 3D Building Reconstruction From a Single Overhead Image},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Articulation-Aware Canonical Surface Mapping
Nilesh Kulkarni, Abhinav Gupta, David F. Fouhey, Shubham Tulsiani


We tackle the tasks of: 1) predicting a Canonical Surface Mapping (CSM) that indicates the mapping from 2D pixels to corresponding points on a canonical template shape , and 2) inferring the articulation and pose of the template corresponding to the input image. While previous approaches rely on keypoint supervision for learning, we present an approach that can learn without such annotations. Our key insight is that these tasks are geometrically related, and we can obtain supervisory signal via enforcing consistency among the predictions. We present results across a diverse set of animal object categories, showing that our method can learn articulation and CSM prediction from image collections using only foreground mask labels for training. We empirically show that allowing articulation helps learn more accurate CSM prediction, and that enforcing the consistency with predicted CSM is similarly critical for learning meaningful articulation.
[prediction, predicting, work, goal] [template, supervision, object, mask, predicted, foreground, annotated, map, annotation, table, tackle, global] [model, input, animal, sheep] [figure, pixel, based, signal] [image, learn, mapping, consistency, corresponding, transfer, source, loss, cycle, supervisory, meaningful, target, requiring] [learning, set, report, observe, training, note, predictor, accuracy, imagenet, deep, space, vector, find, objective] [articulation, surface, keypoint, shape, approach, csm, point, canonical, camera, pose, keypoints, articulated, dense, enforcing, geometric, mesh, accurate, allows, shubham, leverage, correspondence, reprojection, rigid, reconstruction, michael, jitendra, rely, allowing, direct, form, enables, define, lgcc]
@InProceedings{Kulkarni_2020_CVPR,
  author = {Kulkarni, Nilesh and Gupta, Abhinav and Fouhey, David F. and Tulsiani, Shubham},
  title = {Articulation-Aware Canonical Surface Mapping},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
BiFuse: Monocular 360 Depth Estimation via Bi-Projection Fusion
Fu-En Wang, Yu-Hsuan Yeh, Min Sun, Wei-Chen Chiu, Yi-Hsuan Tsai


Depth estimation from a monocular 360 image is an emerging problem that gains popularity due to the availability of consumer-level 360 cameras and the complete surrounding sensing capability. While the standard of 360 imaging is under rapid development, we propose to predict the depth map of a monocular 360 image by mimicking both peripheral and foveal vision of the human eye. To this end, we adopt a two-branch neural network leveraging two common projections: equirectangular and cubemap projections. In particular, equirectangular projection incorporates a complete field-of-view but introduces distortion, whereas cubemap projection avoids distortion but introduces discontinuity at the boundary of the cube. Thus we propose a bi-projection fusion scheme along with learnable masks to balance the feature map from the two projections. Moreover, for the cubemap projection, we propose a spherical padding procedure which mitigates discontinuity at the boundary of each face. We apply our method to four panorama datasets and show favorable results against the existing state-of-the-art methods.
[recognition, dataset, length, prediction, predict] [feature, propose, map, boundary, branch, module, table, adopt, apply, area, introduces, center, surrounding] [face, model, distortion, study, inconsistency] [padding, cube, cubemap, fusion, equirectangular, proposed, ieee, pattern, method, convolution, omnidepth, figure, foveal, peripheral, fcrn, panosuncg, applying, convolutional, mae, equi] [image, qualitative, unsupervised, corresponding, mapping] [learning, network, training, deep, neural, layer, procedure, scheme, min, balance, size, indicates] [depth, spherical, vision, conference, computer, monocular, estimation, projection, single, panorama, camera, fov, human, ground, truth, rmse, international, indoor, geometric, compare, complete, joint, estimate, discontinuity]
@InProceedings{Wang_2020_CVPR,
  author = {Wang, Fu-En and Yeh, Yu-Hsuan and Sun, Min and Chiu, Wei-Chen and Tsai, Yi-Hsuan},
  title = {BiFuse: Monocular 360 Depth Estimation via Bi-Projection Fusion},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Transformation GAN for Unsupervised Image Synthesis and Representation Learning
Jiayu Wang, Wengang Zhou, Guo-Jun Qi, Zhongqian Fu, Qi Tian, Houqiang Li


Generative Adversarial Networks (GAN) have shown promising performance in image synthesis and unsupervised learning (USL). In most cases, however, the representations extracted from unsupervised GAN are usually unsatisfactory in other computer vision tasks. By using conditional GAN (CGAN), this problem could be solved to some extent, but the main drawback of such models is the necessity for labeled data. To improve both image synthesis quality and representation learning performance under the unsupervised setting, in this paper, we propose a simple yet effective Transformation Generative Adversarial Networks (TrGAN). In our approach, instead of capturing the joint distribution of image-label pairs p(x,y) as in conditional GAN, we try to estimate the joint distribution of transformed image t(x) and transformation t. Specifically, given a randomly sampled transformation t, we train the discriminator to give an estimate of input transformation, while following the adversarial training scheme of the original GAN. In addition, intermediate feature matching as well as feature-transform matching methods are introduced to strengthen the regularization on the generated features. To evaluate the quality of both generated samples and extracted representations, extensive experiments are conducted on four public datasets. The experimental results on the quality of both the synthesized images and the extracted representations demonstrate the effectiveness of our method.
[visual, recognition, predict, work, hit] [feature, table, global, main, propose, supervision] [adversarial, trained, model, quality, input, original, internal, experimental] [intermediate, figure, block, proposed, comparison, ieee, pattern, method, based, output, stacked] [trgan, image, gan, unsupervised, generated, discriminator, conditional, generator, representation, generative, encoder, real, extracted, loss, fid, train, corresponding, generation, transformed, mapping, ifm, aet, synthesis, introduce] [learning, training, distribution, labeled, data, baseline, regularization, neural, better, deep, applied, accuracy, imagenet, large, sample, function, performance] [transformation, conference, computer, vision, international, matching, fit, well, joint, directly, estimate, capture]
@InProceedings{Wang_2020_CVPR,
  author = {Wang, Jiayu and Zhou, Wengang and Qi, Guo-Jun and Fu, Zhongqian and Tian, Qi and Li, Houqiang},
  title = {Transformation GAN for Unsupervised Image Synthesis and Representation Learning},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
PPDM: Parallel Point Detection and Matching for Real-Time Human-Object Interaction Detection
Yue Liao, Si Liu, Fei Wang, Yanjie Chen, Chen Qian, Jiashi Feng


We propose a single-stage Human-Object Interaction (HOI) detection method that has outperformed all existing methods on HICO-DET dataset at 37 fps on a single Titan XP GPU. It is the first real-time HOI detection method. Conventional HOI detection methods are composed of two stages, i.e., human-object proposals generation, and proposals classification. Their effectiveness and efficiency are limited by the sequential and separate architecture. In this paper, we propose a Parallel Point Detection and Matching (PPDM) HOI detection framework. In PPDM, an HOI is defined as a point triplet < human point, interaction point, object point>. Human and object points are the center of the detection boxes, and the interaction point is the midpoint of the human and object points. PPDM contains two parallel branches, namely point detection branch and point matching branch. The point detection branch predicts three points. Simultaneously, the point matching branch predicts two displacements from the interaction point to its corresponding human and object points. The human point and the object point originated from the same interaction point are considered as matched pairs. In our novel parallel architecture, the interaction points implicitly provide context and regularization for human and object detection. The isolated detection boxes unlikely to form meaningful HOI triplets are suppressed, which increases the precision of HOI detection. Moreover, the matching between human and object detection boxes is only applied around limited numbers of filtered candidate interaction points, which saves much computational cost. Additionally, we build a new application-oriented database named HOI-A, which serves as a good supplement to the existing datasets.
[interaction, dataset, action, three, visual, attention, time, predict, composed, context, reasoning, sit] [object, detection, hoi, center, ppdm, box, feature, branch, table, framework, offset, heatmap, ican, proposal, faster, stage, including, confidence, midpoint, map, predicted, annotated, location, global, propose, matched] [subject, model, heatmaps] [figure, based, parallel, convolutional, existing, method, proposed] [person, corresponding, loss, image, train, row] [size, triplet, negative, considered, set, network, performance, number, general, practical, inference, class, training, learning] [point, human, displacement, matching, second, local, pose, defined, form, novel, limited]
@InProceedings{Liao_2020_CVPR,
  author = {Liao, Yue and Liu, Si and Wang, Fei and Chen, Yanjie and Qian, Chen and Feng, Jiashi},
  title = {PPDM: Parallel Point Detection and Matching for Real-Time Human-Object Interaction Detection},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Height and Uprightness Invariance for 3D Prediction From a Single View
Manel Baradad, Antonio Torralba


Current state-of-the-art methods that predict 3D from single images ignore the fact that the height of objects and their upright orientation is invariant to the camera pose and intrinsic parameters. To account for this, we propose a system that directly regresses 3D world coordinates for each pixel. First, our system predicts the camera position with respect to the ground plane and its intrinsic parameters. Followed by that, it predicts the 3D position for each pixel along the rays spanned by the camera. The predicted 3D coordinates and normals are invariant to a change in the camera position or its model, and we can directly impose a regression loss on these world coordinates. Our approach yields competitive results for depth and camera pose estimation (while not being explicitly trained to predict any of these) and improves across-dataset generalization performance over existing state-of-the-art methods.
[prediction, predict, dataset, recognition, state, time, frame] [predicted, height, semantic, table, regression, art, object] [model, trained, generalization, testing, datasets] [method, pattern, figure, reference, ieee, pixel, convolutional, june, output, based, range, field] [image, loss, invariant, produce, generalize] [performance, better, set, learning, neural, data, training, network, simple, distribution, parameter, metric, large, note, test] [depth, camera, conference, computer, vision, scannet, single, ground, structure, fov, system, point, plane, extrinsics, scene, truth, intrinsic, planar, monocular, view, estimated, full, estimate, floor, predicts, position, estimation, intrinsics, regress, indoor, international, normal, allows, cloud]
@InProceedings{Baradad_2020_CVPR,
  author = {Baradad, Manel and Torralba, Antonio},
  title = {Height and Uprightness Invariance for 3D Prediction From a Single View},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
SCT: Set Constrained Temporal Transformer for Set Supervised Action Segmentation
Mohsen Fayyaz, Jurgen Gall


Temporal action segmentation is a topic of increasing interest, however, annotating each frame in a video is cumbersome and costly. Weakly supervised approaches therefore aim at learning temporal action segmentation from videos that are only weakly labeled. In this work, we assume that for each training video only the list of actions is given that occur in the video, but not when, how often, and in which order they occur. In order to address this task, we propose an approach that can be trained end-to-end on such data. The approach divides the video into smaller temporal regions and predicts for each region the action label and its length. In addition, the network estimates the action labels for each frame. By measuring how consistent the frame-wise predictions are with respect to the temporal regions and the annotated action labels, the network learns to divide a video into class-consistent regions. We evaluate our approach on three datasets where the approach achieves state-of-the-art results.
[temporal, action, video, length, dataset, juergen, breakfast, order, cooking, mof, three, long, embedding, tcbs, frame, sequence, hilde, transformer] [region, segmentation, weakly, supervision, pooling, table, predicted, apply, improves, annotated, achieves, fully, weak] [model, input, adding, constrained, mohsen, trained] [method, proposed, figure, convolution, kernel, high, based, result, ieee, convolutional] [loss, supervised, train, representation, learn, corresponding, encourages] [set, network, max, learning, size, training, evaluate, top, class, mentioned, large, problem, regularizer, function, respect, number, better, regularizers, dimension, probability] [approach, differentiable, human, predicts, conference, computer, directly, vision]
@InProceedings{Fayyaz_2020_CVPR,
  author = {Fayyaz, Mohsen and Gall, Jurgen},
  title = {SCT: Set Constrained Temporal Transformer for Set Supervised Action Segmentation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
3DV: 3D Dynamic Voxel for Action Recognition in Depth Video
Yancheng Wang, Yang Xiao, Fu Xiong, Wenxiang Jiang, Zhiguo Cao, Joey Tianyi Zhou, Junsong Yuan


For depth-based 3D action recognition, one essential issue is to represent 3D motion pattern effectively and efficiently. To this end, 3D dynamic voxel (3DV) is proposed as a novel 3D motion representation manner. With 3D space voxelization, the key idea of 3DV is to encode the 3D motion information within depth video into a regular voxel set (i.e., 3DV) compactly, via temporal rank pooling. Each available 3DV voxel intrinsically involves 3D spatial and motion feature for 3D action description. 3DV is then abstracted as a point set and input into PointNet++ for 3D action recognition, in the end-to-end learning way. The intuition for transferring 3DV into the point set form is that, PointNet++ is lightweight and effective for deep feature learning towards point set. Since 3DV may loose appearance clue, a multi-stream 3D action recognition manner is also proposed to learn motion and appearance feature jointly. To extract richer temporal order information of actions, we also split the depth video into temporal segments and encode this procedure in 3DV integrally. The extensive experiments on the well-established benchmark datasets (e.g., NTU RGB+D 120 and NTU RGB+D 60) demonstrate the superiority of our proposition. Impressively, we acquire the accuracy of 82.4% and 93.5% on NTU RGB+D 120 with the cross-subject and cross-setup test setting respectively. 3DV's code is available at https://github.com/3huo/3DV-Action.
[action, recognition, temporal, ntu, video, stream, dataset, time, lstm, executed, attention, regular, microsoft] [feature, table, pooling, effectiveness, split, cnn, proposal, main] [model, input, reveal] [motion, pattern, ieee, proposed, comparison, dynamic, convolutional, spatial, figure, based, extraction, analysis, generally, method, listed] [appearance, image, manner, learn, representation, jun, corresponding] [learning, performance, set, rank, deep, accuracy, machine, number, network, neural, ranking, test, binary, applied, higher, better, size, sampling, vector] [point, depth, voxel, computer, conference, vision, human, local, kinect, pose, single, well, novel, estimation]
@InProceedings{Wang_2020_CVPR,
  author = {Wang, Yancheng and Xiao, Yang and Xiong, Fu and Jiang, Wenxiang and Cao, Zhiguo and Zhou, Joey Tianyi and Yuan, Junsong},
  title = {3DV: 3D Dynamic Voxel for Action Recognition in Depth Video},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Adaptive Interaction Modeling via Graph Operations Search
Haoxin Li, Wei-Shi Zheng, Yu Tao, Haifeng Hu, Jian-Huang Lai


Interaction modeling is important for video action analysis. Recently, several works design specific structures to model interactions in videos. However, their structures are manually designed and non-adaptive, which require structures design efforts and more importantly could not model interactions adaptively. In this paper, we automate the process of structures design to learn adaptive structures for interaction modeling. We propose to search the network structures with differentiable architecture search mechanism, which learns to construct adaptive structures for different videos to facilitate adaptive interaction modeling. To this end, we first design the search space with several basic graph operations that explicitly capture different relations in videos. We experimentally demonstrate that our architecture search framework learns to construct adaptive interaction modeling structures, which provides more understanding about the relations between the structures and some interaction characteristics, and also releases the requirement of structures design efforts. Additionally, we show that the designed basic graph operations in the search space are able to model different interactions in videos. The experiments on two interaction datasets show that our method achieves competitive performance with state-of-the-arts.
[interaction, graph, node, temporal, recognition, video, attention, modeling, action, explicitly, construct, spatiotemporal, automatically, reasoning, relation, superedge, incorporation, supernode, xti] [feature, background, framework, aggregation, propagation, pooling, propose, employ, backbone, table, affinity] [model, difference, input, effective] [adaptive, figure, ieee, convolution, pattern, convolutional, analysis, cell, comparison, method, proposed, indicate] [learn, corresponding, specific, learns, mismatch, representation] [search, architecture, network, design, searched, basic, learning, operation, classification, space, performance, neural, computation, training, candidate, machine, fixed, accuracy, selected, reduces, equation, set, proportion] [conference, computer, vision, structure, international, differentiable, capture, match]
@InProceedings{Li_2020_CVPR,
  author = {Li, Haoxin and Zheng, Wei-Shi and Tao, Yu and Hu, Haifeng and Lai, Jian-Huang},
  title = {Adaptive Interaction Modeling via Graph Operations Search},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Front2Back: Single View 3D Shape Reconstruction via Front to Back Prediction
Yuan Yao, Nico Schertler, Enrique Rosales, Helge Rhodin, Leonid Sigal, Alla Sheffer


Reconstruction of a 3D shape from a single 2D image is a classical computer vision problem, whose difficulty stems from the inherent ambiguity of recovering occluded or only partially observed surfaces. Recent methods address this challenge through the use of largely unstructured neural networks that effectively distill conditional mapping and priors over 3D shape. In this work, we induce structure and geometric constraints by leveraging three core observations: (1) the surface of most everyday objects is often almost entirely exposed from pairs of typical opposite views; (2) everyday objects often exhibit global reflective symmetries which can be accurately predicted from single views; (3) opposite orthographic views of a 3D shape share consistent silhouettes. Following these observations, we first predict orthographic 2.5D visible surface maps (depth, normal and silhouette) from perspective 2D images, and detect global reflective symmetries in this data; second, we predict the back facing depth and normal maps using as input the front maps and, when available, the symmetric reflections of these maps; and finally, we reconstruct a 3D mesh from the union of these maps using a surface reconstruction method best suited for this data. Our experiments demonstrate that our framework outperforms state-of-the art approaches for 3D shape reconstructions from 2D and 2.5D data in terms of input fidelity and details preservation. Specifically, we achieve 12% better performance on average in ShapeNet benchmark dataset, and up to 19% for certain classes of objects (e.g., chairs and vessels).
[prediction, recognition, predict, visual] [map, predicted, object, occluded, oriented, table, detection, global] [input, model, adversarial, visibility] [ieee, pattern, method, figure, intermediate, based, output, reflection] [image, translation, loss, corresponding, produce, representation, expect, real] [learning, training, neural, processing, average, set, data, network, impact, closer, deep] [front, reconstruction, surface, view, computer, depth, symmetry, vision, conference, shape, reflected, normal, visible, plane, single, point, core, distance, ground, reflective, orthographic, truth, symmetric, initial, opposite, geometry, perspective, complete, human, provided, european, facing, shapenet, silhouette, approach, directly, well, cloud, atlasnet]
@InProceedings{Yao_2020_CVPR,
  author = {Yao, Yuan and Schertler, Nico and Rosales, Enrique and Rhodin, Helge and Sigal, Leonid and Sheffer, Alla},
  title = {Front2Back: Single View 3D Shape Reconstruction via Front to Back Prediction},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
SDC-Depth: Semantic Divide-and-Conquer Network for Monocular Depth Estimation
Lijun Wang, Jianming Zhang, Oliver Wang, Zhe Lin, Huchuan Lu


Monocular depth estimation is an ill-posed problem, and as such critically relies on scene priors and semantics. Due to its complexity, we propose a deep neural network model based on a semantic divide-and-conquer approach. Our model decomposes a scene into semantic segments, such as object instances and background stuff classes, and then predicts a scale and shift invariant depth map for each semantic segment in a canonical space. Semantic segments of the same category share the same depth decoder, so the global depth prediction task is decomposed into a series of category-specific ones, which are simpler to learn and easier to generalize to new scene types. Finally, our model stitches each local depth segment by predicting its scale and shift based on the global context of the image. The model is trained end-to-end using a multi-task loss for panoptic segmentation and depth prediction, and is therefore able to leverage large-scale panoptic segmentation datasets to boost its semantic understanding. We validate the effectiveness of our approach and show state-of-the-art performance on three benchmark datasets.
[prediction, dataset, decoding, three, context, predict] [segmentation, semantic, instance, category, map, object, global, branch, coco, mask, fully, module, segment, panoptic, backbone, aggregation, feature, table, final, adopt, dci, propose, pyramid, region] [input, trained, model, datasets, improve] [ieee, pattern, method, figure, convolutional, based, proposed, scale, net, zhang, output] [image, loss, train, corresponding] [network, learning, training, performance, set, deep, neural, accuracy, compared, data, best, normalization, test] [depth, estimation, conference, computer, vision, monocular, canonical, single, diw, predicts, local, scene, relative, transformation, rmse, error, mcii, european, sparse]
@InProceedings{Wang_2020_CVPR,
  author = {Wang, Lijun and Zhang, Jianming and Wang, Oliver and Lin, Zhe and Lu, Huchuan},
  title = {SDC-Depth: Semantic Divide-and-Conquer Network for Monocular Depth Estimation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Single-View View Synthesis With Multiplane Images
Richard Tucker, Noah Snavely


A recent strand of work in view synthesis uses deep learning to generate multiplane images--a camera-centric, layered 3D representation--given two or more input images at known viewpoints. We apply this representation to single-view view synthesis, a problem which is more challenging but has potentially much wider application. Our method learns to predict a multiplane image directly from a single image input, and we introduce scale-invariant view synthesis for supervision, enabling us to train on online video. We show this approach is applicable to several different datasets, that it additionally generates reasonable depth maps, and that it learns to fill in content behind the edges of foreground objects in background layers.
[predict, recognition, work, dataset, prediction, multiple, predicting, video] [background, predicted, apply, table, supervision, foreground] [input, model, quality] [method, scale, light, ieee, disparity, pattern, field, output, figure, color, psnr, ssim, performs, convolutional, interpolation] [image, synthesis, loss, source, target, representation, train, factor, learns, content, learn] [set, learning, layer, network, training, data, deep, online] [depth, view, single, computer, conference, full, mpi, vision, multiplane, point, richard, camera, smoothness, rendered, noah, layered, srinivasan, rendering, ground, stereo, sparse, compute, tulsiani, acm, approach, kitti, nobackground, novel, truth, scene, visible]
@InProceedings{Tucker_2020_CVPR,
  author = {Tucker, Richard and Snavely, Noah},
  title = {Single-View View Synthesis With Multiplane Images},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Deep Parametric Shape Predictions Using Distance Fields
Dmitriy Smirnov, Matthew Fisher, Vladimir G. Kim, Richard Zhang, Justin Solomon


Many tasks in graphics and vision demand machinery for converting shapes into consistent representations with sparse sets of parameters; these representations facilitate rendering, editing, and storage. When the source data is noisy or ambiguous, however, artists and engineers often manually construct such representations, a tedious and potentially time-consuming process. While advances in deep learning have been successfully applied to noisy geometric data, the task of generating parametric shapes has so far been difficult for these methods. Hence, we propose a new framework for predicting parametric shape primitives using deep learning. We use distance fields to transition between shape parameters like control points and input data on a pixel grid. We demonstrate efficacy on 2D and 3D tasks, including font vectorization and surface abstraction.
[exploration, recognition] [predicted, template, propose, apply] [input, curve, model] [figure, method, ieee, field, pattern, output, noisy, comparison, resolution] [loss, font, representation, control, target, image, raster, common, train, adobe, source, generate] [learning, network, deep, set, simple, vector, data, training, sampling, number, test, neural, achieve, space, general, sampled] [distance, shape, chamfer, parametric, geometric, glyph, structure, computer, surface, vision, atlasnet, sparse, vectorization, grid, acm, geometry, cuboid, point, full, conference, demonstrate, single, approach, decorative, leonidas, consistent, eulerian, defined, nearest, rounded, volume, vladimir]
@InProceedings{Smirnov_2020_CVPR,
  author = {Smirnov, Dmitriy and Fisher, Matthew and Kim, Vladimir G. and Zhang, Richard and Solomon, Justin},
  title = {Deep Parametric Shape Predictions Using Distance Fields},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Leveraging Photometric Consistency Over Time for Sparsely Supervised Hand-Object Reconstruction
Yana Hasson, Bugra Tekin, Federica Bogo, Ivan Laptev, Marc Pollefeys, Cordelia Schmid


Modeling hand-object manipulations is essential for understanding how humans interact with their environment. While of practical importance, estimating the pose of hands and objects during interactions is challenging due to the large mutual occlusions that occur during manipulation. Recent efforts have been directed towards fully-supervised methods that require large amounts of labeled training samples. Collecting 3D ground-truth data for hand-object interactions, however, is costly, tedious, and error-prone. To overcome this challenge we present a method to leverage photometric consistency across time when annotations are only available for a sparse subset of frames in a video. Our model is trained end-to-end on color images to jointly reconstruct hands and objects in 3D by inferring their poses. Given our estimated reconstructions, we differentiably render the optical flow between pairs of adjacent images and use it within the network to warp one frame to another. We then apply a self-supervised photometric loss that relies on the visual consistency between nearby images. We achieve state-of-the-art results on 3D hand-object reconstruction benchmarks and demonstrate that our approach allows us to improve the pose estimation accuracy by leveraging information from neighboring frames in low-data regimes.
[recognition, frame, provide, skeleton, predict, action, work, temporal, dataset, understanding] [object, supervision, annotated, fully, challenging, unified] [model, datasets] [ieee, method, pattern, flow, tref, optical, motion, figure, reference, color, pixel] [consistency, loss, image, supervised] [training, data, learning, large, accuracy, average, observe, report, subset, neural, network] [pose, hand, computer, vision, estimation, conference, error, photometric, joint, reconstruction, shape, rgb, human, single, international, monocular, vertex, vtref, mano, fphab, approach, camera, sparsely, dense, depth, mesh, compare, full, itref, additional, estimated, regress, distance, sparse, body, estimating, leverage, allows]
@InProceedings{Hasson_2020_CVPR,
  author = {Hasson, Yana and Tekin, Bugra and Bogo, Federica and Laptev, Ivan and Pollefeys, Marc and Schmid, Cordelia},
  title = {Leveraging Photometric Consistency Over Time for Sparsely Supervised Hand-Object Reconstruction},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Ensemble Generative Cleaning With Feedback Loops for Defending Adversarial Attacks
Jianhe Yuan, Zhihai He


Effective defense of deep neural networks against adversarial attacks remains a challenging problem, especially under powerful white-box attacks. In this paper, we develop a new method called ensemble generative cleaning with feedback loops (EGC-FL) for effective defense of deep neural networks. The proposed EGC-FL method is based on two central ideas. First, we introduce a transformed deadzone layer into the defense network, which consists of an orthonormal transform and a deadzone-based activation function, to destroy the sophisticated noise pattern of adversarial attacks. Second, by constructing a generative cleaning network with a feedback loop, we are able to generate an ensemble of diverse estimations of the original clean image. We then learn a network to fuse this set of diverse estimations together to restore the original image. Our extensive experimental results demonstrate that our approach improves the state-of-art by large margins in both white-box and black-box attacks. It significantly improves the classification accuracy for white-box PGD attacks upon the second best method by more than 29% on the SVHN dataset and more than 39% on the challenging CIFAR-10 dataset.
[dataset, powerful, outperforms] [feature, table, improves, including, map, challenging, fuse] [adversarial, attack, defense, original, feedback, pgd, noise, cleaning, deadzone, bpda, ensemble, defending, clean, input, fgs, accumulative, iterative, destroy, experimental, defend, ian, sophisticated, magnitude, robust, attacked, effective] [method, proposed, figure, transform, fusion, pattern, output, recover, existing, called, based, remove] [image, generative, transformed, generate, target, content, diverse, loss, row, consists] [network, accuracy, deep, performance, neural, learning, layer, activation, classification, training, algorithm, arxiv, preprint, set, large, svhn, number, gradient, energy, small, classifier, function, best, process] [conference, loop, international, approach, second, david, computer]
@InProceedings{Yuan_2020_CVPR,
  author = {Yuan, Jianhe and He, Zhihai},
  title = {Ensemble Generative Cleaning With Feedback Loops for Defending Adversarial Attacks},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Temporal Pyramid Network for Action Recognition
Ceyuan Yang, Yinghao Xu, Jianping Shi, Bo Dai, Bolei Zhou


Visual tempo characterizes the dynamics and the temporal scale of an action. Modeling such visual tempos of different actions facilitates their recognition. Previous works often capture the visual tempo through sampling raw videos at multiple rates and constructing an input-level frame pyramid, which usually requires a costly multi-branch network to handle. In this work we propose a generic Temporal Pyramid Network (TPN) at the feature-level, which can be flexibly integrated into 2D or 3D backbone networks in a plug-and-play manner. Two essential components of TPN, the source of features and the fusion of features, form a feature hierarchy for the backbone so that it can capture action instances at various tempos. TPN also shows consistent improvements over other challenging baselines on several action recognition datasets. Specifically, when equipped with TPN, the 3D ResNet-50 with dense sampling obtains a 2% gain on the validation set of Kinetics-400. A further analysis also reveals that TPN gains most of its improvements on action classes that have large variances in their visual tempos, validating the effectiveness of TPN.
[tpn, visual, action, temporal, video, tempo, modulation, recognition, frame, tsn, multiple, semantics, spatiotemporal, modeling, work, bring, hierarchical] [backbone, feature, pyramid, table, semantic, instance, final, level, ablation, aggregation, stride, module, propose, improvement, object] [input, original, model, study, testing, auxiliary] [spatial, flow, figure, convolutional, proposed, parallel, output, receptive] [source] [network, variance, performance, set, sampling, validation, rate, size, large, accuracy, applied, better, sampled, neural, training, learning, gain, deep, sample, note, classification, increase] [single, capture, consistent, conference]
@InProceedings{Yang_2020_CVPR,
  author = {Yang, Ceyuan and Xu, Yinghao and Shi, Jianping and Dai, Bo and Zhou, Bolei},
  title = {Temporal Pyramid Network for Action Recognition},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
FaceScape: A Large-Scale High Quality 3D Face Dataset and Detailed Riggable 3D Face Prediction
Haotian Yang, Hao Zhu, Yanru Wang, Mingkai Huang, Qiu Shen, Ruigang Yang, Xun Cao


In this paper, we present a large-scale detailed 3D face dataset, FaceScape, and propose a novel algorithm that is able to predict elaborate riggable 3D face models from a single image input. FaceScape dataset provides 18,760 textured 3D faces, captured from 938 subjects and each with 20 specific expressions. The 3D models contain the pore-level facial geometry that is also processed to be topologically uniformed. These fine 3D facial models can be represented as a 3D morphable model for rough shapes and displacement maps for detailed geometry. Taking advantage of the large-scale and high-accuracy dataset, a novel algorithm is further proposed to learn the expression-specific dynamic details using a deep neural network. The learned relationship serves as the foundation of our 3D face prediction system from a single image input. Different than the previous methods, our predicted 3D models are riggable with highly detailed geometry under different expressions. The unprecedented dataset and code will be released to public for research purpose.
[predict, previous, bilinear, dataset, prediction, static, represent, three, build, multiple] [map, predicted, table] [model, face, facial, expression, riggable, rigged, facescape, morphable, identity, blendshape, quality, datasets, topologically, database, uniformed, james, stefanos, blendshapes, input] [dynamic, figure, detail, comparison, high, method, based, raw, proposed] [image, source, generated, corresponding, representation, xun, specific, texture, consists] [base, learning, deep, neural, network, data, weight, accuracy, space, large, number, parameter, activation] [detailed, displacement, single, shape, geometry, reconstruction, fitting, hao, mesh, rough, recovered, error, deforming, system, capture, parametric, volume, depth, thomas, camera, pipeline, michael]
@InProceedings{Yang_2020_CVPR,
  author = {Yang, Haotian and Zhu, Hao and Wang, Yanru and Huang, Mingkai and Shen, Qiu and Yang, Ruigang and Cao, Xun},
  title = {FaceScape: A Large-Scale High Quality 3D Face Dataset and Detailed Riggable 3D Face Prediction},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Structure-Guided Ranking Loss for Single Image Depth Prediction
Ke Xian, Jianming Zhang, Oliver Wang, Long Mai, Zhe Lin, Zhiguo Cao


Single image depth prediction is a challenging task due to its ill-posed nature and challenges with capturing ground truth for supervision. Large-scale disparity data generated from stereo photos and 3D videos is a promising source of supervision, however, such disparity data can only approximate the inverse ground truth depth up to an affine transformation. To more effectively learn from such pseudo-depth data, we propose to use a simple pair-wise ranking loss with a novel sampling strategy. Instead of randomly sampling point pairs, we guide the sampling to better characterize structure of important regions based on the low-level edge maps and high-level object instance masks. We show that the pair-wise ranking loss, combined with our structure-guided sampling strategies, can significantly improve the quality of depth map prediction. In addition, we introduce a new relative depth dataset of about 21K diverse high-resolution web stereo photos to enhance the generalization ability of our model. In experiments, we conduct cross-dataset evaluation on six benchmark datasets and show that our method consistently improves over the baselines, leading to superior quantitative and qualitative results.
[recognition, dataset, prediction, evaluation, pair, work, three] [edge, web, propose, object, instance, map, segmentation, supervision, salient, achieves, mask] [model, trained, datasets, generalization] [pattern, disparity, proposed, sharp, based, analysis] [loss, image, source, qualitative, generated, learn, consistency] [sampling, ranking, ordinal, data, gradient, set, learning, sampled, sample, training, random, network, compared, baseline, deep, metric, accuracy, randomly, small, evaluate, neural] [depth, vision, point, computer, stereo, ground, single, monocular, truth, indoor, local, accurate, outdoor, matching, error, monodepth, diw, midas, estimation, structure, ibims]
@InProceedings{Xian_2020_CVPR,
  author = {Xian, Ke and Zhang, Jianming and Wang, Oliver and Mai, Long and Lin, Zhe and Cao, Zhiguo},
  title = {Structure-Guided Ranking Loss for Single Image Depth Prediction},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
In Perfect Shape: Certifiably Optimal 3D Shape Reconstruction From 2D Landmarks
Heng Yang, Luca Carlone


We study the problem of 3D shape reconstruction from 2D landmarks extracted in a single image. We adopt the 3D deformable shape model and formulate the reconstruction as a joint optimization of the camera pose and the linear shape parameters. Our first contribution is to apply Lasserre's hierarchy of convex Sums-of-Squares (SOS) relaxations to solve the shape reconstruction problem and show that the SOS relaxation of minimum order 2 empirically solves the original non-convex problem exactly. Our second contribution is to exploit the structure of the polynomial in the objective function and find a reduced set of basis monomials for the SOS relaxation that significantly decreases the size of the resulting semidefinite program (SDP) without compromising its accuracy. These two contributions, to the best of our knowledge, lead to the first certifiably optimal solver for 3D shape reconstruction, that we name Shape*. Our third contribution is to add an outlier rejection layer to Shape[?] using a truncated least squares (TLS) robust cost function and leveraging graduated non-convexity to solve TLS without initialization. The result is a robust reconstruction algorithm, named Shape#, that tolerates a large amount of outlier measurements. We evaluate the performance of Shape[?] and Shape# in both simulated and real experiments, showing that Shape[?] outperforms local optimization and previous convex relaxation techniques, while Shape# achieves state-of-the-art performance and is robust against 70% outliers in the FG3DCar dataset.
[order, recognition, bki, contribution] [global, luca, car, apply, achieves, object, propose, positive] [model, robust, original, truncated, degree, face] [ieee, pattern, method, graduated] [gap, image, translation] [problem, optimization, relaxation, optimal, reduction, set, hierarchy, linear, function, minimum, objective, algorithm, theorem, size, feasible, proposition, large, performance, quadratic, written, equivalent, heng, matrix, regularization, denote] [shape, basis, reconstruction, estimation, pose, convex, polynomial, computer, sdp, vision, solution, error, camera, outlier, rotation, single, semidefinite, certifiably, human, duality, solver, local, fitting, sparse, hhi, solve, solving, relative, solves, monomials, perspective, computed, supplementary, initial]
@InProceedings{Yang_2020_CVPR,
  author = {Yang, Heng and Carlone, Luca},
  title = {In Perfect Shape: Certifiably Optimal 3D Shape Reconstruction From 2D Landmarks},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
When NAS Meets Robustness: In Search of Robust Architectures Against Adversarial Attacks
Minghao Guo, Yuzhe Yang, Rui Xu, Ziwei Liu, Dahua Lin


Recent advances in adversarial attacks uncover the intrinsic vulnerability of modern deep neural networks. Since then, extensive efforts have been devoted to enhancing the robustness of deep networks via specialized learning algorithms and loss functions. In this work, we take an architectural perspective and investigate the patterns of network architectures that are resilient to adversarial attacks. To obtain the large number of networks needed for this study, we adopt one-shot neural architecture search, training a large network for once and then finetuning the sub-networks sampled therefrom. The sampled architectures together with the accuracies they achieve provide a rich basis for our study. Our "robust architecture Odyssey" reveals several valuable observations: 1) densely connected patterns result in improved robustness; 2) under computational budget, adding convolution operations to direct connection edge is effective; 3) flow of solution procedure (FSP) matrix is a good indicator of network robustness. Based on these observations, we discover a family of robust architectures (RobNets). On various datasets, including CIFAR, SVHN, Tiny-ImageNet, and ImageNet, RobNets exhibit superior robustness performance to other widely used architectures. Notably, RobNets substantially improve the robust accuracy ( 5% absolute gains) under both white-box and black-box attacks, even with fewer parameter numbers. Code is available at https://github.com/gmh14/RobNets.
[connected, natural, three, previous, provide] [feature, correlation, table, resnet] [adversarial, robust, robustness, robnet, model, fsp, attack, pgd, improve, clean, trained, study, input, representative, adding, robnets, ian, strong, effective, datasets, reveals] [cell, convolution, densely, figure, flow, proposed, analysis, intermediate, based, denoising] [loss, train, extensive, image, generate] [network, architecture, search, accuracy, number, training, neural, computational, matrix, space, parameter, candidate, family, data, performance, supernet, total, arxiv, preprint, deep, learning, large, evaluate, set, larger, sampled, procedure, density, small, finetuning, deeper] [direct, distance, refer]
@InProceedings{Guo_2020_CVPR,
  author = {Guo, Minghao and Yang, Yuzhe and Xu, Rui and Liu, Ziwei and Lin, Dahua},
  title = {When NAS Meets Robustness: In Search of Robust Architectures Against Adversarial Attacks},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Towards Transferable Targeted Attack
Maosen Li, Cheng Deng, Tengjiao Li, Junchi Yan, Xinbo Gao, Heng Huang


An intriguing property of adversarial examples is their transferability, which suggests that black-box attacks are feasible in real-world applications. Previous works mostly study the transferability on non-targeted setting. However, recent studies show that targeted adversarial examples are more difficult to transfer than non-targeted ones. In this paper, we find there exist two defects that lead to the difficulty in generating transferable examples. First, the magnitude of gradient is decreasing during iterative attack, causing excessive consistency between two successive noises in accumulation of momentum, which is termed as noise curing. Second, it is not enough for targeted adversarial examples to just get close to target class without moving away from true class. To overcome the above problems, we propose a novel targeted attack approach to effectively generate more transferable adversarial examples. Specifically, we first introduce the Poincare distance as the similarity metric to make the magnitude of gradient self-adaptive during iterative attack to alleviate noise curing. Furthermore, we regularize the targeted attack process with metric learning to take adversarial examples away from true label and gain more transferable targeted adversarial examples. Experiments on ImageNet validate the superiority of our approach achieving 8% higher attack success rate over other state-of-the-art methods on average in black-box targeted attack.
[correct, successive, sign, outperforms] [table, key] [adversarial, targeted, attack, noise, ensemble, trained, true, input, success, original, ball, adversarially, potrip, transferability, iterative, example, model, xadv, ytar, untar, magnitude, clean, classified, tar, study, effectively, curing, fgsm, ian, accumulation] [method, proposed, output, figure, existing, based, high] [target, loss, transferable, generated, transfer, cross, generate, corresponding, generating] [gradient, metric, triplet, learning, class, label, probability, deep, logits, momentum, find, close, neural, problem, space, entropy, function, set, similarity, imagenet, softmax, data, iteration, algorithm, network] [distance, point, direction, surface, avoid]
@InProceedings{Li_2020_CVPR,
  author = {Li, Maosen and Deng, Cheng and Li, Tengjiao and Yan, Junchi and Gao, Xinbo and Huang, Heng},
  title = {Towards Transferable Targeted Attack},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Self-Supervised Human Depth Estimation From Monocular Videos
Feitong Tan, Hao Zhu, Zhaopeng Cui, Siyu Zhu, Marc Pollefeys, Ping Tan


Previous methods on estimating detailed human depth often require supervised training with 'ground truth' depth data. This paper presents a self-supervised method that can be trained on YouTube videos without known depth, which makes training data collection simple and improves the generalization of the learned network. The self-supervised learning is achieved by minimizing a photo-consistency loss, which is evaluated between a video frame and its neighboring frames warped according to the estimated depth and the 3D non-rigid motion of the human body. To solve this non-rigid motion, we first estimate a rough SMPL model at each video frame and compute the non-rigid body motion accordingly, which enables self-supervised learning on estimating the shape details. Experiments demonstrate that our method enjoys better generalization, and performs much better on data in the wild.
[frame, video, skeleton, dataset, youtube, represent] [map, final, table, improves, occlusion] [model, trained, robust, generalization, input] [motion, pattern, method, reference, figure, neighboring, detail, recover, result, comparison, proposed, captured, residual] [image, target, loss, train] [learning, base, data, network, training, baseline, accuracy, better, neural, deep, set, function, finetune, simple] [human, smpl, shape, depth, computer, vision, body, single, estimation, pose, conference, tang, compute, tracknet, reconnet, estimate, camera, international, view, estimated, estimating, error, ssimcs, hmd, michael, european, hao, detailed, undressed, photometric, transformation, monocular, left, stereo, volumetric, capture, mesh, georgios]
@InProceedings{Tan_2020_CVPR,
  author = {Tan, Feitong and Zhu, Hao and Cui, Zhaopeng and Zhu, Siyu and Pollefeys, Marc and Tan, Ping},
  title = {Self-Supervised Human Depth Estimation From Monocular Videos},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Recursive Social Behavior Graph for Trajectory Prediction
Jianhua Sun, Qinhong Jiang, Cewu Lu


Social interaction is an important topic in human trajectory prediction to generate plausible paths. In this paper, we present a novel insight of group-based social interaction model to explore relationships among pedestrians. We recursively extract social representations supervised by group-based annotations and formulate them into a social behavior graph, called Recursive Social Behavior Graph. Our recursive mechanism explores the representation power largely. Graph Convolutional Neural Network then is used to propagate social interaction information in such a graph. With the guidance of Recursive Social Behavior Graph, we surpass state-of-the-art methods on ETH and UCY dataset for 11.1% in ADE and 10.8% in FDE in average, and successfully predict complex social behaviors.
[social, trajectory, prediction, graph, interaction, behavior, three, individual, relational, context, fde, lstm, attention, rsbg, relationship, historical, ade, people, time, previous, cewu, predict, eth, forecasting, red, gcns, bilstm, mechanism, future, integrate, destination, timestep, yit, stgat, alexandre] [feature, pedestrian, represents, propagate, pooling, key, tracking, predicted] [model, strong] [ieee, recursive, pattern, method, dynamic, based, convolutional, figure, proposed, modeled, crowd, spatial] [representation, person, loss, row, generate, common, introduce, target] [neural, network, learning, group, deep, exponential, performance, set, comparing, path] [conference, computer, human, vision, international, distance, approach, novel, scene, ground, handle, truth]
@InProceedings{Sun_2020_CVPR,
  author = {Sun, Jianhua and Jiang, Qinhong and Lu, Cewu},
  title = {Recursive Social Behavior Graph for Trajectory Prediction},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Context-Aware and Scale-Insensitive Temporal Repetition Counting
Huaidong Zhang, Xuemiao Xu, Guoqiang Han, Shengfeng He


Temporal repetition counting aims to estimate the number of cycles of a given repetitive action. Existing deep learning methods assume repetitive actions are performed in a fixed time-scale, which is invalid for the complex repetitive actions in real life. In this paper, we tailor a context-aware and scale-insensitive framework, to tackle the challenges in repetition counting caused by the unknown and diverse cycle-lengths. Our approach combines two key insights: (1) Cycle lengths from different actions are unpredictable that require large-scale searching, but, once a coarse cycle length is determined, the variety between repetitions can be overcome by regression. (2) Determining the cycle length cannot only rely on a short fragment of video but a contextual understanding. The first point is implemented by a coarse-to-fine cycle refinement method. It avoids the heavy computation of exhaustively searching all the cycle lengths in the video, and, instead, it propagates the coarse prediction for further refinement in a hierarchical manner. We secondly propose a bidirectional cycle length estimation method for a context-aware prediction. It is a regression network that takes two consecutive coarse cycles as input, and predicts the locations of the previous and next repetitive cycles. To benefit the training and evaluation of temporal repetition counting area, we construct a new and largest benchmark, which contains 526 videos with diverse repetitive actions. Extensive experiments show that the proposed network trained on a single dataset outperforms state-of-the-art methods on several benchmarks, indicating that the proposed framework is general enough to capture repetition patterns across domains. Code and data are available in https://github.com/Xiaodomgdomg/Deep-Temporal-Repetition-Counting.
[video, temporal, previous, action, dataset, length, prediction, frame, varied, bidirectional, work, context, time, future, extract, sequence] [regression, stage, refinement, benchmark, propose, table, refine, framework, detection, tnr, final, contextual, detect, refined, key] [variation, original, exhaustive, input, detecting, trained, model] [repetition, proposed, method, repetitive, figure, counting, motion, quva, ucfrep, periodic, existing, consecutive, range, ytsegments, performed, scale, designed, double, based] [cycle, diverse, aim] [network, data, number, search, large, sampled, fixed, training, performance, classification, problem, set, initialization, validation, count, process, compared, deep, searching, entire] [estimation, position, initial, human, uniformly]
@InProceedings{Zhang_2020_CVPR,
  author = {Zhang, Huaidong and Xu, Xuemiao and Han, Guoqiang and He, Shengfeng},
  title = {Context-Aware and Scale-Insensitive Temporal Repetition Counting},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
OASIS: A Large-Scale Dataset for Single Image 3D in the Wild
Weifeng Chen, Shengyi Qian, David Fan, Noriyuki Kojima, Max Hamilton, Jia Deng


Single-view 3D is the task of recovering 3D properties such as depth and surface normals from a single image. We hypothesize that a major obstacle to single-image 3D is data. We address this issue by presenting Open Annotations of Single Image Surfaces (OASIS), a dataset for single-image 3D in the wild consisting of annotations of detailed 3D geometry for 140,000 images. We train and evaluate leading models on a variety of single-image 3D tasks. We expect OASIS to be a useful resource for 3D vision research. Project site: https://pvl.cs.princeton.edu/OASIS.
[dataset, work, worker, length, order, evaluation] [occlusion, boundary, predicted, detection, instance, leading, segmentation, jia, annotate, annotation, object, annotated, side] [trained, hourglass, wild, pixelwise, difference, datasets, quality] [ieee, pattern, prior, scale, pixel, sensor, figure, method, parallel] [image, train, perform] [metric, training, network, evaluate, scaling, task, large, imagenet, data, performance, orthogonal, set, distribution, deep, learning] [depth, surface, normal, computer, human, oasis, conference, vision, ground, focal, fold, relative, truth, single, shape, point, nyu, estimation, planar, reconstruction, error, scene, indoor, continuous, european, international, geometry, singleimage, dense, angle, scannet]
@InProceedings{Chen_2020_CVPR,
  author = {Chen, Weifeng and Qian, Shengyi and Fan, David and Kojima, Noriyuki and Hamilton, Max and Deng, Jia},
  title = {OASIS: A Large-Scale Dataset for Single Image 3D in the Wild},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
VPLNet: Deep Single View Normal Estimation With Vanishing Points and Lines
Rui Wang, David Geraghty, Kevin Matzen, Richard Szeliski, Jan-Michael Frahm


We present a novel single-view surface normal estimation method that combines traditional line and vanishing point analysis with a deep learning approach. Starting from a color image and a Manhattan line map, we use a deep neural network to regress on a dense normal map, and a dense Manhattan label map that identifies planar regions aligned with the Manhattan directions. We fuse the normal map and label map in a fully differentiable manner to produce a refined normal map as final output. To do so, we softly decompose the output into a Manhattan part and a non-Manhattan part. The Manhattan part is treated by discrete classification and vanishing points, while the non-Manhattan part is learned by direct supervision. Our method achieves state-of-the-art results on standard single-view normal estimation benchmarks. More importantly, we show that by using vanishing points and lines, our method has better generalization ability than existing works. In addition, we demonstrate how our surface normal network can improve the performance of depth estimation networks, both quantitatively and qualitatively, in particular, in 3D reconstructions of walls and other flat surfaces.
[prediction, dataset, visual, evaluation] [map, table, predicted, final, detection, semantic, segmentation, achieves] [model, generalization, input, trained] [figure, ieee, pattern, proposed, method, output, raw, convolutional, traditional, based, color] [image, produce, ability, pretrained, aligned] [label, network, deep, baseline, metric, neural, training, accuracy, better, learning, performance, large, orthogonal, nout, compared] [normal, manhattan, estimation, vanishing, depth, computer, conference, surface, vision, point, single, dominant, scannet, error, ground, truth, replica, nraw, ncomb, computed, direction, international, dense, rgb, analytically, camera, demonstrate, plane, framenet, directly, median, view, direct, approach, geometry, well, geometric, coordinate]
@InProceedings{Wang_2020_CVPR,
  author = {Wang, Rui and Geraghty, David and Matzen, Kevin and Szeliski, Richard and Frahm, Jan-Michael},
  title = {VPLNet: Deep Single View Normal Estimation With Vanishing Points and Lines},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Adversarial Robustness: From Self-Supervised Pre-Training to Fine-Tuning
Tianlong Chen, Sijia Liu, Shiyu Chang, Yu Cheng, Lisa Amini, Zhangyang Wang


Pretrained models from self-supervision are prevalently used in fine-tuning downstream tasks faster or for better accuracy. However, gaining robustness from pretraining is left unexplored. We introduce adversarial training into self-supervision, to provide general-purpose robust pretrained models for the first time. We find these robust pretrained models can benefit the subsequent fine-tuning in two ways: i) boosting final model robustness; ii) saving the computation cost, if proceeding towards adversarial fine-tuning. We conduct extensive experiments to demonstrate that the proposed framework achieves large performance margins (eg, 3.83% on robust accuracy and 1.3% on standard accuracy, on the CIFAR-10 dataset), compared with the conventional end-to-end adversarial training baseline. Moreover, we find that different self-supervised pretrained models have diverse adversarial vulnerability. It inspires us to ensemble several pretraining tasks, which boosts robustness more. Our ensemble strategy contributes to a further improvement of 3.59% on robust accuracy, while maintaining a slightly higher standard accuracy on CIFAR-10. Our codes are available at https://github.com/TAMU-VITA/Adv-SS-Pretraining.
[prediction, dataset, three, downstream, evaluation, provide] [table, final, improvement, represents, ablation, denotes, feature, framework] [adversarial, robust, model, ensemble, robustness, selfie, adversarially, pgd, defense, successful, input, study, attack, scenario, testing, asr] [figure, proposed, ieee, method, column, comparison, pattern] [jigsaw, pretrained, image, supervised, transfer, loss, learnt, diverse, diversity, representation, unsupervised] [pretraining, training, standard, learning, accuracy, task, arxiv, preprint, data, deep, classification, unlabeled, neural, performance, best, better, size, baseline, classifier, set, follow, problem, finetuning, network, unforeseen, sample, find, higher] [rotation, full, conference, approach, partial, computer, demonstrate, consistent]
@InProceedings{Chen_2020_CVPR,
  author = {Chen, Tianlong and Liu, Sijia and Chang, Shiyu and Cheng, Yu and Amini, Lisa and Wang, Zhangyang},
  title = {Adversarial Robustness: From Self-Supervised Pre-Training to Fine-Tuning},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Defending Against Universal Attacks Through Selective Feature Regeneration
Tejas Borkar, Felix Heide, Lina Karam


Deep neural network (DNN) predictions have been shown to be vulnerable to carefully crafted adversarial perturbations. Specifically, image-agnostic (universal adversarial) perturbations added to any image can fool a target network into making erroneous predictions. Departing from existing defense strategies that work mostly in the image domain, we present a novel defense which operates in the DNN feature domain and effectively defends against such universal perturbations. Our approach identifies pre-trained convolutional features that are most vulnerable to adversarial noise and deploys trainable feature regeneration units which transform these DNN filter activations into resilient features that are robust to universal perturbations. Regenerating only the top 50% adversarially susceptible activations in at most 6 DNN layers and leaving all remaining DNN activations unchanged, we outperform existing defense strategies across different network architectures by more than 10% in restored accuracy. We show that without any additional modification, our defense trained on ImageNet with one type of universal attack examples effectively defends against other types of unseen universal attacks.
[unit, making, work] [feature, table] [adversarial, universal, defense, attack, dnn, regeneration, perturbation, robustness, uap, trained, noise, perturbed, input, effectively, adversarially, fooling, googlenet, dnns, clean, caffenet, spgd, ranked, defends, prn, susceptible, nag, vsyn, ian, vulnerable, robust, hgd, model, regenerated, masking] [convolutional, figure, proposed, ieee, existing, pattern, restoration, method, denoising, output] [image, target, unseen, gap, loss, synthetic, generate] [accuracy, baseline, filter, training, set, deep, network, ratio, learning, layer, neural, imagenet, evaluate, small, classification] [conference, computer, vision, international, additional, computed]
@InProceedings{Borkar_2020_CVPR,
  author = {Borkar, Tejas and Heide, Felix and Karam, Lina},
  title = {Defending Against Universal Attacks Through Selective Feature Regeneration},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Universal Physical Camouflage Attacks on Object Detectors
Lifeng Huang, Chengying Gao, Yuyin Zhou, Cihang Xie, Alan L. Yuille, Changqing Zou, Ning Liu


In this paper, we study physical adversarial attacks on object detectors in the wild. Previous works mostly craft instance-dependent perturbations only for rigid or planar objects. To this end, we propose to learn an adversarial pattern to effectively attack all instances belonging to the same object category, referred to as Universal Physical Camouflage Attack (UPC). Concretely, UPC crafts camouflage by jointly fooling the region proposal network, as well as misleading the classifier and the regressor to output errors. In order to make UPC effective for non-rigid or non-planar objects, we introduce a set of transformations for mimicking deformable properties. We additionally impose optimization constraint to make generated patterns look natural to human observers. To fairly evaluate the effectiveness of different physical-world attacks, we present the first standardized virtual database, AttackScenes, which simulates the real 3D world in a controllable and reproducible environment. Extensive experiments suggest the superiority of our proposed UPC compared with existing physical adversarial attackers not only in virtual environments (AttackScenes), but also in real-world physical environments.
[natural, walking, dataset] [object, detection, detected, table, semantic, faster, car, rpn, head, propose, threshold, region, proposal, category] [physical, adversarial, attack, camouflage, attacking, universal, upc, fooling, original, digital, fool, transferability, perturbed, drop, external, internal, study, effectively, standardized, robust, strength] [pattern, ieee, proposed, figure, method, output] [generated, real, image, target, generate, person] [arxiv, preprint, set, network, deep, classification, training, rate, label, precision, classifier, neural, denote, performance, average, evaluate] [computer, conference, virtual, human, vision, constraint, well, scene, complex, transformation]
@InProceedings{Huang_2020_CVPR,
  author = {Huang, Lifeng and Gao, Chengying and Zhou, Yuyin and Xie, Cihang and Yuille, Alan L. and Zou, Changqing and Liu, Ning},
  title = {Universal Physical Camouflage Attacks on Object Detectors},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Intra- and Inter-Action Understanding via Temporal Action Parsing
Dian Shao, Yue Zhao, Bo Dai, Dahua Lin


Current methods for action recognition primarily rely on deep convolutional networks to derive feature embeddings of visual and motion features. While these methods have demonstrated remarkable performance on standard benchmarks, we are still in need of a better understanding as to how the videos, in particular their internal structures, relate to high-level semantics, which may lead to benefits in multiple aspects, e.g. interpretable predictions and even new methods that can take the recognition performances to a next level. Towards this goal, we construct TAPOS, a new dataset developed on sport videos with manual annotations of sub-actions, and conduct a study on temporal action parsing on top. Our study shows that a sport activity usually consists of multiple sub-actions and that the awareness of such temporal structures is beneficial to action recognition. We also investigate a number of temporal parsing methods, and thereon devise an improved method that is capable of mining sub-actions from training data without knowing the labels of them. On the constructed TAPOS, the proposed method is shown to reveal intra-action information, i.e. how action instances are made of sub-actions, and inter-action information, i.e. one specific sub-action may commonly appear in various actions.
[action, temporal, video, dataset, transparser, frame, sps, tapos, recognition, activity, provide, olympics, dahua, sport, visual, three, attention, unit, understanding, ivan, yue] [parsing, feature, segmentation, instance, table, focus, global, boundary, recall, semantic, annotated] [internal, jump, datasets, tap, ctm] [figure, pattern, proposed, convolutional, motion, analysis, output, method] [miner, loss, discriminative] [number, class, network, set, average, performance, compared, deep, classification, learning, improved, knowing, large, start, process, sampling, precision, arxiv, preprint, top] [human, computer, distance, conference, local, european, second, relative, approach]
@InProceedings{Shao_2020_CVPR,
  author = {Shao, Dian and Zhao, Yue and Dai, Bo and Lin, Dahua},
  title = {Intra- and Inter-Action Understanding via Temporal Action Parsing},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Lightweight Photometric Stereo for Facial Details Recovery
Xueying Wang, Yudong Guo, Bailin Deng, Juyong Zhang


Recently, 3D face reconstruction from a single image has achieved great success with the help of deep learning and shape prior knowledge, but they often fail to produce accurate geometry details. On the other hand, photometric stereo methods can recover reliable geometry details, but require dense inputs and need to solve a complex optimization problem. In this paper, we present a lightweight strategy that only requires sparse inputs or even a single image to recover high-fidelity face shapes with images captured under near-field lights. To this end, we construct a dataset containing 84 different subjects with 29 expressions under 3 different lights. Data augmentation is applied to enrich the data in terms of diversity in identity, lighting, expression, etc. With this constructed dataset, we propose a novel neural network specially designed for photometric stereo based 3D face reconstruction. Extensive experiments and comparisons demonstrate that our method can generate high-quality reconstruction results with one to three facial images captured under near-field lights. Our full framework is available at https://github.com/Juyong/FacePSNet.
[dataset, three, recognition, construct, constructed] [map, stage, height, propose] [face, input, model, facial] [method, light, ieee, captured, pattern, recover, based, proposed, prior, figure, lightweight, convolutional, pixel] [image, real, corresponding, synthetic, train, fine] [network, proxy, deep, set, data, problem, learning, augmentation, neural, training, updated, optimization, test, number, better, requires] [photometric, normal, stereo, reconstruction, conference, point, computer, lighting, vision, single, estimation, geometry, coarse, sparse, shape, accurate, reconstruct, surface, international, parametric, recovered, solve, triangle, estimated, albedo, iij, vertex, ground, dense, geometric, uncalibrated, position, human, term, truth]
@InProceedings{Wang_2020_CVPR,
  author = {Wang, Xueying and Guo, Yudong and Deng, Bailin and Zhang, Juyong},
  title = {Lightweight Photometric Stereo for Facial Details Recovery},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Bundle Pooling for Polygonal Architecture Segmentation Problem
Huayi Zeng, Kevin Joseph, Adam Vest, Yasutaka Furukawa


This paper introduces a polygonal architecture segmentation problem, proposes bundle-pooling modules for line structure reasoning, and demonstrates a virtual remodeling application that produces production quality results. Given a photograph of a house with a few vanishing point candidates, we decompose the house into a set of architectural components, each of which is represented as a simple geometric primitive. A bundle-pooling module pools convolutional features along a bundle of line segments (e.g., a family of vanishing lines) and fuses the bundle of features to determine polygonal boundaries or assign a corresponding vanishing point. Qualitative and quantitative evaluations demonstrate significant improvements over the existing techniques based on our metric and benchmark dataset. We will share the code and data for further research.
[house, three, current, bipartite] [pooling, architectural, segmentation, bounding, boundary, table, remodeling, feature, box, building, detection, mask, object, ross, yasutaka, assign, segment, assignment, faster, determines, map, global, apply] [roof, type, dnn, photograph] [ieee, figure, pattern, range, based, combination] [image, corresponding, qualitative, layout, residential, representation, proposes] [architecture, simple, number, top, problem, set, neural, max, standard, paper, learning, network] [bundle, vanishing, polygonal, computer, conference, vision, polygon, point, system, primitive, international, virtual, approach, left, wall, facade, determine, ransac, single, form, keypoint, estimate, shape, reconstruction, vertical]
@InProceedings{Zeng_2020_CVPR,
  author = {Zeng, Huayi and Joseph, Kevin and Vest, Adam and Furukawa, Yasutaka},
  title = {Bundle Pooling for Polygonal Architecture Segmentation Problem},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
AvatarMe: Realistically Renderable 3D Facial Reconstruction "In-the-Wild"
Alexandros Lattas, Stylianos Moschoglou, Baris Gecer, Stylianos Ploumpis, Vasileios Triantafyllou, Abhijeet Ghosh, Stefanos Zafeiriou


Over the last years, with the advent of Generative Adversarial Networks (GANs), many face analysis tasks have accomplished astounding performance, with applications including, but not limited to, face generation and 3D face reconstruction from a single "in-the-wild" image. Nevertheless, to the best of our knowledge, there is no method which can produce high-resolution photorealistic 3D faces from "in-the-wild" images and this can be attributed to the: (a) scarcity of available data for training, and (b) lack of robust methodologies that can successfully be applied on very high-resolution data. In this paper, we introduce AvatarMe, the first method that is able to reconstruct photorealistic 3D faces from a single "in-the-wild" image with an increasing level of detail. To achieve this, we capture a large dataset of facial shape and reflectance and build on a state-of-the-art 3D texture and shape reconstruction method and successively refine its results, while generating the per-pixel diffuse and specular components that are required for realistic rendering. As we demonstrate in a series of qualitative and quantitative experiments, AvatarMe outperforms the existing arts by a significant margin and reconstructs authentic, 4K by 6K-resolution 3D faces from a single low-resolution image that, for the first time, bridges the uncanny valley.
[infer, environment, dataset, order] [head, map, including, employ] [facial, face, input, model, quality, stefanos, stylianos, skin, ganfit, adversarial, identity, original, morphable, led, subject] [method, ieee, pattern, illumination, high, figure, resolution, proposed, acquisition, analysis] [texture, translation, image, train, produce, photorealistic, arbitrary, grayscale] [network, data, deep, learning, training, space, gradient, neural, large, algorithm, inference] [diffuse, specular, albedo, reflectance, computer, conference, shape, reconstruction, geometry, vision, capture, single, reconstructed, acquired, acm, reconstruct, estimation, spherical, rendered, well, baked, human, polarized, rendering, paul, fitting, tangent]
@InProceedings{Lattas_2020_CVPR,
  author = {Lattas, Alexandros and Moschoglou, Stylianos and Gecer, Baris and Ploumpis, Stylianos and Triantafyllou, Vasileios and Ghosh, Abhijeet and Zafeiriou, Stefanos},
  title = {AvatarMe: Realistically Renderable 3D Facial Reconstruction "In-the-Wild"},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Defending Against Model Stealing Attacks With Adaptive Misinformation
Sanjay Kariyappa, Moinuddin K. Qureshi


Deep Neural Networks (DNNs) are susceptible to model stealing attacks, which allows a data-limited adversary with no knowledge of the training dataset to clone the functionality of a target model, just by using black-box query access. Such attacks are typically carried out by querying the target model using inputs that are synthetically generated or sampled from a surrogate dataset to construct a labeled dataset. The adversary can use this labeled dataset to train a clone model, which achieves a classification accuracy comparable to that of the target model. We propose "Adaptive Misinformation" to defend against such model stealing attacks. We identify that all existing model stealing attacks invariably query the target model with Out-Of-Distribution (OOD) inputs. By selectively sending incorrect predictions for OOD queries, our defense substantially degrades the accuracy of the attacker's clone model (by up to 40%), while minimally impacting the accuracy (<0.5%) for benign users. Compared to existing defenses, our defense has a significantly better security vs accuracy trade-off and incurs minimal computational overhead.
[dataset, prediction, selectively, work] [detection, detector, achieves, propose] [model, clone, stealing, adversary, defense, defender, security, misinformation, query, input, perturbation, serviced, original, incorrect, perturbed, attack, benign, adversarial, trained, knockoffnets, access, msp, service, attacker, true, functionality, representative, indicating, poisoning, curve] [existing, adaptive, high, based, output, figure, low, proposed] [train, target, surrogate, synthetic, produce, perform, user, loss] [accuracy, ood, classification, data, compared, better, distribution, training, labeled, learning, lower, deep, large, number, probability, function, amount, problem, augmentation, class, computational, small, set, requires, objective, neural, comparable] [allows]
@InProceedings{Kariyappa_2020_CVPR,
  author = {Kariyappa, Sanjay and Qureshi, Moinuddin K.},
  title = {Defending Against Model Stealing Attacks With Adaptive Misinformation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Learning to Generate 3D Training Data Through Hybrid Gradient
Dawei Yang, Jia Deng


Synthetic images rendered by graphics engines are a promising source for training deep networks. However, it is challenging to ensure that they can help train a network to perform well on real images, because a graphics-based generation pipeline requires numerous design decisions such as the selection of 3D shapes and the placement of the camera. In this work, we propose a new method that optimizes the generation of 3D training data based on what we call "hybrid gradient". We parametrize the design decisions as a real vector, and combine the approximate gradient and the analytical gradient to obtain the hybrid gradient of the network performance with respect to this vector. We evaluate our approach on the task of estimating surface normal, depth or intrinsic decomposition from a single image. Experiments on standard benchmarks show that our approach can outperform the prior state of the art on optimizing the generation of 3D training data, particularly in terms of computational efficiency.
[recognition, dataset] [deng, table, including] [trained, face, model, generalization, difference, original, finite] [ieee, pattern, method, june] [synthetic, generation, real, train, image, generated, yang, loss, generating, texture, generate] [training, gradient, network, random, optimization, performance, validation, data, learning, deep, analytical, evaluate, set, sample, search, function, design, algorithm, neural, approximate, vector, test, respect, update, optimize, basic, task, simple, sampled, better, sampling, optimizing, fixed, probabilistic, hyperparameter] [computer, hybrid, vision, conference, approach, depth, scene, international, rendering, shape, error, intrinsic, surface, single, differentiable, estimation, pipeline, ground, computed, pose, thomas, decomposition, truth, compute, indoor, suncg, nyu, david, rendered, well]
@InProceedings{Yang_2020_CVPR,
  author = {Yang, Dawei and Deng, Jia},
  title = {Learning to Generate 3D Training Data Through Hybrid Gradient},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Cascaded Refinement Network for Point Cloud Completion
Xiaogang Wang, Marcelo H. Ang Jr. , Gim Hee Lee


Point clouds are often sparse and incomplete. Existing shape completion methods are incapable of generating details of objects or learning the complex point distributions. To this end, we propose a cascaded refinement network together with a coarse-to-fine strategy to synthesize the detailed object shapes. Considering the local details of partial input with the global shape information together, we can preserve the existing details in the incomplete point set and generate the missing parts with high fidelity. We also design a patch discriminator that guarantees every local area has the same pattern with the ground truth to learn the complicated point distribution. Quantitative and qualitative experiments on different datasets show that our method achieves superior results compared to existing state-of-the-art approaches on the 3D point cloud completion task. Our source code is available at https://github.com/xiaogangw/cascaded-point-completion.git.
[dataset, hierarchical, graph, evaluation] [feature, object, table, refinement, global, module, represents, propose, refine, adopt, mirror] [adversarial, input, model, testing] [ieee, pattern, method, figure, output, quantitative, existing, resolution, proposed, comparison, cascaded, patch, upsampling, extraction] [generate, discriminator, loss, generated, qualitative, generative, missing, generator, generating, synthesize, preserve, fine, latent] [training, learning, network, set, data, size, neural, arxiv, preprint, deep, compared, design] [point, shape, conference, computer, completion, cloud, pcn, topnet, vision, partial, complete, reconstruction, dense, coarse, mlps, distance, lifting, local, ground, single, international, voxel, accurate, mesh, structure, chair]
@InProceedings{Wang_2020_CVPR,
  author = {Wang, Xiaogang and , Marcelo H. Ang Jr. and Lee, Gim Hee},
  title = {Cascaded Refinement Network for Point Cloud Completion},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Enhancing Intrinsic Adversarial Robustness via Feature Pyramid Decoder
Guanlin Li, Shuya Ding, Jun Luo, Chang Liu


Whereas adversarial training is employed as the main defence strategy against specific adversarial samples, it has limited generalization capability and incurs excessive time complexity. In this paper, we propose an attack-agnostic defence framework to enhance the intrinsic robustness of neural networks, without jeopardizing the ability of generalizing clean samples. Our Feature Pyramid Decoder (FPD) framework applies to all block-based convolutional neural networks (CNNs). It implants denoising and image restoration modules into a targeted CNN, and it also constraints the Lipschitz constant of the classification layer. Moreover, we propose a two-phase strategy to train the FPD-enhanced CNN, utilizing e-neighbourhood noisy images with multi-task and self-supervised learning. Evaluated against a variety of white-box and black-box attacks, we demonstrate that FPD-enhanced CNNs gain sufficient robustness against general adversarial samples on MNIST, SVHN and CALTECH. In addition, if we further conduct adversarial training, the FPD-enhanced CNNs perform better than their non-enhanced versions.
[time, step, exploration, decoder] [cnn, module, table, feature, framework, pyramid, propose] [adversarial, acc, robustness, attack, fpd, fpgd, clean, opgd, lipschitz, constant, xclean, pgd, noise, implanted, middle, original, fpdfd, trained, fgsm, fpdr, defending, fpdbd, depicted, thwarting, ian, shandong, conduct, robust, input, improve, defence] [denoising, enhanced, restoration, figure, residual, phase, output, enhance, proposed, noisy, block, comparison, high, elu] [image, loss, target, abstract, train] [training, layer, inner, network, performance, classification, deep, activation, strategy, neural, bottleneck, selection, function, learning, set, accuracy, average] [front, intrinsic, variety]
@InProceedings{Li_2020_CVPR,
  author = {Li, Guanlin and Ding, Shuya and Luo, Jun and Liu, Chang},
  title = {Enhancing Intrinsic Adversarial Robustness via Feature Pyramid Decoder},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Learning to Discriminate Information for Online Action Detection
Hyunjun Eun, Jinyoung Moon, Jongyoul Park, Chanho Jung, Changick Kim


From a streaming video, online action detection aims to identify actions in the present. For this task, previous methods use recurrent networks to model the temporal sequence of current action frames. However, these methods overlook the fact that an input image sequence includes background and irrelevant actions as well as the action of interest. For online action detection, in this paper, we propose a novel recurrent unit to explicitly discriminate the information relevant to an ongoing action from others. Our unit, named Information Discrimination Unit (IDU), decides whether to accumulate input information based on its relevance to the current action. This enables our recurrent network with IDU to learn a more discriminative representation for identifying ongoing actions. In experiments on two benchmark datasets, TVSeries and THUMOS-14, the proposed method outperforms state-of-the-art methods by a significant margin. Moreover, we demonstrate the effectiveness of our recurrent unit by conducting comprehensive ablation studies.
[action, current, idn, idu, embedding, relevant, gru, reset, recurrent, ongoing, tvseries, temporal, recognition, hidden, time, mcap, relevance, previous, unit, trn, state, xet, video, dataset, mechanism, untrimmed, wxt, explicitly, relation, lstm, tsn, red, sequence] [module, detection, feature, table, effectiveness, achieves, background, ablation, map, including, discriminate] [input, model, effectively, offline, streaming] [ieee, proposed, pattern, based, comparison, method, figure] [discrimination, irrelevant, loss, learn] [update, online, early, network, gate, performance, learning, precision, set, average, note, number] [conference, vision, computer, enables, computed, international, compare, novel]
@InProceedings{Eun_2020_CVPR,
  author = {Eun, Hyunjun and Moon, Jinyoung and Park, Jongyoul and Jung, Chanho and Kim, Changick},
  title = {Learning to Discriminate Information for Online Action Detection},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Adversarial Examples Improve Image Recognition
Cihang Xie, Mingxing Tan, Boqing Gong, Jiang Wang, Alan L. Yuille, Quoc V. Le


Adversarial examples are commonly viewed as a threat to ConvNets. Here we present an opposite perspective: adversarial examples can be used to improve image recognition models if harnessed in the right manner. We propose AdvProp, an enhanced adversarial training scheme which treats adversarial examples as additional examples, to prevent overfitting. Key to our method is the usage of a separate auxiliary batch norm for adversarial examples, as they have different underlying distributions to normal examples. We show that AdvProp improves a wide range of models on various image recognition tasks and performs better when the models are bigger. For instance, by applying AdvProp to the latest EfficientNet-B7 [28] on ImageNet, we achieve significant improvements on ImageNet (+0.7%), ImageNet-C (+6.5%), ImageNet-A (+7.0%), Stylized-ImageNet (+4.8%). With an enhanced EfficientNet-B8, our method achieves the state-of-the-art 85.5% ImageNet top-1 accuracy without extra data. This result even surpasses the best model in [20] which is trained with 3.5B Instagram images ( 3000X more than ImageNet) and 9.4X more parameters. Code and models will be made publicly available.
[previous, recognition, outperforms, step] [improves, extra, improvement, achieves, propose, table, main] [adversarial, advprop, clean, auxiliary, model, trained, improve, bns, attacker, robustness, perturbation, stronger, generalization, adversarially, autoaugment, attack, adv, distorted, pgd, norm, mce, effectively, robust, improving] [result, convolutional, traditional, method, proposed, comparison] [image, train, disentangled, mismatch, learn, ability, corresponding, loss] [training, performance, accuracy, imagenet, vanilla, learning, better, arxiv, preprint, network, data, larger, distribution, size, large, compared, baseline, set, best, normalization, mixture, batch, small, neural, deep, augmentation, quoc, setting, standard, report] []
@InProceedings{Xie_2020_CVPR,
  author = {Xie, Cihang and Tan, Mingxing and Gong, Boqing and Wang, Jiang and Yuille, Alan L. and Le, Quoc V.},
  title = {Adversarial Examples Improve Image Recognition},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
PQ-NET: A Generative Part Seq2Seq Network for 3D Shapes
Rundi Wu, Yixin Zhuang, Kai Xu, Hao Zhang, Baoquan Chen


We introduce PQ-NET, a deep neural network which represents and generates 3D shapes via sequential part assembly. The input to our network is a 3D shape segmented into parts, where each part is first encoded into a feature representation using a part autoencoder. The core component of PQ-NET is a sequence-to-sequence or Seq2Seq autoencoder which encodes a sequence of part features into a latent vector of fixed size, and the decoder reconstructs the 3D shape, one part at a time, resulting in a sequential assembly. The latent space formed by the Seq2Seq encoder encodes both part structure and fine part geometry. The decoder can be adapted to perform several generative tasks including shape autoencoding, interpolation, novel shape generation, and single-view 3D reconstruction, where the generated shapes are all composed of meaningful parts.
[sequence, sequential, order, decoder, rnn, gru, work, three] [feature, table, object, box, segmented, represents, including, propose] [model, input, trained] [figure, method, output, stacked, based, comparison, ieee, quantitative, assembly] [generative, latent, generation, encoder, image, generated, representation, autoencoder, structural, generates, learns, train, learn, loss] [network, learning, deep, neural, vector, space, training, set, random, number, better, arxiv, preprint, linear, sampled] [shape, geometry, structure, reconstruction, single, computer, depth, point, conference, ground, rgb, vision, implicit, distance, structurenet, truth, acm, view, surface, mesh, compare, international, approach, partnet]
@InProceedings{Wu_2020_CVPR,
  author = {Wu, Rundi and Zhuang, Yixin and Xu, Kai and Zhang, Hao and Chen, Baoquan},
  title = {PQ-NET: A Generative Part Seq2Seq Network for 3D Shapes},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Actor-Transformers for Group Activity Recognition
Kirill Gavrilyuk, Ryan Sanford, Mehrsan Javan, Cees G. M. Snoek


This paper strives to recognize individual actions and group activities from videos. While existing solutions for this challenging problem explicitly model spatial and temporal relationships based on location of individual actors, we propose an actor-transformer model able to learn and selectively extract information relevant for group activity recognition. We feed the transformer with rich actor-specific static and dynamic representations expressed by features from a 2D pose network and 3D CNN, respectively. We empirically study different ways to combine these representations and show their complementary benefits. Experiments show what is important to transform and how it should be transformed. What is more, actor-transformers achieve state-of-the-art results on two publicly available benchmarks for group activity recognition, outperforming the previous best published results by a considerable margin
[activity, action, transformer, static, actor, recognition, attention, dataset, individual, volleyball, collective, temporal, video, frame, recurrent, graph, mechanism, modeling, explore, late, language, encoding, recognize, positional, three, ibrahim] [bounding, branch, cnn, table, backbone, key, hrnet, feature, box, ablation] [model, input, study] [dynamic, fusion, flow, optical, motion, convolutional, spatial, figure, based, method, proposed, analysis] [representation, encoder, person, image] [group, network, neural, set, training, achieve, learning, better, layer, early, accuracy, best, deep, machine, performance, number, size] [pose, rgb, capture, approach, combine, human, left, single, position, compare]
@InProceedings{Gavrilyuk_2020_CVPR,
  author = {Gavrilyuk, Kirill and Sanford, Ryan and Javan, Mehrsan and Snoek, Cees G. M.},
  title = {Actor-Transformers for Group Activity Recognition},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
SG-NN: Sparse Generative Neural Networks for Self-Supervised Scene Completion of RGB-D Scans
Angela Dai, Christian Diller, Matthias Niessner


We present a novel approach that converts partial and noisy RGB-D scans into high-quality 3D scene reconstructions by inferring unobserved scene geometry. Our approach is fully self-supervised and can hence be trained solely on incomplete, real-world scans. To achieve, self-supervision, we remove frames from a given (incomplete) 3D scan in order to make it even more incomplete; self-supervision is then formulated by correlating the two levels of partialness of the same scan while masking out regions that have never been observed. Through generalization across a large training set, we can then predict 3D scene completions even without seeing any 3D scan of entirely complete geometry. Combined with a new 3D sparse generative convolutional neural network architecture, our method is able to predict highly detailed surfaces in a coarse-to-fine hierarchical fashion that outperform existing state-of-the-art methods by a significant margin in terms of reconstruction quality.
[predict, order, prediction, predicting] [table, predicted, semantic, final, level, propose, fully] [input, model, masking, trained, poisson] [ieee, figure, pattern, cvpr, method, resolution, output, convolutional, june, july] [target, generative, generate, synthetic, representation, train, generating, loss, produce] [data, training, learning, neural, network, set, deep, evaluate, starget, remains, entire, architecture] [scan, scene, sparse, completion, complete, approach, conference, geometry, computer, surface, incomplete, reconstruction, unobserved, tsdf, vision, distance, error, ground, truth, enables, volumetric, completeness, voxel, international, october, angela, matthias, dense, shape, single, scancomplete, partial, well, occupancy, thomas, enabling, geometric]
@InProceedings{Dai_2020_CVPR,
  author = {Dai, Angela and Diller, Christian and Niessner, Matthias},
  title = {SG-NN: Sparse Generative Neural Networks for Self-Supervised Scene Completion of RGB-D Scans},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Geometry-Aware Satellite-to-Ground Image Synthesis for Urban Areas
Xiaohu Lu, Zuoyue Li, Zhaopeng Cui, Martin R. Oswald, Marc Pollefeys, Rongjun Qin


We present a novel method for generating panoramic street-view images which are geometrically consistent with a given satellite image. Different from existing approaches that completely rely on a deep learning architecture to generalize cross-view image distributions, our approach explicitly loops in the geometric configuration of the ground objects based on the satellite views, such that the produced ground view synthesis preserves the geometric shape and the semantics of the scene. In particular, we propose a neural network with a geo-transformation layer that turns predicted ground-height values from the satellite view to a ground view while retaining the physical satellite-to-ground relation. Our results show that the synthesized image retains well-articulated and authentic geometric shapes, as well as texture richness of the street-view in various scenarios. Both qualitative and quantitative results demonstrate that our method compares favorably to other state-of-the-art approaches that lack geometric consistency.
[panoramic, semantics, work, three, road, urban, evaluation] [semantic, stage, predicted, aerial, height, building, feature, propose] [input, quality, adversarial] [method, ieee, proposed, pattern, quantitative, figure, utilized, pixel, patch] [image, generated, loss, transformed, generate, regmi, corresponding, latent, texture, realistic, learn, street, generative, generation, bicyclegan, synthesis, utilize, encoder, conditional, layout, iprj, synthesized, extracted] [network, training, weighted, learning, layer, vector, deep, architecture, neural, learned, problem] [satellite, depth, rgb, ground, view, geometric, transformation, conference, computer, panorama, vision, scene, truth, matching, pipeline, well, directly, grid, international, novel, geometrically, single, voxel]
@InProceedings{Lu_2020_CVPR,
  author = {Lu, Xiaohu and Li, Zuoyue and Cui, Zhaopeng and Oswald, Martin R. and Pollefeys, Marc and Qin, Rongjun},
  title = {Geometry-Aware Satellite-to-Ground Image Synthesis for Urban Areas},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Action Modifiers: Learning From Adverbs in Instructional Videos
Hazel Doughty, Ivan Laptev, Walterio Mayol-Cuevas, Dima Damen


We present a method to learn a representation for adverbs from instructional videos using weak supervision from the accompanying narrations. Key to our method is the fact that the visual representation of the adverb is highly dependent on the action to which it applies, although the same adverb will modify multiple actions in a similar way. For instance, while 'spread quickly' and 'mix quickly' will look dissimilar, we can learn a common representation that allows us to recognize both, among other actions. We formulate this as an embedding problem, and use scaled dot product attention to learn from weakly-supervised video narrations. We jointly learn adverbs as invertible transformations which operate on the embedding space, so as to add or remove the effect of the adverb. As there is no prior work on weakly supervised learning from adverbs, we gather paired action-adverb annotations from a subset of the HowTo100M dataset, for 6 adverbs: quickly/slowly, finely/coarsely and partially/completely. Our method outperforms all baselines for video-to-adverb retrieval with a performance of 0.719 mAP. We also demonstrate our model's ability to attend to the relevant video parts in order to determine the adverb for a given action.
[action, video, embedding, adverb, attention, recognition, relevant, instructional, visual, retrieval, work, antonym, temporal, narrated, text, josef, ivan, dataset, embeddings, embed, glove, accompanying, outperforms, modeling, modifier, dima, multiple, attend, question] [weakly, object, weak, table, supervision, localization, map, predicted, key, ablation] [model, query, modify] [ieee, pattern, method, figure, timestamp, prior] [learn, representation, supervised, perform, loss] [learning, learned, scaled, space, report, test, rank, performance, task, classifier, consider, set, linear, evaluate, training, better] [conference, vision, computer, international, joint, transformation, european, jointly, demonstrate, well]
@InProceedings{Doughty_2020_CVPR,
  author = {Doughty, Hazel and Laptev, Ivan and Mayol-Cuevas, Walterio and Damen, Dima},
  title = {Action Modifiers: Learning From Adverbs in Instructional Videos},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
ZSTAD: Zero-Shot Temporal Activity Detection
Lingling Zhang, Xiaojun Chang, Jun Liu, Minnan Luo, Sen Wang, Zongyuan Ge, Alexander Hauptmann


An integral part of video analysis and surveillance is temporal activity detection, which means to simultaneously recognize and localize activities in long untrimmed videos. Currently, the most effective methods of temporal activity detection are based on deep learning, and they typically perform very well with large scale annotated videos for training. However, these methods are limited in real applications due to the unavailable videos about certain activity classes and the time-consuming data annotation. To solve this challenging problem, we propose a novel task setting called zero-shot temporal activity detection (ZSTAD), where activities that have never been seen in training can still be detected. We design an end-to-end deep network based on R-C3D as the architecture for this solution. The proposed network is optimized with an innovative loss function that considers the embeddings of activity labels and their super-classes while learning the common semantics of seen and unseen activities. Experiments on both the THUMOS'14 and the Charades datasets show promising performance in terms of detecting unseen activities.
[activity, temporal, embeddings, embedding, action, tpn, video, recognition, zsdn, long, untrimmed, zstad, dataset, semantics] [detection, semantic, module, map, background, proposal, regression, feature, table, boundary, iou, sliding, object, threshold, china] [input, testing, model] [ieee, subnet, prior, pattern, figure, proposed, convolutional, based, output, analysis, method, event, called] [unseen, loss, xiaojun, common, zsl, image] [label, classification, deep, network, class, learning, clustering, training, improved, set, number, processing, performance, note, problem, vector, data, setting, architecture, design, space, neural, function, learned, convnet] [conference, computer, vision, novel, international]
@InProceedings{Zhang_2020_CVPR,
  author = {Zhang, Lingling and Chang, Xiaojun and Liu, Jun and Luo, Minnan and Wang, Sen and Ge, Zongyuan and Hauptmann, Alexander},
  title = {ZSTAD: Zero-Shot Temporal Activity Detection},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Geometric Structure Based and Regularized Depth Estimation From 360 Indoor Imagery
Lei Jin, Yanyu Xu, Jia Zheng, Junfei Zhang, Rui Tang, Shugong Xu, Jingyi Yu, Shenghua Gao


Motivated by the correlation between the depth and the geometric structure of a 360 indoor image, we propose a novel learning-based depth estimation framework that leverages the geometric structure of a scene to conduct depth estimation. Specifically, we represent the geometric structure of an indoor scene as a collection of corners, boundaries and planes. On the one hand, once a depth map is estimated, this geometric structure can be inferred from the estimated depth map; thus, the geometric structure functions as a regularizer for depth estimation. On the other hand, this estimation also benefits from the geometric structure of a scene estimated from an image where the structure functions as a prior. However, furniture in indoor scenes makes it challenging to infer geometric structure from depth or image data. An attention map is inferred to facilitate both depth estimation from features of the geometric structure and also geometric inferences from the estimated depth map. To validate the effectiveness of each component in our framework under controlled conditions, we render a synthetic dataset, Shanghaitech-Kujiale Indoor 360 dataset with 3550 360 indoor images. Extensive experiments on popular datasets validate the effectiveness of our solution. We also demonstrate that our method can also be applied to counterfactual depth.
[attention, dataset, panoramic, prediction, three, infer, order, evaluation, work] [map, module, propose, object, table, effectiveness, branch, framework, segmentation, boundary, mask, semantic, predicted, regression] [counterfactual, input, datasets] [ieee, prior, convolution, pattern, method, proposed, convolutional, remove, block, based, figure, validate, cnns, existing] [image, synthetic, layout, corresponding, representation, structural, train] [fur, learning, network, stanford, performance, subset, regularizer, data, evaluate, denote, regularized, deep] [depth, structure, geometric, estimation, conference, indoor, furniture, computer, room, spherical, vision, scene, reconstruction, plane, international, demonstrate, omnidirectional, european, estimated, single, panorama, perspective, leveraging, rgb, ground, planar, empty]
@InProceedings{Jin_2020_CVPR,
  author = {Jin, Lei and Xu, Yanyu and Zheng, Jia and Zhang, Junfei and Tang, Rui and Xu, Shugong and Yu, Jingyi and Gao, Shenghua},
  title = {Geometric Structure Based and Regularized Depth Estimation From 360 Indoor Imagery},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Deep Kinematics Analysis for Monocular 3D Human Pose Estimation
Jingwei Xu, Zhenbo Yu, Bingbing Ni, Jiancheng Yang, Xiaokang Yang, Wenjun Zhang


For monocular 3D pose estimation conditioned on 2D detection, noisy/unreliable input is a key obstacle in this task. Simple structure constraints attempting to tackle this problem, e.g., symmetry loss and joint angle limit, could only provide marginal improvements and are commonly treated as auxiliary losses in previous researches. Thus it still remains challenging about how to effectively utilize the power of human prior knowledge for this task. In this paper, we propose to address above issue in a systematic view. Firstly, we show that optimizing the kinematics structure of noisy 2D inputs is critical to obtain accurate 3D estimations. Secondly, based on corrected 2D joints, we further explicitly decompose articulated motion with human topology, which leads to more compact 3D static structure easier for estimation. Finally, temporal refinement emphasizing the validity of 3D dynamic structure is naturally developed to pursue more accurate result. Above three steps are seamlessly integrated into deep neural models, which form a deep kinematics analysis pipeline concurrently considering the static/dynamic structure of 2D inputs and 3D outputs. Extensive experiments show that proposed framework achieves state-of-the-art performance on two widely used 3D human action datasets. Meanwhile, targeted ablation study shows that each former step is critical for the latter one to obtain promising results.
[length, trajectory, temporal, video, three, action, time, dataset, work, cai, evaluation, previous, step, sequence, stamp, skeleton, critical, explicitly, unreliable, contribution, corresponds, prediction] [refine, adopt, refined, propose, achieves, refinement, table, ablation] [model, correction, protocol, input, refers, trained, study] [based, analysis, motion, proposed, figure, prior, output, noisy, decompose] [loss, bingbing, decomposed, corresponding, utilize] [deep, accuracy, training, learning, neural, network, performance, better, simple, compact, reliable, large] [pose, estimation, human, joint, estimated, kinematics, structure, monocular, projection, pavllo, direction, single, body, completion, correspondence, ground, truth, perspective, estimate, articulated, detailed, point, focal, error, october, angle, accurate]
@InProceedings{Xu_2020_CVPR,
  author = {Xu, Jingwei and Yu, Zhenbo and Ni, Bingbing and Yang, Jiancheng and Yang, Xiaokang and Zhang, Wenjun},
  title = {Deep Kinematics Analysis for Monocular 3D Human Pose Estimation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
TEA: Temporal Excitation and Aggregation for Action Recognition
Yan Li, Bin Ji, Xintian Shi, Jianguo Zhang, Bin Kang, Limin Wang


Temporal modeling is key for action recognition in videos. It normally considers both short-range motions and long-range aggregations. In this paper, we propose a Temporal Excitation and Aggregation (TEA) block, including a motion excitation (ME) module and a multiple temporal aggregation (MTA) module, specifically designed to capture both short- and long-range temporal evolution. In particular, for short-range motion modeling, the ME module calculates the feature-level temporal differences from spatiotemporal features. It then utilizes the differences to excite the motion-sensitive channels of the features. The long-range temporal aggregations in previous works are typically achieved by stacking a large number of local temporal convolutions. Each convolution processes a local temporal window at a time. In contrast, the MTA module proposes to deform the local convolution to a group of sub-convolutions, forming a hierarchical residual architecture. Without introducing additional parameters, the features will be processed with a series of sub-convolutions, and each frame could complete multiple temporal aggregations with neighborhoods. The final equivalent receptive field of temporal dimension is accordingly enlarged, which is capable of modeling the long-range temporal relationship over distant frames. The two components of the TEA block are complementary in temporal modeling. Finally, our approach achieves impressive results at low FLOPs on several action recognition benchmarks, such as Kinetics, Something-Something, HMDB51, and UCF101, which confirms its effectiveness and efficiency.
[temporal, action, multiple, video, recognition, spatiotemporal, modeling, frame, hierarchical, considering] [module, aggregation, feature, resnet, cnn, table, propose, adopt, introducing, final, pooling] [model, input, original, protocol] [motion, conv, proposed, tea, excitation, convolution, residual, method, block, spatial, mta, utilized, convolutional, receptive, optical, stm, imgnet, based, figure, senet, field, existing, flow, enhance, pattern, stacking, adjacent, comparison, excite, cnns] [utilize, image, corresponding] [learning, deep, training, performance, baseline, large, efficient, group, architecture, simple, network, validation, equivalent, dimension, layer, indicates, accuracy, standard, strategy, applied, inference] [local, approach, additional, compare, structure]
@InProceedings{Li_2020_CVPR,
  author = {Li, Yan and Ji, Bin and Shi, Xintian and Zhang, Jianguo and Kang, Bin and Wang, Limin},
  title = {TEA: Temporal Excitation and Aggregation for Action Recognition},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Oops! Predicting Unintentional Action in Video
Dave Epstein, Boyuan Chen, Carl Vondrick


From just a short glance at a video, we can often tell whether a person's action is intentional or not. Can we train a model to recognize this? We introduce a dataset of in-the-wild videos of unintentional action, as well as a suite of tasks for recognizing, localizing, and anticipating its onset. We train a supervised neural network as a baseline and analyze its performance compared to human consistency on the tasks. We also investigate self-supervised representations that leverage natural signals in our dataset, and show the effectiveness of an approach that uses the intrinsic speed of video to perform competitively with highly-supervised pretraining. However, a significant gap between machine and human performance remains.
[video, action, unintentional, dataset, speed, visual, intentional, temporal, recognition, kinetics, failure, predicting, predict, clip, intentionality, three, time, recognize, prediction, clue, order, work, recognizing, context, frame] [supervision, feature, localization, table] [model, analyze, deviation, input] [ieee, figure, pattern, motion, convolutional, perceptual] [representation, learn, train, person, gap, supervised, unsupervised, diverse, image] [learning, unlabeled, set, arxiv, preprint, performance, classification, labeled, task, network, large, neural, label, evaluate, training, linear, carl, learned, best, number, classifier, data, deep, predictive, processing, accuracy, andrew] [computer, conference, human, vision, international, limited, scene, european, nearest, variety, compare]
@InProceedings{Epstein_2020_CVPR,
  author = {Epstein, Dave and Chen, Boyuan and Vondrick, Carl},
  title = {Oops! Predicting Unintentional Action in Video},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Scene Recomposition by Learning-Based ICP
Hamid Izadinia, Steven M. Seitz


By moving a depth sensor around a room, we compute a 3D CAD model of the environment, capturing the room shape and contents such as chairs, desks, sofas, and tables. Rather than reconstructing geometry, we match, place, and align each object in the scene to thousands of CAD models of objects. In addition to the fully automatic system, the key technical contribution is a novel approach for aligning CAD models to 3D scans, based on deep reinforcement learning. This approach, which we call Learning-based ICP, outperforms prior ICP methods in the literature, by learning the best points to match and conditioning on object viewpoint. LICP learns to align using only synthetic data and does not require ground truth annotation of object pose or keypoint pair matching in real scene scans. While LICP is trained on synthetic data and without 3D real scene annotations, it outperforms both learned local deep feature matching and geometric based alignment methods in real scenes. The proposed method is evaluated on real scenes datasets of SceneNN and ScanNet as well as synthetic scenes of SUNCG. High quality results are demonstrated on a range of real world scenes, with robustness to clutter, viewpoint, and occlusion.
[policy, reward, action, reinforcement, automatic, prediction] [object, feature, detection, semantic, fully, recall, global, segmentation] [model, input, query, trained, robust] [reference, figure, method, based, proposed, high, prior] [real, alignment, synthetic, loss, learn, aligning, image, representation, generate, align] [learning, network, deep, function, data, set, training, learned, large, problem, amount, test, performance] [cad, scene, point, licp, icp, shape, surface, pose, scan, cloud, recomposition, transformation, depth, geometry, camera, ground, matching, scanned, distance, local, error, rotation, approach, furniture, voxel, compare, rgbd, compute, truth, keypoint, recomposed, registration, single, reconstruction, visible, chair, indoor, room, estimation]
@InProceedings{Izadinia_2020_CVPR,
  author = {Izadinia, Hamid and Seitz, Steven M.},
  title = {Scene Recomposition by Learning-Based ICP},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Enhancing Cross-Task Black-Box Transferability of Adversarial Examples With Dispersion Reduction
Yantao Lu, Yunhan Jia, Jianyu Wang, Bai Li, Weiheng Chai, Lawrence Carin, Senem Velipasalar


Neural networks are known to be vulnerable to carefully crafted adversarial examples, and these malicious samples often transfer, i.e., they remain adversarial even against other models. Although significant effort has been devoted to the transferability across models, surprisingly little attention has been paid to cross-task transferability, which represents the real-world cybercriminal's situation, where an ensemble of different defense/detection mechanisms need to be evaded all at once. We investigate the transferability of adversarial examples across a wide range of real-world computer vision tasks, including image classification, object detection, semantic segmentation, explicit content detection, and text detection. Our proposed attack minimizes the "dispersion" of the internal feature map, overcoming the limitations of existing attacks, that require task-specific loss functions and/or probing a target model. We conduct evaluation on open-source detection and segmentation models, as well as four different computer vision tasks provided by Google Cloud Vision (GCV) APIs. We demonstrate that our approach outperforms existing attacks by degrading performance of multiple CV tasks by a large margin with only modest perturbations.
[evaluation, text, recognition, explicit] [detection, map, feature, semantic, object, segmentation, achieves, miou, voc, table, coco] [adversarial, attack, model, dispersion, dim, onv, transferability, pgd, drop, original, gcv, attacking, google, deployed, std, input, aes, perturbation, middle, safesearch, ensemble, internal] [proposed, method, figure, ieee, based, performs] [target, image, source, content, loss, transfer, generated] [performance, compared, baseline, best, learning, accuracy, reduction, average, arxiv, preprint, machine, layer, gradient, set, validation, larger, momentum, deep, reducing, training, neural] [vision, computer, cloud, conference, compare]
@InProceedings{Lu_2020_CVPR,
  author = {Lu, Yantao and Jia, Yunhan and Wang, Jianyu and Li, Bai and Chai, Weiheng and Carin, Lawrence and Velipasalar, Senem},
  title = {Enhancing Cross-Task Black-Box Transferability of Adversarial Examples With Dispersion Reduction},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Single-Step Adversarial Training With Dropout Scheduling
Vivek B.S., R. Venkatesh Babu


Deep learning models have shown impressive performance across a spectrum of computer vision applications including medical diagnosis and autonomous driving. One of the major concerns that these models face is their susceptibility to adversarial attacks. Realizing the importance of this issue, more researchers are working towards developing robust models that are less affected by adversarial attacks. Adversarial training method shows promising results in this direction. In adversarial training regime, models are trained with mini-batches augmented with adversarial samples. Fast and simple methods (e.g., single-step gradient ascent) are used for generating adversarial samples, in order to reduce computational complexity. It is shown that models trained using single-step adversarial training method (adversarial samples are generated using noniterative method) are pseudo robust. Further, this pseudo robustness of models is attributed to the gradient masking effect. However, existing works fail to explain when and why gradient masking effect occurs during single-step adversarial training. In this work, (i) we show that models trained using single-step adversarial training method learn to prevent the generation of single-step adversaries, and this is due to over-fitting of the model during the initial stages of training, and (ii) to mitigate this effect, we propose a singlestep adversarial training method with dropout scheduling. Unlike models trained using existing single-step adversarial training methods, models trained using the proposed single-step adversarial training method are robust against both single-step and multi-step adversarial attacks, and the performance is on par with models trained using computationally expensive multi-step adversarial training methods, in white-box and black-box settings.
[observed, order, time, dataset] [table, plot, propose] [adversarial, trained, model, pgd, robust, attack, perturbation, sads, mnist, pat, fgsm, robustness, versus, typical, success, norm, fat, masking, demonstrated, clean, deepfool, datasets, iterative, ian, venkatesh, susceptible, adversarially, ifgsm] [method, proposed, based, fast, pattern] [loss, generated, generating, learn, generation, image, perform] [training, dropout, learning, layer, performance, gradient, setting, probability, set, validation, machine, accuracy, network, deep, prevent, note, neural, large, increase, iteration, sample, size, arxiv, preprint, test] [conference, international, vision, normal, computer, supplementary]
@InProceedings{B.S._2020_CVPR,
  author = {B.S., Vivek and Babu, R. Venkatesh},
  title = {Single-Step Adversarial Training With Dropout Scheduling},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Deep Non-Line-of-Sight Reconstruction
Javier Grau Chopite, Matthias B. Hullin, Michael Wand, Julian Iseringhausen


The recent years have seen a surge of interest in methods for imaging beyond the direct line of sight. The most prominent techniques rely on time-resolved optical impulse responses, obtained by illuminating a diffuse wall with an ultrashort light pulse and observing multi-bounce indirect reflections with an ultrafast time-resolved imager. Reconstruction of geometry from such data, however, is a complex non-linear inverse problem that comes with substantial computational demands. In this paper, we employ convolutional feed-forward networks for solving the reconstruction problem efficiently while maintaining good reconstruction quality. Specifically, we devise a tailored autoencoder architecture, trained end-to-end, that maps transient images directly to a depth-map representation. Training is done using a recent, very efficient transient renderer for three-bounce indirect light transport that enables the quick generation of large amounts of training data for the network. We examine the performance of our method on a variety of synthetic and experimental datasets and its dependency on the choice of training data and augmentation strategies, as well as architectural features. We demonstrate that our feed-forward network, even if trained solely on synthetic data, is able to obtain results competitive with previous, model-based optimization methods, while being orders of magnitude faster.
[time, three, hidden, temporal, dataset, evaluation, work] [map, object, background, global] [model, input, trained, datasets, poisson, experimental, university, case] [figure, convolutional, sensor, light, photon, output, resolution, pixel, ieee, spatial, imaging, pattern, intensity, proposed] [synthetic, image, target, real, representation, generative, perform, generation, train, generate, consists] [data, deep, learning, network, training, performance, test, problem, neural, large, forward, efficient, better] [depth, reconstruction, transient, nlos, spad, shapenet, diffuse, computer, conference, vision, geometry, well, scene, approach, acm, regressor, retroreflective, rendering, david, matthias, indirect, renderer, full, dense]
@InProceedings{Chopite_2020_CVPR,
  author = {Chopite, Javier Grau and Hullin, Matthias B. and Wand, Michael and Iseringhausen, Julian},
  title = {Deep Non-Line-of-Sight Reconstruction},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
SSRNet: Scalable 3D Surface Reconstruction Network
Zhenxing Mi, Yiming Luo, Wenbing Tao


Existing learning-based surface reconstruction methods from point clouds are still facing challenges in terms of scalability and preservation of details on large-scale point clouds. In this paper, we propose the SSRNet, a novel scalable learning-based method for surface reconstruction. The proposed SSRNet constructs local geometry-aware features for octree vertices and designs a scalable reconstruction pipeline, which not only greatly enhances the predication accuracy of the relative position between the vertices and the implicit surface facilitating the surface reconstruction quality, but also allows dividing the point cloud and octree vertices and processing different parts in parallel for superior scalability on large-scale point clouds with millions of points. Moreover, SSRNet demonstrates outstanding generalization capability and only needs several surface data for training, much less than other learning-based reconstruction methods, which can effectively avoid overfitting. The trained model of SSRNet on one dataset can be directly used on other datasets with superior performance. Finally, the time consumption with SSRNet on a large-scale point cloud is acceptable and competitive. To our knowledge, the proposed SSRNet is the first to really bring a convincing solution to the scalability issue of the learning-based surface reconstruction methods, and is an important step to make learning-based methods competitive with respect to geometry processing methods on real-world and challenging data. Experiments show that our method achieves a breakthrough in scalability and quality compared with state-of-the-art learning-based methods.
[time, evaluation, order, dataset, represent, recognition, extract] [table, global, feature, bounding] [input, generalization, datasets, quality, trained, testing] [method, figure, convolution, ieee, pattern, capability, high, scale, output] [latent] [network, classification, data, training, accuracy, set, stanford, processing, learning, function, deep, vector, good, stl, scalable, evaluate, worth, better] [point, octree, surface, reconstruction, local, onet, tangent, dtu, ssrnet, implicit, computer, geometry, psr, conference, cloud, vertex, neighbor, vision, reconstruct, shapenet, scalability, grid, geometric, shape, normal, directly, distance, depth, accurate, dividing, marching, reconstructed, complex, capture, projection]
@InProceedings{Mi_2020_CVPR,
  author = {Mi, Zhenxing and Luo, Yiming and Tao, Wenbing},
  title = {SSRNet: Scalable 3D Surface Reconstruction Network},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Progressive Relation Learning for Group Activity Recognition
Guyue Hu, Bo Cui, Yuan He, Shan Yu


Group activities usually involve spatio-temporal dynamics among many interactive individuals, while only a few participants at several key frames essentially define the activity. Therefore, effectively modeling the group-relevant and suppressing the irrelevant actions (and interactions) are vital for group activity recognition. In this paper, we propose a novel method based on deep reinforcement learning to progressively refine the low-level features and high-level relations of group activities. Firstly, we construct a semantic relation graph (SRG) to explicitly model the relations among persons. Then, two agents adopting policy according to two Markov decision processes are applied to progressively refine the SRG. Specifically, one feature-distilling (FD) agent in the discrete action space refines the low-level spatio-temporal features by distilling the most informative frames. Another relation-gating (RG) agent in continuous action space adjusts the high-level semantic graph to pay more attention to group-relevant relations. The SRG, FD agent, and RG agent are optimized alternately to mutually boost the performance of each other. Extensive experiments on two widely used benchmarks demonstrate the effectiveness and superiority of the proposed approach.
[agent, graph, activity, relation, action, reinforcement, spatiotemporal, individual, standing, state, srg, policy, prl, recognition, passing, structured, reward, explicitly, visual, volleyball, moving, attention, temporal, heij, three, frame, talking, video, connected, lstm, dataset, collective, interaction, node] [semantic, feature, global, key, framework, denotes, refine, edge, table] [model, trained, input] [proposed, convolutional, method, optical, flow, figure, asynchronous] [progressively, person, progressive, corresponding] [group, learning, network, neural, function, deep, updating, number, baseline, distilled, sparsity, informative, updated, applied, discrete, probability, algorithm, class, matrix, queue, training] [left, local, continuous, cad]
@InProceedings{Hu_2020_CVPR,
  author = {Hu, Guyue and Cui, Bo and He, Yuan and Yu, Shan},
  title = {Progressive Relation Learning for Group Activity Recognition},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Cooling-Shrinking Attack: Blinding the Tracker With Imperceptible Noises
Bin Yan, Dong Wang, Huchuan Lu, Xiaoyun Yang


Adversarial attack of CNN aims at deceiving models to misbehave by adding imperceptible perturbations to images. This feature facilitates to understand neural networks deeply and to improve the robustness of deep learning models. Although several works have focused on attacking image classifiers and object detectors, an effective and efficient method for attacking single object trackers of any target in a model-free way remains lacking. In this paper, a cooling-shrinking attack method is proposed to deceive state-of-the-art SiameseRPN-based trackers. An effective and efficient perturbation generator is trained with a carefully designed adversarial loss, which can simultaneously cool hot regions where the target exists on the heatmaps and force the predicted bounding box to shrink, making the tracked target invisible to trackers. Numerous experiments on OTB100, VOT2018, and LaSOT datasets show that our method can effectively fool the state-of-the-art SiameseRPN++ tracker by adding small perturbations to the template or the search regions. Besides, our method has good transferability and is able to deceive other top-performance trackers such as DaSiamRPN, DaSiamRPN-UpdateNet, and DiMP. The source codes are available at https://github.com/MasterBin-IIAU/CSA.
[visual, three, dataset, making] [tracking, object, template, tracker, lasot, region, siamese, regression, represents, table, dasiamrpn, bounding, threshold, tracked, dimp, overlap, detection, location, siamrpn, framework, ope, huchuan, box, faster, rpn] [adversarial, attacking, attack, clean, original, shrinking, heatmaps, cooling, drop, adding, deceive, imperceptible, siamrpnpp, success, fool, robust, model, dong, perturbation, trained, experimental] [method, figure, designed, based, column, high, proposed] [loss, target, generator, image, discriminative, train, discriminator, proposes] [search, deep, performance, classification, algorithm, learning, arxiv, preprint, efficient, training, network, size, neural, precision, online] [single, detailed, initial, accurate, novel]
@InProceedings{Yan_2020_CVPR,
  author = {Yan, Bin and Wang, Dong and Lu, Huchuan and Yang, Xiaoyun},
  title = {Cooling-Shrinking Attack: Blinding the Tracker With Imperceptible Noises},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Adversarial Camouflage: Hiding Physical-World Attacks With Natural Styles
Ranjie Duan, Xingjun Ma, Yisen Wang, James Bailey, A. K. Qin, Yun Yang


Deep neural networks (DNNs) are known to be vulnerable to adversarial examples. Existing works have mostly focused on either digital adversarial examples created via small and imperceptible perturbations, or physical-world adversarial examples created with large and less realistic distortions that are easily identified by human observers. In this paper, we propose a novel approach, called Adversarial Camouflage (AdvCam), to craft and camouflage physical-world adversarial examples into natural styles that appear legitimate to human observers. Specifically, AdvCam transfers large adversarial perturbations into customized styles, which are then "hidden" on-target object or off-target background. Experimental evaluation shows that, in both digital and physical-world scenarios, adversarial examples crafted by AdvCam are well camouflaged and highly stealthy, while remaining effective in fooling state-of-the-art DNN image classifiers. Hence, AdvCam is a flexible approach that can help craft stealthy attacks to evaluate the robustness of DNNs.
[natural, three, perception, sign, visual] [region, camouflaged, object, semantic, represents, area, ablation, propose] [adversarial, attack, advcam, camouflage, digital, crafted, craft, perturbation, physical, stealthiness, targeted, advpatch, example, original, dnn, stealthy, attacking, yisen, success, attacker, pgd, xingjun, strength, james, physicalworld, robustness, model, untargeted, help, fool, customized, highly, effective, unrestricted, robust, study, university, experimental] [figure, proposed, existing, flexible, high, created] [image, style, target, loss, content, transfer, texture, ladv, generate, generated, source, perform] [large, deep, neural, small, learning, test, achieve, network, selected, classifier, class, note, size, rate] [human, smoothness, approach, directly, second, defined, shape]
@InProceedings{Duan_2020_CVPR,
  author = {Duan, Ranjie and Ma, Xingjun and Wang, Yisen and Bailey, James and Qin, A. K. and Yang, Yun},
  title = {Adversarial Camouflage: Hiding Physical-World Attacks With Natural Styles},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Weakly-Supervised Action Localization by Generative Attention Modeling
Baifeng Shi, Qi Dai, Yadong Mu, Jingdong Wang


Weakly-supervised temporal action localization is a problem of learning an action localization model with only video-level action labeling available. The general framework largely relies on the classification activation, which employs an attention model to identify the action-related frames and then categorizes them into different classes. Such method results in the action-context confusion issue: context frames near action clips tend to be recognized as action frames themselves, since they are closely related to the specific classes. To solve the problem, in this paper we propose to model the class-agnostic frame-wise probability conditioned on the frame attention using conditional Variational Auto-Encoder (VAE). With the observation that the context exhibits notable difference from the action at representation level, a probabilistic model, i.e., conditional VAE, is learned to model the likelihood of each frame given the attention. By maximizing the conditional probability with respect to the attention, the action and non-action frames are well separated. Experiments on THUMOS14 and ActivityNet1.2 demonstrate advantage of our method and effectiveness in handling action-context confusion problem. Code is now available on GitHub.
[action, attention, temporal, recognition, context, video, frame, modeling, dgam, evaluation, lcv, lguide, observation, long] [feature, localization, weak, table, module, foreground, framework, detection, map, background, weakly] [model, quality, difference] [ieee, method, pattern, figure, prior, high, based, convolutional, modeled, flow] [generative, loss, discriminative, latent, representation, conditional, variational, lre, confusion, cvae, discrepancy, conditioned, produce, train, ting, tao, supervised] [classification, log, learning, distribution, set, network, indicates, neural, problem, performance, note, learned, data, training, processing, class, better, optimization] [conference, vision, computer, international, reconstruction, full, european, directly, truth]
@InProceedings{Shi_2020_CVPR,
  author = {Shi, Baifeng and Dai, Qi and Mu, Yadong and Wang, Jingdong},
  title = {Weakly-Supervised Action Localization by Generative Attention Modeling},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Towards Achieving Adversarial Robustness by Enforcing Feature Consistency Across Bit Planes
Sravanti Addepalli, Vivek B.S., Arya Baburaj, Gaurang Sriramanan, R. Venkatesh Babu


As humans, we inherently perceive images based on their predominant features, and ignore noise embedded within lower bit planes. On the contrary, Deep Neural Networks are known to confidently misclassify images corrupted with meticulously crafted perturbations that are nearly imperceptible to the human eye. In this work, we attempt to address this problem by training networks to form coarse impressions based on the information in higher bit planes, and use the lower bit planes only to refine their prediction. We demonstrate that, by imposing consistency on the representations learned across differently quantized images, the adversarial robustness of networks improves significantly when compared to a normally trained model. Present state-of-the-art defenses against adversarial attacks require the networks to be explicitly trained using adversarial samples that are computationally expensive to generate. While such methods that use adversarial training continue to achieve the best results, this work paves the way towards achieving robustness without having to explicitly train on adversarial samples. The proposed approach is therefore faster, and also closer to the natural learning process in humans.
[recognition, work, step] [feature] [adversarial, robustness, pgd, noise, attack, bpfc, mnist, trained, clean, robust, model, input, stronger, datasets, fgsm, targeted, ian, deepfool] [proposed, method, pixel, based, low, range, pattern, ieee] [image, loss, corresponding, consistency, fine] [training, bit, accuracy, learning, random, deep, quantization, consider, gradient, set, achieve, compared, quantized, arxiv, preprint, network, better, lower, regularizer, rate, data, neural, higher, performance, mixup, function, sample, test, process] [conference, local, approach, coarse, computer, plane, international, human, vision, term]
@InProceedings{Addepalli_2020_CVPR,
  author = {Addepalli, Sravanti and B.S., Vivek and Baburaj, Arya and Sriramanan, Gaurang and Babu, R. Venkatesh},
  title = {Towards Achieving Adversarial Robustness by Enforcing Feature Consistency Across Bit Planes},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Polishing Decision-Based Adversarial Noise With a Customized Sampling
Yucheng Shi, Yahong Han, Qi Tian


As an effective black-box adversarial attack, decision-based methods polish adversarial noise by querying the target model. Among them, boundary attack is widely applied due to its powerful noise compression capability, especially when combined with transfer-based methods. Boundary attack splits the noise compression into several independent sampling processes, repeating each query with a constant sampling setting. In this paper, we demonstrate the advantage of using current noise and historical queries to customize the variance and mean of sampling in boundary attack to polish adversarial noise. We further reveal the relationship between the initial noise and the compressed noise in boundary attack. We propose Customized Adversarial Boundary (CAB) attack that uses the current noise to model the sensitivity of each pixel and polish adversarial noise of each image with a customized sampling setting. On the one hand, CAB uses current noise as a prior belief to customize the multivariate normal distribution. On the other hand, CAB keeps the new samplings away from historical failed queries to avoid similar mistakes. Experimental results measured on several image classification datasets emphasizes the validity of our method.
[current, historical, order, step, represent] [boundary, table, biased, bba, represents] [noise, adversarial, attack, cab, magnitude, model, query, stepsize, failed, example, original, customize, whey, customized, decision, misclassification, input, substitute, polish, sensitivity, blackbox, zeroth, monotonicity, satisfies, evo, polishing, constant, robustness, customization, mnist] [based, compression, senet, method, pixel] [target, image, source, generated, generate] [sampling, distribution, variance, number, evolutionary, space, random, imagenet, set, reduction, nasnet, optimization, efficiency, probability, rate, deep, neural, learning, sample, strategy] [initial, normal, direction, median, spherical, distance, dense, absolute]
@InProceedings{Shi_2020_CVPR,
  author = {Shi, Yucheng and Han, Yahong and Tian, Qi},
  title = {Polishing Decision-Based Adversarial Noise With a Customized Sampling},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Towards Large Yet Imperceptible Adversarial Image Perturbations With Perceptual Color Distance
Zhengyu Zhao, Zhuoran Liu, Martha Larson


The success of image perturbations that are designed to fool image classifier is assessed in terms of both adversarial effect and visual imperceptibility. The conventional assumption on imperceptibility is that perturbations should strive for tight Lp-norm bounds in RGB space. In this work, we drop this assumption by pursuing an approach that exploits human color perception, and more specifically, minimizing perturbation size with respect to perceptual color distance. Our first approach, Perceptual Color distance C&W (PerC-C&W), extends the widely-used C&W approach and produces larger RGB perturbations. PerC-C&W is able to maintain adversarial strength, while contributing to imperceptibility. Our second approach, Perceptual Color distance Alternating Loss (PerC-AL), achieves the same outcome, but does so more efficiently by alternating between the classification loss and perceptual color difference when updating perturbations. Experimental evaluation shows PerC approaches outperform conventional Lp approaches in terms of robustness and transferability, and also demonstrates that the PerC distance can provide added value on top of existing structure-based methods to creating image perturbations.
[work, visual, order, step, evaluation, three] [confidence, table] [adversarial, perc, norm, original, perturbation, difference, imperceptibility, success, perturbed, security, imperceptible, robustness, untargeted, create, budget, privacy, fool, experimental, accumulated, ian, successful, model, improve, strength, perturbs, adding] [color, perceptual, ddn, method, based, high, existing, achieved, ieee, designed, conventional, figure, assumption, commonly, proposed] [image, loss, structural, creating, generated, inception, transferable] [respect, size, deep, space, learning, large, classification, search, updating, optimization, neural, similarity, find, gradient, achieve, alternating, small, efficient, note, investigate, larger] [distance, rgb, approach, human, directly, smooth, direction, joint, computer, michael]
@InProceedings{Zhao_2020_CVPR,
  author = {Zhao, Zhengyu and Liu, Zhuoran and Larson, Martha},
  title = {Towards Large Yet Imperceptible Adversarial Image Perturbations With Perceptual Color Distance},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Something-Else: Compositional Action Recognition With Spatial-Temporal Interaction Networks
Joanna Materzynska, Tete Xiao, Roei Herzig, Huijuan Xu, Xiaolong Wang, Trevor Darrell


Human action is naturally compositional: humans can easily recognize and perform actions with objects that are different from those used in training demonstrations. In this paper, we study the compositionality of action by looking into the dynamics of subject-object interactions. We propose a novel model which can explicitly reason about the geometric relations between constituent objects and an agent performing an action. To train our model, we collect dense object box annotations on the Something-Something dataset. We propose a novel compositional action recognition task where the training combinations of verbs and nouns do not overlap with the test set. The novel aspects of our model are applicable to activities with prominent object interaction dynamics and to objects which can be tracked using state-of-the-art approaches; for activities without clearly defined spatial object-agent interactions, we rely on baseline scene-level spatio-temporal representations. We show the effectiveness of our approach not only on the proposed compositional action recognition task but also in a few-shot compositional setting which requires the model to generalize across both object appearance and action category.
[action, compositional, stin, interaction, recognition, oie, video, smth, reasoning, temporal, graph, dataset, visual, combining, time, recognize, explicitly, constituent, current, relational, abhinav, agent, verb] [object, feature, table, bounding, box, split, detection, improvement, tracking, propose, ross, kaiming, category, module] [model, trained, testing, identity] [spatial, figure, convolutional, based, combination] [perform, appearance, generalize, train, trevor, unseen] [training, learning, set, setting, network, neural, base, deep, baseline, simple, classification, validation, performance, randomly, arxiv, preprint, large] [novel, human, david, geometric, computer, coordinate]
@InProceedings{Materzynska_2020_CVPR,
  author = {Materzynska, Joanna and Xiao, Tete and Herzig, Roei and Xu, Huijuan and Wang, Xiaolong and Darrell, Trevor},
  title = {Something-Else: Compositional Action Recognition With Spatial-Temporal Interaction Networks},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Learning Unsupervised Hierarchical Part Decomposition of 3D Objects From a Single RGB Image
Despoina Paschalidou, Luc Van Gool, Andreas Geiger


Humans perceive the 3D world as a set of distinct objects that are characterized by various low-level (geometry, reflectance) and high-level (connectivity, adjacency, symmetry) properties. Recent methods based on convolutional neural networks (CNNs) demonstrated impressive progress in 3D reconstruction, even when using a single 2D image as input. However, the majority of these methods focuses on recovering the local 3D geometry of an object without considering its part-based decomposition or relations between parts. We address this challenging problem by proposing a novel formulation that allows to jointly recover the geometry of a 3D object as a set of primitives as well as their latent hierarchical structure without part-level supervision. Our model recovers the higher level structural decomposition of various objects in the form of a binary tree of primitives, where simple parts are represented with fewer primitives and more complex parts are modeled with more components. Our experiments on the ShapeNet and D-FAUST datasets demonstrate that considering the organization of parts indeed facilitates reasoning about 3D geometry.
[hierarchical, recognition, predict, node, prediction] [object, predicted, feature, level, iou, supervision, semantic, centroid, employ] [model, input, quality] [ieee, pattern, tree, recover, figure, partition, contrast] [target, representation, image, loss, learn, unsupervised, learns] [learning, network, neural, set, note, function, binary, hierarchy, learned, maximum, deep, processing, max, task, consider, unbalanced, number] [shape, primitive, reconstruction, computer, vision, depth, geometry, structure, international, single, pdk, point, xkd, hdk, decomposition, human, mesh, implicit, hao, rgb, qkd, sqs, leonidas, andreas, well, represented, surface, recovers, cdk, occupancy, form, require, representing]
@InProceedings{Paschalidou_2020_CVPR,
  author = {Paschalidou, Despoina and Gool, Luc Van and Geiger, Andreas},
  title = {Learning Unsupervised Hierarchical Part Decomposition of 3D Objects From a Single RGB Image},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Focus on Defocus: Bridging the Synthetic to Real Domain Gap for Depth Estimation
Maxim Maximov, Kevin Galim, Laura Leal-Taixe


Data-driven depth estimation methods struggle with the generalization outside their training scenes due to the immense variability of the real-world scenes. This problem can be partially addressed by utilising synthetically generated images, but closing the synthetic-real domain gap is far from trivial. In this paper, we tackle this issue by using domain invariant defocus blur as direct supervision. We leverage defocus cues by using a permutation invariant convolutional neural network that encourages the network to learn from the differences between images with a different point of focus. Our proposed network uses the defocus map as an intermediate supervisory signal. We are able to train our model completely on synthetic data and directly apply it to a wide range of real-world images. We evaluate our model on synthetic and real datasets, showing compelling generalization results and state-of-the-art depth prediction. The dataset and code are available at https://github.com/dvl-tum/defocus-net.
[dataset, prediction, regular, work, decoder] [focus, map, table, global, main, object, pooling, propose] [input, model, trained, generalization, datasets, effective] [stack, ieee, blur, range, method, pattern, cvpr, dynamic, june, figure, based, output, september, proposed] [image, synthetic, real, domain, train, invariant, row, generalize, perform, autoencoder, appearance] [network, training, test, data, number, learning, better, architecture, random, set, neural, wide, deep, mobile, problem] [depth, defocus, focal, estimation, camera, computer, vision, conference, single, estimate, scene, rgb, monocular, compare, stereo, dof, defocusnet, direct, compute, rely, distance, coc, approach, allows, nyu, directly, shape]
@InProceedings{Maximov_2020_CVPR,
  author = {Maximov, Maxim and Galim, Kevin and Leal-Taixe, Laura},
  title = {Focus on Defocus: Bridging the Synthetic to Real Domain Gap for Depth Estimation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Active Vision for Early Recognition of Human Actions
Boyu Wang, Lihan Huang, Minh Hoai


We propose a method for early recognition of human actions, one that can take advantages of multiple cameras while satisfying the constraints due to limited communication bandwidth and processing power. Our method considers multiple cameras, and at each time step, it will decide the best camera to use so that a confident recognition decision can be reached as soon as possible. We formulate the camera selection problem as a sequential decision process, and learn a view selection policy based on reinforcement learning. We also develop a novel recurrent neural network architecture to account for the unobserved video frames and the irregular intervals between the observed frames. Experiments on three datasets demonstrate the effectiveness of our approach for early recognition of human actions.
[action, recognition, time, policy, multiple, video, mfrnn, recurrent, rnn, frame, state, three, observed, dataset, belief, observation, elapsed, indrnn, reinforcement, activity, ntu, agent, integrating, integrate, sequence, work, step, ixmas] [framework, table, feature, propose] [input, decision, model, scenario, analyze, trained] [ieee, pattern, based, method, output, proposed, handling] [missing, learn, image, cycle] [selection, early, performance, learned, select, network, data, test, training, learning, vector, accuracy, function, probability, neural, architecture, random, active, processing, average, consider, classifier, classification, rate, set, class, augmented] [view, camera, human, conference, computer, vision, international, unobserved, novel]
@InProceedings{Wang_2020_CVPR,
  author = {Wang, Boyu and Huang, Lihan and Hoai, Minh},
  title = {Active Vision for Early Recognition of Human Actions},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
SmallBigNet: Integrating Core and Contextual Views for Video Classification
Xianhang Li, Yali Wang, Zhipeng Zhou, Yu Qiao


Temporal convolution has been widely used for video classification. However, it is performed on spatio-temporal contexts in a limited view, which often weakens its capacity of learning video representation. To alleviate this problem, we propose a concise and novel SmallBig network, with the cooperation of small and big views. For the current time step, the small view branch is used to learn the core semantics, while the big view branch is used to capture the contextual semantics. Unlike traditional temporal convolution, the big view branch can provide the small view branch with the most activated video features from a broader 3D receptive field. Via aggregating such big-view contexts, the small view branch can learn more robust and discriminative spatio-temporal representations for video classification. Furthermore, we propose to share convolution in the small and big view branch, which improves model compactness as well as alleviates overfitting. As a result, our SmallBigNet achieves a comparable model size like 2D CNNs, while boosting accuracy like 3D CNNs. We conduct extensive experiments on the large-scale video benchmarks, e.g., Kinetics400, Something-Something V1 and V2. Our SmallBig network outperforms a number of recent state-of-the-art approaches, in terms of accuracy and/or efficiency. The codes and models will be available on https://github.com/xhl-video/SmallBigNet.
[video, temporal, unit, outperforms, action, spatiotemporal, yellow, cooperation, aggregating, attention, visual, recognition] [table, branch, key, global, extra, pooling, apply, feature, contextual, propose, achieves, tube, box, enlarge] [model, input, shenzhen] [smallbig, big, convolution, receptive, block, nonlocal, residual, broader, field, gflops, preferable, cnns, convolutional, concise] [learn, activated, discriminative, perform, progressively, discover, representation, gradually] [small, network, learning, share, design, accuracy, max, size, reduce, parameter, operation, classification, number, sharing, adapt, best, efficient, deep, set, better] [view, core, local, limited, well, full, directly, novel]
@InProceedings{Li_2020_CVPR,
  author = {Li, Xianhang and Wang, Yali and Zhou, Zhipeng and Qiao, Yu},
  title = {SmallBigNet: Integrating Core and Contextual Views for Video Classification},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Gate-Shift Networks for Video Action Recognition
Swathikiran Sudhakaran, Sergio Escalera, Oswald Lanz


Deep 3D CNNs for video action recognition are designed to learn powerful representations in the joint spatio-temporal feature space. In practice however, because of the large number of parameters and computations involved, they may under-perform in the lack of sufficiently large datasets for training them at scale. In this paper we introduce spatial gating in spatial-temporal decomposition of 3D kernels. We implement this concept with Gate-Shift Module (GSM). GSM is lightweight and turns a 2D-CNN into a highly efficient spatio-temporal feature extractor. With GSM plugged in, a 2D-CNN learns to adaptively route features through time and combine them, at almost no additional parameters and computational overhead. We perform an extensive evaluation of the proposed module to study its effectiveness in video action recognition, achieving state-of-the-art results on Something Something-V1 and Diving48 datasets, and obtaining competitive results on EPIC-Kitchens with far less model complexity.
[gsm, action, video, temporal, recognition, tsn, spatiotemporal, kinetics, dataset, gst, shift, tanh, time, modeling, three, tsm, noun] [feature, module, cnn, branch, pooling, table, inside, ablation, apply, backbone] [model, trained, input, datasets] [spatial, convolution, motion, block, flow, residual, convolutional, channel, optical, cnns, performs, kernel, figure, proposed, existing, lightweight, plugged] [discriminative, inception, image, representation, learn] [imagenet, performance, gating, number, learning, network, accuracy, architecture, efficient, improved, computational, compared, applied, layer, group, set, dimension, standard, deep, large] [additional, approach, single, decomposition, rgb]
@InProceedings{Sudhakaran_2020_CVPR,
  author = {Sudhakaran, Swathikiran and Escalera, Sergio and Lanz, Oswald},
  title = {Gate-Shift Networks for Video Action Recognition},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Semantics-Guided Neural Networks for Efficient Skeleton-Based Human Action Recognition
Pengfei Zhang, Cuiling Lan, Wenjun Zeng, Junliang Xing, Jianru Xue, Nanning Zheng


Skeleton-based human action recognition has attracted great interest thanks to the easy accessibility of the human skeleton data. Recently, there is a trend of using very deep feedforward neural networks to model the 3D coordinates of joints without considering the computational efficiency. In this paper, we propose a simple yet effective semantics-guided neural network (SGN) for skeleton-based action recognition. We explicitly introduce the high level semantics of joints (joint type and frame index) into the network to enhance the feature representation capability. In addition, we exploit the relationship of joints hierarchically through two modules, i.e., a joint-level module for modeling the correlations of joints in the same frame and a framelevel module for modeling the dependencies of frames by taking the joints in the same frame as a whole. A strong baseline is proposed to facilitate the study of this field. With an order of magnitude smaller model size than most previous works, SGN achieves the state-of-the-art performance on the NTU60, NTU120, and SYSU datasets.
[action, frame, semantics, skeleton, recognition, graph, temporal, sequence, sgn, gcn, three, maxpooling, order, lstm, dataset, passing, message, cuiling, wenjun, previous, explicitly, exploit, modeling, recurrent, foot, attention, junliang, smp] [module, denotes, table, cnn, feature, effectiveness, achieves, level, propose, gang, sysu] [type, model, strong, effective] [convolutional, spatial, high, proposed, adaptive, based, method, kernel, convolution] [representation, learn, jun] [neural, network, learning, performance, layer, accuracy, number, deep, size, data, set, baseline, training, denote, large, smaller, dimension] [joint, human, position, body, pose, velocity]
@InProceedings{Zhang_2020_CVPR,
  author = {Zhang, Pengfei and Lan, Cuiling and Zeng, Wenjun and Xing, Junliang and Xue, Jianru and Zheng, Nanning},
  title = {Semantics-Guided Neural Networks for Efficient Skeleton-Based Human Action Recognition},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Exploiting Joint Robustness to Adversarial Perturbations
Ali Dabouei, Sobhan Soleymani, Fariborz Taherkhani, Jeremy Dawson, Nasser M. Nasrabadi


Recently, ensemble models have demonstrated empirical capabilities to alleviate the adversarial vulnerability. In this paper, we exploit first-order interactions within ensembles to formalize a reliable and practical defense. We introduce a scenario of interactions that certifiably improves the robustness according to the size of the ensemble, the diversity of the gradient directions, and the balance of the member's contribution to the robustness. We present a joint gradient phase and magnitude regularization (GPMR) as a vigorous approach to impose the desired scenario of interactions among members of the ensemble. Through extensive experiments, including gradient-based and gradient-free evaluations on several datasets and network architectures, we validate the practical effectiveness of the proposed approach compared to the previous methods. Furthermore, we demonstrate that GPMR is orthogonal to other defense strategies developed for single classifiers and their combination can further improve the robustness of ensembles.
[prediction, natural, provide, previous, multiple] [effectiveness, table, improves, framework, predicted, equalization] [robustness, adversarial, ensemble, gpmr, magnitude, input, gal, defense, fool, diversifying, model, attack, adp, improve, perturbation, case, change, mnist, pgd, transferability, scenario, dnns] [figure, net, proposed, ieee, method, based] [diversity, loss] [gradient, training, classification, set, performance, neural, deep, classifier, bound, arxiv, preprint, learning, number, accuracy, optimal, similarity, maximum, lower, equal, size, regularization, equation, theorem, evaluate, network, sample, rate, cosine, compared, orthogonal, data] [joint, approach, conference, single, computer, error]
@InProceedings{Dabouei_2020_CVPR,
  author = {Dabouei, Ali and Soleymani, Sobhan and Taherkhani, Fariborz and Dawson, Jeremy and Nasrabadi, Nasser M.},
  title = {Exploiting Joint Robustness to Adversarial Perturbations},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
From Image Collections to Point Clouds With Self-Supervised Shape and Pose Networks
K L Navaneet, Ansu Mathew, Shashank Kashyap, Wei-Chih Hung, Varun Jampani, R. Venkatesh Babu


Reconstructing 3D models from 2D images is one of the fundamental problems in computer vision. In this work, we propose a deep learning technique for 3D object reconstruction from a single image. Contrary to recent works that either use 3D supervision or multi-view supervision, we use only single view images with no pose information during training as well. This makes our approach more practical requiring only an image collection of an object category and the corresponding silhouettes. We learn both 3D point cloud reconstruction and pose estimation networks in a self-supervised manner, making use of differentiable point cloud renderer to train with 2D supervision. A key novelty of the proposed technique is to impose 3D geometric reasoning into predicted 3D point clouds by rotating them with randomly sampled poses and then enforcing cycle consistency on both 3D reconstructions and poses. In addition, using single-view supervision allows us to do test-time optimization on a given test image. Experiments on the synthetic ShapeNet and real-world Pix3D datasets demonstrate that our approach, despite using less supervision, can achieve competitive performance compared to pose-supervised and multi-view supervised approaches.
[prediction, multiple, dataset] [object, supervision, predicted, car, propose, mask, table] [input, model, trained] [based, proposed, color, figure] [image, consistency, loss, supervised, corresponding, cycle, train, utilize, learn] [network, training, learning, set, performance, observe, randomly, optimization, better, sampled, compared, utilizing, note, number, evaluate] [pose, reconstruction, point, approach, single, cloud, differ, shape, ground, truth, nearest, geometric, chair, ulsp, shapenet, differentiable, projection, additional, view, reconstructed, enforce, viewpoint, correspondence, chamfer, symmetry, conference, computer, camera, distance, projected, provided, aero, degenerate, neighbour]
@InProceedings{Navaneet_2020_CVPR,
  author = {Navaneet, K L and Mathew, Ansu and Kashyap, Shashank and Hung, Wei-Chih and Jampani, Varun and Babu, R. Venkatesh},
  title = {From Image Collections to Point Clouds With Self-Supervised Shape and Pose Networks},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Searching for Actions on the Hyperbole
Teng Long, Pascal Mettes, Heng Tao Shen, Cees G. M. Snoek


In this paper, we introduce hierarchical action search. Starting from the observation that hierarchies are mostly ignored in the action literature, we retrieve not only individual actions but also relevant and related actions, given an action name or video example as input. We propose a hyperbolic action network, which is centered around a hyperbolic space shared by action hierarchies and videos. Our discriminative hyperbolic embedding projects actions on the shared space while jointly optimizing hypernym-hyponym relations between action pairs and a large margin separation between all actions. The projected actions serve as hyperbolic prototypes that we match with projected video representations. The result is a learned space where videos are positioned in entailment cones formed by different subtrees. To perform search in this space, we start from a query and increasingly enlarge its entailment cone to retrieve hierarchically relevant action videos. Experiments on three action datasets with new hierarchy annotations show the effectiveness of our approach for hierarchical action search by name and by video example, regardless of whether queried actions have been seen or not during training. Our implementation is available at https://github.com/Tenglon/hyperbolic_action
[action, hierarchical, video, embedding, embeddings, three, recognition, entailment, retrieval, retrieve, mettes, hyperbole, hypernym, activitynet, cees, relevant, word, provide] [map, table, propose, sibling, denotes, semantic] [query, example, datasets, model, university] [figure, high, separation, tree, based] [shared, discriminative, loss, perform, image, unseen, tao, common, project, representation] [hyperbolic, search, space, hierarchy, dnc, learning, barz, standard, large, paper, class, denzler, network, margin, function, set, setup, training, searching, softmax, report, optimization, akin, equation, riemannian, dimensionality] [approach, matching, euclidean, projected, distance, cone, match, compare]
@InProceedings{Long_2020_CVPR,
  author = {Long, Teng and Mettes, Pascal and Shen, Heng Tao and Snoek, Cees G. M.},
  title = {Searching for Actions on the Hyperbole},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
ColorFool: Semantic Adversarial Colorization
Ali Shahin Shamsabadi, Ricardo Sanchez-Matilla, Andrea Cavallaro


Adversarial attacks that generate small Lp norm perturbations to mislead classifiers have limited success in black-box settings and with unseen classifiers. These attacks are also not robust to defenses that use denoising filters and to adversarial training procedures. Instead, adversarial attacks that generate unrestricted perturbations are more robust to defenses, are generally more successful in black-box settings and are more transferable to unseen classifiers. However, unrestricted perturbations may be noticeable to humans. In this paper, we propose a content-based black-box adversarial attack that generates unrestricted perturbations by exploiting image semantics to selectively modify colors within chosen ranges that are perceived as natural by humans. We show that the proposed approach, ColorFool, outperforms in terms of success rate, robustness to defense frameworks and transferability, five state-of-the-art adversarial attacks on two different tasks, scene and object classification, when attacking three state-of-the-art deep neural networks using three standard datasets. The source code is available at https://github.com/smartcameras/ColorFool.
[recognition, three, considering] [semantic, object, region] [adversarial, colorfool, bim, clean, restricted, success, unrestricted, sparsefool, robustness, perturbation, deepfool, attack, semanticadv, jpeg, robust, sensitive, datasets, attacking, misleading, transferability, quality, chosen, iterative, trained, decision] [color, ieee, pattern, june, pixel, proposed, based, method, filtering, compression, range, figure] [image, generated, generate, unseen, colorization, generates, transferable] [classifier, number, rate, imagenet, training, deep, neural, higher, random, learning, basic, alexnet, maximum, classification, class, space, function] [conference, vision, computer, scene, median, human]
@InProceedings{Shamsabadi_2020_CVPR,
  author = {Shamsabadi, Ali Shahin and Sanchez-Matilla, Ricardo and Cavallaro, Andrea},
  title = {ColorFool: Semantic Adversarial Colorization},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Boosting the Transferability of Adversarial Samples via Attention
Weibin Wu, Yuxin Su, Xixian Chen, Shenglin Zhao, Irwin King, Michael R. Lyu, Yu-Wing Tai


The widespread deployment of deep models necessitates the assessment of model vulnerability in practice, especially for safety- and security-sensitive domains such as autonomous driving and medical diagnosis. Transfer-based attacks against image classifiers thus elicit mounting interest, where attackers are required to craft adversarial images based on local proxy models without the feedback information from remote target ones. However, under such a challenging but practical setup, the synthesized adversarial samples often achieve limited success due to overfitting to the local model employed. In this work, we propose a novel mechanism to alleviate the overfitting issue. It computes model attention over extracted features to regularize the search of adversarial examples, which prioritizes the corruption of critical features that are likely to be adopted by diverse architectures. Consequently, it can promote the transferability of resultant adversarial instances. Extensive experiments on ImageNet classifiers confirm the effectiveness of our strategy and its superiority to state-of-the-art benchmarks in both white-box and black-box settings.
[attention, critical, exploit, prediction] [feature, resnet, map, benchmark, object, employ, remote, adopt, table] [adversarial, model, attack, tap, transferability, success, bim, ata, fgsm, input, deceptive, clean, perturbation, jsma, malicious, noise, trained, ian, victim, defended, ensemble, undefended, dnn, threat, query, technique, adversarially] [figure, ieee, proposed, method, generally] [image, source, inception, target, diverse, cat, loss, transferable, extracted, corresponding, synthesized] [deep, learning, strategy, function, regularization, search, optimization, training, algorithm, overfitting, neural, machine, accuracy, gradient, performance, layer, note, imagenet, classifier] [conference, international, computer, term, vision, local, resultant, limited]
@InProceedings{Wu_2020_CVPR,
  author = {Wu, Weibin and Su, Yuxin and Chen, Xixian and Zhao, Shenglin and King, Irwin and Lyu, Michael R. and Tai, Yu-Wing},
  title = {Boosting the Transferability of Adversarial Samples via Attention},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
ActionBytes: Learning From Trimmed Videos to Localize Actions
Mihir Jain, Amir Ghodrati, Cees G. M. Snoek


This paper tackles the problem of localizing actions in long untrimmed videos. Different from existing works, which all use annotated untrimmed videos during training, we learn only from short trimmed videos. This enables learning from large-scale datasets originally designed for action classification. We propose a method to train an action localization network that segments a video into interpretable fragments, we call ActionBytes. Our method jointly learns to cluster ActionBytes and trains the localization network using the cluster assignments as pseudo-labels. By doing so, we train on short trimmed videos that become untrimmed for ActionBytes. In isolation, or when merged, the ActionBytes also serve as effective action proposals. Experiments demonstrate that our boundary-guided training generalizes to unknown action classes and localizes actions in long videos of Thumos14, MultiThumos, and ActivityNet1.2. Furthermore, we show the advantage of ActionBytes for zero-shot localization as well as traditional weakly supervised localization, that train on long videos, to achieve state-of-the-art results.
[action, actionbytes, actionbyte, video, temporal, long, untrimmed, short, trimmed, localize, dataset, multithumos, length, time, cshort, localizing, multiple, extract, recognition, word, clong] [localization, table, weakly, map, score, effectiveness, supervision, feature, segment] [model, trained, improve, datasets, iterative] [figure, method, proposed] [latent, train, learn, transfer, loss, generate, supervised, common, extracted, interpretable, cluster, unseen] [set, training, learning, class, baseline, test, performance, number, mining, label, evaluate, classification, network, average, compared, validation, knowledge, deep, problem, clustering, layer, activation, task, vector] [single, projection, approach, pipeline, second]
@InProceedings{Jain_2020_CVPR,
  author = {Jain, Mihir and Ghodrati, Amir and Snoek, Cees G. M.},
  title = {ActionBytes: Learning From Trimmed Videos to Localize Actions},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Efficient Adversarial Training With Transferable Adversarial Examples
Haizhong Zheng, Ziqi Zhang, Juncheng Gu, Honglak Lee, Atul Prakash


Adversarial training is an effective defense method to protect classification models against adversarial attacks. However, one limitation of this approach is that it can require orders of magnitude additional training time due to high cost of generating strong adversarial examples during training. In this paper, we first show that there is high transferability between models from neighboring epochs in the same training process, i.e., adversarial examples from one epoch continue to be adversarial in subsequent epochs. Leveraging this property, we propose a novel method, Adversarial Training with Transferable Adversarial Examples (ATTA), that can enhance the robustness of trained models and greatly improve the training efficiency by accumulating adversarial perturbations through epochs. Compared to state-of-the-art adversarial training methods, ATTA enhances adversarial accuracy by up to 7.2% on CIFAR10 and requires 12 14x less training time on MNIST and CIFAR10 datasets with comparable model robustness.
[time, natural, previous] [table, achieves, faster, improves] [adversarial, attack, model, transferability, atta, robustness, perturbation, trained, mnist, defense, strength, mat, example, accumulative, robust, input, ian, strong, improve, iterative, bounded, accumulated, targeted, adversarially, effective, stronger, insight, pgd, yopo] [method, high, figure, traditional, inverse, based, neighboring, result] [loss, generate, image, train, transferable, generated, source, free] [training, data, epoch, learning, accuracy, higher, compared, achieve, augmentation, efficiency, comparable, better, number, evaluate, reusing, equation, machine, find, function, gradient, performance, fewer, connection, neural, reuse] [conference, international, computer]
@InProceedings{Zheng_2020_CVPR,
  author = {Zheng, Haizhong and Zhang, Ziqi and Gu, Juncheng and Lee, Honglak and Prakash, Atul},
  title = {Efficient Adversarial Training With Transferable Adversarial Examples},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Alleviation of Gradient Exploding in GANs: Fake Can Be Real
Song Tao, Jia Wang


In order to alleviate the notorious mode collapse phenomenon in generative adversarial networks (GANs), we propose a novel training method of GANs in which certain fake samples are considered as real ones during the training process. This strategy can reduce the gradient value that generator receives in the region where gradient exploding happens. We show the process of an unbalanced generation and a vicious circle issue resulted from gradient exploding in practical training, which explains the instability of GANs. We also theoretically prove that gradient exploding can be alleviated by penalizing the difference between discriminator outputs and fake-as-real consideration for very close real and fake samples. Accordingly, Fake-As-Real GAN (FARGAN) is proposed with a more stable training process and a more faithful generated distribution. Experiments on different datasets verify our theoretical analysis.
[dataset, considering, pair, multiple, work] [resnet, including, region] [adversarial, generalization, norm, improve, difference, finite, serious, original, datasets] [method, faithful, figure, based, proposed, achieved] [discriminator, real, fake, generated, generative, exploding, gan, generator, fargan, mode, gans, generation, vicious, collapse, issue, alleviation, nsgan, target, corresponding, fid] [gradient, training, close, learning, process, distribution, set, penalty, unbalanced, neural, minibatch, considered, theoretical, large, architecture, overfitting, prevent, achieve, log, imagenet, better, number, appendix, weight, lower, arxiv, preprint, processing, circle, consideration, capacity, data, equilibrium, practice, penalization, proposition, empirical, larger, note, stable] [conference, international, local, term]
@InProceedings{Tao_2020_CVPR,
  author = {Tao, Song and Wang, Jia},
  title = {Alleviation of Gradient Exploding in GANs: Fake Can Be Real},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
On Isometry Robustness of Deep 3D Point Cloud Models Under Adversarial Attacks
Yue Zhao, Yuwei Wu, Caihua Chen, Andrew Lim


While deep learning in 3D domain has achieved revolutionary performance in many tasks, the robustness of these models has not been sufficiently studied or explored. Regarding the 3D adversarial samples, most existing works focus on manipulation of local points, which may fail to invoke the global geometry properties, like robustness under linear projection that preserves the Euclidean distance, i.e., isometry. In this work, we show that existing state-of-the-art deep 3D models are extremely vulnerable to isometry transformations. Armed with the Thompson Sampling, we develop a black-box attack with success rate over 95% on ModelNet40 data set. Incorporating with the Restricted Isometry Property, we propose a novel framework of white-box attack on top of spectral norm based perturbation. In contrast to previous works, our adversarial samples are experimentally shown to be strongly transferable. Evaluated on a sequence of prevailing 3D models, our white-box attack achieves success rates from 98.88% to 100%. It maintains a successful attack rate over 95% even within an imperceptible rotation range [+-2.81*].
[work, current, order] [table, object, global, propose, framework, achieves] [attack, isometry, adversarial, model, success, tsi, ctri, robustness, restricted, norm, thompson, rip, defense, victim, transferability, easily, successful] [ieee, figure, based, pattern, spectral, range, analysis, proposed] [image, generated, corresponding] [data, deep, matrix, learning, sampling, neural, augmentation, probability, accuracy, linear, rate, random, efficient, classification, sample, arxiv, preprint, network, evaluate, small, training, distribution, maximum, performance, computational, function] [point, rotation, cloud, conference, vision, computer, pointnet, property, international, geometry, well, approach, shape, transformation, acm, local, novel, directly]
@InProceedings{Zhao_2020_CVPR,
  author = {Zhao, Yue and Wu, Yuwei and Chen, Caihua and Lim, Andrew},
  title = {On Isometry Robustness of Deep 3D Point Cloud Models Under Adversarial Attacks},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Achieving Robustness in the Wild via Adversarial Mixing With Disentangled Representations
Sven Gowal, Chongli Qin, Po-Sen Huang, Taylan Cemgil, Krishnamurthy Dvijotham, Timothy Mann, Pushmeet Kohli


Recent research has made the surprising finding that state-of-the-art deep learning models sometimes fail to generalize to small variations of the input. Adversarial training has been shown to be an effective approach to overcome this problem. However, its application has been limited to enforcing invariance to analytically defined transformations like lp-norm bounded perturbations. Such perturbations do not necessarily cover plausible real-world variations that preserve the semantics of the input (such as a change in lighting conditions). In this paper, we propose a novel approach to express and formalize robustness to these kinds of real-world transformations of the input. The two key ideas underlying our formulation are (1) leveraging disentangled representations of the input to define different factors of variations, and (2) generating new input images by adversarially composing the representations of different images. We use a StyleGAN model to demonstrate the efficacy of this framework. Specifically, we leverage the disentangled latent representations computed by a StyleGAN model to generate perturbations of an image that are similar to real-world variations (like adding make-up, or changing the skin-tone of a person) and train models to be invariant to these perturbations. Extensive experiments show that our method improves generalization and reduces the effect of spurious correlations (reducing the error rate of a "smile" detector by 21% for example).
[dataset, dec, work] [semantic, biased, table] [adversarial, advmix, model, latents, input, risk, nist, robustness, trained, robust, original, randmix, nominal, bounded, spurious, systematically, eleba] [figure, method, ieee, affect, color] [image, disentangled, stylegan, latent, generative, mixing, loss, train, generate, invariant, style, generating, representation, plausible, semantically, corresponding, independent] [training, data, arxiv, preprint, set, neural, accuracy, augmentation, space, deep, learning, mixup, classifier, find, classification, equation, random, test, label, algorithm, finding, small, network, bias, note] [conference, computer, defined, define, demonstrate, well, additional, finer, formulation]
@InProceedings{Gowal_2020_CVPR,
  author = {Gowal, Sven and Qin, Chongli and Huang, Po-Sen and Cemgil, Taylan and Dvijotham, Krishnamurthy and Mann, Timothy and Kohli, Pushmeet},
  title = {Achieving Robustness in the Wild via Adversarial Mixing With Disentangled Representations},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
QEBA: Query-Efficient Boundary-Based Blackbox Attack
Huichen Li, Xiaojun Xu, Xiaolu Zhang, Shuang Yang, Bo Li


Machine learning (ML), especially deep neural networks (DNNs) have been widely used in various applications, including several safety-critical ones (e.g. autonomous driving). As a result, recent research about adversarial examples has raised great concerns. Such adversarial attacks can be achieved by adding a small magnitude of perturbation to the input to mislead model prediction. While several whitebox attacks have demonstrated their effectiveness, which assume that the attackers have full access to the machine learning models; blackbox attacks are more realistic in practice. In this paper, we propose a Query-Efficient Boundary-based blackbox Attack (QEBA) based only on model's final prediction labels. We theoretically show why previous boundary-based attack with gradient estimation on the whole gradient space is not efficient in terms of query numbers, and provide optimality analysis for our dimension reduction-based gradient estimation. On the other hand, we conducted extensive experiments on ImageNet and CelebA datasets to evaluate QEBA. We show that compared with the state-of-the-art blackbox attacks, QEBA is able to use a smaller number of queries to achieve a lower magnitude of perturbation with 100% attack success rate. We also show case studies of attacks on real-world APIs including MEGVII Face++ and Microsoft Azure.
[three, step, order, prediction, provide, work] [boundary, final, including, propose] [attack, adversarial, model, blackbox, query, decision, perturbation, qeba, apis, representative, face, success, original, xadv, api, access, move, attacking, xtgt, hsja, magnitude, example] [based, figure, proposed, frequency, ieee, mse, spatial, low, analysis, method] [image, component, perform, project, transformed, celeba] [gradient, subspace, number, arxiv, preprint, learning, dimension, imagenet, sample, space, data, reduction, sampling, rate, random, deep, efficient, online, set, machine, lower, large, cosine, compared, optimization] [estimation, intrinsic, pca, conference, estimate, estimated, basis, computer]
@InProceedings{Li_2020_CVPR,
  author = {Li, Huichen and Xu, Xiaojun and Zhang, Xiaolu and Yang, Shuang and Li, Bo},
  title = {QEBA: Query-Efficient Boundary-Based Blackbox Attack},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Learning to Simulate Dynamic Environments With GameGAN
Seung Wook Kim, Yuhao Zhou, Jonah Philion, Antonio Torralba, Sanja Fidler


Simulation is a crucial component of any robotic system. In order to simulate correctly, we need to write complex rules of the environment: how dynamic agents behave, and how the actions of each of the agents affect the behavior of others. In this paper, we aim to learn a simulator by simply watching an agent interact with an environment. We focus on graphics games as a proxy of the real environment. We introduce GameGAN, a generative model that learns to visually imitate a desired game by ingesting screenplay and keyboard actions during training. Given a key pressed by the agent, GameGAN "renders" the next screen using a carefully designed generative adversarial network. Our approach offers key advantages over existing work: we design a memory module that builds an internal map of the environment, allowing for the agent to return to previously visited locations with high visual consistency. In addition, GameGAN is able to disentangle static and dynamic components within an image making the behavior of the model more interpretable, and relevant for downstream tasks that require explicit reasoning over dynamic elements. This enables many interesting applications such as swapping different components of the game to build new games that do not exist. We will release the code and trained model, enabling human players to play generated games and their variations with our GameGAN.
[gamegan, engine, agent, environment, static, pacman, time, action, vizdoom, video, state, simulator, future, shift, hidden, reinforcement, temporal, behavior, work, frame, visual, playing] [module, location, map, challenging, fed, propose] [model, adversarial, trained, game, access] [dynamic, figure, simply, version, simulate] [image, real, generated, consistency, learns, produce, loss, learn, generative, introduce, disentangle, gan, realistic, generating, discriminator, cycle, swapping, generate, conditional, content] [memory, learning, training, neural, vector, learned, number, stochastic, task, arxiv, preprint, random, note, test, higher, forward, processing] [rendering, conference, initial, international, consistent, simulated, left, complex, require]
@InProceedings{Kim_2020_CVPR,
  author = {Kim, Seung Wook and Zhou, Yuhao and Philion, Jonah and Torralba, Antonio and Fidler, Sanja},
  title = {Learning to Simulate Dynamic Environments With GameGAN},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Learn2Perturb: An End-to-End Feature Perturbation Learning to Improve Adversarial Robustness
Ahmadreza Jeddi, Mohammad Javad Shafiee, Michelle Karg, Christian Scharfenberger, Alexander Wong


While deep neural networks have been achieving state-of-the-art performance across a wide variety of applications, their vulnerability to adversarial attacks limits their widespread deployment for safety-critical applications. Alongside other adversarial defense approaches being investigated, there has been a very recent interest in improving adversarial robustness in deep neural networks through the introduction of perturbations during the training process. However, such methods leverage fixed, pre-defined perturbations and require significant hyper-parameter tuning that makes them very difficult to leverage in a general fashion. In this study, we introduce Learn2Perturb, an end-to-end feature perturbation learning approach for improving the adversarial robustness of deep neural networks. More specifically, we introduce novel perturbation-injection modules that are incorporated at each layer to perturb the feature space and increase uncertainty in the network. This feature perturbation is performed at both the training and the inference stages. Furthermore, inspired by the Expectation-Maximization, an alternating back-propagation training algorithm is introduced to train the network and noise parameters consecutively. Experimental results on CIFAR-10 and CIFAR-100 datasets show that the proposed Learn2Perturb method can result in deep neural networks which are 4-7% more robust on l_inf FGSM and PDG adversarial attacks and significantly outperforms the state-of-the-art against l_2 C&W attack and a wide range of well-known black-box attacks.
[provide, outperforms, step, illustrated] [feature, table, framework, effectiveness, module] [adversarial, robustness, noise, model, pgd, perturbation, attack, input, fgsm, trained, robust, improve, injection, defense, pni, experimental, clean, improving, technique, evaluating] [proposed, method, based, output, introduced, ieee, utilized, figure, performed] [loss, competing, randomization, learn, surrogate] [network, training, neural, deep, learning, function, alternating, distribution, arxiv, preprint, algorithm, gradient, updated, compared, data, random, process, number, regularization, set, performance, increase, inference, descent, architecture, regularizer, evaluate, layer] [approach, conference, computer, uncertainty]
@InProceedings{Jeddi_2020_CVPR,
  author = {Jeddi, Ahmadreza and Shafiee, Mohammad Javad and Karg, Michelle and Scharfenberger, Christian and Wong, Alexander},
  title = {Learn2Perturb: An End-to-End Feature Perturbation Learning to Improve Adversarial Robustness},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
SDFDiff: Differentiable Rendering of Signed Distance Fields for 3D Shape Optimization
Yue Jiang, Dantong Ji, Zhizhong Han, Matthias Zwicker


We propose SDFDiff, a novel approach for image-based shape optimization using differentiable rendering of 3D shapes represented by signed distance functions (SDFs). Compared to other representations, SDFs have the advantage that they can represent shapes with arbitrary topology, and that they guarantee watertight surfaces. We apply our approach to the problem of multi-view 3D reconstruction, where we achieve high reconstruction quality and can capture complex topology of 3D objects. In addition, we employ a multi-resolution strategy to obtain a robust optimization algorithm. We further demonstrate that our SDF-based differentiable renderer can be integrated with deep learning models, which opens up options for learning approaches on 3D objects without 3D supervision. In particular, we apply our method to single-view 3D reconstruction and achieve state-of-the-art results.
[represent, step, current, recognition, automatic] [object, level, apply, framework] [ray, input, model, quality] [figure, resolution, ieee, pattern, based, method, inverse, pixel, proposed, comparison] [loss, image, target, representation, perform, arbitrary] [learning, optimization, neural, deep, function, set, network, gradient, processing] [differentiable, sdf, reconstruction, point, shape, computer, surface, distance, conference, rendering, camera, intersection, vision, sdfs, approach, renderer, signed, sphere, local, view, continuous, reconstruct, michael, geometry, tracing, watertight, topology, implicit, single, smvs, initial, voxel, rendered, casting, compute, trilinear, international, novel, complex, scene, mesh, limg, shading, grid, bunny]
@InProceedings{Jiang_2020_CVPR,
  author = {Jiang, Yue and Ji, Dantong and Han, Zhizhong and Zwicker, Matthias},
  title = {SDFDiff: Differentiable Rendering of Signed Distance Fields for 3D Shape Optimization},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Through the Looking Glass: Neural 3D Reconstruction of Transparent Shapes
Zhengqin Li, Yu-Ying Yeh, Manmohan Chandraker


Recovering the 3D shape of transparent objects using a small number of unconstrained natural images is an ill-posed problem. Complex light paths induced by refraction and reflection have prevented both traditional and deep multiview stereo from solving this challenge. We propose a physically-based network to recover 3D shape of transparent objects using a few images acquired with a mobile phone camera, under a known but arbitrary environment map. Our novel contributions include a normal representation that enables the network to model complex light transport through local computation, a rendering layer that models refractions and reflections, a cost volume specifically designed for normal refinement of transparent shapes and a feature mapping based on predicted normals for 3D point cloud reconstruction. We render a synthetic dataset to encourage the model to learn refractive light transport across different views. Our experiments show successful recovery of high-quality 3D geometry for complex transparent shapes using as few as 5-12 natural images. Code and data will be publicly released.Recovering the 3D shape of transparent objects using a small number of unconstrained natural images is an ill-posed problem. Complex light paths induced by refraction and reflection have prevented both traditional and deep multiview stereo from solving this challenge. We propose a physically-based network to recover 3D shape of transparent objects using a few images acquired with a mobile phone camera, under a known but arbitrary environment map. Our novel contributions include a normal representation that enables the network to model complex light transport through local computation, a rendering layer that models refractions and reflections, a cost volume specifically designed for normal refinement of transparent shapes and a feature mapping based on predicted normals for 3D point cloud reconstruction. We render a synthetic dataset to encourage the model to learn refractive light transport across different views. Our experiments show successful recovery of high-quality 3D geometry for complex transparent shapes using as few as 5-12 natural images. Code and data will be publicly released.
[visual, environment, natural, build, phone] [map, feature, propose, predicted, represents, object, table, mask] [input, ray, model, controlled, unconstrained, internal] [figure, light, reflection, method, june, based, formation] [loss, image, real, synthetic, arbitrary, mapping, latent, matting] [network, layer, number, deep, total, sampled, set, compared, small, basic, function, mobile, better, learning, training] [reconstruction, transparent, normal, shape, hull, rendering, point, cloud, view, volume, refractive, rendered, error, cost, reconstruct, reconstructed, acm, novel, surface, compute, supplementary, nearest, complex, differentiable, render, multiview, geometry, full, initial, second, chamfer, refraction, stereo, estimation, uniformly, distance]
@InProceedings{Li_2020_CVPR,
  author = {Li, Zhengqin and Yeh, Yu-Ying and Chandraker, Manmohan},
  title = {Through the Looking Glass: Neural 3D Reconstruction of Transparent Shapes},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
TextureFusion: High-Quality Texture Acquisition for Real-Time RGB-D Scanning
Joo Ho Lee, Hyunho Ha, Yue Dong, Xin Tong, Min H. Kim


Real-time RGB-D scanning technique has become widely used to progressively scan objects with a hand-held sensor. Existing online methods restore color information per voxel, and thus their quality is often limited by the tradeoff between spatial resolution and time performance. Also, such methods often suffer from blurred artifacts in the captured texture. Traditional offline texture mapping methods with non-rigid warping assume that the reconstructed geometry and all input views are obtained in advance, and the optimization takes a long time to compute mesh parameterization and warp parameters, which prevents them from being used in real-time applications. In this work, we propose a progressive texture-fusion method specially designed for real-time RGB-D scanning. To this end, we first devise a novel texture-tile voxel grid, where texture tiles are embedded in the voxel grid of the signed distance function, allowing for high-resolution texture mapping on the low-resolution geometry volume. Instead of using expensive mesh parameterization, we associate vertices of implicit geometry directly with texture coordinates. Second, we introduce real-time texture warping that applies a spatially-varying perspective mapping to input images so that texture warping efficiently mitigates the mismatch between the intermediate geometry and the current input view. It allows us to enhance the quality of texture over time while updating the geometry in real-time. The results demonstrate that the quality of our real-time texture mapping is highly competitive to that of exhaustive offline texture warping methods. Our method is also capable of being integrated into existing RGB-D scanning frameworks.
[current, time, frame, unit, three] [global, map, main, interactive] [input, quality, offline, blending, zhou, model] [method, color, warping, figure, warp, resolution, existing, motion, spatial, cell, affine, fusion, ieee, proposed, pixel, comparison, flow, field, window, captured, intermediate, integrated] [texture, image, mapping, representation, mismatch, progressively, atlas, consists] [optimization, online, update, space, size, achieve, efficient, tradeoff, performance, weight, parameterization] [geometry, voxel, camera, perspective, local, surface, grid, scanning, depth, tile, reconstruction, correspondence, acm, mesh, pose, estimate, view, canonical, distance, point, scene, novel, allows, register, registration, compute, implicit, structure, transformation, computer]
@InProceedings{Lee_2020_CVPR,
  author = {Lee, Joo Ho and Ha, Hyunho and Dong, Yue and Tong, Xin and Kim, Min H.},
  title = {TextureFusion: High-Quality Texture Acquisition for Real-Time RGB-D Scanning},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
D3VO: Deep Depth, Deep Pose and Deep Uncertainty for Monocular Visual Odometry
Nan Yang, Lukas von Stumberg, Rui Wang, Daniel Cremers


We propose D3VO as a novel framework for monocular visual odometry that exploits deep networks on three levels -- deep depth, pose and uncertainty estimation. We first propose a novel self-supervised monocular depth estimation network trained on stereo videos without any external supervision. In particular, it aligns the training image pairs into similar lighting condition with predictive brightness transformation parameters. Besides, we model the photometric uncertainties of pixels on the input images, which improves the depth estimation accuracy and provides a learned weighting function for the photometric residuals in direct (feature-less) visual odometry. Evaluation results show that the proposed network outperforms state-of-the-art self-supervised depth estimation networks. D3VO tightly incorporates the predicted depth, pose and uncertainty into a direct visual odometry method to boost both the front-end tracking as well as the back-end non-linear optimization. We evaluate D3VO in terms of monocular visual odometry on both the KITTI odometry benchmark and the EuRoC MAV dataset. The results show that D3VO outperforms state-of-the-art traditional monocular VO methods by a large margin. It also achieves comparable results to state-of-the-art stereo/LiDAR odometry on KITTI and to the state-of-the-art visual-inertial odometry on EuRoC MAV, while using only a single camera.
[visual, evaluation, temporal, dataset] [predicted, table, propose, tracking, map, achieves, framework, benchmark] [difficult, trained, model] [ieee, brightness, proposed, pattern, method, motion, traditional, illumination, comparison, scale, constancy, deliver, figure, high, based] [image, unsupervised, factor] [deep, network, learning, performance, neural, training, set, arxiv, preprint, learned, comparable, predictive, function, better, note] [depth, monocular, conference, estimation, odometry, stereo, uncertainty, computer, pose, photometric, direct, international, euroc, vision, kitti, transformation, mav, camera, robotics, daniel, well, automation, dso, thomas, single, point, sparse, predicts, term, dvso, estimated, relative, error, virtual, vio]
@InProceedings{Yang_2020_CVPR,
  author = {Yang, Nan and Stumberg, Lukas von and Wang, Rui and Cremers, Daniel},
  title = {D3VO: Deep Depth, Deep Pose and Deep Uncertainty for Monocular Visual Odometry},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Deep Implicit Volume Compression
Danhang Tang, Saurabh Singh, Philip A. Chou, Christian Hane, Mingsong Dou, Sean Fanello, Jonathan Taylor, Philip Davidson, Onur G. Guleryuz, Yinda Zhang, Shahram Izadi, Andrea Tagliasacchi, Sofien Bouaziz, Cem Keskin


We describe a novel approach for compressing truncated signed distance fields (TSDF) stored in 3D voxel grids, and their corresponding textures. To compress the TSDF, our method relies on a block-based neural network architecture trained end-to-end, achieving state-of-the-art rate-distortion trade-off. To prevent topological errors, we losslessly com- press the signs of the TSDF, which also upper bounds the reconstruction error by the voxel size. To compress the corresponding texture, we designed a fast block-based UV parameterization, generating coherent texture maps that can be effectively compressed using existing video compression algorithms. We demonstrate the performance of our algo- rithms on two 4D performance capture datasets, reducing bitrate by 66% for the same distortion, or alternatively re- ducing the distortion by 50% for the same bitrate, compared to the state-of-the-art.
[video, dataset, sign, sequence, order] [philip, propose] [model, distortion, quality] [compression, figure, ieee, block, method, compressed, compress, morton, mpeg, losslessly, bitrate, draco, prior, coding, high, receiver, raw, spatial, chart, sender] [image, texture, encoder, conditional, representation, sean, corresponding, mapping] [rate, size, better, distribution, entropy, data, neural, learning, learned, deep, network, performance, training, number, higher, probability, average, compressing, architecture, processing] [point, geometry, tsdf, computer, cloud, volumetric, volume, reconstruction, surface, mesh, acm, distance, error, parametrization, tang, ground, truth, voxel, novel, topology, vision, marching, conference, implicit, mingsong, approach, reconstructed, supplementary, david, shahram]
@InProceedings{Tang_2020_CVPR,
  author = {Tang, Danhang and Singh, Saurabh and Chou, Philip A. and Hane, Christian and Dou, Mingsong and Fanello, Sean and Taylor, Jonathan and Davidson, Philip and Guleryuz, Onur G. and Zhang, Yinda and Izadi, Shahram and Tagliasacchi, Andrea and Bouaziz, Sofien and Keskin, Cem},
  title = {Deep Implicit Volume Compression},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
MAGSAC++, a Fast, Reliable and Accurate Robust Estimator
Daniel Barath, Jana Noskova, Maksym Ivashechkin, Jiri Matas


We propose MAGSAC++ and Progressive NAPSAC sampler, P-NAPSAC in short. In MAGSAC++, we replace the model quality and polishing functions of the original method by an iteratively re-weighted least-squares fitting with weights determined via marginalizing over the noise scale. MAGSAC++ is fast -- often an order of magnitude faster -- and more geometrically accurate than MAGSAC. P-NAPSAC merges the advantages of local and global sampling by drawing samples from gradually growing neighborhoods. Exploiting that nearby points are more likely to originate from the same geometric model, P-NAPSAC finds local structures earlier than global samplers. We show that the progressive spatial sampling in P-NAPSAC can be integrated with PROSAC sampling, which is applied to the first, location-defining, point. The methods are tested on homography and fundamental matrix fitting on six publicly available datasets. MAGSAC combined with P-NAPSAC sampler is superior to state-of-the-art robust estimators in terms of speed, accuracy and failure rate.
[failure, dataset, time, ith] [threshold, global, propose, faster] [model, robust, quality, datasets, tested, noise, input] [proposed, homography, method, fast, pattern, figure] [image, progressive, drawn, corresponding] [function, sample, number, sampling, set, matrix, selected, data, algorithm, random, calculated, probability, sampler, accuracy, parameter, processing, weight, size, best, sgd, machine, ratio] [point, local, fundamental, ransac, inlier, error, computer, magsac, fitting, napsac, vision, accurate, estimation, prosac, localized, minimal, msac, inliers, conference, international, lmeds, termination, geometric, estimated, marginalizing, nearest, kitti, median, iteratively, estimating, evd, rmse]
@InProceedings{Barath_2020_CVPR,
  author = {Barath, Daniel and Noskova, Jana and Ivashechkin, Maksym and Matas, Jiri},
  title = {MAGSAC++, a Fast, Reliable and Accurate Robust Estimator},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
OctSqueeze: Octree-Structured Entropy Model for LiDAR Compression
Lila Huang, Shenlong Wang, Kelvin Wong, Jerry Liu, Raquel Urtasun


We present a novel deep compression algorithm to reduce the memory footprint of LiDAR point clouds. Our method exploits the sparsity and structural redundancy between points to reduce the bitrate. Towards this goal, we first encode the point cloud into an octree, a data-efficient structure suitable for sparse point clouds. We then design a tree-structured conditional entropy model that can be directly applied to octree structures to predict the probability of a symbol's occurrence. We validate the effectiveness of our method over two large-scale datasets. The results demonstrate that our approach reduces the bitrate by 10- 20% at the same reconstruction quality, compared to the previous state-of-the-art. Importantly, we also show that for the same bitrate, our approach outperforms other compression algorithms when performing downstream 3D segmentation and detection tasks using compressed representations. This helps advance the feasibility of using point cloud compression to reduce the onboard and offboard storage for safety-critical applications such as self-driving cars, where a single vehicle captures 84 billion points per day.
[node, context, encode, video, downstream, encoding, long, perception, hidden] [lidar, object, anchor, iou, feature, semantic, segmentation, raquel, detection, parent, van, final] [model, input, quality] [compression, ieee, bitrate, range, draco, pattern, psnr, cvpr, method, mpeg, june, coding, convolutional, proposed, symbol, prior, tree, raw, serialized, northamerica, salt, lake, sensor, figure] [image, representation, conditional, train] [entropy, deep, data, neural, learning, reduce, evaluate, probability, lower, note, performance] [point, computer, octree, conference, cloud, vision, structure, occupancy, international, reconstruction, october, approach, well, depth, sparse, geometry, represented, full, kitti]
@InProceedings{Huang_2020_CVPR,
  author = {Huang, Lila and Wang, Shenlong and Wong, Kelvin and Liu, Jerry and Urtasun, Raquel},
  title = {OctSqueeze: Octree-Structured Entropy Model for LiDAR Compression},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
4D Association Graph for Realtime Multi-Person Motion Capture Using Multiple Video Cameras
Yuxiang Zhang, Liang An, Tao Yu, Xiu Li, Kun Li, Yebin Liu


his paper contributes a novel realtime multi-person motion capture algorithm using multiview video inputs. Due to the heavy occlusions and closely interacting motions in each view, joint optimization on the multiview images and multiple temporal frames is indispensable, which brings up the essential challenge of realtime efficiency. To this end, for the first time, we unify per-view parsing, cross-view matching, and temporal tracking into a single optimization framework, i.e., a 4D association graph that each dimension (image space, viewpoint and time) can be treated equally and simultaneously. To solve the 4D association graph efficiently, we further contribute the idea of 4D limb bundle parsing based on heuristic searching, followed with limb bundle assembling by proposing a bundle Kruskal's algorithm. Our method enables a realtime motion capture system running at 30fps using 5 cameras on a 5-person scene. Benefiting from the unified parsing, matching and tracking constraints, our method is robust to noisy detection due to severe occlusions and close interacting motions, and achieves high-quality online pose reconstruction quality. The proposed method outperforms state-of-the-art methods quantitatively without using high-level appearance information.
[graph, temporal, skeleton, multiple, dataset, frame, connecting, current, correct, step, previous, video, contribute, three, sequential] [association, tracking, parsing, edge, bernt, false, unified, cnn, crowded, panoptic, detected] [clique, robust, input, quality, model] [motion, method, based, high, figure, comparison, proposed, severe] [person, image, real] [optimization, function, objective, close, note, problem, data, share, algorithm, performance] [limb, pose, human, joint, capture, realtime, view, estimation, bundle, matching, single, body, multiview, assembling, system, solving, shelf, interacting, markerless, defined, mykhaylo, ground, truth, enables, scene, form, pictorial, reconstructed]
@InProceedings{Zhang_2020_CVPR,
  author = {Zhang, Yuxiang and An, Liang and Yu, Tao and Li, Xiu and Li, Kun and Liu, Yebin},
  title = {4D Association Graph for Realtime Multi-Person Motion Capture Using Multiple Video Cameras},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Upgrading Optical Flow to 3D Scene Flow Through Optical Expansion
Gengshan Yang, Deva Ramanan


We describe an approach for upgrading 2D optical flow to 3D scene flow. Our key insight is that dense optical expansion - which can be reliably inferred from monocular frame pairs - reveals changes in depth of scene elements, e.g., things moving closer will get bigger. When integrated with camera intrinsics, optical expansion can be converted into a normalized 3D scene flow vectors that provide meaningful directions of 3D movement, but not their magnitude (due to an underlying scale ambiguity). Normalized scene flow can be further "upgraded" to the true 3D scene flow knowing depth in one frame. We show that dense optical expansion between two views can be learned from annotated optical flow maps or unlabeled video sequences, and applied to a variety of dynamic 3D perception tasks including optical scene flow, LiDAR scene flow, time-to-collision estimation and depth estimation, often demonstrating significant improvement over the prior art.
[frame, perception, work, visual, prediction, predict, relationship, extract, time, moving] [lidar, object, propose, focus, feature, table] [change, true, strong, input, model] [optical, flow, expansion, motion, method, scale, affine, reference, prior, dynamic, pixel, prsm, disparity, figure, osf, avoidance] [image, train, target, notice] [normalized, learning, network, validation, scaled, architecture, better, large, size, training, set, deep, indicates] [scene, depth, estimation, monocular, camera, stereo, point, local, error, dense, rigid, directly, relative, kitti, correspondence, projection, michael, estimating, reconstruction, plane, orthographic, compute, computed, geometry, matching, initial, triangulation, approach, collision, estimate, sparse, derive]
@InProceedings{Yang_2020_CVPR,
  author = {Yang, Gengshan and Ramanan, Deva},
  title = {Upgrading Optical Flow to 3D Scene Flow Through Optical Expansion},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Robust 3D Self-Portraits in Seconds
Zhe Li, Tao Yu, Chuanyu Pan, Zerong Zheng, Yebin Liu


In this paper, we propose an efficient method for robust 3D self-portraits using a single RGBD camera. Benefiting from the proposed PIFusion and lightweight bundle adjustment algorithm, our method can generate detailed 3D self-portraits in seconds and shows the ability to handle subjects wearing extremely loose clothes. To achieve highly efficient and robust reconstruction, we propose PIFusion, which combines learning-based 3D recovery with volumetric non-rigid fusion to generate accurate sparse partial scans of the subject. Moreover, a non-rigid volumetric deformation method is proposed to continuously refine the learned shape prior. Finally, a lightweight bundle adjustment algorithm is proposed to guarantee that all the partial scans can not only "loop" with each other but also remain consistent with the selected live key observations. The results and experiments show that the proposed method achieves more robust and efficient 3D self-portraits compared with state-of-the-art methods.
[frame, recognition, sequence] [key, tracking, mask, propose, map, fuse, improves] [live, model, input, robust, subject] [adjustment, method, reference, ieee, fusion, proposed, warp, lightweight, fused, pattern, figure, warped, color, comparison, based] [generate, image, misalignment, portrait] [inner, accuracy, function, large, performance, efficient, energy, optimization, number, layer, algorithm] [partial, bundle, depth, accurate, human, computer, vision, single, body, reconstruction, term, conference, deformation, volumetric, shape, rgbd, silhouette, pifusion, surface, loop, mesh, error, joint, detailed, point, implicit, capture, defined, vertex, scanning, geometry, system, scan, esmooth, tsdf, normal, international, rgb, smooth]
@InProceedings{Li_2020_CVPR,
  author = {Li, Zhe and Yu, Tao and Pan, Chuanyu and Zheng, Zerong and Liu, Yebin},
  title = {Robust 3D Self-Portraits in Seconds},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
FastDVDnet: Towards Real-Time Deep Video Denoising Without Flow Estimation
Matias Tassano, Julie Delon, Thomas Veit


In this paper, we propose a state-of-the-art video denoising algorithm based on a convolutional neural network architecture. Until recently, video denoising with neural networks had been a largely under explored domain, and existing methods could not compete with the performance of the best patch-based methods. The approach we introduce in this paper, called FastDVDnet, shows similar or better performance than other state-of-the-art competitors with significantly lower computing times. In contrast to other existing neural network denoisers, our algorithm exhibits several desirable properties such as fast runtimes, and the ability to handle a wide range of noise levels with a single network model. The characteristics of its architecture make it possible to avoid using a costly motion compensation stage while achieving excellent performance. The combination between its denoising performance and lower computational load makes this algorithm attractive for practical denoising applications. We compare our method with different state-of-art algorithms, both visually and with respect to objective quality metrics.
[video, temporal, frame, recognition, composed, three, observed, step, order, den] [faster, map, cnn, davis, feature] [noise, denoise, trained, quality, model, input, digital, clean, magnitude] [denoising, motion, fastdvdnet, ieee, dvdnet, flow, pattern, convolutional, block, figure, fast, proposed, vnlnet, residual, comparison, based, compensation, method, vnlb, cascaded, output, restoration, high, running, videnn, optical, erroneous, gaussian, color, clipped] [image, modified, jan] [algorithm, neural, deep, architecture, training, performance, network, learning, best, processing, respect, number, set, reduction, paper, size] [computer, conference, vision, estimation, thomas, handle, supplementary]
@InProceedings{Tassano_2020_CVPR,
  author = {Tassano, Matias and Delon, Julie and Veit, Thomas},
  title = {FastDVDnet: Towards Real-Time Deep Video Denoising Without Flow Estimation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Learning to Have an Ear for Face Super-Resolution
Givi Meishvili, Simon Jenni, Paolo Favaro


We propose a novel method to use both audio and a low-resolution image to perform extreme face super-resolution (a 16x increase of the input size). When the resolution of the input image is very low (e.g., 8x8 pixels), the loss of information is so dire that important details of the original identity have been lost and audio can aid the recovery of a plausible high-resolution image. In fact, audio carries information about facial attributes, such as gender and age. To combine the aural and visual modalities, we propose a method to first build the latent representations of a face from the lone audio track and then from the lone low-resolution image. We then train a network to fuse these two representations. We show experimentally that audio can assist in recovering attributes such as the gender, the age and the identity, and thus improve the correctness of the high-resolution image reconstruction process. Our procedure does not make use of human annotation and thus can be easily trained with existing video datasets. Moreover, we show that our model builds a factorized representation of images and audio as it allows one to mix low-resolution images and audio from different videos and to generate realistic faces with semantically meaningful combinations.
[audio, recognition, visual, dataset, video, attention] [ablation, table, track, propose, extreme] [face, model, trained, identity, age, input, adversarial, facial, acc] [ieee, fusion, pattern, june, resolution, high, september, method, xli, figure, residual, reference, based, output, column, lowresolution, highresolution, signal, psnr, ssim, aural, superresolution] [image, encoder, latent, generator, gender, corresponding, train, encoders, learn, representation, generative, loss, perform, factor, mapping] [training, network, set, learning, performance, deep, accuracy, space, fixed, problem, classifier, test, open, neural] [vision, computer, conference, single, international, european, october, reconstruction, second, limited]
@InProceedings{Meishvili_2020_CVPR,
  author = {Meishvili, Givi and Jenni, Simon and Favaro, Paolo},
  title = {Learning to Have an Ear for Face Super-Resolution},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Deep Optics for Single-Shot High-Dynamic-Range Imaging
Christopher A. Metzler, Hayato Ikoma, Yifan Peng, Gordon Wetzstein


High-dynamic-range (HDR) imaging is crucial for many applications. Yet, acquiring HDR images with a single shot remains a challenging problem. Whereas modern deep learning approaches are successful at hallucinating plausible HDR content from a single low-dynamic-range (LDR) image, saturated scene details often cannot be faithfully recovered. Inspired by recent deep optical imaging approaches, we interpret this problem as jointly training an optical encoder and electronic decoder where the encoder is parameterized by the point spread function (PSF) of the lens, the bottleneck is the sensor with a limited dynamic range, and the decoder is a convolutional neural network (CNN). The lens surface is then jointly optimized with the CNN in a training phase; we fabricate this optimized optical element and attach it as a hardware add-on to a conventional camera during inference. In extensive simulations and with a physical prototype, we demonstrate that this end-to-end deep optical imaging approach to single-shot HDR imaging outperforms both purely CNN-based approaches and other PSF engineering approaches.
[element, work, multiple, video] [cnn, challenging] [model, profile, trained, example, creates] [hdr, optical, psf, imaging, ldr, dynamic, range, sensor, lens, optimized, high, figure, doe, captured, saturated, convolutional, light, proposed, spread, conventional, pixel, coded, psnr, wolfgang, optically, ieee, photography, diffractive, exposure, phase, field, color, aperture, recover, motion, fabricated, spatially] [image, loss] [deep, training, function, network, computational, neural, filter, learning, processing, problem, set, scaled] [acm, camera, single, approach, depth, point, reconstruction, scene, system, surface, varying, simulated, jointly, well, directly, gordon, limited, wave, demonstrate, capture]
@InProceedings{Metzler_2020_CVPR,
  author = {Metzler, Christopher A. and Ikoma, Hayato and Peng, Yifan and Wetzstein, Gordon},
  title = {Deep Optics for Single-Shot High-Dynamic-Range Imaging},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Learning Rank-1 Diffractive Optics for Single-Shot High Dynamic Range Imaging
Qilin Sun, Ethan Tseng, Qiang Fu, Wolfgang Heidrich, Felix Heide


High-dynamic range (HDR) imaging is an essential imaging modality for a wide range of applications in uncontrolled environments, including autonomous driving, robotics, and mobile phone cameras. However, existing HDR techniques in commodity devices struggle with dynamic scenes due to multi-shot acquisition and post-processing time, e.g. mobile phone burst photography, making such approaches unsuitable for real-time applications. In this work, we propose a method for snapshot HDR imaging by learning an optical HDR encoding in a single image which maps saturated highlights into neighboring unsaturated areas using a diffractive optical element (DOE). We propose a novel rank-1 parameterization of the proposed DOE which avoids vast trainable parameters and keeps high frequencies' encoding compared with conventional end-to-end design methods. We further propose a reconstruction network tailored to this rank-1 parametrization for recovery of clipped information from the encoded measurements. The proposed end-to-end framework is validated through simulation and real-world experiments and improves the PSNR by more than 7 dB over state-of-the-art end-to-end designs.
[encoding, prediction, time, work, encode] [height, map, feature, final, propose] [model, splitting] [hdr, range, dynamic, high, doe, optical, psf, imaging, unsaturated, saturated, sensor, residual, ieee, figure, ldr, method, phase, proposed, pixel, exposure, wolfgang, captured, convolution, snapshot, light, tone, felix, low, inverse, reference, lens, diffractive, night, existing, star, glare, field, optimized, formation] [image, loss, content, source] [network, deep, design, learning, layer, computational, processing, optimization] [reconstruction, single, conference, acm, computer, point, depth, international, capture, approach, camera, supplemental, vision, measurement, local, allows, allow, varying, refer]
@InProceedings{Sun_2020_CVPR,
  author = {Sun, Qilin and Tseng, Ethan and Fu, Qiang and Heidrich, Wolfgang and Heide, Felix},
  title = {Learning Rank-1 Diffractive Optics for Single-Shot High Dynamic Range Imaging},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Deep White-Balance Editing
Mahmoud Afifi, Michael S. Brown


We introduce a deep learning approach to realistically edit an sRGB image's white balance. Cameras capture sensor images that are rendered by their integrated signal processor (ISP) to a standard RGB (sRGB) color space encoding. The ISP rendering begins with a white-balance procedure that is used to remove the color cast of the scene's illumination. The ISP then applies a series of nonlinear color manipulations to enhance the visual quality of the final sRGB image. Recent work by [3] showed that sRGB images that were rendered with the incorrect white balance cannot be easily corrected due to the ISP's nonlinear rendering. The work in [3] proposed a k-nearest neighbor (KNN) solution based on tens of thousands of image pairs. We propose to solve this problem with a deep neural network (DNN) architecture trained in an end-to-end manner to learn the correct white balance. Our DNN maps an input image to two additional white-balance settings corresponding to indoor and outdoor illuminations. Our solution not only is more accurate than the KNN approach in terms of correcting a wrong white-balance setting but also provides the user the freedom to edit the white balance in the sRGB image to other illumination settings.
[dataset, correct, work, three, decoder, goal, provide] [framework, final, table] [input, correction, model, manipulation, datasets, trained, dnn, white, appear, incorrect, nonlinear, original, testing] [color, srgb, result, method, incandescent, awb, proposed, output, version, captured, figure, illumination, convolutional, isp, based, range, comparison, emulator, ieee, mahmoud, performed] [image, shade, target, editing, corresponding, produce, encoder, mapping, fivek, edit, user, consists, qualitative, photo, unprocessed] [set, setting, training, deep, function, applied, architecture, process, learning, selected, network, task, space] [rendered, camera, ground, truth, additional, michael, polynomial, error, rendering, single, estimation, allows, scene]
@InProceedings{Afifi_2020_CVPR,
  author = {Afifi, Mahmoud and Brown, Michael S.},
  title = {Deep White-Balance Editing},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Non-Line-of-Sight Surface Reconstruction Using the Directional Light-Cone Transform
Sean I. Young, David B. Lindell, Bernd Girod, David Taubman, Gordon Wetzstein


We propose a joint albedo-normal approach to non-line-of-sight (NLOS) surface reconstruction using the directional light-cone transform (D-LCT). While current NLOS imaging methods reconstruct either the albedo or surface normals of the hidden scene, the two quantities provide complementary information of the scene, so an efficient method to estimate both simultaneously is desirable. We formulate the recovery of the two quantities as a vector deconvolution problem, and solve it via Cholesky-Wiener decomposition. We demonstrate that surfaces fitted non-parametrically using our recovered normals are more accurate than those produced with NLOS surface reconstruction methods recently proposed, and are 1,000 times faster to compute than using inverse rendering.
[hidden] [object, background, faster] [model, cholesky, poisson, university] [imaging, figure, inverse, light, method, transform, fourier, deconvolution, based, recover, spatial, intensity, captured, mae, sensing] [produce] [problem, computational, stanford, vector, complexity, matrix, minimize, cosine, linear, typically] [surface, albedo, lct, directional, reconstruction, nlos, normal, scene, transient, depth, david, approach, reconstruct, solution, solve, gordon, visible, volume, matthew, recovered, rendering, volumetric, geometry, error, acm, estimate, compute, confocal, recovering, fermat, recovery, system, phasor, diffuse, rmse, tsai, trans, joint, laser, sight, well, allowing]
@InProceedings{Young_2020_CVPR,
  author = {Young, Sean I. and Lindell, David B. and Girod, Bernd and Taubman, David and Wetzstein, Gordon},
  title = {Non-Line-of-Sight Surface Reconstruction Using the Directional Light-Cone Transform},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Seeing the World in a Bag of Chips
Jeong Joon Park, Aleksander Holynski, Steven M. Seitz


We address the dual problems of novel view synthesis and environment reconstruction from hand-held RGBD sensors. Our contributions include 1) modeling highly specular objects, 2) modeling inter-reflections and Fresnel effects, and 3) enabling surface light field reconstruction with the same input needed to reconstruct shape alone. In cases where scene surface has a strong mirror-like material component, we generate highly detailed environment images, revealing room composition, objects, people, buildings, and trees visible through windows. Our approach yields state of the art view synthesis techniques, operates on low dynamic range imagery, and is robust to geometric and calibration errors.
[environment, video, modeling, multiple, work] [map, object] [input, model, ray, robust, quality, blending, adversarial, highly] [light, ieee, pattern, method, field, figure, reflection, perceptual, illumination, high, recover, range] [image, synthesis, loss, texture, appearance, train] [neural, network, deep, learning, arxiv, preprint, problem, test, note, set] [surface, specular, view, computer, scene, conference, material, rendering, diffuse, acm, vision, lighting, novel, reflectance, geometry, srms, ground, truth, single, estimation, rgbd, fresnel, approach, reconstruct, point, international, michael, reconstruction, recovered, srm, capture, camera, shape, david, richard, detailed, reconstructed, ravi, estimated, visible, depth, viewpoint, accurate]
@InProceedings{Park_2020_CVPR,
  author = {Park, Jeong Joon and Holynski, Aleksander and Seitz, Steven M.},
  title = {Seeing the World in a Bag of Chips},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Correction Filter for Single Image Super-Resolution: Robustifying Off-the-Shelf Deep Super-Resolvers
Shady Abu Hussein, Tom Tirer, Raja Giryes


The single image super-resolution task is one of the most examined inverse problems in the past decade. In the recent years, Deep Neural Networks (DNNs) have shown superior performance over alternative methods when the acquisition process uses a fixed known downscaling kernel---typically a bicubic kernel. However, several recent works have shown that in practical scenarios, where the test data mismatch the training data (e.g. when the downscaling kernel is not the bicubic kernel or is not available at training), the leading DNN methods suffer from a huge performance drop. Inspired by the literature on generalized sampling, in this work we propose a method for improving the performance of DNNs that have been trained with a fixed kernel on observations acquired by other kernels. For a known kernel, we design a closed-form correction filter that modifies the low-resolution image to match one which is obtained by another kernel (e.g. bicubic), and thus improves the results of existing pre-trained DNNs. For an unknown kernel, we extend this idea and propose an algorithm for blind estimation of the required correction filter. We show that our approach outperforms other super-resolution methods, which are designed for general downscaling kernels.
[work, composed, inspired] [propose, cnn, table, box, improves, leading] [correction, std, dnn, dnns, trained, raja, literature] [kernel, downscaling, gaussian, ieee, dbpn, bicubic, scale, blind, operator, rcan, assumption, prosr, kbicub, pattern, sisr, proposed, figure, zssr, signal, method, srmd, presented, upsampling, downsampling, comparison, acquisition, based, prior, psnr, ssim, tom, inverse, convolution, ybicub, cell, kernelgan, tirer] [image, factor, generalized, latent] [filter, deep, performance, training, sampling, linear, setting, learning, test, algorithm, processing, note, width, fixed, theorem, requires, task, neural] [approach, conference, computer, vision, single, estimation, reconstruction, estimated, term, estimator, estimate, international]
@InProceedings{Hussein_2020_CVPR,
  author = {Hussein, Shady Abu and Tirer, Tom and Giryes, Raja},
  title = {Correction Filter for Single Image Super-Resolution: Robustifying Off-the-Shelf Deep Super-Resolvers},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Retina-Like Visual Image Reconstruction via Spiking Neural Model
Lin Zhu, Siwei Dong, Jianing Li, Tiejun Huang, Yonghong Tian


The high-sensitivity vision of primates, including humans, is mediated by a small retinal region called the fovea. As a novel bio-inspired vision sensor, spike camera mimics the fovea to record the nature scenes by continuous-time spikes instead of frame-based manner. However, reconstructing visual images from the spikes remains to be a challenge. In this paper, we design a retina-like visual image reconstruction framework, which is flexible in reconstructing full texture of natural scenes from the totally new spike data. Specifically, the proposed architecture consists of motion local excitation layer, spike refining layer and visual reconstruction layer motivated by bio-realistic leaky integrate and fire (LIF) neurons and synapse connection with spike-timing-dependent plasticity (STDP) rules. This approach may represent a major shift from conventional frame-based vision to the continuous-time retina-like vision, owning to the advantages of high temporal resolution and low power consumption. To test the performance, a spike dataset is constructed which is recorded by the spike camera. The experimental results show that the proposed approach is extremely effective in reconstructing the visual image in both normal and high speed scenes, while achieving high dynamic range and high image quality.
[visual, time, static, moment, temporal, state, dataset, mechanism, graph, current, represent] [threshold, propose, confidence, denotes, including, region] [model, input, fig, isi, noise, quality, digital] [spike, dynamic, motion, ieee, method, mij, high, proposed, tfi, excitation, pixel, based, synaptic, analysis, signal, figure, extraction, biological, stdp, refining, output, photon, gaussian, tfa, synapse, intensity, asynchronous, refractory, period, conventional, sensor, luminance, contrast, blur] [image, train, texture, distinguish] [neuron, firing, neural, layer, spiking, class, distribution, data, set, learning, membrane, potential, matrix, sampling, process, higher, record, architecture, plasticity, binary] [vision, reconstruction, camera, local, reconstruct, reconstructed, conference, human]
@InProceedings{Zhu_2020_CVPR,
  author = {Zhu, Lin and Dong, Siwei and Li, Jianing and Huang, Tiejun and Tian, Yonghong},
  title = {Retina-Like Visual Image Reconstruction via Spiking Neural Model},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Plug-and-Play Algorithms for Large-Scale Snapshot Compressive Imaging
Xin Yuan, Yang Liu, Jinli Suo, Qionghai Dai


Snapshot compressive imaging (SCI) aims to capture the high-dimensional (usually 3D) images using a 2D sensor (detector) in a single snapshot. Though enjoying the advantages of low-bandwidth, low-power and low-cost, applying SCI to large-scale problems (HD or UHD videos) in our daily life is still challenging. The bottleneck lies in the reconstruction algorithms; they are either too slow (iterative optimization algorithms) or not flexible to the encoding process (deep learning based end-to-end networks). In this paper, we develop fast and flexible algorithms for SCI based on the plug-and-play (PnP) framework. In addition to the widely used PnP-ADMM method, we further propose the PnP-GAP (generalized alternating projection) algorithm with a lower computational workload and prove the global convergence of PnP-GAP under the SCI hardware constraints. By employing deep denoising priors, we first time show that PnP can recover a UHD color video (3840x1644x48 with PNSR above 30dB) from a snapshot 2D measurement. Extensive results on both simulation and real datasets verify the superiority of our proposed algorithm.
[video, lawrence, provide, time, speed, yuan, temporal] [global, framework, mask, denotes] [bounded, model, patrick, true] [sci, compressive, xin, sensing, ieee, imaging, snapshot, rmin, denoising, signal, desci, based, admm, color, proposed, pnp, spectral, ffdnet, rnx, denoiser, denoisers, assumption, rmax, flexible, gaussian, figure, psnr, result, noisy, coded, optical, compressed, adaptive, wnnm, proved, pattern] [image, real, gap] [deep, computational, convergence, algorithm, learning, matrix, max, data, consider, fixed, minimization, converge, update, theorem, better, alternating, hardware, network, best, diagonal, satisfying, processing] [david, reconstruction, conference, measurement, solution, international, computer, single, vision, simulation, solved]
@InProceedings{Yuan_2020_CVPR,
  author = {Yuan, Xin and Liu, Yang and Suo, Jinli and Dai, Qionghai},
  title = {Plug-and-Play Algorithms for Large-Scale Snapshot Compressive Imaging},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Neural Network Pruning With Residual-Connections and Limited-Data
Jian-Hao Luo, Jianxin Wu


Filter level pruning is an effective method to accelerate the inference speed of deep CNN models. Although numerous pruning algorithms have been proposed, there are still two open issues. The first problem is how to prune residual connections. We propose to prune both channels inside and outside the residual connections via a KL-divergence based criterion. The second issue is pruning with limited data. We observe an interesting phenomenon: directly pruning on a small dataset is usually worse than fine-tuning a small model which is pruned or trained from scratch on the large dataset. Knowledge distillation is an effective approach to compensate for the weakness of limited data. However, the logits of a teacher model may be noisy. In order to avoid the influence of label noise, we propose a label refinement approach to solve this problem. Experiments have demonstrated the effectiveness of our method (CURL, Compression Using Residual-connections and Limited-data). CURL significantly outperforms previous state-of-the-art methods on ImageNet. More importantly, when pruning on small datasets, CURL achieves comparable or much better performance than fine-tuning a pretrained small model.
[dataset, previous, order, speed, current] [propose, inside, table, refinement, level] [model, trained, original, datasets, hourglass, influence] [residual, method, output, block, relu, proposed, channel, compression, based, remove, convolutional, scale] [image, target, loss] [pruning, small, pruned, large, training, prune, network, filter, knowledge, label, accuracy, learning, curl, better, neural, deep, distillation, imagenet, data, criterion, mixup, layer, logits, performance, teacher, achieve, evaluate, rate, bottleneck, slimmable, soft, strategy, set, inference, wallet, connection, impact, worse, scratch, comparable, number, shortcut] [limited, directly, structure, approach, novel, second, accurate]
@InProceedings{Luo_2020_CVPR,
  author = {Luo, Jian-Hao and Wu, Jianxin},
  title = {Neural Network Pruning With Residual-Connections and Limited-Data},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
AdderNet: Do We Really Need Multiplications in Deep Learning?
Hanting Chen, Yunhe Wang, Chunjing Xu, Boxin Shi, Chao Xu, Qi Tian, Chang Xu


Compared with cheap addition operation, multiplication operation is of much higher computation complexity. The widely-used convolutions in deep neural networks are exactly cross-correlation to measure the similarity between input feature and convolution filters, which involves massive multiplications between float values. In this paper, we present adder networks (AdderNets) to trade these massive multiplications in deep neural networks, especially convolutional neural networks (CNNs), for much cheaper additions to reduce computation costs. In AdderNets, we take the L1-norm distance between filters and input feature as the output response. The influence of this new similarity measure on the optimization of neural network have been thoroughly analyzed. To achieve a better performance, we develop a special back-propagation approach for AdderNets by investigating the full-precision gradient. We then propose an adaptive learning rate strategy to enhance the training procedure of AdderNets according to the magnitude of each neuron's gradient. As a result, the proposed AdderNets can achieve 74.9% Top-1 accuracy 91.7% Top-5 accuracy using ResNet-50 on the ImageNet dataset without any multiplication in convolutional layer. The codes are publicly available at: (https://github.com/huaweinoah/AdderNet).
[dataset, sign, extract] [visualization, cnn, feature, table, addition, achieves, propose, template] [input, mnist, model, magnitude] [proposed, convolutional, convolution, cnns, output, figure, adaptive, cin, conventional, gaussian, low] [image, train] [neural, addernets, learning, gradient, deep, accuracy, layer, rate, adder, network, achieve, replace, batch, computational, addernet, measure, similarity, training, efficient, update, normalization, arxiv, preprint, filter, variance, set, operation, classification, binary, bnn, distribution, multiplication, imagenet, number, performance, lower, computation, better, investigate, calculate, larger, chang, compared, mobile, small, replacing, weight, energy, teacher] [distance, demonstrate, cost, computer]
@InProceedings{Chen_2020_CVPR,
  author = {Chen, Hanting and Wang, Yunhe and Xu, Chunjing and Shi, Boxin and Xu, Chao and Tian, Qi and Xu, Chang},
  title = {AdderNet: Do We Really Need Multiplications in Deep Learning?},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
NeuralScale: Efficient Scaling of Neurons for Resource-Constrained Deep Neural Networks
Eugene Lee, Chen-Yi Lee


Deciding the amount of neurons during the design of a deep neural network to maximize performance is not intuitive. In this work, we attempt to search for the neuron (filter) configuration of a fixed network architecture that maximizes accuracy. Using iterative pruning methods as a proxy, we parametrize the change of the neuron (filter) number of each layer with respect to the change in parameters, allowing us to efficiently scale an architecture across arbitrary sizes. We also introduce architecture descent which iteratively refines the parametrized function used for model scaling. The combination of both proposed methods is coined as NeuralScale. To prove the efficiency of NeuralScale in terms of parameters, we show empirical simulations on VGG11, MobileNetV2 and ResNet18 using CIFAR10, CIFAR100 and TinyImageNet as benchmark datasets. Our results show an increase in accuracy of 3.04%, 8.56% and 3.41% for VGG11, MobileNetV2 and ResNet18 on CIFAR10, CIFAR100 and TinyImageNet respectively under a parameter-constrained setting (output neurons (filters) of default configuration with scaling factor of 0.25).
[structured, corresponds, observed, work] [table] [iterative, dnn, change, model, magnitude, trained] [scale, figure, method, convolutional, pattern, ieee, proposed, comparison, output, residual, based, expansion] [factor] [architecture, network, pruning, layer, neural, accuracy, neuralscale, parameter, descent, efficient, search, arxiv, preprint, uniform, scaling, deep, morphnet, number, learning, tinyimagenet, configuration, set, latency, algorithm, total, iteration, filter, scaled, gain, rate, gradient, size, width, optimal, training, pruned, quoc, design, function, efficiency, prune, resource, barret, processing, andrew, performance, neuron, default, required, ratio] [approach, conference, computer, vision, single, international, parameterize]
@InProceedings{Lee_2020_CVPR,
  author = {Lee, Eugene and Lee, Chen-Yi},
  title = {NeuralScale: Efficient Scaling of Neurons for Resource-Constrained Deep Neural Networks},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Training Quantized Neural Networks With a Full-Precision Auxiliary Module
Bohan Zhuang, Lingqiao Liu, Mingkui Tan, Chunhua Shen, Ian Reid


In this paper, we seek to tackle a challenge in training low-precision networks: the notorious difficulty in propagating gradient through a low-precision network due to the non-differentiable quantization function. We propose a solution by training the low-precision network with a full-precision auxiliary module. Specifically, during training, we construct a mix-precision network by augmenting the original low-precision network with the full precision auxiliary module. Then the augmented mix-precision network and the low-precision network are jointly optimized. This strategy creates additional full-precision routes to update the parameters of the low-precision model, thus making the gradient back-propagates more easily. At the inference time, we discard the auxiliary module without introducing any computational complexity to the low-precision network. We evaluate the proposed method on image classification and object detection over various quantization approaches and show consistent performance increase. In particular, we achieve near lossless performance to the full-precision model by using a 4-bit detector, which is of great practical value.
[] [module, object, detection, feature, table, propose, retinanet, backbone, ross, semantic, pyramid, head, chunhua] [auxiliary, model, improve, overview, original, adaptor] [proposed, ieee, method, block, convolutional, intermediate, based, output, skip, capability] [loss, image, shared] [network, training, neural, baseline, quantized, quantization, learning, auxi, gradient, performance, classification, plain, knowledge, sharing, weight, imagenet, layer, strategy, binary, achieve, deep, accuracy, classifier, observe, distillation, comparing, fullprecision, design, note, set, teacher, quantize, validation, arxiv, preprint, precision, search, architecture, representational, convergence, rate] [additional, approach, accurate, directly, jointly]
@InProceedings{Zhuang_2020_CVPR,
  author = {Zhuang, Bohan and Liu, Lingqiao and Tan, Mingkui and Shen, Chunhua and Reid, Ian},
  title = {Training Quantized Neural Networks With a Full-Precision Auxiliary Module},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Neural Networks Are More Productive Teachers Than Human Raters: Active Mixup for Data-Efficient Knowledge Distillation From a Blackbox Model
Dongdong Wang, Yandong Li, Liqiang Wang, Boqing Gong


We study how to train a student deep neural network for visual recognition by distilling knowledge from a blackbox teacher model in a data-efficient manner. Progress on this problem can significantly reduce the dependence on large-scale datasets for learning high-performing visual recognition models. There are two major challenges. One is that the number of queries into the teacher model should be minimized to save computational and/or financial costs. The other is that the number of images used for the knowledge distillation should be small; otherwise, it violates our expectation of reducing the dependence on large-scale datasets. To tackle these challenges, we propose an approach that blends mixup and active learning. The former effectively augments the few unlabeled images by a big pool of synthetic images sampled from the convex hull of the original images, and the latter actively chooses from the pool hard examples for the student neural network and query their labels from the teacher model. We validate our approach with extensive experiments.
[natural, pair, recognition, work, construct, querying, visual] [table, confidence, propose, hard] [model, blackbox, query, original, success, adversarial, mnist, black, white, whitebox, input, study, improve] [ieee, proposed, figure, big, coefficient, comparison] [synthetic, image, real, train] [teacher, active, mixup, student, learning, network, knowledge, distillation, neural, training, data, number, selected, accuracy, arxiv, preprint, pool, deep, unlabeled, subset, test, random, set, performance, classification, algorithm, small, machine, vanilla, candidate, reduce, distill, select, higher, search, actively, augmentation, scheme, save] [approach, conference, computer, international, convex, initial, acquire]
@InProceedings{Wang_2020_CVPR,
  author = {Wang, Dongdong and Li, Yandong and Wang, Liqiang and Gong, Boqing},
  title = {Neural Networks Are More Productive Teachers Than Human Raters: Active Mixup for Data-Efficient Knowledge Distillation From a Blackbox Model},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Multi-Dimensional Pruning: A Unified Framework for Model Compression
Jinyang Guo, Wanli Ouyang, Dong Xu


In this work, we propose a unified model compression framework called Multi-Dimensional Pruning (MDP) to simultaneously compress the convolutional neural networks (CNNs) on multiple dimensions. In contrast to the existing model compression methods that only aim to reduce the redundancy along either the spatial/spatial-temporal dimension (e.g., spatial dimension for 2D CNNs, spatial and temporal dimensions for 3D CNNs) or the channel dimension, our newly proposed approach can simultaneously reduce the spatial/spatial-temporal and the channel redundancies for CNNs. Specifically, in order to reduce the redundancy along the spatial/spatial-temporal dimension, we downsample the input tensor of a convolutional layer, in which the scaling factor for the downsampling operation is adaptively selected by our approach. After the convolution operation, the output tensor is upsampled to the original size to ensure the unchanged input size for the subsequent CNN layers. To reduce the channel-wise redundancy, we introduce a gate for each channel of the output tensor as its importance score, in which the gate value is automatically learned. The channels with small importance scores will be removed after the model compression process. Our comprehensive experiments on four benchmark datasets demonstrate that our MDP framework outperforms the existing methods when pruning both 2D CNNs and 3D CNNs.
[temporal, multiple, work, outperforms, video, order, three] [branch, table, framework, effectiveness, feature, stage, pooling, unified] [model, input, original] [mdp, channel, spatial, tensor, convolutional, output, method, cnns, figure, compression, proposed, compress, downsampling, dcp, compressed, existing, convolution, based, downsample, resolution] [perform, factor, aim, image, representation] [pruning, layer, network, redundancy, reduce, compressing, selected, scaling, searching, neural, learning, prune, dimension, reducing, operation, gate, accuracy, indicates, imagenet, deep, simultaneously, performance, efficient, number, classification, pruned, average, better] [approach, demonstrate]
@InProceedings{Guo_2020_CVPR,
  author = {Guo, Jinyang and Ouyang, Wanli and Xu, Dong},
  title = {Multi-Dimensional Pruning: A Unified Framework for Model Compression},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Towards Efficient Model Compression via Learned Global Ranking
Ting-Wu Chin, Ruizhou Ding, Cha Zhang, Diana Marculescu


Pruning convolutional filters has demonstrated its effectiveness in compressing ConvNets. Prior art in filter pruning requires users to specify a target model complexity (e.g., model size or FLOP count) for the resulting architecture. However, determining a target model complexity can be difficult for optimizing various embodied AI applications such as autonomous robots, drones, and user-facing applications. First, both the accuracy and the speed of ConvNets can affect the performance of the application. Second, the performance of the application can be hard to assess without evaluating ConvNets during inference. As a consequence, finding a sweet-spot between the accuracy and speed via filter pruning, which needs to be done in a trial-and-error fashion, can be time-consuming. This work takes a first step toward making this process more efficient by altering the goal of model compression to producing a set of ConvNets with various accuracy and latency trade-offs instead of producing one ConvNet targeting some pre-defined latency constraint. To this end, we propose to learn a global ranking of the filters across different layers of the ConvNet, which is used to obtain a set of ConvNet architectures that have different accuracy/latency trade-offs by pruning the bottom-ranked filters. Our proposed algorithm, LeGR, is shown to be 2x to 3x faster than prior work while having comparable or better performance when targeting seven pruned ResNet-56 with different accuracy/FLOPs profiles on the CIFAR-100 dataset. Additionally, we have evaluated LeGR on ImageNet and Bird-200 with ResNet-50 and Mo- bileNetV2 to demonstrate its effectiveness. Code available at https://github.com/cmu-enyac/LeGR.
[work, recognition, speed, embodied, pair, goal, explore] [propose, art, global, faster] [model, curve] [prior, ieee, figure, convolutional, pattern, assumption, proposed, method, compression, affine, channel] [learn, target, image, generate] [pruning, flop, filter, learning, neural, count, ranking, accuracy, convnets, pruned, legr, deep, network, arxiv, preprint, convnet, learned, set, efficient, complexity, performance, training, architecture, find, algorithm, search, imagenet, prune, gradient, compared, validation, rate, size, comparable, measure, note, subset, processing, considered, layer, regularization, consider, amc, group, diana, optimizing, latency] [conference, computer, vision, international, cost, single, formulation, compare]
@InProceedings{Chin_2020_CVPR,
  author = {Chin, Ting-Wu and Ding, Ruizhou and Zhang, Cha and Marculescu, Diana},
  title = {Towards Efficient Model Compression via Learned Global Ranking},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
HRank: Filter Pruning Using High-Rank Feature Map
Mingbao Lin, Rongrong Ji, Yan Wang, Yichen Zhang, Baochang Zhang, Yonghong Tian, Ling Shao


Neural network pruning offers a promising prospect to facilitate deploying deep neural networks on resource-limited devices. However, existing methods are still challenged by the training inefficiency and labor cost in pruning designs, due to missing theoretical guidance of non-salient network components. In this paper, we propose a novel filter pruning method by exploring the High Rank of feature maps (HRank). Our HRank is inspired by the discovery that the average rank of multiple feature maps generated by a single filter is always the same, regardless of the number of image batches CNNs receive. Based on HRank, we develop a method that is mathematically formulated to prune filters with low-rank feature maps. The principle behind our pruning is that low-rank feature maps contain less information, and thus pruned results can be easily reproduced. Besides, we experimentally show that weights with high-rank feature maps contain more important information, such that even when a portion is not updated, very little damage would be done to the model performance. Without introducing any additional constraints, HRank leads to significant improvements over the state-of-the-arts in terms of FLOPs and parameters reduction, with similar accuracies. For example, with ResNet-110, we achieve a 58.2%-FLOPs reduction by removing 59.2% of the parameters, with only a small loss of 0.14% in top-1 accuracy on CIFAR-10. With Res-50, we achieve a 43.8%-FLOPs reduction by removing 36.7% of the parameters, with only a loss of 1.17% in the top-1 accuracy on ImageNet. The codes can be available at https://github.com/lmbxmu/HRank.
[recognition, time] [feature, map, including, cnn, jian] [model, input, googlenet, conduct] [based, convolutional, convolution, adaptive, pattern, compression, cnns, high, proposed, method, remove, removed, block, removing, figure] [generated, image, loss, zhao, extensive, corresponding] [pruning, filter, hrank, rank, accuracy, neural, network, better, deep, oij, compared, learning, average, set, acceleration, efficient, pruned, number, small, weight, performance, large, layer, rongrong, training, portion, reduction, optimization, machine, denote, rate, prune, achieve, discussed, batch, imagenet] [computer, vision, property, conference, international, structure, estimate, single, demonstrate, decomposition, additional]
@InProceedings{Lin_2020_CVPR,
  author = {Lin, Mingbao and Ji, Rongrong and Wang, Yan and Zhang, Yichen and Zhang, Baochang and Tian, Yonghong and Shao, Ling},
  title = {HRank: Filter Pruning Using High-Rank Feature Map},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
DMCP: Differentiable Markov Channel Pruning for Neural Networks
Shaopeng Guo, Yujie Wang, Quanquan Li, Junjie Yan


Recent works imply that the channel pruning can be regarded as searching optimal sub-structure from unpruned networks. However, existing works based on this observation require training and evaluating a large number of structures, which limits their application. In this paper, we propose a novel differentiable method for channel pruning, named Differentiable Markov Channel Pruning (DMCP), to efficiently search the optimal sub-structure. Our method is differentiable and can be directly optimized by gradient descent with respect to standard task loss and budget regularization (e.g. FLOPs constraint). In DMCP, we model the channel pruning as a Markov process, in which each state represents for retaining the corresponding channel during pruning, and transitions between states denote the pruning process. In the end, our method is able to implicitly select the proper number of channels in each layer by the Markov process with optimized transitions. To validate the effectiveness of our method, we perform extensive experiments on Imagenet with ResNet and MobilenetV2. Results show our method can achieve consistent improvement than state-of-the-art pruning methods in various FLOPs settings.
[state, illustrated] [table, stage, effectiveness, resnet] [model, trained, budget, influence, original, iterative] [channel, method, figure, optimized, output, convolutional, proposed, cout, scale] [loss, train, perform, target, corresponding] [pruning, architecture, unpruned, pruned, layer, network, training, markov, search, number, process, sampling, dmcp, performance, expected, probability, large, lop, neural, sampled, updating, learning, update, uniform, warmup, regularization, retaining, transition, sandwich, task, set, variant, equation, arxiv, preprint, optimal, imagenet, note, weight, efficient, accuracy, gradient, metapruning, soft, better, baseline, searching, respect, amc, best, ratio, space, sample] [differentiable, computed, direct, computer]
@InProceedings{Guo_2020_CVPR,
  author = {Guo, Shaopeng and Wang, Yujie and Li, Quanquan and Yan, Junjie},
  title = {DMCP: Differentiable Markov Channel Pruning for Neural Networks},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
ReSprop: Reuse Sparsified Backpropagation
Negar Goli, Tor M. Aamodt


The success of Convolutional Neural Networks (CNNs) in various applications is accompanied by a significant increase in computation and training time. In this work, we focus on accelerating training by observing that about 90% of gradients are reusable during training. Leveraging this observation, we propose a new algorithm, Reuse-Sparse-Backprop (ReSprop), as a method to sparsify gradient vectors during CNN training. ReSprop maintains state-of-the-art accuracy on CIFAR-10, CIFAR-100, and ImageNet datasets with less than 1.1% accuracy loss while enabling a reduction in back-propagation computations by a factor of 10x resulting in a 2.7x overall speedup in training. As the computation reduction introduced by Re-Sprop is accomplished by introducing fine-grained sparsity that reduces computation efficiency on GPUs, we introduce a generic sparse convolution neural network accelerator (GSCN), which is designed to accelerate sparse back-propagation convolutions. When combined with ReSprop, GSCN achieves 8.0x and 7.2x speedup in the backward pass on ResNet34 and VGG16 versus a GTX 1080 Ti GPU.
[overhead, previous, dataset, work] [table, threshold, propose, achieves, faster] [trained, datasets, percentage, original, model] [output, figure, convolutional, ieee, convolution, high] [loss, mode] [resprop, training, accuracy, gradient, neural, deep, computation, sparsity, arxiv, preprint, network, algorithm, speedup, meprop, validation, reuse, stochastic, layer, iteration, backward, learning, imagenet, pass, forward, vector, batch, average, compared, random, pruning, number, baseline, processing, backpropagation, reusing, memory, theoretical, computing, machine, accelerate, reduction, accelerator, size, higher, efficient, reducing, gscn, sparsifying, dsg, larger, architecture] [sparse, international, conference, computer, angle, compute, approach, full]
@InProceedings{Goli_2020_CVPR,
  author = {Goli, Negar and Aamodt, Tor M.},
  title = {ReSprop: Reuse Sparsified Backpropagation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Adversarial Texture Optimization From RGB-D Scans
Jingwei Huang, Justus Thies, Angela Dai, Abhijit Kundu, Chiyu "Max" Jiang, Leonidas J. Guibas, Matthias Niessner, Thomas Funkhouser


Realistic color texture generation is an important step in RGB-D surface reconstruction, but remains challenging in practice due to inaccuracies in reconstructed geometry, misaligned camera poses, and view-dependent imaging artifacts. In this work, we present a novel approach for color texture generation using a conditional adversarial loss obtained from weakly-supervised views. Specifically, we propose an approach to produce photorealistic textures for approximate surfaces, even from misaligned images, by learning an objective function that is robust to these errors. The key idea of our approach is to learn a patch-based conditional discriminator which guides the texture optimization to be tolerant to misalignments. Our discriminator takes a synthesized view and a real image, and evaluates whether the synthesized one is realistic, under a broadened definition of realism. We train the discriminator by providing as `real' examples pairs of input views and their misaligned versions -- so that the learned adversarial loss will tolerate errors from the scans. Experiments on synthetic and real data under quantitative or qualitative evaluation demonstrate the advantage of our approach in comparison to state of the art.
[evaluation] [object, propose, table] [input, adversarial, vgg, auxiliary, robust, model, example, condition] [figure, color, patch, perceptual, ieee, method, pixel, pattern, proposed, exact, sharp] [texture, image, loss, discriminator, real, colormap, realistic, source, misalignment, misaligned, conditional, synthetic, produce, generation, mapping, fake, learn, synthesized] [optimization, optimize, metric, learned, evaluate, objective, function, neural, selection] [camera, geometry, view, approach, surface, sharpest, cad, ground, rendered, computer, truth, texturing, conference, reconstruction, error, nearest, rendering, pose, local, acm, reconstructed, textured, vision, matthias, thomas, parametric, handle, scanning, scanned]
@InProceedings{Huang_2020_CVPR,
  author = {Huang, Jingwei and Thies, Justus and Dai, Angela and Kundu, Abhijit and Jiang, Chiyu "Max" and Guibas, Leonidas J. and Niessner, Matthias and Funkhouser, Thomas},
  title = {Adversarial Texture Optimization From RGB-D Scans},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Synchronizing Probability Measures on Rotations via Optimal Transport
Tolga Birdal, Michael Arbel, Umut Simsekli, Leonidas J. Guibas


We introduce a new paradigm, `measure synchronization', for synchronizing graphs with measure-valued edges. We formulate this problem as maximization of the cycle-consistency in the space of probability measures over relative rotations. In particular, we aim at estimating marginal distributions of absolute orientations by synchronizing the `conditional' ones, which are defined on the Riemannian manifold of quaternions. Such graph optimization on distributions-on-manifolds enables a natural treatment of multimodal hypotheses, ambiguities and uncertainties arising in many computer vision applications such as SLAM, SfM, and object pose estimation. We first formally define the problem as a generalization of the classical rotation graph synchronization, where in our case the vertices denote probability measures over rotations. We then measure the quality of the synchronization by using Sinkhorn divergences, which reduces to other popular metrics such as Wasserstein distance or the maximum mean discrepancy as limit cases. We propose a nonparametric Riemannian particle optimization approach to solve the problem. Even though the problem is non-convex, by drawing a connection to the recently proposed sparse optimization methods, we show that the proposed algorithm converges to the global optimum in a special case of the problem under certain conditions. Our qualitative and quantitative experiments show the validity of our approach and we bring in new perspectives to the study of synchronization.
[graph, gij, multiple, sense, work] [global, denotes, positive] [case, definition, noise] [ieee, pattern, proposed, kernel, journal, particle, analysis] [mmd, wasserstein, transport, loss, discrepancy, generative, introduce, real] [probability, problem, algorithm, measure, riemannian, gradient, optimal, optimization, learning, arxiv, preprint, space, special, distribution, machine, set, divergence, note, number, neural, processing, descent, consider, function, rpgd, matrix, qij, denote, maximum, optimum, min] [synchronization, absolute, pose, computer, relative, conference, sinkhorn, vision, international, rotation, camera, distance, defined, single, ground, averaging, euclidean, approach, define, solution, estimation, form, quaternion, synchronizing, sparse, constraint, point, handle, joint, structure, geodesic, cost]
@InProceedings{Birdal_2020_CVPR,
  author = {Birdal, Tolga and Arbel, Michael and Simsekli, Umut and Guibas, Leonidas J.},
  title = {Synchronizing Probability Measures on Rotations via Optimal Transport},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
GhostNet: More Features From Cheap Operations
Kai Han, Yunhe Wang, Qi Tian, Jianyuan Guo, Chunjing Xu, Chang Xu


Deploying convolutional neural networks (CNNs) on embedded devices is difficult due to the limited memory and computation resources. The redundancy in feature maps is an important characteristic of those successful CNNs, but has rarely been investigated in neural architecture design. This paper proposes a novel Ghost module to generate more feature maps from cheap operations. Based on a set of intrinsic feature maps, we apply a series of linear transformations with cheap cost to generate many ghost feature maps that could fully reveal information underlying intrinsic features. The proposed Ghost module can be taken as a plug-and-play component to upgrade existing convolutional neural networks. Ghost bottlenecks are designed to stack Ghost modules, and then the lightweight GhostNet can be easily established. Experiments conducted on benchmarks demonstrate that the proposed Ghost module is an impressive alternative of convolution layers in baseline models, and our GhostNet can achieve higher recognition performance (e.g. 75.7% top-1 accuracy) than MobileNetV3 with similar computational cost on the ImageNet ILSVRC-2012 classification dataset. Code is available at https://github.com/huawei-noah/ghostnet.
[pair] [feature, module, table, object, detection, map, faster, including] [model, input, series, original, primary, conduct] [convolutional, proposed, convolution, figure, compression, output, channel, existing, kernel, chao, kai, designed, cnns, residual] [image, generate, generated, generating, common] [ghost, neural, efficient, ghostnet, deep, linear, number, architecture, size, performance, imagenet, learning, layer, network, accuracy, computational, depthwise, operation, training, large, chang, cheap, yunhe, pruning, larger, smaller, bottleneck, classification, mobile, ordinary, chunjing, compact, width, small, andrew, redundancy, data, applied] [intrinsic, cost]
@InProceedings{Han_2020_CVPR,
  author = {Han, Kai and Wang, Yunhe and Tian, Qi and Guo, Jianyuan and Xu, Chunjing and Xu, Chang},
  title = {GhostNet: More Features From Cheap Operations},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Attention-Aware Multi-View Stereo
Keyang Luo, Tao Guan, Lili Ju, Yuesong Wang, Zhuo Chen, Yawei Luo


Multi-view stereo is a crucial task in computer vision, that requires accurate and robust photo-consistency among input images for depth estimation. Recent studies have shown that learning-based feature matching and confidence regularization can play a vital role in this task. Nevertheless, how to design good matching confidence volumes as well as effective regularizers for them are still under in-depth study. In this paper, we propose an attention-aware deep neural network "AttMVS" for learning multi-view stereo. In particular, we propose a novel attention-enhanced matching confidence volume, that combines the raw pixel-wise matching confidence from the extracted perceptual features with the contextual information of local scenes, to improve the matching robustness. Furthermore, we develop an attention-guided regularization module, which consists of multilevel ray fusion modules, to hierarchically aggregate and regularize the matching confidence volume into a latent depth probability volume.Experimental results show that our approach achieves the best overall performance on the DTU dataset and the intermediate sequences of Tanks & Temples benchmark over many state-of-the-art MVS algorithms.
[attention, evaluation, construct, recognition, hierarchically, dataset] [confidence, map, feature, module, table, contextual, benchmark, denotes, propose, refinement, advanced, aggregate] [quality, ray, improve, model, input] [pattern, based, ieee, figure, reference, method, proposed, raw, perceptual, intermediate, comparison, filtering, conventional, fusion, introduced] [image, corresponding, loss, tao, consists, source] [network, training, regularization, performance, learning, algorithm, number, deep, neural, regularize, probability, best, set] [depth, matching, computer, conference, volume, vision, attmvs, dtu, stereo, reconstruction, scene, surface, scan, point, mvsnet, hypothesized, local, camera, view, international, multiview]
@InProceedings{Luo_2020_CVPR,
  author = {Luo, Keyang and Guan, Tao and Ju, Lili and Wang, Yuesong and Chen, Zhuo and Luo, Yawei},
  title = {Attention-Aware Multi-View Stereo},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Bi3D: Stereo Depth Estimation via Binary Classifications
Abhishek Badki, Alejandro Troccoli, Kihwan Kim, Jan Kautz, Pradeep Sen, Orazio Gallo


Stereo-based depth estimation is a cornerstone of computer vision, with state-of-the-art methods delivering accurate results in real time. For several applications such as autonomous navigation, however, it may be useful to trade accuracy for lower latency. We present Bi3D, a method that estimates depth via a series of binary classifications. Rather than testing if objects are at a particular depth D, as existing stereo methods do, it classifies them as being closer or farther than D. This property offers a powerful mechanism to balance accuracy and latency. Given a strict time budget, Bi3D can detect objects closer than a given distance in as little as a few milliseconds, or estimate depth with arbitrarily coarse quantization, with complexity linear with the number of quantization levels. Bi3D can also use the allotted quantization levels to get continuous depth, but in a specific depth range. For standard stereo (i.e., continuous depth on the whole range), our method is close to or on par with state-of-the-art, finely tuned stereo methods.
[recognition, selective, time, correct, work, multiple, pair, context, dataset] [object, confidence, table, region, par, segmentation] [case, trained, budget] [disparity, figure, method, range, ieee, pattern, existing, pixel, output, flow, warping] [train, image, specific] [binary, network, closer, quantization, equation, quantized, number, note, neural, classify, vector, deep, close, computing, consider, search, layer, accuracy, computational, standard, candidate, larger, classification] [depth, stereo, plane, computer, vision, conference, estimation, cost, estimate, matching, front, continuous, volume, scene, direction, kitti, farther, full, cdi, distance, compute, refer, gwcnet, accurate, estimating, allows, approach, form, sweep]
@InProceedings{Badki_2020_CVPR,
  author = {Badki, Abhishek and Troccoli, Alejandro and Kim, Kihwan and Kautz, Jan and Sen, Pradeep and Gallo, Orazio},
  title = {Bi3D: Stereo Depth Estimation via Binary Classifications},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Joint Filtering of Intensity Images and Neuromorphic Events for High-Resolution Noise-Robust Imaging
Zihao W. Wang, Peiqi Duan, Oliver Cossairt, Aggelos Katsaggelos, Tiejun Huang, Boxin Shi


We present a novel computational imaging system with high resolution and low noise. Our system consists of a traditional video camera which captures high-resolution intensity images, and an event camera which encodes high-speed motion as a stream of asynchronous binary events. To process the hybrid input, we propose a unifying framework that first bridges the two sensing modalities via a noise-robust motion compensation model, and then performs joint image filtering. The filtered output represents the temporal gradient of the captured space-time volume, which can be viewed as motion-compensated event frames with high resolution and low noise. Therefore, the output can be widely applied to many existing event-based algorithms that are highly dependent on spatial resolution and noise robustness. In experimental results performed on both publicly available datasets as well as our contributing RGB-DAVIS dataset, we show systematic performance improvement in applications such as high frame-rate video synthesis, feature/corner detection and tracking, as well as high dynamic range image reconstruction.
[frame, video, temporal, recognition, three, time, visual, dataset] [guided, detection, edge, corner, tracking, framework] [noise] [event, flow, gef, filtering, intensity, high, motion, resolution, contrast, pattern, sensing, denoising, ieee, low, spatial, figure, davide, hdr, dynamic, warped, patch, output, range, compressive, tref, guillermo, henri, optical, guidance, ltref, jcm, edi, compensation, captured, signal, window, upsampling, super, based, analysis] [image, filtered, perform] [gradient, filter, performance, computational, applied, compared, set, maximization, machine, regularization] [vision, conference, computer, camera, joint, international, reconstruction, well, local, supplementary, hybrid, velocity, system, estimation, additional, term, compare, michael]
@InProceedings{Wang_2020_CVPR,
  author = {Wang, Zihao W. and Duan, Peiqi and Cossairt, Oliver and Katsaggelos, Aggelos and Huang, Tiejun and Shi, Boxin},
  title = {Joint Filtering of Intensity Images and Neuromorphic Events for High-Resolution Noise-Robust Imaging},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
SGAS: Sequential Greedy Architecture Search
Guohao Li, Guocheng Qian, Itzel C. Delgadillo, Matthias Muller, Ali Thabet, Bernard Ghanem


Architecture design has become a crucial component of successful deep learning. Recent progress in automatic neural architecture search (NAS) shows a lot of promise. However, discovered architectures often fail to generalize in the final evaluation. Architectures with a higher validation accuracy during the search phase may perform worse in the evaluation. Aiming to alleviate this common issue, we introduce sequential greedy architecture search (SGAS), an efficient method for neural architecture search. By dividing the search procedure into sub-problems, SGAS chooses and prunes candidate operations in a greedy fashion. We apply SGAS to search architectures for Convolutional Neural Networks (CNN) and Graph Convolutional Networks (GCN). Extensive experiments show that SGAS is able to find state-of-the-art architectures for tasks such as image classification, point cloud classification and node classification in protein-protein interaction graphs with minimal computational cost.
[evaluation, graph, node, sequential, gcn, dataset, order, bernard, previous] [edge, final, table, correlation, architectural, feature] [model] [cell, convolutional, ieee, pattern, figure, phase, channel, method, comparison, based, high] [image, discrepancy] [search, architecture, sgas, neural, best, gradient, learning, greedy, size, criterion, network, manual, selection, classification, performance, accuracy, training, operation, arxiv, preprint, deep, weight, random, discovered, small, kendall, space, ppi, test, batch, validation, efficient, large, searching, average, report, evolution, reduce, imagenet, optimization, standard, rate, quoc, design, computational, larger, sharing] [computer, vision, conference, point, initial, cost, differentiable, normal]
@InProceedings{Li_2020_CVPR,
  author = {Li, Guohao and Qian, Guocheng and Delgadillo, Itzel C. and Muller, Matthias and Thabet, Ali and Ghanem, Bernard},
  title = {SGAS: Sequential Greedy Architecture Search},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
HVNet: Hybrid Voxel Network for LiDAR Based 3D Object Detection
Maosheng Ye, Shuangjie Xu, Tongyi Cao


We present Hybrid Voxel Network (HVNet), a novel one-stage unified network for point cloud based 3D object detection for autonomous driving. Recent studies show that 2D voxelization with per voxel PointNet style feature extractor leads to accurate and efficient detector for large 3D scenes. Since the size of the feature map determines the computation and memory cost, the size of the voxel becomes a parameter that is hard to balance. A smaller voxel size gives a better performance, especially for small objects, but a longer inference time. A larger voxel can cover the same area with a smaller feature map, but fails to capture intricate features and accurate location for smaller objects. We present a Hybrid Voxel network that solves this problem by fusing voxel feature encoder (VFE) of different scales at point-wise level and project into multiple pseudo-image feature maps. We further propose an attentive voxel feature encoding that outperforms plain VFE and a feature fusion pyramid network to aggregate multi-scale information at feature map level. Experiments on the KITTI benchmark show that a single HVNet achieves the best mAP among all existing methods with a real time inference speed of 31Hz.
[attention, three, speed, encoding, multiple, time] [feature, object, detection, hvnet, map, vfe, bev, car, lidar, pedestrian, head, avfe, location, hard, attentive, cyclist, main, avfeo, extractor, pyramid, achieves, pointpillars, propose, aggregated, backbone, denotes, easy] [input, physical] [scale, ieee, pattern, based, output, fusion, figure, tensor, fast, dynamic, block, method] [loss, corresponding, image, encoder] [set, network, layer, size, inference, performance, validation, knowledge, data, number, max, best, compared, learning, implementation, computation, class, test] [voxel, point, hybrid, cloud, conference, kitti, computer, projection, vision, second, voxelization, projected, sparse, grid, novel, single]
@InProceedings{Ye_2020_CVPR,
  author = {Ye, Maosheng and Xu, Shuangjie and Cao, Tongyi},
  title = {HVNet: Hybrid Voxel Network for LiDAR Based 3D Object Detection},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Frequency Domain Compact 3D Convolutional Neural Networks
Hanting Chen, Yunhe Wang, Han Shu, Yehui Tang, Chunjing Xu, Boxin Shi, Chao Xu, Qi Tian, Chang Xu


This paper studies the compression and acceleration of 3-dimensional convolutional neural networks (3D CNNs). To reduce the memory cost and computational complexity of deep neural networks, a number of algorithms have been explored by discovering redundant parameters in pre-trained networks. However, most of existing methods are designed for processing neural networks consisting of 2-dimensional convolution filters (i.e. image classification and detection) and cannot be straightforwardly applied for 3-dimensional filters (i.e. time series data). In this paper, we develop a novel approach for eliminating redundancy in the time dimensionality of 3D convolution filters by converting them into the frequency domain through a series of learned optimal transforms with extremely fewer parameters. Moreover, these transforms are forced to be orthogonal, and the calculation of feature maps can be accomplished in the frequency domain to achieve considerable speed-up rates. Experimental results on benchmark 3D CNN models and datasets demonstrate that the proposed Frequency Domain Compact 3D CNNs (FDC3D) can achieve the state-of-the-art performance, e.g. a 2x speed-up ratio on the 3D-ResNet-18 without obviously affecting its accuracy.
[temporal, dataset, video, convert, action, extract] [segmentation, redundant, feature, effectiveness, table, tumor, denotes, mask] [input, original, suitable, model] [proposed, frequency, convolution, convolutional, method, compression, transform, cnns, figure, channel, dct, medical, compressed, eliminating, conventional, ieee, spatial, converting, based, designed, coefficient] [domain, image, eliminate] [neural, pruning, dimension, optimal, network, deep, redundancy, learning, achieve, matrix, layer, filter, number, arxiv, computational, accuracy, compressing, increased, preprint, data, performance, rate, compact, complexity, higher, training, efficient, size, learned, converted, computation, reduced, sparsity, discard] [dimensional, demonstrate, cost, computer, directly, volumetric, conference, novel, single]
@InProceedings{Chen_2020_CVPR,
  author = {Chen, Hanting and Wang, Yunhe and Shu, Han and Tang, Yehui and Xu, Chunjing and Shi, Boxin and Xu, Chao and Tian, Qi and Xu, Chang},
  title = {Frequency Domain Compact 3D Convolutional Neural Networks},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Single-Image HDR Reconstruction by Learning to Reverse the Camera Pipeline
Yu-Lun Liu, Wei-Sheng Lai, Yu-Sheng Chen, Yi-Lung Kao, Ming-Hsuan Yang, Yung-Yu Chuang, Jia-Bin Huang


Recovering a high dynamic range (HDR) image from a single low dynamic range (LDR) input image is challenging due to missing details in under-/over-exposed regions caused by quantization and saturation of camera sensors. In contrast to existing learning-based methods, our core idea is to incorporate the domain knowledge of the LDR image formation pipeline into our model. We model the HDR-to-LDR image formation pipeline as the (1) dynamic range clipping, (2) non-linear mapping from a camera response function, and (3) quantization. We then propose to learn three specialized CNNs to reverse these steps. By decomposing the problem into specific sub-tasks, we impose effective physical constraints to facilitate the training of individual sub-networks. Finally, we jointly fine-tune the entire model end-to-end to reduce error accumulation. With extensive quantitative and qualitative experiments on diverse image datasets, we demonstrate that the proposed method performs favorably against state-of-the-art single-image HDR reconstruction algorithms.
[visual, three, predict] [crf, table, edge, response, mask] [model, input, noise] [hdr, ldr, inverse, dynamic, range, method, formation, histogram, figure, high, proposed, hdrcnn, expandnet, drtmo, ynth, eal, existing, monotonically, tone, radiometric, pixel, perceptual, captured, reverse, performs, convolutional, output, imaging, quantitative, intensity, dequantization] [image, loss, missing, mapping, train, real, learn, generate, user] [deep, training, linear, quantization, learning, increasing, set, reduce, network, evaluate, filter, soft, performance, design] [reconstruction, camera, pipeline, single, acm, estimate, reconstructed, recovering, constraint, error, scene, well, accurate]
@InProceedings{Liu_2020_CVPR,
  author = {Liu, Yu-Lun and Lai, Wei-Sheng and Chen, Yu-Sheng and Kao, Yi-Lung and Yang, Ming-Hsuan and Chuang, Yung-Yu and Huang, Jia-Bin},
  title = {Single-Image HDR Reconstruction by Learning to Reverse the Camera Pipeline},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
DNU: Deep Non-Local Unrolling for Computational Spectral Imaging
Lizhi Wang, Chen Sun, Maoqing Zhang, Ying Fu, Hua Huang


Computational spectral imaging has been striving to capture the spectral information of the dynamic world in the last few decades. In this paper, we propose an interpretable neural network for computational spectral imaging. First, we introduce a novel data-driven prior that can adaptively exploit both the local and non-local correlations among the spectral image. Our data-driven prior is integrated as a regularizer into the reconstruction problem. Then, we propose to unroll the reconstruction problem into an optimization-inspired deep neural network. The architecture of the network has high interpretability by explicitly characterizing the image correlation and the system imaging model. Finally, we learn the complete parameters in the network through end-to-end training, enabling robust performance with high spatial-spectral fidelity. Extensive simulation and hardware experiments validate the superior performance of our method over state-of-the-art methods.
[dataset, explicitly, natural, exploit, previous] [propose, branch] [model, quality, iterative] [spectral, prior, method, ieee, pattern, imaging, compressive, figure, sensing, hyperspectral, coded, nls, aperture, proposed, based, sam, snapshot, denoising, analysis, spatial, recursion, residual, psnr, hscnn, twist, block, inverse, icvl, signal, proximal, convolutional, harvard, bpdn, sslr, ssim, unrolling, adaptively, high, cassi, hyperreconnet] [image, interpretable, learn, autoencoder, real, representation] [network, deep, optimization, neural, computational, matrix, learning, problem, training, performance, diagonal, linear, hardware, machine, regularization, set, processing, arg, compared, general, operation, accuracy] [local, conference, computer, reconstruction, vision, system, sparse, international, solve, capture, structure, rmse]
@InProceedings{Wang_2020_CVPR,
  author = {Wang, Lizhi and Sun, Chen and Zhang, Maoqing and Fu, Ying and Huang, Hua},
  title = {DNU: Deep Non-Local Unrolling for Computational Spectral Imaging},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Single Image Optical Flow Estimation With an Event Camera
Liyuan Pan, Miaomiao Liu, Richard Hartley


Event cameras are bio-inspired sensors that asynchronously report intensity changes in microsecond resolution. DAVIS can capture high dynamics of a scene and simultaneously output high temporal resolution events and low frame-rate intensity images. In this paper, we propose a single image (potentially blurred) and events based optical flow estimation approach. First, we demonstrate how events can be used to improve flow estimates. To this end, we encode the relation between flow and events effectively by presenting an event-based photometric consistency formulation. Then, we consider the special case of image blur caused by high dynamics in the visual environments and show that including the blur formation in our model further constrains flow estimation. This is in sharp contrast to existing works that ignore the blurred images while our formulation can naturally handle either blurred or sharp images to achieve accurate flow estimation. Finally, we reduce flow estimation, as well as image deblurring, to an alternative optimization problem of an objective function using the primal-dual algorithm. Experimental results on both synthetic and real data (with blurred and non-blurred images) show the superiority of our model in comparison to state-of-the-art approaches.
[dataset, video, time, frame, visual, provide, temporal, relation] [table, propose, davis, framework, map, tracking] [model, input] [flow, optical, event, ieee, blurred, based, intensity, pattern, motion, deblurring, blur, edi, result, high, june, dynamic, sharp, brightness, method, constancy, deblurred, selflow, pixel, formation, dual, sintel, output, kernel, figure, asynchronous] [image, real, latent, synthetic] [function, data, learning, algorithm, optimization, min, deep, achieve, rate, neural, objective] [estimation, estimate, single, error, camera, estimated, vision, approach, defined, handle, term, constraint, scene, reconstruct, reconstructed, stereo, demonstrate, photometric, well, computer, computed, reconstruction, jointly]
@InProceedings{Pan_2020_CVPR,
  author = {Pan, Liyuan and Liu, Miaomiao and Hartley, Richard},
  title = {Single Image Optical Flow Estimation With an Event Camera},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Multi-View Neural Human Rendering
Minye Wu, Yuehao Wang, Qiang Hu, Jingyi Yu


We present an end-to-end Neural Human Renderer (NHR) for dynamic human captures under the multi-view setting. NHR adopts PointNet++ for feature extraction (FE) to enable robust 3D correspondence matching on low quality, dynamic 3D reconstructions. To render new views, we map 3D features onto the target camera as a 2D feature map and employ an anti-aliased CNN to handle holes and noises. Newly synthesized views from NHR can be further used to construct visual hulls to handle textureless and/or dark regions such as black clothing. Comprehensive experiments show NHR significantly outperforms the state-of-the-art neural and image-based rendering techniques, especially on hands, hair, nose, foot, etc.
[visual, video, time, modeling, individual, work] [feature, map, mask, semantic, module, foreground, refinement, final] [quality, model, improve, strong, christian, input, technique] [dynamic, high, based, ieee, figure, resolution, pattern, low, captured, color] [image, target, produce, appearance, texture, synthesized] [neural, network, number, training, set, process, sampled, learning, deep, layer] [point, rendering, human, nhr, cloud, geometry, reconstruction, view, computer, render, depth, conference, camera, handle, shape, acm, michael, hull, directly, rgb, body, renderer, capture, dome, approach, vision, rendered, recovered, volume, dense, textureless]
@InProceedings{Wu_2020_CVPR,
  author = {Wu, Minye and Wang, Yuehao and Hu, Qiang and Yu, Jingyi},
  title = {Multi-View Neural Human Rendering},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Depth Sensing Beyond LiDAR Range
Kai Zhang, Jiaxin Xie, Noah Snavely, Qifeng Chen


Depth sensing is a critical component of autonomous driving technologies, but today's LiDAR- or stereo camera- based solutions have limited range. We seek to increase the maximum range of self-driving vehicles' depth perception modules for the sake of better safety. To that end, we propose a novel three-camera system that utilizes small field of view cameras. Our system, along with our novel algorithm for computing metric depth, does not require full pre-calibration and can output dense depth maps with practically acceptable accuracy for scenes and objects at long distances not well covered by most commercial LiDARs.
[step, three, driving, recognition, vehicle, pair, shift] [map, autonomous, offset, propose, marked] [input, case, example] [disparity, method, figure, pixel, clr, pattern, motion, proposed, rectification, affine, sensing, range, output, existing, removal] [image, synthetic, corresponding, unknown] [small, problem, set, algorithm, setup, matrix, randomly, sampled] [depth, camera, left, stereo, estimated, ambiguity, computer, view, estimation, relative, vision, clb, sfm, system, distance, calibration, loop, richard, dense, solution, approach, error, novel, distant, intrinsics, fov, structure, fundamental, matching, pose, full, well, accurate, telephoto, rotation, assume, focal, estimate]
@InProceedings{Zhang_2020_CVPR,
  author = {Zhang, Kai and Xie, Jiaxin and Snavely, Noah and Chen, Qifeng},
  title = {Depth Sensing Beyond LiDAR Range},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Event Probability Mask (EPM) and Event Denoising Convolutional Neural Network (EDnCNN) for Neuromorphic Cameras
R. Wes Baldwin, Mohammed Almatrafi, Vijayan Asari, Keigo Hirakawa


This paper presents a novel method for labeling real-world neuromorphic camera sensor data by calculating the likelihood of generating an event at each pixel within a short time window, which we refer to as "event probability mask" or EPM. Its applications include (i) objective benchmarking of event denoising performance, (ii) training convolutional neural networks for noise removal called "event denoising convolutional neural network" (EDnCNN), and (iii) estimating internal neuromorphic camera parameters. We provide the first dataset (DVSNOISE20) of real-world labeled neuromorphic camera events for noise removal.
[time, temporal, moving, dataset, multiple] [object, edge, detection, feature, threshold, tracking, mask] [noise, change, actual, trained, internal, input] [event, dvs, denoising, neuromorphic, edncnn, ieee, aps, figure, epm, pixel, spatial, intensity, rpmd, benchmarking, noisy, pattern, range, asynchronous, davide, dynamic, motion, optical, sensor, high, ryad, convolutional, method, likelihood, designed, filtering, window, tobi, guillermo, henri, exact, timing, contrast] [generated, real, mapping, image] [data, log, probability, performance, learning, training, test, neural, network, classification, arxiv, preprint, objective, label, machine] [camera, vision, conference, computer, international, scene, hypothesis, neighborhood, simulated, local]
@InProceedings{Baldwin_2020_CVPR,
  author = {Baldwin, R. Wes and Almatrafi, Mohammed and Asari, Vijayan and Hirakawa, Keigo},
  title = {Event Probability Mask (EPM) and Event Denoising Convolutional Neural Network (EDnCNN) for Neuromorphic Cameras},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Point-GNN: Graph Neural Network for 3D Object Detection in a Point Cloud
Weijing Shi, Raj Rajkumar


In this paper, we propose a graph neural network to detect objects from a LiDAR point cloud. Towards this end, we encode the point cloud efficiently in a fixed radius near-neighbors graph. We design a graph neural network, named Point-GNN, to predict the category and shape of the object that each vertex in the graph belongs to. In Point-GNN, we propose an auto-registration mechanism to reduce translation variance, and also design a box merging and scoring operation to combine detections from multiple vertices accurately. Our experiments on the KITTI benchmark show the proposed approach achieves leading accuracy using the point cloud alone and can even surpass fusion-based algorithms. Our results demonstrate the potential of using the graph neural network as a new approach for 3D object detection. The code is available at https://github.com/WeijingShi/Point-GNN.
[graph, gnn, state, regular, three, dataset, mechanism, multiple, time, recognition, encode, work, predict, extract] [object, box, detection, bounding, lidar, easy, moderate, merging, hard, table, car, scoring, pedestrian, score, split, feature, propose, bev, downsampled, center, benchmark, background, localization, detect, category, achieves, semantic] [study] [figure, ieee, pattern, proposed, june, method, convolutional, convolution] [image, translation, loss, representation] [neural, network, set, accuracy, classification, average, number, training, learning, size, test, design, reduce, standard, deep, class, algorithm, precision, operation, sample, sampling] [point, cloud, vertex, kitti, vision, computer, conference, approach, grid, scanning, view, initial, radius]
@InProceedings{Shi_2020_CVPR,
  author = {Shi, Weijing and Rajkumar, Raj},
  title = {Point-GNN: Graph Neural Network for 3D Object Detection in a Point Cloud},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Self-Learning Video Rain Streak Removal: When Cyclic Consistency Meets Temporal Correspondence
Wenhan Yang, Robby T. Tan, Shiqi Wang, Jiaying Liu


In this paper, we address the problem of rain streaks removal in video by developing a self-learned rain streak removal method, which does not require any clean groundtruth images in the training process. The method is inspired by fact that the adjacent frames are highly correlated and can be regarded as different versions of identical scene, and rain streaks are randomly distributed along the temporal dimension. With this in mind, we construct a two-stage Self-Learned Deraining Network (SLDNet) to remove rain streaks based on both temporal correlation and consistency. In the first stage, SLDNet utilizes the temporal correlations and learns to predict the clean version of the current frame based on its adjacent rain video frames. In the second stage, SLDNet enforces the temporal consistency among different frames. It takes both the current rain frame and adjacent rain video frames to recover structural details. The first stage is responsible for reconstructing main structures, and the second stage is responsible for extracting structural details. We build our network architecture with two sub-tasks, i.e. motion estimation, and rain region detection, and optimize them jointly. Our extensive experiments demonstrate the effectiveness of our method, offering better results both quantitatively and qualitatively.
[video, temporal, frame, current, recurrent, visual, red, work] [background, region, correlation, stage, detection, effectiveness, cnn] [model, clean, input, effective, trained, quality] [rain, ieee, removal, deraining, based, streak, optical, adjacent, pattern, motion, june, method, remove, spaccnn, fastderain, figure, psnr, result, ehnet, detail, convolutional, proposed, prednet, dip, comparison, wenhan, jiaying] [image, consistency, real, loss, aligned, learn, paired, structural] [network, training, deep, learning, denoted, better, denote, large, performance, function, architecture] [computer, vision, estimation, single, sparse, joint, estimated, intrinsic, compare, second, well, ground]
@InProceedings{Yang_2020_CVPR,
  author = {Yang, Wenhan and Tan, Robby T. and Wang, Shiqi and Liu, Jiaying},
  title = {Self-Learning Video Rain Streak Removal: When Cyclic Consistency Meets Temporal Correspondence},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Neuromorphic Camera Guided High Dynamic Range Imaging
Jin Han, Chu Zhou, Peiqi Duan, Yehui Tang, Chang Xu, Chao Xu, Tiejun Huang, Boxin Shi


Reconstruction of high dynamic range image from a single low dynamic range image captured by a frame-based conventional camera, which suffers from over- or under-exposure, is an ill-posed problem. In contrast, recent neuromorphic cameras are able to record high dynamic range scenes in the form of an intensity map, with much lower spatial resolution, and without color. In this paper, we propose a neuromorphic camera guided high dynamic range imaging pipeline, and a network consisting of specially designed modules according to each step in the pipeline, which bridges the domain gaps on resolution, dynamic range, and color representation between two types of sensors and images. A hybrid camera system has been built to validate that the proposed method is able to reconstruct quantitatively and qualitatively high-quality high dynamic range images by successfully fusing the images and intensity maps for various real-world scenarios.
[attention, visual, video, multiple, three] [map, guided, mask, feature, merging, fuse, propose, davis] [input, quality, adversarial] [hdr, intensity, dynamic, range, ldr, color, proposed, high, neuromorphic, luminance, tone, imaging, conventional, fusion, pixel, method, chrominance, upsampling, inverse, figure, pattern, compensation, output, captured, event, spatial, resolution, fused, fsm, designed, reconstructs] [image, loss, mapping, gap, domain, encoder, specific, representation, real] [network, function, deep, data, arxiv, preprint, neural, space, weighting, weight, size, calculated, architecture, training, learning] [camera, computer, vision, acm, reconstruct, conference, single, reconstruction, scene, international, hybrid, pipeline, radiance, system, directly, rgb]
@InProceedings{Han_2020_CVPR,
  author = {Han, Jin and Zhou, Chu and Duan, Peiqi and Tang, Yehui and Xu, Chang and Xu, Chao and Huang, Tiejun and Shi, Boxin},
  title = {Neuromorphic Camera Guided High Dynamic Range Imaging},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Learning in the Frequency Domain
Kai Xu, Minghai Qin, Fei Sun, Yuhao Wang, Yen-Kuang Chen, Fengbo Ren


Deep neural networks have achieved remarkable success in computer vision tasks. Existing neural networks mainly operate in the spatial domain with fixed input sizes. For practical applications, images are usually large and have to be downsampled to the predetermined input size of neural networks. Even though the downsampling operations reduce computation and the required communication bandwidth, it removes both redundant and salient information obliviously, which results in accuracy degradation. Inspired by digital signal processing theories, we analyze the spectral bias from the frequency perspective and propose a learning-based frequency selection method to identify the trivial frequency components which can be removed without accuracy loss. The proposed method of learning in the frequency domain leverages identical structures of the well-known neural networks, such as ResNet-50, MobileNetV2, and Mask R-CNN, while accepting the frequency-domain information as the input. Experiment results show that learning in the frequency domain with static channel selection can achieve higher accuracy than the conventional spatial downsampling approach and meanwhile further reduce the input data size. Specifically for ImageNet classification with the same input size, the proposed method achieves 1.60% and 0.63% top-1 accuracy improvements on ResNet-50 and MobileNetV2, respectively. Even with half input size, the proposed method still improves the top-1 accuracy on ResNet-50 by 1.42%. In addition, we observe a 0.8% average precision improvement on Mask R-CNN for instance segmentation on the COCO dataset.
[communication, bandwidth, dataset, static, trivial, visual] [cnn, mask, segmentation, instance, propose, table, object, module, feature, coco, detection, achieves] [input, model, trained] [frequency, channel, method, figure, dct, tensor, proposed, spatial, convolutional, dynamic, conventional, based, existing, compression, ycbcr, color, spectral, low, degradation] [image, domain, loss, train] [learning, selection, data, size, accuracy, classification, inference, neural, imagenet, selected, gate, experiment, deep, higher, lower, task, network, number, validation, baseline, reduce, computation, training, applied, required, bias, average, small, requires, improved] [rgb, heat, shape, vision, approach]
@InProceedings{Xu_2020_CVPR,
  author = {Xu, Kai and Qin, Minghai and Sun, Fei and Wang, Yuhao and Chen, Yen-Kuang and Ren, Fengbo},
  title = {Learning in the Frequency Domain},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Polarized Reflection Removal With Perfect Alignment in the Wild
Chenyang Lei, Xuhua Huang, Mengdi Zhang, Qiong Yan, Wenxiu Sun, Qifeng Chen


We present a novel formulation to removing reflection from polarized images in the wild. We first identify the misalignment issues of existing reflection removal datasets where the collected reflection-free images are not perfectly aligned with input mixed images due to glass refraction. Then we build a new dataset with more than 100 types of glass in which obtained transmission images are perfectly aligned with input mixed images. Second, capitalizing on the special relationship between reflection and polarized light, we propose a polarized reflection removal model with a two-stage architecture. In addition, we design a novel perceptual NCC loss that can improve the performance of reflection removal and general image decomposition tasks. We conduct extensive experiments, and results suggest that our model outperforms state-of-the-art methods on reflection removal.
[dataset, previous, work] [propose, wei, background, table, feature] [input, model, collected, perfect, difference, degree, study, type, collect, exists] [reflection, transmission, glass, removal, raw, method, light, pncc, intensity, based, zhang, figure, perceptual, wieschollek, remove, assumption, bdn, perfectly, unpolarized, pol, comparison, psnr, exclusion, separation, kong, proposed, gamma, doubledip, qifeng] [image, loss, misalignment, real, alignment, diverse, yang, wen] [data, performance, mixed, learning, network, deep, design, training, layer, set, note, better] [polarization, polarized, single, rgb, pipeline, reflected, well, angle, decomposition, assume, capture, approach, novel, formulation, collection, closeup, directly, michael]
@InProceedings{Lei_2020_CVPR,
  author = {Lei, Chenyang and Huang, Xuhua and Zhang, Mengdi and Yan, Qiong and Sun, Wenxiu and Chen, Qifeng},
  title = {Polarized Reflection Removal With Perfect Alignment in the Wild},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Learning Multiview 3D Point Cloud Registration
Zan Gojcic, Caifa Zhou, Jan D. Wegner, Leonidas J. Guibas, Tolga Birdal


We present a novel, end-to-end learnable, multiview 3D point cloud registration algorithm. Registration of multiple scans typically follows a two-stage pipeline: the initial pairwise alignment and the globally consistent refinement. The former is often ambiguous due to the low overlap of neighboring point clouds, symmetries and repetitive scene parts. Therefore, the latter global refinement aims at establishing the cyclic consistency across multiple scans and helps in resolving the ambiguous cases. In this paper we propose, to the best of our knowledge, the first end-to-end algorithm for joint learning of both parts of this two-stage problem. Experimental evaluation on well accepted benchmark datasets shows that our approach outperforms the state-of-the-art by a significant margin, while being end-to-end trainable and computationally less costly. Moreover, we present detailed analysis and an ablation study that validate the novel components of our approach. The source code and pretrained models are publicly available under https://github.com/zgojcic/3D_multiview_reg.
[recognition, graph, evaluation, individual, dataset, multiple, pair] [global, confidence, feature, edge, refinement, fully, object, recall, ablation, propose, fed] [input, robust, iterative, study] [ieee, pattern, block, proposed, based, analysis, method, traditional] [translation, loss] [pairwise, learning, network, layer, algorithm, function, deep, data, neural, rij, cij, set, iteration, average, problem, report, pruning, machine, efficient, weighting, good, performance] [registration, point, transformation, conference, computer, vision, multiview, cloud, approach, synchronization, estimation, relative, international, local, scene, well, rotation, fcgf, initial, globally, correspondence, differentiable, tolga, estimated, solution, consistent, scannet, slobodan, tij, indoor, european, geometric]
@InProceedings{Gojcic_2020_CVPR,
  author = {Gojcic, Zan and Zhou, Caifa and Wegner, Jan D. and Guibas, Leonidas J. and Birdal, Tolga},
  title = {Learning Multiview 3D Point Cloud Registration},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
A Sparse Resultant Based Method for Efficient Minimal Solvers
Snehal Bhayani, Zuzana Kukelova, Janne Heikkila


Many computer vision applications require robust and efficient estimation of camera geometry. The robust estimation is usually based on solving camera geometry problems from a minimal number of input data measurements, i.e. solving minimal problems in a RANSAC framework. Minimal problems often result in complex systems of polynomial equations. Many state-of-the-art efficient polynomial solvers to these problems are based on Grobner basis and the action-matrix method that has been automatized and highly optimized in recent years. In this paper we study an alternative algebraic method for solving systems of polynomial equations, i.e., the sparse resultant-based method and propose a novel approach to convert the resultant constraint to an eigenvalue problem. This technique can significantly improve the efficiency and stability of existing resultant-based solvers. We applied our new resultant-based method to a large variety of computer vision problems and show that for most of the considered problems, the new method leads to solvers that are the same size as the the best available Grobner basis solvers and of similar accuracy. For some problems the new sparse-resultant based method leads to even smaller and more stable solvers than the state-of-the-art Grobner basis solvers. Our new method can be fully automatized and incorporated into existing tools for automatic generation of efficient polynomial solvers and as such it represents a competitive alternative to popular Grobner basis methods for minimal problems in computer vision.
[automatic, recognition, length] [extra, table, template] [input, distortion, original, radial, stability] [method, based, proposed, pattern, ieee, partition, coefficient, column, block] [generated, generating, variable, image, generate, real] [matrix, problem, efficient, set, size, algorithm, general, special, equation, random, smaller, computing, indexed, performance, vector, multiplication, machine, number, studied] [polynomial, minimal, basis, computer, pose, vision, monomial, solver, sparse, approach, eigenvalue, resultant, camera, conference, system, relative, form, solving, monomials, focal, international, solve, compute, well, zuzana, geometry, elimination, kalle, alternate, supplementary, formulation, estimation, solution, algebraic]
@InProceedings{Bhayani_2020_CVPR,
  author = {Bhayani, Snehal and Kukelova, Zuzana and Heikkila, Janne},
  title = {A Sparse Resultant Based Method for Efficient Minimal Solvers},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Zero-Reference Deep Curve Estimation for Low-Light Image Enhancement
Chunle Guo, Chongyi Li, Jichang Guo, Chen Change Loy, Junhui Hou, Sam Kwong, Runmin Cong


The paper presents a novel method, Zero-Reference Deep Curve Estimation (Zero-DCE), which formulates light enhancement as a task of image-specific curve estimation with a deep network. Our method trains a lightweight deep network, DCE-Net, to estimate pixel-wise and high-order curves for dynamic range adjustment of a given image. The curve estimation is specially designed, considering pixel value range, monotonicity, and differentiability. Zero-DCE is appealing in its relaxed assumption on reference images, i.e., it does not require any paired or unpaired data during training. This is achieved through a set of carefully formulated non-reference loss functions, which implicitly measure the enhancement quality and drive the learning of the network. Our method is efficient as image enhancement can be achieved by an intuitive and simple nonlinear curve mapping. Despite its simplicity, we show that it generalizes well to diverse lighting conditions. Extensive experiments on various benchmarks demonstrate the advantages of our method over state-of-the-art methods qualitatively and quantitatively. Furthermore, the potential benefits of our Zero-DCE to face detection in the dark are discussed.
[three, dataset, visual] [wang, detection, map, region, table] [curve, input, face, quality, trained, lexp, study, example, model, difference] [enhancement, color, enhanced, method, illumination, light, exposure, adjustment, proposed, dynamic, range, ieee, enlightengan, figure, dark, result, neighboring, pixel, contrast, lime, convolutional, constancy, lspa, retinexnet, perceptual, spatial, retinex, intensity, version, lcol, ltva, lightweight, reference, existing] [image, loss, paired, unpaired, perform, control, consistency, mapping] [deep, training, data, network, set, parameter, learning, size, average, performance, better, best, design, number] [estimation, local, lighting, smoothness, iteratively]
@InProceedings{Guo_2020_CVPR,
  author = {Guo, Chunle and Li, Chongyi and Guo, Jichang and Loy, Chen Change and Hou, Junhui and Kwong, Sam and Cong, Runmin},
  title = {Zero-Reference Deep Curve Estimation for Low-Light Image Enhancement},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
BlendedMVS: A Large-Scale Dataset for Generalized Multi-View Stereo Networks
Yao Yao, Zixin Luo, Shiwei Li, Jingyang Zhang, Yufan Ren, Lei Zhou, Tian Fang, Long Quan


While deep learning has recently achieved great success on multi-view stereo (MVS), limited training data makes the trained model hard to be generalized to unseen scenarios. Compared with other computer vision tasks, it is rather difficult to collect a large-scale MVS dataset as it requires expensive active scanners and labor-intensive process to obtain ground truth 3D structures. In this paper, we introduce BlendedMVS, a novel large-scale dataset, to provide sufficient training ground truth for learning-based MVS. To create the dataset, we apply a 3D reconstruction pipeline to recover high-quality textured meshes from images of well-selected scenes. Then, we render these mesh models to color images and depth maps. To introduce the ambient lighting information during training, the rendered color images are further blended with the input images to generate the training input. Our dataset contains over 17k high-resolution images covering a variety of scenes, including cities, architectures, sculptures and small objects. Extensive experiments demonstrate that BlendedMVS endows the trained model with significantly better generalization ability compared with other MVS datasets. The dataset and pretrained models are available at https://github.com/YoYo000/BlendedMVS.
[dataset, recognition, three, visual, build, extract, evaluation] [map, applies, apply, score] [trained, input, blended, model, generalization, datasets] [pattern, pixel, proposed, color, figure, cnns, range] [image, generate, synthetic, ability, generation, train] [training, validation, network, data, random, learning, deep, better, fixed, set, small, online, size, compared, process, precision, filter] [depth, rendered, computer, vision, point, blendedmvs, ground, dtu, truth, error, mvsnet, camera, textured, stereo, mesh, megadepth, reconstruction, conference, lighting, view, international, reconstructed, cost, pipeline, ambient, variety, cloud, volume, estimation, demonstrate, scene, indoor]
@InProceedings{Yao_2020_CVPR,
  author = {Yao, Yao and Luo, Zixin and Li, Shiwei and Zhang, Jingyang and Ren, Yufan and Zhou, Lei and Fang, Tian and Quan, Long},
  title = {BlendedMVS: A Large-Scale Dataset for Generalized Multi-View Stereo Networks},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Convolution in the Cloud: Learning Deformable Kernels in 3D Graph Convolution Networks for Point Cloud Analysis
Zhi-Hao Lin, Sheng-Yu Huang, Yu-Chiang Frank Wang


Point clouds are among the popular geometry representations for 3D vision applications. However, without regular structures like 2D images, processing and summarizing information over these unordered data points are very challenging. Although a number of previous works attempt to analyze point clouds and achieve promising performances, their performances would degrade significantly when data variations like shift and scale changes are presented. In this paper, we propose 3D Graph Convolution Networks (3D-GCN), which is designed to extract local 3D features from point clouds across scales, while shift and scale-invariance properties are introduced. The novelty of our 3D-GCN lies in the definition of learnable kernels with a graph max-pooling mechanism. We show that 3D-GCN can be applied to 3D classification and segmentation tasks, with ablation studies and visualizations verifying the design of 3D-GCN.
[graph, recognition, shift, rnm, extract, associated, describe, describing] [object, segmentation, pooling, feature, table, global, propose, cnn, miou, semantic, denotes] [model, input] [convolution, figure, kernel, receptive, ieee, scale, pattern, convolutional, field, learnable, output, neighboring, deformable] [invariance, corresponding, image, perform, learn] [learning, classification, neural, note, data, number, deep, operation, processing, network, layer, size, set, consider, vector, similarity, standard, denoted, promising, denote, support, ratio, comparable, applied] [point, cloud, vision, conference, computer, shape, local, directional, pointnet, xyz, geometric, unordered, dgcnn, neighbor, kpconv, leonidas, voxel, handle]
@InProceedings{Lin_2020_CVPR,
  author = {Lin, Zhi-Hao and Huang, Sheng-Yu and Wang, Yu-Chiang Frank},
  title = {Convolution in the Cloud: Learning Deformable Kernels in 3D Graph Convolution Networks for Point Cloud Analysis},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
A Semi-Supervised Assessor of Neural Architectures
Yehui Tang, Yunhe Wang, Yixing Xu, Hanting Chen, Boxin Shi, Chao Xu, Chunjing Xu, Qi Tian, Chang Xu


Neural architecture search (NAS) aims to automatically design deep neural networks of satisfactory performance. Wherein, architecture performance predictor is critical to efficiently value an intermediate neural architecture. But for the training of this predictor, a number of neural architectures and their corresponding real performance often have to be collected. In contrast with classical performance predictor optimized in a fully supervised way, this paper suggests a semi-supervised assessor of neural architectures. We employ an auto-encoder to discover meaningful representations of neural architectures. Taking each neural architecture as an individual instance in the search space, we construct a graph to capture their intrinsic similarities, where both labeled and unlabeled architectures are involved. A graph convolutional neural network is introduced to predict the performance of architectures based on the learned representations and their relation modeled by the graph. Extensive experimental results on the NAS-Benchmark-101 dataset demonstrated that our method is able to make a significant reduction on the required fully trained architectures for finding efficient architectures.
[graph, prediction, relation, gcn, predict, dataset, construct, constructed, decoder, lrc, artificial] [predicted, table, annotated, denotes] [trained] [based, proposed, method, figure, ieee, convolutional, pattern, cell, output] [loss, representation, encoder, real, common, supervised, image, train] [performance, architecture, neural, search, unlabeled, assessor, labeled, training, space, network, learned, arxiv, preprint, predictor, learning, optimization, similarity, best, ktau, large, massive, ranking, deep, number, randomly, accuracy, classification, selected, random, efficient, algorithm, entire, compared, peephole, path, quoc, evolutionary, weight, process, higher, lrg] [conference, computer, vision, intrinsic, limited, reconstruction, accurate, continuous, system]
@InProceedings{Tang_2020_CVPR,
  author = {Tang, Yehui and Wang, Yunhe and Xu, Yixing and Chen, Hanting and Shi, Boxin and Xu, Chao and Xu, Chunjing and Tian, Qi and Xu, Chang},
  title = {A Semi-Supervised Assessor of Neural Architectures},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Learning a Reinforced Agent for Flexible Exposure Bracketing Selection
Zhouxia Wang, Jiawei Zhang, Mude Lin, Jiong Wang, Ping Luo, Jimmy Ren


Automatically selecting exposure bracketing (images exposed differently) is important to obtain a high dynamic range image by using multi-exposure fusion. Unlike previous methods that have many restrictions such as requiring camera response function, sensor noise model, and a stream of preview images with different exposures (not accessible in some scenarios e.g. mobile applications), we propose a novel deep neural network to automatically select exposure bracketing, named EBSNet, which is sufficiently flexible without having the above restrictions. EBSNet is formulated as a reinforced agent that is trained by maximizing rewards provided by a multi-exposure fusion network (MEFNet). By utilizing the illumination and semantic information extracted from just a single auto-exposure preview image, EBSNet enables to select an optimal exposure bracketing for multi-exposure fusion. EBSNet and MEFNet can be jointly trained to produce favorable results against recent state-of-the-art approaches. To facilitate future research, we provide a new benchmark dataset for multi-exposure selection and fusion.
[reward, dataset, reinforcement, agent, stream, considering, ten, time, three, policy, future, provide, considers] [semantic, branch, feature, propose, response, table, fuse, building, framework, effectiveness] [trained, model, input, noise] [exposure, bracketing, proposed, ebsnet, preview, hdr, mefnet, fusion, method, illumination, dynamic, range, figure, saturated, high, barakat, billboard, existing, beek, ldr, based, dark, psnr, captured, irradiance, histogram, recover, flexible] [image, generate, generated, train, loss] [selected, selection, network, training, learning, neural, select, optimal, better, number, deep, set, candidate, large, update, function, distribution, performance, mobile] [scene, ground, single, camera, well, truth, capture, additional, jointly, joint, facilitate]
@InProceedings{Wang_2020_CVPR,
  author = {Wang, Zhouxia and Zhang, Jiawei and Lin, Mude and Wang, Jiong and Luo, Ping and Ren, Jimmy},
  title = {Learning a Reinforced Agent for Flexible Exposure Bracketing Selection},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
CARS: Continuous Evolution for Efficient Neural Architecture Search
Zhaohui Yang, Yunhe Wang, Xinghao Chen, Boxin Shi, Chao Xu, Chunjing Xu, Qi Tian, Chang Xu


Searching techniques in most of existing neural architecture search (NAS) algorithms are mainly dominated by differentiable methods for the efficiency reason. In contrast, we develop an efficient continuous evolutionary approach for searching neural networks. Architectures in the population that share parameters within one SuperNet in the latest generation will be tuned over the training dataset with a few epochs. The searching in the next evolution generation will directly inherit both the SuperNet and the population, which accelerates the optimal network generation. The non-dominated sorting strategy is further applied to preserve only results on the Pareto front for accurately updating the SuperNet. Several neural networks with different model sizes and performances will be produced after the continuous search with only 0.4 GPU days. As a result, our framework provides a series of networks with the number of parameters ranging from 3.7M to 5.1M under mobile settings. These networks surpass those produced by the state-of-the-art methods on the benchmark ImageNet dataset.
[time, considers, trap, speed, step] [stage, benchmark] [model, acc] [figure, proposed, convolutional, range, method, chao, kai] [generation, train, image, corresponding] [architecture, search, evolution, neural, optimization, network, accuracy, supernet, set, parameter, efficient, searched, gradient, searching, number, algorithm, small, updating, size, yunhe, large, training, pareto, chang, sorting, chunjing, evolutionary, strategy, higher, larger, performance, learning, cutout, population, better, evaluate, update, increasing, select, quoc, validation, sharing, maintained, smaller, eevo, ratio, manual, snas, reduction, deep, optimal, latency, initialize, evolve] [continuous, cost, normal, thomas]
@InProceedings{Yang_2020_CVPR,
  author = {Yang, Zhaohui and Wang, Yunhe and Chen, Xinghao and Shi, Boxin and Xu, Chao and Xu, Chunjing and Tian, Qi and Xu, Chang},
  title = {CARS: Continuous Evolution for Efficient Neural Architecture Search},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Joint 3D Instance Segmentation and Object Detection for Autonomous Driving
Dingfu Zhou, Jin Fang, Xibin Song, Liu Liu, Junbo Yin, Yuchao Dai, Hongdong Li, Ruigang Yang


Currently, in Autonomous Driving (AD), most of the 3D object detection frameworks (either anchor- or anchor-free-based) consider the detection as a Bounding Box (BBox) regression problem. However, this compact representation is not sufficient to explore all the information of the objects. To tackle this problem, we propose a simple but practical detection framework to jointly predict the 3D BBox and instance segmentation. For instance segmentation, we propose a Spatial Embeddings (SEs) strategy to assemble all foreground points into their corresponding object centers. Base on the SE results, the object proposals can be generated based on a simple clustering strategy. For each cluster, only one proposal is generated. Therefore, the Non-Maximum Suppression (NMS) process is no longer needed here. Finally, with our proposed instance-aware ROI pooling, the BBox is refined by a second-stage network. Experimental results on the public KITTI dataset show that the proposed SEs can significantly improve the instance segmentation results compared with other feature embedding-based method. Meanwhile, it also outperforms most of the 3D object detectors on the KITTI testing benchmark.
[embedding, dataset, evaluation, represent, prediction, work, three, driving] [object, instance, segmentation, detection, bbox, lidar, feature, proposal, semantic, framework, autonomous, predicted, bboxes, employed, dingfu, easy, mask, center, backbone, pointrcnn, yuchao, ruigang, regression, foreground, stage, region, roi, global, grouping, china] [public, testing, experimental] [proposed, ieee, based, pattern, spatial, method, convolutional, hongdong, figure, designed, comparison, xibin] [loss, generated, image, representation] [network, deep, learning, clustering, top, neural, applied, number, validation, performance, data, randomly, compared] [point, conference, computer, vision, kitti, cloud, ground, truth, international, second, local, shape, well, directly, approach, jointly, direction, joint]
@InProceedings{Zhou_2020_CVPR,
  author = {Zhou, Dingfu and Fang, Jin and Song, Xibin and Liu, Liu and Yin, Junbo and Dai, Yuchao and Li, Hongdong and Yang, Ruigang},
  title = {Joint 3D Instance Segmentation and Object Detection for Autonomous Driving},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
View-GCN: View-Based Graph Convolutional Network for 3D Shape Analysis
Xin Wei, Ruixuan Yu, Jian Sun


View-based approach that recognizes 3D shape through its projected 2D images has achieved state-of-the-art results for 3D shape recognition. The major challenge for view-based approach is how to aggregate multi-view features to be a global shape descriptor. In this work, we propose a novel view-based Graph Convolutional Neural Network, dubbed as view-GCN, to recognize 3D shape based on graph representation of multiple views in flexible view configurations. We first construct view-graph with multiple views as graph nodes, then design a graph convolutional neural network over view-graph to hierarchically learn discriminative shape descriptor considering relations of multiple views. The view-GCN is a hierarchical network based on local and non-local graph convolution for feature transform, and selective view-sampling for graph coarsening. Extensive experiments on benchmark datasets show that view-GCN achieves state-of-the-art results for 3D shape classification and retrieval.
[graph, node, multiple, selective, message, hierarchical, recognition, passing, gcn, coarsened, retrieval, represent, dataset, hierarchically, considering, attention, sequential, irregular, viewgcn, construct] [feature, achieves, object, global, aggregate, table, instance, backbone, level, fps, including, map] [input, model] [convolutional, based, convolution, neighboring, method, ieee, figure] [representation, loss, learn, image, discriminative, real] [network, classification, learning, neural, training, class, deep, design, compared, matrix, updated, rate, sampled, higher, strategy, architecture, accuracy, evaluate] [shape, view, local, defined, rotationnet, novel, camera, shapenet, point, mvcnn, descriptor, multiview]
@InProceedings{Wei_2020_CVPR,
  author = {Wei, Xin and Yu, Ruixuan and Sun, Jian},
  title = {View-GCN: View-Based Graph Convolutional Network for 3D Shape Analysis},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Collaborative Distillation for Ultra-Resolution Universal Style Transfer
Huan Wang, Yijun Li, Yuehai Wang, Haoji Hu, Ming-Hsuan Yang


Universal style transfer methods typically leverage rich representations from deep Convolutional Neural Network (CNN) models (e.g., VGG-19) pre-trained on large collections of images. Despite the effectiveness, its application is heavily constrained by the large model size to handle ultra-resolution images given limited memory. In this work, we present a new knowledge distillation method (named Collaborative Distillation) for encoder-decoder based neural style transfer to reduce the convolutional filters. The main idea is underpinned by a finding that the encoder-decoder pairs construct an exclusive collaborative relationship, which is regarded as a new kind of knowledge for style transfer models. Moreover, to overcome the feature size mismatch when applying collaborative distillation, a linear embedding loss is introduced to drive the student network to learn a linear embedding of the teacher's features. Extensive experiments show the effectiveness of our method when applied to different universal style transfer approaches (WCT and AdaIN), even if the model size is reduced by 15.5 times. Especially, on WCT with the compressed models, we achieve ultra-resolution (over 40 megapixels) universal style transfer on a 12GB GPU for the first time. Further experiments on optimization-based stylization scheme show the generality of our algorithm on different stylization paradigms. Our code and trained models are available at https://github.com/mingsun-tse/collaborative-distillation.
[collaborative, decoder, work, embedding, relationship, visual, three] [feature, propose] [model, original, universal, trained, input, middle, study, exclusive] [compressed, convolutional, method, proposed, compression, comparison, figure, based, collaboration, output, perceptual, visually] [style, transfer, image, encoder, stylized, content, stylization, loss, wct, adain, user, gatys, lcollab, synthesis, learn, arbitrary, train, specific, gram, texture, yijun, mismatch] [neural, deep, size, knowledge, small, linear, distillation, network, large, arxiv, preprint, matrix, fewer, learning, training, achieve, scheme, classification, architecture, efficient, typically, student, collaborator, comparable, pruning, algorithm, considerable] [limited, vision]
@InProceedings{Wang_2020_CVPR,
  author = {Wang, Huan and Li, Yijun and Wang, Yuehai and Hu, Haoji and Yang, Ming-Hsuan},
  title = {Collaborative Distillation for Ultra-Resolution Universal Style Transfer},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
TomoFluid: Reconstructing Dynamic Fluid From Sparse View Videos
Guangming Zang, Ramzi Idoughi, Congli Wang, Anthony Bennett, Jianguo Du, Scott Skeen, William L. Roberts, Peter Wonka, Wolfgang Heidrich


Visible light tomography is a promising and increasingly popular technique for fluid imaging. However, the use of a sparse number of viewpoints in the capturing setups makes the reconstruction of fluid flows very challenging. In this paper, we present a state-of-the-art 4D tomographic reconstruction framework that integrates several regularizers into a multi-scale matrix free optimization algorithm. In addition to existing regularizers, we propose two new regularizers for improved results: a regularizer based on view interpolation of projected images and a regularizer to encourage reprojection consistency. We demonstrate our method with extensive experiments on both simulated and real data.
[time, temporal, work, viewing, visual] [framework, propose] [improve, input, quality, highly] [method, figure, flow, captured, tomography, imaging, based, comparison, field, zang, spatial, proposed, light, tomographic, interpolation, prior, optical, dynamic, lsmooth, ieee, capturing, sart, particle, medical, slice] [image, consistency, loss, missing, generated, synthetic, appearance, corresponding, real] [density, set, number, regularizer, data, setup, baseline, optimization, algorithm, regularizers, problem, applied, learning, function] [fluid, reconstruction, view, volume, sparse, projection, camera, novel, reconstruct, approach, reconstructed, term, acm, computer, reprojection, velocity, capture, estimated, smoothness, second, application, laser, estimation, visible, vision]
@InProceedings{Zang_2020_CVPR,
  author = {Zang, Guangming and Idoughi, Ramzi and Wang, Congli and Bennett, Anthony and Du, Jianguo and Skeen, Scott and Roberts, William L. and Wonka, Peter and Heidrich, Wolfgang},
  title = {TomoFluid: Reconstructing Dynamic Fluid From Sparse View Videos},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Instance Shadow Detection
Tianyu Wang, Xiaowei Hu, Qiong Wang, Pheng-Ann Heng, Chi-Wing Fu


Instance shadow detection is a brand new problem, aiming to find shadow instances paired with object instances. To approach it, we first prepare a new dataset called SOBA, named after Shadow-OBject Association, with 3,623 pairs of shadow and object instances in 1,000 photos, each with individual labeled masks. Second, we design LISA, named after Light-guided Instance Shadow-object Association, an end-to-end framework to automatically predict the shadow and object instances, together with the shadow-object associations and light direction. Then, we pair up the predicted shadow and object instances, and match them with the predicted shadow-object associations to generate the final results. In our evaluations, we formulate a new metric named the shadow-object average precision to measure the performance of our results. Further, we conducted various experiments and demonstrate our method's applicability on light direction estimation and photo editing.
[associated, pair, lisa, dataset, predict, individual, minh, work, multiple, three, evaluation] [instance, object, detection, box, mask, predicted, association, bounding, detect, head, branch, ross, piotr, named, framework, final, soba, segmentation, region, kaiming, yago, xiaowei, shadowobject, feature, semantic] [input, dimitris, example, applicability, adversarial] [light, figure, method, remove, called, based, output, result, convolutional, pattern] [shadow, image, photo, loss, produced] [baseline, network, learning, training, find, deep, performance, set, design, problem, average, label, machine] [direction, cast, full, single, pipeline, jointly, approach, match, demonstrate, predicts, estimated]
@InProceedings{Wang_2020_CVPR,
  author = {Wang, Tianyu and Hu, Xiaowei and Wang, Qiong and Heng, Pheng-Ann and Fu, Chi-Wing},
  title = {Instance Shadow Detection},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Self2Self With Dropout: Learning Self-Supervised Denoising From Single Image
Yuhui Quan, Mingqin Chen, Tongyao Pang, Hui Ji


In last few years, supervised deep learning has emerged as one powerful tool for image denoising, which trains a denoising network over an external dataset of noisy/clean image pairs. However, the requirement on a high-quality training dataset limits the broad applicability of the denoising networks. Recently, there have been a few works that allow training a denoising network on the set of external noisy images only. Taking one step further, this paper proposes a self-supervised learning method which only uses the input noisy image itself for training. In the proposed method, the network is trained with dropout on the pairs of Bernoulli-sampled instances of the input image, and the result is estimated by averaging the predictions generated from multiple instances of the trained model with dropout. The experiments show that the proposed method not only significantly outperforms existing single-image learning or non-learning methods, but also is competitive to the denoising networks trained on external datasets.
[multiple, dataset, outperforms, prediction] [table, pconv, denotes, instance] [trained, noise, input, model, external, identity, effective] [denoising, noisy, method, dip, proposed, blind, nns, conv, existing, comparison, hui, output, figure, competitive, unorganized, denoisers, ksvd, psnr, lei, prerequisite, based] [image, train, loss, inpainting, generate] [training, learning, dropout, bernoulli, deep, performance, sampled, variance, set, data, test, paper, large, number, average, layer, scheme, better, network, sampling, architecture, learned, randomly, probability, standard, denoted, good, reducing, dropping, dictionary, iteration] [single, partial, truth, sparse]
@InProceedings{Quan_2020_CVPR,
  author = {Quan, Yuhui and Chen, Mingqin and Pang, Tongyao and Ji, Hui},
  title = {Self2Self With Dropout: Learning Self-Supervised Denoising From Single Image},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Discrete Model Compression With Resource Constraint for Deep Neural Networks
Shangqian Gao, Feihu Huang, Jian Pei, Heng Huang


In this paper, we target to address the problem of compression and acceleration of Convolutional Neural Networks (CNNs). Specifically, we propose a novel structural pruning method to obtain a compact CNN with strong discriminative power. To find such networks, we propose an efficient discrete optimization method to directly optimize channel-wise differentiable discrete gate under resource constraint while freezing all the other model parameters. Although directly optimizing discrete variables is a complex non-smooth, non-convex and NP-hard problem, our optimization method can circumvent these difficulties by using the straight-through estimator. Thus, our method is able to ensure that the sub-network discovered within the training process reflects the true sub-network. We further extend the discrete gate to its stochastic version in order to thoroughly explore the potential sub-networks. Unlike many previous methods requiring per-layer hyper-parameters, we only require one hyper-parameter to control FLOPs budget. Moreover, our method is globally discrimination-aware due to the discrete setting. The experimental results on CIFAR-10 and ImageNet show that our method is competitive with state-of-the-art methods.
[outperforms] [propose, feature, jian, cnn, global] [model, magnitude, original, acc] [method, channel, compression, convolutional, dcp, ieee, pattern, comparison, proposed, output, relu] [loss, structural, discriminative] [pruning, gate, discrete, neural, weight, regularization, learning, deep, training, pruned, stochastic, performance, imagenet, function, decay, rate, algorithm, optimization, search, network, better, gradient, accuracy, prune, number, machine, compared, arxiv, preprint, processing, efficient, computational, binary, deterministic, close, resource, problem, achieve, impact, power, architecture, ste, process] [conference, computer, vision, international, estimation, symmetric, differentiable, dmc, constraint, directly]
@InProceedings{Gao_2020_CVPR,
  author = {Gao, Shangqian and Huang, Feihu and Pei, Jian and Huang, Heng},
  title = {Discrete Model Compression With Resource Constraint for Deep Neural Networks},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Structured Compression by Weight Encryption for Unstructured Pruning and Quantization
Se Jung Kwon, Dongsoo Lee, Byeongwook Kim, Parichay Kapoor, Baeseong Park, Gu-Yeon Wei


Model compression techniques, such as pruning and quantization, are becoming increasingly important to reduce the memory footprints and the amount of computations. Despite model size reduction, achieving performance enhancement on devices is, however, still challenging mainly due to the irregular representations of sparse matrix formats. This paper proposes a new weight representation scheme for Sparse Quantized Neural Networks, specifically achieved by fine-grained and unstructured pruning method. The representation is encrypted in a structured regular format, which can be efficiently decoded through XOR-gate network during inference in a parallel manner. We demonstrate various deep learning models that can be compressed and represented by our proposed format with fixed and high compression ratio. For example, for fully-connected layers of AlexNet on ImageNet dataset, we can represent the sparse weights by only 0.28 bits/weight for 1-bit quantization and 91% pruning rate with a fixed decoding rate and full memory bandwidth usage. Decoding through XOR-gate network can be performed without any model accuracy degradation with additional patch data associated with small overhead.
[decoding, time, structured, bandwidth, regular, represent, order] [seed] [model] [compression, proposed, figure, compressed, high, format, patch, encryption, conventional, method, parallel, remove] [representation, row] [pruning, care, memory, number, matrix, neural, dpatch, nin, npatch, nout, quantization, rate, weight, quantized, reduction, network, deep, data, random, vector, algorithm, fixed, encrypted, large, learning, csr, size, ratio, linear, ternary, pruned, execution, decryption, scheme, imagenet, small, reduced, viterbi, bit, binary, note, dongsoo, reduce, performance, inference, accuracy, parallelism, fifo, processing, alexnet, higher, sqnns, process, xor, byeongwook] [sparse, international, conference, additional, dense, unstructured]
@InProceedings{Kwon_2020_CVPR,
  author = {Kwon, Se Jung and Lee, Dongsoo and Kim, Byeongwook and Kapoor, Parichay and Park, Baeseong and Wei, Gu-Yeon},
  title = {Structured Compression by Weight Encryption for Unstructured Pruning and Quantization},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
End-to-End Learning Local Multi-View Descriptors for 3D Point Clouds
Lei Li, Siyu Zhu, Hongbo Fu, Ping Tan, Chiew-Lan Tai


In this work, we propose an end-to-end framework to learn local multi-view descriptors for 3D point clouds. To adopt a similar multi-view representation, existing studies use hand-crafted viewpoints for rendering in a preprocessing stage, which is detached from the subsequent descriptor learning stage. In our framework, we integrate the multi-view rendering into neural networks by using a differentiable renderer, which allows the viewpoints to be optimizable parameters for capturing more informative local context of interest points. To obtain discriminative descriptors, we also design a soft-view pooling module to attentively fuse convolutional features across views. Extensive experiments on existing 3D registration benchmarks show that our method outperforms existing local descriptors both quantitatively and qualitatively.
[eth, recognition, context] [feature, pooling, table, benchmark, recall, object, framework, adopt, module, cnn, propose, achieves] [input, trained, robust] [method, ieee, existing, fusion, convolutional, figure, patch, flow, range, cnns] [representation, image, invariant, perform] [learning, network, average, performance, neural, set, number, deep, gradient, shot, randomly, learned, design, large, computation, size, data] [point, local, rendering, cloud, descriptor, viewpoint, view, differentiable, lmvcnn, shape, registration, cgf, optimizable, voxel, geometric, rendered, outdoor, renderer, fpfh, geometry, approach, well, rotation, surface, depth, matching, hotel, multiview, novel]
@InProceedings{Li_2020_CVPR,
  author = {Li, Lei and Zhu, Siyu and Fu, Hongbo and Tan, Ping and Tai, Chiew-Lan},
  title = {End-to-End Learning Local Multi-View Descriptors for 3D Point Clouds},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Minimal Solutions for Relative Pose With a Single Affine Correspondence
Banglei Guan, Ji Zhao, Zhang Li, Fang Sun, Friedrich Fraundorfer


In this paper we present four cases of minimal solutions for two-view relative pose estimation by exploiting the affine transformation between feature points and we demonstrate efficient solvers for these cases. It is shown, that under the planar motion assumption or with knowledge of a vertical direction, a single affine correspondence is sufficient to recover the relative camera pose. The four cases considered are two-view planar relative motion for calibrated cameras as a closed-form and a least-squares solution, a closed-form solution for unknown focal length and the case of a known vertical direction. These algorithms can be used efficiently for outlier detection within a RANSAC loop and for initial motion estimation. All the methods are evaluated on both synthetic data and real-world datasets from the KITTI benchmark. The experimental results demonstrate that our methods outperform comparable state-of-the-art methods in accuracy with the benefit of a reduced number of needed RANSAC iterations.
[visual, three, length] [feature, table] [noise, case, robust, pitch] [motion, affine, method, ieee, proposed, figure, pattern, homography, based, analysis, exploiting] [image, translation, unknown, common, synthetic] [matrix, equation, number, performance, efficient, set, data, random, standard, compared, denote, space, better] [relative, pose, rotation, estimation, camera, planar, minimal, point, error, computer, conference, vision, solution, ground, system, correspondence, vertical, essential, direction, solver, focal, odometry, estimated, estimate, imu, sin, international, single, ransac, angle, truth, kitti, friedrich, transformation, calibrated, additional, roll, monocular]
@InProceedings{Guan_2020_CVPR,
  author = {Guan, Banglei and Zhao, Ji and Li, Zhang and Sun, Fang and Fraundorfer, Friedrich},
  title = {Minimal Solutions for Relative Pose With a Single Affine Correspondence},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Point Cloud Completion by Skip-Attention Network With Hierarchical Folding
Xin Wen, Tianyang Li, Zhizhong Han, Yu-Shen Liu


Point cloud completion aims to infer the complete geometries for missing regions of 3D objects from incomplete ones. Previous methods usually predict the complete point cloud based on the global shape representation extracted from the incomplete input. However, the global representation often suffers from the information loss of structure details on local regions of incomplete point cloud. To address this problem, we propose Skip-Attention Network (SA-Net) for 3D point cloud completion. Our main contributions lie in the following two-folds. First, we propose a skip-attention mechanism to effectively exploit the local structure details of incomplete point clouds during the inference of missing parts. The skip-attention mechanism selectively conveys geometric information from the local regions of incomplete point clouds for the generation of complete ones at different resolutions, where the skip-attention reveals the completion process in an interpretable way. Second, in order to fully utilize the selected geometric information encoded by skip-attention mechanism at different resolutions, we propose a novel structure-preserving decoder with hierarchical folding for complete shape generation. The hierarchical folding preserves the structure of complete point cloud generated in upper layer by progressively detailing the local regions, using the skip-attentioned geometry at the same resolution. We conduct comprehensive experiments on ShapeNet and KITTI datasets, which demonstrate that the proposed SA-Net outperforms the state-of-the-art point cloud completion methods.
[decoder, attention, hierarchical, order, previous, mechanism, dataset, three, predict, selectively] [region, feature, global, level, table, propose, segmentation, visualization, effectiveness, pij, semantic, junwei] [input, wing, variation] [figure, ieee, pattern, resolution, based, proposed, block, learnable, comparison, existing] [encoder, missing, representation, unsupervised, preserve, generate, generated, loss, image, progressively] [learning, network, number, performance, compared, deep, process, evaluate, similarity, neural, cosine, classification] [point, local, shape, cloud, completion, folding, conference, structure, incomplete, computer, complete, zhizhong, vision, geometric, matthias, international, compare, shapenet, kitti, volumetric, directly, plane, foldingnet, zhenbao]
@InProceedings{Wen_2020_CVPR,
  author = {Wen, Xin and Li, Tianyang and Han, Zhizhong and Liu, Yu-Shen},
  title = {Point Cloud Completion by Skip-Attention Network With Hierarchical Folding},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Fast-MVSNet: Sparse-to-Dense Multi-View Stereo With Learned Propagation and Gauss-Newton Refinement
Zehao Yu, Shenghua Gao


Almost all previous deep learning-based multi-view stereo (MVS) approaches focus on improving reconstruction quality. Besides quality, efficiency is also a desirable feature for MVS in real scenarios. Towards this end, this paper presents a Fast-MVSNet, a novel sparse-to-dense coarse-to-fine framework, for fast and accurate depth estimation in MVS. Specifically, in our Fast-MVSNet, we first construct a sparse cost volume for learning a sparse and high-resolution depth map. Then we leverage a small-scale convolutional neural network to encode the depth dependencies for pixels within a local region to densify the sparse high-resolution depth map. At last, a simple but efficient Gauss-Newton layer is proposed to further optimize the depth map. On one hand, the high-resolution depth map, the data-adaptive propagation method and the Gauss-Newton layer jointly guarantee the effectiveness of our method. On the other hand, all modules in our Fast-MVSNet are lightweight and thus guarantee the efficiency of our approach. Besides, our approach is also memory-friendly because of the sparse depth representation. Extensive experimental results show that our method is 5 times and 14 times faster than Point-MVSNet and R-MVSNet, respectively, while achieving comparable or even better results on the challenging Tanks and Temples dataset as well as the DTU dataset. Code is available at https://github.com/svip-lab/FastMVSNet.
[predict, work, dataset, evaluation, step] [map, propagation, cnn, propose, effectiveness, module, table, refine, refinement, feature, faster] [] [method, ieee, proposed, figure, pattern, convolutional, reference, low, bilateral, high, resolution, upsampling, img, based, comparison, spatial, highresolution] [image, representation, learn] [learning, efficient, layer, network, memory, efficiency, set, deep, strategy, neural, simple, optimization, number, optimize, algorithm, learned, size, training, machine, comparable, better, regularization] [depth, sparse, computer, cost, reconstruction, volume, conference, dense, vision, mvsnet, point, stereo, differentiable, dtu, international, scene, joint, cloud, local, nearest, estimate]
@InProceedings{Yu_2020_CVPR,
  author = {Yu, Zehao and Gao, Shenghua},
  title = {Fast-MVSNet: Sparse-to-Dense Multi-View Stereo With Learned Propagation and Gauss-Newton Refinement},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
AANet: Adaptive Aggregation Network for Efficient Stereo Matching
Haofei Xu, Juyong Zhang


Despite the remarkable progress made by learning based stereo matching algorithms, one key challenge remains unsolved. Current state-of-the-art stereo models are mostly based on costly 3D convolutions, the cubic computational complexity and high memory consumption make it quite expensive to deploy in real-world applications. In this paper, we aim at completely replacing the commonly used 3D convolutions to achieve fast inference speed while maintaining comparable accuracy. To this end, we first propose a sparse points based intra-scale cost aggregation method to alleviate the well-known edge-fattening issue at disparity discontinuities. Further, we approximate traditional cross-scale cost aggregation algorithm with neural network layers to handle large textureless regions. Both modules are simple, lightweight, and complementary, leading to an effective and efficient architecture for cost aggregation. With these two modules, we can not only significantly speed up existing top-performing models (e.g., 41x than GC-Net, 4x than PSMNet and 38x than GA-Net), but also improve the performance of fast stereo models (e.g., StereoNet). We also achieve competitive results on Scene Flow and KITTI datasets while running at 62ms, demonstrating the versatility and high efficiency of the proposed method. Our full framework is available at https://github.com/haofeixu/aanet.
[prediction, regular, visual] [aggregation, feature, object, pyramid, propose, correlation, challenging, final] [model, csa] [disparity, ieee, pattern, adaptive, method, proposed, based, flow, convolution, deformable, psmnet, fast, aanet, traditional, existing, high, pixel, stereonet, window, resolution, scale, epe, figure, analysis, isa, competitive] [image, pseudo, learn, representation] [sampling, efficient, learning, large, performance, set, network, better, computational, achieve, training, test, algorithm, neural, deep, architecture, layer, complexity, memory, function] [cost, stereo, computer, conference, kitti, vision, matching, volume, ground, scene, truth, textureless, local, international, sparse, error, depth]
@InProceedings{Xu_2020_CVPR,
  author = {Xu, Haofei and Zhang, Juyong},
  title = {AANet: Adaptive Aggregation Network for Efficient Stereo Matching},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Towards Unified INT8 Training for Convolutional Neural Network
Feng Zhu, Ruihao Gong, Fengwei Yu, Xianglong Liu, Yanfei Wang, Zhelong Li, Xiuqi Yang, Junjie Yan


Recently low-bit (e.g., 8-bit) network quantization has been extensively studied to accelerate the inference. Besides inference, low-bit training with quantized gradients can further bring more considerable acceleration, since the backward process is often computation-intensive. Unfortunately, the inappropriate quantization of backward propagation usually makes the training unstable and even crash. There lacks a successful unified low-bit training framework that can support diverse networks on various tasks. In this paper, we give an attempt to build a unified 8-bit (INT8) training framework for common convolutional neural networks from the aspects of both accuracy and speed. First, we empirically find the four distinctive characteristics of gradients, which provide us insightful clues for gradient quantization. Then, we theoretically give an in-depth analysis of the convergence bound and derive two principles for stable INT8 training. Finally, we propose two universal techniques, including Direction Sensitive Gradient Clipping that reduces the direction deviation of gradients and Deviation Counteractive Learning Rate Scaling that avoids illegal gradient update along the wrong direction. The experiments show that our unified solution promises accurate and efficient INT8 training for a variety of networks and tasks, including MobileNetV2, InceptionV3 and object detection that prior studies have never succeeded. Moreover, it enjoys a strong flexibility to run on off-the-shelf hardware, and reduces the training time by 22% on Pascal GPU without too much optimization effort. We believe that this pioneering study will help lead the community towards a fully unified INT8 training for convolutional neural networks.
[time, build, recognition] [table, unified, pascal, object, framework, including, detection, faster] [deviation, sensitive, original, model, study] [method, convolutional, figure, june, analysis, based, ieee, pattern, integer, low] [loss, common, image] [training, gradient, learning, quantization, clipping, neural, rate, accuracy, deep, quantized, scaling, backward, convergence, efficient, distribution, cosine, network, find, optimization, counteractive, update, achieve, precision, processing, arxiv, preprint, stable, forward, epoch, performance, layer, set, bound, compared, better, theoretical, indicates, reduce, gpus, quantizing, speedup, comparable, acceleration, wageubn] [direction, computer, conference, error, vision, distance]
@InProceedings{Zhu_2020_CVPR,
  author = {Zhu, Feng and Gong, Ruihao and Yu, Fengwei and Liu, Xianglong and Wang, Yanfei and Li, Zhelong and Yang, Xiuqi and Yan, Junjie},
  title = {Towards Unified INT8 Training for Convolutional Neural Network},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Active 3D Motion Visualization Based on Spatiotemporal Light-Ray Integration
Fumihiko Sakaue, Jun Sato


In this paper, we propose a method of visualizing 3D motion with zero latency. This method achieves motion visualization by projecting special high-frequency light patterns on moving objects without using any feedback mechanisms. For this objective, we focus on the time integration of light rays in the sensing system of observers. It is known that the visual system of human observers integrates light rays in a certain period. Similarly, the image sensor in a camera integrates light rays during the exposure time. Thus, our method embeds multiple images into a time-varying light field, such that the observer of the time-varying light field observes completely different images according to the dynamic motion of the scene. Based on this concept, we propose a method of generating special high-frequency patterns of projector lights. After projection onto target objects with projectors, the image observed on the target changes automatically depending on the motion of the objects and without any scene sensing and data analysis. In other words, we achieve motion visualization without the time delay incurred during sensing and computing.
[observed, time, observation, speed, spatiotemporal, frame, static, visual, vehicle, three, multiple, relationship, moving] [object, visualization, horizontal, box] [observer, change, case, visualizing, experimental] [motion, light, method, proposed, figure, projector, screen, integration, field, dynamic, intensity, based, sensing, coded, observes, pattern, exposure, existing, optical, ieee, integrates, delay, range] [image, target, moved, corresponding, visualize, real, representation] [consider, objective, backward, forward, number, rate, indicates, accuracy, active, data, achieve, drastically, computational] [epipolar, projected, camera, point, computer, scene, estimation, axis, planar, projecting, conference, system, projection, relative, computed, position, human, require]
@InProceedings{Sakaue_2020_CVPR,
  author = {Sakaue, Fumihiko and Sato, Jun},
  title = {Active 3D Motion Visualization Based on Spatiotemporal Light-Ray Integration},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Block-Wisely Supervised Neural Architecture Search With Knowledge Distillation
Changlin Li, Jiefeng Peng, Liuchun Yuan, Guangrun Wang, Xiaodan Liang, Liang Lin, Xiaojun Chang


Neural Architecture Search (NAS), aiming at automatically designing network architectures by machines, is expected to bring about a new revolution in machine learning. Despite these high expectation, the effectiveness and efficiency of existing NAS solutions are unclear, with some recent works going so far as to suggest that many existing NAS solutions are no better than random architecture selection. The ineffectiveness of NAS solutions may be attributed to inaccurate architecture evaluation. Specifically, to speed up NAS, recent works have proposed under-training different candidate architectures in a large search space concurrently by using shared network parameters; however, this has resulted in incorrect architecture ratings and furthered the ineffectiveness of NAS. In this work, we propose to modularize the large search space of NAS into blocks to ensure that the potential candidate architectures are fully trained; this reduces the representation shift caused by the shared parameters and leads to the correct rating of the candidates. Thanks to the block-wise search, we can also evaluate all of the candidate architectures within each block. Moreover, we find that the knowledge of a network model lies not only in the network parameters but also in the network architecture. Therefore, we propose to distill the neural architecture (DNA) knowledge from a teacher model to supervise our block-wise architecture search, which significantly improves the effectiveness of NAS. Remarkably, the performance of our searched architectures has exceeded the teacher model, demonstrating the practicability of our method. Finally, our method achieves a state-of-the-art 78.4% top-1 accuracy on ImageNet in a mobile setting. All of our searched models along with the evaluation code are available at https://github.com/changlin31/DNA.
[evaluation, previous, recognition, current] [feature, map, table, effectiveness, propose, fully, achieves, supervision] [model, input, trained] [block, existing, figure, ieee, method, pattern, output, cell, proposed, channel, comparison] [loss, train, shared, representation, gap, image] [search, architecture, network, teacher, supernet, candidate, learning, knowledge, searched, training, neural, space, accuracy, performance, student, distillation, dna, layer, size, find, number, scratch, deep, machine, supervising, best, path, algorithm, evaluate, note, weight, sharing, denote, rate, operation, increase, strategy, better, large, imagenet, fairly, sampled, smaller] [conference, international, computer, vision, single, partial, depth]
@InProceedings{Li_2020_CVPR,
  author = {Li, Changlin and Peng, Jiefeng and Yuan, Liuchun and Wang, Guangrun and Liang, Xiaodan and Lin, Liang and Chang, Xiaojun},
  title = {Block-Wisely Supervised Neural Architecture Search With Knowledge Distillation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
GreedyNAS: Towards Fast One-Shot NAS With Greedy Supernet
Shan You, Tao Huang, Mingmin Yang, Fei Wang, Chen Qian, Changshui Zhang


Training a supernet matters for one-shot neural architecture search (NAS) methods since it serves as a basic performance estimator for different architectures (paths). Current methods mainly hold the assumption that a supernet should give a reasonable ranking over all paths. They thus treat all paths equally, and spare much effort to train paths. However, it is harsh for a single supernet to evaluate accurately on such a huge-scale search space (e.g., 7^21). In this paper, instead of covering all paths, we ease the burden of supernet by encouraging it to focus more on evaluation of those potentially-good ones, which are identified using a surrogate portion of validation data. Concretely, during training, we propose a multi-path sampling strategy with rejection, and greedily filter the weak paths. The training efficiency is thus boosted since the training space has been greedily shrunk from all paths to those potentially-good ones. Moreover, we further adopt an exploration and exploitation policy by introducing an empirical candidate path pool. Our proposed method GreedyNAS is easy-to-follow, and experimental results on ImageNet dataset indicate that it can achieve better Top-1 accuracy under same search space and FLOPs or latency level, but with only 60% of supernet training cost. By searching on a larger space, our GreedyNAS can also obtain new state-of-the-art architectures.
[dataset, evaluation, current, exploration] [correlation, table, weak, propose, adopt] [trained, acc] [filtering, ieee, method, figure, based, proposed, comparison, chen, block, coefficient, superiority] [image, loss] [supernet, training, path, sampling, search, pool, candidate, architecture, validation, greedy, greedynas, neural, performance, space, uniform, searching, dval, agood, sample, random, probability, searched, accuracy, evolutionary, arxiv, preprint, rank, efficiency, learning, sampled, size, number, ranking, distribution, algorithm, evaluate, strategy, exploitation, imagenet, larger, optimal, good, greedily, latency, implement, data, aweak, smaller, fei, indicates, report, filter, deep, hardware] [single, conference, computer, vision, cost, constraint, estimator]
@InProceedings{You_2020_CVPR,
  author = {You, Shan and Huang, Tao and Yang, Mingmin and Wang, Fei and Qian, Chen and Zhang, Changshui},
  title = {GreedyNAS: Towards Fast One-Shot NAS With Greedy Supernet},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Learning Filter Pruning Criteria for Deep Convolutional Neural Networks Acceleration
Yang He, Yuhang Ding, Ping Liu, Linchao Zhu, Hanwang Zhang, Yi Yang


Filter pruning has been widely applied to neural network compression and acceleration. Existing methods usually utilize pre-defined pruning criteria, such as Lp-norm, to prune unimportant filters. There are two major limitations to these methods. First, existing methods fail to consider the variety of filter distribution across layers. To extract features of the coarse level to the fine level, the filters of different layers have various distributions. Therefore, it is not suitable to utilize the same pruning criteria to different functional layers. Second, prevailing layer-by-layer pruning methods process each layer independently and sequentially, failing to consider that all the layers in the network collaboratively make the final prediction. In this paper, we propose Learning Filter Pruning Criteria (LFPC) to solve the above problems. Specifically, we develop a differentiable pruning criteria sampler. This sampler is learnable and optimized by the validation loss of the pruned network obtained from the sampled criteria. In this way, we could adaptively select the appropriate pruning criteria for different functional layers. Besides, when evaluating the sampled criteria, LFPC comprehensively consider the contribution of all the layers at the same time. Experiments validate our approach on three image classification benchmarks. Notably, on ILSVRC-2012, our LFPC reduces more than 60% FLOPs on ResNet-50 with only 0.83% top-5 accuracy loss.
[previous, three, work, ith] [feature, resnet, map, final, table, achieves, effectiveness] [model, adversarial, norm, conduct, input] [method, convolutional, based, optimized, figure, output, channel, existing, conventional, compression, adaptively, proposed] [loss, utilize] [pruning, filter, pruned, network, neural, learning, deep, training, lfpc, layer, accuracy, set, criterion, fpgm, prune, forward, acceleration, lth, ratio, distribution, number, process, sfp, arxiv, preprint, weight, achieve, computation, denote, higher, pfec, consider, validation, greedy, probability, efficient, random, sampler, appropriate, architecture, accelerating, performance, select, lower, sample, update, nisp, small, denoted] [geometric, functional, differentiable, median]
@InProceedings{He_2020_CVPR,
  author = {He, Yang and Ding, Yuhang and Liu, Ping and Zhu, Linchao and Zhang, Hanwang and Yang, Yi},
  title = {Learning Filter Pruning Criteria for Deep Convolutional Neural Networks Acceleration},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
DIST: Rendering Deep Implicit Signed Distance Function With Differentiable Sphere Tracing
Shaohui Liu, Yinda Zhang, Songyou Peng, Boxin Shi, Marc Pollefeys, Zhaopeng Cui


We propose a differentiable sphere tracing algorithm to bridge the gap between inverse graphics methods and the recently proposed deep learning based implicit signed distance function. Due to the nature of the implicit function, the rendering process requires tremendous function queries, which is particularly problematic when the function is represented as a neural network. We optimize both the forward and backward pass of our rendering layer to make it run efficiently with affordable memory consumption on a commodity graphics card. Our rendering method is fully differentiable such that losses can be directly computed on the rendered 2D observations, and the gradients can be propagated backward to optimize the 3D geometry. We show that our rendering method can effectively reconstruct accurate 3D shapes from various inputs, such as sparse depth and multi-view images, through inverse optimization. With the geometry based reasoning, our 3D shape prediction methods show excellent generalization capability and robustness against various noises.
[recognition, prediction, reasoning, work] [object, propose, threshold] [ray, generalization, robustness] [method, pattern, based, figure, resolution, inverse, color, comparison, pixel, proposed, high, capability] [image, code, latent, loss, representation, corresponding] [learning, neural, function, deep, network, optimization, algorithm, process, performance, forward, aggressive, convergence, optimize, efficient, random, memory, better, set, note, processing, requires] [shape, rendering, distance, differentiable, vision, computer, depth, implicit, signed, sphere, tracing, surface, deepsdf, geometric, pmo, camera, conference, sdf, marching, renderer, single, rendered, geometry, international, point, render, mesh, michael, represented, directly, accurate, continuous, initial, dense, normal, sparse, hao]
@InProceedings{Liu_2020_CVPR,
  author = {Liu, Shaohui and Zhang, Yinda and Peng, Songyou and Shi, Boxin and Pollefeys, Marc and Cui, Zhaopeng},
  title = {DIST: Rendering Deep Implicit Signed Distance Function With Differentiable Sphere Tracing},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Visually Imbalanced Stereo Matching
Yicun Liu, Jimmy Ren, Jiawei Zhang, Jianbo Liu, Mude Lin


Understanding of human vision system (HVS) has inspired many computer vision algorithms. Stereo matching, which borrows the idea from human stereopsis, has been extensively studied in the existing literature. However, scant attention has been drawn on a typical scenario where binocular inputs are qualitatively different (e.g., high-res master camera and low-res slave camera in a dual-lens module). Recent advances in human optometry reveal the capability of the human visual system to maintain coarse stereopsis under such visually imbalanced conditions. Bionically aroused, it is natural to question that: do stereo machines share the same capability? In this paper, we carry out a systematic comparison to investigate the effect of various imbalanced conditions on current popular stereo matching algorithms. We show that resembling the human visual system, those algorithms can handle limited degrees of monocular downgrading but also prone to collapses beyond a certain threshold. To avoid such collapse, we propose a solution to recover the stereopsis by a joint guided-view-restoration and stereo-reconstruction framework. We show the superiority of our framework on KITTI dataset and its extension on real-world applications.
[visual, positional, master, perception, work, evaluation, shift, current] [guided, framework, object, horizontal, feature, map, table, contour] [corrupted, noise, model] [disparity, blur, figure, spatial, proposed, stereopsis, dynamic, rectification, based, phase, binocular, downgrading, restored, filtering, crl, visually, slave, resolution, capability, acuity, downgraded, scale, pixel, convolutional, jimmy, psmnet] [image, synthesis, loss, corresponding, generated] [imbalanced, network, filter, performance, size, neural, layer, deep, set, problem, design, linear, large, test, search, applied, learning, accuracy] [view, stereo, monocular, matching, displacement, human, left, depth, vision, error, volume, camera, coarse, single, computer, position, local, kitti]
@InProceedings{Liu_2020_CVPR,
  author = {Liu, Yicun and Ren, Jimmy and Zhang, Jiawei and Liu, Jianbo and Lin, Mude},
  title = {Visually Imbalanced Stereo Matching},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Mesh-Guided Multi-View Stereo With Pyramid Architecture
Yuesong Wang, Tao Guan, Zhuo Chen, Yawei Luo, Keyang Luo, Lili Ju


Multi-view stereo (MVS) aims to reconstruct 3D geometry of the target scene by using only information from 2D images. Although much progress has been made, it still suffers from textureless regions. To overcome this difficulty, we propose a mesh-guided MVS method with pyramid architecture, which makes use of the surface mesh obtained from coarse-scale images to guide the reconstruction process. Specifically, a PatchMatch-based MVS algorithm is first used to generate depth maps for coarse-scale images and the corresponding surface mesh is obtained by a surface reconstruction algorithm. Next we project the mesh onto each of depth maps to replace unreliable depth values and the corrected depth maps are fed to fine-scale reconstruction for initialization. To alleviate the influence of possible erroneous faces on the mesh, we further design and train a convolutional neural network to remove incorrect depths. In addition, it is often hard for the correct depth values for low-textured regions to survive at the fine-scale, thus we also develop an efficient method to seek out these regions and further enforce the geometric consistency in these regions. Experimental results on the ETH3D high-resolution dataset demonstrate that our method achieves state-of-the-art performance, especially in completeness.
[attention, correct, wrong, three, evaluation, dataset] [map, confidence, pyramid, region, module, feature, table, fuse, detector, threshold, guide, aspp, score] [input, improve, robust] [method, scale, ieee, figure, pattern, receptive, field, erroneous, coarsest, based, remove, patch, fusion] [consistency, corresponding, image] [set, network, accuracy, size, learning, neural, test, filter, achieve, architecture, design, deep, number] [depth, mesh, computer, conference, geometric, untextured, surface, completeness, point, view, stereo, textureless, vision, acmh, reconstruction, finer, neighbor, initial, international, local, matching, cost, photometric, cloud, estimation, acmm, scene]
@InProceedings{Wang_2020_CVPR,
  author = {Wang, Yuesong and Guan, Tao and Chen, Zhuo and Luo, Yawei and Luo, Keyang and Ju, Lili},
  title = {Mesh-Guided Multi-View Stereo With Pyramid Architecture},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
BiDet: An Efficient Binarized Object Detector
Ziwei Wang, Ziyi Wu, Jiwen Lu, Jie Zhou


In this paper, we propose a binarized neural network learning method called BiDet for efficient object detection. Conventional network binarization methods directly quantize the weights and activations in one-stage or two-stage detectors with constrained representational capacity, so that the information redundancy in the networks causes numerous false positives and degrades the performance significantly. On the contrary, our BiDet fully utilizes the representational capacity of the binary neural networks for object detection by redundancy removal, through which the detection precision is enhanced with alleviated false positives. Specifically, we generalize the information bottleneck (IB) principle to object detection, where the amount of information in the high-level feature maps is constrained and the mutual information between the feature maps and object detection is maximized. Meanwhile, we learn sparse object priors so that the posteriors are concentrated on informative detection prediction with false positive elimination. Extensive experiments on the PASCAL VOC and COCO datasets show that our method outperforms the state-of-the-art binary neural networks by a sizable margin.
[prediction, sign] [object, detection, bidet, false, feature, backbone, map, faster, fully, pascal, positive, voc, predicted, framework, location, detector, coco, ross, represents, china, redundant, region] [model, input, datasets] [proposed, method, figure, block, convolutional, based, compression, utilized, scale, prior] [learn, enforces] [neural, binary, network, distribution, efficient, redundancy, principle, binarized, learning, mutual, representational, informative, class, capacity, bottleneck, training, performance, storage, quantization, number, arxiv, preprint, set, classification, deep, concentrated, learned, large, log, probability, precision, amount, power, computation, compared, parameterized] [sparse, directly]
@InProceedings{Wang_2020_CVPR,
  author = {Wang, Ziwei and Wu, Ziyi and Lu, Jiwen and Zhou, Jie},
  title = {BiDet: An Efficient Binarized Object Detector},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Local Non-Rigid Structure-From-Motion From Diffeomorphic Mappings
Shaifali Parashar, Mathieu Salzmann, Pascal Fua


We propose a new formulation to non-rigid structure-from-motion that only requires the deforming surface to preserve its differential structure. This is a much weaker assumption than the traditional ones of isometry or conformality. We show that it is nevertheless sufficient to establish local correspondences between the surface in two different images and therefore to perform point-wise reconstruction using only first-order derivatives. To this end, we formulate differential constraints and solve them algebraically using the theory of resultants. We will demonstrate that our approach is more widely applicable, more stable in noisy and sparse imaging conditions and much faster than earlier ones, while delivering similar accuracy. The code is available at https://github.com/cvlab-epfl/diff-nrsfm/.
[order, dataset] [challenge, faster] [isometry, differential, model] [pattern, figure, method, expressed, diffeomorphic, assumption, motion, ieee, analysis, deformable, optical, high] [image, diff, perform, corresponding, consists] [number, best, requires, machine, large, find, performance, computing, linear, equation, paper, formulate] [surface, local, computer, reconstruction, conference, shape, approach, locally, structure, point, deforming, second, vision, nrsfm, write, solve, compute, assume, isometric, computed, depth, substituting, international, deformation, well, solution, express, additional, define, perspective, formulation, sufficient, conformality]
@InProceedings{Parashar_2020_CVPR,
  author = {Parashar, Shaifali and Salzmann, Mathieu and Fua, Pascal},
  title = {Local Non-Rigid Structure-From-Motion From Diffeomorphic Mappings},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Seeing Around Street Corners: Non-Line-of-Sight Detection and Tracking In-the-Wild Using Doppler Radar
Nicolas Scheiner, Florian Kraus, Fangyin Wei, Buu Phan, Fahim Mannan, Nils Appenrodt, Werner Ritter, Jurgen Dickmann, Klaus Dietmayer, Bernhard Sick, Felix Heide


Conventional sensor systems record information about directly visible objects, whereas occluded scene components are considered lost in the measurement process. Non-line-of-sight (NLOS) methods try to recover such hidden objects from their indirect reflections - faint signal components, traditionally treated as measurement noise. Existing NLOS approaches struggle to record these low-signal components outside the lab, and do not scale to large-scale outdoor scenes and high-speed motion, typical in automotive scenarios. In particular, optical NLOS capture is fundamentally limited by the quartic intensity falloff of diffuse indirect reflections. In this work, we depart from visible-wavelength approaches and demonstrate detection, classification, and tracking of hidden objects in large-scale dynamic environments using Doppler radars that can be manufactured at low-cost in series production. To untangle noisy indirect and direct reflections, we learn from temporal sequences of Doppler velocity and position measurements, which we fuse in a joint NLOS detection and tracking network over time. We validate the approach on in-the-wild automotive scenes, including sequences of parked cars or house facades as relay surfaces, and demonstrate low-cost, real-time NLOS in dynamic automotive environments.
[hidden, frame, time, road, shift, temporal, work, multiple, vehicle, parked, three] [detection, tracking, object, localization, occluded, box, lidar, pyramid] [model, input, typical] [ieee, frequency, automotive, signal, imaging, pattern, figure, proposed, range, receiver, sensor, recover, existing, intensity, dynamic, based, high, automated] [] [network, data, large, size, training, small] [radar, nlos, velocity, wall, relay, single, conference, doppler, computer, visible, vision, point, estimation, measurement, ground, truth, joint, distance, scene, indirect, diffuse, supplemental, received, specular, cloud, international, capture, demonstrate, direct, fmcw, acm, position, virtual, camera, system, angle, chirp]
@InProceedings{Scheiner_2020_CVPR,
  author = {Scheiner, Nicolas and Kraus, Florian and Wei, Fangyin and Phan, Buu and Mannan, Fahim and Appenrodt, Nils and Ritter, Werner and Dickmann, Jurgen and Dietmayer, Klaus and Sick, Bernhard and Heide, Felix},
  title = {Seeing Around Street Corners: Non-Line-of-Sight Detection and Tracking In-the-Wild Using Doppler Radar},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
APQ: Joint Search for Network Architecture, Pruning and Quantization Policy
Tianzhe Wang, Kuan Wang, Han Cai, Ji Lin, Zhijian Liu, Hanrui Wang, Yujun Lin, Song Han


We present APQ, a novel design methodology for efficient deep learning deployment. Unlike previous methods that separately optimize the neural network architecture, pruning policy, and quantization policy, we design to optimize them in a joint manner. To deal with the larger design space it brings, we devise to train a quantization-aware accuracy predictor that is fed to the evolutionary search to select the best fit. Since directly training such a predictor requires time-consuming quantization data collection, we propose to use predictor-transfer technique to get the quantization-aware predictor: we first generate a large dataset of pairs by sampling a pretrained unified supernet and doing direct evaluation; then we use these data to train an accuracy predictor without quantization, further transferring its weights to train the quantization-aware predictor, which largely reduces the quantization data collection time. Extensive experiments on ImageNet show the benefits of this joint design methodology: the model searched by our method maintains the same level accuracy as ResNet34 8-bit model while saving 8x BitOps; we obtain the same level accuracy as MobileNetV2+HAQ while achieving 2x/1.3x latency/energy saving; the marginal search cost of joint optimization for a new deployment scenario outperforms separate optimizations using ProxylessNAS+AMC+HAQ by 2.3% accuracy while reducing 600x GPU hours and CO2 emission.
[dataset, policy, time] [level, table, propose] [model, technique, choose] [channel, figure, method, comparison, proposed, based, kernel, block, existing, convolutional] [train, target, perform, transfer, han] [accuracy, architecture, search, network, predictor, quantization, neural, design, training, efficient, quantized, pruning, data, deep, learning, performance, latency, song, space, large, optimization, hardware, number, precision, deployment, energy, optimal, compared, haq, layer, best, imagenet, searched, achieve, searching, find, path, resource, size, better, fixed, hanrui, evolutionary, requires, marginal, gpu, mobile, amc, apq, bitops, pairwise] [joint, cost, directly, pipeline, jointly, full, computer]
@InProceedings{Wang_2020_CVPR,
  author = {Wang, Tianzhe and Wang, Kuan and Cai, Han and Lin, Ji and Liu, Zhijian and Wang, Hanrui and Lin, Yujun and Han, Song},
  title = {APQ: Joint Search for Network Architecture, Pruning and Quantization Policy},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
On the Acceleration of Deep Learning Model Parallelism With Staleness
An Xu, Zhouyuan Huo, Heng Huang


Training the deep convolutional neural network for computer vision problems is slow and inefficient, especially when it is large and distributed across multiple devices. The inefficiency is caused by the backpropagation algorithm's forward locking, backward locking, and update locking problems. Existing solutions for acceleration either can only handle one locking problem or lead to severe accuracy loss or memory inefficiency. Moreover, none of them consider the straggler problem among devices. In this paper, we propose Layer-wise Staleness and a novel efficient training algorithm, Diversely Stale Parameters (DSP), to address these challenges. We also analyze the convergence of DSP with two popular gradient-based methods and prove that both of them are guaranteed to converge to critical points for non-convex problems. Finally, extensive experimental results on training deep learning models demonstrate that our proposed DSP algorithm can achieve significant training speedup with stronger robustness than compared methods.
[time, slow, critical] [table, propose] [model, difference, input, robustness] [block, figure, parallel, proposed, method, convolutional, output, ieee] [loss] [dsp, gradient, training, staleness, learning, neural, forward, backward, data, deep, locking, distributed, test, arxiv, preprint, rate, batch, parallelism, convergence, algorithm, accuracy, momentum, stochastic, epoch, performance, imagenet, machine, network, backpropagation, update, stale, speedup, sgd, decay, zhouyuan, efficient, pass, decoupled, cifar, heng, memory, diversely, recomputation, random, implementation, theorem, processing, large, converge, compared, ddg, lower, denote, problem, larger] [error, computer, conference, vision, assume, novel]
@InProceedings{Xu_2020_CVPR,
  author = {Xu, An and Huo, Zhouyuan and Huang, Heng},
  title = {On the Acceleration of Deep Learning Model Parallelism With Staleness},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
RevealNet: Seeing Behind Objects in RGB-D Scans
Ji Hou, Angela Dai, Matthias Niessner


During 3D reconstruction, it is often the case that people cannot scan each individual object from all views, resulting in missing geometry in the captured scan. This missing geometry can be fundamentally limiting for many applications, e.g., a robot needs to know the unseen geometry to perform a precise grasp on an object. Thus, we introduce the task of semantic instance completion: from an incomplete RGB-D scan of a scene, we aim to detect the individual object instances and infer their complete object geometry. This will open up new possibilities for interactions with objects in a scene, for instance for virtual or robotic agents. We tackle this problem by introducing RevealNet, a new data-driven approach that jointly detects object instances and predicts their complete geometry. This enables a semantically meaningful decomposition of a scanned scene into individual, complete 3D objects, including hidden and unobserved object parts. RevealNet is an end-to-end 3D neural network architecture that leverages joint color and geometry feature learning. The fully-convolutional nature of our 3D network enables efficient inference of semantic instance completion for 3D scans at scale of large indoor environments in a single forward pass. We show that predicting complete object geometry improves both 3D detection and instance segmentation performance. We evaluate on both real and synthetic scan benchmark data for the new task, where we outperform state-of-the-art approaches by over 15 in mAP@0.5 on ScanNet, and over 18 in mAP@0.5 on SUNCG.
[predict, predicting, understanding, prediction, individual, order] [instance, object, semantic, detection, segmentation, bounding, predicted, feature, box, backbone, proposal, detected, map, ross, improves, unified] [input, model] [color, pattern, ieee, method] [loss, synthetic, missing, real, cross] [task, network, evaluate, arxiv, preprint, data, performance, class, deep, learning, architecture, binary, neural] [completion, geometry, approach, scan, complete, computer, scene, volumetric, conference, well, vision, shape, ground, geometric, angela, matthias, single, truth, scannet, point, voxel, volume, revealnet, partial, suncg, leonidas, indoor, surface, international, acm, predicts, enables, joint, full]
@InProceedings{Hou_2020_CVPR,
  author = {Hou, Ji and Dai, Angela and Niessner, Matthias},
  title = {RevealNet: Seeing Behind Objects in RGB-D Scans},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
MemNAS: Memory-Efficient Neural Architecture Search With Grow-Trim Learning
Peiye Liu, Bo Wu, Huadong Ma, Mingoo Seok


Recent studies on automatic neural architecture search techniques have demonstrated significant performance, competitive to or even better than hand-crafted neural architectures. However, most of the existing search approaches tend to use residual structures and a concatenation connection between shallow and deep features. A resulted neural network model, therefore, is non-trivial for resource-constraint devices to execute since such a model requires large memory to store network parameters and intermediate feature maps along with excessive computing complexity. To address this challenge, we propose MemNAS, a novel growing and trimming based neural architecture search framework that optimizes not only performance but also memory requirement of an inference network. Specifically, in the search process, we consider running memory use, including network parameters and the essential intermediate feature maps memory requirement, as an optimization objective along with performance. Besides, to improve the accuracy of the search, we extract the correlation information among multiple candidate architectures to rank them and then choose the candidates with desired performance and memory efficiency. On the ImageNet classification task, our MemNAS achieves 75.4% accuracy, 0.7% higher than MobileNetV2 with 42.1% less memory requirement. Additional experiments confirm that the proposed MemNAS can perform well across the different targets of the trade-off between accuracy and memory consumption.
[multiple, time, previous, current] [achieves, correlation, propose, represents, table, feature, denotes, framework, round, score] [model, input, technique] [intermediate, proposed, figure, convolution, block, cell, output, existing, based, conventional] [target, perform, generation, produce, representation] [memory, neural, search, network, memnas, architecture, candidate, accuracy, requirement, data, size, inference, layer, number, performance, controller, operation, large, ranking, scc, efficient, mobile, base, training, metric, manual, consider, process, shufflenet, total, hardware, best, set, lifetime, space, optimize, learning, deep, requires, computing, trimming, classification, design, find, average, experiment, arxiv, preprint] [auto, estimate, structure, relative]
@InProceedings{Liu_2020_CVPR,
  author = {Liu, Peiye and Wu, Bo and Ma, Huadong and Seok, Mingoo},
  title = {MemNAS: Memory-Efficient Neural Architecture Search With Grow-Trim Learning},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
StegaStamp: Invisible Hyperlinks in Physical Photographs
Matthew Tancik, Ben Mildenhall, Ren Ng


Printed and digitally displayed photos have the ability to hide imperceptible digital data that can be accessed through internet-connected imaging systems. Another way to think about this is physical photographs that have unique QR codes invisibly embedded within them. This paper presents an architecture, algorithms, and a prototype implementation addressing this vision. Our key technical contribution is StegaStamp, a learned steganographic algorithm to enable robust encoding and decoding of arbitrary hyperlink bitstrings into photos in a manner that approaches perceptual invisibility. StegaStamp comprises a deep neural network that learns an encoding/decoding algorithm robust to image perturbations approximating the space of distortions resulting from real printing and photography. We demonstrates real-time decoding of hyperlinks in photos from in-the-wild videos that contain variation in lighting, shadows, perspective, occlusion and viewing distance. Our prototype system robustly retrieves 56 bit hyperlinks after error correction -- sufficient to embed a unique code within every photo on the internet.
[message, decoding, decoder, hidden, work, encoding, encode, dataset, three, length] [detection] [robust, encoded, trained, robustness, digital, adversarial, physical, stegastamp, pixelwise, jpeg, perturbation, cellphone, printed, noise, model, input, steganography, printer, watermarking, lfm, barcode, original, consumer, hiding, quality, stegastamps, hyperlink] [color, spatial, figure, method, perceptual, high, imaging, ieee, captured, residual, warp, blur, pixel] [image, real, encoder, loss, photo, synthetic, arbitrary, code, produce, train] [network, training, data, deep, bit, accuracy, set, random, test, find, randomly, learning, mobile, number, increased] [camera, system, unique, error, perspective, pipeline, computer, demonstrate]
@InProceedings{Tancik_2020_CVPR,
  author = {Tancik, Matthew and Mildenhall, Ben and Ng, Ren},
  title = {StegaStamp: Invisible Hyperlinks in Physical Photographs},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
L2-GCN: Layer-Wise and Learned Efficient Training of Graph Convolutional Networks
Yuning You, Tianlong Chen, Zhangyang Wang, Yang Shen


Graph convolution networks (GCN) are increasingly popular in many applications, yet remain notoriously hard to train over large graph datasets. They need to compute node representations recursively from their neighbors. Current GCN training algorithms suffer from either high computational costs that grow exponentially with the number of layers, or high memory usage for loading the entire graph and node embeddings. In this paper, we propose a novel efficient layer-wise training framework for GCN (L-GCN), that disentangles feature aggregation and feature transformation during training, hence greatly reducing time and memory complexities. We present theoretical analysis for L-GCN under the graph isomorphism framework, that L-GCN leads to as powerful GCNs as the more costly conventional training algorithm does, under mild conditions. We further propose L^2-GCN, which learns a controller for each layer that can automatically adjust the training epochs per layer in L-GCN. Experiments show that L-GCN is faster than state-of-the-arts by at least an order of magnitude, with a consistent of memory usage not dependent on dataset size, while maintaining comparable prediction performance. With the learned controller, L^2-GCN can further cut the training time in half. Our codes are available at https://github.com/Shen-Lab/L2-GCN.
[time, graph, gcn, rnn, node, gnn, powerful, embeddings, graphsage, hidden, embedding, gcns, prediction, dataset] [feature, table, propose, aggregation, propagation, stage, final] [trained, input, dependent] [conventional, convolutional, figure, usage, proposed, output] [loss, train, representation, learn, mapping] [training, memory, layer, controller, network, epoch, learning, complexity, algorithm, performance, neural, llay, learned, lth, linear, capacity, large, matrix, weight, process, probability, arxiv, preprint, number, theorem, deeper, cora, theoretical, isomorphism, comparable, vrgcn, data, compared, sample, set, sgd, optimizer, fastgcn, gpu, ppi, reddit, search, efficient, gradient, layerwise, injective] [transformation, international, neighborhood]
@InProceedings{You_2020_CVPR,
  author = {You, Yuning and Chen, Tianlong and Wang, Zhangyang and Shen, Yang},
  title = {L2-GCN: Layer-Wise and Learned Efficient Training of Graph Convolutional Networks},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Polarized Non-Line-of-Sight Imaging
Kenichiro Tanaka, Yasuhiro Mukaigawa, Achuta Kadambi


This paper presents a method of passive non-line-of-sight (NLOS) imaging using polarization cues. A key observation is that the oblique light has a different polarimetric signal. It turns out this effect is due to the polarization axis rotation, a phenomena which can be used to better condition the light transport matrix for non-line-of-sight imaging. Our analysis and results show that the use of a polarization for NLOS is both a standalone technique, as well as an enhancement technique to boost the results of other forms of passive NLOS imaging. We make a surprising finding that, despite 50% light attenuation from polarization optics, the gains from polarized NLOS are overall superior to unpolarized NLOS.
[multiple, observation, conditioning, work, time, previous, observed] [object, improvement, key] [condition, occluder, effective, improve, variation, model, original, case] [light, imaging, ieee, method, pattern, reflection, figure, existing, intensity, resolution, patch, putting, enhanced, result, enhancement, enhance, proposed] [transport, image, source, target] [number, active, matrix, baseline, computational, paper, top, better, depends, improved, setting] [nlos, polarization, scene, camera, polarizer, polarized, angle, wall, conference, computer, passive, oblique, partial, vision, leakage, ramesh, roughness, international, single, los, point, axis, brewster, achuta, diffuse, view, front, recovered, reconstruction, azimuth, plane, geometry, lcd, incident, surface, andreas, zenith, rough, rotating, reflective]
@InProceedings{Tanaka_2020_CVPR,
  author = {Tanaka, Kenichiro and Mukaigawa, Yasuhiro and Kadambi, Achuta},
  title = {Polarized Non-Line-of-Sight Imaging},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
AdaBits: Neural Network Quantization With Adaptive Bit-Widths
Qing Jin, Linjie Yang, Zhenyu Liao


Deep neural networks with adaptive configurations have gained increasing attention due to the instant and flexible deployment of these models on platforms with different resource budgets. In this paper, we investigate a novel option to achieve this goal by enabling adaptive bit-widths of weights and activations in the model. We first examine the benefits and challenges of training quantized model with adaptive bit-widths, and then experiment with several approaches including direct adaptation, progressive training and joint training. We discover that joint training is able to produce comparable performance on the adaptive model as individual models. We also propose a new technique named Switchable Clipping Level (S-CL) to further improve quantized models at the lowest bit-width. With our proposed techniques applied on a bunch of models including MobileNet V1/V2 and ResNet50, we demonstrate that bit-width of weights and activations is a new option for adaptively executable deep neural networks, offering a distinct opportunity for improved accuracy-efficiency trade-off as well as instant adaptation according to the platform constraints in real-world applications.
[individual, work, individually] [level, sat, table, including, named, alan, propose] [model, trained, original] [adaptive, figure, method, ieee, pattern, based, adopted, channel] [progressive, adaptation, modified, train, image] [quantization, neural, bit, quantized, clipping, training, mobilenet, arxiv, preprint, performance, architecture, scheme, adabits, network, search, deep, achieve, accuracy, imagenet, learning, size, lowest, efficient, vanilla, large, larger, lower, number, quoc, switchable, strategy, activation, weight, layer, batch, barret, resource, investigate, better, algorithm, validation, indicates, dorefa, higher, linjie, deployment, optimal, bitwidth, compared] [conference, approach, computer, vision, direct, joint, error, single, application]
@InProceedings{Jin_2020_CVPR,
  author = {Jin, Qing and Yang, Linjie and Liao, Zhenyu},
  title = {AdaBits: Neural Network Quantization With Adaptive Bit-Widths},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Multi-Scale Boosted Dehazing Network With Dense Feature Fusion
Hang Dong, Jinshan Pan, Lei Xiang, Zhe Hu, Xinyi Zhang, Fei Wang, Ming-Hsuan Yang


In this paper, we propose a Multi-Scale Boosted Dehazing Network with Dense Feature Fusion based on the U-Net architecture. The proposed method is designed based on two principles, boosting and error feedback, and we show that they are suitable for the dehazing problem. By incorporating the Strengthen-Operate-Subtract boosting strategy in the decoder of the proposed model, we develop a simple yet effective boosted decoder to progressively restore the haze-free image. To address the issue of preserving spatial information in the U-Net architecture, we design a dense feature fusion module using the back-projection feedback scheme. We show that the dense feature fusion module can simultaneously remedy the missing spatial information from high-resolution features and exploit the non-adjacent features. Extensive evaluations demonstrate that the proposed model performs favorably against the state-of-the-art approaches on the benchmark datasets as well as real-world hazy images.
[decoder, dataset, unit, exploit] [feature, module, detection, pyramid, denotes, level, table, propose, effectiveness, benchmark] [effective, model, technique, feedback] [dehazing, boosted, proposed, boosting, ieee, pattern, dff, figure, method, hazy, fusion, convolutional, based, enhanced, preceding, pffnet, residual, haze, transmission, msbdn, spatial, atmospheric, dehazed, restore, reside, designed, restoration, denoising, deconvolutional] [image, encoder, progressively] [network, deep, strategy, algorithm, evaluate, performance, better, layer, training, learning, neural, best, architecture, design] [conference, computer, vision, single, dense, international, outdoor, indoor, demonstrate, scene, error, estimate, estimated]
@InProceedings{Dong_2020_CVPR,
  author = {Dong, Hang and Pan, Jinshan and Xiang, Lei and Hu, Zhe and Zhang, Xinyi and Wang, Fei and Yang, Ming-Hsuan},
  title = {Multi-Scale Boosted Dehazing Network With Dense Feature Fusion},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
ClusterVO: Clustering Moving Instances and Estimating Visual Odometry for Self and Surroundings
Jiahui Huang, Sheng Yang, Tai-Jiang Mu, Shi-Min Hu


We present ClusterVO, a stereo Visual Odometry which simultaneously clusters and estimates the motion of both ego and surrounding rigid clusters/objects. Unlike previous solutions relying on batch input or imposing priors on scene structure or dynamic object models, ClusterVO is online, general and thus can be used in various scenarios including indoor scene understanding and autonomous driving. At the core of our system lies a multi-level probabilistic association mechanism and a heterogeneous Conditional Random Field (CRF) clustering approach combining semantic, spatial and motion information to jointly infer cluster segmentations online for every frame. The poses of camera and dynamic objects are instantly solved through a sliding-window optimization. Our system is evaluated on Oxford Multimotion and KITTI dataset both quantitatively and qualitatively, reaching comparable results to state-of-the-art solutions on both odometry and dynamic trajectory recovery.
[frame, moving, state, static, visual, current, sequence, heterogeneous, dataset, trajectory, driving, multiple, time, recognition] [object, bounding, track, detection, tracking, semantic, autonomous, association, crf, box, table, feature, detected, assignment, occlusion, localization, map, segmentation] [landmark, input, robust] [dynamic, motion, ieee, spatial, figure, based, method, pattern] [cluster, corresponding] [probabilistic, energy, number, clustering, performance, optimization, probability, set, strategy] [conference, stereo, scene, clustervo, vision, computer, camera, international, system, pose, slam, estimation, indoor, dense, rigid, robotics, dynslam, term, odometry, kitti, clusterslam, ate, automation, marginalization, multimotion, simultaneous, geometric, accurate]
@InProceedings{Huang_2020_CVPR,
  author = {Huang, Jiahui and Yang, Sheng and Mu, Tai-Jiang and Hu, Shi-Min},
  title = {ClusterVO: Clustering Moving Instances and Estimating Visual Odometry for Self and Surroundings},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Automatic Neural Network Compression by Sparsity-Quantization Joint Learning: A Constrained Optimization-Based Approach
Haichuan Yang, Shupeng Gui, Yuhao Zhu, Ji Liu


Deep Neural Networks (DNNs) are applied in a wide range of usecases. There is an increased demand for deploying DNNs on devices that do not have abundant resources such as memory and computation units. Recently, network compression through a variety of techniques such as pruning and quantization have been proposed to reduce the resource requirement. A key parameter that all existing compression techniques are sensitive to is the compression ratio (e.g., pruning sparsity, quantization bitwidth) of each layer. Traditional solutions treat the compression ratios of each layer as hyper-parameters, and tune them using human heuristic. Recent researchers start using black-box hyper-parameter optimizations, but they will introduce new hyper-parameters and have efficiency issue. In this paper, we propose a framework to jointly prune and quantize the DNNs automatically according to a target model size without using any hyper-parameters to manually set the compression ratio for each layer. In the experiments, we show that our framework can compress the weights data of ResNet-50 to be 836x smaller without accuracy loss on CIFAR-10, and compress AlexNet to be 205x smaller without accuracy loss on ImageNet classification.
[reinforcement] [framework, table] [dnn, model, dnns, budget, constrained] [compression, method, compressed, automated, proposed, compress, based, convolutional, ieee, figure, pattern, channel, admm] [loss, introduce] [pruning, quantization, neural, deep, problem, accuracy, bitwidth, learning, arxiv, preprint, layer, weight, set, network, ratio, algorithm, training, search, optimization, sbudget, smaller, sparsity, knapsack, size, rate, alexnet, number, efficient, min, applied, resource, find, mobile, nonzero, manually, imagenet, binary, arg, mckp, evaluate, compact, processing, haq, uniform, compressing, total, fixed, update] [conference, computer, projection, vision, solve, constraint, compare, european, direction]
@InProceedings{Yang_2020_CVPR,
  author = {Yang, Haichuan and Gui, Shupeng and Zhu, Yuhao and Liu, Ji},
  title = {Automatic Neural Network Compression by Sparsity-Quantization Joint Learning: A Constrained Optimization-Based Approach},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Normal Assisted Stereo Depth Estimation
Uday Kusupati, Shuo Cheng, Rui Chen, Hao Su


Accurate stereo depth estimation plays a critical role in various 3D tasks in both indoor and outdoor environments. Recently, learning-based multi-view stereo methods have demonstrated competitive performance with limited number of views. However, in challenging scenarios, especially when building cross-view correspondences is hard, these methods still cannot produce satisfying results. In this paper, we study how to enforce the consistency between surface normal and depth at training time to improve the performance. We couple the learning of a multi-view normal estimation module and a multi-view depth estimation module. In addition, we propose a novel consistency loss to train an independent consistency module that refines the depths from depth/normal pairs. We find that the joint learning can improve both the prediction of normal and depth, and the accuracy and smoothness can be further improved by enforcing the consistency. Experiments on MVS, SUN3D, RGBD and Scenes11 demonstrate the effectiveness of our method and state-of-the-art performance.
[prediction, dataset, previous, recognition] [table, feature, module, aggregation, semantic, map, predicted, supervision] [model, datasets, improve] [ieee, method, pattern, based, pixel, figure, slice, june, cvpr, flow, cnns, spatial] [consistency, image, loss, train, perform] [learning, performance, better, deep, network, probability, test, training, evaluate, gradient] [normal, depth, cost, volume, computer, conference, estimation, surface, vision, stereo, scene, joint, international, estimate, ground, truth, textureless, single, dpsnet, matching, enforce, geometry, error, scannet, coordinate, view, plane, rgb, camera, enforcing, local, nnet, pipeline, european, indoor, absolute, rmse, thomas, david]
@InProceedings{Kusupati_2020_CVPR,
  author = {Kusupati, Uday and Cheng, Shuo and Chen, Rui and Su, Hao},
  title = {Normal Assisted Stereo Depth Estimation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Fusing Wearable IMUs With Multi-View Images for Human Pose Estimation: A Geometric Approach
Zhe Zhang, Chunyu Wang, Wenhu Qin, Wenjun Zeng


We propose to estimate 3D human pose from multi-view images and a few IMUs attached at person's limbs. It operates by firstly detecting 2D poses from the two signals, and then lifting them to the 3D space. We present a geometric approach to reinforce the visual features of each pair of joints based on the IMUs. This notably improves 2D pose estimation accuracy especially when one joint is occluded. We call this approach Orientation Regularized Network (ORN). Then we lift the multi-view 2D poses to the 3D space by an Orientation Regularized Pictorial Structure Model (ORPSM) which jointly minimizes the projection error between the 3D and 2D poses, along with the discrepancy between the 3D pose and IMU orientations. The simple two-step approach reduces the error of the state-of-the-art by a large margin on a public dataset. Our code will be released at https://github.com/microsoft/imu-human-pose-estimation-pytorch.
[] [] [] [] [] [] []
@InProceedings{Zhang_2020_CVPR,
  author = {Zhang, Zhe and Wang, Chunyu and Qin, Wenhu and Zeng, Wenjun},
  title = {Fusing Wearable IMUs With Multi-View Images for Human Pose Estimation: A Geometric Approach},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
gDLS*: Generalized Pose-and-Scale Estimation Given Scale and Gravity Priors
Victor Fragoso, Joseph DeGol, Gang Hua


Many real-world applications in augmented reality (AR), 3D mapping, and robotics require both fast and accurate estimation of camera poses and scales from multiple images captured by multiple cameras or a single moving camera. Achieving high speed and maintaining high accuracy in a pose-and-scale estimator are often conflicting goals. To simultaneously achieve both, we exploit a priori knowledge about the solution space. We present gDLS*, a generalized-camera-model pose-and-scale estimator that utilizes rotation and scale priors. gDLS* allows an application to flexibly weigh the contribution of each prior, which is important since priors often come from noisy sensors. Compared to state-of-the-art generalized-pose-and-scale estimators (e.g., gDLS), our experiments on both synthetic and real data consistently demonstrate that gDLS* accelerates the estimation process and improves scale and pose accuracy.
[drive, recognition, multiple, work, speed, evaluation, three] [improves, table, localization, martin] [noise, improve] [scale, ieee, prior, pattern, pixel, noisy, reference, journal, figure, fast] [translation, generalized, produce, pioneer] [accuracy, function, similarity, optimal, efficient, vector, augmented, data, regularizers, sample, random, problem, set, general, experiment, parameter, average] [rotation, gravity, pose, vision, computer, error, camera, estimator, gdls, minimal, slam, cost, solution, accurate, polynomial, system, upnp, estimation, transformation, robotics, kitti, estimate, modestly, point, absolute, tum, rigid, single, estimating, computes, position, solver, direction, zuzana, tobias, reality]
@InProceedings{Fragoso_2020_CVPR,
  author = {Fragoso, Victor and DeGol, Joseph and Hua, Gang},
  title = {gDLS*: Generalized Pose-and-Scale Estimation Given Scale and Gravity Priors},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Embodied Language Grounding With 3D Visual Feature Representations
Mihir Prabhudesai, Hsiao-Yu Fish Tung, Syed Ashar Javed, Maximilian Sieb, Adam W. Harley, Katerina Fragkiadaki


We propose associating language utterances to 3D visual abstractions of the scene they describe. The 3D visual abstractions are encoded as 3-dimensional visual feature maps. We infer these 3D visual scene feature maps from RGB images of the scene via view prediction: when the generated 3D scene feature map is neurally projected from a camera viewpoint, it should match the corresponding RGB image. We present generative models that condition on the dependency tree of an utterance and generate a corresponding visual 3D feature map as well as reason about its plausibility, and detector models that condition on both the dependency tree of an utterance and a related image and localize the object referents in the 3D feature map inferred from the image. Our model outperforms models of language and vision that associate language with 2D CNN activations or 2D images by a large margin in a variety of tasks, such as, classifying plausibility of utterances, detecting referential expressions, and supplying rewards for trajectory optimization of object placement policies from language instructions. We perform numerous ablations and show the improved performance of our detectors is due to its better generalization across camera viewpoints and lack of object interferences in the inferred 3D feature space, and the improved performance of our generators is due to their ability to spatially reason about objects and their configurations in 3D when mapping from language to scenes.
[language, referential, visual, dependency, utterance, natural, cylinder, blue, placement, rubber, noun, parse, red, infer, goal, yellow, dataset, work, reasoning, grnns, affordability, time, describe, multiple, compositional, grounding] [object, feature, map, location, box, detector, bounding, detection, predicted, module, detected, score, table, deng, annotated] [model, expression, input, trained, condition, detecting] [spatial, figure, tree, green, ieee, inverse, cube, pattern] [image, generative, train, appearance, generated, corresponding, plausible, generation, real, desired] [neural, training, learning, baseline, network, pairwise, sample, performance, consider, stochastic, test] [scene, rgb, front, camera, left, vision, computer, conference, sphere, view, inferred, intersection, well]
@InProceedings{Prabhudesai_2020_CVPR,
  author = {Prabhudesai, Mihir and Tung, Hsiao-Yu Fish and Javed, Syed Ashar and Sieb, Maximilian and Harley, Adam W. and Fragkiadaki, Katerina},
  title = {Embodied Language Grounding With 3D Visual Feature Representations},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Learning to Autofocus
Charles Herrmann, Richard Strong Bowen, Neal Wadhwa, Rahul Garg, Qiurui He, Jonathan T. Barron, Ramin Zabih


Autofocus is an important task for digital cameras, yet current approaches often exhibit poor performance. We propose a learning-based approach to this problem, and provide a realistic dataset of sufficient size for effective learning. Our dataset is labeled with per-pixel depths obtained from multi-view stereo, following [9]. Using this dataset, we apply modern deep classification models and an ordinal regression loss to obtain an efficient learning-based autofocus technique. We demonstrate that our approach provides a significant improvement compared with previous learned and non-learned methods: our model reduces the mean absolute error by a factor of 3.6 over the best comparable baseline algorithm. Our dataset and code are publicly available.
[dataset, prediction, work, step, predict, rahul] [focus, table] [model, input, move, noise, digital] [autofocus, stack, lens, slice, figure, disparity, patch, sensor, wavelet, blur, range, light, pixel, sharpness, raw, output, pattern, mae] [image, corresponding, produce] [problem, algorithm, deep, metric, data, measure, energy, gradient, better, baseline, set, ratio, learning, number, test, learned, performance, ordinal, compared, lower, sum, network, task] [focal, depth, camera, local, distance, estimate, single, capture, error, scene, ground, truth, breathing, ambiguity, approach, left, computer, jonathan, calibration, defocus, delta, stereo, neal]
@InProceedings{Herrmann_2020_CVPR,
  author = {Herrmann, Charles and Bowen, Richard Strong and Wadhwa, Neal and Garg, Rahul and He, Qiurui and Barron, Jonathan T. and Zabih, Ramin},
  title = {Learning to Autofocus},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Joint Demosaicing and Denoising With Self Guidance
Lin Liu, Xu Jia, Jianzhuang Liu, Qi Tian


Usually located at the very early stages of the computational photography pipeline, demosaicing and denoising play important parts in the modern camera image processing. Recently, some neural networks have shown the effectiveness in joint demosaicing and denoising (JDD). Most of them first decompose a Bayer raw image into a four-channel RGGB image and then feed it into a neural network. This practice ignores the fact that the green channels are sampled at a double rate compared to the red and the blue channels. In this paper, we propose a self-guidance network (SGNet), where the green channels are initially estimated and then works as a guidance to recover all missing values in the input image. In addition, as regions of different frequencies suffer different levels of degradation in image restoration. We propose a density-map guidance to help the model deal with a wide range of frequencies. Our model outperforms state-of-the-art joint demosaicing and denoising methods on four public datasets, including two real and two synthetic data sets. Finally, we also verify that our method obtains best results in joint demosaicing , denoising and super-resolution.
[blue, red, outperforms, difficulty] [map, edge, propose, branch, table, main, level, feature, ablation] [noise, model, input, conduct, study, verify] [demosaicing, green, denoising, bayer, guidance, color, sgnet, adaptive, method, channel, convolutional, raw, deepjoint, recover, ledge, convolution, output, kokkinos, residual, lsmooth, admm, comparison, spatially, proposed, based, traditional, low, lei, rggb, flexisp, gaussian, demosaicking, figure, high, psnr, tenet] [image, texture, loss, missing, real, synthetic, lpips] [deep, network, neural, density, learning, test, arxiv, preprint, task, set, size, filter] [joint, dense, reconstruction, sparse, ground, truth, rgb, camera, full, initial, smooth, single]
@InProceedings{Liu_2020_CVPR,
  author = {Liu, Lin and Jia, Xu and Liu, Jianzhuang and Tian, Qi},
  title = {Joint Demosaicing and Denoising With Self Guidance},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Forward and Backward Information Retention for Accurate Binary Neural Networks
Haotong Qin, Ruihao Gong, Xianglong Liu, Mingzhu Shen, Ziran Wei, Fengwei Yu, Jingkuan Song


Weight and activation binarization is an effective approach to deep neural network compression and can accelerate the inference by leveraging bitwise operations. Although many binarization methods have improved the accuracy of the model by minimizing the quantization error in forward propagation, there remains a noticeable performance gap between the binarized model and the full-precision one. Our empirical study indicates that the quantization brings information loss in both forward and backward propagation, which is the bottleneck of training accurate binary neural networks. To address these issues, we propose an Information Retention Network (IR-Net) to retain the information that consists in the forward activations and backward gradients. IR-Net mainly relies on two technical contributions: (1) Libra Parameter Binarization (Libra-PB): simultaneously minimizing both quantization error and information loss of parameters by balanced and standardized weights in forward propagation; (2) Error Decay Estimator (EDE): minimizing the information loss of gradients by gradually approximating the sign function in backward propagation, jointly considering the updating ability and accurate gradients. We are the first to investigate both forward and backward processes of binary networks from the unified information perspective, which provides new insight into the mechanism of network binarization. Comprehensive experiments with various network structures on CIFAR-10 and ImageNet datasets manifest that the proposed IR-Net can consistently outperform state-of-the-art quantization methods.
[sign, clip] [propagation, stage, table, including, denotes, sota, object] [fig, identity, highly, model, caused, study] [ieee, convolutional, method, proposed, high, figure, output, existing] [loss, ability, minimizing, image, diversity] [binary, neural, quantization, network, backward, binarization, training, forward, function, gradient, ede, deep, quantized, entropy, weight, approximation, performance, binarized, distribution, accuracy, retain, data, parameter, balanced, learning, inference, libra, updating, activation, imagenet, clipping, bitwise, compared, operation, layer, update, retention, vanilla, reduce, process, bernoulli, maximum, balance, improved, indicates, decay, number] [error, accurate, derivative, estimator]
@InProceedings{Qin_2020_CVPR,
  author = {Qin, Haotong and Gong, Ruihao and Liu, Xianglong and Shen, Mingzhu and Wei, Ziran and Yu, Fengwei and Song, Jingkuan},
  title = {Forward and Backward Information Retention for Accurate Binary Neural Networks},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Light Field Spatial Super-Resolution via Deep Combinatorial Geometry Embedding and Structural Consistency Regularization
Jing Jin, Junhui Hou, Jie Chen, Sam Kwong


Light field (LF) images acquired by hand-held devices usually suffer from low spatial resolution as the limited sampling resources have to be shared with the angular dimension. LF spatial super-resolution (SR) thus becomes an indispensable part of the LF camera processing pipeline. The high-dimensionality characteristic and complex geometrical structure of LF images makes the problem more challenging than traditional single-image SR. The performance of existing methods are still limited as they fail to thoroughly explore the coherence among LF views and are insufficient in accurately preserving the parallax structure of the scene. In this paper, we propose a novel learning-based LF spatial SR framework, in which each view of an LF image is first individually super-resolved by exploring the complementary information among views with combinatorial geometry embedding. For accurate preservation of the parallax structure among the reconstructed views, a regularization network trained over a structure-aware loss function is subsequently appended to enforce correct parallax relationships over the intermediate estimation. Our proposed approach is evaluated over datasets with a large number of testing images including both synthetic and real-world scenes. Experimental results demonstrate the advantage of our approach over state-of-the-art methods, i.e., our method not only improves the average PSNR by more than 1.0 dB but also preserves more accurate parallax details, at a lower computation cost.
[recognition, embedding, time, individual] [table, module, feature, final] [complementary, auxiliary, datasets, quality, model] [light, spatial, field, ieee, parallax, method, reslf, residual, convolutional, proposed, pattern, fusion, bicubic, lfnet, reference, resolution, intermediate, figure, lytro, disparity, edsr, advantage, psnr, epi, quantitative, hci, analysis, traditional] [image, structural, consistency, synthetic, loss, consists] [regularization, angular, deep, network, combinatorial, compared, stanford, learning, training, performance, number, average, efficient, layer, problem, function, computational, general, higher] [view, reconstruction, conference, computer, vision, geometry, structure, ground, reconstructed, alternate, truth, dense, approach, scene]
@InProceedings{Jin_2020_CVPR,
  author = {Jin, Jing and Hou, Junhui and Chen, Jie and Kwong, Sam},
  title = {Light Field Spatial Super-Resolution via Deep Combinatorial Geometry Embedding and Structural Consistency Regularization},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
A Multi-Hypothesis Approach to Color Constancy
Daniel Hernandez-Juarez, Sarah Parisot, Benjamin Busam, Ales Leonardis, Gregory Slabaugh, Steven McDonagh


Contemporary approaches frame the color constancy problem as learning camera specific illuminant mappings. While high accuracy can be achieved on camera specific data, these models depend on camera spectral sensitivity and typically exhibit poor generalisation to new devices. Additionally, regression methods produce point estimates that do not explicitly account for potential ambiguities among plausible illuminant solutions, due to the ill-posed nature of the problem. We propose a Bayesian framework that naturally handles color constancy ambiguity via a multi-hypothesis strategy. Firstly, we select a set of candidate scene illuminants in a data-driven fashion and apply them to a target image to generate a set of corrected images. Secondly, we estimate, for each corrected image, the likelihood of the light source being achromatic using a camera-agnostic CNN. Finally, our method explicitly learns a final illumination estimate from the generated posterior probability distribution. Our likelihood estimator learns to answer a camera-agnostic question and thus enables effective multi-camera training by disentangling illuminant estimation from the supervised learning task. We extensively evaluate our proposed approach and additionally set a benchmark for novel sensor generalisation without re-training. Our method provides state-of-the-art accuracy on multiple public datasets (up to 11% median angular error improvement) while maintaining real-time execution.
[dataset, multiple, provide, work, frame, highlight] [cnn, table, propose, challenge, regression] [model, datasets, input, white, worst, nature] [illuminant, color, constancy, method, likelihood, ieee, ffcc, corrected, pattern, captured, june, light, sensor, figure, brian, cvpr, illumination, device, prior, convolutional, imaging, histogram] [image, train, learns, specific, produce] [candidate, set, training, learning, bayesian, network, distribution, angular, problem, posterior, selection, typically, probability, evaluate, strategy, space, linear, deep, accuracy, neural, processing, small, inference, alternative, note, classification] [conference, camera, computer, estimation, vision, estimate, scene, error, single, international, reflectance, well, surface, supplementary, novel, define]
@InProceedings{Hernandez-Juarez_2020_CVPR,
  author = {Hernandez-Juarez, Daniel and Parisot, Sarah and Busam, Benjamin and Leonardis, Ales and Slabaugh, Gregory and McDonagh, Steven},
  title = {A Multi-Hypothesis Approach to Color Constancy},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Learning to Restore Low-Light Images via Decomposition-and-Enhancement
Ke Xu, Xin Yang, Baocai Yin, Rynson W.H. Lau


Low-light images typically suffer from two problems. First, they have low visibility (i.e., small pixel values). Second, noise becomes significant and disrupts the image content, due to low signal-to-noise ratio. Most existing lowlight image enhancement methods, however, learn from noise-negligible datasets. They rely on users having good photographic skills in taking images with low noise. Unfortunately, this is not the case for majority of the low-light images. While concurrently enhancing a low-light image and removing its noise is ill-posed, we observe that noise exhibits different levels of contrast in different frequency layers, and it is much easier to detect noise in the lowfrequency layer than in the high one. Inspired by this observation, we propose a frequency-based decompositionand- enhancement model for low-light image enhancement. Based on this model, we present a novel network that first learns to recover image objects in the low-frequency layer and then enhances high-frequency details based on the recovered image objects. In addition, we have prepared a new low-light image dataset with real noise to facilitate learning. Finally, we have conducted extensive experiments to show that the proposed method outperforms state-of-the-art approaches in enhancing practical noisy low-light images.
[attention, dataset, encoding, context, work] [module, map, propose, global, table, stage, enhances] [noise, input, model, trained] [enhancement, method, proposed, figure, srgb, enhance, denoising, lime, raw, noisy, low, sid, existing, ace, based, enhancing, enhanced, deepupe, color, dslr, wvm, drht, hdrcnn, exposure, fail, illumination, frequency, recover, cdt, xin, lowlight, high, remove, ieee, lei, rynson, pixel, contrast, decompose, adaptively, detail, output, receptive] [image, learn, real, domain, corresponding, loss, produce] [learning, layer, network, deep, data, performance, select, training, function, note] [ground, truth, directly, cape, second, camera, single, recovered, novel, well]
@InProceedings{Xu_2020_CVPR,
  author = {Xu, Ke and Yang, Xin and Yin, Baocai and Lau, Rynson W.H.},
  title = {Learning to Restore Low-Light Images via Decomposition-and-Enhancement},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Background Matting: The World Is Your Green Screen
Soumyadip Sengupta, Vivek Jayaram, Brian Curless, Steven M. Seitz, Ira Kemelmacher-Shlizerman


We propose a method for creating a matte - the per-pixel foreground color and alpha - of a person by taking photos or videos in an everyday setting with a handheld camera. Most existing matting methods require a green screen background or a manually created trimap to produce a good matte. Automatic, trimap-free methods are appearing, but are not of comparable quality. In our trimap free approach, we ask the user to take an additional photo of the background without the subject at the time of capture. This step requires a small amount of foresight but is far less timeconsuming than creating a trimap. We train a deep network with an adversarial loss to predict the matte. We first train a matting network with a supervised loss on ground truth data with synthetic composites. To bridge the domain gap to real imagery with no labeling, we train another matting network guided by the first network and by a discriminator that judges the quality of composites. We demonstrate results on a wide variety of photos and videos and show significant improvement over the state of the art.
[video, natural, dataset, work, predict] [background, foreground, segmentation, focus, propose] [input, adversarial, subject, trained, switching] [color, ieee, method, figure, block, pattern, motion, green, captured, brian, result, created] [matting, image, real, alpha, matte, adobe, train, handheld, person, user, gadobe, trimap, discriminator, greal, loss, photo, gap, composited, produce, domain, perform, compositing, trimaps, composite, synthetic] [network, deep, data, training, better, small, learning, soft, set, worse, requires, architecture, sampling, random, setting, compared] [computer, conference, approach, camera, vision, require, international, volume, novel, ground, capture, truth, variety, compare, additional]
@InProceedings{Sengupta_2020_CVPR,
  author = {Sengupta, Soumyadip and Jayaram, Vivek and Curless, Brian and Seitz, Steven M. and Kemelmacher-Shlizerman, Ira},
  title = {Background Matting: The World Is Your Green Screen},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Supervised Raw Video Denoising With a Benchmark Dataset on Dynamic Scenes
Huanjing Yue, Cong Cao, Lei Liao, Ronghe Chu, Jingyu Yang


In recent years, the supervised learning strategy for real noisy image denoising has been emerging and has achieved promising results. In contrast, realistic noise removal for raw noisy videos is rarely studied due to the lack of noisy-clean pairs for dynamic scenes. Clean video frames for dynamic scenes cannot be captured with a long-exposure shutter or averaging multi-shots as was done for static images. In this paper, we solve this problem by creating motions for controllable objects, such as toys, and capturing each static moment for multiple times to generate clean video frames. In this way, we construct a dataset with 55 groups of noisy-clean videos with ISO values ranging from 1600 to 25600. To our knowledge, this is the first dynamic video dataset with noisy-clean pairs. Correspondingly, we propose a raw video denoising network (RViDeNet) by exploring the temporal, spatial, and channel correlations of video frames. Since the raw video has Bayer patterns, we pack it into four sub-sequences, i.e RGBG sequences, which are denoised by the proposed RViDeNet separately and finally fused into a clean video. In addition, our network not only outputs a raw denoising result, but also the sRGB result by going through an image signal processing (ISP) module, which enables users to generate the sRGB result with their favourite ISPs. Experimental results demonstrate that our method outperforms state-of-the-art video and raw image denoising algorithms on both indoor and outdoor videos.
[video, temporal, dataset, frame, attention, multiple, static, visual, work] [propose, module, level] [noise, clean, input, model, trained] [raw, denoising, noisy, srgb, captured, ieee, method, iso, proposed, pattern, isp, fusion, dynamic, spatial, result, capturing, bayer, neighboring, convolution, removal, high, convolutional, psnr, didn, lei, gaussian, based, comparison, channel, color, chen, videnn, motion, figure] [image, utilize, domain, real, generate, alignment, loss, realistic, aligned, mapping, train, synthetic, generated] [network, processing, training, set, learning, data, deep, process, size, strategy] [computer, conference, vision, indoor, capture, outdoor, international, averaging, directly, scene, camera, ground, demonstrate]
@InProceedings{Yue_2020_CVPR,
  author = {Yue, Huanjing and Cao, Cong and Liao, Lei and Chu, Ronghe and Yang, Jingyu},
  title = {Supervised Raw Video Denoising With a Benchmark Dataset on Dynamic Scenes},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Photometric Stereo via Discrete Hypothesis-and-Test Search
Kenji Enomoto, Michael Waechter, Kiriakos N. Kutulakos, Yasuyuki Matsushita


In this paper, we consider the problem of estimating surface normals of a scene with spatially varying, general BRDFs observed by a static camera under varying, known, distant illumination. Unlike previous approaches that are mostly based on continuous local optimization, we cast the problem as a discrete hypothesis-and-test search problem over the discretized space of surface normals. While a naive search requires a significant amount of time, we show that the expensive computation block can be precomputed in a scene-independent manner, resulting in accelerated inference for new scenes. It allows us to perform a full search over the finely discretized space of surface normals to determine the globally optimal surface normal for each scene point. We show that our method can accurately estimate surface normals of scenes with spatially varying different reflectances in a reasonable amount of time.
[dataset, time, recognition, work] [table, object] [model, noise] [method, light, tensor, pattern, reference, discretized, existing, figure, performed, spatially, proposed, analysis, noisy] [diverse, target, synthetic, reasonable, image] [angular, search, number, space, set, computation, optimal, problem, matrix, discrete, test, machine, general, data, best, find, objective, strategy, sampled, optimization] [surface, normal, brdf, photometric, merl, stereo, brdfs, varying, error, reflectance, vision, estimation, hypothesized, reconstruction, sphere, computer, estimated, cast, scene, globally, virtual, specular, shape, conference, intelligence, precomputation, measurement, material, camera, assume, parametric]
@InProceedings{Enomoto_2020_CVPR,
  author = {Enomoto, Kenji and Waechter, Michael and Kutulakos, Kiriakos N. and Matsushita, Yasuyuki},
  title = {Photometric Stereo via Discrete Hypothesis-and-Test Search},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Dynamic Convolutions: Exploiting Spatial Sparsity for Faster Inference
Thomas Verelst, Tinne Tuytelaars


Modern convolutional neural networks apply the same operations on every pixel in an image. However, not all image regions are equally important. To address this inefficiency, we propose a method to dynamically apply convolutions conditioned on the input image. We introduce a residual block where a small gating branch learns which spatial positions should be evaluated. These discrete gating decisions are trained end-to-end using the Gumbel-Softmax trick, in combination with a sparsity criterion. Our experiments on CIFAR, ImageNet, Food-101 and MPII show that our method has better focus on the region of interest and better accuracy than existing methods, at a lower computational complexity. Moreover, we provide an efficient CUDA implementation of our dynamic convolutions using a gather-scatter approach, achieving a significant improvement in inference speed on MobileNetV2 and ShuffleNetV2. On human pose estimation, a task that is inherently spatially sparse, the processing speed is increased by 60% with no loss in accuracy.
[unit, recognition, time, work] [mask, trick, gather, apply, focus, resnet, table] [input, budget, model, trained] [residual, spatial, method, dynamic, block, convolutional, figure, pattern, convolution, ieee, conv, spatially, squeeze, existing, based, tensor, adaptive, intermediate, processed] [conditional, loss, image] [neural, network, execution, deep, inference, learning, arxiv, preprint, computational, gating, efficient, sact, sparsity, processing, amount, dynconv, accuracy, architecture, number, operation, standard, criterion, depthwise, small, better, simple, speedup, computation, typically, training, active, lower, implementation] [conference, computer, pose, vision, sparse, cost, human, estimation, international]
@InProceedings{Verelst_2020_CVPR,
  author = {Verelst, Thomas and Tuytelaars, Tinne},
  title = {Dynamic Convolutions: Exploiting Spatial Sparsity for Faster Inference},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Fixed-Point Back-Propagation Training
Xishan Zhang, Shaoli Liu, Rui Zhang, Chang Liu, Di Huang, Shiyi Zhou, Jiaming Guo, Qi Guo, Zidong Du, Tian Zhi, Yunji Chen


Recent emerged quantization technique (i.e., using low bit-width fixed-point data instead of high bit-width floating-point data) has been applied to inference of deep neural networks for fast and efficient execution. However, directly applying quantization in training can cause significant accuracy loss, thus remaining an open challenge. In this paper, we propose a novel training approach, which applies a layer-wise precision-adaptive quantization in deep neural networks. The new training approach leverages our key insight that the degradation of training accuracy is attributed to the dramatic change of data distribution. Therefore, by keeping the data distribution stable through a layer-wise precision-adaptive quantization, we are able to directly train deep neural networks using low bit-width fixed-point data and achieve guaranteed accuracy, without changing hyper parameters. Experimental results on a wide variety of network architectures (e.g., convolution and recurrent networks) and applications (e.g., image classification, object detection, segmentation and machine translation) show that the proposed approach can train these neural networks with negligible accuracy losses (-1.40%-1.3%, 0.02% on average), and speed up training by 252% on a state-of-the-art Intel CPU.
[observation, speed, automatically, three, evaluation] [propagation, object, segmentation, correlation, detection, propose, extra] [quantify, change, difference, curve] [adaptive, resolution, proposed, range, figure, adjustment, method, low, convolution, frequency, convolutional] [train, image, translation, corresponding] [quantization, training, data, accuracy, deep, activation, neural, network, distribution, dif, gradient, learning, layer, alexnet, backward, precision, update, machine, max, iteration, itv, parameter, forward, inference, average, computation, imagenet, large, convergence, appendix, qpa, achieve, wide, search, compared, mixed, evaluate] [error, approach, qem, initial, measurement]
@InProceedings{Zhang_2020_CVPR,
  author = {Zhang, Xishan and Liu, Shaoli and Zhang, Rui and Liu, Chang and Huang, Di and Zhou, Shiyi and Guo, Jiaming and Guo, Qi and Du, Zidong and Zhi, Tian and Chen, Yunji},
  title = {Fixed-Point Back-Propagation Training},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Heterogeneous Knowledge Distillation Using Information Flow Modeling
Nikolaos Passalis, Maria Tzelepi, Anastasios Tefas


Knowledge Distillation (KD) methods are capable of transferring the knowledge encoded in a large and complex teacher into a smaller and faster student. Early methods were usually limited to transferring the knowledge only between the last layers of the networks, while latter approaches were capable of performing multi-layer KD, further increasing the accuracy of the student. However, despite their improved performance, these methods still suffer from several limitations that restrict both their efficiency and flexibility. First, existing KD methods typically ignore that neural networks undergo through different learning phases during the training process, which often requires different types of supervision for each one. Furthermore, existing multi-layer KD methods are usually unable to effectively handle networks with significantly different architectures (heterogeneous KD). In this paper we propose a novel KD method that works by modeling the information flow through the various layers of the teacher model and then train a student model to mimic this information flow. The proposed method is capable of overcoming the aforementioned limitations by using an appropriate supervision scheme during the different phases of the training process, as well as by designing and training an appropriate auxiliary teacher model that acts as a proxy model capable of "explaining" the way the teacher works to the student. The effectiveness of the proposed method is demonstrated using four image datasets and several different evaluation setups.
[critical, evaluation, heterogeneous, modeling, work, order, providing, provide, multiple, retrieval, account] [supervision, map, final, table, employed, propose] [model, auxiliary, demonstrated, trained, evaluated, effectively, existence] [proposed, method, flow, intermediate, ieee, period, existing, pattern] [transfer, transferring, representation, extracted, loss, train, forming, transferred] [student, teacher, layer, knowledge, learning, network, training, neural, divergence, appropriate, deep, process, accuracy, note, mutual, task, distillation, scheme, compared, metric, performance, large, designing, worth, number, better, noting, precision, classification, set, probability, calculated, pkt, increasing, paper, convergence, data, architecture] [conference, matching, well, computer, approach, refer, vision, international, provided]
@InProceedings{Passalis_2020_CVPR,
  author = {Passalis, Nikolaos and Tzelepi, Maria and Tefas, Anastasios},
  title = {Heterogeneous Knowledge Distillation Using Information Flow Modeling},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Rethinking Differentiable Search for Mixed-Precision Neural Networks
Zhaowei Cai, Nuno Vasconcelos


Low-precision networks, with weights and activations quantized to low bit-width, are widely used to accelerate inference on edge devices. However, current solutions are uniform, using identical bit-width for all filters. This fails to account for the different sensitivities of different filters and is suboptimal. Mixed-precision networks address this problem, by tuning the bit-width to individual filter requirements. In this work, the problem of optimal mixed-precision network search (MPS) is considered. To circumvent its difficulties of discrete search space and combinatorial optimization, a new differentiable search architecture is proposed, with several novel contributions to advance the efficiency by leveraging the unique properties of the MPS problem. The resulting Efficient differentiable MIxed-Precision network Search (EdMIPS) method is effective at finding the optimal bit allocation for multiple popular networks, and can search a large model, e.g. Inception-V3, directly on ImageNet without proxy task in a reasonable amount of time. The learned mixed-precision networks significantly outperform their uniform counterparts.
[multiple] [module, highest, object] [model, googlenet, effective, trained] [figure, parallel, convolutional, proposed, low, convolution, comparison, tensor] [composite, image, inception] [search, architecture, network, bit, uniform, neural, complexity, weight, filter, optimal, edmips, space, accuracy, learning, deep, efficient, optimization, large, allocation, activation, layer, higher, computation, quantization, classification, candidate, training, size, mixed, popular, proxy, learned, expensive, receive, performance, imagenet, closer, function, precision, binary, epoch, lower, arxiv, preprint, problem, small, requires, set, gradient] [differentiable, single, point, avoid, solution]
@InProceedings{Cai_2020_CVPR,
  author = {Cai, Zhaowei and Vasconcelos, Nuno},
  title = {Rethinking Differentiable Search for Mixed-Precision Neural Networks},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Residual Feature Aggregation Network for Image Super-Resolution
Jie Liu, Wenjie Zhang, Yuting Tang, Jie Tang, Gangshan Wu


Recently, very deep convolutional neural networks (CNNs) have shown great power in single image super-resolution (SISR) and achieved significant improvements against traditional methods. Among these CNN-based methods, the residual connections play a critical role in boosting the network performance. As the network depth grows, the residual features gradually focused on different aspects of the input image, which is very useful for reconstructing the spatial details. However, existing methods neglect to fully utilize the hierarchical features on the residual branches. To address this issue, we propose a novel residual feature aggregation (RFA) framework for more efficient feature extraction. The RFA framework groups several residual modules together and directly forwards the features on each local residual branch by adding skip connections. Therefore, the RFA framework is capable of aggregating these informative residual features to produce more representative features. To maximize the power of the RFA framework, we further propose an enhanced spatial attention (ESA) block to make the residual features to be more focused on critical spatial contents. The ESA block is designed to be lightweight and efficient. Our final RFANet is constructed by applying the proposed RFA framework with the ESA blocks. Comprehensive experiments demonstrate the necessity of our RFA framework and the superiority of our RFANet over state-of-the-art SISR methods.
[attention, hierarchical, mechanism, powerful, three, visual, multiple] [feature, framework, module, aggregation, propose, table, achieves, fully, key, map] [model, input, representative, identity, norm] [residual, block, rfa, spatial, convolutional, rfanet, proposed, ieee, esa, psnr, rcan, enhanced, conv, channel, san, edsr, rdn, degradation, vdsr, bicubic, fsrcnn, achieved, figure, convolution, memnet, srcnn, lapsrn, srmd, based, dbpn, existing, lightweight, receptive, output, combination, srfbn, nlrn, applying] [image, consists] [network, deep, learning, performance, best, layer, memory, training, compared, better, basic, neural, design, indicates, efficient] [computer, dense, single, volume, focused, directly, local, lecture, depth]
@InProceedings{Liu_2020_CVPR,
  author = {Liu, Jie and Zhang, Wenjie and Tang, Yuting and Tang, Jie and Wu, Gangshan},
  title = {Residual Feature Aggregation Network for Image Super-Resolution},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Resolution Adaptive Networks for Efficient Inference
Le Yang, Yizeng Han, Xi Chen, Shiji Song, Jifeng Dai, Gao Huang


Adaptive inference is an effective mechanism to achieve a dynamic tradeoff between accuracy and computational cost in deep networks. Existing works mainly exploit architecture redundancy in network depth or width. In this paper, we focus on spatial redundancy of input samples and propose a novel Resolution Adaptive Network (RANet), which is inspired by the intuition that low-resolution representations are sufficient for classifying "easy" inputs containing large objects with prototypical features, while only some "hard" samples need spatially detailed information. In RANet, the input images are first routed to a lightweight sub-network that efficiently extracts low-resolution representations, and those samples with high prediction confidence will exit early from the network without being further processed. Meanwhile, high-resolution paths in the network maintain the capability to recognize the "hard" samples. Therefore, RANet can effectively reduce the spatial redundancy involved in inferring high-resolution inputs. Empirically, we demonstrate the effectiveness of the proposed RANet on the CIFAR-10, CIFAR-100 and ImageNet datasets in both the anytime prediction setting and the budgeted batch classification setting.
[prediction, previous, correctly, recognize, gao, multiple] [feature, confidence, classifying] [input, budget, model, ensemble, classified] [adaptive, figure, scale, resolution, block, conv, convolutional, proposed, high, spatial, fusion, lightweight, output, downsampling] [image, adaptation, corresponding, loss] [network, ranet, classification, computational, inference, accuracy, neural, layer, msdnet, deep, ranets, imagenet, higher, efficient, batch, redundancy, learning, architecture, sample, budgeted, base, training, msdnets, resnets, achieve, anytime, classifier, lowest, early, setting, set, densenets, densenet, design, larger, improved, evaluate, efficiency, cifar] [depth, dense, cost, coarse, initial]
@InProceedings{Yang_2020_CVPR,
  author = {Yang, Le and Han, Yizeng and Chen, Xi and Song, Shiji and Dai, Jifeng and Huang, Gao},
  title = {Resolution Adaptive Networks for Efficient Inference},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Learning to Forget for Meta-Learning
Sungyong Baik, Seokil Hong, Kyoung Mu Lee


Few-shot learning is a challenging problem where the goal is to achieve generalization from only few examples. Model-agnostic meta-learning (MAML) tackles the problem by formulating prior knowledge as a common initialization across tasks, which is then used to quickly adapt to unseen tasks. However, forcibly sharing an initialization can lead to conflicts among tasks and the compromised (undesired by tasks) location on optimization landscape, thereby hindering the task adaptation. Further, we observe that the degree of conflict differs among not only tasks but also layers of a neural network. Thus, we propose task-and-layer-wise attenuation on the compromised initialization to reduce its influence. As the attenuation dynamically controls (or selectively forgets) the influence of prior knowledge for a given task and each layer, we name our method as L2F (Learn to Forget). The experimental results demonstrate that the proposed method provides faster adaptation and greatly improves the performance. Furthermore, L2F can be easily applied and improve other state-of-the-art MAML-based frameworks, illustrating its simplicity and generalizability.
[starting, encode, reinforcement] [table, location, propose, effectiveness, ablation] [degree, model, generalization, generalizability] [figure, fast, method, prior, proposed, conv, sharp] [adaptation, learn, loss, common, generate, generated] [initialization, maml, task, attenuation, learning, conflict, network, knowledge, performance, lti, landscape, gradient, training, layer, compromised, optimization, neural, distribution, miniimagenet, classification, update, parameter, deeper, grad, number, sampled, set, problem, achieve, dti, accuracy, observe, better, forget, simplicity, amount, average, lower, argue, leo, function, tieredimagenet, class, deep, large, learner, disagreement] [point, demonstrate, compute, transformation]
@InProceedings{Baik_2020_CVPR,
  author = {Baik, Sungyong and Hong, Seokil and Lee, Kyoung Mu},
  title = {Learning to Forget for Meta-Learning},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Deep Learning for Handling Kernel/model Uncertainty in Image Deconvolution
Yuesong Nan, Hui Ji


Most existing non-blind image deconvolution methods assume that the given blurring kernel is error-free. In practice, blurring kernel often is estimated via some blind deblurring algorithm which is not exactly the truth. Also, the convolution model is only an approximation to practical blurring effect. It is known that non-blind deconvolution is susceptible to such a kernel/model error. Based on an error-in-variable (EIV) model of image blurring that takes kernel error into consideration, this paper presents a deep learning method for deconvolution, which unrolls a total-least-squares (TLS) estimator whose relating priors are learned by neural networks (NNs). The experiments showed that the proposed method is robust to kernel/model error. It noticeably outperformed existing solutions when deblurring images using noisy kernels, e.g. the ones estimated from existing blind motion deblurring methods.
[dataset, step, visual] [table, denotes, contour, sun, main, stage, cnn, including] [noise, model, input, iterative, correction, fig, trained, robust, cho, robustness, true] [kernel, method, deblurring, proposed, blind, deconvolution, existing, noisy, prior, ieee, blurring, based, motion, handling, levin, comparison, blurred, blur, treatment, lai, denoising, pattern, figure, relating, blurry, sharp, restoration, gaussian, vasu, convolution] [image, latent, specific, real] [learning, deep, regularization, set, algorithm, optimization, performance, practical, training, process, scheme, paper, better, function, gain, matrix, procedure, sampling] [error, term, estimated, computer, conference, truth, second, estimate, estimation, solving, vision, assume, measurement, approach, ground]
@InProceedings{Nan_2020_CVPR,
  author = {Nan, Yuesong and Ji, Hui},
  title = {Deep Learning for Handling Kernel/model Uncertainty in Image Deconvolution},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Reflection Scene Separation From a Single Image
Renjie Wan, Boxin Shi, Haoliang Li, Ling-Yu Duan, Alex C. Kot


For images taken through glass, existing methods focus on the restoration of the background scene by regarding the reflection components as noise. However, the scene reflected by glass surface also contains important information to be recovered, especially for the surveillance or criminal investigations. In this paper, instead of removing reflection components from the mixture image, we aim at recovering reflection scenes from the mixture image. We first propose a strategy to obtain such ground truth and its corresponding input images. Then, we propose a two-stage framework to obtain the visible reflection scene from the mixture image. Specifically, we train the network with a shift-invariant loss which is robust to misalignment between the input and output images. The experimental results show that our proposed method achieves promising results.
[dataset, evaluation, previous, three] [background, denotes, global, propose, mirror, framework, adopt, feature, table, achieves] [model, visibility, input, correction] [separation, figure, method, enhancement, proposed, removal, light, color, reflection, ieee, pattern, dpe, glass, emitted, patch, separated, pixel, enlightengan, putting, convolutional, exposure, based, quantitative, residue, dark, zhang] [image, loss, component, misalignment, invariant, cyclegan, corresponding, introduce, real, lpips] [network, mixture, training, deep, data, learning, equation, problem, better, layer, increase, strategy, gradient] [scene, conference, estimated, computer, ground, vision, truth, local, solve, single, camera, directly, estimation, recovering]
@InProceedings{Wan_2020_CVPR,
  author = {Wan, Renjie and Shi, Boxin and Li, Haoliang and Duan, Ling-Yu and Kot, Alex C.},
  title = {Reflection Scene Separation From a Single Image},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Wavelet Synthesis Net for Disparity Estimation to Synthesize DSLR Calibre Bokeh Effect on Smartphones
Chenchi Luo, Yingmao Li, Kaimo Lin, George Chen, Seok-Jun Lee, Jihwan Choi, Youngjun Francis Yoo, Michael O. Polley


Modern smartphone cameras can match traditional DSLR cameras in many areas thanks to the introduction of camera arrays and multi-frame processing. Among all types of DSLR effects, the narrow depth of field (DoF) or so called bokeh probably arouses most interest. Today's smartphones try to overcome the physical lens and sensor limitations by introducing computational methods that utilize a depth map to synthesize the narrow DoF effect from all-in-focus images. However, a high quality depth map remains to be the key differentiator between computational bokeh and DSLR optical bokeh. Empowered by a novel wavelet synthesis network architecture, we have greatly narrowed the gap between DSLR and smartphone camera in terms of the bokeh more than ever before. We describe three key Modern smartphone cameras can match traditional digital single lens reflex (DSLR) cameras in many areas thanks to the introduction of camera arrays and multi-frame processing. Among all types of DSLR effects, the narrow depth of field (DoF) or so called bokeh probably arouses most interest. Today's smartphones try to overcome the physical lens and sensor limitations by introducing computational methods that utilize a depth map to synthesize the narrow DoF effect from all-in-focus images. However, a high quality depth map remains to be the key differentiator between computational bokeh and DSLR optical bokeh. Empowered by a novel wavelet synthesis network architecture, we have narrowed the gap between DSLR and smartphone camera in terms of bokeh more than ever before. We describe three key enablers of our bokeh solution: a synthetic graphics engine to generate training data with precisely prescribed characteristics that match the real smartphone captures, a novel wavelet synthesis neural network (WSN) architecture to produce unprecedented high definition disparity map promptly on smartphones, and a new evaluation metric to quantify the quality of the disparity map for real images from the bokeh rendering perspective. Experimental results show that the disparity map produced from our neural network achieves much better accuracy than the other state-of-the-art CNN based algorithms. Combining the high resolution disparity map with our rendering algorithm, we demonstrate visually superior bokeh pictures compared with existing top rated flagship smartphones listed on the DXOMARK mobiles.
[evaluation, three, engine, pair] [feature, map, correlation, main, stage, module, foreground, mask, key, level] [quality, input, trained] [disparity, bokeh, wavelet, wsn, smartphone, resolution, high, optical, smartphones, ieee, dslr, figure, convolutional, pattern, invertible, output, spatial, detail, existing, captured, lens, based, pixel, mxiou, calibre, low] [image, synthetic, real, produce, synthesis] [training, network, data, layer, normalized, augmentation, neural, baseline, search, rate, learning, better, performance, large, algorithm, achieve, dimension, computational, top, size, paper] [camera, depth, stereo, computer, conference, photometric, estimation, vision, rendering, calibrated, ground, left, direction, truth, rendered, calibration, international, match]
@InProceedings{Luo_2020_CVPR,
  author = {Luo, Chenchi and Li, Yingmao and Lin, Kaimo and Chen, George and Lee, Seok-Jun and Choi, Jihwan and Yoo, Youngjun Francis and Polley, Michael O.},
  title = {Wavelet Synthesis Net for Disparity Estimation to Synthesize DSLR Calibre Bokeh Effect on Smartphones},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Bundle Adjustment on a Graph Processor
Joseph Ortiz, Mark Pupilli, Stefan Leutenegger, Andrew J. Davison


Graph processors such as Graphcore's Intelligence Processing Unit (IPU) are part of the major new wave of novel computer architecture for AI, and have a general design with massively parallel computation, distributed on-chip memory and very high inter-core communication bandwidth which allows breakthrough performance for message passing algorithms on arbitrary graphs. We show for the first time that the classical computer vision problem of bundle adjustment (BA) can be solved extremely fast on a graph processor using Gaussian Belief Propagation. Our simple but fully parallel implementation uses the 1216 cores on a single IPU chip to, for instance, solve a real BA problem with 125 keyframes and 1919 points in under 40ms, compared to 1450ms for the Ceres CPU library. Further code optimisation will surely increase this difference on static problems, but we argue that the real promise of graph processing is for flexible in-place optimisation of general, dynamically changing factor graphs representing Spatial AI problems. We give indications of this with experiments showing the ability of GBP to efficiently solve incremental SLAM problems, and deal with robust cost functions and different types of factors.
[graph, belief, message, node, passing, time, speed, evaluation, recognition] [propagation, map, cpu] [landmark, noise, robust] [adjustment, gaussian, figure, prior, ieee, adjacent, parallel, spatial, pattern] [factor, variable, loss, independent, arbitrary, mapping, specific] [set, data, distribution, iteration, implementation, efficient, marginal, convergence, processing, large, converge, general, chip, update, design, incremental, function, exp, mkm, performance, problem, inference, algorithm] [gbp, measurement, bundle, vision, keyframe, compute, ipu, huber, conference, computer, single, loopy, slam, solve, optimisation, reprojection, international, keyframes, robotics, joint, error, local, form, tum, structure, zkm, intelligence, massively, cost]
@InProceedings{Ortiz_2020_CVPR,
  author = {Ortiz, Joseph and Pupilli, Mark and Leutenegger, Stefan and Davison, Andrew J.},
  title = {Bundle Adjustment on a Graph Processor},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
3D-ZeF: A 3D Zebrafish Tracking Benchmark Dataset
Malte Pedersen, Joakim Bruslund Haurum, Stefan Hein Bengtson, Thomas B. Moeslund


In this work we present a novel publicly available stereo based 3D RGB dataset for multi-object zebrafish tracking, called 3D-ZeF. Zebrafish is an increasingly popular model organism used for studying neurological disorders, drug addiction, and more. Behavioral analysis is often a critical part of such research. However, visual similarity, occlusion, and erratic movement of the zebrafish makes robust 3D tracking a challenging and unsolved problem. The proposed dataset consists of eight sequences with a duration between 15-120 seconds and 1-10 free moving zebrafish. The videos have been annotated with a total of 86,400 points and bounding boxes. Furthermore, we present a complexity score and a novel open-source modular baseline system for 3D tracking of zebrafish. The performance of the system is measured with respect to two detectors: a naive approach and a Faster R-CNN based fish head detector. The system reaches a MOTA of up to 77.6%. Links to the code and dataset is available at the project page http://vap.aau.dk/3d-zef
[dataset, multiple, naive, node, behavior, video, temporally, visual, water, order, time, three, graph, associated, evaluation, behavioral, social, constructed] [tracking, fish, zebrafish, tracklet, occlusion, tracklets, main, gallery, head, bounding, detection, mot, association, object, annotated, edge] [model, animal, example, publicly, developed, experimental] [based, method, figure, proposed, ieee, analysis, motion, journal, june, biological] [] [set, complexity, number, amount, test, compared, measure, data, performance, algorithm, average, training, setup, path] [conference, system, computer, distance, point, vision, international, single, camera, view, intersection, reprojection, stereo, ground, truth, estimated]
@InProceedings{Pedersen_2020_CVPR,
  author = {Pedersen, Malte and Haurum, Joakim Bruslund and Bengtson, Stefan Hein and Moeslund, Thomas B.},
  title = {3D-ZeF: A 3D Zebrafish Tracking Benchmark Dataset},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
PULSE: Self-Supervised Photo Upsampling via Latent Space Exploration of Generative Models
Sachit Menon, Alexandru Damian, Shijia Hu, Nikhil Ravi, Cynthia Rudin


The primary aim of single-image super-resolution is to construct a high-resolution (HR) image from a corresponding low-resolution (LR) input. In previous approaches, which have generally been supervised, the training objective typically measures a pixel-wise average distance between the super-resolved (SR) and HR images. Optimizing such metrics often leads to blurring, especially in high variance (detailed) regions. We propose an alternative formulation of the super-resolution problem based on creating realistic SR images that downscale correctly. We present a novel super-resolution algorithm addressing this problem, PULSE (Photo Upsampling via Latent Space Exploration), which generates high-resolution, realistic images at resolutions previously unseen in the literature. It accomplishes this in an entirely self-supervised fashion and is not confined to a specific degradation operator used during training, unlike previous methods (which require training on databases of LR-HR image pairs for supervised learning). Instead of starting with the LR image and slowly adding detail, PULSE traverses the high-resolution natural image manifold, searching for images that downscale to the original LR image. This is formalized through the "downscaling loss," which guides exploration through the latent space of a generative model. By leveraging properties of high-dimensional Gaussians, we restrict the search space to guarantee that our outputs are realistic. PULSE thereby generates super-resolved images that both are realistic and downscale correctly. We show extensive experimental results demonstrating the efficacy of our approach in the domain of face super-resolution (also known as face hallucination). Our method outperforms state-of-the-art methods in perceptual quality at higher resolutions and scale factors than previously possible.
[work, natural, recognition, provide, previous] [map, focus, score] [face, quality, input, model, adversarial, original, norm, trained] [downscale, high, resolution, downscaling, pulse, method, ilr, perceptual, isr, convolutional, ieee, figure, traditional, prior, scale, gaussian, pattern, degradation, ihr, upsampling, proposed, psnr, kpp] [image, generative, latent, loss, realistic, supervised, unsupervised, generated, manifold, aim, generating, discriminator, gans, stylegan, learn, celeba] [space, find, average, problem, set, neural, algorithm, network, learning, function, deep, training, higher, architecture, metric, random] [computer, conference, vision, approach, solution, well, single, international]
@InProceedings{Menon_2020_CVPR,
  author = {Menon, Sachit and Damian, Alexandru and Hu, Shijia and Ravi, Nikhil and Rudin, Cynthia},
  title = {PULSE: Self-Supervised Photo Upsampling via Latent Space Exploration of Generative Models},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Scalability in Perception for Autonomous Driving: Waymo Open Dataset
Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien Chouard, Vijaysai Patnaik, Paul Tsui, James Guo, Yin Zhou, Yuning Chai, Benjamin Caine, Vijay Vasudevan, Wei Han, Jiquan Ngiam, Hang Zhao, Aleksei Timofeev, Scott Ettinger, Maxim Krivokon, Amy Gao, Aditya Joshi, Yu Zhang, Jonathon Shlens, Zhifeng Chen, Dragomir Anguelov


The research community has increasing interest in autonomous driving research, despite the resource intensity of obtaining representative real world data. Existing self-driving datasets are limited in the scale and variation of the environments they capture, even though generalization within and between operating regions is crucial to the over-all viability of the technology. In an effort to help align the research community's contributions with real-world self-driving problems, we introduce a new large scale, high quality, diverse dataset. Our new dataset consists of 1150 scenes that each span 20 seconds, consisting of well synchronized and calibrated high quality LiDAR and camera data captured across a range of urban and suburban geographies. It is 15x more diverse than the largest camera+LiDAR dataset available based on our proposed diversity metric. We exhaustively annotated this data with 2D (camera image) and 3D (LiDAR) bounding boxes, with consistent identifiers across frames. Finally, we provide strong baselines for 2D as well as 3D detection and tracking tasks. We further study the effects of dataset size and generalization across geographies on 3D detection methods. Find data, code and more up-to-date information at http://www.waymo.com/open.
[dataset, vehicle, frame, multiple, time, driving, provide, geographical] [lidar, object, tracking, detection, table, autonomous, level, center, pedestrian, iou, area, aph, box, track, positive, heading, score, annotated, bounding, mota, semantic, exciting, including, addition] [evaluating, datasets, model, trained] [sensor, range, figure, ieee, pattern, based, san, rolling, shutter, method, pixel, field] [image, domain, mountain, gap, diverse] [data, training, number, set, metric, validation, size, larger, labeled] [camera, point, vision, computer, conference, ground, coordinate, truth, scene, well, recorded, front, view, single, limited, coverage, synchronization, kitti, pose, defined]
@InProceedings{Sun_2020_CVPR,
  author = {Sun, Pei and Kretzschmar, Henrik and Dotiwalla, Xerxes and Chouard, Aurelien and Patnaik, Vijaysai and Tsui, Paul and Guo, James and Zhou, Yin and Chai, Yuning and Caine, Benjamin and Vasudevan, Vijay and Han, Wei and Ngiam, Jiquan and Zhao, Hang and Timofeev, Aleksei and Ettinger, Scott and Krivokon, Maxim and Gao, Amy and Joshi, Aditya and Zhang, Yu and Shlens, Jonathon and Chen, Zhifeng and Anguelov, Dragomir},
  title = {Scalability in Perception for Autonomous Driving: Waymo Open Dataset},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Extreme Relative Pose Network Under Hybrid Representations
Zhenpei Yang, Siming Yan, Qixing Huang


In this paper, we introduce a novel RGB-D based relative pose estimation approach that is suitable for small-overlapping or non-overlapping scans and can output multiple relative poses. Our method performs scene completion and matches the completed scans. However, instead of using a fixed representation for completion, the key idea is to utilize hybrid representations that combine 360-image, 2D image-based layout, and planar patches. This approach offers adaptively feature representations for relative pose estimation. Besides, we introduce a global-2-local matching procedure, which utilizes initial relative poses obtained during the global phase to detect and then integrate geometric relations for pose refinement. Experimental results justify the potential of this approach across a wide range of benchmark datasets. For example, on ScanNet, the rotation translation errors of the top-1/top-5 predictions of our approach are 28.6^ \circ /0.90m and 16.8^ \circ /0.76m, respectively. Our approach also considerably boosts the performance of multi-scan reconstruction in few-view reconstruction settings.
[multiple, three, pair, relation, evaluation] [module, global, feature, table, predicted, extreme, utilizes] [input, robust, experimental] [ieee, pattern, spectral, output, method, figure, performs, analysis, june, based] [representation, loss, rot, translation, alignment, extracted, layout] [network, learning, baseline, training, data, performance, top, neural, procedure, set, matrix, standard, deep] [relative, approach, pose, local, scan, conference, computer, geometric, vision, completion, matching, initial, estimation, overlapping, scene, planar, point, rotation, trans, single, hybrid, reconstruction, registration, international, dense, consistent, completed, indoor, rigid, term, descriptor, thomas, second]
@InProceedings{Yang_2020_CVPR,
  author = {Yang, Zhenpei and Yan, Siming and Huang, Qixing},
  title = {Extreme Relative Pose Network Under Hybrid Representations},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Single-Shot Monocular RGB-D Imaging Using Uneven Double Refraction
Andreas Meuleman, Seung-Hwan Baek, Felix Heide, Min H. Kim


Cameras that capture color and depth information have become an essential imaging modality for applications in robotics, autonomous driving, virtual, and augmented reality. Existing RGB-D cameras rely on multiple sensors or active illumination with specialized sensors. In this work, we propose a method for monocular single-shot RGB-D imaging. Instead of learning depth from single-image depth cues, we revisit double-refraction imaging using a birefractive medium, measuring depth as the displacement of differently refracted images superimposed in a single capture. However, existing double-refraction methods are orders of magnitudes too slow to be used in real-time applications, e.g., in robotics, and provide only inaccurate depth due to correspondence ambiguity in double reflection. We resolve this ambiguity optically by leveraging the orthogonality of the two linearly polarized rays in double refraction -- introducing uneven double refraction by adding a linear polarizer to the birefractive medium. Doing so makes it possible to develop a real-time method for reconstructing sparse depth and color simultaneously in real-time. We validate the proposed method, both synthetically and experimentally, and demonstrate 3D object detection and photographic applications.
[dependency, previous, recognition, current] [detection, object, map, horizontal, table] [input, model, ray] [color, double, method, birefractive, restoration, disparity, figure, uneven, restored, ieee, pattern, light, imaging, rectification, captured, existing, optical, intensity, residual, pixel, rectified, calcite, proposed, spatial, birefringent, binocular, aperture, runtime, remove] [image, synthetic] [baseline, linear, approximated, equation, note, computational, algorithm, accuracy, memory, learning, efficient, function, lower, set, candidate] [depth, refraction, stereo, estimate, computer, cost, camera, vision, reconstruction, volume, polarizer, sparse, ground, conference, single, ambiguity, estimation, error, monocular, estimated, scene, joint, novel, truth, rely, front, refer, supplemental]
@InProceedings{Meuleman_2020_CVPR,
  author = {Meuleman, Andreas and Baek, Seung-Hwan and Heide, Felix and Kim, Min H.},
  title = {Single-Shot Monocular RGB-D Imaging Using Uneven Double Refraction},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Inverse Rendering for Complex Indoor Scenes: Shape, Spatially-Varying Lighting and SVBRDF From a Single Image
Zhengqin Li, Mohammad Shafiei, Ravi Ramamoorthi, Kalyan Sunkavalli, Manmohan Chandraker


We propose a deep inverse rendering framework for indoor scenes. From a single RGB image of an arbitrary indoor scene, we obtain a complete scene reconstruction, estimating shape, spatially-varying lighting, and spatially-varying, non-Lambertian surface reflectance. Our novel inverse rendering network incorporates physical insights -- including a spatially-varying spherical Gaussian lighting representation, a differentiable rendering layer to model scene appearance, a cascade structure to iteratively refine the predictions and a bilateral solver for refinement -- allowing us to jointly reason about shape, lighting, and reflectance. Since no existing dataset provides ground truth high quality spatially-varying material and spatially-varying lighting, we propose novel methods to map complex materials to existing indoor scene datasets and a new physically-based GPU renderer to create a large-scale, photorealistic indoor dataset. Experiments show that our framework outperforms previous methods and enables various novel applications like photorealistic object insertion and material editing.
[dataset, predict, environment, work] [object, cascade, global, map, table, predicted] [insertion, original, model, input, datasets] [figure, inverse, high, scale, illumination, bilateral, light, frequency, gaussian, method, quantitative, recover, proposed] [image, real, photorealistic, editing, synthetic, loss, realistic, mapping, texture, synthesis, invariant] [network, training, deep, test, layer, neural, replace] [lighting, material, rendering, scene, indoor, single, spherical, specular, complex, rendered, diffuse, geometry, svbrdf, reflectance, intrinsic, computer, estimation, supplementary, brdf, render, phong, differentiable, ground, depth, local, shape, acm, truth, microfacet, error, solver, albedo, normal, joint, vision, kalyan, novel, structure, roughness, barron, decomposition, shading, lambertian]
@InProceedings{Li_2020_CVPR,
  author = {Li, Zhengqin and Shafiei, Mohammad and Ramamoorthi, Ravi and Sunkavalli, Kalyan and Chandraker, Manmohan},
  title = {Inverse Rendering for Complex Indoor Scenes: Shape, Spatially-Varying Lighting and SVBRDF From a Single Image},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
3D Packing for Self-Supervised Monocular Depth Estimation
Vitor Guizilini, Rares Ambrus, Sudeep Pillai, Allan Raventos, Adrien Gaidon


Although cameras are ubiquitous, robotic platforms typically rely on active sensors like LiDAR for direct 3D perception. In this work, we propose a novel self-supervised monocular depth estimation method combining geometry with a new deep network, PackNet, learned only from unlabeled monocular videos. Our architecture leverages novel symmetrical packing and unpacking blocks to jointly learn to compress and decompress detail-preserving representations using 3D convolutions. Although self-supervised, our method outperforms other self, semi, and fully supervised methods on the KITTI benchmark. The 3D inductive bias in PackNet enables it to scale with input resolution and number of parameters without overfitting, generalizing better on out-of-domain data such as the NuScenes dataset. Furthermore, it does not require large-scale supervised pretraining on ImageNet and can run in real-time. Finally, we release DDAD (Dense Depth for Automated Driving), a new urban driving dataset with more challenging and accurate depth evaluation, thanks to longer-range and denser ground-truth depth generated from high-density LiDARs mounted on a fleet of self-driving cars operating world-wide.
[dataset, time, previous] [table, feature, resnet, pooling, lidar, supervision, propose, nuscenes] [input, trained, model, original] [packing, convolutional, ieee, pattern, spatial, scale, proposed, resolution, figure, output, method, flow, invertible] [image, loss, learn, target, supervised, unsupervised, source] [learning, network, training, performance, architecture, deep, number, higher, unlabeled, pretraining, imagenet, better, arxiv, preprint, standard, family, neural] [depth, monocular, packnet, estimation, unpacking, conference, computer, pose, vision, kitti, velocity, accurate, photometric, single, ddad, rel, scene, metrically, camera, dense, sfm, thomas, international, novel]
@InProceedings{Guizilini_2020_CVPR,
  author = {Guizilini, Vitor and Ambrus, Rares and Pillai, Sudeep and Raventos, Allan and Gaidon, Adrien},
  title = {3D Packing for Self-Supervised Monocular Depth Estimation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Cascade Cost Volume for High-Resolution Multi-View Stereo and Stereo Matching
Xiaodong Gu, Zhiwen Fan, Siyu Zhu, Zuozhuo Dai, Feitong Tan, Ping Tan


The deep multi-view stereo (MVS) and stereo matching approaches generally construct 3D cost volumes to regularize and regress the output depth or disparity. These methods are limited when high-resolution outputs are needed since the memory and time costs grow cubically as the volume resolution increases. In this paper, we propose a both memory and time efficient cost volume formulation that is complementary to existing multi-view stereo and stereo matching approaches based on 3D cost volumes. First, the proposed cost volume is built upon a standard feature pyramid encoding geometry and context at gradually finer scales. Then, we can narrow the depth (or disparity) range of each stage by the depth (or disparity) map from the previous stage. With gradually higher cost volume resolution and adaptive adjustment of depth (or disparity) intervals, the output is recovered in a coarser to fine manner. We apply the cascade cost volume to the representative MVS-Net, and obtain a 35.6% improvement on DTU benchmark (1st place), with 50.6% and 59.3% reduction in GPU memory and run-time. It is also the state-of-the-art learning-based method on Tanks and Temples benchmark. The statistics of accuracy, run-time and GPU memory on other representative stereo CNNs also validate the effectiveness of our proposed method. Our source code is available at https://github.com/alibaba/cascade-stereo.
[dataset, evaluation, construct, context, previous, multiple, build, long, time] [cascade, feature, stage, table, pyramid, benchmark, denotes, map, correlation, aggregation, improvement, semantic, global] [input, original] [resolution, spatial, disparity, figure, range, based, proposed, psmnet, output, method, cnns, quantitative, flow, generally, convolutional, intermediate] [image, corresponding, generate, loss] [memory, gpu, set, number, network, accuracy, learning, deep, neural, training, standard, performance, increased, efficient, higher, larger, regularization] [cost, stereo, volume, depth, matching, hypothesis, mvsnet, dtu, plane, gwcnet, point, scene, reconstruction, view, interval, formulation, finer, multiview]
@InProceedings{Gu_2020_CVPR,
  author = {Gu, Xiaodong and Fan, Zhiwen and Zhu, Siyu and Dai, Zuozhuo and Tan, Feitong and Tan, Ping},
  title = {Cascade Cost Volume for High-Resolution Multi-View Stereo and Stereo Matching},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
From Two Rolling Shutters to One Global Shutter
Cenek Albl, Zuzana Kukelova, Viktor Larsson, Michal Polic, Tomas Pajdla, Konrad Schindler


Most consumer cameras are equipped with electronic rolling shutter, leading to image distortions when the camera moves during image capture. We explore a surprisingly simple camera configuration that makes it possible to undo the rolling shutter distortion: two cameras mounted to have different rolling shutter directions. Such a setup is easy and cheap to build and it possesses the geometric constraints needed to correct rolling shutter distortion using only a sparse set of point correspondences between the two images. We derive equations that describe the underlying geometry for general and special motions and present an efficient method for finding their solutions. Our synthetic and real experiments demonstrate that our approach is able to remove large rolling shutter distortions of all types without relying on any specific scene structure.
[three, time] [global] [model, case, distortion, undistorted, constant, input, trained, typical] [motion, shutter, rolling, figure, method, remove, optical, flow, low, assumption, proposed] [image, translation, real, synthetic, corresponding] [baseline, efficient, general, angular, consider, requires, note, configuration, setup, close, small, matrix, equation] [camera, rotation, depth, scene, solution, velocity, minimal, translational, pure, geometry, rig, solver, pose, single, well, sfm, rotational, point, second, error, tomas, relative, cenek, zuzana, approach, projection, reconstruction, stereo, undistortion, full, correspondence, perspective, dense, absolute, undistort, epipolar, opposite, initial, distance, system, solve, structure, supplementary, handle]
@InProceedings{Albl_2020_CVPR,
  author = {Albl, Cenek and Kukelova, Zuzana and Larsson, Viktor and Polic, Michal and Pajdla, Tomas and Schindler, Konrad},
  title = {From Two Rolling Shutters to One Global Shutter},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Deep Global Registration
Christopher Choy, Wei Dong, Vladlen Koltun


We present Deep Global Registration, a differentiable framework for pairwise registration of real-world 3D scans. Deep global registration is based on three modules: a 6-dimensional convolutional network for correspondence confidence prediction, a differentiable Weighted Procrustes algorithm for closed-form pose estimation, and a robust gradient-based SE(3) optimizer for pose refinement. Experiments demonstrate that our approach outperforms state-of-the-art methods, both learning-based and classical, on real-world data.
[dataset, prediction, outperforms, work] [global, feature, recall, module, final, propose, benchmark, table] [robust, iterative] [method, convolutional, dcp, fast, classical, likelihood, figure, analysis, based, pattern] [loss, translation, train, row, representation, generate] [network, pairwise, weighted, deep, optimization, set, learning, space, training, test, function, algorithm, number, weight, gradient, random, neural, convnet] [registration, point, procrustes, correspondence, pose, ransac, error, fgr, icp, cloud, reconstruction, rotation, geometric, differentiable, vladlen, pointnetlk, pipeline, closest, globally, accurate, form, inlier, dense, scene, partial, approach, geometry, structure, neighbor, surface, defined, jaesik, thomas]
@InProceedings{Choy_2020_CVPR,
  author = {Choy, Christopher and Dong, Wei and Koltun, Vladlen},
  title = {Deep Global Registration},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Deep Stereo Using Adaptive Thin Volume Representation With Uncertainty Awareness
Shuo Cheng, Zexiang Xu, Shilin Zhu, Zhuwen Li, Li Erran Li, Ravi Ramamoorthi, Hao Su


We present Uncertainty-aware Cascaded Stereo Network (UCS-Net) for 3D reconstruction from multiple RGB images. Multi-view stereo (MVS) aims to reconstruct fine-grained scene geometry from multi-view images. Previous learning-based MVS methods estimate per-view depth using plane sweep volumes (PSVs) with a fixed depth hypothesis at each plane; this requires densely sampled planes for high accuracy, which is impractical for high-resolution depth because of limited memory. In contrast, we propose adaptive thin volumes (ATVs); in an ATV, the depth hypothesis of each plane is spatially varying, which adapts to the uncertainties of previous per-pixel depth predictions. Our UCS-Net has three stages: the first stage processes a small PSV to predict low-resolution depth; two ATVs are then used in the following stages to refine the depth with higher resolution and higher accuracy. Our ATV consists of only a small number of planes with low memory and computation costs; yet, it efficiently partitions local depth ranges within learned small uncertainty intervals. We propose to use variance-based uncertainty estimates to adaptively construct ATVs; this differentiable process leads to reasonable and fine-grained spatial partitioning. Our multi-stage framework progressively sub-divides the vast scene space with increasing depth resolution and precision, which enables reconstruction with high completeness and accuracy in a coarse-to-fine fashion. We demonstrate that our method achieves superior performance compared with other learning-based MVS methods on various challenging datasets.
[prediction, multiple, previous, three, predict, construct, work] [stage, feature, achieves, propose, map, predicted, cnn] [highly] [ieee, high, method, pattern, spatial, adaptive, resolution, pixel] [image, corresponding, progressively] [learning, network, accuracy, probability, deep, sampling, achieve, small, higher, memory, efficient, number, large, learned, better, size, note, space] [depth, conference, computer, uncertainty, reconstruction, volume, vision, atv, point, ground, truth, cost, international, plane, local, novel, atvs, stereo, reconstruct, scene, estimation, thin, sweep, completeness, dtu, enables, shape, geometry, cloud, accurate, surface, single, european, dense, estimated]
@InProceedings{Cheng_2020_CVPR,
  author = {Cheng, Shuo and Xu, Zexiang and Zhu, Shilin and Li, Zhuwen and Li, Li Erran and Ramamoorthi, Ravi and Su, Hao},
  title = {Deep Stereo Using Adaptive Thin Volume Representation With Uncertainty Awareness},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Why Having 10,000 Parameters in Your Camera Model Is Better Than Twelve
Thomas Schops, Viktor Larsson, Marc Pollefeys, Torsten Sattler


Camera calibration is an essential first step in setting up 3D Computer Vision systems. Commonly used parametric camera models are limited to a few degrees of freedom and thus often do not optimally fit to complex real lens distortion. In contrast, generic camera models allow for very accurate calibration due to their flexibility. Despite this, they have seen little use in practice. In this paper, we argue that this should change. We propose a calibration pipeline for generic models that is fully automated, easy to use, and can act as a drop-in replacement for parametric calibration, with a focus on accuracy. We compare our results to parametric calibrations. Considering stereo depth estimation and camera pose estimation as examples, we show that the calibration error acts as a bias on the results. We thus argue that in contrast to current common practice, generic models should be preferred over parametric ones whenever possible. To facilitate this, we released our calibration pipeline at https://github.com/puzzlepaint/camera_calibration, making both easy-to-use and accurate camera calibration available to everyone.
[observation, modeling, step, evaluation, multiple] [feature, detection, refinement, final, center, propose] [generic, model, central, distortion, radial, fisheye, robust] [pattern, pixel, figure, checkerboard, method, interpolated, adjustment, star, lens, window, result] [image, control] [better, data, number, performance, random, space, note, sample, accuracy, approximate, optimize, general, function, consider, small, optimization, set] [camera, calibration, parametric, error, point, stereo, accurate, bundle, reprojection, grid, local, compare, median, peter, direction, approach, dense, fit, depth, estimation, pose, defined, cost, computer, pipeline, calibrated, projection, define, avoid, compute, opencv, srikumar]
@InProceedings{Schops_2020_CVPR,
  author = {Schops, Thomas and Larsson, Viktor and Pollefeys, Marc and Sattler, Torsten},
  title = {Why Having 10,000 Parameters in Your Camera Model Is Better Than Twelve},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Blur Aware Calibration of Multi-Focus Plenoptic Camera
Mathieu Labussiere, Celine Teuliere, Frederic Bernardin, Omar Ait-Aider


This paper presents a novel calibration algorithm for Multi-Focus Plenoptic Cameras (MFPCs) using raw images only. The design of such cameras is usually complex and relies on precise placement of optic elements. Several calibration procedures have been proposed to retrieve the camera parameters but relying on simplified models, reconstructed images to extract features, or multiple calibrations when several types of micro-lens are used. Considering blur information, we propose a new Blur Aware Plenoptic (BAP) feature. It is first exploited in a pre-calibration step that retrieves initial camera parameters, and secondly to express a new cost function for our single optimization process. The effectiveness of our calibration method is validated by quantitative and qualitative experiments.
[mla, retrieve, step, account, dataset, composed, length] [main, feature, center, aware, object] [model, white, type, internal, christian, case] [lens, blur, raw, method, light, sensor, pixel, field, ieee, spatial, based, checkerboard, expressed, rxlive, aperture, optimized, proposed, introduced, figure, raytrix] [image] [optimization, set, metric, function, process, size, paper, respect, standard, distribution, algorithm] [calibration, plenoptic, camera, distance, point, radius, focal, error, bap, focused, depth, computed, virtual, initial, reprojection, computer, array, estimated, relative, international, conference, plane, intrinsic, projection, single, allows, extrinsic, estimation, compute, vision, lecture, system, position]
@InProceedings{Labussiere_2020_CVPR,
  author = {Labussiere, Mathieu and Teuliere, Celine and Bernardin, Frederic and Ait-Aider, Omar},
  title = {Blur Aware Calibration of Multi-Focus Plenoptic Camera},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Learning Fused Pixel and Feature-Based View Reconstructions for Light Fields
Jinglei Shi, Xiaoran Jiang, Christine Guillemot


In this paper, we present a learning-based framework for light field view synthesis from a subset of input views. Building upon a light-weight optical flow estimation network to obtain depth maps, our method employs two reconstruction modules in pixel and feature domains respectively. For the pixel-wise reconstruction, occlusions are explicitly handled by a disparity-dependent interpolation filter, whereas inpainting on disoccluded areas is learned by convolutional layers. Due to disparity inconsistencies, the pixel-based reconstruction may lead to blurriness in highly textured areas as well as on object contours. On the contrary, the feature-based reconstruction well performs on high frequencies, making the reconstruction in the two domains complementary. End-to-end learning is finally performed including a fusion module merging pixel and feature-based reconstructions. Experimental results show that our method achieves state-of-the-art performance on both synthetic and real-world datasets, moreover, it is even able to extend light fields' baseline by extrapolating high quality views without additional training.
[work, recognition] [feature, module, framework, propose, final, object, including, mask] [input, model, quality, trained, highly] [light, field, disparity, based, warped, method, ieee, pixel, pixrnet, interpolation, fusion, convolutional, high, featrnet, llff, pattern, reference, color, proposed, spatial, figure, epi, resolution] [target, image, synthesis, synthetic, domain, synthesize, representation] [learning, network, learned, training, angular, large, set, deep, ltr, note, sampled, soft, extrapolation, better] [view, reconstruction, depth, estimation, well, computer, vision, novel, position, reconstructed, acm, textured, single, approach, computed, error, rendering, stereo, scene, second, projected, fpfr, purepix, camera, plane, viewpoint, inferred]
@InProceedings{Shi_2020_CVPR,
  author = {Shi, Jinglei and Jiang, Xiaoran and Guillemot, Christine},
  title = {Learning Fused Pixel and Feature-Based View Reconstructions for Light Fields},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
SAL: Sign Agnostic Learning of Shapes From Raw Data
Matan Atzmon, Yaron Lipman


Recently, neural networks have been used as implicit representations for surface reconstruction, modelling, learning, and generation. So far, training neural networks to be implicit representations of surfaces required training data sampled from a ground-truth signed implicit functions such as signed distance or occupancy functions, which are notoriously hard to compute. In this paper we introduce Sign Agnostic Learning (SAL), a deep learning approach for learning implicit shape representations directly from raw, unsigned geometric data, such as point clouds and triangle soups. We have tested SAL on the challenging problem of surface reconstruction from an un-oriented point cloud, as well as end-to-end human shape space learning directly from raw scans dataset, and achieved state of the art reconstructions compared to current approaches. We believe SAL opens the door to many geometric deep learning applications with real-world data, alleviating the usual painstaking, often manual pre-process.
[sign, hidden, work, dataset] [] [input, model, case] [raw, figure, ieee, pattern, method, version, gray] [loss, latent, variational, unseen, generative, learn, representation] [learning, function, neural, data, test, equation, agnostic, space, deep, theorem, initialization, paper, approximate, set, forward, arxiv, preprint, problem, approximation, reproduction, training, network, experiment, note, processing, optimization, choice] [sal, surface, point, reconstruction, implicit, distance, signed, shape, computer, unsigned, conference, cloud, vision, plane, geometric, normal, directly, triangle, local, single, human, defined, scan, approach, basis, mlp, occupancy, parametric, left, volume, michael, collection, geometry]
@InProceedings{Atzmon_2020_CVPR,
  author = {Atzmon, Matan and Lipman, Yaron},
  title = {SAL: Sign Agnostic Learning of Shapes From Raw Data},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Google Landmarks Dataset v2 - A Large-Scale Benchmark for Instance-Level Recognition and Retrieval
Tobias Weyand, Andre Araujo, Bingyi Cao, Jack Sim


While image retrieval and instance recognition techniques are progressing rapidly, there is a need for challenging datasets to accurately measure their performance -- while posing novel challenges that are relevant for practical applications. We introduce the Google Landmarks Dataset v2 (GLDv2), a new benchmark for large-scale, fine-grained instance recognition and image retrieval in the domain of human-made and natural landmarks. GLDv2 is the largest such dataset to date by a large margin, including over 5M images and 200k distinct instance labels. Its test set consists of 118k images with ground truth annotations for both the retrieval and recognition tasks. The ground truth construction involved over 800 hours of human annotator work. Our new dataset has several challenging properties inspired by real-world applications that previous datasets did not consider: An extremely long-tailed class distribution, a large fraction of out-of-domain test photos and large intra-class variability. The dataset is sourced from Wikimedia Commons, the world's largest crowdsourced collection of landmark photos. We provide baseline results for both recognition and retrieval tasks based on state-of-the-art methods as well as competitive results from a public challenge. We further demonstrate the suitability of the dataset for transfer learning by showing that image embeddings trained on it achieve competitive retrieval performance on independent datasets. The dataset images, ground-truth and metric scoring code are available at https://github.com/cvdfoundation/google-landmark
[dataset, recognition, retrieval, visual, oxford, graph, natural, relevant, goal, embedding, associated] [global, instance, object, feature, category, table, challenge, benchmark] [query, landmark, google, datasets, wikimedia, worldwide, trained, city, testing, public, delg, showing, paris, revisited] [based, figure, version, existing, scale] [image, loss, consists] [training, number, set, large, class, deep, data, test, learning, metric, top, manual, validation, baseline, knowledge, label, precision, performance, compared, distribution, total, average] [local, ground, truth, descriptor, well, single, system, matching, human]
@InProceedings{Weyand_2020_CVPR,
  author = {Weyand, Tobias and Araujo, Andre and Cao, Bingyi and Sim, Jack},
  title = {Google Landmarks Dataset v2 - A Large-Scale Benchmark for Instance-Level Recognition and Retrieval},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Instance Guided Proposal Network for Person Search
Wenkai Dong, Zhaoxiang Zhang, Chunfeng Song, Tieniu Tan


Person detection networks have been widely used in person search. These detectors discriminate persons from the background and generate proposals of all the persons from a gallery of scene images for each query. However, such a large number of proposals have a negative influence on the following identity matching process because many distractors are involved. In this paper, we propose a new detection network for person search, named Instance Guided Proposal Network (IGPN), which can learn the similarity between query persons and proposals. Thus, we can decrease proposals according to the similarity scores. To incorporate information of the query into the detection network, we introduce the Siamese region proposal network to Faster-RCNN and we propose improved cross-correlation layers to alleviate the imbalance of parameters distribution. Furthermore, we design a local relation block and a global relation branch to leverage the proposal-proposal relations and query-scene relations, respectively. Extensive experiments show that our method improves the person search performance through decreasing proposals and achieves competitive performance on two large person search benchmark datasets, CUHK-SYSU and PRW.
[relation, previous, context] [igpn, detection, gallery, global, bounding, proposal, feature, siamese, branch, denotes, region, correlation, qeeps, pcb, propose, instance, guided, rpn, positive, object, table, named, locate, prw, regression] [query, model, influence, decrease, trained, identity] [based, method, proposed, figure, block, ieee, pattern, comparison] [person, learn, target, image, appearance, introduce, loss] [search, similarity, network, performance, size, number, set, process, learning, large, benefit, task, training, negative, improved, compared, imbalance, neural, calculation] [scene, conference, computer, local, vision, matching, second, european]
@InProceedings{Dong_2020_CVPR,
  author = {Dong, Wenkai and Zhang, Zhaoxiang and Song, Chunfeng and Tan, Tieniu},
  title = {Instance Guided Proposal Network for Person Search},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Which Is Plagiarism: Fashion Image Retrieval Based on Regional Representation for Design Protection
Yining Lang, Yuan He, Fan Yang, Jianfeng Dong, Hui Xue


With the rapid growth of e-commerce and the popularity of online shopping, fashion retrieval has received considerable attention in the computer vision community. Different from the existing works that mainly focus on identical or similar fashion item retrieval, in this paper, we aim to study the plagiarized clothes retrieval which is somewhat ignored in the academic community while itself has great application value. One of the key challenges is that plagiarized clothes are usually modified in a certain region on the original design to escape the supervision by traditional retrieval methods. To relieve it, we propose a novel network named Plagiarized-Search-Net (PS-Net) based on regional representation, where we utilize the landmarks to guide the learning of regional representations and compare fashion items region by region. Besides, we propose a new dataset named Plagiarized Fashion for plagiarized clothes retrieval, which provides a meaningful complement to the existing fashion retrieval field. Experiments on Plagiarized Fashion dataset verify that our approach is superior to other instance-level counterparts for plagiarized clothes retrieval, showing a promising result for original design protection. Moreover, our PS-Net can also be adapted to traditional fashion retrieval and landmark estimation tasks and achieves the state-of-the-art performance on the DeepFashion and DeepFashion2 datasets.
[retrieval, attention, dataset, work, previous, visual, step, evaluation, yuan, retrieve, order, mechanism] [region, feature, recall, map, detection, branch, guided, table, backbone, propose, named, obtains, category, focus, guide] [clothes, plagiarized, fashion, landmark, regional, original, deepfashion, model, identical, manipulation, evaluated, clothing, item, study] [traditional, method, figure, proposed, based, achieved, output, introduced] [image, loss, modified, attribute, learn, representation, ltri, ability, style] [learning, design, network, performance, similarity, rate, training, deep, task, manual, problem, compared, weight, set, find, best] [approach, estimation, human, novel, complete, compare, geometric, pose]
@InProceedings{Lang_2020_CVPR,
  author = {Lang, Yining and He, Yuan and Yang, Fan and Dong, Jianfeng and Xue, Hui},
  title = {Which Is Plagiarism: Fashion Image Retrieval Based on Regional Representation for Design Protection},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Inter-Task Association Critic for Cross-Resolution Person Re-Identification
Zhiyi Cheng, Qi Dong, Shaogang Gong, Xiatian Zhu


Person images captured by unconstrained surveillance cameras often have low resolutions (LR). This causes the resolution mismatch problem when matched against the high-resolution (HR) gallery images, negatively affecting the performance of person re-identification (re-id). An effective approach is to leverage image super-resolution (SR) along with person re-id in a joint learning manner. However, this scheme is limited due to dramatically more difficult gradients backpropagation during training. In this paper, we introduce a novel model training regularisation method, called Inter-Task Association Critic (INTACT), to address this fundamental problem. Specifically, INTACT discovers the underlying association knowledge between image SR and person re-id, and leverages it as an extra learning constraint for enhancing the compatibility of SR model with person re-id in HR image space. This is realised by parameterising the association constraint which enables it to be automatically learned from the training data. Extensive experiments validate the superiority of INTACT over the state-of-the-art approaches on the cross-resolution re-id task using five standard person re-id datasets.
[recognition, dataset, artificial, difficulty] [association, feature, gallery, table, module, achieves] [model, identity, query, adversarial, trained] [ieee, pattern, resolution, based, cascaded, captured, proposed, method, analysis, low, existing, figure, resolved] [person, image, intact, discriminator, shaogang, loss, representation, xiatian, critic, gan, regularisation, unsupervised, mismatch, generator, tao, bridging, address, underlying, learn, latent] [learning, training, performance, network, standard, deep, learned, space, task, classification, problem, set, compared, design, objective, update, best, test, machine] [conference, computer, vision, constraint, joint, matching, international, camera, novel]
@InProceedings{Cheng_2020_CVPR,
  author = {Cheng, Zhiyi and Dong, Qi and Gong, Shaogang and Zhu, Xiatian},
  title = {Inter-Task Association Critic for Cross-Resolution Person Re-Identification},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
FineGym: A Hierarchical Video Dataset for Fine-Grained Action Understanding
Dian Shao, Yue Zhao, Bo Dai, Dahua Lin


On public benchmarks, current action recognition techniques have achieved great success. However, when used in real-world applications, e.g. sport analysis, which requires the capability of parsing an activity into phases and differentiating between subtly different actions, their performances remain far from being satisfactory. To take action recognition to a new level, we develop FineGym, a new dataset built on top of gymnasium videos. Compared to existing action recognition datasets, FineGym is distinguished in richness, quality, and diversity. In particular, it provides temporal annotations at both action and sub-action levels with a three-level semantic hierarchy. For example, a "balance beam" activity will be annotated as a sequence of elementary sub-actions derived from five sets: "leap-jump-hop", "beam-turns", "flight-salto", "flight-handspring", and "dismount", where the sub-action in each set will be further annotated with finely defined class labels. This new level of granularity presents significant challenges for action recognition, e.g. how to parse the temporal structures from a coherent action, and how to distinguish between subtly different action classes. We systematically investigates different methods on this dataset and obtains a number of interesting findings. We hope this dataset could advance research towards action understanding.
[action, temporal, finegym, recognition, element, video, dataset, tsn, trn, tsm, provide, three, visual, salto, understanding, dahua, vault, reasoning, turn, kinetics, multiple, future, tucked, yue, described, activity] [semantic, table, localization, official, challenging, instance, annotation, level, including, segment, annotated, focus] [datasets, model, quality, study, decision, representative] [figure, event, flow, motion, existing, high, convolutional, analysis, method] [subtle] [set, number, data, backward, top, network, empirical, performance, learning, large, balance, manually] [rgb, human, well, pose, consistent, defined]
@InProceedings{Shao_2020_CVPR,
  author = {Shao, Dian and Zhao, Yue and Dai, Bo and Lin, Dahua},
  title = {FineGym: A Hierarchical Video Dataset for Fine-Grained Action Understanding},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Mapillary Street-Level Sequences: A Dataset for Lifelong Place Recognition
Frederik Warburg, Soren Hauberg, Manuel Lopez-Antequera, Pau Gargallo, Yubin Kuang, Javier Civera


Lifelong place recognition is an essential and challenging task in computer vision with vast applications in robust localization and efficient large-scale 3D reconstruction. Progress is currently hindered by a lack of large, diverse, publicly available datasets. We contribute with Mapillary Street-Level Sequences (SLS), a large dataset for urban and suburban place recognition from image sequences. It contains more than 1.6 million images curated from the Mapillary collaborative mapping platform. The dataset is orders of magnitude larger than current data sources, and is designed to reflect the diversities of true lifelong learning. It features images from 30 major cities across six continents, hundreds of distinct cameras, and substantially different viewpoints and capture times, spanning all seasons over a nine year period. All images are geo-located with GPS and compass, and feature high-level attributes such as road type. We propose a set of benchmark tasks designed to push state-of-the-art performance and provide baseline studies. We show that current state-of-the-art methods still have a long way to go, and that the lack of diversity in existing datasets have prevented generalization to new environments. The dataset and benchmarks are available for academic research.
[place, recognition, netvlad, dataset, geographical, msls, urban, sequence, url, visual, time, viewing, retrieval, day, suburban, tokyo, temporal, seasonal, road, provide, gps, three, include] [mapillary, table, gem, propose, challenging, feature, positive] [query, database, model, trained, datasets] [ieee, figure, pattern, weather, based, raw, night, june] [image, appearance, diverse, structural] [training, test, set, large, data, performance, deep, number, learning, lifelong, distribution, size, negative, machine, min] [conference, computer, international, vision, coverage, viewpoint, well, camera, robotics, direction, local, depth, matching]
@InProceedings{Warburg_2020_CVPR,
  author = {Warburg, Frederik and Hauberg, Soren and Lopez-Antequera, Manuel and Gargallo, Pau and Kuang, Yubin and Civera, Javier},
  title = {Mapillary Street-Level Sequences: A Dataset for Lifelong Place Recognition},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
BDD100K: A Diverse Driving Dataset for Heterogeneous Multitask Learning
Fisher Yu, Haofeng Chen, Xin Wang, Wenqi Xian, Yingying Chen, Fangchen Liu, Vashisht Madhavan, Trevor Darrell


Datasets drive vision progress, yet existing driving datasets are impoverished in terms of visual content and supported tasks to study multitask learning for autonomous driving. Researchers are usually constrained to study a small set of problems on one dataset, while real-world computer vision applications require performing tasks of various complexities. We construct BDD100K, the largest driving video dataset with 100K videos and 10 tasks to evaluate the exciting progress of image recognition algorithms on autonomous driving. The dataset possesses geographic, environmental, and weather diversity, which is useful for training models that are less likely to be surprised by new conditions. Based on this diverse dataset, we build a benchmark for heterogeneous multitask learning and study how to solve the tasks together. Our experiments show that special training strategies are needed for existing models to perform such heterogeneous tasks. BDD100K opens the door for future studies in this important venue.
[dataset, lane, driving, drivable, multitask, multiple, video, visual, heterogeneous, road, evaluation, provide, recognition, complicated, three] [object, segmentation, detection, instance, tracking, marking, table, area, semantic, mot, autonomous, benchmark, box, challenge, occlusion, bounding, annotated] [datasets, model, study, trained, improve, collected] [figure, existing, ieee, pattern, weather] [diverse, image, domain, diversity, train, perform] [learning, training, set, data, number, task, arxiv, deep, large, validation, accuracy, observe, preprint, simple, label] [computer, vision, conference, scene, supplementary, jointly, require, international, homogeneous, single, kitti, joint]
@InProceedings{Yu_2020_CVPR,
  author = {Yu, Fisher and Chen, Haofeng and Wang, Xin and Xian, Wenqi and Chen, Yingying and Liu, Fangchen and Madhavan, Vashisht and Darrell, Trevor},
  title = {BDD100K: A Diverse Driving Dataset for Heterogeneous Multitask Learning},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Rethinking Computer-Aided Tuberculosis Diagnosis
Yun Liu, Yu-Huan Wu, Yunfeng Ban, Huifang Wang, Ming-Ming Cheng


As a serious infectious disease, tuberculosis (TB) is one of the major threats to human health worldwide, leading to millions of death every year. Although early diagnosis and treatment can greatly improve the chances of survival, it remains a major challenge, especially in developing countries. Computer-aided tuberculosis diagnosis (CTD) is a promising choice for TB diagnosis due to the great successes of deep learning. However, when it comes to TB diagnosis, the lack of training data has hampered the progress of CTD. To solve this problem, we establish a large-scale TB dataset, namely Tuberculosis X-ray (TBX11K) dataset. This dataset contains 11200 X-ray images with corresponding bounding box annotations for TB areas, while the existing largest public TB dataset only has 662 X-ray images with corresponding image-level annotations. The proposed dataset enables the training of sophisticated detectors for high-quality CTD. We reform the existing object detectors to adapt them to simultaneous image classification and TB area detection. These reformed detectors are trained and evaluated on the proposed TBX11K dataset and served as the baselines for future research.
[dataset, evaluation, future, provide, people] [detection, object, ssd, fpn, area, faster, bounding, apbb, retinanet, recall, feature, fcos, box, achieves, backbone, loc, global, promote, iou] [ctd, tuberculosis, diagnosis, sick, chest, datasets, golden, major, healthy, uncertain, shenzhen, help, publicly, refers, evaluating, infectious, mycobacterium, experienced] [proposed, existing, ieee, clinical, convolutional, yun, medical] [image, latent, train, health] [deep, classification, active, data, test, training, learning, imagenet, set, performance, pretraining, precision, neural, report, standard, early, rate, accuracy] [computer, conference, international, human, simultaneous]
@InProceedings{Liu_2020_CVPR,
  author = {Liu, Yun and Wu, Yu-Huan and Ban, Yunfeng and Wang, Huifang and Cheng, Ming-Ming},
  title = {Rethinking Computer-Aided Tuberculosis Diagnosis},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
IntrA: 3D Intracranial Aneurysm Dataset for Deep Learning
Xi Yang, Ding Xia, Taichi Kin, Takeo Igarashi


Medicine is an important application area for deep learning models. Research in this field is a combination of medical expertise and data science knowledge. In this paper, instead of 2D medical images, we introduce an open-access 3D intracranial aneurysm dataset, IntrA, that makes the application of points-based and mesh-based classification and segmentation models available. Our dataset can be used to diagnose intracranial aneurysms and to extract the neck for a clipping operation in medicine and other areas of deep learning, such as normal estimation and surface reconstruction. We provide a large-scale benchmark of classification and part segmentation by testing state-of-the-art networks. We also discuss the performance of each method and demonstrate the challenges of our dataset. The published dataset can be accessed here: https://github.com/intra2d2019/IntrA.
[dataset, described, provide, automatically] [aneurysm, segmentation, vessel, intracranial, blood, annotated, segment, annotation, boundary, iou, segmented, neck, abdominal] [model, healthy, input, diagnosis, university, generalization] [figure, medical, ieee, based, convolutional, pattern, brain, method, proposed, convolution, dynamic, treatment, introduced] [image, generated, manifold] [data, deep, learning, network, neural, classification, performance, processing, number, accuracy, training, large, selected, excellent] [point, computer, conference, vision, geodesic, shape, distance, surface, normal, complete, sparse, pointconv, pointcnn, spidercnn, pipeline, cloud, acm, aortic, complex, volumetric, local, leonidas, hao, international, application, estimation, collection, euclidean, cad]
@InProceedings{Yang_2020_CVPR,
  author = {Yang, Xi and Xia, Ding and Kin, Taichi and Igarashi, Takeo},
  title = {IntrA: 3D Intracranial Aneurysm Dataset for Deep Learning},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Revisiting Saliency Metrics: Farthest-Neighbor Area Under Curve
Sen Jia, Neil D. B. Bruce


In this paper, we propose a new metric to address the long-standing problem of center bias in saliency evaluation. We first show that distribution-based metrics cannot measure saliency performance across datasets due to ambiguity in the choice of standard deviation, especially for Convolutional Neural Networks. Therefore, our proposed metric is AUC-based because ROC curves are relatively robust to the standard deviation problem. However, this requires sufficient unique values in the saliency prediction to compute AUC scores. Secondly, we propose a global smoothing function for the problem of few value degrees in predicted saliency output. Compared with random noise, our smoothing function can create unique values without losing the existing relative saliency relationship. Finally, we show our proposed AUC-based metric can generate a more directional negative set for evaluation, denoted as Farthest-Neighbor AUC (FN-AUC). Our experiments show FN-AUC can measure spatial biases, central and peripheral, more effectively than S-AUC without penalizing the fixation locations.
[fixation, visual, dataset, prediction, relationship, considers, build, predict] [saliency, center, map, positive, score, global, pall, penalize, table, propose, cnn, salicon, toronto, fps] [auc, model, deviation, neil, datasets, robust] [figure, spatial, gaussian, ieee, proposed, based, method, output, applying, achieved, spatially, pattern] [image, drawn, mit] [set, negative, bias, distribution, metric, problem, sampled, probability, higher, performance, standard, sample, training, sampling, measure, function, random, achieve, choice, considered, small, smoothing, computational, evaluate, process, size, strategy, experiment, large, applied, lower, test] [vision, conference, relative, computer, international, matthias, compare]
@InProceedings{Jia_2020_CVPR,
  author = {Jia, Sen and Bruce, Neil D. B.},
  title = {Revisiting Saliency Metrics: Farthest-Neighbor Area Under Curve},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Computing the Testing Error Without a Testing Set
Ciprian A. Corneanu, Sergio Escalera, Aleix M. Martinez


Deep Neural Networks (DNNs) have revolutionized computer vision. We now have DNNs that achieve top (accuracy) results in many problems, including object recognition, facial expression analysis, and semantic segmentation, to name but a few. The design of the DNNs that achieve top results is, however, non-trivial and mostly done by trail-and-error. That is, typically, researchers will derive many DNN architectures (i.e., topologies) and then test them on multiple datasets. However, there are no guarantees that the selected DNN will perform well in the real world. One can use a testing set to estimate the performance gap between the training and testing sets, but avoiding overfitting-to-the-testing-data is of concern. Using a sequestered testing data may address this problem, but this requires a constant update of the dataset, a very expensive venture. Here, we derive an algorithm to estimate the performance gap between training and testing without the need of a testing dataset. Specifically, we derive a set of persistent topology measures that identify when a DNN is learning to generalize to unseen samples. We provide extensive experimental validation on multiple networks and datasets to demonstrate the feasibility of the proposed approach.
[dataset, recognition, multiple, evaluation, three, observed] [semantic, object, table, segmentation, correlation] [testing, topological, persistent, dnn, dnns, deviation, facial, generalization, simplicial, model, homology, sequestered, datasets, filtration, persistence, summary, betti] [figure, proposed, ieee, based, convolutional, conv, pattern, classical] [gap, perform, mapping, unseen, train, extensive, corresponding, image] [performance, training, standard, number, deep, algorithm, set, neural, metric, test, network, data, linear, computing, space, learning, classification, average, problem, function, validation, margin] [compute, computer, error, vision, topology, estimate, functional, define, defined, computed, conference, approach, estimated, derive, complex, estimating]
@InProceedings{Corneanu_2020_CVPR,
  author = {Corneanu, Ciprian A. and Escalera, Sergio and Martinez, Aleix M.},
  title = {Computing the Testing Error Without a Testing Set},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Improving Confidence Estimates for Unfamiliar Examples
Zhizhong Li, Derek Hoiem


Intuitively, unfamiliarity should lead to lack of confidence. In reality, current algorithms often make highly confident yet wrong predictions when faced with relevant but unfamiliar examples. A classifier we trained to recognize gender is 12 times more likely to be wrong with a 99% confident prediction if presented with a subject from a different age group than those seen during training. In this paper, we compare and evaluate several methods to improve confidence estimates for unfamiliar and familiar samples. We propose a testing methodology of splitting unfamiliar and familiar samples by attribute (age, breed, subcategory) or sampling (similar datasets collected by different people at different times). We evaluate methods including confidence calibration, ensembles, distillation, and a Bayesian model and use several metrics to analyze label, likelihood, and calibration error. While all methods reduce over-confident errors, the ensemble of calibrated models performs best overall, and T-scaling performs best among the approaches with fastest inference.
[prediction, dataset, dog, work, recognition, correct] [confidence, object, table, propose, feature, split] [ensemble, improve, model, datasets, generalization, quality] [pattern, method, ieee, likelihood, figure, performs, high, analysis] [loss, gender, cat, unsupervised, drawn, image] [unfamiliar, familiar, training, bayesian, neural, nll, deep, label, data, validation, baseline, test, distillation, set, machine, network, performance, learning, classification, brier, confident, classifier, distribution, reduce, evaluate, best, better, novelty, consider, lower, scaling, outperform, distill, sampled, higher, temperature, expected, dropout, kendall, presence, arxiv] [calibration, error, conference, computer, calibrated, uncertainty, vision, single, international, uncalibrated]
@InProceedings{Li_2020_CVPR,
  author = {Li, Zhizhong and Hoiem, Derek},
  title = {Improving Confidence Estimates for Unfamiliar Examples},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
CycleISP: Real Image Restoration via Improved Data Synthesis
Syed Waqas Zamir, Aditya Arora, Salman Khan, Munawar Hayat, Fahad Shahbaz Khan, Ming-Hsuan Yang, Ling Shao


The availability of large-scale datasets has helped unleash the true potential of deep convolutional neural networks (CNNs). However, for the single-image denoising problem, capturing a real dataset is an unacceptably expensive and cumbersome procedure. Consequently, image denoising algorithms are mostly developed and evaluated on synthetic data that is usually generated with a widespread assumption of additive white Gaussian noise (AWGN). While the CNNs achieve impressive results on these synthetic datasets, they do not perform well when applied on real camera images, as reported in recent benchmark datasets. This is mainly because the AWGN is not adequate for modeling the real camera noise which is signal-dependent and heavily transformed by the camera imaging pipeline. In this paper, we present a framework that models camera imaging pipeline in forward and reverse directions. It allows us to produce any number of realistic image pairs for denoising both in RAW and sRGB spaces. By training a new image denoising network on realistic synthetic data, we achieve the state-of-the-art performance on real camera benchmark datasets. The parameters in our models are 5 times lesser than the previous best method for RAW denoising. Furthermore, we demonstrate that the proposed framework generalizes beyond image denoising problem e.g., for color matching in stereoscopic cinema. The source code and pre-trained models are available at https://github.com/swz30/CycleISP.
[attention, dataset, order] [branch, table, framework, feature, apply, benchmark, propose, main, cnn] [noise, model, clean, input, correction, trained, datasets, injection, generalization, technique] [srgb, denoising, raw, color, cycleisp, noisy, sidd, method, dnd, convolutional, output, psnr, gaussian, proposed, figure, spatial, sensor, ssim, imaging, cnns, irgb, residual, bayer, iraw, channel, lei, isp, performs, stereoscopic, based] [image, real, synthetic, realistic, generate, perform, synthesize, source] [network, data, deep, learning, training, processing, size, applied, procedure, layer, process, performance, neural, best, set, space] [camera, michael, matching, well, pipeline, allows, computer, vision, single, approach]
@InProceedings{Zamir_2020_CVPR,
  author = {Zamir, Syed Waqas and Arora, Aditya and Khan, Salman and Hayat, Munawar and Khan, Fahad Shahbaz and Yang, Ming-Hsuan and Shao, Ling},
  title = {CycleISP: Real Image Restoration via Improved Data Synthesis},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Enhanced Blind Face Restoration With Multi-Exemplar Images and Adaptive Spatial Feature Fusion
Xiaoming Li, Wenyu Li, Dongwei Ren, Hongzhi Zhang, Meng Wang, Wangmeng Zuo


In many real-world face restoration applications, e.g., smartphone photo albums and old films, multiple high-quality (HQ) images of the same person usually are available for a given degraded low-quality (LQ) observation. However, most existing guided face restoration methods are based on single HQ exemplar image, and are limited in properly exploiting guidance for improving the generalization ability to unknown degradation process. To address these issues, this paper suggests to enhance blind face restoration performance by utilizing multi-exemplar images and adaptive fusion of features from guidance and degraded images. First, given a degraded observation, we select the optimal guidance based on the weighted affine distance on landmark sets, where the landmark weights are learned to make the guidance image optimized to HQ image reconstruction. Second, moving least-square and adaptive instance normalization are leveraged for spatial alignment and illumination translation of guidance image in the feature space. Finally, for better feature fusion, multiple adaptive spatial feature fusion (ASFF) layers are introduced to incorporate guidance features in an adaptive and progressive manner, resulting in our ASFFNet. Experiments show that our ASFFNet performs favorably in terms of quantitative and qualitative evaluation, and is effective in generating photo-realistic results on real-world LQ images. The source code and models are available at https://github.com/csxmli2016/ASFFNet.
[multiple, visual, three, attention] [feature, adopt, denotes, mask, table, guided] [face, landmark, input, generalization, model, quality, effective, difference, facial, expression, adversarial, testing] [guidance, degraded, restoration, asffnet, gfrnet, adaptive, fusion, asff, gwainet, spatial, illumination, blind, degradation, method, warped, quantitative, adopted, based, subnet, restored, lgk, extraction, figure, waveletsr, affine, suggested, scgan, introduced, result, warping, spatially, block] [image, exemplar, loss, alignment, translation, adain, ability, progressive, unknown, style] [deep, learning, performance, better, selected, set, training, optimal, weighted, network, consider, select, normalization] [reconstruction, pose, single, reconstructed, limited, defined]
@InProceedings{Li_2020_CVPR,
  author = {Li, Xiaoming and Li, Wenyu and Ren, Dongwei and Zhang, Hongzhi and Wang, Meng and Zuo, Wangmeng},
  title = {Enhanced Blind Face Restoration With Multi-Exemplar Images and Adaptive Spatial Feature Fusion},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Explorable Super Resolution
Yuval Bahat, Tomer Michaeli


Single image super resolution (SR) has seen major performance leaps in recent years. However, existing methods do not allow exploring the infinitely many plausible reconstructions that might have given rise to the observed low-resolution (LR) image. These different explanations to the LR image may dramatically vary in their textures and fine details, and may often encode completely different semantic information. In this paper, we introduce the task of explorable super resolution. We propose a framework comprising a graphical user interface with a neural network backend, allowing editing the SR output so as to explore the abundance of plausible HR explanations to the LR input. At the heart of our method is a novel module that can wrap any existing SR network, analytically guaranteeing that its SR outputs would precisely match the LR input, when down- sampled. Besides its importance in our setting, this module is guaranteed to decrease the reconstruction error of any SR network it wraps, and can be used to cope with blur kernels that are different from the one the network was trained for. We illustrate our approach in a variety of use cases, ranging from medical imaging and forensics, to graphics.
[natural, cem, explorable, explore, exploring, incorporating] [module, region, framework, scribble, semantic] [input, trained, adversarial, wrap, quality] [output, figure, esrgan, pattern, signal, ieee, existing, kernel, method, resolution, low, super, spatially, blur, based, frequency, tomer, high, valid, perceptual, kpn] [image, editing, loss, user, plausible, control, tool, gui, corresponding, desired, manipulating, edited, lmap, producing, perceptually, generative, encourage, consistency, generator, content, identically, diverse, diversity, gan, texture] [network, neural, deep, set, achieve, training, space, processing, imprint, imprinting, learning] [conference, computer, consistent, vision, reconstruction, allows, allow, single, approach, match, additional, enforcing]
@InProceedings{Bahat_2020_CVPR,
  author = {Bahat, Yuval and Michaeli, Tomer},
  title = {Explorable Super Resolution},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Syn2Real Transfer Learning for Image Deraining Using Gaussian Processes
Rajeev Yasarla, Vishwanath A. Sindagi, Vishal M. Patel


Recent CNN-based methods for image deraining have achieved excellent performance in terms of reconstruction error as well as visual quality. However, these methods are limited in the sense that they can be trained only on fully labeled data. Due to various challenges in obtaining real world fully-labeled image deraining datasets, existing methods are trained only on synthetically generated data and hence, generalize poorly to real-world images. The use of real-world data in training image deraining networks is relatively less explored in the literature. We propose a Gaussian Process-based semi-supervised learning framework which enables the network in learning to derain using synthetic dataset while generalizing better using unlabeled real-world images. Through extensive experiments and ablations on several challenging datasets (such as Rain800, Rain100H and DDN-SIRR), we show that the proposed method, when trained on limited labeled data, achieves on-par performance with fully-labeled training. Additionally, we demonstrate that using unlabeled real-world images in the proposed GP-based framework results in superior performance as compared to existing methods.
[dataset, recognition, observed, recurrent, goal] [framework, wei, detection, achieves, predicted] [input, model, clean, trained, datasets, quality, generalization, adversarial] [rain, proposed, method, ieee, fzl, deraining, pattern, rainy, gaussian, removal, zuk, existing, zhang, based, sirr, figure, intermediate, convolutional, streak] [image, latent, synthetic, loss, train, consists, minimizing, learn, pseudo, supervised, encoder] [unlabeled, labeled, data, training, network, set, performance, better, space, learning, test, gain, compared, deep, vector, function, ssl, distribution, process, matrix, neural, typically, achieve] [computer, conference, vision, single, leverage, international, error, joint, sparse, approach, defined, jointly]
@InProceedings{Yasarla_2020_CVPR,
  author = {Yasarla, Rajeev and Sindagi, Vishwanath A. and Patel, Vishal M.},
  title = {Syn2Real Transfer Learning for Image Deraining Using Gaussian Processes},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Deblurring by Realistic Blurring
Kaihao Zhang, Wenhan Luo, Yiran Zhong, Lin Ma, Bjorn Stenger, Wei Liu, Hongdong Li


Existing deep learning methods for image deblurring typically train models using pairs of sharp images and their blurred counterparts. However, synthetically blurring images does not necessarily model the blurring process in real-world scenarios with sufficient accuracy. To address this problem, we propose a new method which combines two GAN models, i.e., a learning-to-Blur GAN (BGAN) and learning-to-DeBlur GAN (DBGAN), in order to learn a better model for image deblurring by primarily learning how to blur images. The first model, BGAN, learns how to blur sharp images with unpaired sharp and blurry image sets, and then guides the second model, DBGAN, to learn how to correctly deblur such images. In order to reduce the discrepancy between real blur and synthesized blur, a relativistic blur loss is leveraged. As an additional contribution, this paper also introduces a Real-World Blurred Image (RWBI) dataset including diverse blurry images. Our experiments show that the proposed method achieves consistently superior quantitative performance as well as higher perceptual quality on both the newly proposed dataset and the public GOPRO dataset.
[order, dataset, gao, video, lin, three] [propose, framework, module, achieves, wei] [model, input, trained, adversarial, public, noise, help] [blurry, sharp, bgan, deblurring, blur, proposed, based, blurred, method, relativistic, blurring, conv, perceptual, gopro, convolutional, motion, output, kernel, figure, traditional, rwbi, nah, hongdong, prior, lrelu, deblur, recover, rbl, residual, comparison, dynamic, xin, jiaya, kaihao, wenhan, existing] [image, real, loss, dbgan, generator, realistic, generate, synthesized, generated, gan, learn, train, synthetic, tao, fake, discriminator, paired, learns, gap, qualitative] [learning, performance, training, network, deep, probability, function, process, update, neural, architecture, better, increase, paper, data, labeled] [camera, single]
@InProceedings{Zhang_2020_CVPR,
  author = {Zhang, Kaihao and Luo, Wenhan and Zhong, Yiran and Ma, Lin and Stenger, Bjorn and Liu, Wei and Li, Hongdong},
  title = {Deblurring by Realistic Blurring},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Bringing Old Photos Back to Life
Ziyu Wan, Bo Zhang, Dongdong Chen, Pan Zhang, Dong Chen, Jing Liao, Fang Wen


We propose to restore old photos that suffer from severe degradation through a deep learning approach. Unlike conventional restoration tasks that can be solved through supervised learning, the degradation in real photos is complex and the domain gap between synthetic images and real old photos makes the network fail to generalize. Therefore, we propose a novel triplet domain translation network by leveraging real photos along with massive synthetic image pairs. Specifically, we train two variational autoencoders (VAEs) to respectively transform old photos and clean photos into two latent spaces. And the translation between these two latent spaces is learned with synthetic paired data. This translation generalizes well to real photos because the domain gap is closed in the compact latent space. Besides, to address multiple degradations mixed in one old photo, we design a global branch with a partial nonlocal block targeting to the structured defects, such as scratches and dust spots, and a local branch targeting to the unstructured defects, such as noises and blurriness. Two branches are fused in the latent space, leading to improved capability to restore old photos from multiple defects. The proposed method outperforms state-of-the-art methods in terms of visual quality for old photos restoration.
[structured, attention, multiple, dataset, visual, three] [propose, global, table, branch, ablation, detection, adopt, focus] [adversarial, clean, trained, corrupted, study, model, quality] [restoration, ieee, method, degradation, figure, pattern, nonlocal, restore, convolutional, block, color, denoising, proposed, prior, restoring] [image, latent, real, synthetic, translation, domain, photo, mapping, loss, inpainting, gap, vaes, learn, corresponding, vae, train, variational, address, film, supervised, inpainted] [network, space, deep, neural, mixed, data, learning, learned, training, arxiv, better, processing, preprint, performance, scratch, triplet, compact] [computer, conference, vision, partial, unstructured, ground, international, local, complex, well, distance]
@InProceedings{Wan_2020_CVPR,
  author = {Wan, Ziyu and Zhang, Bo and Chen, Dongdong and Zhang, Pan and Chen, Dong and Liao, Jing and Wen, Fang},
  title = {Bringing Old Photos Back to Life},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
A Physics-Based Noise Formation Model for Extreme Low-Light Raw Denoising
Kaixuan Wei, Ying Fu, Jiaolong Yang, Hua Huang


Lacking rich and realistic data, learned single image denoising algorithms generalize poorly in real raw images that not resemble the data used for training. Although the problem can be alleviated by the heteroscedastic Gaussian noise model, the noise sources caused by digital camera electronics are still largely overlooked, despite their significant effect on raw measurement, especially under extremely low-light condition. To address this issue, we present a highly accurate noise formation model based on the characteristics of CMOS photosensors, thereby enabling us to synthesize realistic samples that better match the physics of image formation process. Given the proposed noise model, we additionally propose a method to calibrate the noise parameters for available modern digital cameras, which is simple and reproducible for any new device. We systematically study the generalizability of a neural network trained with existing schemes, by introducing a new low-light denoising dataset that covers many modern digital cameras from diverse brands. Extensive empirical results collectively show that by utilizing our proposed noise formation model, a network can reach the capability as if it had been trained with rich real data, which demonstrates the effectiveness of our noise formation model.
[dataset, recognition, current, modeling, evaluation] [extreme, plot, denotes] [noise, model, digital, trained, clean, input] [ieee, raw, denoising, gaussian, noisy, pattern, formation, sensor, light, method, figure, low, read, psnr, photon, imaging, sid, ssim, cmos, proposed, exposure, dark, iso, sony, color, banding, nread, presented, lambda, electronic, tukey, captured, june, heteroscedastic, burst, generally, analysis, scale, eld] [image, real, paired, synthetic, row, realistic] [data, training, distribution, parameter, learning, bias, network, better, process, statistical, probability, neural, shot, performance, modern, deep] [camera, conference, computer, vision, international, single, calibration, pipeline, estimated]
@InProceedings{Wei_2020_CVPR,
  author = {Wei, Kaixuan and Fu, Ying and Yang, Jiaolong and Huang, Hua},
  title = {A Physics-Based Noise Formation Model for Extreme Low-Light Raw Denoising},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Learning to Super Resolve Intensity Images From Events
S. Mohammad Mostafavi I. , Jonghyun Choi, Kuk-Jin Yoon


An event camera detects per-pixel intensity difference and produces asynchronous event stream with low latency, high dynamic range, and low power consumption. As a trade-off, the event camera has low spatial resolution. We propose an end-to-end network to reconstruct high resolution, high dynamic range (HDR) images directly from the event stream. We evaluate our algorithm on both simulated and real-world sequences and verify that it captures fine details of a scene and outperforms the combination of the state-of-the-art event to image algorithms with the state-ofthe-art super resolution schemes in many quantitative measures by large margins. We further extend our method by using the active sensor pixel (APS) frames or reconstructing images iteratively.
[state, frame, previous, temporal, sequence, video, stream, outperforms, dataset] [table, challenging, propose, location] [input, quality, central, create, korea, complementary] [intensity, event, method, stack, aps, flow, high, output, optical, resolution, ieee, super, low, stacking, comparison, dynamic, based, figure, range, convolutional, misr, combination, proposed, intermediate, rectified, psnr, ssim, mse, timestamp, sisr, residual, perceptual, quantitative, extend] [image, loss, lpips, structural, qualitative, fine] [network, number, learning, deep, similarity, neural, better, data, higher, evaluate, large, size, power] [reconstruct, reconstruction, camera, scene, directly, estimation, compare, single, error, initial, supplementary]
@InProceedings{I._2020_CVPR,
  author = {, S. Mohammad Mostafavi I. and Choi, Jonghyun and Yoon, Kuk-Jin},
  title = {Learning to Super Resolve Intensity Images From Events},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Camouflaged Object Detection
Deng-Ping Fan, Ge-Peng Ji, Guolei Sun, Ming-Ming Cheng, Jianbing Shen, Ling Shao


We present a comprehensive study on a new task named camouflaged object detection (COD), which aims to identify objects that are "seamlessly" embedded in their surroundings. The high intrinsic similarities between the target object and the background make COD far more challenging than the traditional object detection task. To address this issue, we elaborately collect a novel dataset, called COD10K, which comprises 10,000 images covering camouflaged objects in various natural scenes, over 78 object categories. All the images are densely annotated with category, bounding-box, object-/instance-level, and matting-level labels. This dataset could serve as a catalyst for progressing many vision tasks, e.g., localization, segmentation, and alpha-matting, etc. In addition, we develop a simple but effective framework for COD, termed Search Identification Network (SINet). Without any bells and whistles, SINet outperforms various state-of-the-art object detection baselines on all datasets tested, making it a robust, general framework that can help facilitate future research in COD. Finally, we conduct a large-scale COD study, evaluating 13 cutting-edge models, providing some interesting findings, and showing several potential applications. Our research offers the community an opportunity to explore more in this new field. The code will be available at https://github.com/DengPingFan/SINet/.
[dataset, visual, attention, provide, natural, three, recognition, evaluation, time, engine, decoder, includes] [object, camouflaged, detection, salient, cod, sinet, challenging, camo, jianbing, segmentation, ali, semantic, chameleon, pdc, background, framework, center, module, feature, wenguan, ling, annotated, indefinable, benchmark, god, score, sod, map, area] [camouflage, model, generic, identification, trained, datasets, animal, collected, testing] [ieee, figure, proposed, existing, resolution, contrast, densely, receptive, based] [image, component, loss] [training, search, learning, deep, performance, network, size, best, task, number, set, potential, distribution, find, simple, small, metric, layer, function] [computer, vision, human, system, partial]
@InProceedings{Fan_2020_CVPR,
  author = {Fan, Deng-Ping and Ji, Ge-Peng and Sun, Guolei and Cheng, Ming-Ming and Shen, Jianbing and Shao, Ling},
  title = {Camouflaged Object Detection},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Holistically-Attracted Wireframe Parsing
Nan Xue, Tianfu Wu, Song Bai, Fudong Wang, Gui-Song Xia, Liangpei Zhang, Philip H.S. Torr


This paper presents a fast and parsimonious parsing method to accurately and robustly detect a vectorized wireframe in an input image with a single forward pass. The proposed method is end-to-end trainable, consisting of three components: (i) line segment and junction proposal generation, (ii) line segment and junction matching, and (iii) line segment and junction verification. For computing line segment proposals, a novel exact dual representation is proposed which exploits a parsimonious geometric reparameterization for line segments and forms a holistic 4-dimensional attraction field map for an input image. Junctions can be treated as the "basins" in the attraction field. The proposed method is thus called Holistically-Attracted Wireframe Parser (HAWP). In experiments, the proposed method is tested on two benchmarks, the Wireframe dataset [14] and the YorkUrban dataset [8]. On both benchmarks, it obtains state-of-the-art performance in terms of accuracy and efficiency. For example, on the Wireframe dataset, compared to the previous state-of-the-art method L-CNN [36], it improves the challenging mean structural average precision (msAP) by a large margin (2.8% absolute improvements), and achieves 29.5 FPS on a single GPU (89% relative improvement). A systematic ablation study is performed to further justify the proposed method.
[three, parser, work, dataset, vectorized, recognition] [segment, wireframe, junction, proposal, hawp, afm, attraction, parsing, map, holistic, detection, region, feature, groundtruth, lsd, dwp, module, yorkurban, adopt, benchmark, aph, fps, object, heatmap, offset, positive] [verification, input, reparameterization] [proposed, based, field, method, ieee, pattern, exact, dual, convolutional] [image, representation, loss, generation] [learning, negative, set, vector, performance, support, number, training, denoted, deep, simple, machine, precision, better, paper, computing, operation] [computer, vision, distance, conference, international, point, computed, novel, displacement, single, geometric, matching, distant, pose, estimation, human, supported]
@InProceedings{Xue_2020_CVPR,
  author = {Xue, Nan and Wu, Tianfu and Bai, Song and Wang, Fudong and Xia, Gui-Song and Zhang, Liangpei and Torr, Philip H.S.},
  title = {Holistically-Attracted Wireframe Parsing},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Conv-MPN: Convolutional Message Passing Neural Network for Structured Outdoor Architecture Reconstruction
Fuyang Zhang, Nelson Nauata, Yasutaka Furukawa


This paper proposes a novel message passing neural (MPN) architecture Conv-MPN, which reconstructs an outdoor building as a planar graph from a single RGB image. Conv-MPN is specifically designed for cases where nodes of a graph have explicit spatial embedding. In our problem, nodes correspond to building edges in an image. Conv-MPN is different from MPN in that 1) the feature associated with a node is represented as a feature volume instead of a 1D vector; and 2) convolutions encode messages instead of fully connected layers. Conv-MPN learns to select a true subset of nodes (i.e., building edges) to reconstruct a building planar graph. Our qualitative and quantitative evaluations over 2,000 buildings show that Conv-MPN makes significant improvements over the existing fully neural solutions. We believe that the paper has a potential to open a new line of graph neural network research for structured geometry reconstruction.
[graph, message, passing, structured, node, nauata, convmpn, connected, infer, associated, current, gnn, failure] [building, feature, edge, corner, cnn, detect, region, ppgnet, recall, yasutaka, fully, challenge, spacenet, confidence] [input] [ieee, convolutional, figure, pattern, spatial, existing] [image, structural, learns, qualitative] [neural, architecture, network, learning, inference, memory, problem, deep, set, paper, standard, processing, requires, classification, simple, update, optimization, training, gpu, large, performance, arxiv, preprint] [computer, conference, vision, planar, volume, reconstruction, rgb, geometric, geometry, structure, international, outdoor, single, satellite, human, complex, pose, hamaguchi]
@InProceedings{Zhang_2020_CVPR,
  author = {Zhang, Fuyang and Nauata, Nelson and Furukawa, Yasutaka},
  title = {Conv-MPN: Convolutional Message Passing Neural Network for Structured Outdoor Architecture Reconstruction},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Domain Adaptation for Image Dehazing
Yuanjie Shao, Lerenhan Li, Wenqi Ren, Changxin Gao, Nong Sang


Image dehazing using learning-based methods has achieved state-of-the-art performance in recent years. However, most existing methods train a dehazing model on synthetic hazy images, which are less able to generalize well to real hazy images due to domain shift. To address this issue, we propose a domain adaptation paradigm, which consists of an image translation module and two image dehazing modules. Specifically, we first apply a bidirectional translation network to bridge the gap between the synthetic and real domains by translating images from one domain to another. And then, we use images before and after translation to train the proposed two image dehazing networks with a consistency constraint. In this phase, we incorporate the real hazy image into the dehazing training via exploiting the properties of the clear image (e.g., dark channel prior and image gradient smoothing) to further improve the domain adaptivity. By training image translation and dehazing network in an end-to-end manner, we can obtain better effects of both image translation and dehazing. Experimental results on both synthetic and real-world images demonstrate that our model performs favorably against the state-of-the-art dehazing algorithms.
[visual, incorporate, work, dataset] [module, map, propose, framework, table, denotes, feature] [model, adversarial, trained, clean, improve, datasets] [dehazing, hazy, ieee, figure, method, proposed, dehazed, pattern, dark, channel, epdn, clear, transmission, haze, color, prior, img, conv, nld, dehazenet, dcpdn, gfn, result, quantitative, wenqi, performs, atmospheric, convolution, hazerd, psnr] [image, real, synthetic, domain, translation, loss, adaptation, train, translated, unsupervised, eat, perform, gan, consistency, utilize, gap, discrepancy, translate, translator] [network, training, learning, performance, better, deep, data, reduce, layer, gradient, set] [conference, computer, depth, vision, single, international, estimate, demonstrate, scene, well]
@InProceedings{Shao_2020_CVPR,
  author = {Shao, Yuanjie and Li, Lerenhan and Ren, Wenqi and Gao, Changxin and Sang, Nong},
  title = {Domain Adaptation for Image Dehazing},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Auto-Encoding Twin-Bottleneck Hashing
Yuming Shen, Jie Qin, Jiaxin Chen, Mengyang Yu, Li Liu, Fan Zhu, Fumin Shen, Ling Shao


Conventional unsupervised hashing methods usually take advantage of similarity graphs, which are either pre-computed in the high-dimensional space or obtained from random anchor points. On the one hand, existing methods uncouple the procedures of hash function learning and graph construction. On the other hand, graphs empirically built upon original data could introduce biased prior knowledge of data relevance, leading to sub-optimal retrieval performance. In this paper, we tackle the above problems by proposing an efficient and adaptive code-driven graph, which is updated by decoding in the context of an auto-encoder. Specifically, we introduce into our framework twin bottlenecks (i.e., latent variables) that exchange crucial information collaboratively. One bottleneck (i.e., binary codes) conveys the high-level intrinsic data structure captured by the code-driven graph to the other (i.e., continuous variables for low-level detail information), which in turn propagates the updated network feedback for the encoder to learn more discriminative binary codes. The auto-encoding learning objective literally rewards the code-driven graph to learn an optimal encoder. Moreover, the proposed model can be simply optimized by gradient descent without violating the binary constraints. Experiments on benchmarked datasets clearly show the superiority of our framework over the state-of-the-art hashing methods. Our source code can be found at https://github.com/ymcidence/TBH.
[graph, decoder, retrieval, decoding, adjacency, gcn, encoding, trainable] [feature, framework, adopt, fully, sigmoid, ling] [model, adversarial, lae, original, refers] [proposed, existing, relu, convolutional, figure, method, based, adaptive] [unsupervised, latent, code, image, generative, encoder, loss, variable, supervised] [binary, tbh, data, hashing, learning, network, bottleneck, similarity, training, hamming, deep, sgh, stochastic, set, regularization, gradient, performance, discrete, layer, hash, log, baseline, batch, machine, function, efficient, objective, greedyhash, precision, yuming, updated, better, problem, neural, design, procedure, note, involves, wae, neuron] [continuous, reconstruction, directly, single, computed]
@InProceedings{Shen_2020_CVPR,
  author = {Shen, Yuming and Qin, Jie and Chen, Jiaxin and Yu, Mengyang and Liu, Li and Zhu, Fan and Shen, Fumin and Shao, Ling},
  title = {Auto-Encoding Twin-Bottleneck Hashing},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Agriculture-Vision: A Large Aerial Image Database for Agricultural Pattern Analysis
Mang Tik Chiu, Xingqian Xu, Yunchao Wei, Zilong Huang, Alexander G. Schwing, Robert Brunner, Hrant Khachatrian, Hovnatan Karapetyan, Ivan Dozier, Greg Rose, David Wilson, Adrian Tudor, Naira Hovakimyan, Thomas S. Huang, Honghui Shi


The success of deep learning in visual recognition tasks has driven advancements in multiple fields of research. Particularly, increasing attention has been drawn towards its application in agriculture. Nevertheless, while visual pattern recognition on farmlands carries enormous economic values, little progress has been made to merge computer vision and crop sciences due to the lack of suitable agricultural image datasets. Meanwhile, problems in agriculture also pose new challenges in computer vision. For example, semantic segmentation of aerial farmland images requires inference over extremely large-size images with extreme annotation sparsity. These challenges are not present in most of the common object datasets, and we show that they are more challenging than many other aerial image datasets. To encourage research in computer vision for agriculture, we present Agriculture-Vision: a large-scale aerial farmland image dataset for semantic segmentation of agricultural patterns. We collected 94,986 high-quality aerial images from 3,432 farmlands across the US, where each image consists of RGB and Near-infrared (NIR) channels with resolution as high as 10 cm per pixel. We annotate nine types of field anomaly patterns that are most important to farmers. As a pilot study of aerial agricultural semantic segmentation, we perform comprehensive experiments using popular semantic segmentation models; we also propose an effective model designed for aerial agricultural pattern recognition. Our experiments demonstrate several challenges Agriculture-Vision poses to both the computer vision and agriculture communities. Future versions of this dataset will include even more aerial images, anomaly patterns and image channels.
[dataset, recognition, multiple, visual, red, work] [aerial, semantic, segmentation, table, remote, object, yunchao, annotation, annotated, extreme, detection, miou] [model, datasets, study, trained, collected] [agricultural, field, farmland, pattern, ieee, honghui, weed, figure, proposed, convolution, convolutional, window, analysis, pixel, land, crop, captured, agriculture, method, resolution, color, storm, channel, nrg, sensing, pilot] [image, common, nir, transfer, encourage] [deep, learning, neural, large, arxiv, preprint, size, training, classification, data, sample, note, larger, network, layer] [computer, conference, vision, rgb, thomas, international, scene, cover, david]
@InProceedings{Chiu_2020_CVPR,
  author = {Chiu, Mang Tik and Xu, Xingqian and Wei, Yunchao and Huang, Zilong and Schwing, Alexander G. and Brunner, Robert and Khachatrian, Hrant and Karapetyan, Hovnatan and Dozier, Ivan and Rose, Greg and Wilson, David and Tudor, Adrian and Hovakimyan, Naira and Huang, Thomas S. and Shi, Honghui},
  title = {Agriculture-Vision: A Large Aerial Image Database for Agricultural Pattern Analysis},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Bi-Directional Interaction Network for Person Search
Wenkai Dong, Zhaoxiang Zhang, Chunfeng Song, Tieniu Tan


Existing works have designed end-to-end frameworks based on Faster-RCNN for person search. Due to the large receptive fields in deep networks, the feature maps of each proposal, cropped from the stem feature maps, involve redundant context information outside the bounding boxes. However, person search is a fine-grained task which needs accurate appearance information. Such context information can make the model fail to focus on persons, so the learned representations lack the capacity to discriminate various identities. To address this issue, we propose a Siamese network which owns an additional instance-aware branch, named Bi-directional Interaction Network (BINet). During the training phase, in addition to scene images, BINet also takes as inputs person patches which help the model discriminate identities based on human appearance. Moreover, two interaction losses are designed to achieve bi-directional interaction between branches at two levels. The interaction can help the model learn more discriminative features for persons in the scene. At the inference stage, only the major branch is applied, so BINet introduces no additional computation. Extensive experiments on two widely used person search benchmarks, CUHK-SYSU and PRW, have shown that our BINet achieves state-of-the-art results among end-to-end methods without loss of efficiency.
[interaction, context, dataset, previous] [binet, feature, cropped, gallery, bounding, branch, table, oim, detection, prw, faster, map, siamese, focus, object, framework, apply, named, achieves, cnn, qeeps, resized, redundant, discriminate, propose, effectiveness] [model, identity, influence, improve] [ieee, pattern, method, based, figure, proposed, scale, guidance, existing] [person, loss, learn, discriminative, appearance, introduce] [search, performance, network, size, training, learning, baseline, deep, compared, large, layer, set, neural, best, task, achieve, classification, share] [scene, conference, computer, vision, additional, human, solve, second]
@InProceedings{Dong_2020_CVPR,
  author = {Dong, Wenkai and Zhang, Zhaoxiang and Song, Chunfeng and Tan, Tieniu},
  title = {Bi-Directional Interaction Network for Person Search},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Meshlet Priors for 3D Mesh Reconstruction
Abhishek Badki, Orazio Gallo, Jan Kautz, Pradeep Sen


Estimating a mesh from an unordered set of sparse, noisy 3D points is a challenging problem that requires to carefully select priors. Existing hand-crafted priors, such as smoothness regularizers, impose an undesirable trade-off between attenuating noise and preserving local detail. Recent deep-learning approaches produce impressive results by learning priors directly from the data. However, the priors are learned at the object level, which makes these algorithms class-specific, and even sensitive to the pose of the object. We introduce meshlets, small patches of mesh that we use to learn local shape priors. Meshlets act as a dictionary of local features and thus allow to use learned priors to reconstruct object meshes in any pose and from unseen classes, even when the noise is large and the samples sparse.
[recognition, natural, work, extract] [object, global, marked] [noise, poisson, quality, input] [figure, method, ieee, pattern, traditional, prior, noisy, low, fail] [learn, latent, consistency, introduce, corresponding, ability, manifold, produce, disentangle, variational] [optimization, learning, training, deep, space, set, algorithm, small, learned, large, number, vector, neural, sparsity] [meshlets, mesh, point, local, meshlet, shape, computer, conference, vision, cloud, reconstruction, pose, laplacian, surface, reconstruct, distance, depth, enforce, enforcing, smoothness, estimate, occnet, geometric, compute, error, approach, atlasnet, single, directly, watertight, measured, estimation, vertex, canonical, match, compare, allows]
@InProceedings{Badki_2020_CVPR,
  author = {Badki, Abhishek and Gallo, Orazio and Kautz, Jan and Sen, Pradeep},
  title = {Meshlet Priors for 3D Mesh Reconstruction},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Space-Time-Aware Multi-Resolution Video Enhancement
Muhammad Haris, Greg Shakhnarovich, Norimichi Ukita


We consider the problem of space-time super-resolution (ST-SR): increasing spatial resolution of video frames and simultaneously interpolating frames to increase the frame rate. Modern approaches handle these axes one at a time. In contrast, our proposed model called STARnet super-resolves jointly in space and time. This allows us to leverage mutually informative relationships between time and space: higher resolution can provide more detailed information about motion, and higher frame-rate can provide better pixel alignment. The components of our model that generate latent low- and high-resolution representations during ST-SR can be used to finetune a specialized mechanism for just spatial or just temporal super-resolution. Experimental results demonstrate that STARnet improves the performances of space-time, spatial, and temporal video super-resolution by substantial margins on publicly available datasets.
[video, temporal, frame, multiple, time, visual, greg] [table, stage, feature, refinement, muhammad] [input, improve, original, model, niqe] [psnr, spatial, flow, star, ssim, rbpn, dbpn, starnet, dain, motion, method, resolution, itsr, low, interpolation, analysis, toflow, itl, netd, optical, convolutional, figure, comparison, lvgg, ieee, proposed, residual, downscaled, netf, norimichi, based] [image, loss, subtle, produce, perform, representation] [space, learning, network, better, deep, training, large, test, performance, higher, set, indicates, compared, best, neural] [joint, direct, jointly, human]
@InProceedings{Haris_2020_CVPR,
  author = {Haris, Muhammad and Shakhnarovich, Greg and Ukita, Norimichi},
  title = {Space-Time-Aware Multi-Resolution Video Enhancement},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
FSS-1000: A 1000-Class Dataset for Few-Shot Segmentation
Xiang Li, Tianhan Wei, Yau Pun Chen, Yu-Wing Tai, Chi-Keung Tang


Over the past few years, we have witnessed the success of deep learning in image recognition thanks to the availability of large-scale human-annotated datasets such as PASCAL VOC, ImageNet, and COCO. Although these datasets have covered a wide range of object categories, there are still a significant number of objects that are not included. Can we perform the same task without a lot of human annotations? In this paper, we are interested in few-shot object segmentation where the number of annotated training examples are limited to 5 only. To evaluate and validate the performance of our approach, we have built a few-shot segmentation dataset, FSS-1000, which consists of 1000 object classes with pixelwise annotation of ground-truth segmentation. Unique in FSS-1000, our dataset contains significant number of objects that have never been seen or annotated in previous datasets, such as tiny daily objects, merchandise, cartoon characters, logos, etc. We build our baseline model using standard backbone networks such as VGG-16, ResNet-101, and Inception. To our surprise, we found that training our model from scratch using FSS-1000 achieves comparable and even better results than training with weights pre-trained by ImageNet which is more than 100 times larger than FSS-1000. Both our approach and dataset are simple, effective, and easily extensible to learn segmentation of new object classes given very few annotated training examples. Dataset is available at https://github.com/HKUSTCV/FSS-1000
[dataset, relation, decoder, previous, visual, recognition, prediction] [segmentation, object, module, feature, semantic, pascal, coco, table, annotation, voc, instance, guided, fscoco, iou, achieves, hard, backbone, ilsvrc, segment, level, map] [model, datasets, trained, query, pixelwise, animal, input, example] [figure, existing, convolutional, scale, output, proposed] [image, unseen, loss, consists, train, encoder, produce, learn, corresponding] [support, network, set, training, number, learning, class, test, deep, label, performance, baseline, imagenet, classification, accuracy, architecture, binary, good, large, evaluate, better, neural, small, hierarchy] [human, limited, compare]
@InProceedings{Li_2020_CVPR,
  author = {Li, Xiang and Wei, Tianhan and Chen, Yau Pun and Tai, Yu-Wing and Tang, Chi-Keung},
  title = {FSS-1000: A 1000-Class Dataset for Few-Shot Segmentation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
MSeg: A Composite Dataset for Multi-Domain Semantic Segmentation
John Lambert, Zhuang Liu, Ozan Sener, James Hays, Vladlen Koltun


We present MSeg, a composite dataset that unifies se- mantic segmentation datasets from different domains. A naive merge of the constituent datasets yields poor performance due to inconsistent taxonomies and annotation practices. We reconcile the taxonomies and bring the pixel-level annotations into alignment by relabeling more than 220,000 object masks in more than 80,000 images. The resulting composite dataset enables training a single semantic segmentation model that functions effectively across domains and generalizes to datasets that were not seen during training. We adopt zero-shot cross-dataset transfer as a benchmark to systematically evaluate a model's robustness and show that MSeg training yields substantially more robust models in comparison to training on individual datasets or naive mixing of datasets without the presented contributions. A model trained on MSeg ranks first on the WildDash leaderboard for robust semantic segmentation, with no exposure to WildDash data during training.
[dataset, multiple, driving, road, work, visual, individual, evaluation, lane, naive] [semantic, segmentation, table, unified, coco, mapillary, split, pascal, merging, annotation, object, sun, voc, merge, benchmark, curtain, bdd, labeling, building, bicyclist, motorcyclist] [datasets, model, trained, generalization, robust, counter, relabeling, compatible] [figure, result, resolution, pixel, presented] [mseg, wilddash, taxonomy, domain, component, image, composite, transfer, mixing, rider, perform, sidewalk, train, idd, mix, harmonic] [training, performance, test, data, learning, class, accuracy, validation, set, best, good, evaluate, classification, algorithm, john] [vladlen, vision, indoor, single, ground, provided, scene, truth, enables, scannet, computer]
@InProceedings{Lambert_2020_CVPR,
  author = {Lambert, John and Liu, Zhuang and Sener, Ozan and Hays, James and Koltun, Vladlen},
  title = {MSeg: A Composite Dataset for Multi-Domain Semantic Segmentation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
DeeperForensics-1.0: A Large-Scale Dataset for Real-World Face Forgery Detection
Liming Jiang, Ren Li, Wayne Wu, Chen Qian, Chen Change Loy


We present our on-going effort of constructing a largescale benchmark for face forgery detection. The first version of this benchmark, DeeperForensics-1.0, represents the largest face forgery detection dataset by far, with 60,000 videos constituted by a total of 17.6 million frames, 10 times larger than existing datasets of the same kind. Extensive real-world perturbations are applied to obtain a more challenging benchmark of larger scale and higher diversity. All source videos in DeeperForensics-1.0 are carefully collected, and fake videos are generated by a newly proposed end-to-end face swapping framework. The quality of generated videos outperforms those in existing datasets, validated by user studies. The benchmark features a hidden test set, which contains manipulated videos achieving high deceptive scores in human evaluations. We further contribute a comprehensive study that evaluates five representative detection baselines and make a thorough analysis of different settings.
[dataset, hidden, video, temporal, previous, evaluation, three, future, youtube, order] [detection, table, benchmark, propose, module, challenging] [face, forgery, manipulated, quality, datasets, trained, facial, std, model, collected, swapped, madain, deepfake, study, improve, forensics, dfdc, aforementioned, detecting] [high, figure, existing, method, scale, prior, column, raw, proposed] [source, fake, swapping, target, appearance, style, diversity, image, user, real, ensure, generated, introduce, variational, perform, train] [set, test, data, training, standard, accuracy, deep, better, arxiv, learning, total, distribution, compared, neural, larger] [structure, collection, human]
@InProceedings{Jiang_2020_CVPR,
  author = {Jiang, Liming and Li, Ren and Wu, Wayne and Qian, Chen and Loy, Chen Change},
  title = {DeeperForensics-1.0: A Large-Scale Dataset for Real-World Face Forgery Detection},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Learning Multi-Granular Hypergraphs for Video-Based Person Re-Identification
Yichao Yan, Jie Qin, Jiaxin Chen, Li Liu, Fan Zhu, Ying Tai, Ling Shao


Video-based person re-identification (re-ID) is an important research topic in computer vision. The key to tackling the challenging task is to exploit both spatial and temporal clues in video sequences. In this work, we propose a novel graph-based framework, namely Multi-Granular Hypergraph (MGH), to pursue better representational capabilities by modeling spatiotemporal dependencies in terms of multiple granularities. Specifically, hypergraphs with different spatial granularities are constructed using various levels of part-based features across the video sequence. In each hypergraph, different temporal granularities are captured by hyperedges that connect a set of graph nodes (i.e., part-based features) across different temporal ranges. Two critical issues (misalignment and occlusion) are explicitly addressed by the proposed hypergraph propagation and feature aggregation schemes. Finally, we further enhance the overall video representation by learning more diversified graph-level representations of multiple granularities based on mutual information minimization. Extensive experiments on three widely-adopted benchmarks clearly demonstrate the effectiveness of the proposed framework. Notably, 90.0% top-1 accuracy on MARS is achieved using MGH, outperforming the state-of-the-arts.
[temporal, hypergraph, video, node, graph, attention, mgh, sequence, hyperedge, three, multiple, hypergraphs, hyperedges, explicitly, spatiotemporal, videobased, people, granularity, dependency, length, exploit, modeling] [feature, aggregation, propagation, framework, achieves, map, pooling, global, correlation, denotes, table, final, adopt, aggregate, attentive, propose] [model, robust, influence] [spatial, figure, proposed, convolutional, based] [person, discriminative, loss, representation, image, address] [learning, neural, performance, mutual, set, network, deep, number, training, better, standard, average, baseline, accuracy, compared, layer] [human, body, capture, local, additional, novel]
@InProceedings{Yan_2020_CVPR,
  author = {Yan, Yichao and Qin, Jie and Chen, Jiaxin and Liu, Li and Zhu, Fan and Tai, Ying and Shao, Ling},
  title = {Learning Multi-Granular Hypergraphs for Video-Based Person Re-Identification},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Online Joint Multi-Metric Adaptation From Frequent Sharing-Subset Mining for Person Re-Identification
Jiahuan Zhou, Bing Su, Ying Wu


Person Re-IDentification (P-RID), as an instance-level recognition problem, still remains challenging in computer vision community. Many P-RID works aim to learn faithful and discriminative features/metrics from offline training data and directly use them for the unseen online testing data. However, their performance is largely limited due to the severe data shifting issue between training and testing data. Therefore, we propose an online joint multi-metric adaptation model to adapt the offline learned P-RID models for the online data by learning a series of metrics for all the sharing-subsets. Each sharing-subset is obtained from the proposed novel frequent sharing-subset mining module and contains a group of testing samples which share strong visual similarity relationships to each other. Unlike existing online P-RID methods, our model simultaneously takes both the sample-specific discriminant and the set-based visual similarity among testing samples into consideration so that the adapted multiple metrics can refine the discriminant of all the given testing samples jointly via a multi-kernel late fusion framework. Our proposed model is generally suitable to any offline learned P-RID baselines for online boosting, the performance improvement by our model is not only verified by extensive experiments on several widely-used P-RID benchmarks (CUHK03, Market1501, DukeMTMC-reID and MSMT17) and state-of-the-art P-RID baselines but also guaranteed by the provided in-depth theoretical analyses.
[visual, multiple, retrieval, attention, evaluation, late] [gallery, map, improvement, feature, liang, propose, table, shifting, distractors, challenging, fully] [testing, model, query, offline, discriminant, influence, generalization, strong, probe] [proposed, method, based, fusion, kernel, figure, utilized] [person, adaptation, discriminative, learn, list] [online, learning, performance, metric, data, similarity, frequent, learned, mining, sharing, sample, baseline, sjv, training, large, ranking, compared, network, equality, siv, set, sssets, eest, distribution, rank, algorithm, deep, theoretical, negative, efficient, mahalanobis, ssset] [local, joint, solution, form, error, novel, directly, limited, matching]
@InProceedings{Zhou_2020_CVPR,
  author = {Zhou, Jiahuan and Su, Bing and Wu, Ying},
  title = {Online Joint Multi-Metric Adaptation From Frequent Sharing-Subset Mining for Person Re-Identification},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Taking a Deeper Look at Co-Salient Object Detection
Deng-Ping Fan, Zheng Lin, Ge-Peng Ji, Dingwen Zhang, Huazhu Fu, Ming-Ming Cheng


Co-salient object detection (CoSOD) is a newly emerging and rapidly growing branch of salient object detection (SOD), which aims to detect the co-occurring salient objects in multiple images. However, existing CoSOD datasets often have a serious data bias, which assumes that each group of images contains salient objects of similar visual appearances. This bias results in the ideal settings and the effectiveness of the models, trained on existing datasets, may be impaired in real-life situations, where the similarity is usually semantic or conceptual. To tackle this issue, we first collect a new high-quality dataset, named CoSOD3k, which contains 3,316 images divided into 160 groups with multiple level annotations, i.e., category, bounding box, object, and instance levels. CoSOD3k makes a significant leap in terms of diversity, difficulty and scalability, benefiting related vision tasks. Besides, we comprehensively summarize 34 cutting-edge algorithms, benchmarking 19 of them over four existing CoSOD datasets (MSRC, iCoSeg, Image Pair and CoSal2015) and our CoSOD3k with a total of 61K images (largest scale), and reporting group-level performance analysis. Finally, we discuss the challenge and future work of CoSOD. Our study would give a strong boost to growth in the CoSOD community. Benchmark toolbox and results are available on our project page.
[dataset, visual, multiple, hierarchical, pair, provide, graph, current, three, attention, evaluation] [object, cosod, detection, salient, sod, bounding, instance, saliency, icoseg, egnet, junwei, dingwen, category, cpd, msrc, csmg, umlf, wei, huazhu, detect, benchmark, table, fully, cshs, esmg, framework, ali, cosaliency, xiang, huchuan, box, pcsd, feature] [model, datasets, comprehensive, input, evaluating] [ieee, existing, based, convolutional, proposed, traditional, figure] [image, common] [learning, deep, group, performance, number, large, metric, network, average, imagenet, computational, deeper, data] [single, acm, vision, provided]
@InProceedings{Fan_2020_CVPR,
  author = {Fan, Deng-Ping and Lin, Zheng and Ji, Ge-Peng and Zhang, Dingwen and Fu, Huazhu and Cheng, Ming-Ming},
  title = {Taking a Deeper Look at Co-Salient Object Detection},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Single-Stage 6D Object Pose Estimation
Yinlin Hu, Pascal Fua, Wei Wang, Mathieu Salzmann


Most recent 6D pose estimation frameworks first rely on a deep network to establish correspondences between 3D object keypoints and 2D image locations and then use a variant of a RANSAC-based Perspective-n-Point (PnP) algorithm. This two-stage process, however, is suboptimal: First, it is not end-to-end trainable. Second, training the deep network relies on a surrogate loss that does not directly reflect the final 6D pose estimation task. In this work, we introduce a deep architecture that directly regresses 6D poses from correspondences. It takes as input a group of candidate correspondences for each 3D keypoint and accounts for the fact that the order of the correspondences within each group is irrelevant, while the order of the groups, that is, of the 3D keypoints, is fixed. Our architecture is generic and can thus be exploited in conjunction with existing correspondence-extraction networks so as to yield single-stage 6D pose estimation frameworks. Our experiments demonstrate that these single-stage frameworks consistently outperform their two-stage counterparts in terms of both accuracy and speed.
[order, multiple, three] [object, feature, pascal, level, table, bounding, detection, global, module, center, box] [input, noise, model, original, create] [pnp, method, pattern, figure, cell, stefan, ieee] [image, cluster, synthetic, target, loss, train] [network, deep, training, data, architecture, algorithm, processing, learning, average, set, problem, randomly, accuracy, neural, vector, machine, fact] [pose, computer, conference, estimation, correspondence, international, ransac, vision, error, approach, accurate, point, segdriven, pvnet, vincent, posecnn, grid, camera, uik, local, compare, keypoint, solution, rigid, single, mathieu, establish, keypoints, directly]
@InProceedings{Hu_2020_CVPR,
  author = {Hu, Yinlin and Fua, Pascal and Wang, Wei and Salzmann, Mathieu},
  title = {Single-Stage 6D Object Pose Estimation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
OccuSeg: Occupancy-Aware 3D Instance Segmentation
Lei Han, Tian Zheng, Lan Xu, Lu Fang


3D instance segmentation, with a variety of applications in robotics and augmented reality, is in large demands these days. Unlike 2D images that are projective observations of the environment, 3D models provide metric reconstruction of the scenes without occlusion or scale ambiguity. In this paper, we define "3D occupancy size", as the number of voxels occupied by each instance. It owns advantages of robustness in prediction, on which basis, OccuSeg, an occupancy-aware 3D instance segmentation scheme is proposed. Our multi-task learning produces both occupancy signal and embedding representations, where the training of spatial and feature embeddings varies with their difference in scale-aware. Our clustering scheme benefits from the reliable comparison between the predicted occupancy size and the clustered occupancy size, which encourages hard samples being correctly clustered and avoids over segmentation. The proposed approach achieves state-of-theart performance on 3 real-world datasets, i.e. ScanNetV2, S3DIS and SceneNN, while maintaining high efficiency.
[embedding, previous, graph, prediction] [instance, segmentation, feature, semantic, object, predicted, belonging, stage, map, occuseg, benchmark, table, achieves, employ, inherent, propose, represents, merge, scenenn] [input, public] [spatial, ieee, method, pattern, convolutional, signal, based, figure, comparison, proposed, high, convolution, pixel] [image, utilize, learn] [clustering, learning, network, number, covariance, scheme, set, validation, arxiv, preprint, space, metric, performance, evaluate, deep, vector, neural, efficient, function, size, margin] [occupancy, conference, computer, vision, point, voxels, term, approach, reconstruction, international, geometry, voxel, occupied, sparse, indoor, jointly, reconstructed, well, scene]
@InProceedings{Han_2020_CVPR,
  author = {Han, Lei and Zheng, Tian and Xu, Lan and Fang, Lu},
  title = {OccuSeg: Occupancy-Aware 3D Instance Segmentation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Camera Trace Erasing
Chang Chen, Zhiwei Xiong, Xiaoming Liu, Feng Wu


Camera trace is a unique noise produced in digital imaging process. Most existing forensic methods analyze camera trace to identify image origins. In this paper, we address a new low-level vision problem, camera trace erasing, to reveal the weakness of trace-based forensic methods. A comprehensive investigation on existing anti-forensic methods reveals that it is non-trivial to effectively erase camera trace while avoiding the destruction of content signal. To reconcile these two demands, we propose Siamese Trace Erasing (SiamTE), in which a novel hybrid loss is designed on the basis of Siamese architecture for network training. Specifically, we propose embedded similarity, truncated fidelity, and cross identity to form the hybrid loss. Compared with existing anti-forensic methods, SiamTE has a clear advantage for camera trace erasing, which is demonstrated in three representative tasks.
[embedded, visual, three, dataset, order, shift] [erasing, adopt, propose, table, siamese, denotes] [trace, adversarial, forensic, siamte, manipulation, jpeg, forensics, identity, noise, degree, digital, type, ltf, ori, destruction, origin, niqe, truncated, verification, effectively, erase, identification, conduct, cyclic, input, reveal, comprehensive] [method, captured, proposed, ieee, signal, denoising, existing, figure, listed, compression, comparison, adopted, based, operator, imaging] [image, loss, content, cross, lci, extracted, address] [network, performance, similarity, clustering, classification, filter, accuracy, learning, processing, metric, better, measure, set, neural, deep, machine, calculate, size, architecture, compared] [camera, hybrid, median, distance, define, vision]
@InProceedings{Chen_2020_CVPR,
  author = {Chen, Chang and Xiong, Zhiwei and Liu, Xiaoming and Wu, Feng},
  title = {Camera Trace Erasing},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Deep Metric Learning via Adaptive Learnable Assessment
Wenzhao Zheng, Jiwen Lu, Jie Zhou


In this paper, we propose a deep metric learning via adaptive learnable assessment (DML-ALA) method for image retrieval and clustering, which aims to learn a sample assessment strategy to maximize the generalization of the trained metric. Unlike existing deep metric learning methods that usually utilize a fixed sampling strategy like hard negative mining, we propose a sequence-aware learnable assessor which re-weights each training example to train the metric towards good generalization. We formulate the learning of this assessor as a meta-learning problem, where we employ an episode-based training scheme and update the assessor at each iteration to adapt to the current model status. We construct each episode by sampling two subsets of disjoint labels to simulate the procedure of training and testing and use the performance of one-gradient-updated metric on the validation subset as the meta-objective of the assessor. Experimental results on the widely used CUB-200-2011, Cars196, and Stanford Online Products datasets demonstrate the effectiveness of the proposed approach.
[embedding, state, sequence, current, retrieval, previous, construct, three, composed, lstm, connected, dataset] [hard, table, propose, positive, china, fully, grant] [model, assessment, generalization, trained, original, disjoint] [method, proposed, existing, learnable, adaptive, figure, simulate] [loss, image, train, learn, utilize, ability, person] [metric, training, assessor, learning, deep, sampling, ala, triplet, set, strategy, mining, tuple, updated, sample, episode, network, update, margin, negative, performance, validation, online, clustering, subset, weight, min, nmi, maximize, stanford, tuples, test, learned, gradient, jiwen, weighted, simultaneously, function, jie, class, number, knowledge, process, arg] [distance]
@InProceedings{Zheng_2020_CVPR,
  author = {Zheng, Wenzhao and Lu, Jiwen and Zhou, Jie},
  title = {Deep Metric Learning via Adaptive Learnable Assessment},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Deep Representation Learning on Long-Tailed Data: A Learnable Embedding Augmentation Perspective
Jialun Liu, Yifan Sun, Chuchu Han, Zhaopeng Dou, Wenhui Li


This paper considers learning deep features from long-tailed data. We observe that in the deep feature space, the head classes and the tail classes present different distribution patterns. The head classes have a relatively large spatial span, while the tail classes have a significantly small spatial span, due to the lack of intra-class diversity. This uneven distribution between head and tail classes distorts the overall feature space, which compromises the discriminative ability of the learned features. In response, we seek to expand the distribution of the tail classes during training, so as to alleviate the distortion of the feature space. To this end, we propose to augment each instance of the tail classes with certain disturbances in the deep feature space. With the augmentation, a specified feature vector becomes a set of probable features scattered around itself, which is analogical to an atomic nucleus surrounded by the electron cloud. Intuitively, we name it as "feature cloud". The intra-class distribution of the feature cloud is learned from the head classes, and thus provides higher intra-class variation to the tail classes. Consequentially, it alleviates the distortion of the learned feature space, and improves deep representation learning on long tailed data. Extensive experimental evaluations on person re-identification and face recognition tasks confirm the effectiveness of our method.
[recognition, dataset, embedding] [head, feature, center, table, map, liang, instance, propose, effectiveness] [cosface, arcface, face, combined, model, original, datasets, improve] [method, ieee, pattern, version, based, comparison, proposed, formulated, figure] [person, loss, corresponding, transfer, dukemtmc, representation, diversity, discriminative] [tail, class, distribution, angular, deep, learning, data, number, variance, learned, set, margin, training, vanilla, baseline, vector, calculate, cosine, performance, large, sample, network, augmentation, space, compared, sampled, evaluate, epoch, accuracy, observe, reduce] [conference, computer, vision, cloud, full, angle, distance, international, european, approach, well]
@InProceedings{Liu_2020_CVPR,
  author = {Liu, Jialun and Sun, Yifan and Han, Chuchu and Dou, Zhaopeng and Li, Wenhui},
  title = {Deep Representation Learning on Long-Tailed Data: A Learnable Embedding Augmentation Perspective},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Fantastic Answers and Where to Find Them: Immersive Question-Directed Visual Attention
Ming Jiang, Shi Chen, Jinhui Yang, Qi Zhao


While most visual attention studies focus on bottom-up attention with restricted field-of-view, real-life situations are filled with embodied vision tasks. The role of attention is more significant in the latter due to the information overload, and attention to the most important regions is critical to the success of tasks. The effects of visual attention on task performance in this context have also been widely ignored. This research addresses a number of challenges to bridge this research gap, on both the data and model aspects. Specifically, we introduce the first dataset of top-down attention in immersive scenes. The Immersive Question-directed Visual Attention (IQVA) dataset features visual attention and corresponding task performance (i.e., answer correctness). It consists of 975 questions and answers collected from people viewing 360deg videos in a head-mounted display. Analyses of the data demonstrate a significant correlation between people's task performance and their eye movements, suggesting the role of attention in task performance. With that, a neural network is developed to encode the differences of correct and incorrect attention and jointly predict the two. The proposed attention model for the first time takes into account answer correctness, whose outputs naturally distinguish important regions from distractions. This study with new data and features may enable new tasks that leverage attention and answer correctness, and inspire new research that reveals the process behind decision making in performing various tasks.
[attention, visual, correct, question, dataset, video, answer, people, difficulty, prediction, immersive, fixation, time, predicting, semantics, length, predict, swm, iqva, understanding, answering, natural, temporal, attended, role, viewing, modeling, previous, correctly, order, equator] [saliency, head, table, aggregated, correlation, propose, semantic, tracking, map] [incorrect, model, eye, gaze, study, difference, verify, query] [figure, proposed, existing, correctness, based, indicate, comparison, spatial] [loss, corresponding, introduce, image] [task, data, accuracy, bias, performance, network, count, memory, best] [human, vision, well, spherical, distance, jointly, virtual]
@InProceedings{Jiang_2020_CVPR,
  author = {Jiang, Ming and Chen, Shi and Yang, Jinhui and Zhao, Qi},
  title = {Fantastic Answers and Where to Find Them: Immersive Question-Directed Visual Attention},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
HUMBI: A Large Multiview Dataset of Human Body Expressions
Zhixuan Yu, Jae Shin Yoon, In Kyu Lee, Prashanth Venkatesh, Jaesik Park, Jihun Yu, Hyun Soo Park


This paper presents a new large multiview dataset called HUMBI for human body expressions with natural clothing. The goal of HUMBI is to facilitate modeling view-specific appearance and geometry of gaze, face, hand, body, and garment from assorted people. 107 synchronized HD cam- eras are used to capture 772 distinctive subjects across gen- der, ethnicity, age, and physical condition. With the mul- tiview image streams, we reconstruct high fidelity body ex- pressions using 3D mesh models, which allows representing view-specific appearance using their canonical atlas. We demonstrate that HUMBI is highly effective in learning and reconstructing a complete human model and is complemen- tary to the existing datasets of human body expressions with limited views and subjects such as MPII-Gaze, Multi-PIE, Human3.6M, and Panoptic Studio datasets.
[dataset, evaluation, prediction, modeling, natural, represent, people, cmu, social] [map, table, head, benchmark, tracking, including] [humbi, model, gaze, face, datasets, garment, eye, trained, utmv, complementary, facial, testing, clothing, conduct, synchronized, physical, summarized] [figure, captured, existing, range, motion] [appearance, image, alignment, real, atlas, diverse] [variance, network, performance, number, large, learning, evaluate, training, accuracy, set, data] [body, hand, human, pose, mesh, multiview, camera, reconstruction, error, capture, markerless, monocular, geometry, mpii, cloth, single, shape, estimation, stereo, occupancy, distinctive, reconstruct, dense, view, reconstructed, median, canonical, rgb, keypoints, reprojection, system]
@InProceedings{Yu_2020_CVPR,
  author = {Yu, Zhixuan and Yoon, Jae Shin and Lee, In Kyu and Venkatesh, Prashanth and Park, Jaesik and Yu, Jihun and Park, Hyun Soo},
  title = {HUMBI: A Large Multiview Dataset of Human Body Expressions},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Image Search With Text Feedback by Visiolinguistic Attention Learning
Yanbei Chen, Shaogang Gong, Loris Bazzani


Image search with text feedback has promising impacts in various real-world applications, such as e-commerce and internet search. Given a reference image and text feedback from user, the goal is to retrieve images that not only resemble the input image, but also change certain aspects in accordance with the given text. This is a challenging task as it requires the synergistic understanding of both image and text. In this work, we tackle this task by a novel Visiolinguistic Attention Learning (VAL) framework. Specifically, we propose a composite transformer that can be seamlessly plugged in a CNN to selectively preserve and transform the visual features conditioned on language semantics. By inserting multiple composite transformers at varying depths, VAL is incentive to encapsulate the multi-granular visiolinguistic information, thus yielding an expressive representation for effective image search. We conduct comprehensive evaluation on three datasets: Fashion200k, Shoes and FashionIQ. Extensive experiments show our model exceeds existing approaches on all datasets, demonstrating consistent superiority in coping with various text feedbacks, including attribute-like and natural language descriptions.
[text, visual, language, visiolinguistic, attention, natural, semantics, question, xivl, three, hierarchical, multiple, retrieval, xir, lvs, multimodal, attentional, tirg, transformer, linguistic, stream, fashioniq, selectively, aich] [feature, val, interactive, table, side, cnn, including] [feedback, model, fashion, auxiliary, input, change] [ieee, pattern, reference, figure, transform, spatial, output] [image, composite, representation, learn, user, target, composition, preserve, attribute, encoder, content, preservation, learnt, qualitative, desired, latent] [learning, search, neural, deep, training, processing, task, machine, product, set] [conference, computer, vision, matching, international, relative, european, varying, capture, transformation, jointly, match, essential, supplementary]
@InProceedings{Chen_2020_CVPR,
  author = {Chen, Yanbei and Gong, Shaogang and Bazzani, Loris},
  title = {Image Search With Text Feedback by Visiolinguistic Attention Learning},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Image Processing Using Multi-Code GAN Prior
Jinjin Gu, Yujun Shen, Bolei Zhou


Despite the success of Generative Adversarial Networks (GANs) in image synthesis, applying trained GAN models to real image processing remains challenging. Previous methods typically invert a target image back to the latent space either by back-propagation or by learning an additional encoder. However, the reconstructions from both of the methods are far from ideal. In this work, we propose a novel approach, called mGANprior, to incorporate the well-trained GANs as effective prior to a variety of image processing tasks. In particular, we employ multiple latent codes to generate multiple feature maps at some intermediate layer of the generator, then compose them with adaptive channel importance to recover the input image. Such an over-parameterization of the latent space significantly improves the image reconstruction quality, outperforming existing competitors. The resulting high-fidelity image reconstruction enables the trained GAN models as prior to many real-world applications, such as image colorization, super-resolution, image inpainting, and semantic manipulation. We further analyze the properties of the layer-wise representation learned by GAN models and shed light on what knowledge each layer is capable of representing.
[multiple, compose, semantics, work] [feature, semantic, propose, apply, including, segmentation] [model, adversarial, trained, manipulation, input, face, quality] [prior, method, channel, dip, recover, figure, intermediate, proposed, comparison, adaptive, existing, quantitative, based, result, rcan] [image, latent, gan, inversion, generative, code, target, pggan, inpainting, colorization, real, gans, bedroom, encoder, generator, composition, mganprior, corresponding, bolei, invert, church, grayscale, qualitative, phillip, representation, generation, style] [layer, processing, space, deep, task, optimizing, knowledge, higher, better, learned, learning, number, optimization, training, data, inverted] [single, reconstruction, approach, well, reconstructed, compare, david, variety, error, reconstruct]
@InProceedings{Gu_2020_CVPR,
  author = {Gu, Jinjin and Shen, Yujun and Zhou, Bolei},
  title = {Image Processing Using Multi-Code GAN Prior},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
What Does Plate Glass Reveal About Camera Calibration?
Qian Zheng, Jinnan Chen, Zhan Lu, Boxin Shi, Xudong Jiang, Kim-Hui Yap, Ling-Yu Duan, Alex C. Kot


This paper aims to calibrate the orientation of glass and the field of view of the camera from a single reflection-contaminated image. We show how a reflective amplitude coefficient map can be used as a calibration cue. Different from existing methods, the proposed solution is free from image contents. To reduce the impact of a noisy calibration cue estimated from a reflection-contaminated image, we propose two strategies: an optimization-based method that imposes part of though reliable entries on the map and a learning-based method that fully exploits all entries. We collect a dataset containing 320 samples as well as their camera parameters for evaluation. We demonstrate that our method not only facilitates a general single image camera calibration method that leverages image contents but also contributes to improving the performance of single image reflection removal. Furthermore, we show our byproduct output helps alleviate the ill-posed problem of estimating the panorama from a single image.
[considering, cue, context, represent, dataset] [map, table, horizontal, propose] [radial, distortion] [reflection, method, pattern, glass, figure, separation, removal, amplitude, based, concentric, coefficient, transmission, formation, ontrolled, internatoinal, ieee, ild, analysis, psnr, boxin, reflectioncontaminated, imaged, ssim, proposed, byproduct] [image, loss, free] [performance, problem, network, data, deep, reliable, machine, paper, function, equation, set, better, alex, neural] [calibration, camera, single, computer, vision, conference, panorama, reflective, estimated, geometric, estimate, orientation, fov, estimation, plate, extrinsic, solve, ground, indoor, view, well, scene, intrinsic, rotation, coordinate, plane, focal, international]
@InProceedings{Zheng_2020_CVPR,
  author = {Zheng, Qian and Chen, Jinnan and Lu, Zhan and Shi, Boxin and Jiang, Xudong and Yap, Kim-Hui and Duan, Ling-Yu and Kot, Alex C.},
  title = {What Does Plate Glass Reveal About Camera Calibration?},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Zero-Assignment Constraint for Graph Matching With Outliers
Fudong Wang, Nan Xue, Jin-Gang Yu, Gui-Song Xia


Graph matching (GM), as a longstanding problem in computer vision and pattern recognition, still suffers from numerous cluttered outliers in practical applications. To address this issue, we present the zero-assignment constraint (ZAC) for approaching the graph matching problem in the presence of outliers. The underlying idea is to suppress the matchings of outliers by assigning zero-valued vectors to the potential outliers in the obtained optimal correspondence matrix. We provide elaborate theoretical analysis to the problem, i.e., GM with ZAC, and figure out that the GM problem with and without outliers are intrinsically different, which enables us to put forward a sufficient condition to construct valid and reasonable objective function. Consequently, we design an efficient outlier-robust algorithm to significantly reduce the incorrect or redundant matchings caused by numerous outliers. Extensive experiments demonstrate that our method can achieve the state-of-the-art performance in terms of accuracy and efficiency, especially in the presence of numerous outliers.
[graph, time, node, complicated, pair, aij, dataset] [edge, recall, assignment, affinity] [numerous, identification, condition] [ieee, method, figure, based, mpm, proposed, removal, pattern, quantitative] [reasonable, foundation] [objective, zac, algorithm, function, average, zacr, matrix, problem, optimization, precision, achieve, accuracy, matchings, theoretical, pia, set, fgmd, bpfg, frgm, find, rrwm, optimal, efficient, consumption, ideal, linear, complexity, better, proposition, note, presence, empirical, eij, pairwise, applied, potential, reduce, performance, distinguishability, minimum, space] [matching, inliers, constraint, outlier, varying, point, correspondence, approach, sufficient, rigid, shape, match, defined, solved, cluttered, geometric]
@InProceedings{Wang_2020_CVPR,
  author = {Wang, Fudong and Xue, Nan and Yu, Jin-Gang and Xia, Gui-Song},
  title = {Zero-Assignment Constraint for Graph Matching With Outliers},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Cascaded Deep Video Deblurring Using Temporal Sharpness Prior
Jinshan Pan, Haoran Bai, Jinhui Tang


We present a simple and effective deep convolutional neural network (CNN) model for video deblurring. The proposed algorithm mainly consists of optical flow estimation from intermediate latent frames and latent frame restoration steps. It first develops a deep CNN model to estimate optical flow from intermediate latent frames and then restores the latent frames based on the estimated optical flow. To better explore the temporal information from videos, we develop a temporal sharpness prior to constrain the deep CNN model to help the latent frame restoration. We develop an effective cascaded training approach and jointly train the proposed CNN model in an end-to-end manner. We show that exploring the domain knowledge of video deblurring is able to make the deep CNN model more compact and efficient. Extensive experimental results show that the proposed algorithm performs favorably against state-of-the-art methods on the benchmark datasets as well as real-world videos.
[video, frame, temporal, dataset, explore] [cnn, table, denotes, stage, module, main, effectiveness, benchmark] [model, effective, input, help, success, improve] [optical, flow, proposed, sharpness, blur, prior, method, deblurring, blurred, restoration, motion, cascaded, figure, deblurred, develop, kim, sharp, intermediate, based, consecutive, edvr, restore, clearer, adjacent, convolutional, stfan, remove, dynamic, pixel, psnr, psnrs, ssims, lee, performs, favorably, conventional, cnns] [latent, image, generate, variational, alignment, generates, real, train, domain] [deep, algorithm, network, training, better, note, neural, learning, compact, evaluate, performance, accuracy, test, simple, knowledge, denote, size, set] [estimation, estimate, estimated, well, directly]
@InProceedings{Pan_2020_CVPR,
  author = {Pan, Jinshan and Bai, Haoran and Tang, Jinhui},
  title = {Cascaded Deep Video Deblurring Using Temporal Sharpness Prior},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
JL-DCF: Joint Learning and Densely-Cooperative Fusion Framework for RGB-D Salient Object Detection
Keren Fu, Deng-Ping Fan, Ge-Peng Ji, Qijun Zhao


This paper proposes a novel joint learning and densely-cooperative fusion (JL-DCF) architecture for RGB-D salient object detection. Existing models usually treat RGB and depth as independent information and design separate networks for feature extraction from each. Such schemes can easily be constrained by a limited amount of training data or over-reliance on an elaborately-designed training process. In contrast, our JL-DCF learns from both RGB and depth inputs through a Siamese network. To this end, we propose two effective components: joint learning (JL), and densely-cooperative fusion (DCF). The JL module provides robust saliency feature learning, while the latter is introduced for complementary feature discovery. Comprehensive experiments on four popular metrics show that the designed framework yields a robust RGB-D saliency detector with good generalization. As a result, JL-DCF significantly advances the top-1 D3Net model by an average of 1.9% (S-measure) across six challenging datasets, showing that the proposed framework offers a potential solution for real-world applications and could provide more insight into the cross-modality complementarity task. The code will be available at https://github.com/kerenfu/JLDCF/.
[attention, visual, video, hierarchical, prediction, concatenation] [object, salient, saliency, feature, detection, module, map, jianbing, side, backbone, wenguan, framework, cnn, final, ali, ablation, table, cpfp, segmentation, sod, fed, dcf, effectiveness, stere, huchuan, sota, subsequent] [model, input, sip, effective] [fusion, ieee, convolutional, proposed, output, based, color, conv, contrast, channel, extraction, figure, spatial] [image, loss, separate, shared, component, inception, independent] [learning, deep, network, max, batch, architecture, training, size, performance, note, strategy, scheme, set, data, number, neural] [depth, rgb, joint, coarse, ground, single]
@InProceedings{Fu_2020_CVPR,
  author = {Fu, Keren and Fan, Deng-Ping and Ji, Ge-Peng and Zhao, Qijun},
  title = {JL-DCF: Joint Learning and Densely-Cooperative Fusion Framework for RGB-D Salient Object Detection},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
From Fidelity to Perceptual Quality: A Semi-Supervised Approach for Low-Light Image Enhancement
Wenhan Yang, Shiqi Wang, Yuming Fang, Yue Wang, Jiaying Liu


Under-exposure introduces a series of visual degradation, i.e. decreased visibility, intensive noise, and biased color, etc. To address these problems, we propose a novel semi-supervised learning approach for low-light image enhancement. A deep recursive band network (DRBN) is proposed to recover a linear band representation of an enhanced normal-light image with paired low/normal-light images, and then obtain an improved one by recomposing the given bands via another learnable linear transformation based on a perceptual quality-driven adversarial learning with unpaired data. The architecture is powerful and flexible to have the merit of training with both paired and unpaired data. On one hand, the proposed network is well designed to extract a series of coarse-to-fine band representations, whose estimations are mutually beneficial in a recursive process. On the other hand, the extracted band representation of the enhanced image in the first stage of DRBN (recursive band learning) bridges the gap between the restoration knowledge of paired data and the perceptual quality preference to real high-quality images. Its second stage (band recomposition) learns to recompose the band representation towards fitting perceptual properties of high-quality images via adversarial learning. With the help of this two-stage design, our approach generates enhanced results with well-reconstructed details and visually promising contrast and color distributions. Qualitative and quantitative evaluations demonstrate the superiority of our DRBN.
[visual, previous, dataset, order] [stage, global, equalization, framework, fully] [quality, noise, adversarial, series, trained, input, model] [band, enhancement, recursive, ieee, contrast, perceptual, illumination, enhanced, based, signal, retinex, color, method, recurrence, drbn, guidance, histogram, intensive, proposed, figure, ssim, low, enlightengan, captured, light, restore, detail, prior, result, pattern, recover, designed, recompose, residual, denoising, fusion] [image, paired, representation, unpaired, real, extracted, structural, learn, gap, loss, discriminator, perform] [learning, deep, network, learned, weighting, set, training, knowledge, data, better, linear, process, good, power] [well, human, vision, computer, recomposition, second, approach, joint]
@InProceedings{Yang_2020_CVPR,
  author = {Yang, Wenhan and Wang, Shiqi and Fang, Yuming and Wang, Yue and Liu, Jiaying},
  title = {From Fidelity to Perceptual Quality: A Semi-Supervised Approach for Low-Light Image Enhancement},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Unsupervised Adaptation Learning for Hyperspectral Imagery Super-Resolution
Lei Zhang, Jiangtao Nie, Wei Wei, Yanning Zhang, Shengcai Liao, Ling Shao


The key for fusion based hyperspectral image (HSI) super-resolution (SR) is to infer the posteriori of a latent HSI using appropriate image prior and likelihood that depends on degeneration. However, in practice the priors of high-dimensional HSIs can be extremely complicated and the degeneration is often unknown. Consequently most existing approaches that assume a shallow hand-crafted image prior and a pre-defined degeneration, fail to well generalize in real applications. To tackle this problem, we present an unsupervised adaptation learning (UAL) framework. Instead of directly modelling the complicated image prior, we propose to first implicitly learn a general image prior using deep networks and then adapt it to a specific HSI. Following this idea, we develop a two-stage SR network that leverages two consecutive modules: a fusion module and an adaptation module, to recover the latent HSI in a coarse-to-fine scheme. The fusion module is pretrained in a supervised manner on synthetic data to capture a spatial-spectral prior that is general across most HSIs. To adapt the learned general prior to the specific HSI under unknown degeneration, we introduce a simple degeneration network to assist learning both the adaptation module and the degeneration in an unsupervised way. In this way, the resultant image-specific prior and the estimated degeneration can benefit the inference of a more accurate posteriori, thereby increasing generalization capacity. To verify the efficacy of UAL, we extensively evaluate it on four benchmark datasets and report strong results that surpass existing approaches.
[three, observed, dataset, concatenation, visual] [module, table, yong, denotes, propose, wei, shallow, employ, including] [noise, datasets, yanning, generalization] [ual, degeneration, fusion, hsi, prior, figure, proposed, hyperspectral, block, ieee, spectral, nssr, dip, scale, kernel, psnr, hsis, cave, spatial, sam, ssim, pattern, based, existing, ntire, gaussian, convolutional, msi, recover] [image, adaptation, unsupervised, latent, unknown, real, learn, mapping, specific, utilize, produce, produced, consists, generalize, introduce] [network, deep, learning, performance, training, general, test, size, denote, learned, simple] [computer, conference, rmse, vision, sparse, demonstrate, structure, estimation, ground, truth, fxy, international, reconstruction]
@InProceedings{Zhang_2020_CVPR,
  author = {Zhang, Lei and Nie, Jiangtao and Wei, Wei and Zhang, Yanning and Liao, Shengcai and Shao, Ling},
  title = {Unsupervised Adaptation Learning for Hyperspectral Imagery Super-Resolution},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Central Similarity Quantization for Efficient Image and Video Retrieval
Li Yuan, Tao Wang, Xiaopeng Zhang, Francis EH Tay, Zequn Jie, Wei Liu, Jiashi Feng


Existing data-dependent hashing methods usually learn hash functions from pairwise or triplet data relationships, which only capture the data similarity locally, and often suffer from low learning efficiency and low collision rate. In this work, we propose a new global similarity metric, termed as central similarity, with which the hash codes of similar data pairs are encouraged to approach a common center and those for dissimilar pairs to converge to different centers, to improve hash learning efficiency and retrieval accuracy. We principally formulate the computation of the proposed central similarity metric by introducing a new concept, i.e., hash center that refers to a set of data points scattered in the Hamming space with a sufficient mutual distance between each other. We then provide an efficient method to construct well separated hash centers by leveraging the Hadamard matrix and Bernoulli distributions. Finally, we propose the Central Similarity Quantization (CSQ) that optimizes the central similarity between data points w.r.t. their hash centers instead of optimizing the local similarity. CSQ is generic and applicable to both image and video hashing scenarios. Extensive experiments on large-scale image and video retrieval tasks demonstrate that CSQ can generate cohesive hash codes for similar data pairs and dispersed hash codes for dissimilar pairs, achieving a noticeable boost in retrieval performance, i.e. 3%-20% in mAP over the previous state-of-the-arts.
[video, retrieval, three, temporal, recognition, evaluation, construct] [center, semantic, map, coco, propose, adopt, global, feature, table, achieves] [central, datasets] [ieee, pattern, proposed, method, based, figure, convolutional] [image, generate, learn, loss, generated, corresponding, supervised, code, common] [hash, data, similarity, hamming, hashing, learning, csq, deep, imagenet, performance, pairwise, space, hadamard, function, learned, quantization, set, training, number, compared, binary, hashnet, dissimilar, matrix, dhn, log, triplet, neural, precision, sample, efficient, metric, better, find, dch, average, calculate, processing] [distance, conference, computer, approach, vision, directly, international]
@InProceedings{Yuan_2020_CVPR,
  author = {Yuan, Li and Wang, Tao and Zhang, Xiaopeng and Tay, Francis EH and Jie, Zequn and Liu, Wei and Feng, Jiashi},
  title = {Central Similarity Quantization for Efficient Image and Video Retrieval},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
ARCH: Animatable Reconstruction of Clothed Humans
Zeng Huang, Yuanlu Xu, Christoph Lassner, Hao Li, Tony Tung


In this paper, we propose ARCH (Animatable Reconstruction of Clothed Humans), a novel end-to-end framework for accurate reconstruction of animation-ready 3D clothed humans from a monocular image. Existing approaches to digitize 3D humans struggle to handle pose variations and recover details. Also, they do not produce models that are animation ready. In contrast, ARCH is a learnedpose-awaremodelthatproducesdetailed3Drigged full-body human avatars from a single unconstrained RGB image. A Semantic Space and a Semantic Deformation Field are created using a parametric 3D body estimator. They allow the transformation of 2D/3D clothed humans into a canonical space, reducing ambiguities in geometry caused by pose variations and occlusions in training data. Detailed surface geometry and appearance are learned using an implicit function representation with spatial local features. Furthermore, we propose additional per-pixel supervisiononthe3Dreconstructionusingopacity-awaredifferentiablerendering. OurexperimentsindicatethatARCH increases the fidelity of the reconstructed humans. We obtain more than 50% lower reconstruction errors for standard metrics compared to state-of-the-art methods on public datasets. We also show numerous qualitative examples of animated, high-quality reconstructed avatars unseen in the literature so far.
[dataset, three, prediction, granular, people, represent] [semantic, feature, predicted, propose, framework] [model, input, clothing, rigged, christian] [color, ieee, pattern, field, spatial, based, figure, proposed] [image, arbitrary, loss, representation, factor] [function, space, training, learning, linear, neural, sampled] [conference, human, body, computer, reconstruction, normal, surface, vision, canonical, pose, occupancy, differentiable, shape, point, clothed, implicit, single, ground, arch, mesh, skinning, pifu, truth, rendering, deformation, reconstructed, gerard, international, avatar, estimation, directly, reconstruct, distance, detailed, mlp, scan, depth, michael, european, hao, render, capture, fpo, geometry, defined, rbf, acm, yuanlu]
@InProceedings{Huang_2020_CVPR,
  author = {Huang, Zeng and Xu, Yuanlu and Lassner, Christoph and Li, Hao and Tung, Tony},
  title = {ARCH: Animatable Reconstruction of Clothed Humans},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
A Model-Driven Deep Neural Network for Single Image Rain Removal
Hong Wang, Qi Xie, Qian Zhao, Deyu Meng


Deep learning (DL) methods have achieved state-of-the-art performance in the task of single image rain removal. Most of current DL architectures, however, are still lack of sufficient interpretability and not fully integrated with physical structures inside general rain streaks. To this issue, in this paper, we propose a model-driven deep neural network for the task, with fully interpretable network structures. Specifically, based on the convolutional dictionary learning mechanism for representing rain, we propose a novel single image deraining model and utilize the proximal gradient descent technique to design an iterative algorithm only containing simple operators for solving the model. Such a simple implementation scheme facilitates us to unfold it into a new deep network architecture, called rain convolutional dictionary network (RCDNet), with almost every network module one-to-one corresponding to each operation involved in the algorithm. By end-to-end training the proposed RCDNet, all the rain kernels and proximal operators can be automatically extracted, faithfully characterizing the features of both rain and clean background layers, and thus naturally lead to its better deraining performance, especially in real scenarios. Comprehensive experiments substantiate the superiority of the proposed network, especially its well generality to diverse testing scenarios and good interpretability for all its modules, as compared with state-of-the-arts both visually and quantitatively.
[video, current, visual, recurrent] [background, stage, detection, table, module] [model, input, iterative, interpretability, comprehensive, testing] [rain, ieee, removal, rainy, convolutional, deraining, rcdnet, prior, pattern, based, proposed, psnr, proximal, analysis, ssim, deyu, gmm, removing, figure, rcd, method, facilitates, repetitive, unfolding, clear, streak, prenet, spanet, tensor] [image, corresponding, real, extracted, competing, diverse] [network, deep, learning, layer, algorithm, performance, better, training, dictionary, updating, general, learned, design, knowledge, optimization, architecture, number, problem, set, simple] [single, computer, conference, vision, international, sparse, solving, joint, local, qian, well, representing, intrinsic]
@InProceedings{Wang_2020_CVPR,
  author = {Wang, Hong and Xie, Qi and Zhao, Qian and Meng, Deyu},
  title = {A Model-Driven Deep Neural Network for Single Image Rain Removal},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Novel Object Viewpoint Estimation Through Reconstruction Alignment
Mohamed El Banani, Jason J. Corso, David F. Fouhey


The goal of this paper is to estimate the viewpoint for a novel object. Standard viewpoint estimation approaches generally fail on this task due to their reliance on a 3D model for alignment or large amounts of class-specific training data and their corresponding canonical pose. We overcome those limitations by learning a reconstruct and align approach. Our key insight is that although we do not have an explicit 3D model or a predefined canonical pose, we can still learn to estimate the object's shape in the viewer's frame and then use an image to provide our reference model or canonical pose. In particular, we propose learning two networks: the first maps images to a 3D geometry-aware feature bottleneck and is trained via an image-to-image translation loss; the second learns whether two instances of features are aligned. At test time, our model finds the relative transformation that best aligns the bottleneck features of our test image to a reference image. We evaluate our method on novel object viewpoint estimation by generalizing across different datasets, analyzing the impact of our different modules, and providing a qualitative analysis of the learned features to identify what representation is being learnt for alignment.
[predict, prediction, frame, work, goal, dataset, provide, previous, represent] [object, feature, mask, iou, table] [model, trained, generalization, input, testing, datasets, success] [reference, figure, convolutional, proposed, removing] [learn, image, train, alignment, representation, generalize, ability, unseen, third, learns, discriminator, align, perform] [network, learning, training, find, evaluate, performance, space, test, better, bottleneck, respect, data, learned, class, large, problem, deep, standard, best] [viewpoint, pose, shape, novel, reconstruction, relative, estimation, depth, approach, shapenet, view, canonical, well, estimate, grid, second, voxel, transformation, mederr, despite, occupancy, jitendra, computer, single, directly, coordinate, rotation, carving, reconstruct, generalizing]
@InProceedings{Banani_2020_CVPR,
  author = {Banani, Mohamed El and Corso, Jason J. and Fouhey, David F.},
  title = {Novel Object Viewpoint Estimation Through Reconstruction Alignment},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Creating Something From Nothing: Unsupervised Knowledge Distillation for Cross-Modal Hashing
Hengtong Hu, Lingxi Xie, Richang Hong, Qi Tian


In recent years, cross-modal hashing (CMH) has attracted increasing attentions, mainly because its potential ability of mapping contents from different modalities, especially in vision and language, into the same space, so that it becomes efficient in cross-modal data retrieval. There are two main frameworks for CMH, differing from each other in whether semantic supervision is required. Compared to the unsupervised methods, the supervised methods often enjoy more accurate results, but require much heavier labors in data annotation. In this paper, we propose a novel approach that enables guiding a supervised method using outputs produced by an unsupervised method. Specifically, we make use of teacher-student optimization for propagating knowledge. Experiments are performed on two popular CMH benchmarks, i.e., the MIRFlickr and NUS-WIDE datasets. Our approach outperforms all existing unsupervised methods by a large margin.
[text, relevant, retrieval, modality, work, dataset, artificial, outperforms] [feature, table, semantic, supervision, map, wei, instance, framework, correlation] [model, query, adversarial, improve, trained, original] [ieee, pattern, method, figure, proposed, output, advantage] [unsupervised, image, supervised, ugach, mirflickr, uch, idea, generative, cmh, dcmh, ukd, common, extracted, mapping, paired, fii] [hashing, learning, knowledge, teacher, deep, training, data, similarity, student, number, distillation, selected, ssah, network, task, accuracy, precision, compared, hash, set, neural, gain, arxiv, preprint, performance, better, large, consider, hamming, matrix, labeled] [conference, approach, computer, vision, international, accurate, acm, additional]
@InProceedings{Hu_2020_CVPR,
  author = {Hu, Hengtong and Xie, Lingxi and Hong, Richang and Tian, Qi},
  title = {Creating Something From Nothing: Unsupervised Knowledge Distillation for Cross-Modal Hashing},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Evaluating Weakly Supervised Object Localization Methods Right
Junsuk Choe, Seong Joon Oh, Seungho Lee, Sanghyuk Chun, Zeynep Akata, Hyunjung Shim


Weakly-supervised object localization (WSOL) has gained popularity over the last years for its promise to train localization models with only image-level labels. Since the seminal WSOL work of class activation mapping (CAM), the field has focused on how to expand the attention regions to cover objects more broadly and localize them better. However, these strategies rely on full localization supervision to validate hyperparameters and for model selection, which is in principle prohibited under the WSOL setup. In this paper, we argue that WSOL task is ill-posed with only image-level labels, and propose a new evaluation protocol where full supervision is limited to only a small held-out set not overlapping with the test set. We observe that, under our protocol, the five most recent WSOL methods have not made a major improvement over the CAM baseline. Moreover, we report that existing WSOL methods have not reached the few-shot learning baseline, where the full-supervision at validation time is used for model training instead. Based on our findings, we discuss some future directions for WSOL.
[evaluation, work, future, three, visual] [wsol, cam, object, localization, supervision, threshold, score, box, mask, table, acol, semantic, bounding, map, erasing, foreground, maxboxacc, adl, weakly, propose, segmentation, background, fully, pxap, iou, spg, center, instance] [model, drop, protocol, input] [figure, convolutional, pixel, based] [image, cub, supervised, cutmix, zeynep, discriminative] [learning, performance, hyperparameter, training, hyperparameters, class, set, imagenet, fsl, search, task, appendix, baseline, deep, vanilla, data, openimages, test, accuracy, labeled, classification, posterior, best, binary, random, better, network, amount, fixed, label, sij, measure] [full, operating, define]
@InProceedings{Choe_2020_CVPR,
  author = {Choe, Junsuk and Oh, Seong Joon and Lee, Seungho and Chun, Sanghyuk and Akata, Zeynep and Shim, Hyunjung},
  title = {Evaluating Weakly Supervised Object Localization Methods Right},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Style Normalization and Restitution for Generalizable Person Re-Identification
Xin Jin, Cuiling Lan, Wenjun Zeng, Zhibo Chen, Li Zhang


Existing fully-supervised person re-identification (ReID) methods usually suffer from poor generalization capability caused by domain gaps. The key to solving this problem lies in filtering out identity-irrelevant interference and learning domain-invariant person representations. In this paper, we aim to design a generalizable person ReID framework which trains a model on source domains yet is able to generalize/perform well on target domains. To achieve this goal, we propose a simple yet effective Style Normalization and Restitution (SNR) module. Specifically, we filter out style variations (e.g., illumination, color contrast) by Instance Normalization (IN). However, such a process inevitably removes discriminative information. We propose to distill identity-relevant feature from the removed information and restitute it to the network to ensure high discrimination. For better disentanglement, we enforce a dual causal loss constraint in SNR to encourage the separation of identity-relevant features and identity-irrelevant features. Extensive experiments demonstrate the strong generalization capability of our framework. Our models empowered by the SNR modules significantly outperform the state-of-the-art domain generalization approaches on multiple widely-used person ReID benchmarks, and also show superiority on unsupervised domain adaptation.
[outperforms, dataset, work] [feature, map, module, instance, effectiveness, propose, framework, liang, table, denotes] [generalization, model, strong, datasets, adding, study, subsection, input, discarded] [dual, figure, proposed, capability, method, convolutional, high, residual, adaptive, conv, block] [person, snr, domain, reid, style, loss, duke, generalizable, source, target, unsupervised, restitution, causality, discriminative, adaptation, unseen, discrepancy, discrimination, lsn, inevitably, disentangle, image, transfer, disentanglement, ability, shaogang, restitute, uda] [baseline, normalization, learning, scheme, normalized, network, design, performance, denote, deep, data, training, activation, sample, arxiv, preprint, better, label, practical] [constraint, demonstrate, well]
@InProceedings{Jin_2020_CVPR,
  author = {Jin, Xin and Lan, Cuiling and Zeng, Wenjun and Chen, Zhibo and Zhang, Li},
  title = {Style Normalization and Restitution for Generalizable Person Re-Identification},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Reconstruct Locally, Localize Globally: A Model Free Method for Object Pose Estimation
Ming Cai, Ian Reid


Six degree-of-freedom pose estimation of a known object in a single image is a long-standing computer vision objective. It is classically posed as a correspondence problem between a known geometric model, such as a CAD model, and image locations. If a CAD model is not available, it is possible to use multi-view visual reconstruction methods to create a geometric model, and use this in the same manner. Instead, we propose a learning-based method whose input is a collection of images of a target object, and whose output is the pose of the object in a novel view. At inference time, our method maps from the RoI features of the input image to a dense collection of object-centric 3D coordinates, one per pixel. This dense 2D-3D mapping is then used to determine 6dof pose using standard PnP plus RANSAC. The model that maps 2D to object 3D coordinates is established at training time by automatically discovering and matching image landmarks that are consistent across multiple views. We show that this method eliminates the requirement for a 3D CAD model (needed by classical geometry-based methods and state-of-the-art learning-based methods alike) but still achieves performance on a par with the prior art.
[recognition, build, visual, multiple, dataset] [object, head, branch, feature, detection, matched, predicted, box, roi, supervision, table, propose, mask, bounding, occlusion] [model, landmark, trained, input] [method, ieee, output, based, pattern, figure] [image, loss, target, synthetic, source, train] [learning, network, training, test, inference, performance, set, problem] [pose, coordinate, computer, conference, estimation, international, vision, reconstruction, geometric, camera, point, projection, cad, reprojection, rgb, geometry, dense, accurate, deformation, structure, matching, ground, truth, linemod, error, term, vincent, estimate, single, direct, solution, rotation, european, reconstruct, transformation]
@InProceedings{Cai_2020_CVPR,
  author = {Cai, Ming and Reid, Ian},
  title = {Reconstruct Locally, Localize Globally: A Model Free Method for Object Pose Estimation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
RoboTHOR: An Open Simulation-to-Real Embodied AI Platform
Matt Deitke, Winson Han, Alvaro Herrasti, Aniruddha Kembhavi, Eric Kolve, Roozbeh Mottaghi, Jordi Salvador, Dustin Schwenk, Eli VanderBilt, Matthew Wallingford, Luca Weihs, Mark Yatskar, Ali Farhadi


Visual recognition ecosystems (e.g. ImageNet, Pascal, COCO) have undeniably played a prevailing role in the evolution of modern computer vision. We argue that interactive and embodied visual AI has reached a stage of development similar to visual recognition prior to the advent of these ecosystems. Recently, various synthetic environments have been introduced to facilitate research in embodied AI. Notwithstanding this progress, the crucial question of how well models trained in simulation generalize to reality has remained largely unanswered. The creation of a comparable ecosystem for simulation-to-real embodied AI presents many challenges: (1) the inherently interactive nature of the problem, (2) the need for tight alignments between real and simulated worlds, (3) the difficulty of replicating physical conditions for repeatable experiments, (4) and the associated cost. In this paper, we introduce RoboTHOR to democratize research in interactive and embodied visual AI. RoboTHOR offers a framework of simulated environments paired with physical counterparts to systematically explore and overcome the challenges of simulation-to-real transfer, and a platform where researchers across the globe can remotely test their embodied models in the physical world. As a first benchmark, our experiments show there exists a significant gap between the performance of models trained in simulation when they are tested in both simulations and their carefully constructed physical analogs. We hope that RoboTHOR will spur the next stage of evolution in embodied computer vision.
[navigation, embodied, robothor, visual, agent, length, shortest, spl, starting, action, question, progress, reinforcement, time, step, policy, navigate, roozbeh, language, abhinav, explore, natural] [object, semantic, location, table, framework, easy, detection, benchmark, autonomous, ali, interactive] [physical, trained, model, success, noise, adversarial, medium] [figure, designed, spatial] [real, image, target, domain, transfer, learn, appearance, train, adaptation, rotate, control, synthetic, generalize] [learning, path, training, deep, performance, task, episode, open, platform, distribution, test, space, set, baseline, large] [simulation, robot, simulated, scene, well, furniture, computer, wall, camera, human, vision, single, matthew, robotic]
@InProceedings{Deitke_2020_CVPR,
  author = {Deitke, Matt and Han, Winson and Herrasti, Alvaro and Kembhavi, Aniruddha and Kolve, Eric and Mottaghi, Roozbeh and Salvador, Jordi and Schwenk, Dustin and VanderBilt, Eli and Wallingford, Matthew and Weihs, Luca and Yatskar, Mark and Farhadi, Ali},
  title = {RoboTHOR: An Open Simulation-to-Real Embodied AI Platform},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
All in One Bad Weather Removal Using Architectural Search
Ruoteng Li, Robby T. Tan, Loong-Fah Cheong


Many methods have set state-of-the-art performance on restoring images degraded by bad weather such as rain, haze, fog, and snow, however they are designed specifically to handle one type of degradation. In this paper, we propose a method that can handle multiple bad weather degradations: rain, fog, snow and adherent raindrops using a single network. To achieve this, we first design a generator with multiple task-specific encoders, each of which is associated with a particular bad weather degradation type. We utilize a neural architecture search to optimally process the image features extracted from all encoders. Subsequently, to convert degraded image features to clean background features, we introduce a series of tensor-based operations encapsulating the underlying physics principles behind the formation of rain, fog, snow and adherent raindrops. These operations serve as the basic building blocks for our architectural search. Finally, our discriminator simultaneously assesses the correctness and classifies the degradation type of the restored image. We design a novel adversarial learning scheme that only backpropagates the loss of a degradation type to the respective task-specific encoder. Despite being designed to handle different types of bad weather, extensive experiments demonstrate that our method performs competitively to the individual and dedicated state-of-the-art image restoration methods.
[multiple, attention, recognition, dataset, video] [feature, propose, background, heavy, table] [adversarial, type, input, clean, model, trained, effective, generic] [rain, removal, bad, weather, ieee, snow, method, raindrop, degradation, pattern, cell, based, adherent, june, proposed, dedicated, restoration, figure, conv, robby, residue, convolutional, july, degraded, deraining, psnr, fog, designed, restored, veiling] [image, discriminator, generator, generative, loss, encoder, corresponding, real, synthetic, encoders] [network, search, learning, architecture, neural, deep, operation, training, test, set, compared, achieve, design, performance, update, layer, process, basic, processing] [conference, computer, vision, single, international, fundamental, ground, decomposition, truth, handle, error]
@InProceedings{Li_2020_CVPR,
  author = {Li, Ruoteng and Tan, Robby T. and Cheong, Loong-Fah},
  title = {All in One Bad Weather Removal Using Architectural Search},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Relation-Aware Global Attention for Person Re-Identification
Zhizheng Zhang, Cuiling Lan, Wenjun Zeng, Xin Jin, Zhibo Chen


For person re-identification (re-id), attention mechanisms have become attractive as they aim at strengthening discriminative features and suppressing irrelevant ones, which matches well the key of re-id, i.e., discriminative feature learning. Previous approaches typically learn attention using local convolutions, ignoring the mining of knowledge from global structure patterns. Intuitively, the affinities among spatial positions/nodes in the feature map provide clustering-like information and are helpful for inferring semantics and thus attention, especially for person images where the feasible human poses are constrained. In this work, we propose an effective Relation-Aware Global Attention (RGA) module which captures the global structural information for better attention learning. Specifically, for each feature position, in order to compactly grasp the structural information of global scope and local appearance information, we propose to stack the relations, i.e., its pairwise correlations/affinities with all the feature positions (e.g., in raster scan order), and the feature itself together to learn the attention with a shallow convolutional model. Extensive ablation studies demonstrate that our RGA can significantly enhance the feature representation power and help achieve the state-of-the-art performance on several popular benchmarks. The source code is available at https://github.com/microsoft/Relation-Aware-Global-Attention-Networks.
[attention, relation, node, scope, embedding, ith, connecting, semantics, infer, exploration, represent, modeling, three, exploit] [feature, global, map, rga, module, affinity, propose, table, mine, achieves, liang, aggregated] [model, original, effective, helpful] [spatial, channel, convolutional, proposed, block, stack, based, relu, comparison, receptive, residual] [person, structural, target, learn, discriminative, image, representation, source, invariant] [pairwise, network, vector, baseline, learning, learned, performance, scheme, function, large, note, mining, better, size, deep, weight, knowledge, design, layer, neural, observe, number, matrix] [position, local, human, determine, globally]
@InProceedings{Zhang_2020_CVPR,
  author = {Zhang, Zhizheng and Lan, Cuiling and Zeng, Wenjun and Jin, Xin and Chen, Zhibo},
  title = {Relation-Aware Global Attention for Person Re-Identification},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
HOnnotate: A Method for 3D Annotation of Hand and Object Poses
Shreyas Hampali, Mahdi Rad, Markus Oberweger, Vincent Lepetit


We propose a method for annotating images of a hand manipulating an object with the 3D poses of both the hand and the object, together with a dataset created using this method. Our motivation is the current lack of annotated real images for this problem, as estimating the 3D poses is challenging, mostly because of the mutual occlusions between the hand and the object. To tackle this challenge, we capture sequences with one or several RGB-D cameras and jointly optimize the 3D hand and object poses over all the frames simultaneously. This method allows us to automatically annotate each frame with accurate estimates of the poses, despite large mutual occlusions. With this method, we created HO-3D, the first markerless dataset of color images with 3D annotations for both the hand and object. This dataset is currently made of 77,558 frames, 68 sequences, 10 persons, and 10 objects. Using our dataset, we develop a single RGB image-based method to predict the hand pose when interacting with objects under severe occlusions and show it generalizes to objects not seen in the dataset.
[dataset, frame, recognition, predict, interaction, temporal, evaluation, provide, egocentric, action] [object, tracking, annotation, table] [model, robust, physical, christian, datasets] [ieee, method, pattern, based, color, created] [synthetic, real, image, generative, discriminative, manipulating] [optimization, accuracy, learning, deep, training, data, function, optimize, setup, large, initialization] [hand, pose, computer, conference, joint, vision, estimation, single, pto, international, term, grasp, camera, rgb, depth, estimate, error, mano, shape, point, pth, edpt, european, antonis, markerless, ground, vertex, mesh, vincent, interacting, complex, truth, wrist, defined, ephy, markus]
@InProceedings{Hampali_2020_CVPR,
  author = {Hampali, Shreyas and Rad, Mahdi and Oberweger, Markus and Lepetit, Vincent},
  title = {HOnnotate: A Method for 3D Annotation of Hand and Object Poses},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Celeb-DF: A Large-Scale Challenging Dataset for DeepFake Forensics
Yuezun Li, Xin Yang, Pu Sun, Honggang Qi, Siwei Lyu


AI-synthesized face-swapping videos, commonly known as DeepFakes, is an emerging problem threatening the trustworthiness of online information. The need to develop and evaluate DeepFake detection algorithms calls for datasets of DeepFake videos. However, current DeepFake datasets suffer from low visual quality and do not resemble DeepFake videos circulated on the Internet. We present a new large-scale challenging DeepFake video dataset, Celeb-DF, which contains 5,639 high-quality DeepFake videos of celebrities generated using improved synthesis process. We conduct a comprehensive evaluation of DeepFake detection methods and datasets to demonstrate the escalated level of challenges posed by Celeb-DF.
[visual, dataset, video, current, evaluation, frame, individual, decoder] [detection, table, head, mask, score, challenging, cnn, challenge] [deepfake, face, datasets, original, quality, trained, auc, facial, dfdc, dfd, model, col, uadfv, input, unpublished, fwa, deepfakes, maker, improve, christian, nov] [based, method, existing, ieee, color, figure, capsule, low, pattern, convolutional] [synthesized, synthesis, image, real, generated, encoder, corresponding, generation, target, code, row, fake] [performance, average, neural, improved, deep, compared, training, basic, learning, algorithm, arxiv, preprint, evaluate] [conference, international, compare, computer]
@InProceedings{Li_2020_CVPR,
  author = {Li, Yuezun and Yang, Xin and Sun, Pu and Qi, Honggang and Lyu, Siwei},
  title = {Celeb-DF: A Large-Scale Challenging Dataset for DeepFake Forensics},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Deep Unfolding Network for Image Super-Resolution
Kai Zhang, Luc Van Gool, Radu Timofte


Learning-based single image super-resolution (SISR) methods are continuously showing superior effectiveness and efficiency over traditional model-based methods, largely due to the end-to-end training. However, different from model-based methods that can handle the SISR problem with different scale factors, blur kernels and noise levels under a unified MAP (maximum a posteriori) framework, learning-based methods generally lack such flexibility. To address this issue, this paper proposes an end-to-end trainable unfolding network which leverages both learningbased methods and model-based methods. Specifically, by unfolding the MAP inference via a half-quadratic splitting algorithm, a fixed number of iterations consisting of alternately solving a data subproblem and a prior subproblem can be obtained. The two subproblems then can be solved with neural modules, resulting in an end-to-end trainable, iterative network. As a result, the proposed network inherits the flexibility of model-based methods to super-resolve blurry, noisy images for different scale factors via a single model, while maintaining the advantages of learning-based methods. Extensive experiments demonstrate the superiority of the proposed deep unfolding network in terms of flexibility, effectiveness and also generalizability.
[visual, trainable, work] [module, level, map, downsampled, main, table, jian, luc, van, effectiveness, denotes] [noise, model, iterative, trained, splitting] [degradation, blur, scale, usrnet, kernel, prior, bicubic, unfolding, sisr, proposed, ieee, gaussian, method, usrgan, classical, psnr, based, fast, kai, figure, convolutional, residual, ircnn, deblurring, blind, motion, ikc, wangmeng, radu, subproblem, convolution, denoiser, flexible, restoration, anisotropic, rcan, zssr, lei, subproblems, result] [image, factor, distinct, loss, real] [deep, data, network, learning, training, large, note, function, optimization, inference, number, neural, size, problem, performance] [single, handle, term, solution, estimation, michael, solving]
@InProceedings{Zhang_2020_CVPR,
  author = {Zhang, Kai and Gool, Luc Van and Timofte, Radu},
  title = {Deep Unfolding Network for Image Super-Resolution},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
On the Uncertainty of Self-Supervised Monocular Depth Estimation
Matteo Poggi, Filippo Aleotti, Fabio Tosi, Stefano Mattoccia


Self-supervised paradigms for monocular depth estimation are very appealing since they do not require ground truth annotations at all. Despite the astonishing results yielded by such methodologies, learning to reason about the uncertainty of the estimated depth maps is of paramount importance for practical applications, yet uncharted in the literature. Purposely, we explore for the first time how to estimate the uncertainty for this task and how this affects depth accuracy, proposing a novel peculiar technique specifically designed for self-supervised approaches. On the standard KITTI dataset, we exhaustively assess the performance of each method with different self-supervised paradigms. Such evaluation highlights that our proposal i) always improves depth accuracy significantly and ii) yields state-of-the-art results concerning uncertainty estimation when training on sequences and competitive results uniquely deploying stereo pairs.
[evaluation, recognition, prediction, multiple, time, dataset] [confidence, supervision, table, lidar, instance] [model, trained] [ieee, pattern, optical, flow, method, figure, traditional, scale, output] [image, unsupervised, train, unknown] [learning, training, empirical, network, neural, deep, log, variance, predictive, bayesian, distribution, accuracy, function, better, machine, best, improved, sampling, dropout, test, compared, strategy, report, data, requires] [depth, uncertainty, monocular, conference, estimation, computer, stereo, vision, single, ground, international, matteo, stefano, truth, pose, estimate, leveraging, modelling, fabio, estimated, boot, rmse, european, joint, ause, aurg, camera, error, rel, sparsification, concerning, estimating, volume, scene, geometry]
@InProceedings{Poggi_2020_CVPR,
  author = {Poggi, Matteo and Aleotti, Filippo and Tosi, Fabio and Mattoccia, Stefano},
  title = {On the Uncertainty of Self-Supervised Monocular Depth Estimation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Proxy Anchor Loss for Deep Metric Learning
Sungyeon Kim, Dongwon Kim, Minsu Cho, Suha Kwak


Existing metric learning losses can be categorized into two classes: pair-based and proxy-based losses. The former class can leverage fine-grained semantic relations between data points, but slows convergence in general due to its high training complexity. In contrast, the latter class enables fast and reliable convergence, but cannot consider the rich data-to-data relations. This paper presents a new proxy-based loss that takes advantages of both pair- and proxy-based methods and overcomes their limitations. Thanks to the use of proxies, our loss boosts the speed of convergence and is robust against noisy labels and outliers. At the same time, it allows embedding vectors of data to interact with each other in its gradients to exploit data-to-data relations. Our method is evaluated on four public benchmarks, where a standard network trained with our loss achieves state-of-the-art performance and most quickly converges.
[embedding, recognition, pair, retrieval, rich, previous, outperforms, speed] [positive, anchor, table, achieves, faster] [model, trained, input, example, datasets, quality, robust, versus] [ieee, pattern, figure, comparison, high, analysis, existing, fast, method] [loss, image, pull] [data, training, learning, metric, deep, complexity, proxy, negative, batch, triplet, class, convergence, performance, size, larger, number, hardness, network, accuracy, set, dimension, neural, space, standard, gradient, sop, sampling, lifted, large, hyperparameters, contrastive, consider, reliable] [conference, computer, vision, international, point, relative, enables, distance, structure, leverage]
@InProceedings{Kim_2020_CVPR,
  author = {Kim, Sungyeon and Kim, Dongwon and Cho, Minsu and Kwak, Suha},
  title = {Proxy Anchor Loss for Deep Metric Learning},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Unsupervised Learning for Intrinsic Image Decomposition From a Single Image
Yunfei Liu, Yu Li, Shaodi You, Feng Lu


Intrinsic image decomposition, which is an essential task in computer vision, aims to infer the reflectance and shading of the scene. It is challenging since it needs to separate one image into two components. To tackle this, conventional methods introduce various priors to constrain the solution, yet with limited performance. Meanwhile, the problem is typically solved by supervised learning methods, which is actually not an ideal solution since obtaining ground truth reflectance and shading for massive general natural scenes is challenging and even impossible. In this paper, we propose a novel unsupervised intrinsic image decomposition framework, which relies on neither labeled training data nor hand-crafted priors. Instead, it directly learns the latent feature of reflectance and shading from unsupervised and uncorrelated data. To enable this, we explore the independence between reflectance and shading, the domain invariant content constraint and the physical constraint. Extensive experiments on both synthetic and real image datasets demonstrate consistently superior performance of the proposed method.
[natural, dataset, three, illustrated, visual, infer, explore, work] [propose, table, ablation, feature, object, module, employ] [input, physical, trained, datasets, adversarial] [prior, method, figure, proposed, comparison, feng, based, convolutional, existing, retinex, lmse, constrain] [image, unsupervised, supervised, code, domain, content, latent, translation, loss, learn, style, mscr, real, mapping, munit, encoders, lcnt, lphy, transfer, appearance, mit, ladv, invariant, consistency, fdcp, qualitative, independence] [learning, training, set, distribution, data, performance, unlabelled, log, total, follow, best, test] [intrinsic, reflectance, shading, decomposition, single, ground, constraint, shapenet, iiw, numerical, term, assume, provided, michael, truth, directly]
@InProceedings{Liu_2020_CVPR,
  author = {Liu, Yunfei and Li, Yu and You, Shaodi and Lu, Feng},
  title = {Unsupervised Learning for Intrinsic Image Decomposition From a Single Image},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Multi-Domain Learning for Accurate and Few-Shot Color Constancy
Jin Xiao, Shuhang Gu, Lei Zhang


Color constancy is an important process in camera pipeline to remove the color bias of captured image caused by scene illumination. Recently, significant improvements in color constancy accuracy have been achieved by using deep neural networks (DNNs). However, existing DNNbased color constancy methods learn distinct mappings for different cameras, which require a costly data acquisition process for each camera device. In this paper, we start a pioneer work to introduce multi-domain learning to color constancy area. For different camera devices, we train a branch of networks which share the same feature extractor and illuminant estimator, and only employ a camera-specific channel re-weighting module to adapt to the camera-specific characteristics. Such a multi-domain learning strategy enables us to take benefit from crossdevice training data. The proposed multi-domain learning color constancy method achieved state-of-the-art performance on three commonly used benchmark datasets. Furthermore, we also validate the proposed method in a fewshot color constancy setting. Given a new unseen device with limited number of training samples, our method is capable of delivering accurate color constancy by merely learning the camera-specific parameters from the few-shot dataset. Our project page is publicly available at https://github.com/msxiaojin/MDLCC.
[dataset, work, multiple, previous, combining, three] [feature, module, table, extractor] [model, worst, input, improve, datasets] [color, constancy, proposed, illuminant, mdlcc, channel, device, ieee, method, pattern, achieved, raw, analysis, existing, scale, combination, cube, utilized, convolutional, ffcc, based] [image, learn, train, shared, utilize] [training, learning, network, data, performance, number, large, best, adapt, deep, architecture, machine, function, set, neural, better, arxiv, preprint, strategy] [single, camera, estimation, conference, estimate, approach, computer, scene, vision, limited, error, estimating, well]
@InProceedings{Xiao_2020_CVPR,
  author = {Xiao, Jin and Gu, Shuhang and Zhang, Lei},
  title = {Multi-Domain Learning for Accurate and Few-Shot Color Constancy},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
PANDA: A Gigapixel-Level Human-Centric Video Dataset
Xueyang Wang, Xiya Zhang, Yinheng Zhu, Yuchen Guo, Xiaoyun Yuan, Liuyu Xiang, Zerun Wang, Guiguang Ding, David Brady, Qionghai Dai, Lu Fang


We present PANDA, the first gigaPixel-level humAN-centric viDeo dAtaset, for large-scale, long-term, and multi-object visual analysis. The videos in PANDA were captured by a gigapixel camera and cover real-world scenes with both wide field-of-view ( 1 square kilometer area) and high-resolution details ( gigapixel-level/frame). The scenes may contain 4k head counts with over 100x scale variation. PANDA provides enriched and hierarchical ground-truth annotations, including 15,974.6k bounding boxes, 111.8k fine-grained attribute labels, 12.7k trajectories, 2.2k groups and 2.9k interactions. We benchmark the human detection and tracking tasks. Due to the vast variance of pedestrian pose, scale, occlusion and trajectory, existing approaches are challenged by both accuracy and efficiency. Given the uniqueness of PANDA with both wide FoV and high resolution, a new task of interaction-aware group detection is introduced. We design a 'global-to-local zoom-in' framework, where global trajectories and local interactions are simultaneously encoded, yielding promising results. We believe PANDA will contribute to the community of artificial intelligence and praxeology by understanding human behaviors and interactions in large-scale real-world scenes. PANDA Website: http://www.panda-dataset.com.
[video, visual, dataset, interaction, trajectory, multiple, social, recognition, three, duration, temporal, understanding] [panda, detection, tracking, pedestrian, object, bounding, global, benchmark, false, annotation, occlusion, faster, edge, eglobal, crowded, head] [datasets, face, representative] [ieee, pattern, crowd, analysis, high, spatial, resolution, scale, based, existing, figure, gigapixel] [image, person, surveillance] [group, performance, number, distribution, label, arxiv, preprint, wide, deep, data, task, learning, neural, network, processing, large, accuracy, caltech] [computer, conference, local, vision, human, international, scene, body, fov, full, visible, provided, camera, dan]
@InProceedings{Wang_2020_CVPR,
  author = {Wang, Xueyang and Zhang, Xiya and Zhu, Yinheng and Guo, Yuchen and Yuan, Xiaoyun and Xiang, Liuyu and Wang, Zerun and Ding, Guiguang and Brady, David and Dai, Qionghai and Fang, Lu},
  title = {PANDA: A Gigapixel-Level Human-Centric Video Dataset},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Cross-View Tracking for Multi-Human 3D Pose Estimation at Over 100 FPS
Long Chen, Haizhou Ai, Rui Chen, Zijie Zhuang, Shuang Liu


Estimating 3D poses of multiple humans in real-time is a classic but still challenging task in computer vision. Its major difficulty lies in the ambiguity in cross-view association of 2D poses and the huge state space when there are multiple people in multiple views. In this paper, we present a novel solution for multi-human 3D pose estimation from multiple calibrated camera views. It takes 2D poses in different camera coordinates as inputs and aims for the accurate 3D poses in the global coordinate. Unlike previous methods that associate 2D poses among all pairs of views from scratch at every frame, we exploit the temporal consistency in videos to match the 2D inputs with 3D poses directly in 3-space. More specifically, we propose to retain the 3D pose for each person and update them iteratively via the cross-view multi-human tracking. This novel formulation improves both accuracy and efficiency, as we demonstrated on widely-used public datasets. To further verify the scalability of our method, we propose a new large-scale multi-human dataset with 12 to 28 camera views. Without bells and whistles, our solution achieves 154 FPS on 12 cameras and 34 FPS on 28 cameras, indicating its ability to handle large-scale real-world applications. The proposed dataset will be released at https://github.com/longcw/crossview_3d_pose_tracking.
[multiple, time, people, dataset, frame, previous, campus, three, cmu, temporal, store, state, work] [tracking, association, propose, affinity, detected, panoptic, table, fps, location, associate, denotes, framework] [iterative, verify, public, dong] [method, ieee, pattern, figure, motion, comparison, proposed, based, conventional, generally, designed] [target, person, consistency] [incremental, number, accuracy, problem, processing, computational, algorithm, baseline, rate, performance, equation, space, updated] [pose, camera, estimation, human, computer, conference, vision, triangulation, point, body, joint, reconstruction, matching, shelf, estimated, geometric, correspondence, solution, monocular, formulation, view, pictorial, estimate, single]
@InProceedings{Chen_2020_CVPR,
  author = {Chen, Long and Ai, Haizhou and Chen, Rui and Zhuang, Zijie and Liu, Shuang},
  title = {Cross-View Tracking for Multi-Human 3D Pose Estimation at Over 100 FPS},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Spatial-Temporal Graph Convolutional Network for Video-Based Person Re-Identification
Jinrui Yang, Wei-Shi Zheng, Qize Yang, Ying-Cong Chen, Qi Tian


While video-based person re-identification (Re-ID) has drawn increasing attention and made great progress in recent years, it is still very challenging to effectively overcome the occlusion problem and the visual ambiguity problem for visually similar negative samples. On the other hand, we observe that different frames of a video can provide complementary information for each other, and the structural information of pedestrians can provide extra discriminative cues for appearance features. Thus, modeling the temporal relations of different frames and the spatial relations within a frame has the potential for solving the above problems. In this work, we propose a novel Spatial-Temporal Graph Convolutional Network (STGCN) to solve these problems. The STGCN includes two GCN branches, a spatial one and a temporal one. The spatial branch extracts structural information of a human body. The temporal branch mines discriminative cues from adjacent frames. By jointly optimizing these branches, our model extracts robust spatial-temporal information that is complementary with appearance information. As shown in the experiments, our model achieves state-of-the-art results on MARS and DukeMTMC-VideoReID datasets.
[gcn, temporal, graph, video, frame, modeling, tgcn, attention, provide, dataset] [map, branch, feature, module, global, pooling, table, occlusion, xiaogang, china, stgcn, alleviate] [model, complementary, robust] [ieee, spatial, pattern, proposed, figure, method, convolutional, adjacent, analysis, patch, convolution] [person, structural, discriminative, appearance, image, loss, representation, tao, perform] [number, learning, network, performance, baseline, matrix, deep, neural, large, metric, impact, arxiv, preprint, average, layer, operation, set, best, activation, problem, potential, compared, evaluate] [computer, conference, vision, body, sgcn, european, international, jointly, distance]
@InProceedings{Yang_2020_CVPR,
  author = {Yang, Jinrui and Zheng, Wei-Shi and Yang, Qize and Chen, Ying-Cong and Tian, Qi},
  title = {Spatial-Temporal Graph Convolutional Network for Video-Based Person Re-Identification},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Salience-Guided Cascaded Suppression Network for Person Re-Identification
Xuesong Chen, Canmiao Fu, Yong Zhao, Feng Zheng, Jingkuan Song, Rongrong Ji, Yi Yang


Employing attention mechanisms to model both global and local features as a final pedestrian representation has become a trend for person re-identification (Re-ID) algorithms. A potential limitation of these methods is that they focus on the most salient features, but the re-identification of a person may rely on diverse clues masked by the most salient features in different situations, e.g., body, clothes or even shoes. To handle this limitation, we propose a novel Salience-guided Cascaded Suppression Network (SCSN) which enables the model to mine diverse salient features and integrate these features into the final representation by a cascaded manner. Our work makes the following contributions: (i) We observe that the previously learned salient features may hinder the network from learning other important information. To tackle this limitation, we introduce a cascaded suppression strategy, which enables the network to mine diverse potential useful features that be masked by the other salient features stage-by-stage and each stage integrates different feature embedding for the last discriminative pedestrian representation. (ii) We propose a Salient Feature Extraction (SFE) unit, which can suppress the salient features learned in the previous cascaded stage and then adaptively extracts other potential salient feature to obtain different clues of pedestrians. (iii) We develop an efficient feature aggregation strategy that fully increases the network's capacity for all potential salience features. Finally, experimental results demonstrate that our proposed method outperforms the state-of-the-art methods on four large-scale datasets. Especially, our approach exceeds the current best method by over 7% on the CUHK03 dataset.
[attention, recognition, extract, unit, dataset, mechanism] [feature, salient, suppression, salience, global, stage, map, pooling, pyramid, module, backbone, table, pedestrian, semantic, sfe, aggregation, employ, scsn, final, focus, denotes, mine, china, parsing, object, including] [improve, input, model, query, prone] [ieee, pattern, spatial, cascaded, method, residual, proposed, fusion, convolutional, extraction, adaptively, figure, dual] [person, loss, diverse, discriminative, image, representation, learn, consists] [network, potential, learning, deep, baseline, training, performance, strategy, triplet, set, number, average, multiplication, vector, reduce] [conference, computer, vision, local, international, human, pose, descriptor, enables, demonstrate, body]
@InProceedings{Chen_2020_CVPR,
  author = {Chen, Xuesong and Fu, Canmiao and Zhao, Yong and Zheng, Feng and Song, Jingkuan and Ji, Rongrong and Yang, Yi},
  title = {Salience-Guided Cascaded Suppression Network for Person Re-Identification},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Fashion Outfit Complementary Item Retrieval
Yen-Liang Lin, Son Tran, Larry S. Davis


Complementary fashion item recommendation is critical for fashion outfit completion. Existing methods mainly focus on outfit compatibility prediction but not in a retrieval setting. We propose a new framework for outfit complementary item retrieval. Specifically, a category-based subspace attention network is presented, which is a scalable approach for learning the subspace attentions. In addition, we introduce an outfit ranking loss that better models the item relationships of an entire outfit. We evaluate our method on the outfit compatibility, FITB and new retrieval tasks. Experimental results demonstrate that our approach outperforms state-of-the-art methods in both compatibility prediction and complementary item retrieval.
[retrieval, attention, embeddings, embedding, dataset, prediction, multiple, indexing, text, work, pair, visual] [category, feature, positive, framework, final, table, aggregation] [outfit, item, compatibility, complementary, ops, model, oms, ear, query, polyvore, fitb, compatible, disjoint, fashion, input, jewellery, original, testing, auc, outerwear] [method, figure, existing, based, comparison, designed] [image, loss, target, generate] [ranking, subspace, set, network, learning, performance, negative, entire, vector, average, top, training, triplet, better, number, task, pairwise, function, accuracy, metric, evaluate, large, similarity, deep] [distance, body, approach, single, well, system, computed, compare]
@InProceedings{Lin_2020_CVPR,
  author = {Lin, Yen-Liang and Tran, Son and Davis, Larry S.},
  title = {Fashion Outfit Complementary Item Retrieval},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Learning Event-Based Motion Deblurring
Zhe Jiang, Yu Zhang, Dongqing Zou, Jimmy Ren, Jiancheng Lv, Yebin Liu


Recovering sharp video sequence from a motion-blurred image is highly ill-posed due to the significant loss of motion information in the blurring process. For event-based cameras, however, fast motion can be captured as events at high frame rate, raising new opportunities to exploring effective solutions. In this paper, we start from a sequential formulation of event-based motion deblurring, then show how its optimization can be unfolded with a novel end-toend deep architecture. The proposed architecture is a convolutional recurrent neural network that integrates visual and temporal knowledge of both global and local scales in principled manner. To further improve the reconstruction, we propose a differentiable directional event filtering module to effectively extract rich boundary prior from the evolution of events. We conduct extensive experiments on the synthetic GoPro dataset and a large newly introduced dataset captured by a DAVIS240C camera. The proposed approach achieves state-of-the-art reconstruction quality, and generalizes better to handling real-world motion blur.
[time, recognition, video, temporal, dataset, mpn, recurrent, sequential, work, frame, mrl, sequence, visual, srn] [boundary, table, propose, global, achieves] [input, model, physical] [motion, event, deblurring, ieee, pattern, blurred, proposed, sharp, blind, bha, fast, blur, high, convolutional, guidance, prior, result, figure, intensity, compensation, etv, gopro, psnr, ssim, cie, captured, low, contrast, optical, read] [image, loss, latent, appearance] [network, learning, deep, note, sampling, neural, architecture, better, data, learned, performance, process] [vision, conference, computer, scene, approach, reconstruction, local, international, directional, camera, initial, estimation, assume, novel, joint, well, estimate, european]
@InProceedings{Jiang_2020_CVPR,
  author = {Jiang, Zhe and Zhang, Yu and Zou, Dongqing and Ren, Jimmy and Lv, Jiancheng and Liu, Yebin},
  title = {Learning Event-Based Motion Deblurring},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Domain Decluttering: Simplifying Images to Mitigate Synthetic-Real Domain Shift and Improve Depth Estimation
Yunhan Zhao, Shu Kong, Daeyun Shin, Charless Fowlkes


Leveraging synthetically rendered data offers great potential to improve monocular depth estimation and other geometric estimation tasks, but closing the synthetic-real domain gap is a non-trivial and important task. While much recent work has focused on unsupervised domain adaptation, we consider a more realistic scenario where a large amount of synthetic training data is supplemented by a small set of real images with ground-truth. In this setting, we find that existing domain translation approaches are difficult to train and offer little advantage over simple baselines that use a mix of real and synthetic data. A key failure mode is that real-world images contain novel objects and clutter not present in synthetic training. This high-level domain shift isn't handled by existing image translation models. Based on these observations, we develop an attention module that learns to identify and remove difficult out-of-domain regions in real images in order to improve depth prediction for a model trained primarily on synthetic data. We carry out extensive experiments to validate our attend-remove-complete approach (ARC) and find that it significantly outperforms state-of-the-art domain adaptation methods for depth prediction. Visualizing the removed regions provides interpretable insights into the synthetic-real domain gap.
[attention, prediction, modular, video] [module, mask, table, annotated, feature, improves, ablation] [model, adversarial, trained, improve, study, testing, original, input] [ieee, pattern, remove, output, convolutional, figure, removed] [synthetic, domain, real, arc, adaptation, train, image, translation, mix, loss, style, unsupervised, translator, learns, inpainting, transfer, translated, exr, trevor] [training, data, learning, better, set, performance, binary, deep, large, amount, neural, predictor, small, sparsity, arxiv, preprint, find, descent, labeled, note, baseline] [depth, conference, computer, vision, scene, monocular, kitti, error, international, indoor, leveraging, sparse, coordinate, single, full, michael]
@InProceedings{Zhao_2020_CVPR,
  author = {Zhao, Yunhan and Kong, Shu and Shin, Daeyun and Fowlkes, Charless},
  title = {Domain Decluttering: Simplifying Images to Mitigate Synthetic-Real Domain Shift and Improve Depth Estimation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Neural Blind Deconvolution Using Deep Priors
Dongwei Ren, Kai Zhang, Qilong Wang, Qinghua Hu, Wangmeng Zuo


Blind deconvolution is a classical yet challenging low-level vision problem with many real-world applications. Traditional maximum a posterior (MAP) based methods rely heavily on fixed and handcrafted priors that certainly are insufficient in characterizing clean images and blur kernels, and usually adopt specially designed alternating minimization to avoid trivial solution. In contrast, existing deep motion deblurring networks learn from massive training images the mapping to clean image or blur kernel, but are limited in handling various complex and large size blur kernels. To connect MAP and deep models, we in this paper present two generative networks for respectively modeling the deep priors of clean image and blur kernel, and propose an unconstrained neural optimization solution to blind deconvolution. In particular, we adopt an asymmetric Autoencoder with skip connections for generating latent clean image, and a fully-connected network (FCN) for generating blur kernel. Moreover, the SoftMax nonlinearity is applied to the output layer of FCN to meet the non-negative and equality constraints. The process of neural optimization can be explained as a kind of "zero-shot" self-supervised learning of the generative networks, and thus our proposed method is dubbed SelfDeblur. Experimental results show that our SelfDeblur can achieve notable quantitative gains as well as more visually plausible deblurring results in comparison to state-of-the-art blind deconvolution methods on benchmark datasets and real-world blurry images. The source code is publicly available at https://github.com/csdwren/SelfDeblur
[dataset, natural, modeling, visual, making] [adopt, fcn, table, level, sun, cnn] [clean, noise, model, unconstrained, study] [blur, blind, deconvolution, selfdeblur, ieee, kernel, blurry, pattern, deblurring, prior, method, comparison, dip, output, motion, levin, performs, traditional, nonlinearity, visually, skip, proposed, quantitative, figure, convolutional, gxt, compulsory, analysis, based, designed] [image, latent, generative, generate, generating, competing] [optimization, network, neural, deep, learning, alternating, layer, algorithm, regularization, applied, machine, equality, minimization, choice, average, softmax, performance, gradient, large, size] [conference, computer, vision, joint, capture, estimated, international, estimation, limited, estimate, solve, estimating, solution]
@InProceedings{Ren_2020_CVPR,
  author = {Ren, Dongwei and Zhang, Kai and Wang, Qilong and Hu, Qinghua and Zuo, Wangmeng},
  title = {Neural Blind Deconvolution Using Deep Priors},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Anisotropic Convolutional Networks for 3D Semantic Scene Completion
Jie Li, Kai Han, Peng Wang, Yu Liu, Xia Yuan


As a voxel-wise labeling task, semantic scene completion (SSC) tries to simultaneously infer the occupancy and semantic labels for a scene from a single depth and/or RGB image. The key challenge for SSC is how to effectively take advantage of the 3D context to model various objects or stuffs with severe variations in shapes, layouts, and visibility. To handle such variations, we propose a novel module called anisotropic convolution, which properties with flexibility and power impossible for the competing methods such as standard 3D convolution and some of its variations. In contrast to the standard 3D convolution that is limited to a fixed 3D receptive field, our module is capable of modeling the dimensional anisotropy voxel-wisely. The basic idea is to enable anisotropic 3D receptive field by decomposing a 3D convolution into three consecutive 1D convolutions, and the kernel size for each such 1D convolution is adaptively determined on the fly. By stacking multiple such anisotropic convolution modules, the voxel-wise modeling capability can be further enhanced while maintaining a controllable amount of model parameters. Extensive experiments on two SSC benchmarks, NYU-Depth-v2 and NYUCAD, show the superior performance of the proposed method.
[three, modulation, multiple, context, modeling, dataset, work] [semantic, module, table, feature, object, iou, cnn, denotes, recall, sofa, propose, achieves] [model, input] [aic, convolution, receptive, kernel, anisotropic, field, proposed, ddrnet, convolutional, ssc, sscnet, stacking, adaptively, method, spatial, nyucad, ieee, anisotropy, existing, figure] [row, image, flexibility] [network, performance, dimension, fixed, size, set, standard, learned, replace, bottleneck, candidate, computation, comparing, selection] [scene, completion, depth, dimensional, computer, voxel, nyu, conference, single, rgb, hybrid, floor, wall, chair, bed, occupancy, novel, vision, voxels, shape, structure]
@InProceedings{Li_2020_CVPR,
  author = {Li, Jie and Han, Kai and Wang, Peng and Liu, Yu and Yuan, Xia},
  title = {Anisotropic Convolutional Networks for 3D Semantic Scene Completion},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
TDAN: Temporally-Deformable Alignment Network for Video Super-Resolution
Yapeng Tian, Yulun Zhang, Yun Fu, Chenliang Xu


Video super-resolution (VSR) aims to restore a photo-realistic high-resolution (HR) video frame from both its corresponding low-resolution (LR) frame (reference frame) and multiple neighboring frames (supporting frames). Due to varying motion of cameras or objects, the reference frame and each support frame are not aligned. Therefore, temporal alignment is a challenging yet important problem for VSR. Previous VSR methods usually utilize optical flow between the reference frame and each supporting frame to warp the supporting frame for temporal alignment. However, both inaccurate flow and the image-level warping strategy will lead to artifacts in the warped supporting frames. To overcome the limitation, we propose a temporally-deformable alignment network (TDAN) to adaptively align the reference frame and each supporting frame at the feature level without computing optical flow. The TDAN uses features from both the reference frame and each supporting frame to dynamically predict offsets of sampling convolution kernels. By using the corresponding kernels, TDAN transforms supporting frames to align with the reference frame. To predict the HR video frame, a reconstruction network taking aligned frames and the reference frame is utilized. Experimental results demonstrate that the TDAN is capable of alleviating occlusions and artifacts for temporal alignment and the TDAN-based VSR model outperforms several recent state-of-the-art VSR networks with a comparable or even much smaller model size. The source code and pre-trained models are released in https://github.com/YapengTian/TDAN-VSR.
[frame, video, temporal, visual, predict, previous, outperforms] [feature, table, module, achieves, propose] [model, quality, strong] [tdan, supporting, reference, vsr, motion, optical, flow, deformable, proposed, convolutional, duf, flowvsr, toflow, residual, convolution, sisr, filr, iilr, figure, based, liu, ieee, yapeng, restore, bicubic, itlr, ithr, rcan, drvsr, method, capability, yulun, yun, warping, capable, ftlr, adaptively, dynamic, psnr, frvsr, output] [image, alignment, aligned, corresponding, align, utilize, loss] [network, deep, sampling, layer, performance, better, training, note, learned] [reconstruction, estimation, accurate, additional, vision, reconstructed, single]
@InProceedings{Tian_2020_CVPR,
  author = {Tian, Yapeng and Zhang, Yulun and Fu, Yun and Xu, Chenliang},
  title = {TDAN: Temporally-Deformable Alignment Network for Video Super-Resolution},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Zooming Slow-Mo: Fast and Accurate One-Stage Space-Time Video Super-Resolution
Xiaoyu Xiang, Yapeng Tian, Yulun Zhang, Yun Fu, Jan P. Allebach, Chenliang Xu


In this paper, we explore the space-time video super-resolution task, which aims to generate a high-resolution (HR) slow-motion video from a low frame rate (LFR), low-resolution (LR) video. A simple solution is to split it into two sub-tasks: video frame interpolation (VFI) and video super-resolution (VSR). However, temporal interpolation and spatial super-resolution are intra-related in this task. Two-stage methods cannot fully take advantage of the natural property. In addition, state-of-the-art VFI or VSR networks require a large frame-synthesis or reconstruction module for predicting high-quality video frames, which makes the two-stage methods have large model sizes and thus be time-consuming. To overcome the problems, we propose a one-stage space-time video super-resolution framework, which directly synthesizes an HR slow-motion video from an LFR, LR video. Rather than synthesizing missing LR video frames as VFI networks do, we firstly temporally interpolate LR frame features in missing LR video frames capturing local temporal contexts by the proposed feature temporal interpolation network. Then, we propose a deformable ConvLSTM to align and aggregate temporal information simultaneously for better leveraging global temporal contexts. Finally, a deep reconstruction network is adopted to predict HR slow-motion video frames. Extensive experiments on benchmark datasets demonstrate that the proposed method not only achieves better quantitative and qualitative performance but also is more than three times faster than recent two-stage state-of-the-art methods, e.g., DAIN+EDVR and DAIN+RBPN.
[video, temporal, frame, bidirectional, visual, state, hidden, predict, previous, sequence, time, temporally, exploit, current, explicit] [feature, propose, global, table, framework, map] [model, input] [deformable, interpolation, motion, convlstm, ieee, pattern, intermediate, proposed, fast, vfi, figure, stvsr, vsr, overlayed, method, handling, edvr, convolution, dain, convolutional, rbpn, spatial, sepconv, ftl, analysis, flow, residual, cell] [learn, image, corresponding, alignment, generate, missing, synthesizing, synthesize] [network, large, sampling, better, deep, function, performance, simultaneously, machine, rate, neural] [conference, computer, vision, local, reconstruction, accurate, directly, capture, handle, leverage, reconstruct]
@InProceedings{Xiang_2020_CVPR,
  author = {Xiang, Xiaoyu and Tian, Yapeng and Zhang, Yulun and Fu, Yun and Allebach, Jan P. and Xu, Chenliang},
  title = {Zooming Slow-Mo: Fast and Accurate One-Stage Space-Time Video Super-Resolution},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Fast MSER
Hailiang Xu, Siqi Xie, Fan Chen


Maximally Stable Extremal Regions (MSER) algorithms are based on the component tree and are used to detect invariant regions. OpenCV MSER, the most popular MSER implementation, uses a linked list to associate pixels with ERs. The data-structure of an ER contains the attributes of a head and a tail linked node, which makes OpenCV MSER hard to be performed in parallel using existing parallel component tree strategies. Besides, pixel extraction (i.e. extracting the pixels in MSERs) in OpenCV MSER is very slow. In this paper, we propose two novel MSER algorithms, called Fast MSER V1 and V2. They first divide an image into several spatial partitions, then construct sub-trees and doubly linked lists (for V1) or a labelled image (for V2) on the partitions in parallel. A novel sub-tree merging algorithm is used in V1 to merge the sub-trees into the final tree, and the doubly linked lists are also merged in the process. While V2 merges the sub-trees using an existing merging algorithm. Finally, MSERs are recognized, the pixels in them are extracted through two novel pixel extraction methods taking advantage of the fact that a lot of pixels in parent and child MSERs are duplicated. Both V1 and V2 outperform three open source MSER algorithms (28 and 26 times faster than OpenCV MSER), and reduce the memory of the pixels in MSERs by 78%.
[recognition, time, node, text, icdar, dataset, child, extract, three] [merging, parent, merge, faster, head, labelled, area, detection, region, object] [recognized, easily, robust] [mser, linked, pixel, msers, tree, fast, extraction, parallel, partition, figure, maximally, extremal, performed, channel, block, running, detectoreval, visiting, gray, usage, dark, green, discon] [image, component, list, extracted, corresponding, real, invariant] [memory, algorithm, tail, linear, stable, execution, size, note, merged, process, standard, number, procedure, small, set, compared, reduce, top] [novel, doubly, continuous, scene, opencv, array]
@InProceedings{Xu_2020_CVPR,
  author = {Xu, Hailiang and Xie, Siqi and Chen, Fan},
  title = {Fast MSER},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Unsupervised Person Re-Identification via Softened Similarity Learning
Yutian Lin, Lingxi Xie, Yu Wu, Chenggang Yan, Qi Tian


Person re-identification (re-ID) is an important topic in computer vision. This paper studies the unsupervised setting of re-ID, which does not require any labeled information and thus is freely deployed to new scenarios. There are very few studies under this setting, and one of the best approach till now used iterative clustering and classification, so that unlabeled images are clustered into pseudo classes for a classifier to get trained, and the updated features are used for clustering and so on. This approach suffers two problems, namely, the difficulty of determining the number of clusters, and the hard quantization loss in clustering. In this paper, we follow the iterative training mechanism but discard clustering, since it incurs loss from hard quantization, yet its only product, image-level similarity, can be easily replaced by pairwise computation and a softened classification task. With these improvements, our approach becomes more elegant and is more robust to hyper-parameter changes. Experiments on two image-based and video-based datasets demonstrate state-of-the-art performance under the unsupervised re-ID setting.
[dataset, embedding, video, predict] [feature, map, hard, table, liang, global, propose, framework, pedestrian] [model, auxiliary, identity, query, iterative, robust, trained] [method, figure, proposed, based, ieee, high, captured, adopted] [person, unsupervised, image, softened, learn, cce, buc, dissimilarity, domain, loss, target, adaptation, encouragement, discriminative, chenggang, pseudo, shaogang, yutian, source, oneex, supervised] [similarity, learning, reliable, training, network, label, clustering, performance, set, unlabeled, classification, baseline, number, accuracy, impact, learned, distribution, class, parameter, labeled, negative, quantization, soft, deep, data, find, selected, better, observe] [distance, camera, ground, truth, term, approach]
@InProceedings{Lin_2020_CVPR,
  author = {Lin, Yutian and Xie, Lingxi and Wu, Yu and Yan, Chenggang and Tian, Qi},
  title = {Unsupervised Person Re-Identification via Softened Similarity Learning},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
COCAS: A Large-Scale Clothes Changing Person Dataset for Re-Identification
Shijie Yu, Shihua Li, Dapeng Chen, Rui Zhao, Junjie Yan, Yu Qiao


Recent years have witnessed great progress in person re-identification (re-id). Several academic benchmarks such as Market1501, CUHK03 and DukeMTMC play important roles to promote the re-id research. To our best knowledge, all the existing benchmarks assume the same person will have the same clothes. While in real-world scenarios, it is very often for a person to change clothes. To address the clothes changing person re-id problem, we construct a novel large-scale re-id benchmark named Clothes Changing Person Set (COCAS), which provides multiple images of the same identity with different clothes. COCAS totally contains 62,382 body images from 5,266 persons. Based on COCAS, we introduce a new person re-id setting for clothes changing problem, where the query includes both a clothes template and a person image taking another clothes. Moreover, we propose a two-branch network named Biometric-Clothes Network (BC-Net) which can effectively integrate biometric and clothes feature for re-id under our setting. Experiments show that it is feasible for clothes changing re-id with clothes templates.
[dataset, retrieval, extract, red, multiple] [feature, map, faster, template, rcnn, branch, mask, detector, benchmark, including, anchor, liang, gallery, module, employed, xiaogang, detection, box, region, bounding, rui, named] [clothes, biometric, query, desensitized, combined, euclid, face, facial, dapeng, suspect, kind, datasets, identification, input, xqda, identity, white, trained, influence, testing, changed] [based, figure, method, partition, proposed, raw] [person, image, target, loss, changing, appearance, corresponding, realistic, dukemtmc] [set, learning, network, training, deep, triplet, performance, similarity, better, applied, data, indicates, metric, setting, neural, search] [body, human, joint, shape, hand, distance]
@InProceedings{Yu_2020_CVPR,
  author = {Yu, Shijie and Li, Shihua and Chen, Dapeng and Zhao, Rui and Yan, Junjie and Qiao, Yu},
  title = {COCAS: A Large-Scale Clothes Changing Person Dataset for Re-Identification},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Learning Formation of Physically-Based Face Attributes
Ruilong Li, Karl Bladin, Yajie Zhao, Chinmay Chinara, Owen Ingraham, Pengda Xiang, Xinglei Ren, Pratusha Prasad, Bipin Kishore, Jun Xing, Hao Li


Based on a combined data set of 4000 high resolution facial scans, we introduce a non-linear morphable face model, capable of producing multifarious face geometry of pore-level resolution, coupled with material attributes for use in physically-based rendering. We aim to maximize the variety of the participant's face identities, while increasing the robustness of correspondence between unique components, including middle-frequency geometry, albedo maps, specular intensity maps and high-frequency displacement details. Our deep learning based generative model learns to correlate albedo and geometry, which ensures the anatomical correctness of the generated assets. We demonstrate potential use of our generative model for novel identity generation, model fitting, interpolation, animation, high fidelity data visualization, and low-to-high resolution data domain transferring. We hope the release of this generative model will encourage further cooperation between all graphics, vision, and data focused professionals, while demonstrating the cumulative value of every individual's complete biometric profile.
[recognition] [offset, map] [face, model, expression, facial, identity, morphable, age, zexp, neutral, lexp, christian, trained, quality, ethnicity, input, djoint, adversarial, database, manipulation, skin] [resolution, based, ieee, figure, pattern, high, interpolation, light, method, low] [generated, generative, latent, image, texture, generate, enable, discriminator, code, introduce, generates, source] [data, network, training, set, linear, inference, learning, base, deep, number, space, large, statistical] [geometry, albedo, computer, vision, acm, conference, specular, capture, fitting, paul, scan, truth, hao, physically, displacement, reconstruction, well, rendering, reflectance, ground, thomas, material, joint, enables, pca, rendered, novel, limited]
@InProceedings{Li_2020_CVPR,
  author = {Li, Ruilong and Bladin, Karl and Zhao, Yajie and Chinara, Chinmay and Ingraham, Owen and Xiang, Pengda and Ren, Xinglei and Prasad, Pratusha and Kishore, Bipin and Xing, Jun and Li, Hao},
  title = {Learning Formation of Physically-Based Face Attributes},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Generalized Product Quantization Network for Semi-Supervised Image Retrieval
Young Kyun Jang, Nam Ik Cho


Image retrieval methods that employ hashing or vector quantization have achieved great success by taking advantage of deep learning. However, these approaches do not meet expectations unless expensive label information is sufficient. To resolve this issue, we propose the first quantization-based semi-supervised image retrieval scheme: Generalized Product Quantization (GPQ) network. We design a novel metric learning strategy that preserves semantic similarity between labeled data, and employ entropy regularization term to fully exploit inherent potentials of unlabeled data. Our solution increases the generalization capacity of the quantization network, which allows overcoming previous limitations in the retrieval community. Extensive experimental results demonstrate that GPQ yields state-of-the-art performance on large-scale real image benchmark datasets.
[retrieval, dataset, build, outperforms] [feature, semantic, map, table, lsem, employ, fully, lcls, benchmark, extractor] [protocol, database, experimental, query, input] [figure, proposed, ieee, high, based, method, optimized, analysis] [image, loss, learn, supervised, code, representation, train, generalized] [deep, quantization, unlabeled, data, hashing, learning, product, gpq, labeled, network, entropy, binary, similarity, training, vector, cosine, codewords, metric, label, performance, number, function, codeword, subspace, set, strategy, scheme, minimize, distribution, triplet, hash, classifier, simultaneously, find, class, classification, bit, machine, amount, search, approximate, neural] [distance, nearest, codebook, neighbor]
@InProceedings{Jang_2020_CVPR,
  author = {Jang, Young Kyun and Cho, Nam Ik},
  title = {Generalized Product Quantization Network for Semi-Supervised Image Retrieval},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Stereoscopic Flash and No-Flash Photography for Shape and Albedo Recovery
Xu Cao, Michael Waechter, Boxin Shi, Ye Gao, Bo Zheng, Yasuyuki Matsushita


We present a minimal imaging setup that harnesses both geometric and photometric approaches for shape and albedo recovery. We adopt a stereo camera and a flashlight to capture a stereo image pair and a flash/no-flash pair. From the stereo image pair, we recover a rough shape that captures low-frequency shape variation without high-frequency details. From the flash/no-flash pair, we derive an image formation model for Lambertian objects under natural lighting, based on which a fine normal map is obtained and fused with the rough shape. Further, we use the flash/no-flash pair for cast shadow detection and albedo canceling, making the shape recovery robust against shadows and albedo variation. We verify the effectiveness of our approach on both synthetic and real-world data.
[pair, recognition, natural, multiple, three, work, environmental, observed] [map, global, confidence] [model, variation, robust] [formation, light, figure, pattern, based, imaging, method, recover, intensity, comparison, illumination, photography] [image, fine, han, yan, synthetic] [ratio, setup, vector, optimization, set, linear, data, function, applied, note] [shape, normal, albedo, stereo, surface, coarse, shading, recovery, computer, vision, photometric, flash, acm, lighting, depth, geometric, flashlight, cast, mnf, spherical, left, camera, recovered, conference, single, international, lambertian, capture, geometry, lnf, initial, michael, approach, matching, direction, estimated, ground, error, assume, sfo]
@InProceedings{Cao_2020_CVPR,
  author = {Cao, Xu and Waechter, Michael and Shi, Boxin and Gao, Ye and Zheng, Bo and Matsushita, Yasuyuki},
  title = {Stereoscopic Flash and No-Flash Photography for Shape and Albedo Recovery},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Context-Aware Group Captioning via Self-Attention and Contrastive Features
Zhuowan Li, Quan Tran, Long Mai, Zhe Lin, Alan L. Yuille


While image captioning has progressed rapidly, existing works focus mainly on describing single images. In this paper, we introduce a new task, context-aware group captioning, which aims to describe a group of target images in the context of another group of related reference images. Context-aware group captioning requires not only summarizing information from both the target and reference image group but also contrasting between them. To solve this problem, we propose a framework combining self-attention mechanism with contrastive feature construction to effectively summarize common information from each image group while capturing discriminative information between them. To build the dataset for this task, we propose to group the images and generate the group captions based on single image captions using scene graphs matching. Our datasets are constructed on top of the public Conceptual Captions dataset and our new Stock Captions dataset. Experiments on the two datasets show the effectiveness of our method on this new task.
[captioning, dataset, woman, stock, attention, visual, caption, language, context, describe, natural, cowboy, suggestion, graph, hat, description, text, referring, long, describing, summarize, constructed, automatic] [feature, conceptual, focus, table, effectiveness, grouping, propose, aggregation, split] [model, query, datasets, effectively, difference, noise, white] [reference, ieee, pattern, figure, method, based, proposed] [image, target, generate, user, representation, discriminative, generation, introduce, common, train] [group, contrastive, neural, performance, problem, set, data, learning, arxiv, preprint, larger, search, training, deep, setting, number, process, top, compared, processing, better] [conference, computer, vision, scene, international, joint, acm, single]
@InProceedings{Li_2020_CVPR,
  author = {Li, Zhuowan and Tran, Quan and Mai, Long and Lin, Zhe and Yuille, Alan L.},
  title = {Context-Aware Group Captioning via Self-Attention and Contrastive Features},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
MEBOW: Monocular Estimation of Body Orientation in the Wild
Chenyan Wu, Yukun Chen, Jiajia Luo, Che-Chun Su, Anuja Dawane, Bikramjot Hanzra, Zhuo Deng, Bilan Liu, James Z. Wang, Cheng-hao Kuo


Body orientation estimation provides crucial visual cues in many applications, including robotics and autonomous driving. It is particularly desirable when 3-D pose estimation is difficult to infer due to poor image resolution, occlusion or indistinguishable body parts. We present COCO-MEBOW (Monocular Estimation of Body Orientation in the Wild), a new large-scale dataset for orientation estimation from a single in-the-wild image. The body-orientation labels for around 130K human bodies within 55K images from the COCO dataset have been collected using an efficient and high-precision annotation pipeline. We also validated the benefits of the dataset. First, we show that our dataset can substantially improve the performance and the robustness of a human body orientation estimation model, the development of which was previously limited by the scale and diversity of the available training data. Additionally, we present a novel triple-source solution for 3-D human pose estimation, where 3-D pose labels, 2-D pose labels, and our body-orientation labels are all used in joint training. Our model significantly outperforms state-of-the-art dual-source solutions for monocular 3-D human pose estimation, where training only uses 3-D pose labels and 2-D pose labels. This substantiates an important advantage of MEBOW for 3-D human pose estimation, which is particularly appealing because the per-instance labeling cost for body orientations is far less than that for 3-D poses. The work demonstrates high potential of MEBOW in addressing real-world challenges involving understanding human behaviors. Further information of this work is available at https://chenyanwu.github.io/MEBOW/.
[dataset, evaluation, work, previous, prediction, outperforms, recognition] [coco, table, bin, backbone, annotation, predicted, hrnet, pedestrian, labeling, instance, bounding, cropped, supervision] [model, trained, datasets, input, example, evaluating, improving] [ieee, based, pattern, method, captured, proposed, figure, motion, existing] [loss, image, source, train, pretrained] [baseline, training, network, label, performance, learning, deep, set, test, classification, architecture, better, vector, labeled, neural, best, number] [orientation, pose, human, estimation, body, mebow, tud, conference, computer, hboe, vision, international, joint, continuous, direction, single, monocular, ground, michael, left, approach, computed, shape, solution, camera, additional]
@InProceedings{Wu_2020_CVPR,
  author = {Wu, Chenyan and Chen, Yukun and Luo, Jiajia and Su, Che-Chun and Dawane, Anuja and Hanzra, Bikramjot and Deng, Zhuo and Liu, Bilan and Wang, James Z. and Kuo, Cheng-hao},
  title = {MEBOW: Monocular Estimation of Body Orientation in the Wild},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Distilling Image Dehazing With Heterogeneous Task Imitation
Ming Hong, Yuan Xie, Cuihua Li, Yanyun Qu


State-of-the-art deep dehazing models are often difficult in training. Knowledge distillation paves a way to train a student network assisted by a teacher network. However, most knowledge distill methods are used for image classification and segmentation as well as object detection, and few investigate distilling image restoration and use different task for knowledge transfer. In this paper, we propose a knowledge-distill dehazing network which distills image dehazing with the heterogeneous task imitation. In our network, the teacher is an off-the-shelf auto-encoder network and is used for image reconstruction. The dehazing network is trained assisted by the teacher network with the process-oriented learning mechanism. The student network imitates the task of image reconstruction in the teacher network. Moreover, we design a spatial-weighted channel-attention residual block for the student image dehazing network to adaptively learn the content-aware channel level attention and pay more attention to the features for dense hazy regions reconstruction. To evaluate the effectiveness of the proposed method, we compare our method with several state-of-the-art methods on two synthetic and real-world datasets, as well as real hazy images.
[] [sth] [] [] [] [nll] [xyz]
@InProceedings{Hong_2020_CVPR,
  author = {Hong, Ming and Xie, Yuan and Li, Cuihua and Qu, Yanyun},
  title = {Distilling Image Dehazing With Heterogeneous Task Imitation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Select, Supplement and Focus for RGB-D Saliency Detection
Miao Zhang, Weisong Ren, Yongri Piao, Zhengkun Rong, Huchuan Lu


Depth data containing a preponderance of discriminative power in location have been proven beneficial for accurate saliency prediction. However, RGB-D saliency detection methods are also negatively influenced by randomly distributed erroneous or missing regions on the depth map or along the object boundaries. This offers the possibility of achieving more effective inference by well designed models. In this paper, we propose a new framework for accurate RGB-D saliency detection taking account of local and global complementarities from two modalities. This is achieved by designing a complimentary interaction model discriminative enough to simultaneously select useful representation from RGB and depth data, and meanwhile to refine the object boundaries. Moreover, we proposed a compensation-aware loss to further process the information not being considered in the complimentary interaction model, leading to improvement of the generalization ability for challenging scenes. Experiments on six public datasets show that our method outperforms18state-of-the-art methods.
[attention, modal, unit, three, visual, unreliable, interaction, decoder, work, prediction] [saliency, salient, boundary, object, detection, map, edge, represents, table, cau, propose, hard, region, feature, supplement, challenging, bsu, background, final, mask, level, location, complimentary, cim, lfsd, focus, huchuan, module, foreground, concatenate] [effectively, effective, model, datasets, help] [figure, ieee, proposed, pattern, method, fusion, based, convolutional, introduced, erroneous, mae, high] [loss, image, generate, discriminative, row, generated] [network, baseline, number, binary, select, performance, design] [depth, rgb, computer, conference, vision, accurate, ground, stereo, international, initial, demonstrate, local, complex]
@InProceedings{Zhang_2020_CVPR,
  author = {Zhang, Miao and Ren, Weisong and Piao, Yongri and Rong, Zhengkun and Lu, Huchuan},
  title = {Select, Supplement and Focus for RGB-D Saliency Detection},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Transfer Learning From Synthetic to Real-Noise Denoising With Adaptive Instance Normalization
Yoonsik Kim, Jae Woong Soh, Gu Yong Park, Nam Ik Cho


Real-noise denoising is a challenging task because the statistics of real-noise do not follow the normal distribution, and they are also spatially and temporally changing. In order to cope with various and complex real-noise, we propose a well-generalized denoising architecture and a transfer learning scheme. Specifically, we adopt an adaptive instance normalization to build a denoiser, which can regularize the feature map and prevent the network from overfitting to the training set. We also introduce a transfer learning scheme that transfers knowledge learned from synthetic-noise data to the real-noise denoiser. From the proposed transfer learning, the synthetic-noise denoiser can learn general features from various synthetic-noise data, and the real-noise denoiser can learn the real-noise characteristics from real data. From the experiments, we find that the proposed denoising method has great generalization ability, such that our network trained with synthetic-noise achieves the best performance for Darmstadt Noise Dataset (DND) among the methods from published papers. We can also see that the proposed transfer learning scheme robustly works for real-noise images through the learning with a very small number of labeled data.
[dataset, work, recognition, order] [level, feature, denotes, map, cnn, table, instance, propose, achieves, challenge] [noise, trained, model] [denoiser, ieee, proposed, denoising, pattern, blind, noisy, ain, method, sidd, lei, gaussian, cbdnet, overfitted, aindnet, dnd, based, convolutional, presented, psnr, adaptive, figure, denoised, residual, awgn, ridnet, wangmeng, kai] [image, transfer, real, synthetic, domain, train, discrepancy, generalized, generated, learns, loss] [training, learning, data, performance, network, scheme, test, regularizer, normalization, best, number, average, architecture, deep, set, parameter, overfitting, small, better, regularization, learned, general] [conference, computer, vision, estimator, reconstruction, well, single, international, camera, novel, estimation, michael]
@InProceedings{Kim_2020_CVPR,
  author = {Kim, Yoonsik and Soh, Jae Woong and Park, Gu Yong and Cho, Nam Ik},
  title = {Transfer Learning From Synthetic to Real-Noise Denoising With Adaptive Instance Normalization},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
On Joint Estimation of Pose, Geometry and svBRDF From a Handheld Scanner
Carolin Schmitt, Simon Donne, Gernot Riegler, Vladlen Koltun, Andreas Geiger


We propose a novel formulation for joint recovery of camera pose, object geometry and spatially-varying BRDF. The input to our approach is a sequence of RGB-D images captured by a mobile, hand-held scanner that actively illuminates the scene with point light sources. Compared to previous works that jointly estimate geometry and materials from a hand-held scanner, we formulate this problem using a single objective function that can be minimized using off-the-shelf gradient-based solvers. By integrating material clustering as a differentiable operation into the optimization process, we avoid pre-processing heuristics and demonstrate that our model is able to determine the correct number of specular materials independently. We provide a study on the importance of each component in our formulation and on the requirements of the initial geometry. We show that optimizing over the poses is crucial for accurately recovering fine details and show that our approach naturally results in a semantically meaningful material segmentation.
[multiple, provide] [object, denotes, location] [model, input, robust] [light, method, figure, reference, pixel, recover, proposed, sensor, captured, illumination, spatially] [image, handheld, appearance, variational] [optimization, number, base, problem, function, objective, optimizing, set, optimize, evaluate, note, test] [geometry, photometric, material, depth, surface, camera, approach, specular, stereo, acm, reflectance, shape, brdf, joint, point, geometric, reconstruction, initial, normal, yvain, estimation, assume, roberto, estimate, daniel, svbrdf, jointly, single, term, michael, scene, varying, well, higo, demonstrate, recovering, shading, accurate, error, estimating, estimated, view, formulation, diffuse, assuming, allows, smoothness, albedo, volumetric, dense, tsdf, matthias]
@InProceedings{Schmitt_2020_CVPR,
  author = {Schmitt, Carolin and Donne, Simon and Riegler, Gernot and Koltun, Vladlen and Geiger, Andreas},
  title = {On Joint Estimation of Pose, Geometry and svBRDF From a Handheld Scanner},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Differentiable Volumetric Rendering: Learning Implicit 3D Representations Without 3D Supervision
Michael Niemeyer, Lars Mescheder, Michael Oechsle, Andreas Geiger


Learning-based 3D reconstruction methods have shown impressive results. However, most methods require 3D supervision which is often hard to obtain for real-world datasets. Recently, several works have proposed differentiable rendering techniques to train reconstruction models from RGB images. Unfortunately, these approaches are currently restricted to voxel- and mesh-based representations, suffering from discretization or low resolution. In this work, we propose a differentiable rendering formulation for implicit shape and texture representations. Implicit representations have recently gained popularity as they represent shape and texture continuously. Our key insight is that depth gradients can be derived analytically using the concept of implicit differentiation. This allows us to learn implicit shape and texture representations directly from RGB images. We experimentally show that our single-view reconstructions rival those learned with full 3D supervision. Moreover, we find that our method can be used for multi-view 3D reconstruction, directly resulting in watertight meshes.
[recognition, infer, visual, prediction] [object, supervision, predicted, table] [model, ray, trained, input] [ieee, pattern, method, spsr, contrast, figure, output, pixel] [texture, image, loss, learn, representation, supervised, train, unsupervised] [learning, network, neural, deep, gradient, space, backward, processing, training, data, pass, investigate, trim, machine, soft] [vision, computer, depth, surface, implicit, reconstruction, shape, differentiable, international, point, single, occupancy, volumetric, rendering, michael, ground, truth, andreas, rgb, european, directly, define, geometry, approach, require, mesh, stereo, camera, dtu, sparse, voxel, lrgb, colmap, hao, allows, full, hull, predicts, ldepth, lars]
@InProceedings{Niemeyer_2020_CVPR,
  author = {Niemeyer, Michael and Mescheder, Lars and Oechsle, Michael and Geiger, Andreas},
  title = {Differentiable Volumetric Rendering: Learning Implicit 3D Representations Without 3D Supervision},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Meta-Transfer Learning for Zero-Shot Super-Resolution
Jae Woong Soh, Sunwoo Cho, Nam Ik Cho


Convolutional neural networks (CNNs) have shown dramatic improvements in single image super-resolution (SISR) by using large-scale external samples. Despite their remarkable performance based on the external dataset, they cannot exploit internal information within a specific image. Another problem is that they are applicable only to the specific condition of data that they are supervised. For instance, the low-resolution (LR) image should be a "bicubic" downsampled noise-free image from a high-resolution (HR) one. To address both issues, zero-shot super-resolution (ZSSR) has been proposed for flexible internal learning. However, they require thousands of gradient updates, i.e., long inference time. In this paper, we present Meta-Transfer Learning for Zero-Shot Super-Resolution (MZSR), which leverages ZSSR. Precisely, it is based on finding a generic initial parameter that is suitable for internal learning. Thus, we can exploit both external and internal information, where one single gradient update can yield quite considerable results. With our method, the network can quickly adapt to a given image condition. In this respect, our method can be applied to a large spectrum of image conditions within a fast adaptation process.
[dataset, natural, time, exploit, step] [cnn, table, adopt] [internal, model, external, trained, help] [ieee, method, kernel, zssr, mzsr, blur, pattern, proposed, bicubic, downsampling, figure, based, psnr, fast, isotropic, gaussian, degradation, rcan, ilr, residual, result, lightweight] [image, learn, specific, unsupervised, learns, loss, adaptation, real, transfer] [gradient, learning, training, test, update, network, number, performance, large, task, descent, requires, maml, algorithm, compared, neural, parameter, deep, scaling, distribution, average, adapt, adapted, simple] [conference, computer, vision, single, initial, point, international]
@InProceedings{Soh_2020_CVPR,
  author = {Soh, Jae Woong and Cho, Sunwoo and Cho, Nam Ik},
  title = {Meta-Transfer Learning for Zero-Shot Super-Resolution},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Solving Jigsaw Puzzles With Eroded Boundaries
Dov Bridger, Dov Danon, Ayellet Tal


Jigsaw puzzle solving is an intriguing problem which has been explored in computer vision for decades. This paper focuses on a specific variant of the problem--solving puzzles with eroded boundaries. Such erosion makes the problem extremely difficult, since most existing solvers utilize solely the information at the boundaries. Nevertheless, this variant is important since erosion and missing data often occur at the boundaries. The key idea of our proposed approach is to inpaint the eroded boundaries between puzzle pieces and later leverage the quality of the inpainted area to classify a pair of pieces as "neighbors or not". An interesting feature of our architecture is that the same GAN discriminator is used for both inpainting and classification; training of the second task is simply a continuation of the training of the first, beginning from the point it left off. We show that our approach outperforms other SOTA methods.
[pair, goal, previous, order, prediction, placement, outperforms, correct] [positive, key, table, extent, area, boundary] [input, case, trained, quality, model, original, example] [method, figure, based, result, cvpr, adjacent, output, ieee, proposed] [image, discriminator, puzzle, inpainting, jigsaw, erosion, dissimilarity, missing, generated, gap, piece, eroded, idea, generator, loss, inpainted, train, gan, consists, unknown, encoder, fresh, israel] [training, problem, learning, probability, pairwise, negative, larger, algorithm, greedy, better, function, computing, classify, task, set, classification, layer, large] [solving, neighbor, square, solution, solve, vision, solver, computer, second, acm, single, defined, handle, continuous]
@InProceedings{Bridger_2020_CVPR,
  author = {Bridger, Dov and Danon, Dov and Tal, Ayellet},
  title = {Solving Jigsaw Puzzles With Eroded Boundaries},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Context-Aware Attention Network for Image-Text Retrieval
Qi Zhang, Zhen Lei, Zhaoxiang Zhang, Stan Z. Li


As a typical cross-modal problem, image-text bi-directional retrieval relies heavily on the joint embedding learning and similarity measure for each image-text pair. It remains challenging because prior works seldom explore semantic correspondences between modalities and semantic correlations in a single modality at the same time. In this work, we propose a unified Context-Aware Attention Network (CAAN), which selectively focuses on critical local fragments (regions and words) by aggregating the global context. Specifically, it simultaneously utilizes global inter-modal alignments and intra-modal correlations to discover latent semantic relations. Considering the interactions between images and sentences in the retrieval process, intra-modal correlations are derived from the second-order attention of region-word alignments instead of intuitively comparing the distance between original features. Our method achieves fairly competitive results on two generic image-text retrieval datasets Flickr30K and MS-COCO.
[attention, retrieval, visual, context, sentence, caan, language, question, word, modality, bidirectional, text, textual, mechanism, embedding, considering, embeddings, attend, multimodal, explore, selectively] [semantic, global, table, region, focus, achieves, propose, feature, final, unified] [model] [based, proposed, figure, method, adaptive, formulated, convolutional] [image, latent, learn, perform] [test, matrix, network, learning, process, similarity, neural, compared, baseline, arxiv, preprint, deep, consider, learned, function, set, fairly, ranking, better, objective] [local, single, matching, full, fragment, joint]
@InProceedings{Zhang_2020_CVPR,
  author = {Zhang, Qi and Lei, Zhen and Zhang, Zhaoxiang and Li, Stan Z.},
  title = {Context-Aware Attention Network for Image-Text Retrieval},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
M-LVC: Multiple Frames Prediction for Learned Video Compression
Jianping Lin, Dong Liu, Houqiang Li, Feng Wu


We propose an end-to-end learned video compression scheme for low-latency scenarios. Previous methods are limited in using the previous one frame as reference. Our method introduces the usage of the previous multiple frames as references. In our scheme, the motion vector (MV) field is calculated between the current frame and the previous one. With multiple reference frames and associated multiple MV fields, our designed network can generate more accurate prediction of the current frame, yielding less residual. Multiple reference frames also help generate MV prediction, which reduces the coding cost of MV field. We use two deep auto-encoders to compress the residual and the MV, respectively. To compensate for the compression error of the auto-encoders, we further design a MV refinement network and a residual refinement network, taking use of the multiple reference frames as well. All the modules in our scheme are jointly optimized through a single rate-distortion loss function. We use a step-by-step training strategy to optimize the entire scheme. Experimental results show that the proposed method outperforms the existing learned video compression methods for low-latency mode. Our method also performs better than H.265 in both PSNR and MS-SSIM. Our code and models are publicly available.
[video, frame, multiple, previous, prediction, current, encoding, time, step, three, work, recurrent, predict] [add, refinement, predicted, propose, feature] [model, original, adding, experimental, curve, trained] [compression, residual, reference, motion, proposed, decoded, method, mvd, coding, based, dvc, compensation, field, convolutional, figure, hevc, block, compress, traditional, warped, psnr] [image, latent, loss, train, perform, generate] [network, learned, scheme, bit, training, entropy, class, gain, deep, buffer, compressing, reduce, size, neural, number, rate, experiment, arxiv, preprint, vector, compared] [reconstructed, error, jointly, single, directly, david, estimation]
@InProceedings{Lin_2020_CVPR,
  author = {Lin, Jianping and Liu, Dong and Li, Houqiang and Wu, Feng},
  title = {M-LVC: Multiple Frames Prediction for Learned Video Compression},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Efficient Dynamic Scene Deblurring Using Spatially Variant Deconvolution Network With Optical Flow Guided Training
Yuan Yuan, Wei Su, Dandan Ma


In order to remove the non-uniform blur of images captured from dynamic scenes, many deep learning based methods design deep networks for large receptive fields and strong fitting capabilities, or use multi-scale strategy to deblur image on different scales gradually. Restricted by the fixed structures and parameters, these methods are always huge in model size to handle complex blurs. In this paper, we start from the deblurring deconvolution operation, then design an effective and real-time deblurring network. The main contributions are three folded, 1) we construct a spatially variant deconvolution network using modulated deformable convolutions, which can adjust receptive fields adaptively according to the blur features. 2) our analysis shows the sampling points of deformable convolution can be used to approximate the blur kernel, which can be simplified to bi-directional optical flows. So the position learning of sampling points can be supervised by bi-directional optical flows. 3) we build a light-weighted backbone for image restoration problem, which can balance the calculations and effectiveness well. Experimental results show that the proposed method achieves state-of-the-art deblurring performance, but with less parameters and shorter running time.
[order, shortest, regular] [feature, module, backbone, achieves, guided, global, neck, effectiveness, represents, head] [input, model] [deblurring, blur, optical, deformable, ieee, deconvolution, blurry, convolution, proposed, pattern, dynamic, flow, receptive, kernel, method, figure, convolutional, motion, skip, clear, based, spatial, modulated, dilated, perceptual, prior, field, conv, formulated, spatially, dilation, inverse] [image, loss, content, generate, train] [network, sampling, deep, training, learning, approximate, neural, better, large, design, number, performance, size, operation, calculate, set, variant, process, adjusted, architecture, larger, algorithm] [conference, computer, vision, scene, international, position, local]
@InProceedings{Yuan_2020_CVPR,
  author = {Yuan, Yuan and Su, Wei and Ma, Dandan},
  title = {Efficient Dynamic Scene Deblurring Using Spatially Variant Deconvolution Network With Optical Flow Guided Training},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Single Image Reflection Removal Through Cascaded Refinement
Chao Li, Yixiao Yang, Kun He, Stephen Lin, John E. Hopcroft


We address the problem of removing undesirable reflections from a single image captured through a glass surface, which is an ill-posed, challenging but practically important problem for photo enhancement. Inspired by iterative structure reduction for hidden community detection in social networks, we propose an Iterative Boost Convolutional LSTM Network (IBCLN) that enables cascaded prediction for reflection removal. IBCLN is a cascaded network that iteratively refines the estimates of transmission and reflection layers in a manner that they can boost the prediction quality to each other, and information across steps of the cascade is transferred using an LSTM. The intuition is that the transmission is the strong, dominant structure while the reflection is the weak, hidden structure. They are complementary to each other in a single image and thus a better estimate and reduction on one side from the original image leads to a more accurate estimate on the other side. To facilitate training over multiple cascade steps, we employ LSTM to address the vanishing gradient problem, and propose residual reconstruction loss as further training guidance. Besides, we create a dataset of real-world images with reflection and ground-truth transmission layers to mitigate the problem of insufficient data. Comprehensive experiments demonstrate that the proposed method can effectively remove reflections in real and synthetic images compared with state-of-the-art reflection removal methods.
[time, lstm, step, dataset, hidden, prediction, previous, three, multiple, visual] [predicted, boost, cascade, table, propose, detection] [model, input, auxiliary, iterative, complementary, original, quality] [transmission, residual, convolutional, removal, ibcln, ieee, figure, zhang, psnr, ssim, conv, pattern, cascaded, relu, method, proposed, perceptual, reflection, glass, output, captured, rmnet, imaging] [image, loss, synthetic, synthesis, real] [network, layer, training, linear, set, architecture, best, performance, deep, learning, neural, objective, function, total, problem, better, compared, iteration, data] [computer, conference, single, reconstruction, vision, ground, truth, approach, international, structure, dominant, accurate, iteratively]
@InProceedings{Li_2020_CVPR,
  author = {Li, Chao and Yang, Yixiao and He, Kun and Lin, Stephen and Hopcroft, John E.},
  title = {Single Image Reflection Removal Through Cascaded Refinement},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
From Patches to Pictures (PaQ-2-PiQ): Mapping the Perceptual Space of Picture Quality
Zhenqiang Ying, Haoran Niu, Praful Gupta, Dhruv Mahajan, Deepti Ghadiyaram, Alan Bovik


Blind or no-reference (NR) perceptual picture quality prediction is a difficult, unsolved problem of great consequence to the social and streaming media industries that impacts billions of viewers daily. Unfortunately, popular NR prediction models perform poorly on real-world distorted pictures. To advance progress on this problem, we introduce the largest (by far) subjective picture quality database, containing about 40, 000 real-world distorted pictures and 120, 000 patches, on which we collected about 4M human judgments of picture quality. Using these picture and patch quality labels, we built deep region-based architectures that learn to produce state-of-the-art global picture quality predictions as well as useful local picture quality maps. Our innovations include picture quality prediction architectures that produce global-to-local inferences as well as local-to-global inferences (via feedback). The dataset and source code are available at https: //live.ece.utexas.edu/research.php.
[prediction, visual, dataset, social, predict, three, video] [global, table, feature, pooling] [quality, picture, model, roipool, database, subjective, feedback, lcc, distortion, trained, srcc, study, clive, brisque, largest, subject, niqe, collected, highly, scatter, nima, assessment, distorted, cnniqa, iqa] [ieee, patch, perceptual, blind, pattern, aspect, june, signal, spatial] [image, content, diverse, produce, synthetic] [deep, learning, linear, test, validation, performance, training, baseline, data, neural, problem, popular, set, number, size, better, randomly, layer, applied] [human, local, vision, well, collection, unique, complex, single]
@InProceedings{Ying_2020_CVPR,
  author = {Ying, Zhenqiang and Niu, Haoran and Gupta, Praful and Mahajan, Dhruv and Ghadiyaram, Deepti and Bovik, Alan},
  title = {From Patches to Pictures (PaQ-2-PiQ): Mapping the Perceptual Space of Picture Quality},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Video to Events: Recycling Video Datasets for Event Cameras
Daniel Gehrig, Mathias Gehrig, Javier Hidalgo-Carrio, Davide Scaramuzza


Event cameras are novel sensors that output brightness changes in the form of a stream of asynchronous "events" instead of intensity frames. They offer significant advantages with respect to conventional cameras: high dynamic range (HDR), high temporal resolution, and no motion blur. Recently, novel learning approaches operating on event data have achieved impressive results. Yet, these methods require a large amount of event data for training, which is hardly available due the novelty of event sensors in computer vision research. In this paper, we present a method that addresses these needs by converting any existing video dataset recorded with conventional cameras to synthetic event data. This unlocks the use of a virtually unlimited number of existing video datasets for training networks designed for real event data. We evaluate our method on two relevant vision tasks, i.e., object recognition and semantic segmentation, and show that models trained on synthetic events have several benefits: (i) they generalize well to real event data, even in scenarios where standard-camera images are blurry or overexposed, by inheriting the outstanding properties of event cameras; (ii) they can be used for fine-tuning on real data to improve over state-of-the-art for both classification and semantic segmentation.
[video, frame, dataset, work, temporal, state, recognition, time, described, three, visual] [semantic, segmentation, object, score, threshold, art, challenging, davis, positive] [trained, datasets, model, original, case] [event, high, method, ieee, contrast, interpolation, low, pattern, davide, intensity, existing, dynamic, pixel, conventional, motion, converting, proposed, tobi, sensor, neuromorphic, brightness, asynchronous, figure, intermediate, optical] [real, synthetic, generated, image, generate, generalize, train, fine, generative, generation, est, address, representation, alonso, domain] [learning, data, network, test, rate, large, standard, training, classification, number, deep, neural, evaluate, accuracy, negative] [camera, vision, well, computer, novel, recorded, front]
@InProceedings{Gehrig_2020_CVPR,
  author = {Gehrig, Daniel and Gehrig, Mathias and Hidalgo-Carrio, Javier and Scaramuzza, Davide},
  title = {Video to Events: Recycling Video Datasets for Event Cameras},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Composed Query Image Retrieval Using Locally Bounded Features
Mehrdad Hosseinzadeh, Yang Wang


Composed query image retrieval is a new problem where the query consists of an image together with a requested modification expressed via a textual sentence. The goal is then to retrieve the images that are generally similar to the query image, but differ according to the requested modification. Previous methods usually consider the image as a whole. In this paper, we propose a novel method that represents the image using a set of local areas in the image. The relationship between each word in the modification text and each area in the image is then explicitly established, allowing the model to accurately correlate the modification text to parts of the image. We conduct extensive experiments on three benchmark datasets. The results show that our method outperforms other state-of-the-art approaches by a considerable margin.
[retrieval, visual, text, recognition, composed, dataset, attention, work, embedding, language, outperforms, textual, word, sentence, relationship, tmax, attended, tirg, previous, question] [module, feature, region, main, propose, table, object] [query, auxiliary, input, effectively, model, testing, face] [method, proposed, ieee, pattern, reference, figure, output] [image, modification, representation, target, loss, source, consists, avg, requested, user, retrieved] [candidate, network, vector, learning, set, layer, deep, processing, problem, linear, average, training, performance, neural, product, filter, setting, test, consider, function, dimension, pool, triplet, requires, applied] [conference, vision, computer, local, joint, international, distance, coarse]
@InProceedings{Hosseinzadeh_2020_CVPR,
  author = {Hosseinzadeh, Mehrdad and Wang, Yang},
  title = {Composed Query Image Retrieval Using Locally Bounded Features},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Spatially-Attentive Patch-Hierarchical Network for Adaptive Motion Deblurring
Maitreya Suin, Kuldeep Purohit, A. N. Rajagopalan


This paper tackles the problem of motion deblurring of dynamic scenes. Although end-to-end fully convolutional designs have recently advanced the state-of-the-art in non-uniform motion deblurring, their performance-complexity trade-off is still sub-optimal. Existing approaches achieve a large receptive field by increasing the number of generic convolution layers and kernel-size, but this comesat the expense of of the increase in model size and inference speed. In this work, we propose an efficient pixel adaptive and feature attentive design for handling large blur variations across different spatial locations and process each test image adaptively. We also propose an effective content-aware global-local filtering module that significantly improves performance by considering not only global dependencies but also by dynamically exploiting neighboring pixel information. We use a patch-hierarchical attentive architecture composed of the above module that implicitly discovers the spatial variations in the blur present in the input image and in turn, performs local and global modulation of intermediate features. Extensive qualitative and quantitative comparisons with prior art on deblurring benchmarks demonstrate that our design offers significant improvements over the state-of-the-art in accuracy as well as speed.
[attention, decoder, previous, work] [module, feature, global, table, level, map, mask, key, attentive] [input, model, hide] [deblurring, motion, dynamic, blur, convolutional, pixel, ieee, receptive, convolution, spatial, proposed, gopro, blurred, adaptive, field, figure, kernel, method, filtering, high, fusion, pattern, adaptively, output, psnr, existing, quantitative, spatially, handling] [image, encoder, generate, cross, pdf] [network, processing, performance, better, filter, large, design, number, efficient, neural, compared, standard, test, architecture, deep, memory, matrix, set, multiplication, increase, accuracy] [computer, local, conference, vision, approach, single, scene, international, well, camera]
@InProceedings{Suin_2020_CVPR,
  author = {Suin, Maitreya and Purohit, Kuldeep and Rajagopalan, A. N.},
  title = {Spatially-Attentive Patch-Hierarchical Network for Adaptive Motion Deblurring},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
End-to-End Illuminant Estimation Based on Deep Metric Learning
Bolei Xu, Jingxin Liu, Xianxu Hou, Bozhi Liu, Guoping Qiu


Previous deep learning approaches to color constancy usually directly estimate illuminant value from input image. Such approaches might suffer heavily from being sensitive to the variation of image content. To overcome this problem, we introduce a deep metric learning approach named Illuminant-Guided Triplet Network (IGTN) to color constancy. IGTN generates an Illuminant Consistent and Discriminative Feature (ICDF) for achieving robust and accurate illuminant color estimation. ICDF is composed of semantic and color features based on a learnable color histogram scheme. In the ICDF space, regardless of the similarities of their contents, images taken under the same or similar illuminants are placed close to each other and at the same time images taken under different illuminants are placed far apart. We also adopt an end-to-end training strategy to simultaneously group image features and estimate illuminant value, and thus our approach does not have to classify illuminant in a separate module. We evaluate our method on two public datasets and demonstrate our method outperforms state-of-the-art approaches. Furthermore, we demonstrate that our method is less sensitive to image appearances, and can achieve more robust and consistent results than other methods on a High Dynamic Range dataset.
[work, extract, dataset, three, previous] [feature, semantic, pooling, table, global, framework, propose, pyramid, including] [input, datasets, trained, variation, model, sensitive] [color, illuminant, learnable, constancy, histogram, icdf, ieee, method, based, pattern, imaging, illumination, gamut, igtn, convolutional, science, spatial, figure, exposure, dynamic, range, removed, proposed] [image, loss, discriminative, checker, shared, mapping] [deep, network, learning, triplet, training, angular, neural, metric, achieve, performance, size, set, function, problem, base, data, alexnet, equation, experiment, evaluate, setting, average, lower, strategy, group, statistical] [estimation, approach, computer, estimate, consistent, conference, scene, vision, estimated, error, local, camera, single, directly]
@InProceedings{Xu_2020_CVPR,
  author = {Xu, Bolei and Liu, Jingxin and Hou, Xianxu and Liu, Bozhi and Qiu, Guoping},
  title = {End-to-End Illuminant Estimation Based on Deep Metric Learning},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Variational-EM-Based Deep Learning for Noise-Blind Image Deblurring
Yuesong Nan, Yuhui Quan, Hui Ji


Non-blind deblurring is an important problem encountered in many image restoration tasks. The focus of non-blind deblurring is on how to suppress noise magnification during deblurring. In practice, it often happens that the noise level of input image is unknown and varies among different images. This paper aims at developing a deep learning framework for deblurring images with unknown noise level. Based on the framework of variational expectation maximization (EM), an iterative noise-blind deblurring scheme is proposed which integrates the estimation of noise level and the quantification of image prior uncertainty. Then, the proposed scheme is unrolled to a neural network (NN) where image prior is modeled by NN with uncertainty quantification. Extensive experiments showed that the proposed method not only outperformed existing noise-blind deblurring methods by a large margin, but also outperformed those state-of-the-art image deblurring methods designed/trained with known noise level.
[natural, dataset, prediction, bank] [level, table, denotes, stage, cnn, sun, framework] [noise, model, trained, iterative, universal, poisson] [deblurring, prior, proposed, method, blurred, kfi, based, wavelet, existing, kernel, vem, designed, ieee, blur, deblur, noisy, denoising, comparison, called, blind, gaussian, learnable, bayes, levin, hui, outperformed, clear, likelihood, deconvolution, figure] [image, latent, unknown, variable, variational, learn] [learning, deep, distribution, set, performance, regularization, gain, filter, training, log, argmin, scheme, network, gradient, parameter, data, updating, size, test, linear, algorithm, vector, compared, problem, better, fixed] [uncertainty, estimator, estimate, approach, varying, estimation, term]
@InProceedings{Nan_2020_CVPR,
  author = {Nan, Yuesong and Quan, Yuhui and Ji, Hui},
  title = {Variational-EM-Based Deep Learning for Noise-Blind Image Deblurring},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Image Demoireing with Learnable Bandpass Filters
Bolun Zheng, Shanxin Yuan, Gregory Slabaugh, Ales Leonardis


Image demoireing is a multi-faceted image restoration task involving both texture and color restoration. In this paper, we propose a novel multiscale bandpass convolutional neural network (MBCNN) to address this problem. As an end-to-end solution, MBCNN respectively solves the two sub-problems. For texture restoration, we propose a learnable bandpass filter (LBF) to learn the frequency prior for moire texture removal. For color restoration, we propose a two-step tone mapping strategy, which first applies a global tone mapping to correct for a global color shift, then performs local fine tuning of the color per pixel. Through an ablation study, we demonstrate the effectiveness of the different components of MBCNN. Experimental results on two public datasets show that our method outperforms state-of-the-art methods by a large margin (more than 2dB in terms of PSNR).
[attention, three, dataset] [global, cnn, feature, table, propose, denotes, ablation, advanced, including, final] [model, input, clean, trained] [moire, color, tone, mbcnn, frequency, figure, proposed, demoireing, learnable, bandpass, convolutional, residual, block, gtmb, prior, scale, sobel, removal, output, mtrb, channel, restoration, convolution, comparison, lcdmoire, method, compression, introduced, ssim, dmcnn, based, ltmb, remove, existing, dct, spectrum, relu, asl, multiscale, lbf, artifact, high] [image, texture, loss, mapping, learn, structural, domain] [deep, learning, network, performance, validation, large, training, size, layer, filter, set, function, neural, strategy, connection] [local, dense, structure, accurate]
@InProceedings{Zheng_2020_CVPR,
  author = {Zheng, Bolun and Yuan, Shanxin and Slabaugh, Gregory and Leonardis, Ales},
  title = {Image Demoireing with Learnable Bandpass Filters},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Assessing Image Quality Issues for Real-World Problems
Tai-Yin Chiu, Yinan Zhao, Danna Gurari


We introduce a new large-scale dataset that links the assessment of image quality issues to two practical vision tasks: image captioning and visual question answering. First, we identify for 39,181 images taken by people who are blind whether each is sufficient quality to recognize the content as well as what quality flaws are observed from six options. These labels serve as a critical foundation for us to make the following contributions: (1) a new problem and algorithms for deciding whether an image is insufficient quality to recognize the content and so not captionable, (2) a new problem and algorithms for deciding which of six quality flaws an image contains, (3) a new problem and algorithms for deciding whether a visual question is unanswerable due to unrecognizable content versus the content of interest being missing from the field of view, and (4) a novel application of more efficiently creating a large-scale image captioning dataset by automatically deciding whether an image is insufficient quality and so should not be captioned. We publicly-share our datasets and code to facilitate future extensions of this work: https://vizwiz.org.
[visual, dataset, question, captioning, work, people, evaluation, predicting, prediction, jeffrey, recognize, recognizing, natural] [alan, score, table, object, feature] [quality, unrecognizable, assessment, recognizable, datasets, unanswerable, unrecognizability, answerability, assessing, crowdworkers, percentage, flaw, trained, hog, zhou, insufficient, answered, distorted, flag, brt, drk, recognizability, danna, identify, deemed] [ieee, blind, figure, pattern, method, existing, signal, high] [image, content, introduce, missing, real, train] [learning, training, task, observe, random, problem, evaluate, average, algorithm, support, call, precision, performance, label, benefit, probability, set] [conference, vision, computer, novel, scene, demonstrate, human, well]
@InProceedings{Chiu_2020_CVPR,
  author = {Chiu, Tai-Yin and Zhao, Yinan and Gurari, Danna},
  title = {Assessing Image Quality Issues for Real-World Problems},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Memory-Efficient Hierarchical Neural Architecture Search for Image Denoising
Haokui Zhang, Ying Li, Hao Chen, Chunhua Shen


Recently, neural architecture search (NAS) methods have attracted much attention and outperformed manually designed architectures on a few high-level vision tasks. In this paper, we propose HiNAS (Hierarchical NAS), an effort towards employing NAS to automatically design effective neural network architectures for image denoising. HiNAS adopts gradient based search strategies and employs operations with adaptive receptive field to build an flexible hierarchical search space. During the search stage, HiNAS shares cells across different feature levels to save memory and employ an early stopping strategy to avoid the collapse issue in NAS, and considerably accelerate the search speed. The proposed HiNAS is both memory and computation efficient, which takes only about 4.5 hours for searching using a single GPU. We evaluate the effectiveness of our proposed HiNAS on two different datasets, namely an additive white Gaussian noise dataset BSD500, and a realistic noise datasetSIM1800. Experimental results show that the architecture found by HiNAS has fewer parameters and enjoys a faster inference speed, while achieving highly competitive performance compared with state-of-the-art methods. We also present analysis on the architectures found by NAS. HiNAS also shows good performance on experiments for image de-raining.
[three, node, hierarchical, build, dataset, previous] [table, employ, feature] [input, noise, trained] [hinas, cell, ieee, denoising, proposed, psnr, ssim, convolution, designed, figure, conv, skip, based, deformable, def, restoration, falsr, output, nlrn, competitive, conventional, comparison, lssim, adaptive, receptive, field, flexible] [image, loss, corresponding, train] [search, architecture, network, neural, layer, set, width, performance, supernet, learning, training, searching, gradient, design, memory, deep, outer, number, space, strategy, early, inference, inner, best, efficient, gpu, basic, sharing, test, compared] [single, differentiable, continuous]
@InProceedings{Zhang_2020_CVPR,
  author = {Zhang, Haokui and Li, Ying and Chen, Hao and Shen, Chunhua},
  title = {Memory-Efficient Hierarchical Neural Architecture Search for Image Denoising},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Blindly Assess Image Quality in the Wild Guided by a Self-Adaptive Hyper Network
Shaolin Su, Qingsen Yan, Yu Zhu, Cheng Zhang, Xin Ge, Jinqiu Sun, Yanning Zhang


Blind image quality assessment (BIQA) for authentically distorted images has always been a challenging problem, since images captured in the wild include varies contents and diverse types of distortions. The vast majority of prior BIQA methods focus on how to predict synthetic image quality, but fail when applied to real-world distorted images. To deal with the challenge, we propose a self-adaptive hyper network architecture to blind assess image quality in the wild. We separate the IQA procedure into three stages including content understanding, perception rule learning and quality predicting. After extracting image semantics, perception rule is established adaptively by a hyper network, and then adopted by a quality prediction network. In our model, image quality can be estimated in a self-adaptive manner, thus generalizes well on diverse images captured in the wild. Experimental results verify that our approach not only outperforms the state-of-the-art methods on challenging authentic image databases but also achieves competing performances on synthetic image databases, though it is not explicitly designed for the synthetic task.
[prediction, perception, three, extract, understanding, connected] [semantic, feature, aware, challenge, module, fully, table, alan, global, backbone, propose] [quality, iqa, model, live, distortion, authentic, database, assessment, distorted, rule, dbcnn, srcc, csiq, input, bid, plcc, worst, livec, pqr, authentically, generalization] [proposed, ieee, based, blind, figure, conv, pattern, synthetically, high, achieved] [image, content, synthetic, target, extracted, generated, competing, learn, gap, consists, ability] [network, learning, hyper, deep, training, fixed, weight, layer, neural, architecture, procedure, best, performance, selected] [local, computer, conference, approach, vision, capture, human]
@InProceedings{Su_2020_CVPR,
  author = {Su, Shaolin and Yan, Qingsen and Zhu, Yu and Zhang, Cheng and Ge, Xin and Sun, Jinqiu and Zhang, Yanning},
  title = {Blindly Assess Image Quality in the Wild Guided by a Self-Adaptive Hyper Network},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Perceptual Quality Assessment of Smartphone Photography
Yuming Fang, Hanwei Zhu, Yan Zeng, Kede Ma, Zhou Wang


As smartphones become people's primary cameras to take photos, the quality of their cameras and the associated computational photography modules has become a de facto standard in evaluating and ranking smartphones in the consumer market. We conduct so far the most comprehensive study of perceptual quality assessment of smartphone photography. We introduce the Smartphone Photography Attribute and Quality (SPAQ) database, consisting of 11,125 pictures taken by 66 smartphones, where each image is attached with so far the richest annotations. Specifically, we collect a series of human opinions for each image, including image quality, image attributes (brightness, colorfulness, contrast, noisiness, and sharpness), and scene category labels (animal, cityscape, human, indoor scene, landscape, night scene, plant, still life, and others) in a well-controlled laboratory environment. The exchangeable image file format (EXIF) data for all images are also recorded to aid deeper analysis. We also make the first attempts using the database to train blind image quality assessment (BIQA) models constructed by baseline and multi-task deep neural networks. The results provide useful insights on how EXIF data, image attributes and high-level semantics interact with image quality, how next-generation BIQA models can be designed, and how better computational photography systems can be optimized on mobile devices. The database along with the proposed BIQA models are available at https://github.com/h4nwei/SPAQ.
[visual, prediction, time, connected, natural, multiple] [category, including, table, challenge, semantic, fully, score] [quality, subjective, exif, assessment, biqa, database, spaq, iqa, model, laboratory, study, moss, live, srcc, input, plcc, digital, noise, poor] [smartphone, ieee, figure, perceptual, photography, captured, proposed, exposure, iso, based, night, blind, high, brightness, smartphones, comparison] [image, attribute, realistic, train, synthetic, user] [learning, computational, deep, training, data, task, baseline, objective, performance, function, neural, mobile, sample, number, network] [scene, camera, human, conference, computer, indoor, continuous]
@InProceedings{Fang_2020_CVPR,
  author = {Fang, Yuming and Zhu, Hanwei and Zeng, Yan and Ma, Kede and Wang, Zhou},
  title = {Perceptual Quality Assessment of Smartphone Photography},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Don't Hit Me! Glass Detection in Real-World Scenes
Haiyang Mei, Xin Yang, Yang Wang, Yuanyuan Liu, Shengfeng He, Qiang Zhang, Xiaopeng Wei, Rynson W.H. Lau


Glass is very common in our daily life. Existing computer vision systems neglect it and thus may have severe consequences, e.g., a robot may crash into a glass wall. However, sensing the presence of glass is not straightforward. The key challenge is that arbitrary objects/scenes can appear behind the glass, and the content within the glass region is typically similar to those behind it. In this paper, we propose an important problem of detecting glass from a single RGB image. To address this problem, we construct a large-scale glass detection dataset (GDD) and design a glass detection network, called GDNet, which explores abundant contextual cues for robust glass detection with a novel large-field contextual feature integration (LCFI) module. Extensive experiments demonstrate that the proposed method achieves more superior glass detection results on our GDD test set than state-of-the-art methods fine-tuned for glass detection.
[attention, dataset, context, integrate, three, extract, explore, water] [contextual, detection, segmentation, object, salient, module, feature, mirror, region, detect, semantic, propose, denotes, inside, area, segment, fully, saliency, edge, table] [input, detecting] [glass, lcfi, conv, gdd, figure, relu, reflection, gdnet, method, abundant, proposed, spatially, separable, block, convolution, field, existing, integration, removal, based, dilation, parallel, comparison] [image, loss, shadow, content, address, extracted, common] [network, large, typically, problem, test, set, small, base, deep, size, rate, training, learning] [scene, single, local, rgb, depth, combine, leverage, vision, novel]
@InProceedings{Mei_2020_CVPR,
  author = {Mei, Haiyang and Yang, Xin and Wang, Yang and Liu, Yuanyuan and He, Shengfeng and Zhang, Qiang and Wei, Xiaopeng and Lau, Rynson W.H.},
  title = {Don't Hit Me! Glass Detection in Real-World Scenes},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Progressive Mirror Detection
Jiaying Lin, Guodong Wang, Rynson W.H. Lau


The mirror detection problem is important as mirrors can affect the performances of many vision tasks. It is a difficult problem as it requires an understanding of global scene semantics. Recently, a method was proposed to detect mirrors by learning multi-level contextual contrasts between inside and outside of mirrors, which helps locate mirror edges implicitly. We observe that the content of a mirror reflects the content of its surrounding, separated by the edge of the mirror. Hence, we propose a model in this paper to progressively learn the content similarity between the inside and outside of the mirror while explicitly detecting the mirror edges. Our work has two main contributions. First, we propose a new relational contextual contrasted local (RCCL) module to extract and compare the mirror features with its corresponding context features, and an edge detection and fusion (EDF) module to learn the features of mirror edges in complex scenes via explicit supervision. Second, we construct a challenging benchmark dataset of 6,461 mirror images. Unlike the existing MSD dataset, which has limited diversity, our dataset covers a variety of scenes and is much larger in scale. Experimental results show that our model outperforms relevant state-of-the-art methods.
[relational, extract, dataset, context, relation, decoder, explicitly, visual, relevant, prediction, considers] [mirror, edge, contextual, module, detection, contrasted, edf, feature, propose, map, detect, msd, rccl, extractor, benchmark, boundary, object, salient, inside, mirrornet, region, semantic, global, segmentation, extracting, table, pspnet, lot, saliency, pyramid, score] [input, model, datasets, help, detecting] [figure, method, fusion, proposed, based, output, convolution, existing, contrast, extraction, ieee] [image, corresponding, content, loss, progressive, produce, extracted] [similarity, network, layer, basic, size, total, evaluate, set, rate, popular] [local, ground, error, single, truth, scene, novel]
@InProceedings{Lin_2020_CVPR,
  author = {Lin, Jiaying and Wang, Guodong and Lau, Rynson W.H.},
  title = {Progressive Mirror Detection},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Category-Level Articulated Object Pose Estimation
Xiaolong Li, He Wang, Li Yi, Leonidas J. Guibas, A. Lynn Abbott, Shuran Song


This paper addresses the task of category-level pose estimation for articulated objects from a single depth image. We present a novel category-level approach that correctly accommodates object instances previously unseen during training. We introduce Articulation-aware Normalized Coordinate Space Hierarchy (ANCSH) - a canonical representation for different articulated objects in a given category. As the key to achieve intra-category generalization, the representation constructs a canonical object space as well as a set of canonical part spaces. The canonical object space normalizes the object orientation, scales and articulations (e.g. joint parameters and states) while each canonical part space further normalizes its part pose and scale. We develop a deep network based on PointNet++ that predicts ANCSH from a single depth point cloud, including part segmentation, normalized coordinates, and joint parameters in the canonical object space. By leveraging the canonicalized joints, we demonstrate: 1) improved performance in part pose and scale estimations using the induced kinematic constraints from joints; 2) high accuracy for joint parameter estimation in camera space.
[predict, state, work, prediction, associated, predicting, individual, frame] [object, segmentation, amodal, bounding, category, head, predicted, including] [model, combined] [ieee, pattern, reference, based, figure, scale] [representation, unseen, translation, perform, corresponding, loss] [space, normalized, algorithm, network, optimization, parameter, hierarchy, set, deep, learning, scaling, performance, accuracy] [joint, pose, naocs, articulated, ancsh, estimation, point, revolute, depth, coordinate, kinematic, conference, rigid, canonical, camera, prismatic, error, computer, single, defined, vision, approach, predicts, rotation, axis, human, compute, estimate, define, novel, cad, orientation, cloud, dense, direction, rest, well]
@InProceedings{Li_2020_CVPR,
  author = {Li, Xiaolong and Wang, He and Yi, Li and Guibas, Leonidas J. and Abbott, A. Lynn and Song, Shuran},
  title = {Category-Level Articulated Object Pose Estimation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Unbiased Scene Graph Generation From Biased Training
Kaihua Tang, Yulei Niu, Jianqiang Huang, Jiaxin Shi, Hanwang Zhang


Today's scene graph generation (SGG) task is still far from practical, mainly due to the severe training bias, e.g., collapsing diverse "human walk on / sit on / lay on beach" into "human on beach". Given such SGG, the down-stream tasks such as VQA can hardly infer better scene structures than merely a bag of objects. However, debiasing in SGG is not trivial because traditional debiasing methods cannot distinguish between the good and bad bias, e.g., good context prior (e.g., "person read book" rather than "eat") and bad long-tailed bias (e.g., "near" dominating "behind / in front of"). In this paper, we present a novel SGG framework based on causal inference but not the conventional likelihood. We first build a causal graph for SGG, and perform traditional biased training with the graph. Then, we propose to draw the counterfactual causality from the trained graph to infer the effect from the bad bias, which should be removed. In particular, we use Total Direct Effect (TDE) as the proposed final predicate score for unbiased SGG. Note that our framework is agnostic to any SGG model and thus can be widely applied in the community who seeks unbiased predictions. By using the proposed Scene Graph Diagnosis toolkit on the SGG benchmark Visual Genome and several prevailing models, we observed significant improvements over the previous state-of-the-art methods.
[graph, visual, tde, sgg, causal, unbiased, relationship, umbrella, predicate, context, retrieval, prediction, three, debiasing, link, language, man, woman, trivial, previous, dog, surfboard, intervention, dataset, reasoning, standing, embedding, pair, vctree] [object, biased, feature, detected, table, detection, framework, faster, bag] [counterfactual, model, input, caused, original, diagnosis, example, difference, reweight, fraction, trained] [figure, proposed, conventional, fusion, analysis, based, nie, prior] [image, generation, person, street, generated] [baseline, training, bias, set, learning, total, sum, neural, gate, inference, label, good, note, logits, classification, data, worse, performance, size, better] [scene, human, direct, focal]
@InProceedings{Tang_2020_CVPR,
  author = {Tang, Kaihua and Niu, Yulei and Huang, Jianqiang and Shi, Jiaxin and Zhang, Hanwang},
  title = {Unbiased Scene Graph Generation From Biased Training},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Dynamic Graph Message Passing Networks
Li Zhang, Dan Xu, Anurag Arnab, Philip H.S. Torr


Modelling long-range dependencies is critical for scene understanding tasks in computer vision. Although CNNs have excelled in many vision tasks, they are still limited in capturing long-range structured relationships as they typically consist of layers of local kernels. A fully-connected graph is beneficial for such modelling, however, its computational overhead is prohibitive. We propose a dynamic graph message passing network, that significantly reduces the computational complexity compared to related works modelling a fully-connected graph. This is achieved by adaptively sampling nodes in the graph, conditioned on the input, for message passing. Based on the sampled nodes, we dynamically predict node-dependent filter weights and the affinity matrix for propagating information between them. Using this model, we show significant improvements with respect to strong, state-of-the-art baselines on three different tasks and backbone architectures. Our approach also outperforms fully-connected graphs while using substantially fewer floating-point operations and parameters.
[message, graph, node, dgmn, latexit, samp, passing, ure, fea, dynam, dynamically, attention, pred, affin, multiple, convo, structured, deno, understanding, work, context, ransforma] [feature, semantic, segmentation, map, mask, instance, object, detection, module, backbone, coco, resnet, table, apbox, apmask, effectiveness] [model, effective, npu] [dynamic, proposed, deformable, based, convolution, convolutional, method, receptive, field, dilated, phase, comparison] [image, conditioned, latent] [sampling, learning, neural, deep, performance, random, sampled, network, baseline, set, filter, number, computational, strategy, sample, consider, validation, uniform, learned, better] [form, scene, approach, vision, local]
@InProceedings{Zhang_2020_CVPR,
  author = {Zhang, Li and Xu, Dan and Arnab, Anurag and Torr, Philip H.S.},
  title = {Dynamic Graph Message Passing Networks},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Weakly Supervised Visual Semantic Parsing
Alireza Zareian, Svebor Karaman, Shih-Fu Chang


Scene Graph Generation (SGG) aims to extract entities, predicates and their semantic structure from images, enabling deep understanding of visual content, with many applications such as visual reasoning and image retrieval. Nevertheless, existing SGG methods require millions of manually annotated bounding boxes for training, and are computationally inefficient, as they exhaustively process all pairs of object proposals to detect predicates. In this paper, we address those two limitations by first proposing a generalized formulation of SGG, namely Visual Semantic Parsing, which disentangles entity and predicate recognition, and enables sub-quadratic performance. Then we propose the Visual Semantic Parsing Network, VSPNet, based on a dynamic, attention-based, bipartite message passing framework that jointly infers graph nodes and edges through an iterative process. Additionally, we propose the first graph-based weakly supervised learning framework, based on a novel graph alignment algorithm, which enables training without bounding box annotations. Through extensive experiments, we show that VSPNet outperforms weakly supervised baselines significantly and approaches fully supervised performance, while being several times faster. We publicly release the source code of our method.
[graph, visual, vspn, predicate, entity, message, sgg, vsp, attention, passing, role, connected, node, state, relationship, represent, pair, bipartite, outperforms, time, dataset, extract, order, embedding] [semantic, object, weakly, bounding, fully, detection, propose, faster, box, table, parsing, framework, proposal] [model, iterative, subject] [ieee, method, proposed, pattern, output, based, existing, figure] [supervised, alignment, image, loss, generation] [number, learning, training, processing, set, problem, network, deep, performance, process, optimization, algorithm, neural, class, arxiv, preprint, inference, note] [computer, conference, scene, vision, ground, truth, international, define, novel, formulation, european]
@InProceedings{Zareian_2020_CVPR,
  author = {Zareian, Alireza and Karaman, Svebor and Chang, Shih-Fu},
  title = {Weakly Supervised Visual Semantic Parsing},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
GPS-Net: Graph Property Sensing Network for Scene Graph Generation
Xin Lin, Changxing Ding, Jinquan Zeng, Dacheng Tao


Scene graph generation (SGG) aims to detect objects in an image along with their pairwise relationships. There are three key properties of scene graph that have been underexplored in recent works: namely, the edge direction information, the difference in priority between nodes, and the long-tailed distribution of relationships. Accordingly, in this paper, we propose a Graph Property Sensing Network (GPS-Net) that fully explores these three properties for SGG. First, we propose a novel message passing module that augments the node feature with node-specific contextual information and encodes the edge direction information via a tri-linear model. Second, we introduce a node priority sensitive loss to reflect the difference in priority between nodes during training. This is achieved by designing a mapping function that adjusts the focusing parameter in the focal loss. Third, since the frequency of relationships is affected by the long-tailed distribution problem, we mitigate this issue by first softening the distribution and then enabling it to be adjusted for each subject-object pair according to their visual appearance. Systematic experiments demonstrate the effectiveness of the proposed techniques. Moreover, GPS-Net achieves state-of-the-art performance on three popular databases: VG, OI, and VRD by significant gains under various settings and metrics. The code and models are available at https://github.com/taksau/GPS-Net.
[relationship, visual, graph, node, priority, three, context, dmp, message, man, attention, sgg, paw, wearing, shirt, passing, transformer, sitting, evaluation, sgcls, dog, sgdet, modeling, reldn, hat] [object, edge, contextual, table, module, feature, detection, effectiveness, propose, adopt, denotes, represents, ablation, achieves, global] [model, ear, difference, subject] [frequency, figure, proposed, tree, relu, existing, prior, method, stacking] [loss, image, utilize, generation, mapping] [performance, distribution, function, layer, training, network, class, denoted, equation, problem, neural, metric, operation, compared, set] [scene, direction, focal, novel, front, leg]
@InProceedings{Lin_2020_CVPR,
  author = {Lin, Xin and Ding, Changxing and Zeng, Jinquan and Tao, Dacheng},
  title = {GPS-Net: Graph Property Sensing Network for Scene Graph Generation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
End-to-End Optimization of Scene Layout
Andrew Luo, Zhoutong Zhang, Jiajun Wu, Joshua B. Tenenbaum


We propose an end-to-end variational generative model for scene layout synthesis conditioned on scene graphs. Unlike unconditional scene layout generation, we use scene graphs as an abstract but general representation to guide the synthesis of diverse scene layouts that satisfy relationships included in the scene graph. This gives rise to more flexible control over the synthesis process, allowing various forms of inputs such as scene layouts extracted from sentences or inferred from a single color image. Using our conditional layout synthesizer, we can generate various layouts that share the same structure of the input example. In addition to this conditional generation design, we also integrate a differentiable rendering module that enables layout refinement using only 2D projections of the scene. Given a depth and a semantics map, the differentiable rendering module enables optimizing over the synthesized layout to fit the given input in an analysis-by-synthesis fashion. Experiments suggest that our model achieves higher accuracy and diversity in conditional scene synthesis and allows exemplar-based scene generation from various input forms.
[graph, multiple, decoder, dataset, television, text, work, three, tall] [object, semantic, bounding, box, denotes, siyuan, refinement, predicted, map] [model, input, deviation] [figure, ieee, pattern, convolutional, proposed, prior, convolution] [layout, synthesis, latent, image, generate, generated, loss, conditional, generation, target, conditioned, generates, nightstand, diverse, synthesized, encoder, representation, wooden, variational, perform, exemplar] [network, sampled, standard, optimization, distribution, neural, sample, calculate, accuracy, stochastic, vector, training, process, learned] [scene, bed, depth, computer, conference, vision, differentiable, cabinet, front, left, indoor, rotation, ground, truth, single, demonstrate, rendering, desk, manolis, renderer, represented, fabric, allows, define]
@InProceedings{Luo_2020_CVPR,
  author = {Luo, Andrew and Zhang, Zhoutong and Wu, Jiajun and Tenenbaum, Joshua B.},
  title = {End-to-End Optimization of Scene Layout},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Unsupervised Intra-Domain Adaptation for Semantic Segmentation Through Self-Supervision
Fei Pan, Inkyu Shin, Francois Rameau, Seokju Lee, In So Kweon


Convolutional neural network-based approaches have achieved remarkable progress in semantic segmentation. However, these approaches heavily rely on annotated data which are labor intensive. To cope with this limitation, automatically annotated data generated from graphic engines are used to train segmentation models. However, the models trained from synthetic data are difficult to transfer to real images. To tackle this issue, previous works have considered directly adapting models from the source data to the unlabeled target data (to reduce the inter-domain gap). Nonetheless, these techniques do not consider the large distribution gap among the target data itself (intra-domain gap). In this work, we propose a two-step self-supervised domain adaptation approach to minimize the inter-domain and intra-domain gap together. First, we conduct the inter-domain adaptation of the model, from this adaptation, we separate target domain into an easy and hard split using an entropy-based ranking function. Finally, to decrease the intra-domain gap, we propose to employ a self-supervised adaptation technique from the easy to the hard subdomain. Experimental results on numerous benchmark datasets highlight the effectiveness of our method against existing state-of-the-art approaches. The source code is available at https://github.com/feipan664/IntraDA.git.
[dataset, previous, shift, predict, work] [segmentation, semantic, easy, hard, split, table, map, propose, achieves, miou, predicted, annotated, adopt] [model, trained, adversarial, conduct, input, mnist] [proposed, method, figure, output, based, existing, optimized] [adaptation, target, domain, ginter, source, image, synthetic, gap, dinter, dintra, pseudo, loss, gintra, train, real, generator, advent, separate, lseg, adaptsegnet, unsupervised, align, inter, utilize, generated, discriminator, intra, synthia, uda, digit, proposes, alignment, ladv, generate, pte] [data, entropy, ranking, learning, unlabeled, training, distribution, minimize, set, consider, performance, hyperparameter, network, close, validation, large, labeled] [approach, scene]
@InProceedings{Pan_2020_CVPR,
  author = {Pan, Fei and Shin, Inkyu and Rameau, Francois and Lee, Seokju and Kweon, In So},
  title = {Unsupervised Intra-Domain Adaptation for Semantic Segmentation Through Self-Supervision},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Dual Super-Resolution Learning for Semantic Segmentation
Li Wang, Dong Li, Yousong Zhu, Lu Tian, Yi Shan


Current state-of-the-art semantic segmentation methods often apply high-resolution input to attain high performance, which brings large computation budgets and limits their applications on resource-constrained devices. In this paper, we propose a simple and flexible two-stream framework named Dual Super-Resolution Learning (DSRL) to effectively improve the segmentation accuracy without introducing extra computation costs. Specifically, the proposed method consists of three parts: Semantic Segmentation Super-Resolution (SSSR), Single Image Super-Resolution (SISR) and Feature Affinity (FA) module, which can keep high-resolution representations with low-resolution input while simultaneously reducing the model computation complexity. Moreover, it can be easily generalized to other tasks, e.g., human pose estimation. This simple yet effective method leads to strong representations and is evidenced by promising performance on both semantic segmentation and human pose estimation. Specifically, for semantic segmentation on CityScapes, we can achieve \geq2% higher mIoU with similar FLOPs, and keep the performance with 70% FLOPs. For human pose estimation, we can gain \geq2% mAP with the same FLOPs and maintain mAP with 30% fewer FLOPs. Code and models are available at https://github.com/wanglixilinx/DSRL.
[three, dataset] [semantic, segmentation, feature, effectiveness, table, framework, affinity, branch, module, final, miou, map, bisenet, propose, extra, atrous, coco, pyramid] [input, improve, original, effectively, model, effective, generality] [method, sisr, figure, proposed, sssr, dsrl, convolutional, dual, resolution, upsampling, convolution, based, highresolution, output, comparison] [image, loss, consists, representation, person, corresponding] [learning, performance, deep, computation, validation, network, accuracy, set, neural, test, simple, size, reduce, training, similarity, baseline, efficient, large, compared, architecture, equation, task, inference, higher, compact, design] [human, pose, single, scene, demonstrate, dense, estimation, ground, structure]
@InProceedings{Wang_2020_CVPR,
  author = {Wang, Li and Li, Dong and Zhu, Yousong and Tian, Lu and Shan, Yi},
  title = {Dual Super-Resolution Learning for Semantic Segmentation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Self-Supervised Scene De-Occlusion
Xiaohang Zhan, Xingang Pan, Bo Dai, Ziwei Liu, Dahua Lin, Chen Change Loy


Natural scene understanding is a challenging task, particularly when encountering images of multiple objects that are partially occluded. This obstacle is given rise by varying object ordering and positioning. Existing scene understanding paradigms are able to parse only the visible parts, resulting in incomplete and unstructured scene interpretation. In this paper, we investigate the problem of scene de-occlusion, which aims to recover the underlying occlusion ordering and complete the invisible parts of occluded objects. We make the first attempt to address the problem through a novel and unified framework that recovers hidden scene structures without ordering and amodal annotations as supervisions. This is achieved via Partial Completion Network (PCNet)-mask (M) and -content (C), that learn to recover fractions of object masks and contents, respectively, in a self-supervised manner. Based on PCNet-M and PCNet-C, we devise a novel inference scheme to accomplish scene de-occlusion, via progressive ordering recovery, amodal completion and content completion. Extensive experiments on real-world scenes demonstrate the superior performance of our approach to other alternatives. Remarkably, our approach that is trained in a self-supervised manner achieves comparable results to fully-supervised methods. The proposed scene de-occlusion framework benefits many applications, including high-quality and controllable image manipulation and scene recomposition (see Fig. 1), as well as the conversion of existing modal mask annotations to amodal mask annotations.
[modal, order, dataset, graph, multiple, natural, understanding, dahua] [amodal, ordering, instance, mask, occlusion, occluded, object, framework, segmentation, predicted, table, partially, including, semantic, category, manca, cocoa, convexr, propose, region, pcnets, represents, eraser, ama, split] [trained, case, input, invisible, manipulation, testing, change, increment, original] [recover, figure, neighboring, method, existing, proposed, raw, chen, achieved] [content, image, supervised, train, perform, target, synthetic, corresponding, unsupervised, pseudo, ziwei, intact, learn] [training, manual, data, network, learning, comparable, indicates, problem] [completion, scene, partial, complete, ground, truth, convex, approach, novel, well, full, solve, recovery, visible]
@InProceedings{Zhan_2020_CVPR,
  author = {Zhan, Xiaohang and Pan, Xingang and Dai, Bo and Liu, Ziwei and Lin, Dahua and Loy, Chen Change},
  title = {Self-Supervised Scene De-Occlusion},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
BANet: Bidirectional Aggregation Network With Occlusion Handling for Panoptic Segmentation
Yifeng Chen, Guangchen Lin, Songyuan Li, Omar Bourahla, Yiming Wu, Fangfang Wang, Junyi Feng, Mingliang Xu, Xi Li


Panoptic segmentation aims to perform instance segmentation for foreground instances and semantic segmentation for background stuff simultaneously. The typical top-down pipeline concentrates on two key issues: 1) how to effectively model the intrinsic interaction between semantic segmentation and instance segmentation, and 2) how to properly handle occlusion for panoptic segmentation. Intuitively, the complementarity between semantic segmentation and instance segmentation can be leveraged to improve the performance. Besides, we notice that using detection/mask scores is insufficient for resolving the occlusion problem. Motivated by these observations, we propose a novel deep panoptic segmentation scheme based on a bidirectional learning pipeline. Moreover, we introduce a plug-and-play occlusion handling algorithm to deal with the occlusion between different object instances. The experimental results on COCO panoptic benchmark validate the effectiveness of our proposed method. Codes will be released soon at https://github.com/Mooonside/BANet.
[bidirectional, interaction, bilinear, prediction, concatenated, recognition] [semantic, segmentation, instance, occlusion, object, panoptic, feature, roiinlay, stuff, head, module, val, pqth, coco, sim, occluded, pqsf, table, propose, fpn, overlap, backbone, cropped, pyramid, improvement, contextual, score, ocm, roiupsample, box, key, fully, roialign, finlay, thing] [model, improve] [handling, based, figure, pixel, proposed, deformable, convolution, conv, convolutional, applying, crop, spatial, resolve] [appearance, image, loss, perform, structural] [sampling, learning, performance, deep, higher, network, algorithm, set, task, class, better, applied, training, path, similarity] [ground, truth, scene, approach, handle]
@InProceedings{Chen_2020_CVPR,
  author = {Chen, Yifeng and Lin, Guangchen and Li, Songyuan and Bourahla, Omar and Wu, Yiming and Wang, Fangfang and Feng, Junyi and Xu, Mingliang and Li, Xi},
  title = {BANet: Bidirectional Aggregation Network With Occlusion Handling for Panoptic Segmentation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
CPR-GCN: Conditional Partial-Residual Graph Convolutional Network in Automated Anatomical Labeling of Coronary Arteries
Han Yang, Xingjian Zhen, Ying Chi, Lei Zhang, Xian-Sheng Hua


Automated anatomical labeling plays a vital role in coronary artery disease diagnosing procedure. The main challenge in this problem is the large individual variability inherited in human anatomy. Existing methods usually rely on the position information and the prior knowledge of the topology of the coronary artery tree, which may lead to unsatisfactory performance when the main branches are confusing. Motivated by the wide application of the graph neural network in structured data, in this paper, we propose a conditional partial-residual graph convolutional network (CPR-GCN), which takes both position and CT image into consideration, since CT image contains abundant information such as branch size and spanning direction. Two majority parts, a Partial-Residual GCN and a conditions extractor, are included in CPR-GCN. The conditions extractor is a hybrid model containing the 3D CNN and the LSTM, which can extract 3D spatial image features along the branches. On the technical side, the Partial-Residual GCN takes the position features of the branches, with the 3D spatial image features as conditions, to predict the label for each branches. While on the mathematical side, our approach twists the partial differential equation (PDE) into the graph modeling. A dataset with 511 subjects is collected from the clinic and annotated by two experts with a two-phase annotation process. According to the five-fold cross-validation, our CPR-GCN yields 95.8% meanRecall, 95.4% meanPrecision and 0.955 meanF1, which outperforms state-of-the-art approaches.
[graph, gcn, dataset, build, structured, extract, bidirectional, compose, automatic] [main, labeling, cnn, table, branch, side, recall, vessel, score, ablation, annotation, treated] [model, input, original, study, differential, collected] [coronary, artery, ccta, tree, figure, residual, convolutional, automated, block, channel, anatomical, method, traditional, based, conventional, rca, prior, author, scts, meanprecision, cardiovascular] [image, domain, extracted, synthetic, conditional, treat, missing, control] [neural, number, deep, network, learning, precision, size, connection, layer, performance, training, average, label, data, matrix, processing, arxiv, preprint] [position, approach, left, point, rely, topology, direction]
@InProceedings{Yang_2020_CVPR,
  author = {Yang, Han and Zhen, Xingjian and Chi, Ying and Zhang, Lei and Hua, Xian-Sheng},
  title = {CPR-GCN: Conditional Partial-Residual Graph Convolutional Network in Automated Anatomical Labeling of Coronary Arteries},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Cross-View Correspondence Reasoning Based on Bipartite Graph Convolutional Network for Mammogram Mass Detection
Yuhang Liu, Fandong Zhang, Qianyi Zhang, Siwen Wang, Yizhou Wang, Yizhou Yu


Mammogram mass detection is of great clinical significance due to its high proportion in breast cancers. The information from cross views (i.e., mediolateral oblique and cranio-caudal) is highly related and complementary, and is helpful to make comprehensive decisions. However, unlike radiologists who are able to recognize masses with reasoning ability in cross-view images, most existing methods lack the ability to reason under the guidance of domain knowledge, thus it limits the performance. In this paper, we introduce bipartite graph convolutional network to endow existing methods with cross-view reasoning ability of radiologists in mammogram mass detection. The bipartite node sets are constructed by cross-view images respectively to represent relatively consistent regions in breasts, while the bipartite edge learns to model both inherent cross-view geometric constraints and appearance similarities between correspondences. Based on the bipartite graph, the information propagates methodically through correspondences and enables spatial visual features equipped with customized cross-view reasoning ability. Experimental results on DDSM dataset demonstrate that the proposed algorithm achieves state-of-the-art performance. Besides, visual analysis shows the model has a clear physical meaning, which is helpful for radiologists in clinical interpretation.
[graph, bipartite, node, reasoning, visual, mammogram, represent, ddsm, relation, mlo, dataset, pectoral, rhw, bgn, yizhou, mammography, recognition] [detection, mass, object, feature, table, mask, breast, semantic, faster, region, edge, rcnn, backbone, represents, module] [model, examined, auxiliary, screening, digital, helpful, customized, representative, knn] [ieee, convolutional, pattern, based, figure, method, spatial, proposed, enhance, muscle, analysis, enhanced, medical, clinical, designed] [pseudo, mapping, image, ability, domain, learn, cancer] [learning, network, neural, deep, performance, set, design, uniform] [computer, view, conference, geometric, vision, international, correspondence, stereo, consistent]
@InProceedings{Liu_2020_CVPR,
  author = {Liu, Yuhang and Zhang, Fandong and Zhang, Qianyi and Wang, Siwen and Wang, Yizhou and Yu, Yizhou},
  title = {Cross-View Correspondence Reasoning Based on Bipartite Graph Convolutional Network for Mammogram Mass Detection},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
MPM: Joint Representation of Motion and Position Map for Cell Tracking
Junya Hayashida, Kazuya Nishimura, Ryoma Bise


Conventional cell tracking methods detect multiple cells in each frame (detection) and then associate the detection results in successive time-frames (association). Most cell tracking methods perform the association task independently from the detection task. However, there is no guarantee of preserving coherence between these tasks, and lack of coherence may adversely affect tracking performance. In this paper, we propose the Motion and Position Map (MPM) that jointly represents both detection and association for not only migration but also cell division. It guarantees coherence such that if a cell is detected, the corresponding motion flow can always be obtained. It is a simple but powerful method for multi-object tracking in dense environments. We compared the proposed method with current tracking methods under various conditions in real biological images and found that it outperformed the state-of-the-art (+5.2% improvement compared to the second-best).
[frame, associated, individual, multiple, successive, sequence, intervention, recognition, three] [tracking, detection, association, map, detected, annotated, represents, annotation, segmentation, tracked, associate, table, track] [coherence, magnitude, example, symposium, case] [cell, mpm, method, motion, ieee, based, likelihood, proposed, medical, division, figure, microscopy, ryoma, biomedical, imaging, hayashida, migration, flow, culture, bise, outperformed, pixel, daughter, pattern] [image, target, corresponding, representation] [vector, indicates, learning, performance, number, compared, network, accuracy, function, data, training, similarity, deep, computing, simple, distribution, manual] [position, conference, international, estimated, computer, vision, direction, jointly, estimation, takeo, defined, point, error]
@InProceedings{Hayashida_2020_CVPR,
  author = {Hayashida, Junya and Nishimura, Kazuya and Bise, Ryoma},
  title = {MPM: Joint Representation of Motion and Position Map for Cell Tracking},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Deep Distance Transform for Tubular Structure Segmentation in CT Scans
Yan Wang, Xu Wei, Fengze Liu, Jieneng Chen, Yuyin Zhou, Wei Shen, Elliot K. Fishman, Alan L. Yuille


Tubular structure segmentation in medical images, e.g., segmenting vessels in CT scans, serves as a vital step in the use of computers to aid in screening early stages of related diseases. But automatic tubular structure segmentation in CT scans is a challenging problem, due to issues such as poor contrast, noise and complicated background. A tubular structure usually has a cylinder-like shape which can be well represented by its skeleton and cross-sectional radii (scales). Inspired by this, we propose a geometry-aware tubular structure segmentation method, Deep Distance Transform (DDT), which combines intuitions from the classical distance transform for skeletonization and modern deep segmentation networks. DDT first learns a multi-task network to predict a segmentation mask for a tubular structure and a distance map. Each value in the map represents the distance from each tubular structure voxel to the tubular structure surface. Then the segmentation mask is refined by leveraging the shape prior reconstructed from the distance map. We apply our DDT on six medical image datasets. Results show that (1) DDT can boost tubular structure segmentation performance significantly (e.g., over 13% DSC improvement for pancreatic duct segmentation), and (2) DDT additionally provides a geometrical measurement for a tubular structure, which is important for clinical diagnosis (e.g., the cross-sectional scale of a pancreatic duct can be an indicator for pancreatic cancer).
[skeleton, dataset, prediction] [tubular, segmentation, ddt, tumor, duct, predicted, pancreatic, map, pdac, segbaseline, alan, wei, table, elliot, detection, head, mask, resdsn, gvk, wdis, including, backbone, wcls, gar, vessel, yuyin, segment, refinement, aorta, denotes, segmenting, skeletonization] [model, datasets] [scale, transform, dilated, medical, method, clinical, dsc, proposed, figure, ieee, extraction] [loss, image, yan, pseudo] [deep, network, performance, class, training, learning, candidate, label, pancreas, set, find, abnormal, classification, probability, average, size, finding, better, reported] [distance, structure, voxel, shape, normal, geometric, voxels, approach, term, scan, surface, reconstructed]
@InProceedings{Wang_2020_CVPR,
  author = {Wang, Yan and Wei, Xu and Liu, Fengze and Chen, Jieneng and Zhou, Yuyin and Shen, Wei and Fishman, Elliot K. and Yuille, Alan L.},
  title = {Deep Distance Transform for Tubular Structure Segmentation in CT Scans},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Instance Segmentation of Biological Images Using Harmonic Embeddings
Victor Kulikov, Victor Lempitsky


We present a new instance segmentation approach tailored to biological images, where instances may correspond to individual cells, organisms or plant parts. Unlike instance segmentation for user photographs or road scenes, in biological data object instances may be particularly densely packed, the appearance variation may be particularly low, the processing power may be restricted, while, on the other hand, the variability of sizes of individual instances may be limited. The proposed approach successfully addresses these peculiarities. Our approach describes each object instance using an expectation of a limited number of sine waves with frequencies and phases adjusted to particular object sizes and densities. At train time, a fully-convolutional network is learned to predict the object embeddings at each pixel using a simple pixelwise regression loss, while at test time the instances are recovered using clustering in the embedding space. In the experiments, we show that our approach outperforms previous embedding-based instance segmentation approaches on a number of biological datasets, achieving state-of-the-art on a popular CVPPP benchmark. This excellent performance is combined with computational efficiency that is needed for deployment to domain specialists. The source code of the approach is available at https://github.com/kulikovv/harmonic .
[embedding, embeddings, dataset, recurrent, previous, individual, outperforms, work] [instance, segmentation, object, guide, sinconv, table, guided, plant, mask, semantic, cvppp, coordconv, sbd, coco, ablation] [datasets, input, study, pixelwise] [method, convolutional, ieee, biological, pixel, pattern, biomedical, figure, proposed, based, low, high, scale] [image, loss, harmonic, train, discriminative, row, perform] [set, network, learning, training, test, number, deep, neural, function, performance, implementation, good, process, small, achieve, metric, random, learned, simple, clustering, size, layer, best, baseline, data, space] [approach, conference, computer, ground, truth, vision, well, european, second, complex, compare]
@InProceedings{Kulikov_2020_CVPR,
  author = {Kulikov, Victor and Lempitsky, Victor},
  title = {Instance Segmentation of Biological Images Using Harmonic Embeddings},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Multi-scale Domain-adversarial Multiple-instance CNN for Cancer Subtype Classification with Unannotated Histopathological Images
Noriaki Hashimoto, Daisuke Fukushima, Ryoichi Koga, Yusuke Takagi, Kaho Ko, Kei Kohno, Masato Nakaguro, Shigeo Nakamura, Hidekata Hontani, Ichiro Takeuchi


We propose a new method for cancer subtype classification from histopathological images, which can automatically detect tumor-specific features in a given whole slide image (WSI). The cancer subtype should be classified by referring to a WSI, i.e., a large-sized image (typically 40,000x40,000 pixels) of an entire pathological tissue slide, which consists of cancer and non-cancer portions. One difficulty arises from the high cost associated with annotating tumor regions in WSIs. Furthermore, both global and local image features must be extracted from the WSI by changing the magnifications of the image. In addition, the image features should be stably detected against the differences of staining conditions among the hospitals/specimens. In this paper, we develop a new CNN-based cancer subtype classification method by effectively combining multiple-instance, domain adversarial, and multi-scale learning frameworks in order to overcome these practical difficulties. When the proposed method was applied to malignant lymphoma subtype classifications of 196 cases collected from multiple hospitals, the classification performance was significantly better than the standard CNN or other conventional methods, and the accuracy compared favorably with that of standard pathologists.
[multiple, attention, order, expert, difficulty] [bag, feature, instance, cnn, mil, histopathological, tumor, positive, predicted, extractor, van, stage, breast] [trained, digital, magnification, input, effectively] [method, proposed, tissue, scale, wsi, ieee, staining, patch, figure, color, medical, convolutional, wsis, imaging, high] [image, subtype, stained, domain, slide, cancer, malignant, lymphoma, dlbcl, extracted, loss, histopathology, pathology, generated, pathological] [classification, learning, class, label, neural, set, training, deep, network, number, vector, predictor, standard, large, negative, accuracy, problem, probability, exp, parameter] [international, conference, approach, computer]
@InProceedings{Hashimoto_2020_CVPR,
  author = {Hashimoto, Noriaki and Fukushima, Daisuke and Koga, Ryoichi and Takagi, Yusuke and Ko, Kaho and Kohno, Kei and Nakaguro, Masato and Nakamura, Shigeo and Hontani, Hidekata and Takeuchi, Ichiro},
  title = {Multi-scale Domain-adversarial Multiple-instance CNN for Cancer Subtype Classification with Unannotated Histopathological Images},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
SOS: Selective Objective Switch for Rapid Immunofluorescence Whole Slide Image Classification
Sam Maksoud, Kun Zhao, Peter Hobson, Anthony Jennings, Brian C. Lovell


The difficulty of processing gigapixel whole slide images (WSIs) in clinical microscopy has been a long-standing barrier to implementing computer aided diagnostic systems. Since modern computing resources are unable to perform computations at this extremely large scale, current state of the art methods utilize patch-based processing to preserve the resolution of WSIs. However, these methods are often resource intensive and make significant compromises on processing time. In this paper, we demonstrate that conventional patch-based processing is redundant for certain WSI classification tasks where high resolution is only required in a minority of cases. This reflects what is observed in clinical practice; where a pathologist may screen slides using a low power objective and only switch to a high power in cases where they are uncertain about their findings. To eliminate these redundancies, we propose a method for the selective use of high resolution processing based on the confidence of predictions on downscaled WSIs --- we call this the Selective Objective Switch (SOS). Our method is validated on a novel dataset of 684 Liver-Kidney-Stomach immunofluorescence WSIs routinely used in the investigation of autoimmune liver disease. By limiting high resolution processing to cases which cannot be classified confidently at low resolution, we maintain the accuracy of patch-level analysis whilst reducing the inference time by a factor of 7.74.
[time, correct, policy, multiple, visual, recurrent, evaluation, dataset, described, individual, long, gru] [confidence, feature, table, predicted, segmentation] [model, protocol, input, classified, decision] [resolution, wsi, high, low, method, lrn, hrn, wsis, patch, dynamic, rdms, conventional, based, figure, epu, lks, analysis, proposed, executive, medical, ieee, lhe, clinical, liver, convolutional, cnns, paradoxical, spatial, scale] [image, loss, cancer, cross, slide] [classification, processing, class, accuracy, network, max, objective, neural, learning, probability, function, switch, inference, training, number, set, label, lower, entropy, classify, size, arg, total, deep, performance] [conference, computer, compute]
@InProceedings{Maksoud_2020_CVPR,
  author = {Maksoud, Sam and Zhao, Kun and Hobson, Peter and Jennings, Anthony and Lovell, Brian C.},
  title = {SOS: Selective Objective Switch for Rapid Immunofluorescence Whole Slide Image Classification},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Task Agnostic Robust Learning on Corrupt Outputs by Correlation-Guided Mixture Density Networks
Sungjoon Choi, Sanghoon Hong, Kyungjae Lee, Sungbin Lim


In this paper, we focus on weakly supervised learning with noisy training data for both classification and regression problems. We assume that the training outputs are collected from a mixture of a target and correlated noise distributions. Our proposed method simultaneously estimates the target distribution and the quality of each data which is defined as the correlation between the target and data generating distributions. The cornerstone of the proposed method is a Cholesky Block that enables modeling dependencies among mixture distributions in a differentiable manner where we maintain the distribution over the network weights. We first provide illustrative examples in both regression and classification tasks to show the effectiveness of the proposed method. Then, the proposed method is extensively evaluated in a number of experiments where we show that it constantly shows comparable or superior performances compared to existing baseline methods in the handling of noisy data.
[dataset, modeling, work, infer, mechanism] [correlation, regression, feature, table, employ] [cholesky, choicenet, robust, correlated, model, noise, corrupt, datasets, corruption, input, robustness, clean, quality, illustrative, true, collected] [noisy, proposed, method, output, figure, block, ieee, pattern, gaussian, transform, based, presented] [target, loss, synthetic, distinguish, image, learn, train] [mixture, training, data, learning, neural, deep, distribution, classification, network, label, function, weight, test, task, density, probability, variance, processing, compared, matrix, performance, random, mdn, sampling, algorithm, binary, baseline, requires, problem, sampled, regularization, small, base] [conference, second, international, computer, fitting, estimating, additional]
@InProceedings{Choi_2020_CVPR,
  author = {Choi, Sungjoon and Hong, Sanghoon and Lee, Kyungjae and Lim, Sungbin},
  title = {Task Agnostic Robust Learning on Corrupt Outputs by Correlation-Guided Mixture Density Networks},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
METAL: Minimum Effort Temporal Activity Localization in Untrimmed Videos
Da Zhang, Xiyang Dai, Yuan-Fang Wang


Existing Temporal Activity Localization (TAL) methods largely adopt strong supervision for model training, which requires (1) vast amounts of untrimmed videos per each activity category and (2) accurate segment-level boundary annotations (start time and end time) for every instance. This poses a critical restriction to the current methods in practical scenarios where not only segment-level annotations are expensive to obtain, but many activity categories are also rare and unobserved during training. Therefore, Can we learn a TAL model under weak supervision that can localize unseen activity classes? To address this scenario, we define a novel example-based TAL problem called Minimum Effort Temporal Activity Localization (METAL): Given only a few examples, the goal is to find the occurrences of semantically-related segments in an untrimmed video sequence while model training is only supervised by the video-level annotation. Towards this objective, we propose a novel Similarity Pyramid Network (SPN) that adopts the few-shot learning technique of Relation Network and directly encodes hierarchical multi-scale correlations, which we learn by optimizing two complimentary loss functions in an end-to-end manner. We evaluate the SPN on the THUMOS'14 and ActivityNet datasets, of which we rearrange the videos to fit the METAL setup. Results show that our SPN achieves performance superior or competitive to state-of-the-art approaches with stronger supervision.
[temporal, untrimmed, trimmed, activity, video, action, spn, relation, tal, activitynet, metal, gcn, embedding, sequence, cssl, dataset, localize, recognition] [localization, feature, pyramid, module, boundary, positive, table, map, supervision, weakly, detection, score, weak, challenging] [model, testing, trained, input] [ieee, pattern, based, convolutional, proposed, method, figure] [loss, supervised, unseen, learn] [similarity, training, network, set, learning, problem, number, support, neural, deep, follow, performance, setup, sample, applied, classification, consider, denote, average, reported] [conference, computer, vision, international, directly, structure, defined, define, compute, approach, single]
@InProceedings{Zhang_2020_CVPR,
  author = {Zhang, Da and Dai, Xiyang and Wang, Yuan-Fang},
  title = {METAL: Minimum Effort Temporal Activity Localization in Untrimmed Videos},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Neural Data Server: A Large-Scale Search Engine for Transfer Learning Data
Xi Yan, David Acuna, Sanja Fidler


Transfer learning has proven to be a successful technique to train deep learning models in the domains where little training data is available. The dominant approach is to pretrain a model on a large generic dataset such as ImageNet and finetune its weights on the target domain. However, in the new era of an ever-increasing number of massive datasets, selecting the relevant data for pretraining is a critical issue. We introduce Neural Data Server (NDS), a large-scale search engine for finding the most useful transfer learning data to the target domain. NDS consists of a dataserver which indexes several large popular image datasets, and aims to recommend data to a client, an end-user with a target application with its own small labeled dataset. The dataserver represents large datasets with a much more compact mixture-of-experts model, and employs it to perform data search in a series of dataserver-client transactions at a low computational cost. We show the effectiveness of NDS in various transfer learning scenarios, demonstrating state-of-the-art performance on several target datasets and tasks such as image classification, object detection and instance segmentation. Neural Data Server is available as a web-service at http://aidemo s.cs.toronto.edu/nds/.
[dataset, expert, relevant, engine, downstream, represent] [object, detection, instance, segmentation, table, semantic, sanja, represents] [datasets, model, trained, case] [method, figure, scale] [target, transfer, image, domain, source, representation, train, perform, pretrained] [data, client, learning, performance, task, dataserver, subset, server, neural, training, classification, set, network, proxy, imagenet, pretraining, sampling, uniform, search, indexed, large, deep, labeled, class, massive, selected, sampled, number, openimages, small, amount, sample, function, evaluate, selecting, finding, problem, size, gating, simple, recommend] [approach, computer, david, determine, assume, compute, compare]
@InProceedings{Yan_2020_CVPR,
  author = {Yan, Xi and Acuna, David and Fidler, Sanja},
  title = {Neural Data Server: A Large-Scale Search Engine for Transfer Learning Data},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Revisiting Knowledge Distillation via Label Smoothing Regularization
Li Yuan, Francis EH Tay, Guilin Li, Tao Wang, Jiashi Feng


Knowledge Distillation (KD) aims to distill the knowledge of a cumbersome teacher model into a lightweight student model. Its success is generally attributed to the privileged information on similarities among categories provided by the teacher model, and in this sense, only strong teacher models are deployed to teach weaker students in practice. In this work, we challenge this common belief by following experimental observations: 1) beyond the acknowledgment that the teacher can improve the student, the student can also enhance the teacher significantly by reversing the KD procedure; 2) a poorly-trained teacher with much lower accuracy than the student can still improve the latter significantly. To explain these observations, we provide a theoretical analysis of the relationships between KD and label smoothing regularization. We prove that 1) KD is a type of learned label smoothing regularization and 2) label smoothing regularization provides a virtual teacher model for KD. From these results, we argue that the success of KD is not fully due to the similarity information between categories from teachers, but also to the regularization of soft targets, which is equally or even more important. Based on these analyses, we further propose a novel Teacher-free Knowledge Distillation (Tf-KD) framework, where a student model learns from itself or manually-designed regularization distribution. The Tf-KD achieves comparable performance with normal KD from a superior teacher, which is well applied when a stronger teacher model is unavailable. Meanwhile, Tf-KD is generic and can be directly deployed for training deep neural networks. Without any extra computation cost, Tf-KD achieves up to 0.65% improvement on ImageNet over well-established baseline models, which is superior to label smoothing regularization.
[work, provide, correct] [achieves, improvement, table, propose, adopt, weak] [model, improve, stronger, trained, explain] [method, output, ieee, enhance, pattern, based, designed, superior, smoothed] [train, common, loss, image] [teacher, knowledge, student, label, distillation, smoothing, regularization, distribution, baseline, accuracy, lsr, experiment, soft, performance, neural, teach, similarity, learning, find, temperature, training, uniform, comparable, taught, network, manually, improved, probability, arxiv, preprint, learned, deep, exploratory, observe, set, imagenet, higher, implementation, function, computation] [normal, computer, conference, vision, virtual, supplementary]
@InProceedings{Yuan_2020_CVPR,
  author = {Yuan, Li and Tay, Francis EH and Li, Guilin and Wang, Tao and Feng, Jiashi},
  title = {Revisiting Knowledge Distillation via Label Smoothing Regularization},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
WCP: Worst-Case Perturbations for Semi-Supervised Deep Learning
Liheng Zhang, Guo-Jun Qi


In this paper, we present a novel regularization mechanism for training deep networks by minimizing the Worse-Case Perturbation (WCP). It is based on the idea that a robust model is least likely to be affected by small perturbations, such that its output decisions should be as stable as possible on both labeled and unlabeled examples. We will consider two forms of WCP regularizations -- additive and DropConnect perturbations, which impose additive noises on network weights, and make structural changes by dropping the network connections, respectively. We will show that the worse cases of both perturbations can be derived by solving respective optimization problems with spectral methods. The WCP can be minimized on both labeled and unlabeled data so that networks can be trained in a semi-supervised fashion. This leads to a novel paradigm of semi-supervised classifiers by stabilizing the predicted outputs in presence of the worse-case perturbations imposed on the network weights and structures.
[relation, temporal] [table, boundary, sigmoid] [model, perturbation, robust, change, adversarial, trained, largest, example, perturbed, input, liheng] [ieee, proposed, pattern, spectral, method, block, convolutional, based, output] [train, minimizing, corresponding, idea, loss, unsupervised, ensembling] [wcp, dropconnect, network, additive, training, learning, deep, regularizer, rate, labeled, unlabeled, data, machine, margin, large, vector, neural, function, impact, applied, max, gradient, svhn, entropy, minimization, regularization, consider, principle, classifier, number, dropped, bqp, performance, stable, dropping, linear] [error, conference, international, constraint, computer, vision, virtual, solving]
@InProceedings{Zhang_2020_CVPR,
  author = {Zhang, Liheng and Qi, Guo-Jun},
  title = {WCP: Worst-Case Perturbations for Semi-Supervised Deep Learning},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
DEPARA: Deep Attribution Graph for Deep Knowledge Transferability
Jie Song, Yixin Chen, Jingwen Ye, Xinchao Wang, Chengchao Shen, Feng Mao, Mingli Song


Exploring the intrinsic interconnections between the knowledge encoded in PRe-trained Deep Neural Networks (PR-DNNs) of heterogeneous tasks sheds light on their mutual transferability, and consequently enables knowledge transfer from one task to another so as to reduce the training effort of the latter. In this paper, we propose the DEeP Attribution gRAph (DEPARA) to investigate the transferability of knowledge learned from PR-DNNs. In DEPARA, nodes correspond to the inputs and are represented by their vectorized attribution maps with regards to the outputs of the PR-DNN. Edges denote the relatedness between inputs and are measured by the similarity of their features extracted from the PR-DNN. The knowledge transferability of two PR-DNNs is measured by the similarity of their corresponding DEPARAs. We apply DEPARA to two important yet under-studied problems in transfer learning: pre-trained model selection and layer selection. Extensive experiments are conducted to demonstrate the effectiveness and superiority of the proposed method in solving both these problems. Code, data and models reproducing the results in this paper are available at https://github.com/zju-vipa/DEPARA.
[embedding, graph, three, represent] [adopt, denotes, effectiveness, edge, propose, table, highest] [transferability, attribution, depara, model, probe, trained, fki, input, topological, relatedness, nonlinear, definition, gki, deparas, easily] [proposed, method, figure, viewed] [target, transfer, transferred, produced, produce, transferring, source, representation, domain, introduce] [task, knowledge, data, deep, layer, similarity, taskonomy, selection, performance, space, learned, learning, set, denoted, labeled, randomly, accuracy, problem, neural, imagenet, better, denote, higher, conducted, note, rsa, yield, investigate, amount, number, large, efficient, inclusiveness, rank, indicates] [solving, measuring, provided, defined, directly, supplementary, demonstrate, measured, assume]
@InProceedings{Song_2020_CVPR,
  author = {Song, Jie and Chen, Yixin and Ye, Jingwen and Wang, Xinchao and Shen, Chengchao and Mao, Feng and Song, Mingli},
  title = {DEPARA: Deep Attribution Graph for Deep Knowledge Transferability},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Conditional Channel Gated Networks for Task-Aware Continual Learning
Davide Abati, Jakub Tomczak, Tijmen Blankevoort, Simone Calderara, Rita Cucchiara, Babak Ehteshami Bejnordi


Convolutional Neural Networks experience catastrophic forgetting when optimized on a sequence of learning problems: as they meet the objective of the current training examples, their performance on previous tasks drops drastically. In this work, we introduce a novel framework to tackle this problem with conditional computation. We equip each convolutional layer with task-specific gating modules, selecting which filters to apply on the given input. This way, we achieve two appealing properties. Firstly, the execution patterns of the gates allow to identify and protect important filters, ensuring no loss in the performance of the model for previously learned tasks. Secondly, by using a sparsity objective, we can promote the selection of a limited set of kernels, allowing to retain sufficient model capacity to digest new tasks. Existing solutions require, at test time, awareness of the task to which each example belongs to. This knowledge, however, may not be available in many practical scenarios. Therefore, we additionally introduce a task classifier that predicts the task label of each example, to deal with settings in which a task oracle is not available. We validate our proposal on four continual learning datasets. Results show that our model consistently outperforms existing methods both in the presence and the absence of a task oracle. Notably, on Split SVHN and Imagenet-50 datasets, our model yields up to 23.98% and 17.42% improvement in accuracy w.r.t. competing methods.
[prediction, current, gated, multiple, hat, future] [split, feature, backbone, map, module, employ, framework, employed] [model, mnist, input, trained] [convolutional, prior, figure, pattern, based, block, ieee, existing, relu, conv] [generative, conditional] [task, learning, gating, continual, layer, forgetting, classifier, memory, number, neural, network, classification, setting, class, accuracy, catastrophic, objective, training, replay, applied, test, episodic, set, buffer, performance, sparsity, svhn, machine, deep, gate, knowledge, gradient, stored, problem, capacity, sampling, forward, rehearsal, binary, optimization, computation, requires, average] [conference, international, computer, vision, rely, cost, approach]
@InProceedings{Abati_2020_CVPR,
  author = {Abati, Davide and Tomczak, Jakub and Blankevoort, Tijmen and Calderara, Simone and Cucchiara, Rita and Bejnordi, Babak Ehteshami},
  title = {Conditional Channel Gated Networks for Task-Aware Continual Learning},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Towards Discriminability and Diversity: Batch Nuclear-Norm Maximization Under Label Insufficient Situations
Shuhao Cui, Shuhui Wang, Junbao Zhuo, Liang Li, Qingming Huang, Qi Tian


The learning of the deep networks largely relies on the data with human-annotated labels. In some label insufficient situations, the performance degrades on the decision boundary with high data density. A common solution is to directly minimize the Shannon Entropy, but the side effect caused by entropy minimization, i.e., reduction of the prediction diversity, is mostly ignored. To address this issue, we reinvestigate the structure of classification output matrix of a randomly selected data batch. We find by theoretical analysis that the prediction discriminability and diversity could be separately measured by the Frobenius-norm and rank of the batch output matrix. Besides, the nuclear-norm is an upperbound of the Frobenius-norm, and a convex approximation of the matrix rank. Accordingly, to improve both discriminability and diversity, we propose Batch Nuclear-norm Maximization (BNM) on the output matrix. BNM could boost the learning under typical label insufficient learning scenarios, such as semi-supervised learning, domain adaptation and open domain recognition. On these tasks, extensive experimental results show that BNM outperforms competitors and works well with existing well-known methods. The code is available at https://github.com/cuishuhao/BNM
[prediction, recognition, outperforms, three, graph] [category, improvement, table, predicted] [adversarial, insufficient, decision, improve, model] [ieee, method, pattern, output, figure, existing, prior] [domain, diversity, unsupervised, discriminability, adaptation, unknown, loss, discrepancy, image] [bnm, batch, entropy, matrix, learning, kakf, classification, unlabeled, data, open, deep, labeled, training, minimization, ratio, minority, label, maximization, randomly, neural, large, number, accuracy, rank, majority, maximum, balance, knowledge, achieve, machine, arxiv, preprint, applied, maximizing, average, size, entmin, selected, better, increase, maintain, imbalanced, calculated, fixed, processing] [conference, computer, vision, measured, directly, convex, michael, international]
@InProceedings{Cui_2020_CVPR,
  author = {Cui, Shuhao and Wang, Shuhui and Zhuo, Junbao and Li, Liang and Huang, Qingming and Tian, Qi},
  title = {Towards Discriminability and Diversity: Batch Nuclear-Norm Maximization Under Label Insufficient Situations},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
FocalMix: Semi-Supervised Learning for 3D Medical Image Detection
Dong Wang, Yuan Zhang, Kexin Zhang, Liwei Wang


Applying artificial intelligence techniques in medical imaging is one of the most promising areas in medicine. However, most of the recent success in this area highly relies on large amounts of carefully annotated data, whereas annotating medical images is a costly process. In this paper, we propose a novel method, called FocalMix, which, to the best of our knowledge, is the first to leverage recent advances in semi-supervised learning (SSL) for 3D medical image detection. We conducted extensive experiments on two widely used datasets for lung nodule detection, LUNA16 and NLST. Results show that our proposed SSL methods can achieve a substantial improvement of up to 17.3% over state-of-the-art supervised learning approaches with 400 unlabeled CT scans.
[prediction, work, dataset] [detection, anchor, nodule, object, focalmix, pulmonary, positive, propose, cpm, table, annotated, score, predicted, feature, bounding, framework, box, fpn, sharpening] [model, improve, example, ensemble] [medical, proposed, figure, method, imaging, patch, ieee, high, output] [image, loss, target, lesion, supervised, consistency, train] [mixup, learning, unlabeled, data, labeled, ssl, training, performance, augmentation, deep, classification, soft, neural, lung, modern, mixmatch, set, class, large, better, average, network, base, processing, computing, achieve, probability, semisupervised] [conference, focal, international, term, computer, approach]
@InProceedings{Wang_2020_CVPR,
  author = {Wang, Dong and Zhang, Yuan and Zhang, Kexin and Wang, Liwei},
  title = {FocalMix: Semi-Supervised Learning for 3D Medical Image Detection},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Learning 3D Semantic Scene Graphs From 3D Indoor Reconstructions
Johanna Wald, Helisa Dhamo, Nassir Navab, Federico Tombari


Scene understanding has been of high interest in computer vision. It encompasses not only identifying objects in a scene, but also their relationships within the given context. With this goal, a recent line of works tackles 3D semantic segmentation and scene layout prediction. In our work we focus on scene graphs, a data structure that organizes the entities of a scene in a graph, where objects are nodes and their relationships modeled as edges. We leverage inference on scene graphs as a way to carry out 3D scene understanding, mapping objects and their relationships. In particular, we propose a learned method that regresses a scene graph from the point cloud of a scene. Our novel architecture is based on PointNet and Graph Convolutional Networks (GCN). In addition, we introduce 3DSSG, a semiautomatically generated dataset, that contains semantically rich scene graphs of 3D scenes. We show the application of our method in a domain-agnostic retrieval task, where graphs serve as an intermediate representation for 3D-3D and 2D-3D matching.
[graph, prediction, retrieval, recognition, node, predicate, dataset, relationship, understanding, context, hierarchical, visual, represent, multiple, lexical, proximity, explore, gcn, lying, describe, language] [object, semantic, instance, segmentation, focus, table, edge, predicted, propose, feature, federico, annotated] [model, definition] [pattern, method, based, convolutional, figure, spatial, reference] [image, changing, semantically, domain, loss, mapping, representation] [class, support, set, classification, similarity, data, network, task, learning, processing, neural, large, hierarchy] [scene, computer, vision, conference, point, indoor, international, well, single, matching, shape, pointnet, leonidas, chair, matthias, european, hao, define]
@InProceedings{Wald_2020_CVPR,
  author = {Wald, Johanna and Dhamo, Helisa and Navab, Nassir and Tombari, Federico},
  title = {Learning 3D Semantic Scene Graphs From 3D Indoor Reconstructions},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Self-Supervised Viewpoint Learning From Image Collections
Siva Karthik Mustikovela, Varun Jampani, Shalini De Mello, Sifei Liu, Umar Iqbal, Carsten Rother, Jan Kautz


Training deep neural networks to estimate the viewpoint of objects requires large labeled training datasets. However, manually labeling viewpoints is notoriously hard, error-prone, and time-consuming. On the other hand, it is relatively easy to mine many unlabeled images of an object category from the internet, e.g., of cars or faces. We seek to answer the research question of whether such unlabeled collections of in-the-wild images can be successfully utilized to train viewpoint estimation networks for general object categories purely via self-supervision. Self-supervision here refers to the fact that the only true supervisory signal that the network has is the input image itself. We propose a novel learning framework which incorporates an analysis-by-synthesis paradigm to reconstruct images in a viewpoint aware manner with a generative network, along with symmetry and adversarial constraints to successfully supervise our viewpoint estimation network. We show that our approach performs competitively to fully-supervised approaches for several object categories like human faces, cars, buses, and trains. Our work opens up further research in self-supervised viewpoint learning and serves as a robust baseline for it. We open-source our code at https://github.com/NVlabs/SSV.
[dataset, predict] [object, head, predicted, table, framework, propose, car, aware, supervision, feature] [face, input, model, adversarial, facial, flipped, datasets, create] [figure, mae, performs, tilt, analysis, existing] [image, synthesis, style, train, consistency, loss, generative, learn, code, synthesized, supervised, synthetic, unsupervised, real, paired, manner, corresponding, selfsupervised, representation, jan] [network, learning, training, deep, neural, general, better, large] [viewpoint, estimation, pose, ssv, symmetry, hologan, approach, biwi, additional, reconstruction, error, elevation, collection, constraint, human, demonstrate, keypoints, euler, vision, geometric, leverage, azimuth, lsv, canonical]
@InProceedings{Mustikovela_2020_CVPR,
  author = {Mustikovela, Siva Karthik and Jampani, Varun and Mello, Shalini De and Liu, Sifei and Iqbal, Umar and Rother, Carsten and Kautz, Jan},
  title = {Self-Supervised Viewpoint Learning From Image Collections},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Two-Shot Spatially-Varying BRDF and Shape Estimation
Mark Boss, Varun Jampani, Kihwan Kim, Hendrik P.A. Lensch, Jan Kautz


Capturing the shape and spatially-varying appearance (SVBRDF) of an object from images is a challenging task that has applications in both computer vision and graphics. Traditional optimization-based approaches often need a large number of images taken from multiple views in a controlled environment. Newer deep learning-based approaches require only a few input images, but the reconstruction quality is not on par with optimization techniques. We propose a novel deep learning architecture with a stage-wise estimation of shape and SVBRDF. The previous predictions guide each estimation, and a joint refinement network later refines both SVBRDF and shape. We follow a practical mobile image capture setting and use unaligned two-shot flash and no-flash images as input. Both our two-shot image capture and network inference can run on mobile hardware. We also create a large-scale synthetic training dataset with domain-randomized geometry and realistic materials. Extensive experiments on both synthetic and real-world datasets show that our network trained on a synthetic dataset can generalize well to real-world images. Comparisons with recent approaches demonstrate the superior performance of the proposed approach.
[dataset, recognition, environment, work, visual] [object, predicted, map, mask, refinement, merge] [input, trained] [illumination, ieee, pattern, cascaded, light, method, figure, pixel, comparison, residual, convolutional] [image, synthetic, loss, separate, appearance] [network, deep, mobile, learning, compared, better, neural, practical, data, architecture, problem, large, training, layer, sample, task, optimization, inference] [shape, svbrdf, depth, estimation, flash, computer, vision, conference, brdf, normal, capture, single, diffuse, specular, joint, acm, rendering, roughness, intrinsic, material, well, estimate, reflectance, international, scene, monocular, approach, camera, direct, surface, estimated, novel, albedo, refer, supplementary, geometry]
@InProceedings{Boss_2020_CVPR,
  author = {Boss, Mark and Jampani, Varun and Kim, Kihwan and Lensch, Hendrik P.A. and Kautz, Jan},
  title = {Two-Shot Spatially-Varying BRDF and Shape Estimation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Variational Context-Deformable ConvNets for Indoor Scene Parsing
Zhitong Xiong, Yuan Yuan, Nianhui Guo, Qi Wang


Context information is critical for image semantic segmentation. Especially in indoor scenes, the large variation of object scales makes spatial-context an important factor for improving the segmentation performance. Thus, in this paper, we propose a novel variational context-deformable (VCD) module to learn adaptive receptive-field in a structured fashion. Different from standard ConvNets, which share fixed-size spatial context for all pixels, the VCD module learns a deformable spatial-context with the guidance of depth information: depth information provides clues for identifying real local neighborhoods. Specifically, adaptive Gaussian kernels are learned with the guidance of multimodal information. By multiplying the learned Gaussian kernel with standard convolution filters, the VCD module can aggregate flexible spatial context for each pixel during convolution. The main contributions of this work are as follows: 1) a novel VCD module is proposed, which exploits learnable Gaussian kernels to enable feature learning with structured adaptive-context; 2) variational Bayesian probabilistic modeling is introduced for the training of VCD module, which can make it continuous and more stable; 3) a perspective-aware guidance module is designed to take advantage of multi-modal information for RGB-D segmentation. We evaluate the proposed approach on three widely-used datasets, and the performance improvement has shown the effectiveness of the proposed method.
[context, dataset, structured, understanding, modality, multiple, attention, recognition, multimodal] [semantic, segmentation, module, map, feature, cnn, table, object, miou, dcn, employed, sun, wang] [effective, model] [vcd, convolution, proposed, gaussian, ieee, kernel, pattern, spatial, method, pixel, convolutional, deformable, scale, based, figure, fusion, adaptive, guidance, june, analysis, learnable, designed, sigma, result] [image, variational, learn] [learned, performance, standard, learning, deep, network, distribution, large, training, neural, set, baseline, size, small, bayesian, probabilistic] [depth, conference, computer, rgb, vision, scene, indoor, international, rgbd, geometric, local]
@InProceedings{Xiong_2020_CVPR,
  author = {Xiong, Zhitong and Yuan, Yuan and Guo, Nianhui and Wang, Qi},
  title = {Variational Context-Deformable ConvNets for Indoor Scene Parsing},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Strip Pooling: Rethinking Spatial Pooling for Scene Parsing
Qibin Hou, Li Zhang, Ming-Ming Cheng, Jiashi Feng


Spatial pooling has been proven highly effective to capture long-range contextual information for pixel-wise prediction tasks, such as scene parsing. In this paper, beyond conventional spatial pooling that usually has a regular shape of NxN, we rethink the formulation of spatial pooling by introducing a new pooling strategy, called strip pooling, which considers a long but narrow kernel, i.e., 1xN or Nx1. Based on strip pooling, we further investigate spatial pooling architecture design by 1) introducing a new strip pooling module that enables backbone networks to efficiently model long-range dependencies; 2) presenting a novel building block with diverse spatial pooling as a core; and 3) systematically comparing the performance of the proposed strip pooling and conventional spatial pooling techniques. Both novel pooling-based designs are lightweight and can serve as an efficient plug-and-play modules in existing scene parsing networks. Extensive experiments on Cityscapes and ADE20K benchmarks demonstrate that our simple approach establishes new state-of-the-art results. Code is available at https://github.com/Andrew-Qibin/SPNet.
[context, dependency, work, previous, attention, dataset, visual, long, collecting] [pooling, semantic, parsing, backbone, fcn, module, spm, feature, table, building, global, segmentation, aggregation, pyramid, miou, mpms, contextual, adopt, horizontal, add, object, spnet, narrow, ablation, gang, pascal] [input, effective, improve] [strip, spatial, proposed, convolutional, mpm, figure, based, kernel, output, conv, block, pixel, receptive, analysis, capturing, conventional, residual, spms, tensor, ieee] [image] [base, network, average, performance, set, mixed, size, layer, design, learning, neural, deep, top, number, better, test] [scene, vertical, approach, local, shape, demonstrate, capture, enables]
@InProceedings{Hou_2020_CVPR,
  author = {Hou, Qibin and Zhang, Li and Cheng, Ming-Ming and Feng, Jiashi},
  title = {Strip Pooling: Rethinking Spatial Pooling for Scene Parsing},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Few-Shot Object Detection With Attention-RPN and Multi-Relation Detector
Qi Fan, Wei Zhuo, Chi-Keung Tang, Yu-Wing Tai


Conventional methods for object detection typically require a substantial amount of training data and preparing such high-quality training data is very labor-intensive. In this paper, we propose a novel few-shot object detection network that aims at detecting objects of unseen categories with only a few annotated examples. Central to our method are our Attention-RPN, Multi-Relation Detector and Contrastive Training strategy, which exploit the similarity between the few shot support set and query set to detect novel objects while suppressing false detection in the background. To train our network, we contribute a new dataset that contains 1000 categories of various objects with high-quality annotations. To the best of our knowledge, this is one of the first datasets specifically designed for few-shot object detection. Once our few-shot network is trained, it can detect objects of unseen categories without further training or fine-tuning. Our method is general and has a wide range of potential applications. We produce a new state-of-the-art performance on different datasets in the few-shot setting. The dataset link is https://github.com/fanq15/Few-Shot-Object-Detection-Dataset.
[dataset, attention, relation, evaluation, visual, relationship] [object, detection, rpn, fsod, category, coco, detector, feature, table, proposal, faster, head, propose, detect, ross, module, box, lstd, split, background, global, semantic] [model, query, trained, datasets, experimental] [method, figure, proposed, designed, existing] [image, target, train, learn, unseen] [training, support, learning, set, network, test, number, general, performance, better, strategy, large, contrastive, data, imagenet, task, best, deep, metric, label, evaluate, problem, negative, shot, potential, setting, knowledge, size] [novel, matching, approach, directly]
@InProceedings{Fan_2020_CVPR,
  author = {Fan, Qi and Zhuo, Wei and Tang, Chi-Keung and Tai, Yu-Wing},
  title = {Few-Shot Object Detection With Attention-RPN and Multi-Relation Detector},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
What Can Be Transferred: Unsupervised Domain Adaptation for Endoscopic Lesions Segmentation
Jiahua Dong, Yang Cong, Gan Sun, Bineng Zhong, Xiaowei Xu


Unsupervised domain adaptation has attracted growing research attention on semantic segmentation. However, 1) most existing models cannot be directly applied into lesions transfer of medical images, due to the diverse appearances of same lesion among different datasets; 2) equal attention has been paid into all semantic representations instead of neglecting irrelevant knowledge, which leads to negative transfer of untransferable knowledge. To address these challenges, we develop a new unsupervised semantic transfer model including two complementary modules (i.e., T_D and T_F ) for endoscopic lesions segmentation, which can alternatively determine where and how to explore transferable domain-invariant knowledge between labeled source lesions dataset (e.g., gastroscope) and unlabeled target diseases dataset (e.g., enteroscopy). Specifically, T_D focuses on where to translate transferable visual information of medical lesions via residual transferability-aware bottleneck, while neglecting untransferable visual characterizations. Furthermore, T_F highlights how to augment transferable semantic features of various lesions and automatically ignore untransferable representations, which explores domain-invariant knowledge and in return improves the performance of T_D. To the end, theoretical analysis and extensive experiments on medical endoscopic dataset and several non-medical public datasets well demonstrate the superiority of our proposed model.
[dataset, visual, attention, recognition, explore, perception, step, automatically, highlight, shift] [semantic, segmentation, feature, table, module, denotes, employed] [model, transferability, datasets, complementary, adversarial, input, theory] [medical, figure, ieee, proposed, pixel, june, pattern, residual, analysis, output, convolutional, develop, high] [transferable, domain, target, source, transfer, unsupervised, adaptation, endoscopic, untransferable, translation, quantified, irrelevant, translate, pseudo, alternatively, augment, image, lad, neglecting, xsi, gap, discrepancy, utilize, corresponding, xtj, discriminator, translated, yang, gta] [knowledge, performance, network, distribution, training, learning, deep, number, unlabeled, theoretical, bottleneck, alternative, set] [conference, computer, vision, international, determine]
@InProceedings{Dong_2020_CVPR,
  author = {Dong, Jiahua and Cong, Yang and Sun, Gan and Zhong, Bineng and Xu, Xiaowei},
  title = {What Can Be Transferred: Unsupervised Domain Adaptation for Endoscopic Lesions Segmentation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
ADINet: Attribute Driven Incremental Network for Retinal Image Classification
Qier Meng, Satoh Shin'ichi


Retinal diseases encompass a variety of types, including different diseases and severity levels. Training a model with different types of disease is impractical. Dynamically training a model is necessary when a patient with a new disease appears. Deep learning techniques have stood out in recent years, but they suffer from catastrophic forgetting, i.e., a dramatic decrease in performance when new training classes appear. We found that keeping the feature distribution of an old model helps maintain the performance of incremental learning. In this paper, we design a framework named "Attribute Driven Incremental Network" (ADINet), a new architecture that integrates class label prediction and attribute prediction into an incremental learning framework to boost the classification performance. With image-level classification, we apply knowledge distillation (KD) to retain the knowledge of base classes. With attribute prediction, we calculate the weight of each attribute of an image and use these weights for more precise attribute prediction. We designed attribute distillation (AD) loss to retain the information of base class attributes as new classes appear. This incremental learning can be performed multiple times with a moderate drop in performance. The results of an experiment on our private retinal fundus image dataset demonstrate that our proposed method outperforms existing state-of-the-art methods. For demonstrating the generalization of our proposed method, we test it on the ImageNet-150K-sub dataset and show good performance.
[dataset, prediction, recognition, predict, visual, contribution, work, attention, step, evaluation] [table, framework, annotation, feature, boost, category] [model, private, study, trained, face] [proposed, method, ieee, figure, pattern, output, june, comparison, medical, convolutional, designed] [attribute, image, loss, adinet, fundus, disease, retinal, representation, macular] [incremental, learning, classification, base, weight, performance, class, knowledge, distillation, accuracy, average, training, teacher, catastrophic, network, experiment, forgetting, layer, function, compared, deep, label, calculate, retain, student, neural, entropy, problem, lth, data, machine, indicates, number, conducted] [conference, computer, vision, international, estimation, approach, conf]
@InProceedings{Meng_2020_CVPR,
  author = {Meng, Qier and Shin'ichi, Satoh},
  title = {ADINet: Attribute Driven Incremental Network for Retinal Image Classification},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Unsupervised Domain Adaptation With Hierarchical Gradient Synchronization
Lanqing Hu, Meina Kan, Shiguang Shan, Xilin Chen


Domain adaptation attempts to boost the performance on a target domain by borrowing knowledge from a well established source domain. To handle the distribution gap between two domains, the prominent approaches endeavor to extract domain-invariant features. It is known that after a perfect domain alignment the domain-invariant representations of two domains should share the same characteristics from perspective of the overview and also any local piece. Inspired by this, we propose a novel method called Hierarchical Gradient Synchronization to model the synchronization relationship among the local distribution pieces and global distribution, aiming for more precise domain-invariant features. Specifically, the hierarchical domain alignments including class-wise alignment, group-wise alignment and global alignment are first constructed. Then, these three types of alignment are constrained to be consistent to ensure better structure preservation. As a result, the obtained features are domain invariant and intrinsically structure preserved. As evaluated on extensive domain adaptation tasks, our proposed method achieves state-of-the-art classification performance on both vanilla unsupervised domain adaptation and partial domain adaptation.
[hierarchical, recognition, three, shift, relation, work] [global, feature, category, object, denotes, including, extractor, china] [adversarial, magnitude, perfect, model] [method, pattern, ieee, proposed, figure, based, designed, formulated] [domain, alignment, target, unsupervised, adaptation, source, discrepancy, transfer, loss, discriminator, consistency, gsda, aligned, conditional, aligning, ptj, piece, discriminative, mingsheng, jianmin] [distribution, learning, gradient, data, denoted, processing, better, classification, machine, training, classifier, class, neural, deep, objective, entropy, labeled, group, share, sample, probability, equation, promising, set, number, unlabeled, network] [local, conference, synchronization, computer, vision, well, international, partial, structure, novel, direction]
@InProceedings{Hu_2020_CVPR,
  author = {Hu, Lanqing and Kan, Meina and Shan, Shiguang and Chen, Xilin},
  title = {Unsupervised Domain Adaptation With Hierarchical Gradient Synchronization},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Deep Grouping Model for Unified Perceptual Parsing
Zhiheng Li, Wenxuan Bao, Jiayang Zheng, Chenliang Xu


The perceptual-based grouping process produces a hierarchical and compositional image representation that helps both human and machine vision systems recognize heterogeneous visual concepts. Examples can be found in the classical hierarchical superpixel segmentation or image parsing works. However, the grouping process is largely overlooked in modern CNN-based image segmentation networks due to many challenges, including the inherent incompatibility between the grid-shaped CNN feature map and the irregular-shaped perceptual grouping hierarchy. Overcoming these challenges, we propose a deep grouping model (DGM) that tightly marries the two types of representations and defines a bottom-up and a top-down process for feature exchanging. When evaluating the model on the recent Broden+ dataset for the unified perceptual parsing task, it achieves state-of-the-art results while having a small computational overhead compared to other contextual-based segmentation models. Furthermore, the DGM has better interpretability compared with modern CNN methods.
[graph, recognition, context, hierarchical, adjacency, overhead, semantics, message, dataset, modeling, passing, node, emgp, work, prediction] [feature, segmentation, grouping, level, dgm, parsing, object, map, unified, semantic, module, superpixel, pooling, cnn, contextual, click, upernet, tdmp, backbone, ocnet, propose, bottom, global, apply] [model] [ieee, perceptual, pattern, june, method, proposed, pixel, based, convolutional, figure, journal] [image, texture, representation] [task, learning, deep, number, classification, process, better, hierarchy, compared, neural, network, performance, lower, training, machine, average, higher] [computer, vision, conference, scene, international, grid, material, vertex, projection, computed, defined]
@InProceedings{Li_2020_CVPR,
  author = {Li, Zhiheng and Bao, Wenxuan and Zheng, Jiayang and Xu, Chenliang},
  title = {Deep Grouping Model for Unified Perceptual Parsing},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Where Am I Looking At? Joint Location and Orientation Estimation by Cross-View Matching
Yujiao Shi, Xin Yu, Dylan Campbell, Hongdong Li


Cross-view geo-localization is the problem of estimating the position and orientation (latitude, longitude and azimuth angle) of a camera at ground level given a large-scale database of geo-tagged aerial (eg., satellite) images. Existing approaches treat the task as a pure location estimation problem by learning discriminative feature descriptors, but neglect orientation alignment. It is well-recognized that knowing the orientation between ground and aerial images can significantly reduce matching ambiguity between these two views, especially when the ground-level images have a limited Field of View (FoV) instead of a full field-of-view panorama. Therefore, we design a Dynamic Similarity Matching network to estimate cross-view orientation alignment during localization. In particular, we address the cross-view domain gap by applying a polar transform to the aerial images to approximately align the images up to an unknown azimuth angle. Then, a two-stream convolutional network is used to learn deep features from the ground and polar-transformed aerial images. Finally, we obtain the orientation by computing the correlation between cross-view features, which also provides a more accurate measure of feature similarity, improving location recall. Experiments on standard datasets demonstrate that our method significantly improves state-of-the-art performance. Remarkably, we improve the top-1 location recall rate on the CVUSA dataset by a factor of 1.5x for panoramas with known orientation, by a factor of 3.3x for panoramas with unknown orientation, and by a factor of 6x for 180-degree FoV images with unknown orientation.
[extract, dataset, correctly, evaluation, localizing, illustrated] [aerial, feature, location, cvusa, polar, table, correlation, recall, localization, horizontal, module, cnn, apply, cvft] [query] [figure, method, ieee, transform, existing, liu, pattern, convolutional, spatial, comparison, dynamic, captured, proposed] [image, domain, unknown, learn, gap, corresponding, alignment, bridge] [similarity, network, large, performance, learning, deep, respect, neural, top, maximum, standard, reduces, training, test] [ground, orientation, matching, limited, computer, estimation, vision, azimuth, fov, conference, dsm, estimate, camera, direction, localized, position, angle, scene, estimated, cvact, view, distance, geometric, compute, international]
@InProceedings{Shi_2020_CVPR,
  author = {Shi, Yujiao and Yu, Xin and Campbell, Dylan and Li, Hongdong},
  title = {Where Am I Looking At? Joint Location and Orientation Estimation by Cross-View Matching},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Gum-Net: Unsupervised Geometric Matching for Fast and Accurate 3D Subtomogram Image Alignment and Averaging
Xiangrui Zeng, Min Xu


We propose a Geometric unsupervised matching Net-work (Gum-Net) for finding the geometric correspondence between two images with application to 3D subtomogram alignment and averaging. Subtomogram alignment is the most important task in cryo-electron tomography (cryo-ET), a revolutionary 3D imaging technique for visualizing the molecular organization of unperturbed cellular landscapes in single cells. However, subtomogram alignment and averaging are very challenging due to severe imaging limits such as noise and missing wedge effects. We introduce an end-to-end trainable architecture with three novel modules specifically designed for preserving feature spatial information and propagating feature matching information. The training is performed in a fully unsupervised fashion to optimize a matching metric. No ground truth transformation information nor category-level or instance-level matching supervision information is needed. After systematic assessments on six real and nine simulated datasets, we demonstrate that Gum-Net reduced the alignment error by 40 to 50% and improved the averaging resolution by 10%. Gum-Net also achieved 70 to 110 times speedup in practice with GPU acceleration compared to state-of-the-art subtomogram alignment methods. Our work is the first 3D unsupervised geometric matching method for images of strong transformation variation and high noise level. The training code, trained model, and datasets are available in our open-source software AITom.
[dataset, three] [feature, pooling, correlation, map, module, fully, siamese, semantic] [model, input, datasets, noise] [subtomogram, ieee, spatial, subtomograms, pattern, spectral, medical, output, journal, dct, proposed, convolutional, deformable, wedge, filtering, fourier, resolution, achieved, macromolecular, fast, imaging, electron, extraction, optical, cellular, method, figure] [alignment, image, unsupervised, structural, missing, align, snr, transformed, real] [size, learning, data, neural, network, deep, max, better, training, average, layer, computational, algorithm, compared, large, accuracy, processing] [matching, geometric, transformation, conference, computer, averaging, vision, registration, supplementary, michael, ground, structure, international, novel, simulated, rigid, voxel, accurate, single]
@InProceedings{Zeng_2020_CVPR,
  author = {Zeng, Xiangrui and Xu, Min},
  title = {Gum-Net: Unsupervised Geometric Matching for Fast and Accurate 3D Subtomogram Image Alignment and Averaging},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
FDA: Fourier Domain Adaptation for Semantic Segmentation
Yanchao Yang, Stefano Soatto


We describe a simple method for unsupervised domain adaptation, whereby the discrepancy between the source and target distributions is reduced by swapping the low-frequency spectrum of one with the other. We illustrate the method in semantic segmentation, where densely annotated images are aplenty in one domain (synthetic data), but difficult to obtain in another (real images). Current state-of-the-art methods are complex, some requiring adversarial optimization to render the backbone of a neural network invariant to the discrete domain selection variable. Our method does not require any training to perform the domain alignment, just a simple Fourier Transform and its inverse. Despite its simplicity, it achieves state-of-the-art performance in the current benchmarks, when integrated into a relatively standard semantic segmentation model. Our results indicate that even simple procedures can discount nuisance variability in the data that more sophisticated methods struggle to learn away.
[dataset, multiple, visual, road, describe, current] [semantic, segmentation, round, achieves, backbone, improvement, miou] [adversarial, trained, model, improve, original, sophisticated] [method, ieee, pattern, fourier, scale, simply, spectrum, transform, spectral, figure, output] [domain, image, adaptation, target, source, unsupervised, fda, train, bdl, synthetic, alignment, transfer, sst, variability, loss, real, nuisance, perform, common, translation, mbt] [training, learning, performance, network, entropy, deep, simple, note, neural, data, test, set, better, machine, task, size, scratch, best, arxiv, preprint, processing, standard] [conference, computer, vision, international, single, second, performer, averaging, ground, european, stefano]
@InProceedings{Yang_2020_CVPR,
  author = {Yang, Yanchao and Soatto, Stefano},
  title = {FDA: Fourier Domain Adaptation for Semantic Segmentation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Foreground-Aware Relation Network for Geospatial Object Segmentation in High Spatial Resolution Remote Sensing Imagery
Zhuo Zheng, Yanfei Zhong, Junjue Wang, Ailong Ma


Geospatial object segmentation, as a particular semantic segmentation task, always faces with larger-scale variation, larger intra-class variance of background, and foreground-background imbalance in the high spatial resolution (HSR) remote sensing imagery. However, general semantic segmentation methods mainly focus on scale variation in the natural scene, with inadequate consideration of the other two problems that usually happen in the large area earth observation scene. In this paper, we argue that the problems lie on the lack of foreground modeling and propose a foreground-aware relation network (FarSeg) from the perspectives of relation-based and optimization-based foreground modeling, to alleviate the above two problems. From perspective of relation, FarSeg enhances the discrimination of foreground features via foreground-correlated contexts associated by learning foreground-scene relation. Meanwhile, from perspective of optimization, a foreground-aware optimization is proposed to focus on foreground examples and hard examples of background during training for a balanced optimization. The experimental results obtained using a large scale dataset suggest that the proposed method is superior to the state-of-the-art general semantic segmentation methods and achieves a better trade-off between speed and accuracy.
[relation, context, decoder, step, modeling, embedding, road, dataset, vehicle] [remote, segmentation, feature, object, semantic, foreground, geospatial, hsr, hard, module, pyramid, map, table, farseg, focus, background, false, denotes, miou, alleviate, imagery, deeplab, atrous, detection, isaid, geoscience, propose] [example, input] [sensing, ieee, convolutional, spatial, pattern, proposed, resolution, high, scale, upsampling, method, figure, enhance, dynamic, extraction] [image, loss, discrimination, representation] [network, deep, optimization, learning, function, set, imbalance, normalization, general, performance, neural, baseline, problem, larger, large, weight, training] [scene, conference, computer, annealing, vision, projection, estimation]
@InProceedings{Zheng_2020_CVPR,
  author = {Zheng, Zhuo and Zhong, Yanfei and Wang, Junjue and Ma, Ailong},
  title = {Foreground-Aware Relation Network for Geospatial Object Segmentation in High Spatial Resolution Remote Sensing Imagery},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
When2com: Multi-Agent Perception via Communication Graph Grouping
Yen-Cheng Liu, Junjiao Tian, Nathaniel Glaser, Zsolt Kira


While significant advances have been made for single-agent perception, many applications require multiple sensing agents and cross-agent communication due to benefits such as coverage and robustness. It is therefore critical to develop frameworks which support multi-agent collaborative perception in a distributed and bandwidth-efficient manner. In this paper, we address the collaborative perception problem, where one agent is required to perform a perception task and can communicate and share information with other agents on the same task. Specifically, we propose a communication framework by learning both to construct communication groups and decide when to communicate. We demonstrate the generalizability of our framework on two different perception tasks and show that it significantly reduces communication bandwidth while maintaining superior performance.
[communication, agent, bandwidth, communicate, perception, collaborative, recognition, requesting, handshake, attention, construct, wardrobe, decide, tarmac, mechanism, provide, prediction, randcom, message, multiagent, commnet, previous, dataset, three, visual, multiple] [segmentation, semantic, key, box, framework, feature, object, grouping, represents, propose] [model, query, improve, experimental, case] [supporting, figure, degraded, ieee, proposed, transmission, based, pattern] [person, perform, image, learn, address] [learning, accuracy, task, size, neural, processing, matrix, group, note, performance, selection, network, consider, compared, baseline, distributed, number, informative] [conference, vision, computer, shape, international, local, matching, depth, sufficient, compute, demonstrate]
@InProceedings{Liu_2020_CVPR,
  author = {Liu, Yen-Cheng and Tian, Junjiao and Glaser, Nathaniel and Kira, Zsolt},
  title = {When2com: Multi-Agent Perception via Communication Graph Grouping},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Learning Human-Object Interaction Detection Using Interaction Points
Tiancai Wang, Tong Yang, Martin Danelljan, Fahad Shahbaz Khan, Xiangyu Zhang, Jian Sun


Understanding interactions between humans and objects is one of the fundamental problems in visual classification and an essential step towards detailed scene understanding. Human-object interaction (HOI) detection strives to localize both the human and an object as well as the identification of complex interactions between them. Most existing HOI detection approaches are instance-centric where interactions between all possible human-object pairs are predicted based on appearance features and coarse spatial information. We argue that appearance features alone are insufficient to capture complex human-object interactions. In this paper, we therefore propose a novel fully-convolutional approach that directly detects the interactions between human-object pairs. Our network predicts interaction points, which directly localize and classify the inter-action. Paired with the densely predicted interaction vectors, the interactions are associated with human and object detections to obtain final predictions. To the best of our knowledge, we are the first to propose an approach where HOI detection is posed as a keypoint detection and grouping problem. Experiments are performed on two popular benchmarks: V-COCO and HICO-DET. Our approach sets a new state-of-the-art on both datasets. Code is available at https://github.com/vaesl/IP-Net.
[interaction, multiple, three, work, action, visual, stream, pair, attention] [object, detection, hoi, grouping, box, center, maprole, rare, feature, detected, score, achieves, final, backbone, muhammad, fahad, shahbaz, predicted, branch, yanwei, propose, detects, employed, positive, ibox, threshold, motorcycle, rao] [heatmaps, detecting, input] [based, proposed, existing, figure, reference, method] [generated, corresponding, appearance, image, generation, produce] [vector, network, architecture, learning, performance, problem, set, pairwise, filter, negative, default, standard, note] [point, human, approach, keypoint, single, directly, pose, defined, full, angle, complex]
@InProceedings{Wang_2020_CVPR,
  author = {Wang, Tiancai and Yang, Tong and Danelljan, Martin and Khan, Fahad Shahbaz and Zhang, Xiangyu and Sun, Jian},
  title = {Learning Human-Object Interaction Detection Using Interaction Points},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
C2FNAS: Coarse-to-Fine Neural Architecture Search for 3D Medical Image Segmentation
Qihang Yu, Dong Yang, Holger Roth, Yutong Bai, Yixiao Zhang, Alan L. Yuille, Daguang Xu


3D convolution neural networks (CNN) have been proved very successful in parsing organs or tumours in 3D medical images, but it remains sophisticated and time-consuming to choose or design proper 3D networks given different task contexts. Recently, Neural Architecture Search (NAS) is proposed to solve this problem by searching for the best network architecture automatically. However, the inconsistency between search stage and deployment stage often exists in NAS algorithms due to memory constraints and large search space, which could become more serious when applying NAS to some memory and time-consuming tasks, such as 3D medical image segmentation. In this paper, we propose a coarse-to-fine neural architecture search (C2FNAS) to automatically search a 3D segmentation network from scratch without inconsistency on network size or input size. Specifically, we divide the search procedure into two stages: 1) the coarse stage, where we search the macro-level topology of the network, i.e. how each convolution module is connected to other modules; 2) the fine stage, where we search at micro-level for operations in each cell based on previous searched macro-level topology. The coarse-to-fine manner divides the search procedure into two consecutive stages and meanwhile resolves the inconsistency. We evaluate our method on 10 public datasets from Medical Segmentation Decalthon (MSD) challenge, and achieve state-of-the-art performance with the network searched using one dataset, which demonstrates the effectiveness and generalization of our searched models.
[dataset, automatically, node] [segmentation, stage, msd, alan, table, module, final, propose, apply] [model, trained, inconsistency, input, datasets] [medical, cell, based, convolution, proposed, method, conv, convolutional, figure, anisotropic] [image, fine, avg, manner, cluster, colon] [search, network, architecture, space, neural, size, pancreas, training, learning, performance, lung, algorithm, operation, small, memory, searched, data, validation, set, searching, procedure, path, number, better, problem, scratch, candidate, scaling, task, large, deep, random, arxiv, preprint, reduce, report, test, quoc, daguang, best, deployment, achieve, smaller, reduced] [topology, coarse]
@InProceedings{Yu_2020_CVPR,
  author = {Yu, Qihang and Yang, Dong and Roth, Holger and Bai, Yutong and Zhang, Yixiao and Yuille, Alan L. and Xu, Daguang},
  title = {C2FNAS: Coarse-to-Fine Neural Architecture Search for 3D Medical Image Segmentation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Adaptive Subspaces for Few-Shot Learning
Christian Simon, Piotr Koniusz, Richard Nock, Mehrtash Harandi


Object recognition requires a generalization capability to avoid overfitting, especially when the samples are extremely few. Generalization from limited samples, usually studied under the umbrella of meta-learning, equips learning techniques with the ability to adapt quickly in dynamical environments and proves to be an essential aspect of life long learning. In this paper, we provide a framework for few-shot learning by introducing dynamic classifiers that are constructed from few samples. A subspace method is exploited as the central block of a dynamic classifier. We will empirically show that such modelling leads to robustness against perturbations (e.g., outliers) and yields competitive results on the task of supervised and semi-supervised few-shot classification. We also develop a discriminative form which can boost the accuracy even further. Our code is available at https://github.com/chrysts/dsn_fewshot
[dataset, recognition, work, relation, embedding, attention, visual] [feature, table, object] [model, query, trained, testing, study] [method, ieee, pattern, proposed, dynamic, based, fast, analysis] [discriminative, learn, image, prototype, specific, adaptation, generate, train] [learning, subspace, set, classification, class, neural, prototypical, unlabeled, classifier, dsn, training, deep, data, support, network, machine, fsl, performance, task, accuracy, function, problem, open, processing, number, similarity, mic, better, labeled, matrix, average] [conference, computer, international, vision, limited, term, projection, matching, symmetric, basis, geometric]
@InProceedings{Simon_2020_CVPR,
  author = {Simon, Christian and Koniusz, Piotr and Nock, Richard and Harandi, Mehrtash},
  title = {Adaptive Subspaces for Few-Shot Learning},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Learning to Detect Important People in Unlabelled Images for Semi-Supervised Important People Detection
Fa-Ting Hong, Wei-Hong Li, Wei-Shi Zheng


Important people detection is to automatically detect the individuals who play the most important roles in a social event image, which requires the designed model to understand a high-level pattern. However, existing methods rely heavily on supervised learning using large quantities of annotated image samples, which are more costly to collect for important people detection than for individual entity recognition (i.e., object recognition). To overcome this problem, we propose learning important people detection on partially annotated images. Our approach iteratively learns to assign pseudo-labels to individuals in un-annotated images and learns to update the important people detection model based on data with both labels and pseudo-labels. To alleviate the pseudo-labelling imbalance problem, we introduce a ranking strategy for pseudo-label estimation, and also introduce two weighting strategies: one for weighting the confidence that individuals are important people to strengthen the learning on important people and the other for neglecting noisy unlabelled images (i.e., images without any important people). We have collected two large-scale datasets for evaluation. The extensive experimental results clearly confirm the efficacy of our method attained by leveraging unlabelled images for improving the performance of important people detection.
[people, relation, dataset, three, graph, work, current, automatically, social, assist] [labelled, score, detection, table, effectiveness, partially, fully, detected, detect, annotated, alleviate, developing, feature, propagation] [model, datasets, face, adding, input, trained] [method, figure, proposed, noisy, pattern, based, event, existing] [image, supervised, person, introduce, loss, consistency, learn, generated] [unlabelled, data, learning, sampling, training, label, problem, weight, number, imbalance, ranking, weighting, semisupervised, set, large, performance, encaa, sampled, isw, strategy, clearly, classification, class, deep, entropy, teacher, function, costly, design, indicates, network, maximum] [computer, estimated, approach, vision, point, estimate, limited, quantity]
@InProceedings{Hong_2020_CVPR,
  author = {Hong, Fa-Ting and Li, Wei-Hong and Zheng, Wei-Shi},
  title = {Learning to Detect Important People in Unlabelled Images for Semi-Supervised Important People Detection},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Stochastic Sparse Subspace Clustering
Ying Chen, Chun-Guang Li, Chong You


State-of-the-art subspace clustering methods are based on self-expressive model, which represents each data point as a linear combination of other data points. By enforcing such representation to be sparse, sparse subspace clustering is guaranteed to produce a subspace-preserving data affinity where two points are connected only if they are from the same subspace. On the other hand, however, data points from the same subspace may not be well-connected, leading to the issue of over-segmentation. We introduce dropout to address the issue of over-segmentation, which is based on randomly dropping out data points in self-expressive model. In particular, we show that dropout is equivalent to adding a squared l_2 norm regularization on the representation coefficients, therefore induces denser solutions. Then, we reformulate the optimization problem as a consensus problem over a set of small-scale subproblems. This leads to a scalable and flexible sparse subspace clustering approach, termed Stochastic Sparse Subspace Clustering, which can effectively handle large scale datasets. Extensive experiments on synthetic data and real world datasets validate the efficiency and effectiveness of our proposal.
[graph, time] [affinity, consensus, table, segmentation, feature] [model, norm, gtsrb] [ieee, pattern, based, method, proposed, spectral, flexible, analysis, journal, ssc, subproblems] [synthetic, real, image, issue, representation, introduce, address] [subspace, clustering, data, dropout, problem, algorithm, comp, set, ensc, machine, sscomp, matrix, accuracy, min, optimization, performance, rate, chong, scalable, stochastic, cij, learning, neural, number, dimension, cjj, normalized, observe, equivalent, regularization, random, computation, good, dictionary, orthogonal, update, support, higher, olrsc, esc, metric, linear, dropping, greedy, sparsity, omp] [sparse, connectivity, conference, computer, international, solve, vision, solving, solution, matching, dense]
@InProceedings{Chen_2020_CVPR,
  author = {Chen, Ying and Li, Chun-Guang and You, Chong},
  title = {Stochastic Sparse Subspace Clustering},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
CRNet: Cross-Reference Networks for Few-Shot Segmentation
Weide Liu, Chi Zhang, Guosheng Lin, Fayao Liu


Over the past few years, state-of-the-art image segmentation algorithms are based on deep convolutional neural networks. To render a deep network with the ability to understand a concept, humans need to collect a large amount of pixel-level annotated data to train the models, which is time-consuming and tedious. Recently, few-shot segmentation is proposed to solve this problem. Few-shot segmentation aims to learn a segmentation model that can be generalized to novel classes with only a few training images. In this paper, we propose a cross-reference network (CRNet) for few-shot segmentation. Unlike previous works which only predict the mask in the query image, our proposed model concurrently makes predictions for both the support image and the query image. With a cross-reference mechanism, our network can better find the co-occurrent objects in two images, thus helping the few-shot segmentation task. We also develop a mask refinement module to recurrently refine the prediction of the foreground regions. For the k-shot learning, we propose to finetune parts of networks to take advantage of multiple labeled support images. Experiments on the PASCAL VOC 2012 dataset show that our network achieves state-of-the-art performance.
[previous, prediction, evaluation, dataset, multiple] [module, segmentation, mask, feature, refinement, foreground, table, semantic, object, siamese, propose, refine, branch, pascal, voc, category, crnet, predicted, fully, global, ablation, recurrently, achieves, guide] [query, model, condition, input, testing] [based, conv, method, proposed, convolutional, block, comparison, ieee, figure, pattern] [image, encoder, target, common, generate] [support, network, set, performance, training, labeled, finetuning, deep, learning, finetune, cache, better, design, average, vector, task, test, neural, large, data, crossreference, baseline, metric] [computer, reinforced, conference, vision]
@InProceedings{Liu_2020_CVPR,
  author = {Liu, Weide and Zhang, Chi and Lin, Guosheng and Liu, Fayao},
  title = {CRNet: Cross-Reference Networks for Few-Shot Segmentation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Shoestring: Graph-Based Semi-Supervised Classification With Severely Limited Labeled Data
Wanyu Lin, Zhaolin Gao, Baochun Li


Graph-based semi-supervised learning has been shown to be one of the most effective classification approaches, as it can exploit connectivity patterns between labeled and unlabeled samples to improve learning performance. However, we show that existing techniques perform poorly when labeled data are severely limited. To address the problem of semi-supervised learning in the presence of severely limited labeled samples, we propose a new framework, called Shoestring , that incorporates metric learning into the paradigm of graph-based semi-supervised learning. In particular, our base model consists of a graph embedding network, followed by a metric learning network that learns a semantic metric space to represent the semantic similarity between the sparsely labeled and large numbers of unlabeled samples. Then the classification can be performed by clustering the unlabeled samples according to the learned semantic space. We empirically demonstrate Shoestring's superiority over many baselines, including graph convolutional networks, label propagation and their recent label-efficient variations (IGCN and GLP). We show that our framework achieves state-of-the-art performance for node classification in the low-data regime. In addition, we demonstrate the effectiveness of our framework on image classification tasks in the few-shot learning regime, with significant gains on miniImageNet (2.57%~3.59%) and tieredImageNet (1.05%~2.70%).
[graph, embedding, gcn, node, dataset, tpn] [framework, semantic, propagation, table, centroid, feature, propagate] [model, datasets, effective, original] [convolutional, proposed, assumption, based, called, output, performed] [image, learn, representation, cluster, loss, pubmed] [learning, labeled, classification, label, similarity, class, performance, data, metric, shoestring, unlabeled, large, number, network, severely, cora, set, sample, miniimagenet, accuracy, citation, space, neural, learned, deep, semisupervised, function, knowledge, training, cosine, matrix, objective, filter, citeseer, problem, empirically, tieredimagenet, achieve, vector, classifier] [limited, conference, international, computer, vision, distance, structure, term, well, smoothness]
@InProceedings{Lin_2020_CVPR,
  author = {Lin, Wanyu and Gao, Zhaolin and Li, Baochun},
  title = {Shoestring: Graph-Based Semi-Supervised Classification With Severely Limited Labeled Data},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Uninformed Students: Student-Teacher Anomaly Detection With Discriminative Latent Embeddings
Paul Bergmann, Michael Fauser, David Sattlegger, Carsten Steger


We introduce a powerful student-teacher framework for the challenging problem of unsupervised anomaly detection and pixel-precise anomaly segmentation in high-resolution images. Student networks are trained to regress the output of a descriptive teacher network that was pretrained on a large dataset of patches from natural images. This circumvents the need for prior data annotation. Anomalies are detected when the outputs of the student networks differ from that of the teacher network. This happens when they fail to generalize outside the manifold of anomaly-free training data. The intrinsic uncertainty in the student networks is used as an additional scoring function that indicates anomalies. We compare our method to a large number of existing deep learning based methods for unsupervised anomaly detection. Our experiments demonstrate improvements over state-of-the-art methods on a number of real-world datasets, including the recently introduced MVTec Anomaly Detection dataset that was specifically designed to benchmark anomaly segmentation algorithms.
[dataset, multiple, recognition, embedding, embeddings, natural, work] [detection, feature, segmentation, shallow, regression, table, fully, segment, detect, region, area] [input, trained, ensemble, model, mnist] [receptive, output, method, pixel, based, figure, field, patch, pattern, ieee, prior, june, gaussian] [image, pretrained, unsupervised, generative, discriminative, descriptive, loss, train, autoencoders, perform] [anomaly, training, learning, deep, student, network, teacher, large, performance, number, distribution, machine, predictive, size, mvtec, anomalous, data, classification, neural, simple, knowledge, metric, novelty, class, architecture, problem, larger, dimension, triplet, randomly] [computer, vision, conference, single, local, uncertainty, error, descriptor]
@InProceedings{Bergmann_2020_CVPR,
  author = {Bergmann, Paul and Fauser, Michael and Sattlegger, David and Steger, Carsten},
  title = {Uninformed Students: Student-Teacher Anomaly Detection With Discriminative Latent Embeddings},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
3D Sketch-Aware Semantic Scene Completion via Semi-Supervised Structure Prior
Xiaokang Chen, Kwan-Yee Lin, Chen Qian, Gang Zeng, Hongsheng Li


The goal of the Semantic Scene Completion (SSC) task is to simultaneously predict a completed 3D voxel representation of volumetric occupancy and semantic labels of objects in the scene from a single-view observation. Since the computational cost generally increases explosively along with the growth of voxel resolution, most current state-of-the-arts have to tailor their framework into a low-resolution representation with the sacrifice of detail prediction. Thus, voxel resolution becomes one of the crucial difficulties that lead to the performance bottleneck. In this paper, we propose to devise a new geometry-based strategy to embed depth information with low-resolution voxel representation, which could still be able to encode sufficient geometric information, e.g., room layout, object's sizes and shapes, to infer the invisible areas of the scene with well structure-preserving details. To this end, we first propose a novel 3D sketch-aware feature embedding to explicitly encode geometric information effectively and efficiently. With the 3D sketch in hand, we further devise a simple yet effective semantic scene completion framework that incorporates a light-weight 3D Sketch Hallucination module to guide the inference of occupancy and the semantic labels via a semi-supervised structure prior learning strategy. We demonstrate that our proposed geometric embedding works better than the depth feature learning from habitual SSC frameworks. Our final model surpasses state- of-the-arts consistently on three public benchmarks, which only requires 3D volumes of 60 x 36 x 60 resolution for both input and output.
[embedding, explicit, infer, three, context, decoder, embedded, predict] [semantic, feature, table, ablation, object, stage, module, detection, boundary, iou, guide, map, propose, employ, framework] [input, hallucination, ined, invisible, study] [prior, proposed, nyucad, resolution, ssc, figure, output, method, based, sscnet, result, convolution] [sketch, cvae, representation, image, perform, row, proposes, encoder, introduce] [network, learning, task, performance, computational, rate, observe, better, deep, inference, large, space, architecture, training] [scene, completion, structure, depth, nyu, rgb, shape, voxel, geometric, geometry, partial, tsdf, single, volume, approach, complete, full, implicit, ground, estimated, truth, cost, well, novel]
@InProceedings{Chen_2020_CVPR,
  author = {Chen, Xiaokang and Lin, Kwan-Yee and Qian, Chen and Zeng, Gang and Li, Hongsheng},
  title = {3D Sketch-Aware Semantic Scene Completion via Semi-Supervised Structure Prior},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Graph-Guided Architecture Search for Real-Time Semantic Segmentation
Peiwen Lin, Peng Sun, Guangliang Cheng, Sirui Xie, Xi Li, Jianping Shi


Designing a lightweight semantic segmentation network often requires researchers to find a trade-off between performance and speed, which is always empirical due to the limited interpretability of neural networks. In order to release researchers from these tedious mechanical trials, we propose a Graph-guided Architecture Search (GAS) pipeline to automatically search real-time semantic segmentation networks. Unlike previous works that use a simplified search space and stack a repeatable cell to form a network, we introduce a novel search mechanism with a new search space where a lightweight model can be effectively explored through the cell-level diversity and latency oriented constraint. Specifically, to produce the cell-level diversity, the cell-sharing constraint is eliminated through the cell-independent manner. Then a graph convolution network (GCN) is seamlessly integrated as a communication mechanism between cells. Finally, a latency-oriented constraint is endowed into the search process to balance the speed and performance. Extensive experiments on Cityscapes and CamVid datasets demonstrate that GAS achieves the new state-of-the-art trade-off between accuracy and speed. In particular, on Cityscapes dataset, GAS achieves the new best performance of 73.5% mIoU with the speed of 108.4 FPS on Titan Xp.
[graph, speed, reasoning, gcn, titan, mechanism, communication, node, previous] [semantic, segmentation, module, miou, table, edge, represents, effectiveness, fps, achieves, fully, camvid, denotes, icnet, backbone, ablation, object] [model, conduct, input] [cell, figure, high, convolution, adjacent, convolutional, lightweight, low, result, method, based, optimized, intermediate, comparison, stacked, dilated] [image, independent, loss] [search, architecture, network, performance, gas, latency, neural, random, set, learning, operation, deep, parameter, weight, accuracy, ggm, size, validation, space, best, searched, candidate, training, achieve, efficient, equation, quoc, test, dfanet, optimization, distribution, data] [constraint, cost, novel, structure]
@InProceedings{Lin_2020_CVPR,
  author = {Lin, Peiwen and Sun, Peng and Cheng, Guangliang and Xie, Sirui and Li, Xi and Shi, Jianping},
  title = {Graph-Guided Architecture Search for Real-Time Semantic Segmentation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Composing Good Shots by Exploiting Mutual Relations
Debang Li, Junge Zhang, Kaiqi Huang, Ming-Hsuan Yang


Finding views with a good composition from an input image is a common but challenging problem. There are usually at least dozens of candidates (regions) in an image, and how to evaluate these candidates is subjective. Most existing methods only use the feature corresponding to each candidate to evaluate the quality. However, the mutual relations between the candidates from an image play an essential role in composing a good shot due to the comparative nature of this problem. Motivated by this, we propose a graph-based module with a gated feature update to model the relations between different candidates. The candidate region features are propagated on a graph that models mutual relations between different regions for mining the useful information such that the relation features and region features are adaptively fused. We design a multi-task loss to train the model, especially, a regularization term is adopted to incorporate the prior knowledge about the relations into the graph. A data augmentation method is also developed by mixing nodes from different graphs to improve the model generalization ability. Experimental results show that the proposed model performs favorably against state-of-the-art methods, and comprehensive ablation studies demonstrate the contribution of each module and graph-based inference of the proposed method.
[graph, relation, gated, construct, reasoning, gaicd, automatic, gaic, attention, dataset, ven, visual, constructed, incorporate, recognition] [feature, region, annotated, score, table, propose, ablation, lreg, predicted, module, correlation, achieves, map, vpn, saliency] [model, input, help, study, srcc, generalization, influence, datasets] [proposed, method, figure, prior, cropping, based, convolution, fusion] [image, loss, photo, mixing, ability, train] [good, data, learning, augmentation, randomly, training, number, knowledge, finding, evaluate, find, set, performance, mutual, candidate, update, regularization, best, dimension, matrix, function, better, weight, deep, mining, design, similarity, ranking, sorting] [demonstrate, computed, term, view]
@InProceedings{Li_2020_CVPR,
  author = {Li, Debang and Zhang, Junge and Huang, Kaiqi and Yang, Ming-Hsuan},
  title = {Composing Good Shots by Exploiting Mutual Relations},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Organ at Risk Segmentation for Head and Neck Cancer Using Stratified Learning and Neural Architecture Search
Dazhou Guo, Dakai Jin, Zhuotun Zhu, Tsung-Ying Ho, Adam P. Harrison, Chun-Hung Chao, Jing Xiao, Le Lu


OAR segmentation is a critical step in radiotherapy of head and neck (H&N) cancer, where inconsistencies across radiation oncologists and prohibitive labor costs motivate automated approaches. However, leading methods using standard fully convolutional network workflows that are challenged when the number of OARs becomes large, e.g. > 40. For such scenarios, insights can be gained from the stratification approaches seen in manual clinical OAR delineation. This is the goal of our work, where we introduce stratified organ at risk segmentation (SOARS), an approach that stratifies OARs into anchor, mid-level, and small & hard (S&H) categories. SOARS stratifies across two dimensions. The first dimension is that distinct processing pipelines are used for each OAR category. In particular, inspired by clinical practices, anchor OARs are used to guide the mid-level and S&H categories. The second dimension is that distinct network architectures are used to manage the significant contrast, size, and anatomy variations between different OARs. We use differentiable neural architecture search (NAS), allowing the network to choose among 2D, 3D or Pseudo-3D convolutions. Extensive 4-fold cross-validation on 142 H&N cancer patients with 42 manually labeled OARs, the most comprehensive OAR dataset to date, demonstrates that both pipeline- and NAS-stratification significantly improves quantitative performance over the state-of-the-art (from 69.52% to 73.68% in absolute Dice scores). Thus, SOARS provides a powerful and principled means to manage the highly complex segmentation space of OARs.
[dataset, three, automatic, work] [segmentation, anchor, head, neck, detection, segmenting, branch, table, framework, segment, center, map, effectiveness, improvement, fully] [risk, input, trained, highly] [medical, dsc, conv, figure, convolutional, method, ieee, clinical, optic, brain, contrast, pattern, automated, proposed, demonstrates] [oar, image, stratification, radiotherapy, uanet, cancer, stratified, radiation, distinct, rtct, asd, organ, target, stratifies] [network, learning, architecture, search, performance, processing, set, training, neural, better, deep, compared, small, baseline, large, size, best, dimension, average, computing, task, statistical, number] [conference, computer, volume, international, differentiable, vision, shape, heat]
@InProceedings{Guo_2020_CVPR,
  author = {Guo, Dazhou and Jin, Dakai and Zhu, Zhuotun and Ho, Tsung-Ying and Harrison, Adam P. and Chao, Chun-Hung and Xiao, Jing and Lu, Le},
  title = {Organ at Risk Segmentation for Head and Neck Cancer Using Stratified Learning and Neural Architecture Search},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
G2L-Net: Global to Local Network for Real-Time 6D Pose Estimation With Embedding Vector Features
Wei Chen, Xi Jia, Hyung Jin Chang, Jinming Duan, Ales Leonardis


In this paper, we propose a novel real-time 6D object pose estimation framework, named G2L-Net. Our network operates on point clouds from RGB-D detection in a divide-and-conquer fashion. Specifically, our network consists of three steps. First, we extract the coarse object point cloud from the RGB-D image by 2D detection. Second, we feed the coarse object point cloud to a translation localization network to perform 3D segmentation and object translation prediction. Third, via the predicted segmentation and translation, we transfer the fine object point cloud into a local canonical coordinate, in which we train a rotation localization network to estimate initial object rotation. In the third step, we define point-wise embedding vector features to capture viewpoint-aware information. To calculate more accurate rotation, we adopt a rotation residual estimator to estimate the residual between initial rotation and ground truth, which can boost initial pose estimation performance. Our proposed G2L-Net is real-time despite the fact multiple steps are stacked via the proposed coarse-to-fine framework. Extensive experiments on two benchmark datasets show that G2L-Net achieves state-of-the-art performance in terms of both accuracy and speed.
[embedding, recognition, three, predict, extract] [object, localization, global, bounding, box, propose, feature, segmentation, detection, table, predicted, locate, achieves, add, wei, adopt, cnn] [model, input] [residual, figure, method, proposed, ieee, pattern, block, based, output, fast] [translation, image, train, transfer, perform, third] [vector, network, learning, performance, deep, accuracy, space, training, max, pool, metric, inference, label, data, better, class, arxiv, preprint] [rotation, pose, point, estimation, computer, conference, cloud, vision, ground, estimator, rgb, depth, linemod, keypoints, estimate, local, viewpoint, truth, sphere, international, distance, european, vincent, canonical, keypoint, estimated, initial, accurate]
@InProceedings{Chen_2020_CVPR,
  author = {Chen, Wei and Jia, Xi and Chang, Hyung Jin and Duan, Jinming and Leonardis, Ales},
  title = {G2L-Net: Global to Local Network for Real-Time 6D Pose Estimation With Embedding Vector Features},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Unsupervised Instance Segmentation in Microscopy Images via Panoptic Domain Adaptation and Task Re-Weighting
Dongnan Liu, Donghao Zhang, Yang Song, Fan Zhang, Lauren O'Donnell, Heng Huang, Mei Chen, Weidong Cai


Unsupervised domain adaptation (UDA) for nuclei instance segmentation is important for digital pathology, as it alleviates the burden of labor-intensive annotation and domain shift across datasets. In this work, we propose a Cycle Consistency Panoptic Domain Adaptive Mask R-CNN (CyC-PDAM) architecture for unsupervised nuclei segmentation in histopathology images, by learning from fluorescence microscopy images. More specifically, we first propose a nuclei inpainting mechanism to remove the auxiliary generated objects in the synthesized images. Secondly, a semantic branch with a domain discriminator is designed to achieve panoptic-level domain adaptation. Thirdly, in order to avoid the influence of the source-biased features, we propose a task re-weighting mechanism to dynamically add trade-off weights for the task-specific loss functions. Experimental results on three datasets indicate that our proposed method outperforms state-of-the-art UDA methods significantly, and demonstrates a similar performance as fully supervised methods.
[mechanism, recognition, prediction, three, dataset] [segmentation, semantic, instance, feature, mask, panoptic, object, fully, table, propose, detection, level, effectiveness, branch, foreground, kaiming, employed, faster, sifa] [model, adversarial, auxiliary, original] [proposed, based, method, pattern, microscopy, medical, comparison, adaptive, remove, designed, figure, output] [domain, histopathology, uda, image, adaptation, synthesized, inpainting, unsupervised, source, supervised, loss, target, fluorescence, kumar, real, cycada, discriminator, generated, translation] [task, learning, training, architecture, large, performance, bias, compared, entropy, classification, size, data, deep] [computer, vision, conference, international, avoid, local]
@InProceedings{Liu_2020_CVPR,
  author = {Liu, Dongnan and Zhang, Donghao and Song, Yang and Zhang, Fan and O'Donnell, Lauren and Huang, Heng and Chen, Mei and Cai, Weidong},
  title = {Unsupervised Instance Segmentation in Microscopy Images via Panoptic Domain Adaptation and Task Re-Weighting},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Single-Stage Semantic Segmentation From Image Labels
Nikita Araslanov, Stefan Roth


Recent years have seen a rapid growth in new approaches improving the accuracy of semantic segmentation in a weakly supervised setting, i.e. with only image-level labels available for training. However, this has come at the cost of increased model complexity and sophisticated multi-stage training procedures. This is in contrast to earlier work that used only a single stage -- training one segmentation network on image labels -- which was abandoned due to inferior segmentation accuracy. In this work, we first define three desirable properties of a weakly supervised method: local consistency, semantic fidelity, and completeness. Using these properties as guidelines, we then develop a segmentation-based network model and a self-supervised training scheme to train for semantic masks from image-level annotations in a single stage. We show that despite its simplicity, our method achieves results that are competitive with significantly more complex pipelines, substantially outperforming earlier single-stage methods.
[attention, three, multiple, provide, work] [segmentation, semantic, mask, saliency, weakly, object, iou, fully, affinity, feature, refinement, global, pamr, gci, table, shallow, stage, detection, recall, crf, salient, propose, normalised, instance, score, pooling, final, confidence] [model, trained, original, quality, localisation] [convolutional, kernel, method, figure, receptive, output] [image, supervised, pseudo, loss, train, appearance, common] [training, classification, network, deep, class, learning, stochastic, penalty, size, data, accuracy, small, gate, simple, inference, note, softmax, increase, practice] [additional, local, ground, single, focal, compute, truth, approach, iteratively]
@InProceedings{Araslanov_2020_CVPR,
  author = {Araslanov, Nikita and Roth, Stefan},
  title = {Single-Stage Semantic Segmentation From Image Labels},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Cascaded Human-Object Interaction Recognition
Tianfei Zhou, Wenguan Wang, Siyuan Qi, Haibin Ling, Jianbing Shen


Rapid progress has been witnessed for human-object interaction (HOI) recognition, but most existing models are confined to single-stage reasoning pipelines. Considering the intrinsic complexity of the task, we introduce a cascade architecture for a multi-stage, coarse-to-fine HOI understanding. At each stage, an instance localization network progressively refines HOI proposals and feeds them into an interaction recognition network. Each of the two networks is also connected to its predecessor at the previous stage, enabling cross-stage information propagation. The interaction recognition network has two crucial parts: a relation ranking module for high-quality HOI proposal selection and a triple-stream classifier for relation prediction. With our carefully-designed human-centric relation features, these two modules work collaboratively towards effective interaction understanding. Further beyond relation detection on a bounding-box level, we make our framework flexible to perform fine-grained pixel-wise relation segmentation; this provides a new glimpse into better relation modeling. Our approach reached the 1^ st place in the ICCV2019 Person in Context Challenge, on both relation detection and segmentation tasks. It also shows promising results on V-COCO.
[relation, interaction, recognition, visual, pair, three, action, context, graph, attention, previous, place, understanding, considering, work, recognizing] [hoi, instance, object, cascade, feature, localization, detection, semantic, table, pic, hoiw, segmentation, mask, score, module, region, stage, challenge, rrm, box, maprel, wenguan, jianbing, bounding, siyuan, propose, ablation, val, oursmask, bbox, ross, ling, objecti, fed, final, detected, roialign] [model, facial, study, trained, face] [figure, pixel, based] [representation, loss, image, corresponding, learn] [network, learning, performance, ranking, inference, training, architecture, set, better, neural, test, classification, deep] [human, geometric, approach, computer, well, single, pose]
@InProceedings{Zhou_2020_CVPR,
  author = {Zhou, Tianfei and Wang, Wenguan and Qi, Siyuan and Ling, Haibin and Shen, Jianbing},
  title = {Cascaded Human-Object Interaction Recognition},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
DuDoRNet: Learning a Dual-Domain Recurrent Network for Fast MRI Reconstruction With Deep T1 Prior
Bo Zhou, S. Kevin Zhou


MRI with multiple protocols is commonly used for diagnosis, but it suffers from a long acquisition time, which yields the image quality vulnerable to say motion artifacts. To accelerate, various methods have been proposed to reconstruct full images from under-sampled k-space data. However, these algorithms are inadequate for two main reasons. Firstly, aliasing artifacts generated in the image domain are structural and non-local, so that sole image domain restoration is insufficient. Secondly, though MRI comprises multiple protocols during one exam, almost all previous studies only employ the reconstruction of an individual protocol using a highly distorted undersampled image as input, leaving the use of fully-sampled short protocol (say T1) as complementary information highly underexplored. In this work, we address the above two limitations by proposing a Dual Domain Recurrent Network (DuDoRNet) with deep T1 prior embedded to simultaneously recover k-space and images for accelerating the acquisition of MRI with a long imaging protocol. Specifically, a Dilated Residual Dense Network (DRDNet) is customized for dual domain restorations from undersampled MRI data. Extensive experiments on different sampling patterns and acceleration rates demonstrate that our method consistently outperforms state-of-the-art methods, and can reconstruct high quality MRI.
[recurrent, three, previous, long] [feature, fully, global, atrous, propose, denotes, cartesian] [input, radial, quality, highly, demonstrated, model] [mri, prior, figure, residual, undersampled, ieee, restoration, dudornet, based, output, magnetic, resonance, imaging, block, pattern, convolution, dual, ssim, dilated, medical, sensing, undersampling, sdrdb, compressed, convolutional, receptive, conv, fast, aliasing, grappa, proposed, high, psnr, nrec, acquisition, inverse, field] [image, domain, consistency, loss, structural] [network, learning, deep, data, sampled, acceleration, sampling, rate, performance, better, arg, neural, function, compared, set, layer, design, large] [reconstruction, dense, spiral, conference, local, computer, vision, structure, international, reconstructed]
@InProceedings{Zhou_2020_CVPR,
  author = {Zhou, Bo and Zhou, S. Kevin},
  title = {DuDoRNet: Learning a Dual-Domain Recurrent Network for Fast MRI Reconstruction With Deep T1 Prior},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Learning Integral Objects With Intra-Class Discriminator for Weakly-Supervised Semantic Segmentation
Junsong Fan, Zhaoxiang Zhang, Chunfeng Song, Tieniu Tan


Image-level weakly-supervised semantic segmentation (WSSS) aims at learning semantic segmentation by adopting only image class labels. Existing approaches generally rely on class activation maps (CAM) to generate pseudo-masks and then train segmentation models. The main difficulty is that the CAM estimate only covers partial foreground objects. In this paper, we argue that the critical factor preventing to obtain the full object mask is the classification boundary mismatch problem in applying the CAM to WSSS. Because the CAM is optimized by the classification task, it focuses on the discrimination across different image-level classes. However, the WSSS requires to distinguish pixels sharing the same image-level class to separate them into the foreground and the background. To alleviate this contradiction, we propose an efficient end-to-end Intra-Class Discriminator (ICD) framework, which learns intra-class boundaries to help separate the foreground and the background within each image-level class. Without bells and whistles, our approach achieves the state-of-the-art performance of image label based WSSS, with mIoU 68.0% on the VOC 2012 semantic segmentation benchmark, demonstrating the effectiveness of the proposed approach.
[sign, previous, multiple] [icd, cam, foreground, semantic, background, score, segmentation, adopt, saliency, object, miou, feature, boundary, branch, crf, map, table, achieves, weakly, alleviate, adopts, recall, refine, threshold] [model, external, study] [ieee, pattern, convolutional, based, generally, proposed, applying, pixel, comparison, figure, adopted] [image, generate, train, learn, adaptation, discrimination, separate, loss, supervised, mismatch, learns] [class, training, learning, classification, problem, deep, efficient, set, performance, label, vector, machine, strategy, task, network, adapted] [conference, computer, approach, vision, international, estimate, estimation, directional, compute, directly, demonstrate]
@InProceedings{Fan_2020_CVPR,
  author = {Fan, Junsong and Zhang, Zhaoxiang and Song, Chunfeng and Tan, Tieniu},
  title = {Learning Integral Objects With Intra-Class Discriminator for Weakly-Supervised Semantic Segmentation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
FPConv: Learning Local Flattening for Point Convolution
Yiqun Lin, Zizheng Yan, Haibin Huang, Dong Du, Ligang Liu, Shuguang Cui, Xiaoguang Han


We introduce FPConv, a novel surface-style convolution operator designed for 3D point cloud analysis. Unlike previous methods, FPConv doesn't require transforming to intermediate representation like 3D grid or graph and directly works on surface geometry of point cloud. To be more specific, for each point, FPConv performs a local flattening by automatically learning a weight map to softly project surrounding points onto a 2D grid. Regular 2D convolution can thus be applied for efficient feature learning. FPConv can be easily integrated into various network architectures for tasks like 3D object classification and 3D scene segmentation, and achieve comparable performance with existing volumetric-type convolutions. More importantly, our experiments also show that FPConv can be a complementary of volumetric convolutions and jointly training them can further boost overall performance into state-of-the-art results.
[graph, previous] [feature, semantic, fps, segmentation, object, apply, global, pooling, level, miou, area] [input, complementary, conduct] [convolution, ieee, fusion, pattern, convolutional, method, analysis, figure, spectral, interpolation, residual, block, performs, parallel, intensity, based] [shared, representation, learn] [learning, deep, performance, network, neural, classification, data, large, training, better, arxiv, preprint, normalization, accuracy, design, applied, processing, sparsity, indicates, mentioned, weight, efficient] [point, fpconv, conference, local, computer, cloud, kpconv, grid, vision, projection, pointconv, scene, plane, international, surface, flattening, continuous, dense, scannet, second, directly, indoor, flat, shape, curvature, sparse, pointnet, capture]
@InProceedings{Lin_2020_CVPR,
  author = {Lin, Yiqun and Yan, Zizheng and Huang, Haibin and Du, Dong and Liu, Ligang and Cui, Shuguang and Han, Xiaoguang},
  title = {FPConv: Learning Local Flattening for Point Convolution},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Rotation Equivariant Graph Convolutional Network for Spherical Image Classification
Qin Yang, Chenglin Li, Wenrui Dai, Junni Zou, Guo-Jun Qi, Hongkai Xiong


Convolutional neural networks (CNNs) designed for low-dimensional regular grids will unfortunately lead to non-optimal solutions for analyzing spherical images, due to their different geometrical properties from planar images. In this paper, we generalize the grid-based CNNs to a non-Euclidean space by taking into account the geometry of spherical surfaces and propose a Spherical Graph Convolutional Network (SGCN) to encode rotation equivariant representations. Specifically, we propose a spherical graph construction criterion showing that a graph needs to be regular by evenly covering the spherical surfaces in order to design a rotation equivariant graph convolutional layer. For the practical case where the perfectly regular graph does not exist, we design two quantitative measures to evaluate the degree of irregularity for a spherical graph. The Geodesic ICOsahedral Pixelation (GICOPix) is adopted to construct spherical graphs with the minimum degree of irregularity compared to the current popular pixelation schemes. In addition, we design a hierarchical pooling layer to keep the rotation-equivariance, followed by a transition layer to enforce the invariance to the rotations for spherical image classification. We evaluate the proposed graph convolutional layers with different pixelations schemes in terms of equivariance errors. We also assess the effectiveness of the proposed SGCN in fulfilling rotation-invariance by the invariance error of the transition layers and recognizing the spherical images and 3D objects.
[graph, regular, dataset, hierarchical, three, constructed, irregular, order, encode, construct] [pooling, feature, level, table, object] [degree, original, model, definition, input] [proposed, convolutional, based, signal, ieee, cnns, convolution, pattern, adjacent, figure] [image, invariance, ability] [layer, classification, performance, transition, number, group, learning, network, neural, evaluate, compared, scheme, variance, set, filter, weight, feeding] [spherical, rotation, sgcn, conference, equivariant, pixelation, isometric, transformation, equivariance, computer, vertex, error, gicopix, international, construction, irregularity, icosahedron, polynomial, rotated, vision, geodesic, sphere, pdos, icosahedral, projection, healpix, chebyshev, point, spherenet, distance]
@InProceedings{Yang_2020_CVPR,
  author = {Yang, Qin and Li, Chenglin and Dai, Wenrui and Zou, Junni and Qi, Guo-Jun and Xiong, Hongkai},
  title = {Rotation Equivariant Graph Convolutional Network for Spherical Image Classification},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
FOAL: Fast Online Adaptive Learning for Cardiac Motion Estimation
Hanchao Yu, Shanhui Sun, Haichao Yu, Xiao Chen, Honghui Shi, Thomas S. Huang, Terrence Chen


Motion estimation of cardiac MRI videos is crucial for the evaluation of human heart anatomy and function. Recent researches show promising results with deep learning-based methods. In clinical deployment, however, they suffer dramatic performance drops due to mismatched distributions between training and testing datasets, commonly encountered in the clinical environment. On the other hand, it is arguably impossible to collect all representative datasets and to train a universal tracker before deployment. In this context, we proposed a novel fast online adaptive learning (FOAL) framework: an online gradient descent based optimizer that is optimized by a meta-learner. The meta-learner enables the online optimizer to perform a fast and robust adaptation. We evaluated our method through extensive experiments on two public clinical datasets. The results showed the superior performance of FOAL in accuracy compared to the offline-trained tracking method. On average, the FOAL took only 0.4 second per video for online optimization.
[video, dataset, work, frame, recognition] [tracking, segmentation, inside, tracker, table, feature, framework, mask] [model, trained, kaggle, datasets] [motion, foal, cardiac, proposed, method, ieee, acdc, medical, myo, optical, flow, dice, reference, heart, cmr, pattern, clinical, based, adaptive, magnetic, utilized, fast, performed, mri, june, resonance] [image, unsupervised, loss, train, adaptation, source, perform, disease, idea] [learning, distribution, baseline, online, meta, training, data, optimizer, test, deep, compared, number, algorithm, optimization, set, gradient, network, performance, neural, experiment, descent, adapt, task, large, note, problem, improved, maml, rate, averaged] [estimation, dense, conference, computer, vision, international, approach, david]
@InProceedings{Yu_2020_CVPR,
  author = {Yu, Hanchao and Sun, Shanhui and Yu, Haichao and Chen, Xiao and Shi, Honghui and Huang, Thomas S. and Chen, Terrence},
  title = {FOAL: Fast Online Adaptive Learning for Cardiac Motion Estimation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
ScrabbleGAN: Semi-Supervised Varying Length Handwritten Text Generation
Sharon Fogel, Hadar Averbuch-Elor, Sarel Cohen, Shai Mazor, Roee Litman


Optical character recognition (OCR) systems performance have improved significantly in the deep learning era. This is especially true for handwritten text recognition (HTR), where each author has a unique style, unlike printed text, where the variation is smaller by design. That said, deep learning based HTR is limited, as in every other task, by the number of training examples. Gathering data is a challenging and costly task, and even more so, the labeling task that follows, of which we focus here. One possible approach to reduce the burden of data annotation is semi-supervised learning. Semi supervised methods use, in addition to labeled data, some unlabeled samples to improve performance, compared to fully supervised ones. Consequently, such methods may adapt to unseen images during test time. We present ScrabbleGAN, a semi-supervised approach to synthesize handwritten text images that are versatile both in style and lexicon. ScrabbleGAN relies on a novel generative model which can generate images of words with an arbitrary length. We show how to operate our approach in a semi-supervised manner, enjoying the aforementioned benefits such as performance boost over state of the art supervised HTR. Furthermore, our generator can manipulate the resulting text style. This allows us to change, for instance, whether the text is cursive, or how thin is the pen stroke.
[text, handwriting, recognition, word, handwritten, character, iam, recurrent, dataset, sequence, recognizer, three, lexicon, work, inspired] [table, fed, final, fully] [noise, trained, model, offline, adversarial, original, case, input] [figure, method, convolutional, presented, receptive, analysis, column, based, proposed, output, adjacent] [cvl, generated, htr, generate, image, generator, gan, style, train, row, discriminator, loss, generation, generative, synthetic, supervised, pen, generating, alonso, letter, document, synthesize, scrabblegan] [data, training, performance, network, architecture, neural, learning, set, test, written, vector, online, arxiv, preprint, balancing, process, width, deep, unlabeled, compared, modern, large, layer, standard, gradient, best] [approach, conference, international, allows, well]
@InProceedings{Fogel_2020_CVPR,
  author = {Fogel, Sharon and Averbuch-Elor, Hadar and Cohen, Sarel and Mazor, Shai and Litman, Roee},
  title = {ScrabbleGAN: Semi-Supervised Varying Length Handwritten Text Generation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Cross-Domain Semantic Segmentation via Domain-Invariant Interactive Relation Transfer
Fengmao Lv, Tao Liang, Xiang Chen, Guosheng Lin


Exploiting photo-realistic synthetic data to train semantic segmentation models has received increasing attention over the past years. However, the domain mismatch between synthetic and real images will cause a significant performance drop when the model trained with synthetic images is directly applied to real-world scenarios. In this paper, we propose a new domain adaptation approach, called Pivot Interaction Transfer (PIT). Our method mainly focuses on constructing pivot information that is common knowledge shared across domains as a bridge to promote the adaptation of semantic segmentation model from synthetic domains to real-world domains. Specifically, we first infer the image-level category information about the target images, which is then utilized to facilitate pixel-level transfer for semantic segmentation, with the assumption that the interactive relation between the image-level category information and the pixel-level semantic information is invariant across domains. To this end, we propose a novel multi-level region expansion mechanism that aligns both the image-level and pixel-level information. Comprehensive experiments on the adaptation from both GTAV and SYNTHIA to Cityscapes clearly demonstrate the superiority of our method.
[unit, urban, pivot, interaction, traffic, relation, road, text, dataset, mechanism, includes, granularity, infer] [semantic, segmentation, region, category, map, interactive, aggregation, table, propose, promote, regression, miou, fully, imagelevel] [model, adversarial, trained, reveal, drop, strong] [expansion, method, convolutional, clear, prior, proposed, output] [domain, adaptation, target, source, component, synthetic, gtav, synthia, image, unsupervised, transfer, learn, common, loss, train, shared, specific, produce] [activation, class, learning, performance, training, neural, good, data, knowledge, large, logistic, layer, rate, constructing, deep, size] [reconstruction, ground, directly, approach, scene]
@InProceedings{Lv_2020_CVPR,
  author = {Lv, Fengmao and Liang, Tao and Chen, Xiang and Lin, Guosheng},
  title = {Cross-Domain Semantic Segmentation via Domain-Invariant Interactive Relation Transfer},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Inflated Episodic Memory With Region Self-Attention for Long-Tailed Visual Recognition
Linchao Zhu, Yi Yang


There have been increasing interests in modeling long-tailed data. Unlike artificially collected datasets, long-tailed data are naturally existed in the real-world and thus more realistic. To deal with the class imbalance problem, we introduce an Inflated Episodic Memory (IEM) for long-tailed visual recognition. First, our IEM augments the convolutional neural networks with categorical representative features for rapid learning on tail classes. In traditional few-shot learning, a single prototype is usually leveraged to represent a category. However, long-tailed data has higher intra-class variances. It could be challenging to learn a single prototype for one category. Thus, we introduce IEM to store the most discriminative feature for each category individually. Besides, the memory banks are updated independently, which further decreases the chance of learning skewed classifiers. Second, we introduce a novel region self-attention mechanism for multi-scale spatial feature map encoding. It is beneficial to incorporate more discriminative features to improve generalization on tail classes. We propose to encode local feature maps at multiple scales, and the spatial contextual information should be aggregated at the same time. Equipped with IEM and region self-attention, we achieve state-of-the-art performance on four standard long-tailed image recognition benchmarks. Besides, we validate the effectiveness of IEM on a long-tailed video recognition benchmark, i.e., YouTube-8M.
[iem, video, recognition, visual, encoding, bank, mechanism, inflated, oltr, prediction, dataset, pair, linchao, modeling, multiple, netvlad] [region, feature, global, key, effectiveness, map, table, category, head, improvement, pooling, propose] [model, generalization, original, improve, conduct] [convolutional, figure, introduced, method, spatial, block] [loss, image, introduce, discriminative, representation, train, prototype, learn, generated, corresponding] [memory, learning, tail, classification, episodic, data, rate, performance, deep, training, imbalanced, class, updated, number, rsa, set, size, maximum, average, arxiv, preprint, neural, evaluate, update, longtailed, achieve, vector, standard, large, network, classifier] [local, single, leverage, novel]
@InProceedings{Zhu_2020_CVPR,
  author = {Zhu, Linchao and Yang, Yi},
  title = {Inflated Episodic Memory With Region Self-Attention for Long-Tailed Visual Recognition},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Multimodal Future Localization and Emergence Prediction for Objects in Egocentric View With a Reachability Prior
Osama Makansi, Ozgun Cicek, Kevin Buchicchio, Thomas Brox


In this paper, we investigate the problem of anticipating future dynamics, particularly the future location of other vehicles and pedestrians, in the view of a moving vehicle. We approach two fundamental challenges: (1) the partial visibility due to the egocentric view with a single RGB camera and considerable field-of-view change due to the egomotion of the vehicle; (2) the multimodality of the distribution of future states. In contrast to many previous works, we do not assume structural knowledge from maps. We rather estimate a reachability prior for certain classes of objects from the semantic map of the present image and propagate it into the future using the planned egomotion. Experiments show that the reachability prior combined with multi-hypotheses learning improves multimodal prediction of the future location of tracked objects and, for the first time, the emergence of new objects. We also demonstrate promising zero-shot transfer to unseen datasets.
[future, prediction, trajectory, multimodal, egocentric, predicting, reachability, emergence, egomotion, multiple, predict, work, dataset, environment, fln, visual, fde, planned, kalman, vehicle, traffic, driving, forecasting, static, dtp, sted, observed, social, silvio, lstms, time, interaction, attention] [object, framework, localization, semantic, pedestrian, autonomous, bounding, nuscenes, rpn, table, iou, epn, challenging, waymo] [model, testing] [prior, june, motion, proposed, dynamic, method, convolutional] [learn, street, jimei, diverse] [arxiv, preprint, bayesian, network, learning, task, deep, sampling, mixture, set, filter, linear, distribution, report, test, nll, applied] [scene, approach, predicts, solution, human, error, second, single, estimate, view]
@InProceedings{Makansi_2020_CVPR,
  author = {Makansi, Osama and Cicek, Ozgun and Buchicchio, Kevin and Brox, Thomas},
  title = {Multimodal Future Localization and Emergence Prediction for Objects in Egocentric View With a Reachability Prior},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Structure Preserving Generative Cross-Domain Learning
Haifeng Xia, Zhengming Ding


Unsupervised domain adaptation (UDA) casts a light when dealing with insufficient or no labeled data in the target domain by exploring the well-annotated source knowledge in different distributions. Most research efforts on UDA explore to seek a domain-invariant classifier over source supervision. However, due to the scarcity of label information in the target domain, such a classifier has a lack of ground-truth target supervision, which dramatically obstructs the robustness and discrimination of the classifier. To this end, we develop a novel Generative cross-domain learning via Structure-Preserving (GSP), which attempts to transform target data into the source domain in order to take advantage of source supervision. Specifically, a novel cross-domain graph alignment is developed to capture the intrinsic relationship across two domains during target-source translation. Simultaneously, two distinct classifiers are trained to trigger the domain-invariant feature learning both guided with source supervision, one is a traditional source classifier and the other is a source-supervised target classifier. Extensive experimental results on several cross-domain visual benchmarks have demonstrated the effectiveness of our model by comparing with other state-of-the-art UDA algorithms.
[graph, dataset, relationship, three, visual, relation, step, shift, extract] [feature, adopt, promote] [adversarial, model, difference, trained, difficult, robust, input] [ieee, pattern, method, proposed, analysis, develop] [domain, target, ast, source, alignment, discrepancy, adaptation, unsupervised, train, uda, generator, manner, gsp, discriminative, fis, generative, crossdomain, corresponding, mingsheng, jianmin, zhengming, address, distinction, loss] [classifier, learning, training, label, distribution, measure, deep, accuracy, data, classification, metric, performance, machine, maximum, learned, probability, neural, space, network, compared, task, function, similarity, indicates, larger] [conference, computer, vision, matching, structure, symmetric, international, novel, defined, distance]
@InProceedings{Xia_2020_CVPR,
  author = {Xia, Haifeng and Ding, Zhengming},
  title = {Structure Preserving Generative Cross-Domain Learning},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Reverse Perspective Network for Perspective-Aware Object Counting
Yifan Yang, Guorong Li, Zhe Wu, Li Su, Qingming Huang, Nicu Sebe


One of the critical challenges of object counting is the dramatic scale variations, which is introduced by arbitrary perspectives. We propose a reverse perspective network to solve the scale variations of input images, instead of generating perspective maps to smooth final outputs. The reverse perspective network explicitly evaluates the perspective distortions, and efficiently corrects the distortions by uniformly warping the input images. Then the proposed network delivers images with similar instance scales to the regressor. Thus the regression network doesn't need multi-scale receptive fields to match the various scales. Besides, to further solve the scale problem of more congested areas, we enhance the corresponding regions of ground-truth with the evaluation errors. Then we force the regressor to learn from the augmented ground-truth via an adversarial process. Furthermore, to verify the proposed model, we collected a vehicle counting dataset based on Unmanned Aerial Vehicles (UAVs). The proposed dataset has fierce scale variations. Extensive experimental results on four benchmark datasets show the improvements of our method against the state-of-the-arts.
[dataset, evaluation, vehicle, critical] [employ, regression, table, achieves, propose, instance, map, object, framework] [adversarial, input, trained, original, verify, collected, variation, datasets] [scale, method, proposed, reverse, crowd, counting, pattern, warped, mae, convolutional, figure, dramatic, receptive, shanghaitech, csrnet, congested, diminish, warp, column, based, convolution, residual, spatial, ieee] [image, unsupervised, row, learn, factor, loss, corresponding, train] [network, density, learning, training, deep, capacity, augmented, set, performance, evaluate, neural, adapt, space, vector, baseline, count, average, normalize] [perspective, computer, vision, regressor, coordinate, conference, limited, grid, structure, international, solve, uniformly, estimator, estimation, force]
@InProceedings{Yang_2020_CVPR,
  author = {Yang, Yifan and Li, Guorong and Wu, Zhe and Su, Li and Huang, Qingming and Sebe, Nicu},
  title = {Reverse Perspective Network for Perspective-Aware Object Counting},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Multi-Path Region Mining for Weakly Supervised 3D Semantic Segmentation on Point Clouds
Jiacheng Wei, Guosheng Lin, Kim-Hui Yap, Tzu-Yi Hung, Lihua Xie


Point clouds provide intrinsic geometric information and surface context for scene understanding. Existing methods for point cloud segmentation require a large amount of fully labeled data. Using advanced depth sensors, collection of large scale 3D dataset is no longer a cumbersome process. However, manually producing point-level label on the large scale dataset is time and labor-intensive. In this paper, we propose a weakly supervised approach to predict point-level results using weak labels on 3D point clouds. We introduce our multi-path region mining module to generate pseudo point-level labels from a classification network trained with weak labels. It mines the localization cues for each class from various aspects of the network feature using different attention modules. Then, we use the point-level pseudo label to train a point cloud segmentation network in a fully supervised manner. To the best of our knowledge, this is the first method that uses cloud-level weak labels on raw 3D space to train a point cloud semantic segmentation network. In our setting, the 3D weak labels only indicate the classes that appeared in our input sample. We discuss both scene- and subcloud-level weakly labels on raw 3D point cloud data and perform in-depth experiments on them. On ScanNet dataset, our result trained with subcloud-level labels is compatible with some fully supervised methods.
[attention, prediction, context] [segmentation, feature, module, semantic, region, weakly, weak, map, table, pcam, object, subcloud, annotation, mprm, global, fully, localization, supervision, subclouds, nxc, pooling, propose, appeared, labor, backbone, aggregated, final] [trained, input, original, model] [spatial, ieee, convolution, pattern, channel, figure, convolutional, result, method, fusion, scale, raw] [pseudo, supervised, generate, discriminative, learn, train, introduce, produce, image] [network, classification, label, learning, class, deep, training, path, mining, number, layer, neural, data, average, better, large, performance, max, batch] [point, cloud, conference, computer, vision, scene, approach, directly, local]
@InProceedings{Wei_2020_CVPR,
  author = {Wei, Jiacheng and Lin, Guosheng and Yap, Kim-Hui and Hung, Tzu-Yi and Xie, Lihua},
  title = {Multi-Path Region Mining for Weakly Supervised 3D Semantic Segmentation on Point Clouds},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Reliable Weighted Optimal Transport for Unsupervised Domain Adaptation
Renjun Xu, Pelen Liu, Liyan Wang, Chao Chen, Jindong Wang


Recently, extensive researches have been proposed to address the UDA problem, which aims to learn transferrable models for the unlabeled target domain. Among them, the optimal transport is a promising metric to align the representations of the source and target domains. However, most existing works based on optimal transport ignore the intra-domain structure, only achieving coarse pair-wise matching. The target samples distributed near the edge of the clusters, or far from their corresponding class centers are easily to be misclassified by the decision boundary learned from the source domain. In this paper, we present Reliable Weighted Optimal Transport (RWOT) for unsupervised domain adaptation, including novel Shrinking Subspace Reliability (SSR) and weighted optimal transport strategy. Specifically, SSR exploits spatial prototypical information and intra-domain structure to dynamically measure the sample-level domain discrepancy across domains. Besides, the weighted optimal transport strategy based on SSR is exploited to achieve the precise-pair-wise optimal transport procedure, which reduces negative transfer brought by the samples near decision boundaries in the target domain. RWOT also equips with the discriminative centroid clustering exploitation strategy to learn transfer features. A thorough evaluation shows that RWOT outperforms existing state-of-the-art method on standard domain adaptation benchmarks.
[dataset, previous, work] [feature, centroid, achieves, propose, boundary] [adversarial, shrinking, decision] [spatial, figure, based, ieee, proposed, pattern, existing, method, coupling, kernel] [domain, target, source, transport, adaptation, rwot, transfer, unsupervised, discriminative, discrepancy, deepjdot, alignment, csk, ssr, learn, wasserstein, loss, jindong] [optimal, deep, weighted, prototypical, learning, class, subspace, strategy, reliability, matrix, reliable, network, data, training, performance, probability, distribution, negative, classification, neural, accuracy, classifier, sample, machine, large, measure, standard, knowledge, number, average, arxiv, preprint, metric, learned, maximum] [conference, computer, distance, vision, joint, cost, international, structure, approach]
@InProceedings{Xu_2020_CVPR,
  author = {Xu, Renjun and Liu, Pelen and Wang, Liyan and Chen, Chao and Wang, Jindong},
  title = {Reliable Weighted Optimal Transport for Unsupervised Domain Adaptation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
ImVoteNet: Boosting 3D Object Detection in Point Clouds With Image Votes
Charles R. Qi, Xinlei Chen, Or Litany, Leonidas J. Guibas


3D object detection has seen quick progress thanks to advances in deep learning on point clouds. A few recent works have even shown state-of-the-art performance with just point clouds input (e.g. VoteNet). However, point cloud data have inherent limitations. They are sparse, lack color information and often suffer from sensor noise. Images, on the other hand, have high resolution and rich texture. Thus they can complement the 3D geometry provided by point clouds. Yet how to effectively use image information to assist point cloud based detection is still an open question. In this work, we build on top of VoteNet and propose a 3D detection architecture called ImVoteNet specialized for RGB-D scenes. ImVoteNet is based on fusing 2D votes in images and 3D votes in point clouds. Compared to prior work on multi-modal detection, we explicitly extract both geometric and semantic features from the 2D images. We leverage camera parameters to lift these features to 3D. To improve the synergy of 2D-3D feature fusion, we also propose a multi-tower training scheme. We validate our model on the challenging SUN RGB-D dataset, advancing state-of-the-art results by 5.7 mAP. We also provide rich ablation studies to analyze the contribution of each design choice.
[provide, work, visual, previous] [object, detection, vote, semantic, seed, feature, table, voting, center, box, bounding, hough, tower, sun, propose, proposal, ablation, sofa, region, score, boost, segmentation, detector, faster, roi, map] [ray, model, input, blending, showing] [based, pixel, fusion, analysis, color, proposed, method] [image, texture, pseudo, generate, extracted] [deep, network, data, vector, training, gradient, performance, learning, design, space, best, pass, set, arxiv, preprint, architecture] [point, cloud, geometric, rgb, depth, camera, scene, joint, leonidas, indoor, sparse, geometry, leverage, lift, directly, dense, additional, surface, charles]
@InProceedings{Qi_2020_CVPR,
  author = {Qi, Charles R. and Chen, Xinlei and Litany, Or and Guibas, Leonidas J.},
  title = {ImVoteNet: Boosting 3D Object Detection in Point Clouds With Image Votes},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Understanding Road Layout From Videos as a Whole
Buyu Liu, Bingbing Zhuang, Samuel Schulter, Pan Ji, Manmohan Chandraker


In this paper, we address the problem of inferring the layout of complex road scenes from video sequences. To this end, we formulate it as a top-view road attributes prediction problem and our goal is to predict these attributes for each frame both accurately and consistently. In contrast to prior work, we exploit the following three novel aspects: leveraging camera motions in videos, including context cues and incorporating long-term video information. Specifically, we introduce a model that aims to enforce prediction consistency in videos. Our model consists of one LSTM and one Feature Transform Module (FTM). The former implicitly incorporates the consistency constraint with its hidden states, and the latter explicitly takes the camera motion into consideration when aggregating information along videos. Moreover, we propose to incorporate context information by introducing road participants, e.g. objects, into our model. When the entire video sequence is available, our model is also able to encode both local and global cues, e.g. information from both past and future frames. Experiments on two data sets show that: (1) Incorporating either global or contextual cues improves the prediction accuracy and leveraging both gives the best performance. (2) Introducing the LSTM and FTM modules improves the prediction consistency in videos. (3) The proposed method outperforms the SOTA by a large margin.
[road, lstm, video, prediction, context, temporal, understanding, frame, individual, work, explicitly, predict, sequence, multiple] [feature, semantic, global, propose, bev, object, map, segmentation, nuscenes, module, aggregate, boost, samuel, final, detection, effectiveness] [model, input, improve] [ftm, proposed, ieee, method, figure, pattern, transform, convolutional, implicitly] [layout, representation, image, consistency, utilize] [data, network, observe, neural, accuracy, note, report, performance, basic, compared, binary, learning, entire] [scene, single, computer, conference, perspective, estimation, well, camera, depth, vision, kitti, reconstruction, rgb, view, consistent, point, local, plane, full, complex, parametric]
@InProceedings{Liu_2020_CVPR,
  author = {Liu, Buyu and Zhuang, Bingbing and Schulter, Samuel and Ji, Pan and Chandraker, Manmohan},
  title = {Understanding Road Layout From Videos as a Whole},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Bi-Directional Relationship Inferring Network for Referring Image Segmentation
Zhiwei Hu, Guang Feng, Jiayu Sun, Lihe Zhang, Huchuan Lu


Most existing methods do not explicitly formulate the mutual guidance between vision and language. In this work, we propose a bi-directional relationship inferring network (BRINet) to model the dependencies of cross-modal information. In detail, the vision-guided linguistic attention is used to learn the adaptive linguistic context corresponding to each visual region. Combining with the language-guided visual attention, a bi-directional cross-modal attention module (BCAM) is built to learn the relationship between multi-modal features. Thus, the ultimate semantic context of the target object and referring expression can be represented accurately and consistently. Moreover, a gated bi-directional fusion module (GBFM) is designed to integrate the multi-level features where a gate function is used to guide the bi-directional flow of multi-level information. Extensive experiments on four benchmark datasets demonstrate that the proposed method outperforms other state-of-the-art methods under different evaluation metrics.
[referring, visual, attention, linguistic, relationship, bcam, language, brinet, context, lstm, gated, word, dataset, unc, natural, mechanism, vlam, length, inferring, modeling, recurrent, gbfm, previous, referit, outperforms] [segmentation, feature, module, semantic, object, region, guide, represents, final, contextual, propose, huchuan, fully, mask, aspp, iou, lihe, segment, pyramid] [expression, model, datasets, input, query] [fusion, proposed, method, spatial, guidance, convolutional, adaptive, figure, channel, enhance] [image, target, learn, piece, corresponding, representation, generate, introduce] [network, gate, learning, performance, baseline, mutual, function, design, neural, better] [left, demonstrate, detailed]
@InProceedings{Hu_2020_CVPR,
  author = {Hu, Zhiwei and Feng, Guang and Sun, Jiayu and Zhang, Lihe and Lu, Huchuan},
  title = {Bi-Directional Relationship Inferring Network for Referring Image Segmentation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Perspective Plane Program Induction From a Single Image
Yikai Li, Jiayuan Mao, Xiuming Zhang, William T. Freeman, Joshua B. Tenenbaum, Jiajun Wu


We study the inverse graphics problem of inferring a holistic representation for natural images. Given an input image, our goal is to induce a neuro-symbolic, program-like representation that jointly models camera poses, object locations, and global scene structures. Such high-level, holistic scene representations further facilitate low-level image manipulation tasks such as inpainting. We formulate this problem as jointly finding the camera pose and scene structure that best describe the input image. The benefits of such joint inference are two-fold: scene regularity serves as a new cue for perspective correction, and in turn, correct perspective correction leads to a simplified scene structure, similar to how the correct shape leads to the most regular texture in shape from texture. Our proposed framework, Perspective Plane Program Induction (P3I), combines search-based and gradient-based algorithms to efficiently solve the problem. P3I outperforms a set of baselines on a collection of Internet images, across tasks including camera pose estimation, global structure inference, and down-stream image manipulation tasks.
[natural, individual, regular, outperforms, dataset, work, composed, correct] [global, object, feature, circular, detection, detected, holistic, table] [model, correction, input, manipulation] [based, repeated, pattern, figure, inverse, proposed] [image, inpainting, perform, texture, generative, missing, representation, generated, loss, inpaint] [inference, algorithm, learning, search, neural, problem, deep, set, function, william, best] [perspective, camera, scene, program, plane, pose, structure, lattice, single, regularity, rpd, estimation, planenet, shape, quilting, induction, joint, transformation, hybrid, npns, vanishing, compare, acm, local, estimated, view, surface, joshua, jiajun, jointly, solve, collection]
@InProceedings{Li_2020_CVPR,
  author = {Li, Yikai and Mao, Jiayuan and Zhang, Xiuming and Freeman, William T. and Tenenbaum, Joshua B. and Wu, Jiajun},
  title = {Perspective Plane Program Induction From a Single Image},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
DeepFLASH: An Efficient Network for Learning-Based Medical Image Registration
Jian Wang, Miaomiao Zhang


This paper presents DeepFLASH, a novel network with efficient training and inference for learning-based medical image registration. In contrast to existing approaches that learn spatial transformations from training data in the high dimensional imaging space, we develop a new registration network entirely in a low dimensional bandlimited space. This dramatically reduces the computational cost and memory footprint of an expensive training and inference. To achieve this goal, we first introduce complex-valued operations and representations of neural architectures that provide key components for learning-based registration models. We then construct an explicit loss function of transformation fields fully characterized in a bandlimited space with much fewer parameterizations. Experimental results show that our method is significantly faster than the state-of-the-art deep learning based image registration methods, while producing equally accurate alignment. We demonstrate our algorithm in two different applications of image registration: 2D synthetic data and 3D real brain magnetic resonance (MR) images.
[time, prediction] [segmentation, fully, denotes] [model, testing] [diffeomorphic, deepflash, brain, medical, method, low, fourier, imaging, spatial, bandlimited, high, convolutional, field, miaomiao, based, fast, shooting, figure, jacobian, dice, deformable, mri, frequency, voxelmorph, quicksilver, determinant, journal, operator, lddmm, detjac, ieee, zhang] [image, real, source, target, domain, synthetic, loss] [training, network, space, neural, learning, data, function, computational, deep, vector, efficient, optimization, dimension, memory, algorithm, gpu, computing, set, large, optimal, weight, parameter, pre, average, reduces] [registration, dimensional, transformation, international, geodesic, velocity, conference, deformed, imaginary, deformation, flash, defined, volume, complex, computer, smoothness, initial, demonstrate]
@InProceedings{Wang_2020_CVPR,
  author = {Wang, Jian and Zhang, Miaomiao},
  title = {DeepFLASH: An Efficient Network for Learning-Based Medical Image Registration},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Semi-Supervised Learning for Few-Shot Image-to-Image Translation
Yaxing Wang, Salman Khan, Abel Gonzalez-Garcia, Joost van de Weijer, Fahad Shahbaz Khan


In the last few years, unpaired image-to-image translation has witnessed Remarkable progress. Although the latest methods are able to generate realistic images, they crucially rely on a large number of labeled images. Recently, some methods have tackled the challenging setting of few-shot image-to-image ranslation, reducing the labeled data requirements for the target domain during inference. In this work, we go one step further and reduce the amount of required labeled data also from the source domain during training. To do so, we propose applying semi-supervised learning via a noise-tolerant pseudo-labeling procedure. We also apply a cycle consistency constraint to further exploit the information from unlabeled images, either from the same dataset or external. Additionally, we propose several structural modifications to facilitate the image translation task under these circumstances. Our semi-supervised method for few-shot image translation, called SEMIT, achieves excellent results on four different datasets using as little as 10% of the source labels, and matches the performance of the main fully-supervised competitor using only 20% labeled data. Our code and models are made public at: https://github.com/yaxingwang/SEMIT.
[dataset, work, exploit, previous] [feature, van, propose, achieves, extractor, labeling, employ] [adversarial, model, trained, datasets, input, study] [method, proposed, output, figure, noisy, comparison] [translation, image, source, target, loss, semit, funit, domain, train, octconv, generative, ntpl, appearance, unpaired, cycle, unsupervised, generate, discriminator, xsc, unseen, regulation, consistency, generator, joost, latent] [labeled, unlabeled, data, training, learning, set, classification, entropy, large, test, performance, classifier, label, number, consider, network, task, setting, process, procedure] [pose, approach, single, limited]
@InProceedings{Wang_2020_CVPR,
  author = {Wang, Yaxing and Khan, Salman and Gonzalez-Garcia, Abel and Weijer, Joost van de and Khan, Fahad Shahbaz},
  title = {Semi-Supervised Learning for Few-Shot Image-to-Image Translation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Semantic Correspondence as an Optimal Transport Problem
Yanbin Liu, Linchao Zhu, Makoto Yamada, Yi Yang


Establishing dense correspondences across semantically similar images is a challenging task. Due to the large intra-class variation and background clutter, two common issues occur in current approaches. First, many pixels in a source image are assigned to one target pixel, i.e., many to one matching. Second, some object pixels are assigned to the background pixels, i.e., background matching. We solve the first issue by global feature matching, which maximizes the total matching correlations between images to obtain a global optimal matching matrix. The row sum and column sum constraints are enforced on the matching matrix to induce a balanced solution, thus suppressing the many to one matching. We solve the second issue by applying a staircase function on the class activation maps to re-weight the importance of pixels into four levels from foreground to background. The whole procedure is combined into a unified optimal transport algorithm by converting the maximization problem to the optimal transport formulation and incorporating the staircase weights into optimal transport algorithm to act as empirical distributions. The proposed algorithm achieves state-of-the-art performance on four benchmark datasets. Notably, a 26% relative improvement is achieved on the large-scale SPair-71k dataset.
[individual, dataset, evaluation, previous] [semantic, feature, background, object, staircase, map, table, assigned, global, denotes, correlation, hpf, cnn, minsu, foreground, employ, cam, hough] [model, original, variation, input] [proposed, method, column, figure, convolutional, prior, extraction, tss] [transport, target, image, source, row, alignment] [optimal, problem, algorithm, activation, class, sum, layer, validation, probability, large, matrix, function, number, set, learning, total, network, best, arg, cosine, average, empirical, strategy, min, deep] [matching, correspondence, pck, cost, geometric, computed, tij, dense, jean, solve, keypoints, point, compute, keypoint]
@InProceedings{Liu_2020_CVPR,
  author = {Liu, Yanbin and Zhu, Linchao and Yamada, Makoto and Yang, Yi},
  title = {Semantic Correspondence as an Optimal Transport Problem},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
How Much Time Do You Have? Modeling Multi-Duration Saliency
Camilo Fosco, Anelise Newman, Pat Sukhum, Yun Bin Zhang, Nanxuan Zhao, Aude Oliva, Zoya Bylinskii


What jumps out in a single glance of an image is different than what you might notice after closer inspection. Yet conventional models of visual saliency produce predictions at an arbitrary, fixed viewing duration, offering a limited view of the rich interactions between image content and gaze location. In this paper we propose to capture gaze as a series of snapshots, by generating population-level saliency heatmaps for multiple viewing durations. We collect the CodeCharts1K dataset, which contains multiple distinct heatmaps per image corresponding to 0.5, 3, and 5 seconds of free-viewing. We develop an LSTM-based model of saliency that simultaneously trains on data from multiple viewing durations. Our Multi-Duration Saliency Excited Model (MD-SEM) achieves competitive performance on the LSUN 2017 Challenge with 57% fewer parameters than comparable architectures. It is the first model that produces heatmaps at multiple viewing durations, enabling applications where multi-duration saliency can be used to prioritize visual content to keep, transmit, and render.
[viewing, attention, duration, visual, codecharts, multiple, time, temporal, prediction, dataset, predict, scanpath, three, lstm, zoya, action, antonio, modeling, sequence, people, captioning] [saliency, feature, module, salicon, map, correlation, salient, fully, predicted, segmentation] [gaze, model, eye, face, heatmaps, aude, collected, trained, crowdsourcing, input] [ieee, pattern, figure, analysis, convolutional, conventional, coefficient, based, excitation] [image, content, loss, distinct, introduce, consistency] [data, training, network, architecture, deep, performance, scaling, set, learning, large, validation] [conference, computer, human, ground, truth, vision, international, view, capture]
@InProceedings{Fosco_2020_CVPR,
  author = {Fosco, Camilo and Newman, Anelise and Sukhum, Pat and Zhang, Yun Bin and Zhao, Nanxuan and Oliva, Aude and Bylinskii, Zoya},
  title = {How Much Time Do You Have? Modeling Multi-Duration Saliency},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Fine-Grained Generalized Zero-Shot Learning via Dense Attribute-Based Attention
Dat Huynh, Ehsan Elhamifar


We address the problem of fine-grained generalized zero-shot recognition of visually similar classes without training images for some classes. We propose a dense attribute-based attention mechanism that for each attribute focuses on the most relevant image regions, obtaining attribute-based features. Instead of aligning a global feature vector of an image with its associated class semantic vector, we propose an attribute embedding technique that aligns each attribute-based feature with its attribute semantic vector. Hence, we compute a vector of attribute scores, for the presence of each attribute in an image, whose similarity with the true class semantic vector is maximized. Moreover, we adjust each attribute score using an attention mechanism over attributes to better capture the discriminative power of different attributes. To tackle the challenge of bias towards seen classes during testing, we propose a new self-calibration loss that adjusts the probability of unseen classes to account for the training bias. We conduct experiments on three popular datasets of CUB, SUN and AWA2 as well as the large-scale DeepFashion dataset, showing that our model significantly improves the state of the art.
[attention, visual, recognition, embedding, dataset, work, prediction, mechanism, relevant, three] [semantic, feature, score, propose, sun, denotes, localization, holistic, table, improves, region, effectiveness, global] [model, deepfashion, testing, compatibility, datasets] [ieee, pattern, method, figure, proposed, based, traditional] [attribute, unseen, image, discriminative, generalized, loss, notice, lce, harmonic, cub, hai, accu, learn, lcal, eai, supervised] [class, learning, training, vector, accuracy, performance, neural, classification, number, set, processing, knowledge, function, bias, probability, test, small] [conference, computer, vision, dense, international, compute, capture, well, calibration, define]
@InProceedings{Huynh_2020_CVPR,
  author = {Huynh, Dat and Elhamifar, Ehsan},
  title = {Fine-Grained Generalized Zero-Shot Learning via Dense Attribute-Based Attention},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Online Depth Learning Against Forgetting in Monocular Videos
Zhenyu Zhang, Stephane Lathuiliere, Elisa Ricci, Nicu Sebe, Yan Yan, Jian Yang


Online depth learning is the problem of consistently adapting a depth estimation model to handle a continuously changing environment. This problem is challenging due to the network easily overfits on the current environment and forgets its past experiences. To address such problem, this paper presents a novel Learning to Prevent Forgetting (LPF) method for online mono-depth adaptation to new target domains in unsupervised manner. Instead of updating the universal parameters, LPF learns adapter modules to efficiently adjust the feature representation and distribution without losing the pre-learned knowledge in online condition. Specifically, to adapt temporal-continuous depth patterns in videos, we introduce a novel meta-learning approach to learn adapter modules by combining online adaptation process into the learning objective. To further avoid overfitting, we propose a novel temporal-consistent regularization to harmonize the gradient descent procedure at each online learning step. Extensive evaluations on real-world datasets demonstrate that the proposed method, with very limited parameters, significantly improves the estimation quality.
[naive, video, time, evaluation, prediction, step, visual, current, dataset, work, shift, previous, frame] [framework, propose, table, jian, main, zhenyu] [model, robust] [method, proposed, based, fast, adjust] [adaptation, domain, target, unsupervised, source, perform, vkitti, changing, learn, loss, adapting, supervised] [online, learning, adapter, lpf, deep, basic, weight, ideal, training, forgetting, performance, better, adapt, data, knowledge, neural, standard, prevent, process, descent, observe, learned, update, stable, implemented, problem, network, paper, updating, gradient, lmeta, layer] [depth, monocular, estimation, novel, kitti, initial, approach, single, scene, rel, rmse, limited]
@InProceedings{Zhang_2020_CVPR,
  author = {Zhang, Zhenyu and Lathuiliere, Stephane and Ricci, Elisa and Sebe, Nicu and Yan, Yan and Yang, Jian},
  title = {Online Depth Learning Against Forgetting in Monocular Videos},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Few-Shot Learning of Part-Specific Probability Space for 3D Shape Segmentation
Lingjing Wang, Xiang Li, Yi Fang


Recently, deep neural networks are introduced as supervised discriminative models for the learning of 3D point cloud segmentation. Most previous supervised methods require a large number of training data with human annotation part labels to guide the training process to ensure the model's generalization abilities on test data. In comparison, we propose a novel 3D shape segmentation method that requires few labeled data for training. Given an input 3D shape, the training of our model starts with identifying a similar 3D shape with part annotations from a mini-pool of shape templates (e.g. 10 shapes). With the selected template shape, a novel Coherent Point Transformer is proposed to fully leverage the power of a deep neural network to smoothly morph the template shape towards the input shape. Then, based on the transformed template shapes with part labels, a newly proposed Part-specific Density Estimator is developed to learn a continuous part-specific probability distribution function on the entire 3D space with a batch consistency regularization term. With the learned part-specific probability distribution, our model is able to predict the part labels of a new input 3D shape in an end-to-end manner. We demonstrate that our proposed method can achieve remarkable segmentation results on the ShapeNet dataset with few shots, compared to previous supervised learning approaches.
[transformer, dataset, predict, encode, three] [template, segmentation, table, semantic, weakly, feature, category, iou, object, annotated, global, xiang, effectiveness, propose] [input, model, trained] [figure, method, based, proposed, comparison, ieee, pattern] [supervised, learn, consistency, loss, unsupervised, target, retrieved, introduce, firstly] [learning, probability, network, label, density, function, deep, neural, training, performance, number, batch, distribution, selected, set, randomly, regularization, experiment, sampled, group, evaluate, data, compared] [point, shape, cloud, deformed, coherent, estimator, continuous, conference, computer, deformation, vision, geometric, chair, estimation, lingjing, novel, pointnet, demonstrate, directly, distance, descriptor, mlp, hao, acm, approach]
@InProceedings{Wang_2020_CVPR,
  author = {Wang, Lingjing and Li, Xiang and Fang, Yi},
  title = {Few-Shot Learning of Part-Specific Probability Space for 3D Shape Segmentation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Pattern-Structure Diffusion for Multi-Task Learning
Ling Zhou, Zhen Cui, Chunyan Xu, Zhenyu Zhang, Chaoqun Wang, Tong Zhang, Jian Yang


Inspired by the observation that pattern structures high-frequently recur within intra-task also across tasks, we propose a pattern-structure diffusion (PSD) framework to mine and propagate task-specific and task-across pattern structures in the task-level space for joint depth estimation, segmentation and surface normal prediction. To represent local pattern structures, we model them as small-scale graphlets, and propagate them in two different ways, i.e., intra-task and inter-task PSD. For the former, to overcome the limit of the locality of pattern structures, we use the high-order recursive aggregation on neighbors to multiplicatively increase the spread scope, so that long-distance patterns are propagated in the intra-task space. In the inter-task PSD, we mutually transfer the counterpart structures corresponding to the same spatial position into the task itself based on the matching degree of paired pattern structures therein. Finally, the intra-task and inter-task pattern structures are jointly diffused among the task-level patterns, and encapsulated into an end-to-end PSD network to boost the performance of multi-task learning. Extensive experiments on two widely-used benchmarks demonstrate that our proposed PSD is more effective and also achieves the state-of-the-art or competitive results.
[prediction, three, dataset, graph] [psd, segmentation, semantic, feature, table, propagate, global, graphlet, miou, jian, boost, chunyan, zhenyu, propose, diffused] [diffusion, model, api, trained, original] [pattern, proposed, convolutional, scale, adjacent, method, based, figure, pixel, patch, recursive, spatial, fusion, utilized] [image, transfer] [learning, network, deep, task, number, layer, performance, iteration, neural, matrix, size, better, data, process, large, computation, memory] [depth, surface, normal, local, rgb, estimation, joint, rgbd, monocular, single, jointly, structure, eigen, scene, error, rmse, position, dense, well, ground]
@InProceedings{Zhou_2020_CVPR,
  author = {Zhou, Ling and Cui, Zhen and Xu, Chunyan and Zhang, Zhenyu and Wang, Chaoqun and Zhang, Tong and Yang, Jian},
  title = {Pattern-Structure Diffusion for Multi-Task Learning},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Training Noise-Robust Deep Neural Networks via Meta-Learning
Zhen Wang, Guosheng Hu, Qinghua Hu


Label noise may significantly degrade the performance of Deep Neural Networks (DNNs). To train noise-robust DNNs, Loss correction (LC) approaches have been introduced. LC approaches assume the noisy labels are corrupted from clean (ground-truth) labels by an unknown noise transition matrix T. The backbone DNNs and T can be trained separately, where T is approximated with prior knowledge. For example, T is constructed by stacking the maximum or mean predic- tions of the samples from each class. In this work, we pro- pose a new loss correction approach, named as Meta Loss Correction (MLC), to directly learn T from data via the meta-learning framework. The MLC is model-agnostic and learns T from data rather than heuristically approximates it using prior knowledge. Extensive evaluations are conducted on computer vision (MNIST, CIFAR-10, CIFAR-100, Cloth- ing1M) and natural language processing (Twitter) datasets. The experimental results show that MLC achieves very com- petitive performance against state-of-the-art approaches.
[dataset, three, outperforms, natural] [backbone, stage, achieves, propose, supervision, mask] [noise, mlc, clean, correction, model, robust, glc, corrupted, mnist, conduct, success, face, datasets, trained] [noisy, prior, ieee, assumption, optimized, based, convolutional, figure] [loss, learn, learns, train, extensive] [training, validation, network, set, deep, learning, small, transition, neural, optimization, matrix, label, data, optimize, class, performance, knowledge, test, maximum, meta, processing, accuracy, arxiv, preprint, uniform, baseline, twitter, function, crossentropy, rate, classification, achieve, note, layer, size, forward, consistently, machine] [approach, estimate, directly, computer, vision]
@InProceedings{Wang_2020_CVPR,
  author = {Wang, Zhen and Hu, Guosheng and Hu, Qinghua},
  title = {Training Noise-Robust Deep Neural Networks via Meta-Learning},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Fusion-Aware Point Convolution for Online Semantic 3D Scene Segmentation
Jiazhao Zhang, Chenyang Zhu, Lintao Zheng, Kai Xu


Online semantic 3D segmentation in company with real-time RGB-D reconstruction poses special challenges such as how to perform 3D convolution directly over the progressively fused 3D geometric data, and how to smartly fuse information from frame to frame. We propose a novel fusion-aware 3D point convolution which operates directly on the geometric surface being reconstructed and exploits effectively the inter-frame correlation for high-quality 3D feature learning. This is enabled by a dedicated dynamic data structure that organizes the online acquired point cloud with local-global trees. Globally, we compile the online reconstructed 3D points into an incrementally growing coordinate interval tree, enabling fast point insertion and neighborhood query. Locally, we maintain the neighborhood information for each point using an octree whose construction benefits from the fast query of the global tree. The local octrees facilitate efficient surface-aware point convolution. Both levels of trees update dynamically and help the 3D convolution effectively exploits the temporal coherence for effective information fusion across RGB-D frames.
[frame, node, three, dynamically, work, prediction, sequence] [segmentation, feature, semantic, global, labeling, table, adopt, tgx] [offline, improve] [convolution, method, fusion, figure, tree, dynamic, convolutional, based, ieee, fused, comparison, pattern, fast, consecutive] [corresponding, mapping] [online, data, accuracy, learning, deep, network, neural, performance, update, set, find, efficient, benefit, better, operation, path, maintain, label] [point, neighborhood, scene, local, interval, conference, coordinate, structure, distance, computer, geometric, vision, reconstructed, octrees, euclidean, reconstruction, cloud, octree, correspondence, geodesic, scannet, international, directly, construction, organization, volumetric, xmax]
@InProceedings{Zhang_2020_CVPR,
  author = {Zhang, Jiazhao and Zhu, Chenyang and Zheng, Lintao and Xu, Kai},
  title = {Fusion-Aware Point Convolution for Online Semantic 3D Scene Segmentation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Universal Source-Free Domain Adaptation
Jogendra Nath Kundu, Naveen Venkat, Rahul M V, R. Venkatesh Babu


There is a strong incentive to develop versatile learning techniques that can transfer the knowledge of class-separability from a labeled source domain to an unlabeled target domain in the presence of a domain-shift. Existing domain adaptation (DA) approaches are not equipped for practical DA scenarios as a result of their reliance on the knowledge of source-target label-set relationship (e.g. Closed-set, Open-set or Partial DA). Furthermore, almost all prior unsupervised DA works require coexistence of source and target samples even during deployment, making them unsuitable for real-time adaptation. Devoid of such impractical assumptions, we propose a novel two-stage learning process. 1) In the Procurement stage, we aim to equip the model for future source-free deployment, assuming no prior knowledge of the upcoming category-gap and domain-shift. To achieve this, we enhance the model's ability to reject out-of-source distribution samples by leveraging the available source data, in a novel generative classifier framework. 2) In the Deployment stage, the goal is to design a unified adaptation algorithm capable of operating across a wide range of category-gaps, with no access to the previously seen source samples. To this end, in contrast to the usage of complex adversarial training regimes, we define a simple yet effective source-free adaptation objective by utilizing a novel instance-level weighting mechanism, named as Source Similarity Metric (SSM). A thorough evaluation shows the practical usability of the proposed learning framework with superior DA performance even over state-of-the-art source-dependent approaches.
[dataset, evaluation, relationship] [positive, stage, framework, feature, propose, denotes, table] [model, adversarial, universal, access, highly, datasets, private, sensitivity] [proposed, prior, result, method, figure] [source, target, domain, adaptation, procurement, unsupervised, generative, ssm, shared, tavg, uan, loss, image, tunk, transfer, latent, aim, common, generate, frozen, mingsheng, jianmin] [negative, learning, class, deployment, training, knowledge, data, label, labeled, distribution, accuracy, deep, achieve, metric, space, setting, sample, classifier, set, algorithm, imagenet, compactness, practical, wide, weighting, similarity, problem, number, entropy, task] [partial, approach, novel, define]
@InProceedings{Kundu_2020_CVPR,
  author = {Kundu, Jogendra Nath and Venkat, Naveen and V, Rahul M and Babu, R. Venkatesh},
  title = {Universal Source-Free Domain Adaptation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Exploring Spatial-Temporal Multi-Frequency Analysis for High-Fidelity and Temporal-Consistency Video Prediction
Beibei Jin, Yu Hu, Qiankun Tang, Jingyu Niu, Zhiping Shi, Yinhe Han, Xiaowei Li


Video prediction is a pixel-wise dense prediction task to infer future frames based on past frames. Missing appearance details and motion blur are still two major problems for current models, leading to image distortion and temporal inconsistency. We point out the necessity of exploring multi-frequency analysis to deal with the two problems. Inspired by the frequency band decomposition characteristic of Human Vision System (HVS), we propose a video prediction network based on multi-level wavelet analysis to uniformly deal with spatial and temporal information. Specifically, multi-level spatial discrete wavelet transform decomposes each video frame into anisotropic sub-bands with multiple frequencies, helping to enrich structural information and reserve fine details. On the other hand, multilevel temporal discrete wavelet transform which operates on time axis decomposes the frame sequence into sub-band groups of different frequencies to accurately capture multifrequency motions under a fixed frame rate. Extensive experiments on diverse datasets demonstrate that our model shows significant improvements on fidelity and temporal consistency over the state-of-the-art works. Source code and videos are available at https://github.com/Bei-Jin/STMFANet.
[video, prediction, temporal, frame, dataset, future, time, sequence, visual, evaluation, three, multiple, recurrent, predict, previous] [table, visualization, module, predicted, feature, pedestrian, propose] [model, adversarial, input, trained, datasets, generalization] [wavelet, figure, analysis, spatial, transform, motion, frequency, bair, kth, high, based, dwt, method, quantitative, psnr, ssim, comparison, savp, proposed, decomposes, dynamic] [image, generation, loss, fidelity, consistency, latent, generate, generative, ability, corresponding, unsupervised, variational, encoder, generator, discriminator] [stochastic, network, arxiv, preprint, discrete, deep, learning, neural, dimension, better, best, stem] [ground, human, consistent, kitti, decomposition, system, axis, truth]
@InProceedings{Jin_2020_CVPR,
  author = {Jin, Beibei and Hu, Yu and Tang, Qiankun and Niu, Jingyu and Shi, Zhiping and Han, Yinhe and Li, Xiaowei},
  title = {Exploring Spatial-Temporal Multi-Frequency Analysis for High-Fidelity and Temporal-Consistency Video Prediction},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Varicolored Image De-Hazing
Akshay Dudhane, Kuldeep M. Biradar, Prashant W. Patil, Praful Hambarde, Subrahmanyam Murala


The quality of images captured in bad weather is often affected by chromatic casts and low visibility due to the presence of atmospheric particles. Restoration of the color balance is often ignored in most of the existing image de-hazing methods. In this paper, we propose a varicolored end-to-end image de-hazing network which restores the color balance in a given varicolored hazy image and recovers the haze-free image. The proposed network comprises of 1) Haze color correction (HCC) module and 2) Visibility improvement (VI) module. The proposed HCC module provides required attention to each color channel and generates a color balanced hazy image. While the proposed VI module processes the color balanced hazy image through novel inception attention block to recover the haze-free image. We also propose a novel approach to generate a large-scale varicolored synthetic hazy image database. An ablation study has been carried out to demonstrate the effect of different factors on the performance of the proposed network for image de-hazing. Three benchmark synthetic datasets have been used for quantitative analysis of the proposed network. Visual results on a set of real-world hazy images captured in different weather conditions demonstrate the effectiveness of the proposed approach for varicolored image de-hazing.
[attention, recognition, evaluation, relevant, considering, observed, visual] [module, feature, improvement, table, map, represents, benchmark, refined] [visibility, correction, database, model, white, input] [proposed, hazy, color, varicolored, haze, ieee, figure, pattern, atmospheric, existing, hcc, dehazing, channel, analysis, restores, quantitative, spatial, captured, weather, method, illumination, ssim, recover, scattering, gwa, restoration, light, intense, psnr, block, restore, transmission, brighter, bad, convolution, intermediate] [image, generator, synthetic, loss, inception, generate, pseudo] [network, balance, balanced, performance, learning, respective, considered, set, sample, deep, discussed] [computer, conference, vision, single, approach, dense, scene, well, novel, recovered, indoor, estimate]
@InProceedings{Dudhane_2020_CVPR,
  author = {Dudhane, Akshay and Biradar, Kuldeep M. and Patil, Prashant W. and Hambarde, Praful and Murala, Subrahmanyam},
  title = {Varicolored Image De-Hazing},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
SpSequenceNet: Semantic Segmentation Network on 4D Point Clouds
Hanyu Shi, Guosheng Lin, Hao Wang, Tzu-Yi Hung, Zhenhua Wang


Point clouds are useful in many applications like autonomous driving and robotics as they provide natural 3D information of the surrounding environments. While there are extensive research on 3D point clouds, scene understanding on 4D point clouds, a series of consecutive 3D point clouds frames, is an emerging topic and yet under-investigated. With 4D point clouds (3D point cloud videos), robotic systems could enhance their robustness by leveraging the temporal information from previous frames. However, the existing semantic segmentation methods on 4D point clouds suffer from low precision due to the spatial and temporal information loss in their network structures. In this paper, we propose SpSequenceNet to address this problem. The network is designed based on 3D sparse convolution. And we introduce two novel modules, a cross-frame global attention module and a cross-frame local interpolation module, to capture spatial and temporal information in 4D point clouds. We conduct extensive experiments on SemanticKITTI, and achieve the state-of-the-art result of 43.1% on mIoU, which is 1.5% higher than the previous best approach.
[attention, temporal, frame, current, previous, extract, prediction, video, work, dataset, moving, static] [semantic, global, segmentation, backbone, feature, cli, module, semantickitti, area, table, lidar, object, apply, miou, improvement, fuse, spsequencenet, tangentconv, achieves, reorganized] [model, input, status, improve] [figure, interpolation, spatial, convolution, motion, ieee, based, result, pattern, proposed, method, convolutional, designed] [generate, row, train] [network, performance, top, task, data, set, neural, training, design, computation, arxiv, preprint, label, learning, number, better] [point, cloud, local, conference, computer, vision, nearest, sparse, structure, scene, coordinate, international, novel, combine, voxel, hao]
@InProceedings{Shi_2020_CVPR,
  author = {Shi, Hanyu and Lin, Guosheng and Wang, Hao and Hung, Tzu-Yi and Wang, Zhenhua},
  title = {SpSequenceNet: Semantic Segmentation Network on 4D Point Clouds},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Separating Particulate Matter From a Single Microscopic Image
Tushar Sandhan, Jin Young Choi


Particulate matter (PM) is the blend of various solid and liquid particles suspended in atmosphere. These submicron particles are imperceptible for usual hand-held camera photography, but become a great obstacle in microscopic imaging. PM removal from a single microscopic image is a highly ill-posed and one of the challenging image denoising problems. In this work, we thoroughly analyze the physical properties of PM, microscope and their inevitable interaction; and propose an optimization scheme, which removes the PM from a high-resolution microscopic image within a few seconds. Experiments on real world microscopic images show that the proposed method significantly outperforms other competitive image denoising methods. It preserves the comprehensive microscopic foreground details while clearly separating the PM from a single monochromatic or color image.
[visual, time, multimodal] [background, bottom, nearby, table] [noise, input, clean, poisson, true, living] [microscopic, specimen, figure, denoising, imaging, method, ieee, microscopy, light, contrast, low, diffraction, cell, psnr, based, optical, assembly, clahe, illumination, udnet, particulate, matter, analysis, enhancement, bright, ssim, artifact, twsc, dehaze, obstacle, color, glass, captured, intensity, microscope, proposed, lens, dehazing, signal, yeast, resolution, medical, electron, bacterial, quantitative, high, highlighted] [image, row, produce, structural, underlying, domain, translation, synthetic] [data, process, gradient, better, function, optimization, size, learning, average, reduced, processing, objective, reduce, deep] [single, well, approach, sparse, cover]
@InProceedings{Sandhan_2020_CVPR,
  author = {Sandhan, Tushar and Choi, Jin Young},
  title = {Separating Particulate Matter From a Single Microscopic Image},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Adaptive Dilated Network With Self-Correction Supervision for Counting
Shuai Bai, Zhiqun He, Yu Qiao, Hanzhe Hu, Wei Wu, Junjie Yan


The counting problem aims to estimate the number of objects in images. Due to large scale variation and labeling deviations, it remains a challenging task. The static density map supervised learning framework is widely used in existing methods, which uses the Gaussian kernel to generate a density map as the learning target and utilizes the Euclidean distance to optimize the model. However, the framework is intolerable to the labeling deviations and can not reflect the scale variation. In this paper, we propose an adaptive dilated convolution and a novel supervised learning framework named self-correction (SC) supervision. In the supervision level, the SC supervision utilizes the outputs of the model to iteratively correct the annotations and employs the SC loss to simultaneously optimize the model from both the whole and the individuals. In the feature level, the proposed adaptive dilated convolution predicts a continuous value as the specific dilation rate for each location, which adapts the scale variation better than a discrete and static dilation rate. Extensive experiments illustrate that our approach has achieved a consistent improvement on four challenging benchmarks. Especially, our approach achieves better performance than the state-of-the-art methods on all benchmark datasets.
[correct, dataset, step, static, ucf] [map, supervision, annotation, feature, labeling, response, table, adnet, framework, utilizes, denotes, propose, location] [model, variation, effectively, deviation] [crowd, counting, scale, ieee, dilated, dilation, convolution, adaptive, pattern, gaussian, receptive, mae, responsibility, convolutional, proposed, corrected, dotted, residual, relu, deformable, csrnet, output, conventional, dgt, dest, congested, figure, based] [loss, target, introduce, image, supervised, specific, mixing] [density, network, learning, number, size, large, rate, better, performance, baseline, batch, neural, iteration, variance, mixture, deep, problem, optimize, data] [computer, conference, vision, estimation, estimated, international, initial, continuous, consistent, position, october, distance, perspective]
@InProceedings{Bai_2020_CVPR,
  author = {Bai, Shuai and He, Zhiqun and Qiao, Yu and Hu, Hanzhe and Wu, Wei and Yan, Junjie},
  title = {Adaptive Dilated Network With Self-Correction Supervision for Counting},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
PointPainting: Sequential Fusion for 3D Object Detection
Sourabh Vora, Alex H. Lang, Bassam Helou, Oscar Beijbom


Camera and lidar are important sensor modalities for robotics in general and self-driving cars in particular. The sensors provide complementary information offering an opportunity for tight sensor-fusion. Surprisingly, lidar-only methods outperform fusion methods on the main benchmark datasets, suggesting a gap in the literature. In this work, we propose PointPainting: a sequential fusion method to fill this gap. PointPainting works by projecting lidar points into the output of an image-only semantic segmentation network and appending the class scores to each point. The appended (painted) point cloud can then be fed to any lidar-only method. Experiments show large improvements on three different state-of-the art methods, Point-RCNN, VoxelNet and PointPillars on the KITTI and nuScenes datasets. The painted version of PointRCNN represents a new state of the art on the KITTI leaderboard for the bird's-eye view detection task. In ablation, we study how the effects of Painting depends on the quality and format of the semantic segmentation output, and demonstrate how latency can be minimized through pipelining.
[state, dataset, semantics, previous, three, time, sequential, work, frame] [lidar, detection, painted, segmentation, pointpainting, object, pointpillars, nuscenes, semantic, pointrcnn, cyclist, art, map, voxelnet, table, pedestrian, feature, autonomous, car, val, hard, detector, leaderboard, decorated, easy, bounding, improvement, ablation, main, improves] [quality, original, public, input, trained] [fusion, method, based, figure, version] [image, painting, perform, encoder] [network, performance, test, class, latency, set, requires, top, precision, architecture, average, improved, better, design, general] [point, kitti, cloud, view, camera, measured, matching, despite, vision, delta, supplementary, transformation, construction, depth, demonstrate]
@InProceedings{Vora_2020_CVPR,
  author = {Vora, Sourabh and Lang, Alex H. and Helou, Bassam and Beijbom, Oscar},
  title = {PointPainting: Sequential Fusion for 3D Object Detection},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Rethinking Zero-Shot Video Classification: End-to-End Training for Realistic Applications
Biagio Brattoli, Joseph Tighe, Fedor Zhdanov, Pietro Perona, Krzysztof Chalupka


Trained on large datasets, deep learning (DL) can accurately classify videos into hundreds of diverse classes. However, video data is expensive to annotate. Zero-shot learning (ZSL) proposes one solution to this problem. ZSL trains a model once, and generalizes to new tasks whose classes are not present in the training dataset. We propose the first end-to-end algorithm for ZSL in video classification. Our training procedure builds on insights from recent video classification literature and uses a trainable 3D CNN to learn the visual features. This is in contrast to previous video ZSL methods, which use pretrained feature extractors. We also extend the current benchmarking paradigm: Previous techniques aim to make the test task unknown at training time but fall short of this goal. We encourage domain shift across training and test data and disallow tailoring a ZSL model to a specific test dataset. We outperform the state-of-the-art by a wide margin. Our code, evaluation procedure and model weights are available online github.com/bbrattoli/ZeroShotVideoClassification.
[video, action, kinetics, evaluation, previous, visual, dataset, recognition, embedding, ucf, multiple, work, embeddings, hmdb, url, biagio, amazon] [semantic, overlap, propose, easy, table, sun] [model, protocol, trained, datasets, choose] [ieee, pattern, method, figure, removed] [zsl, pretrained, image, domain, train, diverse, realistic] [training, test, classification, class, large, learning, performance, data, procedure, pretraining, deep, neural, set, inference, arxiv, preprint, good, number, baseline, outperform, simple, network, standard, processing, algorithm, task, setting, averaged, random, accuracy] [conference, computer, vision, distance, international, human, european, approach, computed, full, scene]
@InProceedings{Brattoli_2020_CVPR,
  author = {Brattoli, Biagio and Tighe, Joseph and Zhdanov, Fedor and Perona, Pietro and Chalupka, Krzysztof},
  title = {Rethinking Zero-Shot Video Classification: End-to-End Training for Realistic Applications},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Learning to Select Base Classes for Few-Shot Classification
Linjun Zhou, Peng Cui, Xu Jia, Shiqiang Yang, Qi Tian


Few-shot learning has attracted intensive research attention in recent years. Many methods have been proposed to generalize a model learned from provided base classes to novel classes, but no previous work studies how to select base classes, or even whether different base classes will result in different generalization performance of the learned model. In this paper, we utilize a simple yet effective measure, the Similarity Ratio, as an indicator for the generalization performance of a few-shot model. We then formulate the base class selection problem as a submodular optimization problem over Similarity Ratio. We further provide theoretical analysis on the optimization lower bound of different optimization methods, which could be used to identify the most appropriate algorithm for different experimental settings. The extensive experiments on ImageNet, Caltech256 and CUB-200-2011 demonstrate that our proposed method is effective in selecting a better base dataset.
[dataset, mechanism, time, turn, previous, work, three] [table, regression, represents, positive] [model, testing, trained, case, effective, change] [proposed, method, ieee, result, based, figure, double, pattern, exact] [representation, image, transfer, domain, target, curriculum] [base, algorithm, selection, learning, class, similarity, optimization, random, performance, problem, greedy, data, function, set, submodular, selected, support, number, general, training, average, select, process, setting, candidate, space, classification, theorem, simple, imagenet, cold, start, experiment, cardinality, bound, better, incremental, complexity, cosine, maximizing, monotone, compared, larger, indicator, bayesian, network] [novel, conference, continuous, computer, term, vision, defined]
@InProceedings{Zhou_2020_CVPR,
  author = {Zhou, Linjun and Cui, Peng and Jia, Xu and Yang, Shiqiang and Tian, Qi},
  title = {Learning to Select Base Classes for Few-Shot Classification},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
CONSAC: Robust Multi-Model Fitting by Conditional Sample Consensus
Florian Kluger, Eric Brachmann, Hanno Ackermann, Carsten Rother, Michael Ying Yang, Bodo Rosenhahn


We present a robust estimator for fitting multiple parametric models of the same form to noisy measurements. Applications include finding multiple vanishing points in man-made scenes, fitting planes to architectural imagery, or estimating multiple rigid motions within the same sequence. In contrast to previous works, which resorted to hand-crafted search strategies for multiple model detection, we learn the search strategy from data. A neural network conditioned on previously detected models guides a RANSAC estimator to different subsets of all measurements, thereby finding model instances one after another. We train our method supervised, as well as, self-supervised. For supervised training of the search strategy, we contribute a new dataset for vanishing point estimation. Leveraging this dataset, the proposed algorithm is superior with respect to other robust estimators, as well as, to designated vanishing point estimation algorithms. For self-supervised learning of the search, we evaluate the proposed algorithm on multi-homography estimation and demonstrate an accuracy that is superior to state-of-the-art methods.
[multiple, dataset, sequential, order, state, evaluation, provide, three, work, horizon] [instance, vps, feature, detection] [model, robust, datasets] [homography, based, method, figure, result] [image, conditional, supervised, loss, generate, train, generation] [sampling, network, neural, training, data, learning, sample, set, selected, task, finding, test, average, evaluate, accuracy, selection, max, standard, achieve] [fitting, vanishing, point, consac, estimation, ransac, inlier, hypothesis, single, ground, truth, minimal, estimator, well, adelaidermf, approach, yud, andrea, parametric, fundamental, estimate, mct, geometric, david, carsten, homographies, brachmann]
@InProceedings{Kluger_2020_CVPR,
  author = {Kluger, Florian and Brachmann, Eric and Ackermann, Hanno and Rother, Carsten and Yang, Michael Ying and Rosenhahn, Bodo},
  title = {CONSAC: Robust Multi-Model Fitting by Conditional Sample Consensus},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Fast Symmetric Diffeomorphic Image Registration with Convolutional Neural Networks
Tony C.W. Mok, Albert C.S. Chung


Diffeomorphic deformable image registration is crucial in many medical image studies, as it offers unique, special features including topology preservation and invertibility of the transformation. Recent deep learning-based deformable image registration methods achieve fast image registration by leveraging a convolutional neural network (CNN) to learn the spatial transformation from the synthetic ground truth or the similarity metric. However, these approaches often ignore the topology preservation of the transformation and the smoothness of the transformation which is enforced by a global smoothing energy function alone. Moreover, deep learning-based approaches often estimate the displacement field directly, which cannot guarantee the existence of the inverse transformation. In this paper, we present a novel, efficient unsupervised symmetric image registration method which maximizes the similarity between images within the space of diffeomorphic maps and estimates both forward and inverse transformations simultaneously. We evaluate our method on 3D image registration with a large scale brain image dataset. Our method achieves state-of-the-art registration accuracy and running time while maintaining desirable diffeomorphic properties.
[time, moving, pair, transformer] [including, denotes, table, segmentation, global] [model, input, identity, subject] [diffeomorphic, deformable, method, field, proposed, jacobian, medical, determinant, brain, spatial, anatomical, inverse, fast, warped, dsc, affine, convolution, convolutional, dice, lsim, warp, mri, syn, running] [image, loss, unsupervised, consistency, mapping, atlas, desirable, invertibility, utilize] [similarity, function, problem, average, learning, regularization, computing, set, number, deep, large, accuracy, neural, evaluate, fixed, layer, space, normalization, john, network] [registration, deformation, symmetric, local, transformation, velocity, displacement, orientation, conference, differentiable, international, voxels, smoothness, shape, estimate, volume, ground, position, computed, topology]
@InProceedings{Mok_2020_CVPR,
  author = {Mok, Tony C.W. and Chung, Albert C.S.},
  title = {Fast Symmetric Diffeomorphic Image Registration with Convolutional Neural Networks},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Distilled Semantics for Comprehensive Scene Understanding from Videos
Fabio Tosi, Filippo Aleotti, Pierluigi Zama Ramirez, Matteo Poggi, Samuele Salti, Luigi Di Stefano, Stefano Mattoccia


Whole understanding of the surroundings is paramount to autonomous systems. Recent works have shown that deep neural networks can learn geometry (depth) and motion (optical flow) from a monocular video without any explicit supervision from ground truth annotations, particularly hard to source for these two tasks. In this paper, we take an additional step toward holistic scene understanding with monocular cameras by learning depth and motion alongside with semantics, with supervision for the latter provided by a pre-trained network distilling proxy ground truth images. We address the three tasks jointly by a) a novel training protocol based on knowledge distillation and self-supervision and b) a compact network architecture which enables efficient scene understanding on both power hungry GPUs and low-power embedded platforms. We thoroughly assess the performance of our framework and show that it yields state-of-the-art results for monocular depth estimation, optical flow and motion segmentation.
[recognition, understanding, semantics, video, moving, prediction, static, titan, previous, order, dataset] [semantic, segmentation, table, mask, supervision, framework, map, wei, key, pyramid] [testing, protocol, trained, model] [optical, flow, ieee, motion, pattern, convolutional, figure, proposed, method, dynamic, spatial, resolution] [image, unsupervised, train, loss, learn, source] [learning, network, training, deep, proxy, neural, performance, better, standard, architecture, procedure] [depth, conference, computer, vision, monocular, estimation, camera, scene, rigid, stereo, joint, international, european, kitti, stefano, geometry, matteo, pose, fabio, single, michael, novel, defined, dsnet, additional]
@InProceedings{Tosi_2020_CVPR,
  author = {Tosi, Fabio and Aleotti, Filippo and Ramirez, Pierluigi Zama and Poggi, Matteo and Salti, Samuele and Stefano, Luigi Di and Mattoccia, Stefano},
  title = {Distilled Semantics for Comprehensive Scene Understanding from Videos},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Modeling Biological Immunity to Adversarial Examples
Edward Kim, Jocelyn Rego, Yijing Watkins, Garrett T. Kenyon


While deep learning continues to permeate through all fields of signal processing and machine learning, a critical exploit in these frameworks exists and remains unsolved. These exploits, or adversarial examples, are a type of signal attack that can change the output class of a classifier by perturbing the stimulus signal by an imperceptible amount. The attack takes advantage of statistical irregularities within the training data, where the added perturbations can move the image across deep learning decision boundaries. What is even more alarming is the transferability of these attacks to different deep learning models and architectures. This means a successful attack on one model has adversarial effects on other, unrelated models. In a general sense, adversarial attack through perturbations is not a machine learning vulnerability. Human and biological vision can also be fooled by various methods, i.e. mixing high and low frequency images together, by altering semantically related signals, or by sufficiently distorting the input signal. However, the amount and magnitude of such a distortion required to alter biological perception is at a much larger scale. In this work, we explored this gap through the lens of biology and neuroscience in order to understand the robustness exhibited in human perception. Our experiments show that by leveraging sparsity and modeling the biological mechanisms at a cellular level, we are able to mitigate the effect of adversarial alterations to the signal that have no perceptible meaning. Furthermore, we present and illustrate the effects of top-down functional processes that contribute to the inherent immunity in human perception in the context of exploiting these properties to make a more robust machine vision system.
[visual, perception, work, action, natural, inspired, activity] [horizontal, level] [model, adversarial, input, retina, feedback, ganglion, lateral, primary, cortex, lgn, inhibitory, attack, attacked, biologically, defense, jpeg, garrett, amacrine, parvocellular, excitatory, internal, original, photoreceptors, bipolar, graded] [figure, coding, signal, biological, receptive, spike, high, light, cell, output, brain, coded, compression, convolutional, fire] [image, retinal, representation, inhibition, code] [neural, learning, deep, layer, processing, network, dictionary, sum, arxiv, preprint, machine, sparsity, small, training, classification, process, rate, early, higher] [sparse, vision, human, reconstruction, david, computer, relay, conference]
@InProceedings{Kim_2020_CVPR,
  author = {Kim, Edward and Rego, Jocelyn and Watkins, Yijing and Kenyon, Garrett T.},
  title = {Modeling Biological Immunity to Adversarial Examples},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
DOA-GAN: Dual-Order Attentive Generative Adversarial Network for Image Copy-Move Forgery Detection and Localization
Ashraful Islam, Chengjiang Long, Arslan Basharat, Anthony Hoogs


Images can be manipulated for nefarious purposes to hide content or to duplicate certain objects through copy-move operations. Discovering a well-crafted copy-move forgery in images can be very challenging for both humans and machines; for example, an object on a uniform background can be replaced by an image patch of the same background. In this paper, we propose a Generative Adversarial Network with a dual-order attention model to detect and localize copy-move forgeries. In the generator, the first-order attention is designed to capture copy-move location information, and the second-order attention exploits more discriminative features for the patch co-occurrence. Both attention maps are extracted from the affinity matrix and are used to fuse location-aware and co-occurrence features for the final detection and localization branches of the network. The discriminator network is designed to further ensure more accurate localization results. To the best of our knowledge, we are the first to propose such a network architecture with the 1st-order attention mechanism from the affinity matrix. We have performed extensive experimental validation and our state-of-the-art results strongly demonstrate the efficacy of the proposed approach.
[attention, dataset, extract, video, three, provide, visual] [detection, localization, feature, map, affinity, score, region, attentive, module, mask, final, branch, table, object, predicted, ldet, aware, location, framework, atrous, visualization] [forgery, busternet, cmfd, adversarial, forged, densefield, manipulation, input, splicing, comofod, faspp, fcat, fattn, pristine, casia, copymove, forensic, fcooc] [figure, output, based, convolution, proposed, designed, method, spatial, patch, kernel, block] [image, discriminator, source, generator, chengjiang, generative, loss, target, ladv] [network, matrix, performance, deep, learning, size, number, best, neural, precision] [accurate, matching]
@InProceedings{Islam_2020_CVPR,
  author = {Islam, Ashraful and Long, Chengjiang and Basharat, Arslan and Hoogs, Anthony},
  title = {DOA-GAN: Dual-Order Attentive Generative Adversarial Network for Image Copy-Move Forgery Detection and Localization},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Correspondence-Free Material Reconstruction using Sparse Surface Constraints
Sebastian Weiss, Robert Maier, Daniel Cremers, Rudiger Westermann, Nils Thuerey


We present a method to infer physical material parameters, and even external boundaries, from the scanned motion of a homogeneous deformable object via the solution of an inverse problem. Parameters are estimated from real-world data sources such as sparse observations from a Kinect sensor without correspondences. We introduce a novel Lagrangian-Eulerian optimization formulation, including a cost function that penalizes differences to observations during an optimization run. This formulation matches correspondence-free, sparse observations from a single-view depth image with a finite element simulation of deformable bodies. In a number of tests using synthetic datasets and real-world measurements, we analyse the robustness of our approach and the convergence behavior of the numerical optimization scheme.
[observed, step, time, state, current] [object, boundary, mass, table] [physical, model] [method, inverse, deformable, modulus, figure, proposed, reference, dynamic, ieee, motion, indicate] [control] [optimization, function, forward, soft, gradient, parameter, problem, data, convergence, best] [material, reconstruction, adjoint, simulation, point, cost, initial, stiffness, collision, ground, computer, sparse, surface, depth, single, acm, elasticity, solver, damping, reconstructed, formulation, well, compute, displacement, capture, pose, extension, sdf, shape, truth, computed, force, geometry, volume, conference, estimated, reconstruct, match, estimation, camera, rest, grid, displaced, gravity, michael, approach, dense, falling]
@InProceedings{Weiss_2020_CVPR,
  author = {Weiss, Sebastian and Maier, Robert and Cremers, Daniel and Westermann, Rudiger and Thuerey, Nils},
  title = {Correspondence-Free Material Reconstruction using Sparse Surface Constraints},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Augmenting Colonoscopy Using Extended and Directional CycleGAN for Lossy Image Translation
Shawn Mathew, Saad Nadeem, Sruti Kumari, Arie Kaufman


Colorectal cancer screening modalities, such as optical colonoscopy (OC) and virtual colonoscopy (VC), are critical for diagnosing and ultimately removing polyps (precursors for colon cancer). The non-invasive VC is normally used to inspect a 3D reconstructed colon (from computed tomography scans) for polyps and if found, the OC procedure is performed to physically traverse the colon via endoscope and remove these polyps. In this paper, we present a deep learning framework, Extended and Directional CycleGAN, for lossy unpaired image-to-image translation between OC and VC to augment OC video sequences with scale-consistent depth information from VC and VC with patient-specific textures, color and specular highlights from OC (e.g. for realistic polyp synthesis). Both OC and VC contain structural information, but it is obscured in OC by additional patient-specific texture and specular highlights, hence making the translation from OC to VC lossy. The existing CycleGAN approaches do not handle lossy transformations. To address this shortcoming, we introduce an extended cycle consistency loss, which compares the geometric structures from OC in the VC domain. This loss removes the need for the CycleGAN to embed OC information in the VC domain. To handle a stronger removal of the textures and lighting, a Directional Discriminator is introduced to differentiate the direction of translation (by creating paired information for the discriminator), as opposed to the standard CycleGAN which is direction-agnostic. Combining the extended cycle consistency loss and the Directional Discriminator, we show state-of-the-art results on scale-consistent depth inference for phantom, textured VC and for real polyp and normal colon video sequences. We also present results for realistic pendunculated and flat polyp synthesis from bumps introduced in 3D VC models.
[video, work, passed, link] [center] [input, adversarial, stronger, model, create] [figure, lossy, output, ieee, method, introduced, color, medical, remove, pattern, removing] [loss, image, cyclegan, colon, discriminator, cycle, consistency, domain, extended, translation, xdcyclegan, polyp, synthetic, texture, real, conditional, cancer, realistic, paired, gans, goc, colonoscopy, endoscope, generative, generator, produce, gan, phantom, gvc, csyn, corresponding, xcyclegan, colorectal, unpaired, address] [network, learning, data, deep, task, requires, standard, inference, arxiv, preprint] [depth, specular, directional, conference, ground, computer, international, virtual, reconstructed, handle, truth, approach, vision, additional, direction, textured, geometry]
@InProceedings{Mathew_2020_CVPR,
  author = {Mathew, Shawn and Nadeem, Saad and Kumari, Sruti and Kaufman, Arie},
  title = {Augmenting Colonoscopy Using Extended and Directional CycleGAN for Lossy Image Translation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Attention Scaling for Crowd Counting
Xiaoheng Jiang, Li Zhang, Mingliang Xu, Tianzhu Zhang, Pei Lv, Bing Zhou, Xin Yang, Yanwei Pang


Convolutional Neural Network (CNN) based methods generally take crowd counting as a regression task by outputting crowd densities. They learn the mapping between image contents and crowd density distributions. Though having achieved promising results, these data-driven counting networks are prone to overestimate or underestimate people counts of regions with different density patterns, which degrades the whole count accuracy. To overcome this problem, we propose an approach to alleviate the counting performance differences in different regions. Specifically, our approach consists of two networks named Density Attention Network (DANet) and Attention Scaling Network (ASNet). DANet provides ASNet with attention masks related to regions of different density levels. ASNet first generates density maps and scaling factors and then multiplies them by attention masks to output separate attention-based density maps. These density maps are summed to give the final density map. The attention scaling factors help attenuate the estimation errors in different regions. Furthermore, we present a novel Adaptive Pyramid Loss (APLoss) to hierarchically calculate the estimation losses of sub-regions, which alleviates the training bias. Extensive experiments on four challenging datasets (ShanghaiTech Part A, UCF_CC_50, UCF-QNRF, and WorldExpo'10) demonstrate the superiority of the proposed approach.
[attention, dataset, people, ucf, video] [pyramid, map, achieves, level, final, cnn, region, object, utilizes, propose, named, threshold, table, challenging] [datasets] [crowd, counting, ieee, asnet, convolutional, proposed, based, method, mae, figure, pattern, shanghaitech, danet, aploss, mse, adaptive, presented, analysis, scale, intermediate, output, sindagi, pixel, science] [loss, image, corresponding, generate, generates, ability, learn] [density, network, scaling, training, count, learning, neural, test, set, average, divide, performance, deep, number, baseline, calculate, lower, size] [conference, computer, estimation, local, vision, international, approach, novel, cluttered]
@InProceedings{Jiang_2020_CVPR,
  author = {Jiang, Xiaoheng and Zhang, Li and Xu, Mingliang and Zhang, Tianzhu and Lv, Pei and Zhou, Bing and Yang, Xin and Pang, Yanwei},
  title = {Attention Scaling for Crowd Counting},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Shape Reconstruction by Learning Differentiable Surface Representations
Jan Bednarik, Shaifali Parashar, Erhan Gundogdu, Mathieu Salzmann, Pascal Fua


Generative models that produce point clouds have emerged as a powerful tool to represent 3D surfaces, and the best current ones rely on learning an ensemble of parametric representations. Unfortunately, they offer no control over the deformations of the surface patches that form the ensemble and thus fail to prevent them from either overlapping or collapsing into single points or lines. As a consequence, computing shape properties such as surface normals and curvatures becomes difficult and unreliable. In this paper, we show that we can exploit the inherent differentiability of deep networks to leverage differential surface properties during training so as to prevent patch collapse and strongly reduce patch overlap. Furthermore, this lets us reliably compute quantities such as surface normals and curvatures. We will demonstrate on several tasks that this yields more accurate surface reconstructions than the state-of-the-art methods in terms of normals estimation and amount of collapsed and overlapped patches.
[dataset, represent] [overlap, predicted, area, table, object] [differential, trained, model, depicted] [patch, pattern, figure, mae, deformable, relu, method, ieee] [collapse, loss, mapping, latent, target, generated, minimizing, image, representation, generative, control] [training, number, learning, set, deep, accuracy, function, prevent, data, network, note, sampled, amount, better, metric, basic, space, approximate, report, randomly] [surface, point, shape, computer, conference, approach, vision, computed, single, ldef, reconstruction, cloud, fwk, deformation, chd, compute, collapsed, normal, curvature, shapenet, mcol, svr, molap, partial, depth, distance, tds, pcae, rely, atlasnet, completion, lol, cloth, international, analytically]
@InProceedings{Bednarik_2020_CVPR,
  author = {Bednarik, Jan and Parashar, Shaifali and Gundogdu, Erhan and Salzmann, Mathieu and Fua, Pascal},
  title = {Shape Reconstruction by Learning Differentiable Surface Representations},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
A Spatiotemporal Volumetric Interpolation Network for 4D Dynamic Medical Image
Yuyu Guo, Lei Bi, Euijoon Ahn, Dagan Feng, Qian Wang, Jinman Kim


Dynamic medical images are often limited in its application due to the large radiation doses and longer image scanning and reconstruction times. Existing methods attempt to reduce the volume samples in the dynamic sequence by interpolating the volumes between the acquired samples. However, these methods are limited to either 2D images and/or are unable to support large but periodic variations in the functional motion between the image volume samples. In this paper, we present a spatiotemporal volumetric interpolation network (SVIN) designed for 4D dynamic medical images. SVIN introduces dual networks: the first is the spatiotemporal motion network that leverages the 3D convolutional neural network (CNN) for unsupervised parametric volumetric registration to derive spatiotemporal motion field from a pair of image volumes; the second is the sequential volumetric interpolation network, which uses the derived motion field to interpolate image volumes, together with a new regression-based module to characterize the periodic motion cycles in functional organ structures. We also introduce an adaptive multi-scale architecture to capture the volumetric large anatomy motions. Experimental results demonstrated that our SVIN outperformed state-of-the-art temporal medical interpolation methods and natural video interpolation method that has been extended to support volumetric images. Code is available at [1].
[spatiotemporal, temporal, time, video, frame, natural, sequence, represent, dataset, evaluation, sequential] [table, module, segmentation, regression] [model, derived] [motion, interpolation, medical, field, cardiac, dynamic, intermediate, imaging, method, adaptive, based, acdc, ieee, spatial, figure, optical, psnr, nrmse, ssim, svin, clinical, interpolated, voxelmorph, flow, periodic, convolutional, interpolate, high, deformable, proposed, constrain, slomo, warped, mri, rvli, designed] [image, loss, unsupervised, organ, real] [network, architecture, large, learning, similarity, deep, linear, performance, data, neural, better, training, support] [volumetric, volume, registration, estimation, deformation, conference, international, functional, defined, computer, limited, capture, estimate, well, left]
@InProceedings{Guo_2020_CVPR,
  author = {Guo, Yuyu and Bi, Lei and Ahn, Euijoon and Feng, Dagan and Wang, Qian and Kim, Jinman},
  title = {A Spatiotemporal Volumetric Interpolation Network for 4D Dynamic Medical Image},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Attention-Based Context Aware Reasoning for Situation Recognition
Thilini Cooray, Ngai-Man Cheung, Wei Lu


Situation Recognition (SR) is a fine-grained action recognition task where the model is expected to not only predict the salient action of the image, but also predict values of all associated semantic roles of the action. Predicting semantic roles is very challenging: a vast variety of possibilities can be the match for a semantic role. Existing work has focused on dependency modelling architectures to solve this issue. Inspired by the success achieved by query-based visual reasoning (e.g., Visual Question Answering), we propose to address semantic role prediction as a query-based visual reasoning problem. However, existing query-based reasoning methods have not considered handling of inter-dependent queries which is a unique requirement of semantic role prediction in SR. Therefore, to the best of our knowledge, we propose the first set of methods to address inter-dependent queries in query-based visual reasoning. Extensive experiments demonstrate the effectiveness of our proposed method which achieves outstanding performance on Situation Recognition task. Furthermore, leveraging query inter-dependency, our methods improve upon a state-of-the-art method that answers queries separately. Our code: https://github.com/thilinicooray/context-aware-reasoning-for-sr
[role, reasoning, visual, context, verb, frame, prediction, recognition, action, tda, attention, situation, current, caq, encoding, question, hidden, work, dataset, realized, predict, graph, imsitu, answer, incorporate, cair, answering, vqa, order, agent, dependency, combining, yatskar] [semantic, predicted, aware, propose, final, table, object, region, cnn] [model, query, original, vgg, improve] [ieee, pattern, existing, figure, based, proposed, handling] [image, representation, address, tool, generated, loss, generation] [neural, equation, performance, best, task, network, classifier, processing, set, label, achieve] [conference, computer, vision, neighbour, modelling, novel, well, scene]
@InProceedings{Cooray_2020_CVPR,
  author = {Cooray, Thilini and Cheung, Ngai-Man and Lu, Wei},
  title = {Attention-Based Context Aware Reasoning for Situation Recognition},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
PatchVAE: Learning Local Latent Codes for Recognition
Kamal Gupta, Saurabh Singh, Abhinav Shrivastava


Unsupervised representation learning holds the promise of exploiting large amounts of unlabeled data to learn general representations. A promising technique for unsupervised learning is the framework of Variational Auto-encoders (VAEs). However, unsupervised representations learned by VAEs are significantly outperformed by those learned by supervised learning for recognition. Our hypothesis is that to learn useful representations for recognition the model needs to be encouraged to learn about repeating and consistent patterns in data. Drawing inspiration from the mid-level representation discovery work, we propose PatchVAE, that reasons about images at patch level. Our key contribution is a bottleneck formulation that encourages mid-level style representations in the VAE framework. Our experiments demonstrate that representations learned by our method perform much better on the recognition tasks compared to those learned by vanilla VAEs.
[recognition, visual, decoder, abhinav, work, multiple] [table, framework, feature, map, location, propose] [model, latents, trained, adversarial] [patch, occ, conv, figure, prior, ieee, repetitive, residual, pattern, proposed] [image, patchvae, unsupervised, occurrence, representation, learn, appearance, generative, vae, zapp, loss, latent, encoder, variational, supervised, train, zocc, zlocc, vaes, learns, zprior, target, discriminative] [learning, better, learned, classification, data, training, network, imagenet, neural, number, layer, deep, probability, task, performance, arxiv, preprint, distribution, architecture, posterior, increasing, large, bottleneck, compared, standard, note, weighted] [reconstruction, computer, conference, capture, vision, demonstrate, approach, single]
@InProceedings{Gupta_2020_CVPR,
  author = {Gupta, Kamal and Singh, Saurabh and Shrivastava, Abhinav},
  title = {PatchVAE: Learning Local Latent Codes for Recognition},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Self-Supervised Monocular Trained Depth Estimation Using Self-Attention and Discrete Disparity Volume
Adrian Johnston, Gustavo Carneiro


Monocular depth estimation has become one of the most studied applications in computer vision, where the most accurate approaches are based on fully supervised learning models. However, the acquisition of accurate and large ground truth data sets to model these fully supervised methods is a major challenge for the further development of the area. Self-supervised methods trained with monocular videos constitute one the most promising approaches to mitigate the challenge mentioned above due to the wide-spread availability of training data. Consequently, they have been intensively studied, where the main ideas explored consist of different types of model architectures, loss functions, and occlusion masks to address non-rigid motion. In this paper, we propose two new ideas to improve self-supervised monocular trained depth estimation: 1) self-attention, and 2) discrete disparity prediction. Compared with the usual localised convolution operation, self-attention can explore a more general contextual information that allows the inference of similar disparity values at non-contiguous regions of the image. Discrete disparity prediction has been shown by fully supervised methods to provide a more robust and sharper depth estimation than the more common continuous disparity prediction, besides enabling the estimation of depth uncertainty. We show that the extension of the state-of-the-art self-supervised monocular trained depth estimator Monodepth2 with these two ideas allows us to design a model that produces the best results in the field in KITTI 2015 and Make3D, closing the gap with respect self-supervised stereo training and fully supervised approaches.
[attention, context, video, prediction, visual] [fully, module, map, semantic, object, table, contextual, autonomous, segmentation, propose] [trained, model, input, improve] [disparity, method, convolutional, based, low, resolution, figure, proposed, dilated, motion, convolution, flow, ieee, sharper] [image, supervised, loss, unsupervised, train, common, selfsupervised] [learning, discrete, training, deep, data, neural, set, compared, network, better, baseline, large, performance, best] [depth, monocular, estimation, stereo, volume, computer, pose, ddv, vision, kitti, uncertainty, single, ground, truth, accurate, photometric, conference, estimator, estimate, computed, eigen, error, rely, reprojection, camera, scene, relative, defined, rel, allows]
@InProceedings{Johnston_2020_CVPR,
  author = {Johnston, Adrian and Carneiro, Gustavo},
  title = {Self-Supervised Monocular Trained Depth Estimation Using Self-Attention and Discrete Disparity Volume},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
STAViS: Spatio-Temporal AudioVisual Saliency Network
Antigoni Tsiami, Petros Koutras, Petros Maragos


We introduce STAViS, a spatio-temporal audiovisual saliency network that combines spatio-temporal visual and auditory information in order to efficiently address the problem of saliency estimation in videos. Our approach employs a single network that combines visual saliency and auditory features and learns to appropriately localize sound sources and to fuse the two saliencies in order to obtain a final saliency map. The network has been designed, trained end-to-end, and evaluated on six different databases that contain audiovisual eye-tracking data of a large variety of videos. We compare our method against 8 different state-of-the-art visual saliency models. Evaluation results across databases indicate that our STAViS model outperforms our visual only variant as well as the other state-of-the-art models in the majority of cases. Also, the consistently good performance it achieves for all databases indicates that it is appropriate for estimating saliency "in-the-wild". The code is available at https://github.com/atsiami/STAViS.
[visual, audiovisual, audio, video, auditory, attention, sound, recognition, stavis, order, temporal, prediction, petros, evaluation, multimodal, fixation, yden, multiple, sauc, frame, outperforms, acoustic, coutrot] [saliency, localization, map, module, sim, final, employed, employ, feature] [model, database, trained, depicted] [ieee, proposed, spatial, based, pattern, fusion, method, signal, figure] [image, loss, corresponding, source, representation] [network, learning, deep, performance, data, good, processing, applied, better, large, architecture, majority, problem, sample] [vision, computer, well, ground, human, estimation, european, approach, truth, single, second, scene]
@InProceedings{Tsiami_2020_CVPR,
  author = {Tsiami, Antigoni and Koutras, Petros and Maragos, Petros},
  title = {STAViS: Spatio-Temporal AudioVisual Saliency Network},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
More Grounded Image Captioning by Distilling Image-Text Matching Model
Yuanen Zhou, Meng Wang, Daqing Liu, Zhenzhen Hu, Hanwang Zhang


Visual attention not only improves the performance of image captioners, but also serves as a visual interpretation to qualitatively measure the caption rationality and model transparency. Specifically, we expect that a captioner can fix its attentive gaze on the correct objects while generating the corresponding words. This ability is also known as grounded image captioning. However, the grounding accuracy of existing captioners is far from satisfactory.To improve the grounding accuracy while retaining the captioning quality, it is expensive to collect the word-region alignment as strong supervision.To this end, we propose a Part-of-Speech (POS) enhanced image-text matching model (SCAN): POS-SCAN, as the effective knowledge distillation for more grounded image captioning. The benefits are two-fold: 1) given a sentence and an image, POS-SCAN can ground the objects more accurately than SCAN; 2) POS-SCAN serves as a word-region alignment regularization for the captioner's visual attention module. By showing benchmark experimental results, we demonstrate that conventional image captioners equipped with POS-SCAN can significantly improve the grounding accuracy without strong supervision. Last but not the least, we explore the indispensable Self-Critical Sequence Training (SCST) in the context of grounded image captioning and show that the image-text matching score can serve as a reward for more grounded captioning.
[attention, captioning, grounding, visual, caption, grounded, reward, sentence, evaluation, cider, word, shirt, language, hanwang, man, red, embedding, sequence, scst, avt, lstm, attended, hidden, state, daqing, correct, context, woman, apron] [supervision, table, denotes, score, feature, region, object, module, semantic, global, visualization] [model, improve, quality, trained, original] [proposed, figure, method, based, adopted] [image, alignment, generate, generated, corresponding, generation, loss] [performance, knowledge, neural, training, accuracy, function, distillation, learning, set, arxiv, preprint, distilling, serve, compared, similarity, find, better, standard, size, expensive, deep, weight] [matching, scan, ground, truth, local]
@InProceedings{Zhou_2020_CVPR,
  author = {Zhou, Yuanen and Wang, Meng and Liu, Daqing and Hu, Zhenzhen and Zhang, Hanwang},
  title = {More Grounded Image Captioning by Distilling Image-Text Matching Model},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
DUNIT: Detection-Based Unsupervised Image-to-Image Translation
Deblina Bhattacharjee, Seungryong Kim, Guillaume Vizier, Mathieu Salzmann


Image-to-image translation has made great strides in recent years, with current techniques being able to handle unpaired training images and to account for the multi-modality of the translation problem. Despite this, most methods treat the image as a whole, which makes the results they produce for content-rich scenes less realistic. In this paper, we introduce a Detection-based Unsupervised Image-to-image Translation (DUNIT) approach that explicitly accounts for the object instances in the translation process. To this end, we extract separate representations for the global image and for the instances, which we then fuse into a common representation from which we generate the translated image. This allows us to preserve the detailed content of object instances, while still modeling the fact that we aim to produce an image of a single consistent scene. We introduce an instance consistency loss to maintain the coherence between the detections. Furthermore, by incorporating a detector into our architecture, we can still exploit object instances at test time. As evidenced by our experiments, this allows us to outperform the state-of-the-art unsupervised image-to-image translation methods. Furthermore, our approach can also be used as an unsupervised domain adaptation strategy for object detection, and it also achieves state-of-the-art performance on this task.
[unit, exploit, extract, work] [object, global, instance, detection, retinanet, feature, table, detector, map, bounding, score] [input, adversarial, original] [figure, method, adaptive, block, residual, ieee, pattern, comparison, night] [domain, image, translation, style, loss, content, unsupervised, translated, consistency, adaptation, init, dunit, drit, sunny, representation, conditional, cyclegan, source, munit, inception, unpaired, realistic, translate, target, transfer, lpips, alexei, introduce, generate, aim, diverse, translating, corresponding, latent, disentangled, eys, exci, lic, phillip, paired, extracted] [training, learning, note, test, process, task, report, average, merged] [approach, single, computer, consistent, compare, conference]
@InProceedings{Bhattacharjee_2020_CVPR,
  author = {Bhattacharjee, Deblina and Kim, Seungryong and Vizier, Guillaume and Salzmann, Mathieu},
  title = {DUNIT: Detection-Based Unsupervised Image-to-Image Translation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Learning to Observe: Approximating Human Perceptual Thresholds for Detection of Suprathreshold Image Transformations
Alan Dolhasz, Carlo Harvey, Ian Williams


Many tasks in computer vision are often calibrated and evaluated relative to human perception. In this paper, we propose to directly approximate the perceptual function performed by human observers completing a visual detection task. Specifically, we present a novel methodology for learning to detect image transformations visible to human observers through approximating perceptual thresholds. To do this, we carry out a subjective two-alternative forced-choice study to estimate perceptual thresholds of human observers detecting local exposure shifts in images. We then leverage transformation equivariant representation learning to overcome issues of limited perceptual data. This representation is then used to train a dense convolutional classifier capable of detecting local suprathreshold exposure shifts - a distortion common to image composites. In this context, our model can approximate perceptual thresholds with an average error of 0.1148 exposure stops between empirical and predicted thresholds. It can also be trained to detect a range of different local transformations.
[visual, dataset, three, encode] [threshold, object, detection, alan, mask, positive, semantic, segmentation] [model, observer, quality, psychometric, suprathreshold, input, stimulus, subjective, detecting, assessment, sensitivity, methodology, decision, original] [perceptual, exposure, ieee, convolutional, based, figure, method, pattern, range, output, pixel, luminance, approximating, indicate, channel, multiscale, analysis] [image, corresponding, aet, unsupervised, representation, target, learn, train, loss] [function, learning, performance, data, applied, empirical, training, layer, deep, parameter, negative, process, set, network, neural, respect, class, pool, batch, validation, approximate] [human, transformation, local, computer, international, vision, conference, approach, defined, volume, visible]
@InProceedings{Dolhasz_2020_CVPR,
  author = {Dolhasz, Alan and Harvey, Carlo and Williams, Ian},
  title = {Learning to Observe: Approximating Human Perceptual Thresholds for Detection of Suprathreshold Image Transformations},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Show, Edit and Tell: A Framework for Editing Image Captions
Fawaz Sammani, Luke Melas-Kyriazi


Most image captioning frameworks generate captions directly from images, learning a mapping from visual features to natural language. However, editing existing captions can be easier than generating new ones from scratch. Intuitively, when editing captions, a model is not required to learn information that is already present in the caption (i.e. sentence structure), enabling it to focus on fixing details (e.g. replacing repetitive words). This paper proposes a novel approach to image captioning based on iterative adaptive refinement of an existing caption. Specifically, our caption-editing model consisting of two sub-modules: (1) EditNet, a language module with an adaptive copy mechanism (Copy-LSTM) and a Selective Copy Memory Attention mechanism (SCMA), and (2) DCNet, an LSTM-based denoising auto-encoder. These components enable our model to directly copy from and modify existing captions. Experiments demonstrate that our new approach achieves state of-art performance on the MS COCO dataset both with and without sequence-level training.
[attention, caption, lstm, visual, captioning, word, state, language, hidden, editnet, textual, scma, mechanism, dcnet, decoder, context, selective, previous, current, aoanet, natural, includes, tanh, sentence, sequence, attended, cei, cgt, automatic] [score, framework, table, module, final, achieves, highest, ablation, mask, focus] [model, input, copy, copied] [existing, output, figure, mse, adaptive, denoising, proposed, pattern, ieee, based] [image, editing, corresponding, encoder, generating, edit, learn, generated, train] [memory, performance, neural, optimization, training, learning, standard, indicates, machine, gate, probability, set, note, maximum, equation, network, optimize, better, size] [computer, vision, conference, structure, directly, novel]
@InProceedings{Sammani_2020_CVPR,
  author = {Sammani, Fawaz and Melas-Kyriazi, Luke},
  title = {Show, Edit and Tell: A Framework for Editing Image Captions},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Structure Boundary Preserving Segmentation for Medical Image With Ambiguous Boundary
Hong Joo Lee, Jung Uk Kim, Sangmin Lee, Hak Gu Kim, Yong Man Ro


In this paper, we propose a novel image segmentation method to tackle two critical problems of medical image, which are (i) ambiguity of structure boundary in the medical image domain and (ii) uncertainty of the segmented region without specialized domain knowledge. To solve those two problems in automatic medical segmentation, we propose a novel structure boundary preserving segmentation framework. To this end, the boundary key point selection algorithm is proposed. In the proposed algorithm, the key points on the structural boundary of the target object are estimated. Then, a boundary preserving block (BPB) with the boundary key point map is applied for predicting the structure boundary of the target object. Further, for embedding experts' knowledge in the fully automatic segmentation, we propose a novel shape boundary-aware evaluator (SBE) with the ground-truth structure information indicated by experts. The proposed SBE could give feedback to the segmentation network based on the structure boundary key point. The proposed method is general and flexible enough to be built on top of any deep learning-based segmentation network. We demonstrate that the proposed method could surpass the state-of-the-art segmentation network and improve the accuracy of three different segmentation network models on different types of medical image datasets.
[automatic, dataset, evaluation, red, construct] [segmentation, boundary, key, map, bpb, sbe, region, fcn, fully, tvus, table, interactive, object, framework, predicted, bpbs, sgt, center, evaluator, feature, denotes, propose, segment] [ambiguous, skin, input, model, adversarial] [medical, proposed, method, figure, ieee, coefficient, block, convolutional, dice, convolution, analysis, imaging, pattern, based, stn, comparison] [image, preserving, target, lesion, generated, generator, preserve, domain, train, loss, user, preserved] [network, performance, number, selection, deep, training, algorithm, learning, statistical, baseline, selected, machine, knowledge, randomly, function, log, base, experiment] [point, structure, novel, international, conference, shape, approach]
@InProceedings{Lee_2020_CVPR,
  author = {Lee, Hong Joo and Kim, Jung Uk and Lee, Sangmin and Kim, Hak Gu and Ro, Yong Man},
  title = {Structure Boundary Preserving Segmentation for Medical Image With Ambiguous Boundary},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Predicting Cognitive Declines Using Longitudinally Enriched Representations for Imaging Biomarkers
Lyujian Lu, Hua Wang, Saad Elbeleidy, Feiping Nie


With rapid progress in high-throughput genotyping and neuroimaging, researches of complex brain disorders, such as Alzheimer's Disease (AD), have gained significant attention in recent years. Many prediction models have been studied to relate neuroimaging measures to cognitive status over the progressions when these disease develops. Missing data is one of the biggest challenge in accurate cognitive score prediction of subjects in longitudinal neuroimaging studies. To tackle this problem, in this paper we propose a novel formulation to learn an enriched representation for imaging biomarkers that can simultaneously capture both the information conveyed by baseline neuroimaging records and that by progressive variations of varied counts of available follow-up records over time. While the numbers of the brain scans of the participants vary, the learned biomarker representation for every participant is a fixed-length vector, which enable us to use traditional learning models to study AD developments. Our new objective is formulated to maximize the ratio of the summations of a number of L1-norm distances for improved robustness, which, though, is difficult to efficiently solve in general. Thus we derive a new efficient iterative solution algorithm and rigorously prove its convergence. We have performed extensive experiments on the Alzheimer's Disease Neuroimaging Initiative (ADNI) dataset. A performance gain has been achieved to predict four different cognitive scores, when we compare the original baseline representations against the learned representations with enrichments. These promising empirical results have demonstrated improved performances of our new method that validate its effectiveness.
[prediction, time, temporal, predicting, participant, predict, month] [regression, global, cnn] [study, original, iterative, robust, impairment, model] [proposed, imaging, medical, analysis, brain, method, ieee] [representation, disease, learn, consistency, missing, progressive, preserve, image, introduce] [neuroimaging, cognitive, enriched, data, baseline, biomarker, algorithm, longitudinal, learning, hua, objective, adni, heng, learned, feiping, vector, problem, number, optimization, shannon, ratio, performance, andrew, record, support, space, set, theorem, simultaneously, machine, matrix, linear, sungeun, studied, biomarkers, predictive] [solve, local, conference, international, projection, solution, projected, joint, computer, derive]
@InProceedings{Lu_2020_CVPR,
  author = {Lu, Lyujian and Wang, Hua and Elbeleidy, Saad and Nie, Feiping},
  title = {Predicting Cognitive Declines Using Longitudinally Enriched Representations for Imaging Biomarkers},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Predicting Lymph Node Metastasis Using Histopathological Images Based on Multiple Instance Learning With Deep Graph Convolution
Yu Zhao, Fan Yang, Yuqi Fang, Hailing Liu, Niyun Zhou, Jun Zhang, Jiarui Sun, Sen Yang, Bjoern Menze, Xinjuan Fan, Jianhua Yao


Multiple instance learning (MIL) is a typical weakly-supervised learning method where the label is associated with a bag of instances instead of a single instance. Despite extensive research over past years, effectively deploying MIL remains an open and challenging problem, especially when the commonly assumed standard multiple instance (SMI) assumption is not satisfied. In this paper, we propose a multiple instance learning method based on deep graph convolutional network and feature selection (FS-GCN-MIL) for histopathological image classification. The proposed method consists of three components, including instance-level feature extraction, instance-level feature selection, and bag-level classification. We develop a self-supervised learning mechanism to train the feature extractor based on a combination model of variational autoencoder and generative adversarial network (VAE-GAN). Additionally, we propose a novel instance-level feature selection method to select the discriminative instance features. Furthermore, we employ a graph convolutional network (GCN) for learning the bag-level representation and then performing the classification. We apply the proposed method in the prediction of lymph node metastasis using histopathological images of colorectal cancer. Experimental results demonstrate that the proposed method achieves superior performance compared to the state-of-the-art methods.
[multiple, graph, node, gcn, prediction, three, dataset, work, attention] [feature, instance, bag, mil, lymph, metastasis, resnet, histopathological, positive, pooling, object, weakly, survival, propose, lnm, fully, tumor, framework, lllike, disl, table, challenging, including, paradigm] [model] [method, proposed, histogram, based, ieee, convolutional, clinical, wsi, analysis, pattern, convolution, cell, develop, workload, medical] [representation, image, cancer, discriminative, colorectal, generating, loss, extracted, slide, generate, supervised, colon, encoder, vae, gan] [learning, network, deep, selection, neural, classification, negative, training, performance, machine, evaluate, data, label, layer] [conference, computer, international, vision, approach, distance, novel]
@InProceedings{Zhao_2020_CVPR,
  author = {Zhao, Yu and Yang, Fan and Fang, Yuqi and Liu, Hailing and Zhou, Niyun and Zhang, Jun and Sun, Jiarui and Yang, Sen and Menze, Bjoern and Fan, Xinjuan and Yao, Jianhua},
  title = {Predicting Lymph Node Metastasis Using Histopathological Images Based on Multiple Instance Learning With Deep Graph Convolution},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Extremely Dense Point Correspondences Using a Learned Feature Descriptor
Xingtong Liu, Yiping Zheng, Benjamin Killeen, Masaru Ishii, Gregory D. Hager, Russell H. Taylor, Mathias Unberath


High-quality 3D reconstructions from endoscopy video play an important role in many clinical applications, including surgical navigation where they enable direct video-CT registration. While many methods exist for general multi-view 3D reconstruction, these methods often fail to deliver satisfactory performance on endoscopic video. Part of the reason is that local descriptors that establish pair-wise point correspondences, and thus drive reconstruction, struggle when confronted with the texture-scarce surface of anatomy. Learning-based dense descriptors usually have larger receptive fields enabling the encoding of global information, which can be used to disambiguate matches. In this work, we present an effective self-supervised training scheme and novel loss design for dense descriptor learning. In direct comparison to recent local and dense descriptors on an in-house sinus endoscopy dataset, we demonstrate that our proposed dense descriptor can generalize to unseen patients and scopes, thereby largely improving the performance of Structure from Motion (SfM) in terms of model density and completeness. We also evaluate our method on a public dense optical flow dataset and a small-scale SfM public dataset to further demonstrate the effectiveness and generality of our method. The source code is available at https://github.com/lppllppl920/DenseDescriptorLearning-Pytorch.
[evaluation, dataset, pair, video, three, observed, recognition, trajectory, dog, natural] [feature, groundtruth, location, response, heatmap, table, localization, positive, threshold, bce, detector] [trained, model, input] [proposed, ieee, method, flow, pattern, optical, comparison, high, figure, spatial] [target, loss, source, image, corresponding, generated] [training, performance, number, network, task, evaluate, compared, learning, negative, hardest, contrastive, data, scheme, distribution, selected, softmax] [dense, descriptor, keypoint, matching, local, sfm, computer, conference, vision, point, endoscopy, sparse, slam, softargmax, sift, camera, estimated, international, sinus, match, reconstruction, relative, scene, estimation, ucn, correspondence, estimate, accurate, kitti]
@InProceedings{Liu_2020_CVPR,
  author = {Liu, Xingtong and Zheng, Yiping and Killeen, Benjamin and Ishii, Masaru and Hager, Gregory D. and Taylor, Russell H. and Unberath, Mathias},
  title = {Extremely Dense Point Correspondences Using a Learned Feature Descriptor},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Local Deep Implicit Functions for 3D Shape
Kyle Genova, Forrester Cole, Avneesh Sud, Aaron Sarna, Thomas Funkhouser


The goal of this project is to learn a 3D shape representation that enables accurate surface reconstruction, compact storage, efficient computation, consistency for similar shapes, generalization across diverse shape categories, and inference from depth camera observations. Towards this end, we introduce Local Deep Implicit Functions (LDIF), a 3D shape representation that decomposes space into a structured set of learned implicit functions. We provide networks that infer the space decomposition and local deep implicit functions from a 3D mesh or posed depth image. During experiments, we find that it provides 10.3 points higher surface reconstruction accuracy (F-Score) than the state-of-the-art (OccNet), while requiring fewer than 1% of the network parameters. Experiments on posed depth image completion and generalization to unseen classes show 15.8 and 17.8 point improvements over the state-of-the-art, while producing a structured 3D representation for each input with consistency across diverse shape collections.
[structured, element, decoder, encoding, represent, encode] [table, predicted, object, achieves, feature, center] [trained, input, generalization] [figure, ieee, pattern, gaussian, method, decoded] [latent, representation, image, encoder, loss, autoencoder, consistency] [deep, network, function, learning, set, test, better, space, higher, accuracy, vector, dif, learned, fewer, neural, experiment] [shape, local, ldif, implicit, depth, computer, surface, reconstruction, conference, sif, vision, point, occnet, human, partial, analytic, camera, pipeline, pointnet, single, acm, thomas, accurate, completion, representing, symmetry, well, hao, mesh, posed, approach, voxel, grid, complete, chamfer, international]
@InProceedings{Genova_2020_CVPR,
  author = {Genova, Kyle and Cole, Forrester and Sud, Avneesh and Sarna, Aaron and Funkhouser, Thomas},
  title = {Local Deep Implicit Functions for 3D Shape},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
PointGroup: Dual-Set Point Grouping for 3D Instance Segmentation
Li Jiang, Hengshuang Zhao, Shaoshuai Shi, Shu Liu, Chi-Wing Fu, Jiaya Jia


Instance segmentation is an important task for scene understanding. Compared to the fully-developed 2D, 3D instance segmentation for point clouds have much room to improve. In this paper, we present PointGroup, a new end-to-end bottom-up architecture, specifically focused on better grouping the points by exploring the void space between objects. We design a two-branch network to extract point features and predict semantic labels and offsets, for shifting each point towards its respective instance centroid. A clustering component is followed to utilize both the original and offset-shifted point coordinate sets, taking advantage of their complementary strength. Further, we formulate the ScoreNet to evaluate the candidate instances, followed by the Non-Maximum Suppression (NMS) to remove duplicates. We conduct extensive experiments on two challenging datasets, ScanNet v2 and S3DIS, on which our method achieves the highest performance, 63.6% and 64.0%, compared to 54.9% and 54.4% achieved by former best solutions in terms of mAP with IoU threshold 0.5.
[predict, evaluation, extract] [instance, semantic, segmentation, object, pointgroup, offset, scorenet, shifted, predicted, feature, table, grouping, backbone, score, challenging, branch, detection, denotes, ablation, threshold, final, kaiming, shu, highest, framework, nearby, bounding] [original, input, testing, conduct, model] [method, convolutional, based, proposed, jiaya, figure, convolution] [cluster, produce, loss, utilize, learn, separate, introduce] [clustering, set, network, learning, number, group, deep, better, label, candidate, space, neural, performance, denote, validation, best, algorithm, large, vector, design, respective, evaluate, training, higher] [point, scannet, coordinate, scene, directly, cloud, leonidas]
@InProceedings{Jiang_2020_CVPR,
  author = {Jiang, Li and Zhao, Hengshuang and Shi, Shaoshuai and Liu, Shu and Fu, Chi-Wing and Jia, Jiaya},
  title = {PointGroup: Dual-Set Point Grouping for 3D Instance Segmentation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Cost Volume Pyramid Based Depth Inference for Multi-View Stereo
Jiayu Yang, Wei Mao, Jose M. Alvarez, Miaomiao Liu


We propose a cost volume-based neural network for depth inference from multi-view images. We demonstrate that building a cost volume pyramid in a coarse-to-fine manner instead of constructing a cost volume at a fixed resolution leads to a compact, lightweight network and allows us inferring high resolution depth maps to achieve better reconstruction results. To this end, we first build a cost volume based on uniform sampling of fronto-parallel planes across the entire depth range at the coarsest resolution of an image. Then, given current depth estimate, we construct new cost volumes iteratively on the pixelwise depth residual to perform depth map refinement. While sharing similar insight with Point-MVSNet as predicting and refining depth iteratively, we show that working on cost volume pyramid can lead to a more compact, yet efficient network structure compared with the Point-MVSNet on 3D points. We further provide detailed analyses of the relation between (residual) depth sampling and image resolution, which serves as a principle for building compact cost volume pyramid. Experimental results on benchmark datasets show that our model can perform 6x faster and has similar performance as state-of-the-art methods. Code is available at https://github.com/JiayuYANG/CVP-MVSNet
[build, current, provide, construct, evaluation, dataset] [map, pyramid, feature, level, building, table, faster, propose, framework, adopt] [input, model] [resolution, residual, pixel, method, based, reference, high, range, coarsest, figure, fusion] [image, source, corresponding, loss] [memory, network, sampling, learning, size, inference, deep, set, better, number, search, accuracy, best, achieve, efficient, compared, performance, process, small, gpu, compact] [depth, cost, volume, view, point, stereo, reconstruction, dtu, estimate, scene, approach, iteratively, matching, partial, interval, mvsnet, defined, estimation, ground, truth, vision, coarse, completeness]
@InProceedings{Yang_2020_CVPR,
  author = {Yang, Jiayu and Mao, Wei and Alvarez, Jose M. and Liu, Miaomiao},
  title = {Cost Volume Pyramid Based Depth Inference for Multi-View Stereo},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
RoutedFusion: Learning Real-Time Depth Map Fusion
Silvan Weder, Johannes Schonberger, Marc Pollefeys, Martin R. Oswald


The efficient fusion of depth maps is a key part of most state-of-the-art 3D reconstruction methods. Besides requiring high accuracy, these depth fusion methods need to be scalable and real-time capable. To this end, we present a novel real-time capable machine learning-based method for depth map fusion. Similar to the seminal depth map fusion approach by Curless and Levoy, we only update a local group of voxels to ensure real-time capability. Instead of a simple linear fusion of depth information, we propose a neural network that predicts non-linear updates to better account for typical fusion errors. Our network is composed of a 2D depth routing network and a 3D depth fusion network which efficiently handle sensor-specific noise and outliers. This is especially useful for surface edges and thin objects for which the original approach suffers from thickening artifacts. Our method outperforms the traditional fusion approach and related learned approaches on both synthetic and real data. We demonstrate the performance of our method in reconstructing fine geometric details from noise and outlier contaminated data on various scenes.
[recognition, outperforms] [map, global, semantic, confidence, propose, fusing, feature] [noise, ray, trained, input, model, christian] [fusion, method, routing, pattern, figure, noisy, proposed, fused, convolutional, ieee, high, traditional, reconstructing] [corresponding, train, synthetic, qualitative, loss, fine] [network, standard, data, update, learning, learned, performance, neural, better, efficient, processing, training, function, evaluate, number, optimization, andrew] [depth, tsdf, conference, computer, reconstruction, vision, international, surface, scene, approach, local, voxel, thin, well, volumetric, marc, distance, volume, michael, thickening, grid, shapenet, thomas, andreas, pipeline, modelnet, european, voxels, signed, mesh, psdf, compute, october]
@InProceedings{Weder_2020_CVPR,
  author = {Weder, Silvan and Schonberger, Johannes and Pollefeys, Marc and Oswald, Martin R.},
  title = {RoutedFusion: Learning Real-Time Depth Map Fusion},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
VOLDOR: Visual Odometry From Log-Logistic Dense Optical Flow Residuals
Zhixiang Min, Yiding Yang, Enrique Dunn


We propose a dense indirect visual odometry method taking as input externally estimated optical flow fields instead of hand-crafted feature correspondences. We define our problem as a probabilistic model and develop a generalized-EM formulation for the joint inference of camera motion, pixel depth, and motion-track confidence. Contrary to traditional methods assuming Gaussian-distributed observation errors, we supervise our inference framework under an (empirically validated) adaptive log-logistic distribution model. Moreover, the log-logistic residual model generalizes well to different state-of-the-art optical flow methods, making our approach modular and agnostic to the choice of optical flow estimators. Our method achieved top-ranking results on both TUM RGB-D and KITTI odometry benchmarks. Our open-sourced implementation is inherently GPU-friendly with only linear computational and storage growth.
[visual, time, observation, video, sequence] [map, framework, table, feature, tracking, benchmark] [model, input, robust] [flow, ieee, optical, pattern, residual, figure, motion, pixel, likelihood, method, analysis, scale, gaussian] [unsupervised, image, mapping, mode] [learning, distribution, deep, problem, set, probability, update, posterior, inference, mle, probabilistic, performance, accuracy, network] [conference, computer, depth, vision, camera, monocular, dense, odometry, rigidness, international, pose, estimation, kitti, stereo, direct, robotics, wtj, fisk, estimated, slam, geometric, european, sparse, voldor, rigid, thomas, well, mie, michael, automation, intelligent, indirect, joint, error, daniel, scene, estimate, approach, tum, accurate, outlier, bundle]
@InProceedings{Min_2020_CVPR,
  author = {Min, Zhixiang and Yang, Yiding and Dunn, Enrique},
  title = {VOLDOR: Visual Odometry From Log-Logistic Dense Optical Flow Residuals},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Learning to Optimize Non-Rigid Tracking
Yang Li, Aljaz Bozic, Tianwei Zhang, Yanli Ji, Tatsuya Harada, Matthias Niessner


One of the widespread solutions for non-rigid tracking has a nested-loop structure: with Gauss-Newton to minimize a tracking objective in the outer loop, and Preconditioned Conjugate Gradient (PCG) to solve a sparse linear system in the inner loop. In this paper, we employ learnable optimizations to improve tracking robustness and speed up solver convergence. First, we upgrade the tracking objective by integrating an alignment data term on deep features which are learned end-to-end through CNN. The new tracking objective can capture the global deformation which helps Gauss-Newton to jump over local minimum, leading to robust tracking on large non-rigid motions. Second, we bridge the gap between the preconditioning technique and learning method by introducing a ConditionNet which is trained to generate a preconditioner such that PCG can converge within a small number of steps. Experimental results indicate that the proposed learning method converges faster than the original PCG by a large margin.
[frame, dataset, graph, step] [tracking, feature, map, global, apply, extractor, propose] [model, input, trained, iterative] [method, proposed, based, classic, color, figure, motion, flow, block, spatial] [source, target, alignment, generate, image] [learning, large, learned, number, deep, training, convergence, data, optimization, energy, matrix, update, objective, linear, problem, network, total, gradient, converge, good, neural, size] [pcg, term, preconditioner, depth, deformation, conditionnet, fitting, dense, reconstruction, matthias, scannet, system, preconditioning, rigid, solution, point, geometric, scene, local, nonrigid, correspondence, camera, solve, sparse, surface, solving, mesh, ground, truth, solver, dynamicfusion, descriptor, directly]
@InProceedings{Li_2020_CVPR,
  author = {Li, Yang and Bozic, Aljaz and Zhang, Tianwei and Ji, Yanli and Harada, Tatsuya and Niessner, Matthias},
  title = {Learning to Optimize Non-Rigid Tracking},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
KFNet: Learning Temporal Camera Relocalization Using Kalman Filtering
Lei Zhou, Zixin Luo, Tianwei Shen, Jiahui Zhang, Mingmin Zhen, Yao Yao, Tian Fang, Long Quan


Temporal camera relocalization estimates the pose with respect to each video frame in sequence, as opposed to one-shot relocalization which focuses on a still image. Even though the time dependency has been taken into account, current temporal relocalization methods still generally underperform the state-of-the-art one-shot approaches in terms of accuracy. In this work, we improve the temporal relocalization method by using a network architecture that incorporates Kalman filtering (KFNet) for online camera relocalization. In particular, KFNet extends the scene coordinate regression problem to the time domain in order to recursively establish 2D and 3D correspondences for the pose determination. The network architecture design and the loss formulation are based on Kalman filtering in the context of Bayesian learning. Extensive experiments on multiple relocalization benchmarks demonstrate the high accuracy of KFNet at the top of both one-shot and temporal relocalization approaches.
[temporal, time, kalman, state, video, three, retrieval, recurrent, visual] [regression, table, localization, map, feature, tracking, score, propose] [model, noise, datasets, testing] [based, flow, filtering, optical, pixel, prior, motion, likelihood, gaussian, blur, warping, figure, convolutional, innovation, proposed, spatial, lost] [loss, image, mapping, domain] [process, learning, accuracy, distribution, deep, test, network, filter, training, posterior, architecture, bayesian, linear, covariance, problem, search, transition, statistical, larger, size] [scene, pose, relocalization, camera, kfnet, scoordnet, coordinate, system, measurement, uncertainty, cost, accurate, oflownet, error, estimation, volume, deeploc, formulation, relative, absolute, outlier, cambridge]
@InProceedings{Zhou_2020_CVPR,
  author = {Zhou, Lei and Luo, Zixin and Shen, Tianwei and Zhang, Jiahui and Zhen, Mingmin and Yao, Yao and Fang, Tian and Quan, Long},
  title = {KFNet: Learning Temporal Camera Relocalization Using Kalman Filtering},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Information-Driven Direct RGB-D Odometry
Alejandro Fontan, Javier Civera, Rudolph Triebel


This paper presents an information-theoretic approach to point selection in direct RGB-D odometry. The aim is to select only the most informative measurements, in order to reduce the optimization problem with a minimal impact in the accuracy. It is usual practice in visual odometry/SLAM to track several hundreds of points, achieving real-time performance in high-end desktop PCs. Reducing their computational footprint will facilitate the implementation of odometry and SLAM in low-end platforms such as small robots and AR/VR glasses. Our experimental results show that our novel information-based selection criterion allows us to reduce the number of tracked points an order of magnitude (down to only 24 of them), achieving an accuracy similar to the state of the art (sometimes outperforming it) while reducing 10 times the computational demand.
[visual, state, order, frame, three, work, current, trajectory] [tracking, threshold, tracked] [model, creation, difference, robust, freedom, case] [figure, ieee, based, residual, motion, adjustment, inverse, intensity, high] [image, mapping, aim, notice] [informative, number, entropy, matrix, computational, set, selection, accuracy, covariance, optimization, small, function, respect, algorithm, select, reduce, reduction, large, higher, vector, andrew, paper, performance, criterion, mutual, selected, problem, impact, reducing] [point, keyframe, odometry, direct, photometric, conference, slam, error, international, pose, camera, keyframes, system, bundle, marginalization, cost, robotics, approach, translational, intelligent, local, second, relative, daniel, javier, minimal, grid]
@InProceedings{Fontan_2020_CVPR,
  author = {Fontan, Alejandro and Civera, Javier and Triebel, Rudolph},
  title = {Information-Driven Direct RGB-D Odometry},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
SuperGlue: Learning Feature Matching With Graph Neural Networks
Paul-Edouard Sarlin, Daniel DeTone, Tomasz Malisiewicz, Andrew Rabinovich


This paper introduces SuperGlue, a neural network that matches two sets of local features by jointly finding correspondences and rejecting non-matchable points. Assignments are estimated by solving a differentiable optimal transport problem, whose costs are predicted by a graph neural network. We introduce a flexible context aggregation mechanism based on attention, enabling SuperGlue to reason about the underlying 3D scene and feature assignments jointly. Compared to traditional, hand-designed heuristics, our technique learns priors over geometric transformations and regularities of the 3D world through end-to-end training from image pairs. SuperGlue outperforms other learned approaches and achieves state-of-the-art results on the task of pose estimation in challenging real-world indoor and outdoor environments. The proposed method performs matching in real-time on a modern GPU and can be readily integrated into modern SfM or SLAM systems. The code and trained weights are publicly available at github.com/magicleap/SuperGluePretrainedNetwork.
[graph, visual, attention, context, message, attentional, work] [feature, assignment, score, aggregation, table, global, instance, challenging, focus] [trained] [figure, based, homography, performed] [image, learn, representation, transport] [neural, network, learning, deep, optimal, learned, test, mutual, layer, training, precision, linear, ratio, set, andrew, large, better, architecture, applied, number, optimization, data, efficient, algorithm] [superglue, matching, local, keypoints, pose, superpoint, indoor, estimation, sift, keypoint, oanet, outdoor, handcrafted, partial, matcher, point, descriptor, sinkhorn, single, ground, truth, match, computed, nearest, relative, contextdesc, pointcn, tomasz, estimated, solving, enabling, geometric, cost, formulation, complex]
@InProceedings{Sarlin_2020_CVPR,
  author = {Sarlin, Paul-Edouard and DeTone, Daniel and Malisiewicz, Tomasz and Rabinovich, Andrew},
  title = {SuperGlue: Learning Feature Matching With Graph Neural Networks},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Reinforced Feature Points: Optimizing Feature Detection and Description for a High-Level Task
Aritra Bhowmik, Stefan Gumhold, Carsten Rother, Eric Brachmann


We address a core problem of computer vision: Detection and description of 2D feature points for image matching. For a long time, hand-crafted designs, like the seminal SIFT algorithm, were unsurpassed in accuracy and efficiency. Recently, learned feature detectors emerged that implement detection and description using neural networks. Training these networks usually resorts to optimizing low-level matching scores, often pre-defining sets of image patches which should or should not match, or which should or should not contain key points. Unfortunately, increased accuracy for these low-level matching scores does not necessarily translate to better performance in high-level vision tasks. We propose a new training methodology which embeds the feature detector in a complete vision pipeline, and where the learnable parameters are trained in an end-to-end fashion. We overcome the discrete nature of key point selection and descriptor matching using principles from reinforcement learning. As an example, we address the task of relative pose estimation between a pair of images. We demonstrate that the accuracy of a state-of-the-art learning-based feature detector can be increased when trained for the task it is supposed to solve at test time. Our training methodology poses little restrictions on the task to learn, and works for any architecture which predicts key point heat maps, and descriptors for key point locations.
[description, reinforcement] [key, feature, detection, detector, map, apply] [robust, auc, trained, model, methodology] [illumination, patch, output, learnable, figure] [image, loss, train, learn] [training, task, accuracy, learning, probability, network, learned, selection, neural, probabilistic, distribution, matrix, sampling, ratio, sample, observe, increased, performance, test, find, filter, set, selected, large, architecture] [superpoint, matching, point, relative, pose, descriptor, vision, pipeline, heat, essential, sift, ground, truth, local, reinforced, rootsift, error, estimation, nearest, ransac, lift, camera, complete, match, well, compare, distance, sparse, inlier, viewpoint, transformation, estimator, estimated, reprojection, computer, define]
@InProceedings{Bhowmik_2020_CVPR,
  author = {Bhowmik, Aritra and Gumhold, Stefan and Rother, Carsten and Brachmann, Eric},
  title = {Reinforced Feature Points: Optimizing Feature Detection and Description for a High-Level Task},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
ReDA:Reinforced Differentiable Attribute for 3D Face Reconstruction
Wenbin Zhu, HsiangTao Wu, Zeyu Chen, Noranart Vesdapunt, Baoyuan Wang


The key challenge for 3D face shape reconstruction is to build the correct dense face correspondence between the deformable mesh and the single input image. Given the ill-posed nature, previous works heavily rely on prior knowledge (such as 3DMM [2]) to reduce depth ambiguity. Although impressive result has been made recently [42, 14, 8], there is still a large room to improve the correspondence so that projected face shape better aligns with the silhouette of each face region (i.e, eye, mouth, nose, cheek, etc.) on the image. To further reduce the ambiguities, we present a novel framework called "Reinforced Differentiable Attributes" ("ReDA") which is more general and effective than previous Differentiable Rendering ("DR"). Specifically, we first extend from color to more broad attributes, including the depth and the face parsing mask. Secondly, unlike the previous Z-buffer rendering, we make the rendering to be more differentiable through a set of convolution operations with multi-scale kernel sizes. In the meanwhile, to make "ReDA" to be more successful for 3D face recon-struction, we further introduce a new free-form deformation layer that sits on top of 3DMM to enjoy both the prior knowledge and out-of-space modeling. Both techniques can be easily integrated into existing 3D face reconstruction pipeline. Extensive experiments on both RGB and RGB-D datasets show that our approach outperforms prior arts.
[recognition, previous, represent, work, dataset] [mask, parsing, effectiveness, pyramid, table, segmentation, map, regression, propose, apply] [face, model, reda, facial, input, morphable, improve, landmark, micc, christian, adding, expression, pablo, patrick] [ieee, color, pattern, pixel, figure, prior, convolution, kernel, proposed, based, june, traditional] [image, attribute, loss, texture, corresponding, learn] [learning, better, deep, set, capacity, optimization, soft, layer, optimize, large, denote, comparing, reduce] [differentiable, reconstruction, computer, vision, shape, depth, fitting, mesh, deformation, conference, correspondence, rasterization, rendering, dense, geometry, triangle, vertex, rendered, single, error, acm, michael, rgb, monocular, directly, enclosed, pose, hao]
@InProceedings{Zhu_2020_CVPR,
  author = {Zhu, Wenbin and Wu, HsiangTao and Chen, Zeyu and Vesdapunt, Noranart and Wang, Baoyuan},
  title = {ReDA:Reinforced Differentiable Attribute for 3D Face Reconstruction},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
EventCap: Monocular 3D Capture of High-Speed Human Motions Using an Event Camera
Lan Xu, Weipeng Xu, Vladislav Golyanik, Marc Habermann, Lu Fang, Christian Theobalt


The high frame rate is a critical requirement for capturing fast human motions. In this setting, existing markerless image-based methods are constrained by the lighting requirement, the high data bandwidth and the consequent high computation overhead. In this paper, we propose EventCap -- the first approach for 3D capturing of high-speed human motions using a single event camera. Our method combines model-based optimization and CNN-based human pose detection to capture high frequency motion details and to reduce the drifting in the tracking. As a result, we can capture fast motions at millisecond resolution with significantly higher data efficiency than using high frame rate videos. Experiments on our new event-based fast human motion dataset demonstrate the effectiveness and accuracy of our method, as well as its robustness to challenging lighting conditions.
[frame, recognition, stream, temporal, multiple, skeleton, current, time, dataset] [tracking, template, detection, fps, propose, feature, refine, boundary, achieves, track, apply, refinement] [christian, model] [event, motion, high, intensity, method, pattern, fast, asynchronous, low, figure, eventcap, capturing, based, pixel, reference, dynamic, resolution, adjacent, quantitative] [image, latent, corresponding] [rate, data, optimization, batch, linear, accuracy, neural, performance, deep] [pose, human, vision, capture, computer, estimation, conference, body, mesh, monocular, single, hmr, international, markerless, joint, camera, shape, mono, michael, approach, depth, closest, reconstructed, refer, rgb, hybrid, accurate, position, overlay, full, lighting, smpl]
@InProceedings{Xu_2020_CVPR,
  author = {Xu, Lan and Xu, Weipeng and Golyanik, Vladislav and Habermann, Marc and Fang, Lu and Theobalt, Christian},
  title = {EventCap: Monocular 3D Capture of High-Speed Human Motions Using an Event Camera},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Cross-Modal Deep Face Normals With Deactivable Skip Connections
Victoria Fernandez Abrevaya, Adnane Boukhayma, Philip H.S. Torr, Edmond Boyer


We present an approach for estimating surface normals from in-the-wild color images of faces. While data-driven strategies have been proposed for single face images, limited available ground truth data makes this problem difficult. To alleviate this issue, we propose a method that can leverage all available image and normal data, whether paired or not, thanks to a novel cross-modal learning architecture. In particular, we enable additional training with single modality data, either color or normal, by using two encoder-decoder networks with a shared latent space. The proposed architecture also enables face details to be transferred between the image and normal domains, given paired data, through skip connections between the image encoder and normal decoder. Core to our approach is a novel module that we call deactivable skip connections, which allows integrating both the auto-encoded and image-to-normal branches within the same architecture that can be trained end-to-end. This allows learning of a rich latent space that can accurately capture the normal information. We compare against state-of-the-art methods and show that our approach can achieve significant improvements, both quantitative and qualitative, with natural face images.
[decoder, work, dataset, order, previous] [map, extreme, table, propose] [face, facial, model, christian, prn, datasets, input, morphable, stefanos, patrick, pablo] [skip, ieee, pattern, proposed, color, method, output, figure, recover, convolutional, quantitative, analysis] [image, encoder, transfer, latent, qualitative, unsupervised, paired] [training, learning, deep, architecture, layer, neural, data, angular, standard, machine, fei, problem] [normal, computer, conference, vision, single, surface, approach, reconstruction, international, deactivable, shape, detailed, depth, estimation, acm, monocular, well, parametric, photoface, florence, michael, limited, allows, accurate, geometry, estimate, additional, compare, thomas]
@InProceedings{Abrevaya_2020_CVPR,
  author = {Abrevaya, Victoria Fernandez and Boukhayma, Adnane and Torr, Philip H.S. and Boyer, Edmond},
  title = {Cross-Modal Deep Face Normals With Deactivable Skip Connections},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Weakly-Supervised Mesh-Convolutional Hand Reconstruction in the Wild
Dominik Kulon, Riza Alp Guler, Iasonas Kokkinos, Michael M. Bronstein, Stefanos Zafeiriou


We introduce a simple and effective network architecture for monocular 3D hand pose estimation consisting of an image encoder followed by a mesh convolutional decoder that is trained through a direct 3D hand mesh reconstruction loss. We train our network by gathering a large-scale dataset of hand action in YouTube videos and use it as a source of weak supervision. Our weakly-supervised mesh convolutions-based system largely outperforms state-of-the-art methods, even halving the errors on the in the wild benchmark. The dataset and additional resources are available at https://arielai.com/mesh_hands.
[dataset, youtube, decoder, graph, outperforms, length, recognition, sign] [table] [model, iterative, trained, wild, datasets] [ieee, pattern, convolutional, spectral, figure, method, spatial, based, proposed, prior, fast, automated] [image, loss, latent, filtered, introduce, train] [training, learning, network, set, neural, performance, data, function, deep, test, simple, architecture, large, standard, find] [pose, hand, mesh, conference, computer, vision, estimation, reconstruction, spiral, shape, michael, system, international, mano, fitting, error, human, single, rgb, approach, vertex, camera, mpii, monocular, joint, body, collection, keypoint, term, ground, truth, allows, depth, recovery, freihand]
@InProceedings{Kulon_2020_CVPR,
  author = {Kulon, Dominik and Guler, Riza Alp and Kokkinos, Iasonas and Bronstein, Michael M. and Zafeiriou, Stefanos},
  title = {Weakly-Supervised Mesh-Convolutional Hand Reconstruction in the Wild},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Face X-Ray for More General Face Forgery Detection
Lingzhi Li, Jianmin Bao, Ting Zhang, Hao Yang, Dong Chen, Fang Wen, Baining Guo


In this paper we propose a novel image representation called face X-ray for detecting forgery in face images. The face X-ray of an input face image is a greyscale image that reveals whether the input image can be decomposed into the blending of two images from different sources. It does so by showing the blending boundary for a forged image and the absence of blending for a real image. We observe that most existing face manipulation methods share a common step: blending the altered face into an existing background image. For this reason, face X-ray provides an effective way for detecting forgery generated by most existing face manipulation algorithms. Face X-ray is general in the sense that it only assumes the existence of a blending step and does not rely on any knowledge of the artifacts associated with a specific face manipulation technique. Indeed, the algorithm for computing face X-ray can be trained without fake images generated by any of the state-of-the-art face manipulation methods. Extensive experiments show that face X-ray remains effective when applied to forgery generated by unseen face manipulation techniques, while most existing face forgery detection or deepfake detection algorithms experience a significant performance drop.
[dataset, visual, step, work] [detection, mask, framework, table, detect, adopt, boundary, including, hrnet, background, groundtruth, localization, foreground] [face, forgery, blending, manipulation, manipulated, facial, detecting, input, blended, model, generalization, auc, trained, deepfake, forged, forensics, xception, luisa, correction, eer, workshop, multimedia, effective] [ieee, existing, figure, color, based, convolutional, method, comparison, proposed, analysis, develop] [image, real, fake, generated, unseen, loss, corresponding, ability, representation, specific, supervised, train] [training, data, performance, arxiv, preprint, learning, neural, network, set, deep, binary, test, large, number, equation, better, general] [conference, approach, international, intrinsic, computer, well, matthias]
@InProceedings{Li_2020_CVPR,
  author = {Li, Lingzhi and Bao, Jianmin and Zhang, Ting and Yang, Hao and Chen, Dong and Wen, Fang and Guo, Baining},
  title = {Face X-Ray for More General Face Forgery Detection},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
A Morphable Face Albedo Model
William A. P. Smith, Alassane Seck, Hannah Dee, Bernard Tiddeman, Joshua B. Tenenbaum, Bernhard Egger


In this paper, we bring together two divergent strands of research: photometric face capture and statistical 3D face appearance modelling. We propose a novel lightstage capture and processing pipeline for acquiring ear-to-ear, truly intrinsic diffuse and specular albedo maps that fully factor out the effects of illumination, camera and geometry. Using this pipeline, we capture a dataset of 50 scans and combine them with the only existing publicly available albedo dataset (3DRFE) of 23 scans. This allows us to build the first morphable face albedo model. We believe this is the first statistical analysis of the variability of facial specular albedo maps. This model can be used as a plug in replacement for the texture model of the Basel Face Model and we make our new albedo model publicly available. We ensure careful spectral calibration such that our model is built in a linear sRGB space, suitable for inverse rendering of images taken by typical cameras. We demonstrate our model in a state of the art analysis-by-synthesis 3DMM fitting pipeline, are the first to integrate specular map estimation and outperform the Basel Face Model in albedo reconstruction.
[dataset, three, recognition, built, provide, build] [template, final] [model, face, morphable, publicly, facial, poisson, blending, nonlinear] [illumination, colour, captured, proposed, ieee, inverse, light, based, figure, scale, pattern, spectral, srgb, separation, bernhard, existing] [texture, appearance, image, masked, source, perform] [statistical, linear, gradient, set, data, setup, filter, william, performance, matrix] [albedo, specular, diffuse, photometric, capture, camera, computer, bfm, conference, vision, geometry, multiview, rendering, reflectance, view, shape, intrinsic, single, reconstruction, additional, spherical, international, allows, estimation, polarising, acm, thomas, lightstage, pipeline, polarisation, vertex, transformation, spec, principal, basel, fitting, directly]
@InProceedings{Smith_2020_CVPR,
  author = {Smith, William A. P. and Seck, Alassane and Dee, Hannah and Tiddeman, Bernard and Tenenbaum, Joshua B. and Egger, Bernhard},
  title = {A Morphable Face Albedo Model},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Cascade EF-GAN: Progressive Facial Expression Editing With Local Focuses
Rongliang Wu, Gongjie Zhang, Shijian Lu, Tao Chen


Recent advances in Generative Adversarial Nets (GANs) have shown remarkable improvements for facial expression editing. However, current methods are still prone to generate artifacts and blurs around expression-intensive regions, and often introduce undesired overlapping artifacts while handling large-gap expression transformations such as transformation from furious to laughing. To address these limitations, we propose Cascade Expression Focal GAN (Cascade EF-GAN), a novel network that performs progressive facial expression editing with local expression focuses. The introduction of the local focus enables the Cascade EF-GAN to better preserve identity-related features and details around eyes, noses and mouths, which further helps reduce artifacts and blurs within the generated facial images. In addition, an innovative cascade transformation strategy is designed by dividing a large facial expression transformation into multiple small ones in cascade, which helps suppress overlapping artifacts and produce more realistic editing while dealing with large-gap expression transformations. Extensive experiments over two publicly available facial expression datasets show that our proposed Cascade EF-GAN achieves superior performance for facial expression editing.
[attention, step, transformer, concatenation] [cascade, global, response, branch, final] [expression, input] [output, intermediate] [target, stargan] [baseline, label] [local, initial, refiner]
@InProceedings{Wu_2020_CVPR,
  author = {Wu, Rongliang and Zhang, Gongjie and Lu, Shijian and Chen, Tao},
  title = {Cascade EF-GAN: Progressive Facial Expression Editing With Local Focuses},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
GanHand: Predicting Human Grasp Affordances in Multi-Object Scenes
Enric Corona, Albert Pumarola, Guillem Alenya, Francesc Moreno-Noguer, Gregory Rogez


The rise of deep learning has brought remarkable progress in estimating hand geometry from images where the hands are part of the scene. This paper focuses on a new problem not explored so far, consisting in predicting how a human would grasp one or several objects, given a single RGB image of these objects. This is a problem with enormous potential in e.g. augmented reality, robotics or prosthetic design. In order to predict feasible grasps, we need to understand the semantic content of the image, its geometric structure and all potential interactions with a hand physical model. To this end, we introduce a generative model that jointly reasons in all these levels and 1) regresses the 3D shape and pose of the objects in the scene; 2) estimates the grasp types; and 3) refines the 51-DoF of a 3D hand model that minimize a graspability loss. To train this model we build the YCB-Affordance dataset, that contains more than 133k images of 21 objects in the YCB-Video dataset. We have annotated these images with more than 28M plausible 3D human grasps according to a 33-class taxonomy. A thorough evaluation in synthetic and real images shows that our model can robustly predict realistic grasps, even in cluttered scenes with multiple objects in close contact.
[dataset, predict, prediction, predicting, work, affordance, three, multiple, modeling, affordances, contribution] [object, annotated, table, predicted, tracking] [model, input, physical, type] [figure, valid, method, based] [image, realistic, loss, synthetic, taxonomy, real, train, generative, representation] [learning, problem, deep, number, optimization, manually, classification, consider, layer, baseline, data, set, sample, architecture, evaluate, network, training, feasible, process] [hand, grasp, pose, human, shape, rgb, single, ganhand, grasping, estimation, contact, interpenetration, graspit, reconstruction, scene, joint, mano, predicts, estimate, ycb, obman, approach, cad, plane, rotation, robotic, point, simulation, estimating, robotics]
@InProceedings{Corona_2020_CVPR,
  author = {Corona, Enric and Pumarola, Albert and Alenya, Guillem and Moreno-Noguer, Francesc and Rogez, Gregory},
  title = {GanHand: Predicting Human Grasp Affordances in Multi-Object Scenes},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Deep Spatial Gradient and Temporal Depth Learning for Face Anti-Spoofing
Zezheng Wang, Zitong Yu, Chenxu Zhao, Xiangyu Zhu, Yunxiao Qin, Qiusheng Zhou, Feng Zhou, Zhen Lei


Face anti-spoofing is critical to the security of face recognition systems. Depth supervised learning has been proven as one of the most effective methods for face anti-spoofing. Despite the great success, most previous works still formulate the problem as a single-frame multi-task one by simply augmenting the loss with depth, while neglecting the detailed fine-grained information and the interplay between facial depths and moving patterns. In contrast, we design a new approach to detect presentation attacks from multiple frames based on two insights: 1) detailed discriminative clues (e.g., spatial gradient magnitude) between living and spoofing face may be discarded through stacked vanilla convolutions, and 2) the dynamics of 3D moving faces provide important clues in detecting the spoofing faces. The proposed method is able to capture discriminative details via Residual Spatial Gradient Block (RSGB) and encode spatio-temporal information from Spatio-Temporal Propagation Module (STPM) efficiently. Moreover, a novel Contrastive Depth Loss is presented for more accurate depth supervision. To assess the efficacy of our method, we also collect a Double-modal Anti-spoofing Dataset (DMAD) which provides actual depth for each sample. The experiments demonstrate that the proposed approach achieves state-of-the-art results on five benchmark datasets including OULU-NPU, SiW, CASIA-MFSD, Replay-Attack, and the new DMAD. Codes will be available at https://github.com/clks-wzz/FAS-SGTD.
[temporal, frame, three, dataset, recognition, convgru, current] [detection, map, denotes, feature, module, propose, backbone, table, propagation, predicted] [face, model, spoofing, living, protocol, testing, auxiliary, attack, presentation, ststb, rsgb, facial, siw, stpm, actual, stasn, magnitude, generalization, acer, jukka, zhen, difference, abdenour, spoof, input] [spatial, method, proposed, based, convolution, residual, motion, block, figure, convolutional, ieee] [loss, supervised, image, discriminative, corresponding, learn, generated, train, representation, real] [gradient, contrastive, binary, network, learning, performance, neural, design, classification, training] [depth, coarse, conference, capture, novel, camera]
@InProceedings{Wang_2020_CVPR,
  author = {Wang, Zezheng and Yu, Zitong and Zhao, Chenxu and Zhu, Xiangyu and Qin, Yunxiao and Zhou, Qiusheng and Zhou, Feng and Lei, Zhen},
  title = {Deep Spatial Gradient and Temporal Depth Learning for Face Anti-Spoofing},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
DeepCap: Monocular Human Performance Capture Using Weak Supervision
Marc Habermann, Weipeng Xu, Michael Zollhofer, Gerard Pons-Moll, Christian Theobalt


Human performance capture is a highly important computer vision problem with many applications in movie production and virtual/augmented reality. Many previous performance capture approaches either required expensive multi-view setups or did not recover dense space-time coherent geometry with frame-to-frame correspondences. We propose a novel deep learning approach for monocular dense human performance capture. Our method is trained in a weakly supervised manner based on multi-view supervision completely removing the need for training data with 3D ground truth annotations. The network architecture is based on two separate networks that disentangle the task into a pose estimation and a non-rigid surface deformation step. Extensive qualitative and quantitative evaluations show that our approach outperforms the state of the art in terms of quality and robustness.
[video, recognition, people, work, embedded, graph] [global, template, tracking, supervision, propose, foreground] [input, model, clothing] [method, ieee, pattern, motion, based, dynamic, high, reference] [image, translation, loss, qualitative, representation, corresponding] [performance, training, learning, set, note, evaluate, network, deep, general, accuracy, number, layer] [human, computer, pose, capture, vision, conference, deformation, shape, monocular, single, camera, surface, acm, body, joint, reconstruction, approach, dense, estimation, mesh, ground, truth, international, sparse, skeletal, view, rotation, volumetric, articulated, differentiable, depth, geometry, deformed, deephuman, volume, keypoint, implicit, parametric, livecap, distance, defnet, overlay, amviou]
@InProceedings{Habermann_2020_CVPR,
  author = {Habermann, Marc and Xu, Weipeng and Zollhofer, Michael and Pons-Moll, Gerard and Theobalt, Christian},
  title = {DeepCap: Monocular Human Performance Capture Using Weak Supervision},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Attention Mechanism Exploits Temporal Contexts: Real-Time 3D Human Pose Reconstruction
Ruixu Liu, Ju Shen, He Wang, Chen Chen, Sen-ching Cheung, Vijayan Asari


We propose a novel attention-based framework for 3D human pose estimation from a monocular video. Despite the general success of end-to-end deep learning paradigms, our approach is based on two key observations: (1) temporal incoherence and jitter are often yielded from a single frame prediction; (2) error rate can be remarkably reduced by increasing the receptive field in a video. Therefore, we design an attentional mechanism to adaptively identify significant frames and tensor outputs from each deep neural net layer, leading to a more optimal estimation. To achieve large temporal receptive fields, multi-scale dilated convolutions are employed to model long-range dependencies among frames. The architecture is straightforward to implement and can be flexibly adopted for real-time applications. Any off-the-shelf 2D pose estimation system, e.g. Mocap libraries, can be easily integrated in an ad-hoc fashion. We both quantitatively and qualitatively evaluate our method on various standard benchmark datasets (e.g. Human3.6M, HumanEva). Our method considerably outperforms all the state-of-the-art algorithms up to 8% error reduction (average mean per joint position error: 34.7) as compared to the best-reported results. Code is available at: (https://github.com/lrxjason/Attention3DHumanPose)
[attention, temporal, recognition, evaluation, three, tcn, unit, frame, mechanism, video, individual, long, work, prediction, causal, dataset, martinez, hossain] [table, module, level, ablation, pyramid, feature] [model, input, improve] [kernel, receptive, dilation, pattern, figure, field, tensor, convolutional, output, motion, dilated, channel, ieee, method, demonstrates, cascaded] [cpn] [neural, learning, layer, number, network, deep, performance, training, size, accuracy, large, machine, best, compared, data, weight, rate, processing, applied, configuration, algorithm, dimension, setting, impact] [pose, estimation, conference, human, vision, computer, mpjpe, approach, error, body, joint, european, international, pavllo, single, articulated, system]
@InProceedings{Liu_2020_CVPR,
  author = {Liu, Ruixu and Shen, Ju and Wang, He and Chen, Chen and Cheung, Sen-ching and Asari, Vijayan},
  title = {Attention Mechanism Exploits Temporal Contexts: Real-Time 3D Human Pose Reconstruction},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Advancing High Fidelity Identity Swapping for Forgery Detection
Lingzhi Li, Jianmin Bao, Hao Yang, Dong Chen, Fang Wen


In this work, we study various existing benchmarks for deepfake detection researches. In particular, we examine a novel two-stage face swapping algorithm, called FaceShifter, for high fidelity and occlusion aware face swapping. Unlike many existing face swapping works that leverage only limited information from the target image when synthesizing the swapped face, FaceShifter generates the swapped face with high-fidelity by exploiting and integrating the target attributes thoroughly and adaptively. FaceShifter can handle facial occlusions with a second synthesis stage consisting of a Heuristic Error Acknowledging Refinement Network (HEAR-Net), which is trained to recover anomaly regions in a self-supervised way without any manual annotations. Experiments show that existing deepfake detection algorithm performs poorly with FaceShifter, since it achieves advantageous quality over all existing benchmarks. However, our newly developed Face X-Ray method can reliably detect forged images created by FaceShifter.
[embedding, previous, embeddings, attentional, three, multiple, recognition] [feature, stage, detection, level, table, occlusion, mask] [face, identity, aad, swapped, zatt, model, expression, ipgan, facial, faceswap, nirkin, zid, heuristic, trained, deepfakes, input, fsgan, christian, forgery, dong, faceshifter, study] [figure, ieee, method, adaptive, integration, proposed, pattern, result, comparison, high, formulated, output, existing] [target, image, source, swapping, preserve, generated, loss, fidelity, generate, synthesizing, encoder, resblk, synthesis] [network, training, arxiv, preprint, large, vector, activation, manual, better] [conference, computer, pose, vision, lighting, international, well, second, error, novel, hao, handle, reconstruction]
@InProceedings{Li_2020_CVPR,
  author = {Li, Lingzhi and Bao, Jianmin and Yang, Hao and Chen, Dong and Wen, Fang},
  title = {Advancing High Fidelity Identity Swapping for Forgery Detection},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Controllable Person Image Synthesis With Attribute-Decomposed GAN
Yifang Men, Yiming Mao, Yuning Jiang, Wei-Ying Ma, Zhouhui Lian


This paper introduces the Attribute-Decomposed GAN, a novel generative model for controllable person image synthesis, which can produce realistic person images with desired human attributes (e.g., pose, head, upper clothes and pants) provided in various source inputs. The core idea of the proposed model is to embed human attributes into the latent space as independent codes and thus achieve flexible and continuous control of attributes via mixing and interpolation operations in explicit style representations. Specifically, a new architecture consisting of two encoding pathways with style block connections is proposed to decompose the original hard mapping into multiple more accessible subtasks. In source pathway, we further extract component layouts with an off-the-shelf human parser and feed them into a shared global texture encoder for decomposed latent codes. This strategy allows for the synthesis of more realistic output images and automatic separation of un-annotated attributes. Experimental results demonstrate the proposed method's superiority over the state of the art in pose transfer and its effectiveness in the brand-new task of component attribute transfer.
[encoding, parser, natural, multiple, construct, automatic, automatically] [module, global, contextual, annotation, semantic, feature] [model, clothes, adversarial, vgg, original] [figure, method, ieee, pattern, result, proposed, output, dce, block, introduced, fusion, flexible, separation] [person, image, source, style, component, target, synthesis, transfer, texture, generated, code, attribute, generator, desired, latent, encoder, loss, controllable, generative, decomposed, appearance, realistic, user, control, extracted, synthesize, arbitrary, corresponding, editing, unsupervised, conditional, csty, synthesizing, specific, synthesized, manifold, real] [architecture, network, upper, training, arxiv, preprint, neural, task, learning, space, data] [pose, human, computer, conference, vision, directly, detailed, international, full]
@InProceedings{Men_2020_CVPR,
  author = {Men, Yifang and Mao, Yiming and Jiang, Yuning and Ma, Wei-Ying and Lian, Zhouhui},
  title = {Controllable Person Image Synthesis With Attribute-Decomposed GAN},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Attentive Normalization for Conditional Image Generation
Yi Wang, Ying-Cong Chen, Xiangyu Zhang, Jian Sun, Jiaya Jia


Traditional convolution-based generative adversarial networks synthesize images based on hierarchical local operations, where long-range dependency relation is implicitly modeled with a Markov chain. It is still not sufficient for categories with complicated structures. In this paper, we characterize long-range dependence with attentive normalization (AN), which is an extension to traditional instance normalization. Specifically, the input feature map is softly divided into several regions based on its internal semantic similarity, which are respectively normalized. It enhances consistency between distant regions with semantic correspondence. Compared with self-attention GAN, our attentive normalization does not need to measure the correlation of all locations, and thus can be directly applied to large-size feature maps without much computational burden. Extensive experiments on class-conditional image generation and semantic inpainting verify the efficacy of our proposed module.
[semantics, attention, relationship, modeling, visual, dependency, relation, natural, work, time, entity] [feature, semantic, module, attentive, instance, table, branch, map, correlation, effectiveness, employed, predicted] [input, adversarial, regional, model, useless] [proposed, figure, method, based, convolutional, spatial, residual, intermediate, quantitative] [image, generative, generation, layout, conditional, fid, inpainting, generator, gan, intra, activated, generated, discriminator, diverse] [normalization, learning, training, arxiv, preprint, learned, neural, compared, deep, batch, class, task, layer, large, lower, network, imagenet, complexity, computational, data, regularization, conducted, computation, soft, randomly, better, distribution] [distant, capture, computed, well, relies]
@InProceedings{Wang_2020_CVPR,
  author = {Wang, Yi and Chen, Ying-Cong and Zhang, Xiangyu and Sun, Jian and Jia, Jiaya},
  title = {Attentive Normalization for Conditional Image Generation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
SEAN: Image Synthesis With Semantic Region-Adaptive Normalization
Peihao Zhu, Rameen Abdal, Yipeng Qin, Peter Wonka


We propose semantic region-adaptive normalization (SEAN), a simple but effective building block for Generative Adversarial Networks conditioned on segmentation masks that describe the semantic regions in the desired output image. Using SEAN normalization, we can build a network architecture that can control the style of each semantic region individually, e.g., we can specify one style reference image per region. SEAN is better suited to encode, transfer, and synthesize style than the best previous method in terms of reconstruction quality, variability, and visual quality. We evaluate SEAN on multiple datasets and report better quantitative metrics (e.g. FID, PSNR) than the current state of the art. SEAN also pushes the frontier of interactive image editing. We can interactively edit images by changing segmentation masks or the style for any given region. We can also interpolate styles from two reference images per region.
[visual, current, work, state, encoding, three] [semantic, segmentation, region, mask, building, table, unified] [input, adversarial, quality, face, datasets, manipulation, noise, trained] [ieee, method, conv, figure, pattern, psnr, block, high, quantitative, ssim, reference, convolutional] [style, image, sean, spade, generative, synthesis, encoder, control, fid, generator, conditional, source, editing, transfer, gans, translation, corresponding, loss, unsupervised, texture, resblk] [normalization, network, architecture, better, neural, best, arxiv, preprint, evaluate, matrix, training, report, learning, set] [computer, vision, conference, reconstruction, rmse, enables, supplementary, measured, international, peter]
@InProceedings{Zhu_2020_CVPR,
  author = {Zhu, Peihao and Abdal, Rameen and Qin, Yipeng and Wonka, Peter},
  title = {SEAN: Image Synthesis With Semantic Region-Adaptive Normalization},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Blurry Video Frame Interpolation
Wang Shen, Wenbo Bao, Guangtao Zhai, Li Chen, Xiongkuo Min, Zhiyong Gao


Existing works reduce motion blur and up-convert frame rate through two separate ways, including frame deblurring and frame interpolation. However, few studies have approached the joint video enhancement problem, namely synthesizing high-frame-rate clear results from low-frame-rate blurry inputs. In this paper, we propose a blurry video frame interpolation method to reduce motion blur and up-convert frame rate simultaneously. Specifically, we develop a pyramid module to cyclically synthesize clear intermediate frames. The pyramid module features adjustable spatial receptive field and temporal scope, thus contributing to controllable computational complexity and restoration ability. Besides, we propose an inter-pyramid recurrent module to connect sequential models to exploit the temporal relationship. The pyramid module integrates a recurrent module, thus can iteratively synthesize temporally smooth results without significantly increasing the model size. Extensive experimental results demonstrate that our method performs favorably against state-of-the-art methods. The source code and pre-trained model are available at https://github.com/laomao0/BIN.
[frame, video, recurrent, temporal, time, multiple, dataset, srn, evaluation, three, exploit, visual, construct, integrate, current, hidden, unit] [module, pyramid, backbone, propose, cascade, table, propagate] [model] [interpolation, motion, proposed, deblurring, figure, blurry, method, convlstm, intermediate, psnr, flow, blur, scale, optical, jin, dain, ssim, edvr, performs, convolutional, existing, clear, pixel, restoration, deblurred, based, residual, spatial, shutter, exposure, adaptive, blurred, enhancement, receptive, favorably, interpolated, dynamic] [consistency, cycle, image, loss, synthesize, generate, consists] [network, rate, training, better, reduce, learning, performance, metric, neural, problem, formulate, compared] [joint, smoothness, dense, compare, camera]
@InProceedings{Shen_2020_CVPR,
  author = {Shen, Wang and Bao, Wenbo and Zhai, Guangtao and Chen, Li and Min, Xiongkuo and Gao, Zhiyong},
  title = {Blurry Video Frame Interpolation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Learning Physics-Guided Face Relighting Under Directional Light
Thomas Nestmeyer, Jean-Francois Lalonde, Iain Matthews, Andreas Lehrmann


Relighting is an essential step in realistically transferring objects from a captured image into another environment. For example, authentic telepresence in Augmented Reality requires faces to be displayed and relit consistent with the observer's scene lighting. We investigate end-to-end deep learning architectures that both de-light and relight an image of a human face. Our model decomposes the input image into intrinsic components according to a diffuse physics-based image formation model. We enable non-diffuse effects including cast shadows and specular highlights by predicting a residual correction to the diffuse render. To train and evaluate our model, we collected a portrait database of 21 subjects with various expressions and poses. Each sample is captured in a controlled light stage setup with 32 individual light sources. Our method creates precise and believable relighting results and generalizes to complex illumination conditions and challenging poses, including when the subject is not looking straight at the camera.
[environment, evaluation, structured, explicit] [including, stage, table] [model, input, face, access, trained, strong, visibility] [light, ieee, illumination, pattern, residual, captured, formation, quantitative, figure, assumption, based, output, comparison] [image, source, portrait, target, loss, qualitative, corresponding, learn, unknown, desired, translation, generator] [data, deep, training, learning, neural, set, process, test, architecture, consider] [relighting, intrinsic, diffuse, lighting, conference, computer, acm, directional, single, reflectance, vision, approach, shading, relit, rendering, relight, albedo, sfsnet, cast, ground, truth, kalyan, dssim, david, complex, point, photometric, international, scene, human, surface, spherical, well]
@InProceedings{Nestmeyer_2020_CVPR,
  author = {Nestmeyer, Thomas and Lalonde, Jean-Francois and Matthews, Iain and Lehrmann, Andreas},
  title = {Learning Physics-Guided Face Relighting Under Directional Light},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Disentangled Image Generation Through Structured Noise Injection
Yazeed Alharbi, Peter Wonka


We explore different design choices for injecting noise into generative adversarial networks (GANs) with the goal of disentangling the latent space. Instead of traditional approaches, we propose feeding multiple noise codes through separate fully-connected layers respectively. The aim is restricting the influence of each noise code to specific parts of the generated image. We show that disentanglement in the first layer of the generator network leads to disentanglement in the generated image. Through a grid-based structure, we achieve several aspects of disentanglement without complicating the network architecture and without requiring labels. We achieve spatial disentanglement, scale-space disentanglement, and disentanglement of the foreground object from the background style allowing fine-grained control over the generated images. Examples include changing facial expressions in face images, changing beak length in bird images, and changing car dimensions in car images. This empirically leads to better disentanglement scores than state-of-the-art methods on the FFHQ dataset.
[starting, length, previous, dataset, structured] [global, background, foreground, map, propose, feature, main, object, car] [noise, input, change, face, injection, adversarial, facial, degree, datasets, quality, mouth] [method, tensor, spatial, figure, ieee, cell, high, pattern, traditional, pixel, color] [code, style, disentanglement, image, changing, generated, shared, specific, latent, stylegan, mapping, generate, generation, control, independent, generative, ffhq, row, disentangled, unsupervised, content, separability, requiring, separate] [network, linear, general, find, arxiv, preprint, dimension, path, design, layer, earlier, learning, lower, training, achieve] [local, conference, computer, vision, pose, structure, approach, second, single]
@InProceedings{Alharbi_2020_CVPR,
  author = {Alharbi, Yazeed and Wonka, Peter},
  title = {Disentangled Image Generation Through Structured Noise Injection},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Cross-Domain Correspondence Learning for Exemplar-Based Image Translation
Pan Zhang, Bo Zhang, Dong Chen, Lu Yuan, Fang Wen


We present a general framework for exemplar-based image translation, which synthesizes a photo-realistic image from the input in a distinct domain (e.g., semantic segmentation mask, or edge map, or pose keypoints), given an exemplar image. The output has the style (e.g., color, texture) in consistency with the semantically corresponding objects in the exemplar. We propose to jointly learn the cross-domain correspondence and the image translation, where both tasks facilitate each other and thus can be learned with weak supervision. The images from distinct domains are first aligned to an intermediate domain where dense correspondence is established. Then, the network synthesizes images based on the appearance of semantically corresponding patches in the exemplar. We demonstrate the effectiveness of our approach in several image translation tasks. Our method is superior to state-of-the-art methods in terms of image quality significantly, with the image style faithful to the exemplar with semantic consistency. Moreover, we show the utility of our method for several applications.
[semantics, natural] [semantic, propose, feature, edge, lreg, mask, table, segmentation, global, weak] [input, adversarial, quality, face] [method, figure, ieee, pattern, output, warped, color, comparison, based] [image, exemplar, translation, style, domain, synthesis, loss, learn, semantically, transfer, corresponding, fid, makeup, spade, conditional, pretrained, representation, learns, train, munit, generate, mapping, synthesizes, consistency, aligned, generative, alignment, synthesized, texture, swd, unsupervised, photorealistic, distinct] [network, arxiv, preprint, normalization, learning, neural, general, best, deep, layer, processing, data, find, large, learned] [correspondence, conference, computer, vision, dense, international, matching, pose, sparse, refer]
@InProceedings{Zhang_2020_CVPR,
  author = {Zhang, Pan and Zhang, Bo and Chen, Dong and Yuan, Lu and Wen, Fang},
  title = {Cross-Domain Correspondence Learning for Exemplar-Based Image Translation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Disentangled and Controllable Face Image Generation via 3D Imitative-Contrastive Learning
Yu Deng, Jiaolong Yang, Dong Chen, Fang Wen, Xin Tong


We propose an approach for face image generation of virtual people with disentangled, precisely-controllable latent representations for identity of non-existing people, expression, pose, and illumination. We embed 3D priors into adversarial learning and train the network to imitate the image formation of an analytic 3D face deformation and rendering process. To deal with the generation freedom induced by the domain gap between real and rendered faces, we further introduce contrastive learning to promote disentanglement by comparing pairs of generated images. Experiments show that through our imitative-contrastive learning, the factor variations are very well disentangled and the properties of a generated face can be precisely controlled. We also analyze the learned latent space and present several meaningful properties supporting factor disentanglement. Our method can also be used to embed real images into the disentangled latent space. We hope our method could provide new understandings of the relationship between physical properties and deep image synthesis.
[embedding, embed, recognition] [feature, denotes, table, score] [face, identity, adversarial, expression, input, difference, model, dong, fang, targeted, variation, facial, trained] [ieee, method, pattern, figure, illumination, coefficient] [image, latent, generation, disentangled, real, generated, disentanglement, variable, factor, gan, representation, generative, imitative, train, generate, domain, gap, stylegan, generator, loss, synthesis, fid, control, conditional, editing, changing, independent, meaningful, gans, unsupervised] [learning, contrastive, space, network, training, random, deep, neural, set, randomly, arxiv, preprint, learned, processing, paper] [conference, computer, pose, vision, rendered, international, lighting, direction, property, well, reconstruction, virtual, rendering, approach]
@InProceedings{Deng_2020_CVPR,
  author = {Deng, Yu and Yang, Jiaolong and Chen, Dong and Wen, Fang and Tong, Xin},
  title = {Disentangled and Controllable Face Image Generation via 3D Imitative-Contrastive Learning},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Single Image Reflection Removal With Physically-Based Training Images
Soomin Kim, Yuchi Huo, Sung-Eui Yoon


Recently, deep learning-based single image reflection separation methods have been exploited widely. To benefit the learning approach, a large number of training image pairs (i.e., with and without reflections) were synthesized in various ways, yet they are away from a physically-based direction. In this paper, physically based rendering is used for faithfully synthesizing the required training images, and a corresponding network structure and loss term are proposed. We utilize existing RGBD/RGB images to estimate meshes, then physically simulate the light transportation between meshes, glass, and lens with path tracing to synthesize training data, which successfully reproduce the spatially variant anisotropic visual effect of glass reflection. For guiding the separation better, we additionally consider a module, backtrack network (BT-net) for backtracking the reflections, which removes complicated ghosting, attenuation, blurred and defocused effect of glass/lens. This enables obtaining a priori information before having the distortion. The proposed method considering additional a priori information with physically simulated training data is validated with various real reflection images and shows visually pleasant and numerical advantages compared with state-of-the-art techniques.
[dataset, visual, order, transportation] [predicted, table, feature, focus] [input, model, physical, adversarial, trained, wild] [reflection, removal, priori, ieee, glass, figure, separation, pattern, method, transmission, psnr, ssim, light, proposed, zhang, based, existing, lens, simulate, spatially, backtrack, captured, removing, faithful, posteriori, sir, comparison, remove, bdn, blurred] [image, loss, real, synthesized, synthesize, wen, utilize, generated, train, synthesizing] [training, network, deep, test, learning, data, path, set, better, variant] [scene, computer, front, rendered, single, conference, rendering, vision, depth, physically, additional, geometry, capture, render, rgb, international, term, indoor, ground, error, tracing]
@InProceedings{Kim_2020_CVPR,
  author = {Kim, Soomin and Huo, Yuchi and Yoon, Sung-Eui},
  title = {Single Image Reflection Removal With Physically-Based Training Images},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
SketchyCOCO: Image Generation From Freehand Scene Sketches
Chengying Gao, Qi Liu, Qi Xu, Limin Wang, Jianzhuang Liu, Changqing Zou


We introduce the first method for automatic image generation from scene-level freehand sketches. Our model allows for controllable image generation by specifying the synthesis goal via freehand sketches. The key contribution is an attribute vector bridged Generative Adversarial Network called EdgeGAN, which supports high visual-quality object-level image content generation without using freehand sketches as training data. We have built a large-scale composite dataset called SketchyCOCO to support and evaluate the solution. We validate our approach on the tasks of both object-level and scene-level image generation on SketchyCOCO. Through quantitative, qualitative results, human evaluation and ablation studies, we demonstrate the method's capacity to generate realistic complex scene-level images from various freehand sketches.
[dataset, natural, evaluation] [foreground, edge, background, object, semantic, map, segmentation, coco, stuff, instance, including, stage] [model, input, adversarial, trained, study] [ieee, method, figure, proposed, called, based, quantitative, pattern, output, comparison] [image, generation, freehand, sketch, generated, edgegan, sketchygan, attribute, realistic, gaugan, ashual, faithfulness, generate, encoder, realism, sketchycoco, fid, generative, fake, train, synthesis, generating, contextualgan, qualitative, corresponding, layout, content, gans, loss, mapping, gan] [training, vector, data, test, neural, problem, compared, network, better, arxiv, preprint] [scene, computer, conference, approach, ground, vision, single, truth, international, human, left]
@InProceedings{Gao_2020_CVPR,
  author = {Gao, Chengying and Liu, Qi and Xu, Qi and Wang, Limin and Liu, Jianzhuang and Zou, Changqing},
  title = {SketchyCOCO: Image Generation From Freehand Scene Sketches},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Image Based Virtual Try-On Network From Unpaired Data
Assaf Neuberger, Eran Borenstein, Bar Hilleli, Eduard Oks, Sharon Alpert


This paper presents a new image-based virtual try-on approach (Outfit-VITON) that helps visualize how a composition of clothing items selected from various reference images form a cohesive outfit on a person in a query image. Our algorithm has two distinctive properties. First, it is inexpensive, as it simply requires a large set of single (non-corresponding) images (both real and catalog) of people wearing various garments without explicit 3D information. The training phase requires only single images, eliminating the need for manually creating image pairs, where one image shows a person wearing a particular garment and the other shows the same catalog garment alone. Secondly, it can synthesize images of multiple garments composed into a single, coherent outfit; and it enables control of the type of garments rendered in the final outfit. Once trained, our approach can then synthesize a cohesive outfit from multiple images of clothed human models, while fitting the outfit to the body shape and pose of the query person. An online optimization step takes care of fine details such as intricate textures and logos. Quantitative and qualitative evaluations on an image dataset containing large shape and style variations demonstrate superior accuracy compared to existing state-of-the-art methods, especially when dealing with highly detailed garments.
[multiple, wearing, step, evaluation, people, natural] [segmentation, map, feature, semantic, mask, score, main] [query, garment, original, model, adversarial, outfit, trained, gapp, type, input, fashion, eshape, gshape] [reference, pattern, output, ieee, method, phase, quantitative, figure, result, comparison] [image, appearance, loss, generation, generated, person, generator, autoencoder, synthesis, conditional, corresponding, discriminator, fid, generative, transfer, generate, gan, generates, realistic, gans, control, qualitative, generating, train] [online, training, optimization, network, selected, arxiv, preprint, set, data, neural, large, test, requires, vector, process, note] [shape, virtual, body, conference, computer, human, single, vision, pose, approach, geometric]
@InProceedings{Neuberger_2020_CVPR,
  author = {Neuberger, Assaf and Borenstein, Eran and Hilleli, Bar and Oks, Eduard and Alpert, Sharon},
  title = {Image Based Virtual Try-On Network From Unpaired Data},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
PSGAN: Pose and Expression Robust Spatial-Aware GAN for Customizable Makeup Transfer
Wentao Jiang, Si Liu, Chen Gao, Jie Cao, Ran He, Jiashi Feng, Shuicheng Yan


In this paper, we address the makeup transfer task, which aims to transfer the makeup from a reference image to a source image. Existing methods have achieved promising progress in constrained scenarios, but transferring between images with large pose and expression differences is still challenging. Besides, they cannot realize customizable transfer that allows a controllable shade of makeup or specifies the part to transfer, which limits their applications. To address these issues, we propose Pose and expression robust Spatial-aware GAN (PSGAN). It first utilizes Makeup Distill Network to disentangle the makeup of the reference image as two spatial-aware makeup matrices. Then, Attentive Makeup Morphing module is introduced to specify how the makeup of a pixel in the source image is morphed from the reference image. With the makeup matrices and the source image, Makeup Apply Network is used to perform makeup transfer. Our PSGAN not only achieves state-of-the-art results even when large pose and expression differences exist but also is able to perform partial and shade-controllable makeup transfer. Both the code and a newly collected dataset containing facial images with various poses and expressions will be available at https://github.com/wtjiang98/PSGAN.
[visual, attention, lip, dataset, red, considering] [feature, module, map, attentive, parsing, framework, mdnet, propose, apply, region] [facial, face, expression, adversarial, frontal, robust, model, neutral, skin] [reference, figure, proposed, pixel, method, existing, applying, coefficient, color, indicate] [makeup, transfer, source, image, psgan, style, transferred, perform, loss, shade, transferring, manet, beautyglow, ladn, morphing, realize, morphed, code, generated, ladv, customizable, row, corresponding, generative, cyclegan, specific] [network, bottleneck, test, matrix, set, neural, calculating, applied, weight, large, deep, distill, general] [amm, pose, partial, relative, position, well, left, point]
@InProceedings{Jiang_2020_CVPR,
  author = {Jiang, Wentao and Liu, Si and Gao, Chen and Cao, Jie and He, Ran and Feng, Jiashi and Yan, Shuicheng},
  title = {PSGAN: Pose and Expression Robust Spatial-Aware GAN for Customizable Makeup Transfer},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
RetinaFace: Single-Shot Multi-Level Face Localisation in the Wild
Jiankang Deng, Jia Guo, Evangelos Ververas, Irene Kotsia, Stefanos Zafeiriou


Though tremendous strides have been made in uncontrolled face detection, accurate and efficient 2D face alignment and 3D face reconstruction in-the-wild remain an open challenge. In this paper, we present a novel single-shot, multi-level face localisation method, named RetinaFace, which unifies face box prediction, 2D facial landmark localisation and 3D vertices regression under one common target: point regression on the image plane. To fill the data gap, we manually annotated five facial landmarks on the WIDER FACE dataset and employed a semi-automatic annotation pipeline to generate 3D vertices for face images from the WIDER FACE, AFLW and FDDB datasets. Based on extra annotations, we propose a mutually beneficial regression target for 3D face reconstruction, that is predicting 3D vertices projected on the image plane constrained by a common 3D topology. The proposed 3D face reconstruction branch can be easily incorporated, without any optimisation difficulty, in parallel with the existing box and 2D landmark regression branches during joint training. Extensive experimental results show that RetinaFace can simultaneously achieve stable face detection, accurate 2D face alignment and robust 3D face reconstruction while being efficient through single-shot inference.
[dataset, context, prediction, evaluation, beneficial] [regression, detection, box, feature, bounding, cascade, head, employ, semantic, object, segmentation, false, pyramid, module, annotation, ross, branch, edge, challenging, anchor, predicted, positive, iou, hard] [face, facial, localisation, retinaface, landmark, wider, stefanos, jiankang, robust, improve, model, aflw, tiny, prn, densereg] [proposed, figure, method, based, high, scale, convolutional] [image, loss, alignment] [training, network, set, performance, learning, deep, data, number, size, average, subset, validation] [pose, reconstruction, joint, mesh, accurate, dense, estimation, vertex, point, single, directly, predicts, indirect, regress, shape, topology]
@InProceedings{Deng_2020_CVPR,
  author = {Deng, Jiankang and Guo, Jia and Ververas, Evangelos and Kotsia, Irene and Zafeiriou, Stefanos},
  title = {RetinaFace: Single-Shot Multi-Level Face Localisation in the Wild},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Semantic Image Manipulation Using Scene Graphs
Helisa Dhamo, Azade Farshad, Iro Laina, Nassir Navab, Gregory D. Hager, Federico Tombari, Christian Rupprecht


Image manipulation can be considered a special case of image generation where the image to be produced is a modification of an existing image. Image generation and manipulation have been, for the most part, tasks that operate on raw pixels. However, the remarkable progress in learning rich image and object representations has opened the way for tasks such as text-to-image or layout-to-image generation that are mainly driven by semantics. In our work, we address the novel problem of image manipulation from scene graphs, in which a user can edit images by merely applying changes in the nodes or edges of a semantic graph that is generated from the image. Our goal is to encode image information in a given constellation and from there on generate new constellations, such as replacing objects or even changing relationships between objects, while respecting the semantics and style from the original image. We introduce a spatio-semantic scene graph network that does not require direct supervision for constellation changes or image edits. This makes it possible to train the system from existing real-world datasets with no additional annotation effort.
[graph, visual, relationship, node, crn, goal, work, natural, semantics] [object, semantic, feature, bounding, mask, box, region, predicted, supervision, interactive, table] [original, manipulation, adversarial, model, change, input, masking, query] [method, ieee, pattern, figure, spatial, removal, existing] [image, generation, source, editing, conditional, generative, synthesis, generate, user, representation, layout, target, modified, changing, spade, modification, fid, generated, corresponding, loss, content] [neural, training, arxiv, preprint, task, network, processing, deep, data, problem, evaluate, learning] [scene, conference, computer, vision, approach, international, ground, novel, full, reconstruction, truth, directly, european, system, allows, compare, require]
@InProceedings{Dhamo_2020_CVPR,
  author = {Dhamo, Helisa and Farshad, Azade and Laina, Iro and Navab, Nassir and Hager, Gregory D. and Tombari, Federico and Rupprecht, Christian},
  title = {Semantic Image Manipulation Using Scene Graphs},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
A Stochastic Conditioning Scheme for Diverse Human Motion Prediction
Sadegh Aliakbarian, Fatemeh Sadat Saleh, Mathieu Salzmann, Lars Petersson, Stephen Gould


Human motion prediction, the task of predicting future 3D human poses given a sequence of observed ones, has been mostly treated as a deterministic problem. However, human motion is a stochastic process: Given an observed sequence of poses, multiple future motions are plausible. Existing approaches to modeling this stochasticity typically combine a random noise vector with information about the previous poses. This combination, however, is done in a deterministic manner, which gives the network the flexibility to learn to ignore the random noise. Alternatively, in this paper, we propose to stochastically combine the root of variations with previous pose information, so as to force the model to take the noise into account. We exploit this idea for motion prediction by incorporating it into a recurrent encoder-decoder network with a conditional variational autoencoder block that learns to exploit the perturbations. Our experiments on two large-scale motion prediction datasets demonstrate that our model yields high-quality pose sequences that are much more diverse than those from state-of-the-art stochastic motion prediction techniques.
[prediction, hidden, future, state, evaluation, rnn, conditioning, multiple, observed, ignore, decoder, recurrent, sequence, barsoum, time, action, modeling] [propose, table] [quality, model, perturbation, noise, condition, input] [motion, method, ieee, output, high, comparison, pattern] [diversity, generated, diverse, generate, loss, learn, encoder, variable, generating, conditional, cvae, corresponding, yan, latent, introduce, walker, generative] [random, stochastic, training, learning, deterministic, vector, sampling, arxiv, preprint, note, neural, evaluate, standard, sampled, network, large, sample, report, classifier, discussed] [human, approach, conference, pose, computer, vision, international, single, lars, root, term, compare, european, rely]
@InProceedings{Aliakbarian_2020_CVPR,
  author = {Aliakbarian, Sadegh and Saleh, Fatemeh Sadat and Salzmann, Mathieu and Petersson, Lars and Gould, Stephen},
  title = {A Stochastic Conditioning Scheme for Diverse Human Motion Prediction},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Transferring Dense Pose to Proximal Animal Classes
Artsiom Sanakoyeu, Vasil Khalidov, Maureen S. McCarthy, Andrea Vedaldi, Natalia Neverova


Recent contributions have demonstrated that it is possible to recognize the pose of humans densely and accurately given a large dataset of poses annotated in detail. In principle, the same approach could be extended to any animal class, but the effort required for collecting new annotations for each case makes this strategy impractical, despite important applications in natural conservation, science and business. We show that, at least for proximal animal classes such as chimpanzees, it is possible to transfer the knowledge existing in dense pose recognition for humans, as well as in more general object detectors and segmenters, to the problem of dense pose recognition in other classes. We do this by (1) establishing a DensePose model for the new animal which is also geometrically aligned to humans (2) introducing a multi-head R-CNN architecture that facilitates transfer of multiple recognition tasks between classes, (3) finding which combination of known classes can be transferred most effectively to the new animal and (4) using self-calibrated uncertainty heads to generate pseudo-labels graded by quality for training a model for this class. We also introduce two benchmark datasets labelled in the manner of DensePose for the class chimpanzee and use them to evaluate our approach, showing excellent transfer learning performance.
[dataset, recognition, provide, multiple, work, visual, prediction] [object, segmentation, coco, mask, annotated, detection, bounding, semantic, head, instance, supervision] [model, animal, densepose, trained, chimpanzee, datasets, original, collected, collect] [existing, combination, figure, motion, based, chart] [image, transfer, unsupervised, target, learn, mapping, person, domain, adaptation] [learning, class, data, training, network, performance, sampling, number, deep, teacher, student, problem, large, neural, classification, sampled, best, task, unlabeled, top] [pose, dense, human, body, estimation, uncertainty, shape, andrea, mesh, michael, single, computer, well, vision, keypoints, smpl, detailed]
@InProceedings{Sanakoyeu_2020_CVPR,
  author = {Sanakoyeu, Artsiom and Khalidov, Vasil and McCarthy, Maureen S. and Vedaldi, Andrea and Neverova, Natalia},
  title = {Transferring Dense Pose to Proximal Animal Classes},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Weakly-Supervised 3D Human Pose Learning via Multi-View Images in the Wild
Umar Iqbal, Pavlo Molchanov, Jan Kautz


One major challenge for monocular 3D human pose estimation in-the-wild is the acquisition of training data that contains unconstrained images annotated with accurate 3D poses. In this paper, we address this challenge by proposing a weakly-supervised approach that does not require 3D annotations and learns to estimate 3D poses from unlabeled multi-view data, which can be acquired easily in in-the-wild environments. We propose a novel end-to-end learning framework that enables weakly-supervised training using multi-view consistency. Since multi-view consistency is prone to degenerated solutions, we adopt a 2.5D pose representation and propose a novel objective function that can only be minimized when the predictions of the trained model are consistent and plausible across all camera views. We evaluate our proposed approach on two large scale datasets (Human3.6M and MPII-INF-3DHP) where it achieves state-of-the-art performance among semi-/weakly-supervised methods.
[dataset, people, length, evaluation, provide] [supervision, weak, heatmap, predicted, annotated, confidence, propose, adopt] [model, trained, datasets, heatmaps, improve, wild, unconstrained] [proposed, scale, method, existing] [loss, consistency, representation, image, learn, train, latent] [training, data, learning, network, unlabeled, large, performance, normalized, deep, neural, better, objective, evaluate] [pose, human, estimation, approach, camera, body, joint, depth, error, assume, monocular, additional, lmc, limb, require, rgb, reconstructed, estimated, relative, mqc, estimate, single, rcc, mpii, novel, collection, rigid, multiview, reconstruction, kocabas, degenerated, directly]
@InProceedings{Iqbal_2020_CVPR,
  author = {Iqbal, Umar and Molchanov, Pavlo and Kautz, Jan},
  title = {Weakly-Supervised 3D Human Pose Learning via Multi-View Images in the Wild},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
VIBE: Video Inference for Human Body Pose and Shape Estimation
Muhammed Kocabas, Nikos Athanasiou, Michael J. Black


Human motion is fundamental to understanding behavior. Despite progress on single-image 3D pose and shape estimation, existing video-based state-of-the-art methods fail to produce accurate and natural motion sequences due to a lack of ground-truth 3D motion data for training. To address this problem, we propose "Video Inference for Body Pose and Shape Estimation" (VIBE), which makes use of an existing large-scale motion capture dataset (AMASS) together with unpaired, in-the-wild, 2D keypoint annotations. Our key novelty is an adversarial learning framework that leverages AMASS to discriminate between real human motions and those produced by our temporal pose and shape regression networks. We define a novel temporal network architecture with a self-attention mechanism and show that adversarial training, at the sequence level, produces kinematically plausible motion sequences without in-the-wild ground-truth 3D labels. We perform extensive experimentation to analyze the importance of motion and demonstrate the effectiveness of VIBE on challenging 3D pose estimation datasets, achieving state-of-the-art performance. Code and pretrained models are available at https://github.com/mkocabas/VIBE
[temporal, video, sequence, attention, previous, gru, dataset, hidden, work, frame, recurrent, static, mechanism, time] [challenging, feature, final, table, employ, improves, predicted] [model, adversarial, datasets, input, trained] [motion, ieee, pattern, method, figure, output] [discriminator, loss, pretrained, real, produce, learn, encoder, representation, train, generator, learns, generative] [learning, training, neural, machine, max, deep, performance, network, architecture] [pose, human, conference, body, shape, computer, vibe, vision, estimation, international, michael, single, amass, smpl, joint, european, accurate, keypoint, mesh, mposer, capture, mpjpe, estimate, monocular, keypoints, error, kolotouros, hmr]
@InProceedings{Kocabas_2020_CVPR,
  author = {Kocabas, Muhammed and Athanasiou, Nikos and Black, Michael J.},
  title = {VIBE: Video Inference for Human Body Pose and Shape Estimation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
G3AN: Disentangling Appearance and Motion for Video Generation
Yaohui Wang, Piotr Bilinski, Francois Bremond, Antitza Dantcheva


Creating realistic human videos entails the challenge of being able to simultaneously generate both appearance, as well as motion. To tackle this challenge, we introduce G3AN, a novel spatio-temporal generative model, which seeks to capture the distribution of high dimensional video data and to model appearance and motion in disentangled manner. The latter is achieved by decomposing appearance and motion in a three-stream Generator, where the main stream aims to model spatio-temporal consistency, whereas the two auxiliary streams augment the main stream with multi-scale appearance and motion features, respectively. An extensive quantitative and qualitative analysis shows that our model systematically and significantly outperforms state-of-the-art methods on the facial expression datasets MUG and UvA-NEMO, as well as the Weizmann and UCF101 datasets on human action. Additional analysis on the learned latent representations confirms the successful decomposition of appearance and motion.
[video, stream, temporal, modeling, three, factorized, frame, dataset, evaluation, order, work, aiming] [feature, table, main, level] [adversarial, input, noise, facial, model, manipulation, quality, datasets, expression] [motion, figure, proposed, method, removing, spatial, convolution, comparison, high, quantitative, based] [appearance, generated, generation, generative, fid, latent, mocogan, representation, generate, image, transposed, disentangled, weizmann, generating, generator, realistic, conditional, tgan, gan, real, vgan, fsn, ftn, ability] [learning, dimension, report, observe, training, note, indicates, arxiv, preprint, distribution, data, set] [human, well, novel, additional, decomposition]
@InProceedings{Wang_2020_CVPR,
  author = {Wang, Yaohui and Bilinski, Piotr and Bremond, Francois and Dantcheva, Antitza},
  title = {G3AN: Disentangling Appearance and Motion for Video Generation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Domain Adaptive Image-to-Image Translation
Ying-Cong Chen, Xiaogang Xu, Jiaya Jia


Unpaired image-to-image translation (I2I) has achieved great success in various applications. However, its generalization capacity is still an open question. In this paper, we show that existing I2I models do not generalize well for samples outside the training domain. The cause is twofold. First, an I2I model may not work well when testing samples are beyond its valid input domain. Second, results could be unreliable if the expected output is far from what the model is trained. To deal with these issues, we propose the Domain Adaptive Image-To-Image translation (DAI2I) framework that adapts an I2I model for out-of-domain samples. Our framework introduces two sub-modules -- one maps testing samples to the valid input domain of the I2I model, and the other transforms the output of I2I model to expected results. Extensive experiments manifest that our framework improves the capacity of existing I2I models, allowing them to handle samples that are distinctively different from their primary targets.
[dataset, work, relation] [framework, including, propose, table, feature] [model, input, expression, trained, face, adversarial, quality, datasets, testing] [output, ieee, valid, pattern, adaptive, existing, perceptual, applying, figure, assumption, method, removing] [domain, target, image, cat, translation, attribute, loss, style, adaptation, stargan, analogy, lada, oil, sketch, painting, smiling, lrec, stylized, fid, unsupervised, translate, adain, transfer, introduce, source, row, generative, unpaired, mapping, train, reconstructor, address, generate, lgan] [base, training, deep, network, learning, neural, note, expected, normalization, space, machine, min] [conference, computer, human, international, vision, well, handle, directly, approach, european, novel]
@InProceedings{Chen_2020_CVPR,
  author = {Chen, Ying-Cong and Xu, Xiaogang and Jia, Jiaya},
  title = {Domain Adaptive Image-to-Image Translation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
GAN Compression: Efficient Architectures for Interactive Conditional GANs
Muyang Li, Ji Lin, Yaoyao Ding, Zhijian Liu, Jun-Yan Zhu, Song Han


Conditional Generative Adversarial Networks (cGANs) have enabled controllable image synthesis for many computer vision and graphics applications. However, recent cGANs are 1-2 orders of magnitude more computationally-intensive than modern recognition CNNs. For example, GauGAN consumes 281G MACs per image, compared to 0.44G MACs for MobileNet-v3, making it difficult for interactive deployment. In this work, we propose a general-purpose compression framework for reducing the inference time and model size of the generator in cGANs. Directly applying existing CNNs compression methods yields poor performance due to the difficulty of GAN training and the differences in generator architectures. We address these challenges in two ways. First, to stabilize the GAN training, we transfer knowledge of multiple intermediate representations of the original model to its compressed model, and unify unpaired and paired learning. Second, instead of reusing existing CNN designs, our method automatically finds efficient architectures via neural architecture search (NAS). To accelerate the search process, we decouple the model training and architecture search via weight sharing. Experiments demonstrate the effectiveness of our method across different supervision settings (paired and unpaired), model architectures, and learning methods (e.g., pix2pix, GauGAN, CycleGAN). Without losing image quality, we reduce the computation of CycleGAN by more than 20x and GauGAN by 9x, paving the way for interactive image synthesis. The code and demo are publicly available.
[dataset, recognition, visual] [interactive, semantic, decouple, edge, segmentation, effectiveness] [model, original, adversarial, trained, input, change] [compression, method, channel, figure, compressed, output, proposed, convolution, existing, intermediate, convolutional] [generator, conditional, image, unpaired, paired, gan, gans, cyclegan, generative, fid, gaugan, pseudo, generated, target, discriminator, translation, synthesis, address] [training, learning, neural, architecture, computation, teacher, performance, search, efficient, knowledge, student, network, better, arxiv, deep, number, distillation, song, reduce, evaluate, preprint, compared, compressing, size, find, data, pruning, best, design, large, andrew, inference, weight, layer, objective, reduction] [computer, require]
@InProceedings{Li_2020_CVPR,
  author = {Li, Muyang and Lin, Ji and Ding, Yaoyao and Liu, Zhijian and Zhu, Jun-Yan and Han, Song},
  title = {GAN Compression: Efficient Architectures for Interactive Conditional GANs},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Searching Central Difference Convolutional Networks for Face Anti-Spoofing
Zitong Yu, Chenxu Zhao, Zezheng Wang, Yunxiao Qin, Zhuo Su, Xiaobai Li, Feng Zhou, Guoying Zhao


Face anti-spoofing (FAS) plays a vital role in face recognition systems. Most state-of-the-art FAS methods 1) rely on stacked convolutions and expert-designed network, which is weak in describing detailed fine-grained information and easily being ineffective when the environment varies (e.g., different illumination), and 2) prefer to use long sequence as input to extract dynamic features, making them difficult to deploy into scenarios which need quick response. Here we propose a novel frame level FAS method based on Central Difference Convolution (CDC), which is able to capture intrinsic detailed patterns via aggregating both intensity and gradient information. A network built with CDC, called the Central Difference Convolutional Network (CDCN), is able to provide more robust modeling capacity than its counterpart built with vanilla convolution. Furthermore, over a specifically designed CDC search space, Neural Architecture Search (NAS) is utilized to discover a more powerful network structure (CDCN++), which can be assembled with Multiscale Attention Fusion Module (MAFM) for further boosting performance. Comprehensive experiments are performed on six benchmark datasets to show that 1) the proposed method not only achieves superior performance on intra-dataset testing (especially 0.2% ACER in Protocol-1 of OULU-NPU dataset), 2) it also generalizes well on cross-dataset testing (particularly 6.5% HTER from CASIA-MFSD to Replay-Attack datasets). The codes are available at https://github.com/ZitongYu/CDCN.
[attention, node, video, recognition, frame, three, extract, visual] [level, table, backbone, feature, detection, map, ablation, propose] [face, cdc, central, difference, spoofing, testing, cdcn, auxiliary, model, presentation, input, attack, stasn, robust, study, mafm, abdenour, jukka, acer, living] [convolution, based, ieee, pattern, proposed, convolutional, method, output, spatial, cell, kernel, intermediate, designed, dynamic, utilized, fusion, figure] [image, supervised, invariant, representation] [vanilla, architecture, learning, search, network, neural, performance, deep, binary, searching, max, pool, task, size, test, capacity, operation, training, gradient, searched] [conference, local, computer, international, depth, vision, detailed]
@InProceedings{Yu_2020_CVPR,
  author = {Yu, Zitong and Zhao, Chenxu and Wang, Zezheng and Qin, Yunxiao and Su, Zhuo and Li, Xiaobai and Zhou, Feng and Zhao, Guoying},
  title = {Searching Central Difference Convolutional Networks for Face Anti-Spoofing},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
TransMoMo: Invariance-Driven Unsupervised Video Motion Retargeting
Zhuoqian Yang, Wentao Zhu, Wayne Wu, Chen Qian, Qiang Zhou, Bolei Zhou, Chen Change Loy


We present a lightweight video motion retargeting approach TransMoMo that is capable of transferring motion of a person in a source video realistically to another video of a target person. Without using any paired data for supervision, the proposed method can be trained in an unsupervised manner by exploiting invariance properties of three orthogonal factors of variation including motion, structure, and view-angle. Specifically, with loss functions carefully derived based on invariance, we train an auto-encoder to disentangle the latent representations of such factors given the source and target video clips. This allows us to selectively transfer motion extracted from the source video seamlessly to the target video in spite of structural and view-angle disparities between the source and the target. The relaxed assumption of paired data allows our method to be trained on a vast amount of videos needless of manual annotation of source-target pairing, leading to improved robustness against large structural variations and extreme motion in videos. We demonstrate the effectiveness of our method over the state-of-the-art methods. Code, model and data are publicly available on our project page (https://yzhq97.github.io/transmomo).
[video, skeleton, sequence, temporal, three, length, time] [effectiveness, web] [model, trained, adversarial, input, quality, change, university] [motion, method, proposed, figure, designed, chen, created, removing] [retargeting, loss, source, target, unsupervised, invariance, structural, representation, code, latent, invariant, person, disentangled, disentanglement, translation, content, generated, cross, linv, train, generative, transferred, generate, mixamo, jan, paired] [data, network, learning, scaling, training, max, triplet, process, performance, pool, space, large, neural] [structure, joint, body, human, pose, limb, view, reconstruction, novel, rotation, estimation, despite, complex, rendering, reconstructed, rotated, defined, error, approach]
@InProceedings{Yang_2020_CVPR,
  author = {Yang, Zhuoqian and Zhu, Wentao and Wu, Wayne and Qian, Chen and Zhou, Qiang and Zhou, Bolei and Loy, Chen Change},
  title = {TransMoMo: Invariance-Driven Unsupervised Video Motion Retargeting},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
AdaCoF: Adaptive Collaboration of Flows for Video Frame Interpolation
Hyeongmin Lee, Taeoh Kim, Tae-young Chung, Daehyun Pak, Yuseok Ban, Sangyoun Lee


Video frame interpolation is one of the most challenging tasks in video processing research. Recently, many studies based on deep learning have been suggested. Most of these methods focus on finding locations with useful information to estimate each output pixel using their own frame warping operations. However, many of them have Degrees of Freedom (DoF) limitations and fail to deal with the complex motions found in real world videos. To solve this problem, we propose a new warping module named Adaptive Collaboration of Flows (AdaCoF). Our method estimates both kernel weights and offset vectors for each target pixel to synthesize the output frame. AdaCoF is one of the most generalized warping modules compared to other approaches, and covers most of them as special cases of it. Therefore, it can deal with a significantly wide domain of complex motions. To further improve our framework and synthesize more realistic outputs, we introduce dual-frame adversarial loss which is applicable only to video frame interpolation tasks. The experimental results show that our method outperforms the state-of-the-art methods for both fixed training set environments and the Middlebury benchmark. Our source code is available at https://github.com/HyeongminLEE/AdaCoF-pytorch
[frame, video, recognition, dataset, evaluation] [offset, occlusion, table, location, map, area, add, davis, propose, module] [input, adversarial, deal, experimental] [flow, kernel, ieee, interpolation, adacof, figure, psnr, pixel, pattern, output, method, motion, optical, ssim, warping, based, adaptive, reference, iout, nie, middlebury, intermediate, result, convolutional, perceptual, dilation, collaboration, sepconv, june, convolution] [loss, image, train, target, real] [network, large, neural, learning, deep, operation, processing, size, training, better, performance, compared, forward, backward, machine, fixed] [computer, conference, vision, approach, international, complex, estimate, estimation, refer, second, compare, solve]
@InProceedings{Lee_2020_CVPR,
  author = {Lee, Hyeongmin and Kim, Taeoh and Chung, Tae-young and Pak, Daehyun and Ban, Yuseok and Lee, Sangyoun},
  title = {AdaCoF: Adaptive Collaboration of Flows for Video Frame Interpolation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
FReeNet: Multi-Identity Face Reenactment
Jiangning Zhang, Xianfang Zeng, Mengmeng Wang, Yusu Pan, Liang Liu, Yong Liu, Yu Ding, Changjie Fan


This paper presents a novel multi-identity face reenactment framework, named FReeNet, to transfer facial expressions from an arbitrary source face to a target face with a shared model. The proposed FReeNet consists of two parts: Unified Landmark Converter (ULC) and Geometry-aware Generator (GAG). The ULC adopts an encode-decoder architecture to efficiently convert expression in a latent landmark space, which significantly narrows the gap of the face contour between source and target identities. The GAG leverages the converted landmark to reenact the photorealistic image with a reference image of the target person. Moreover, a new triplet perceptual loss is proposed to force the GAG module to learn appearance and geometry information simultaneously, which also enriches facial details of the reenacted images. Further experiments demonstrate the superiority of our approach for generating photorealistic and expression-alike faces, as well as the flexibility for transferring facial expressions between identities.
[three, dataset, multiple, time, video, convert] [unified, module, table, contour] [face, landmark, facial, expression, model, reenactment, gag, reenact, adversarial, ulc, identity, converter, reenacted, input, freenet, quality] [reference, proposed, figure, perceptual, method, based, column, ssim, designed] [target, source, image, loss, generated, person, generator, appearance, latent, row, rafd, transfer, arbitrary, third, generative, generate, photorealistic, synthesis, consists, generation, translation, transferring, corresponding] [converted, triplet, training, arxiv, preprint, learning, task, network, decoupling, baseline, space, simultaneously, randomly, set, efficiently, experiment, design] [geometry, approach, second, well, consistent, term, acm, novel]
@InProceedings{Zhang_2020_CVPR,
  author = {Zhang, Jiangning and Zeng, Xianfang and Wang, Mengmeng and Pan, Yusu and Liu, Liang and Liu, Yong and Ding, Yu and Fan, Changjie},
  title = {FReeNet: Multi-Identity Face Reenactment},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Novel View Synthesis of Dynamic Scenes With Globally Coherent Depths From a Monocular Camera
Jae Shin Yoon, Kihwan Kim, Orazio Gallo, Hyun Soo Park, Jan Kautz


This paper presents a new method to synthesize an image from arbitrary views and times given a collection of images of a dynamic scene. A key challenge for the novel view synthesis arises from dynamic scene reconstruction where epipolar geometry does not apply to the local motion of dynamic contents. To address this challenge, we propose to combine the depth from single view (DSV) and the depth from multi-view stereo (DMV), where DSV is complete, i.e., a depth is assigned to every pixel, yet view-variant in its scale, while DMV is view-invariant yet incomplete. Our insight is that although its scale and quality are inconsistent with other views, the depth estimation from a single view can be used to reason about the globally coherent geometry of dynamic contents. We cast this problem as learning to correct the scale of DSV, and to refine each depth with locally consistent motions between views to form a coherent depth estimation. We integrate these tasks into a depth fusion network in a self-supervised fashion. Given the fused depth maps, we synthesize a photorealistic virtual view in a specific location and time with our deep blending network that completes the scene and renders the virtual view. We evaluate our method of depth estimation and view synthesis on a diverse real-world dynamic scenes and show the outstanding performance over existing methods.
[static, moving, time, visual, video, evaluation, trajectory, prediction, dataset, people] [foreground, background, object, mask, table, challenge] [correction, input, blending] [dynamic, warping, figure, scale, flow, motion, method, pixel, optical, fusion, spatial, warped, perceptual, captured, based, existing] [synthesis, image, source, synthesized, synthesize, missing, jan] [network, learning, function, set, problem, accuracy, metric] [depth, view, scene, monocular, camera, single, novel, reconstruction, geometry, estimation, multiview, stereo, dsv, complete, consistent, deepblender, estimated, dmv, dfnet, rendering, relative, drt, richard, coherent, virtual, shape, local, rmvsnet, monodepth, noah, globally, locally, geometric, reconstruct, geometrically]
@InProceedings{Yoon_2020_CVPR,
  author = {Yoon, Jae Shin and Kim, Kihwan and Gallo, Orazio and Park, Hyun Soo and Kautz, Jan},
  title = {Novel View Synthesis of Dynamic Scenes With Globally Coherent Depths From a Monocular Camera},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Monocular Real-Time Hand Shape and Motion Capture Using Multi-Modal Data
Yuxiao Zhou, Marc Habermann, Weipeng Xu, Ikhsanul Habibie, Christian Theobalt, Feng Xu


We present a novel method for monocular hand shape and pose estimation at unprecedented runtime performance of 100fps and at state-of-the-art accuracy. This is enabled by a new learning based architecture designed such that it can make use of all the sources of available hand training data: image data with either 2D or 3D annotations, as well as stand-alone 3D animations without corresponding image data. It features a 3D hand joint detection module and an inverse kinematics module which regresses not only 3D joint positions but also maps them to joint rotations in a single feed-forward pass. This output makes the method more directly usable for applications in computer vision and graphics compared to only regressing 3D joint positions. We demonstrate that our architectural design leads to a significant quantitative and qualitative improvement over the state of the art on several challenging benchmarks. We will make our code publicly available for future research.
[dataset, mocap, recognition, previous] [feature, challenging, supervision, tracking, detection, annotated, predicted, location] [model, trained, datasets, robust, christian, input] [ieee, pattern, based, method, inverse, motion, proposed, figure, output] [image, real, representation, synthetic, loss, train, corresponding] [data, training, network, learning, better, compared, neural, note, architecture, deep, large, evaluate, design] [hand, joint, pose, computer, conference, vision, shape, estimation, iknet, single, depth, detnet, rgb, ground, monocular, well, truth, capture, kinematics, stb, international, approach, mesh, rotation, estimate, directly, heat, bone, quaternion, rhd, acm, novel, human, camera, rotational, mano, leverage]
@InProceedings{Zhou_2020_CVPR,
  author = {Zhou, Yuxiao and Habermann, Marc and Xu, Weipeng and Habibie, Ikhsanul and Theobalt, Christian and Xu, Feng},
  title = {Monocular Real-Time Hand Shape and Motion Capture Using Multi-Modal Data},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
The GAN That Warped: Semantic Attribute Editing With Unpaired Data
Garoe Dorta, Sara Vicente, Neill D. F. Campbell, Ivor J. A. Simpson


Deep neural networks have recently been used to edit images with great success, in particular for faces. However, they are often limited to only being able to work at a restricted range of resolutions. Many methods are so flexible that face edits can often result in an unwanted loss of identity. This work proposes to learn how to perform semantic image edits through the application of smooth warp fields. Previous approaches that attempted to use warping for semantic edits required paired data, i.e. example images of the same subject with different semantic attributes. In contrast, we employ recent advances in Generative Adversarial Networks that allow our model to be trained with unpaired data. We demonstrate face editing at very high resolutions (4k images) with a single forward pass of a deep network at a lower resolution. We also show that our edits are substantially better at preserving the subject's identity. The robustness of our approach is demonstrated by showing plausible image editing results on the Cub200 birds dataset. To our knowledge this has not been previously accomplished, due the challenging nature of the dataset.
[previous, work, recognition, dataset] [semantic, head] [model, input, face, identity, adversarial, original, trained] [warp, resolution, high, method, warping, ieee, pattern, figure, based, pixel, flow, field, transform, intermediate, color, proposed] [image, edits, attribute, editing, stargan, loss, target, paired, domain, edited, generator, edit, beak, generative, generated, unpaired, user, transformed, nose, realism, real, gan, learn, discriminator, transfer, translation, source, corresponding] [binary, applied, training, data, learning, label, deep, classifier, higher, larger, forward, classification, pass, network, better, number, set, accuracy, smaller] [computer, vision, approach, conference, single, partial, transformation, international, allow, require, smooth, local]
@InProceedings{Dorta_2020_CVPR,
  author = {Dorta, Garoe and Vicente, Sara and Campbell, Neill D. F. and Simpson, Ivor J. A.},
  title = {The GAN That Warped: Semantic Attribute Editing With Unpaired Data},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
4D Visualization of Dynamic Events From Unconstrained Multi-View Videos
Aayush Bansal, Minh Vo, Yaser Sheikh, Deva Ramanan, Srinivasa Narasimhan


We present a data-driven approach for 4D space-time visualization of dynamic events from videos captured by hand-held multiple cameras. Key to our approach is the use of self-supervised neural networks specific to the scene to compose static and dynamic aspects of an event. Though captured from discrete viewpoints, this model enables us to move around the space-time of the event continuously. This model allows us to create virtual cameras that facilitate: (1) freezing the time and exploring views; (2) freezing a view and moving through time; and (3) simultaneously changing both time and view. We can also edit the videos and reveal occluded objects for a given view if it is visible in any of the other views. We validate our approach on challenging in-the-wild events captured using up to 15 mobile cameras.
[time, static, video, multiple, work, spatiotemporal, people, three, modeling, sequence, exploring, visual, pair, temporal, minh] [background, foreground, visualization] [model, study, adversarial, unconstrained, example] [dynamic, figure, captured, event, output, disparity, freezing, convolutional, stacked, motion] [image, target, user, content, person, control, generate, composition, synthesis, generated, enable] [large, neural, space, data, deep, computing, learning, better, lower] [camera, view, approach, capture, stereo, instantaneous, reconstruction, acm, virtual, median, compute, scene, pose, richard, human, enables, estimation, allows, complete, system, noah, yaser, sfm, rendering, depth, multiview, reprojected, takeo, srinivasa]
@InProceedings{Bansal_2020_CVPR,
  author = {Bansal, Aayush and Vo, Minh and Sheikh, Yaser and Ramanan, Deva and Narasimhan, Srinivasa},
  title = {4D Visualization of Dynamic Events From Unconstrained Multi-View Videos},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Global-Local Bidirectional Reasoning for Unsupervised Representation Learning of 3D Point Clouds
Yongming Rao, Jiwen Lu, Jie Zhou


Local and global patterns of an object are closely related. Although each part of an object is incomplete, the underlying attributes about the object are shared among all parts, which makes reasoning the whole object from a single part possible. We hypothesize that a powerful representation of a 3D object should model the attributes that are shared between parts and the whole object, and distinguishable from other objects. Based on this hypothesis, we propose to learn point cloud representation by bidirectional reasoning between the local structures at different abstraction hierarchies and the global shape without human supervision. Experimental results on various benchmark datasets demonstrate the unsupervisedly learned representation is even better than supervised representation in discriminative power, generalization ability, and robustness. We show that unsupervisedly trained point cloud models can outperform their supervised counterparts on downstream classification tasks. Most notably, by simply increasing the channel width of an SSG PointNet++, our unsupervised model surpasses the state-of-the-art supervised methods on both synthetic and real-world 3D object classification datasets. We expect our observations to offer a new perspective on learning better representation from data structures instead of human annotations for point cloud understanding.
[reasoning, dataset, downstream, bidirectional, hierarchical, previous, prediction, powerful, understanding] [global, object, semantic, feature, table, abstraction, propose, grant, china, supervision] [model, trained, generalization, improve, robustness, tsinghua] [method, proposed, based, figure, channel, analysis] [representation, unsupervised, supervised, learn, shared, loss, structural, underlying, discriminative, ssg, ability, cross] [learning, deep, accuracy, data, training, classification, learned, network, metric, neural, arxiv, preprint, knowledge, mutual, achieve, set, performance, linear, compared, large, better, problem, number, outperform] [point, cloud, local, rscnn, scanobjectnn, unsupervisedly, estimation, normal, human, pointnet, single, shape, capture, view, scannet]
@InProceedings{Rao_2020_CVPR,
  author = {Rao, Yongming and Lu, Jiwen and Zhou, Jie},
  title = {Global-Local Bidirectional Reasoning for Unsupervised Representation Learning of 3D Point Clouds},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
HigherHRNet: Scale-Aware Representation Learning for Bottom-Up Human Pose Estimation
Bowen Cheng, Bin Xiao, Jingdong Wang, Honghui Shi, Thomas S. Huang, Lei Zhang


Bottom-up human pose estimation methods have difficulties in predicting the correct pose for small persons due to challenges in scale variation. In this paper, we present HigherHRNet: a novel bottom-up human pose estimation method for learning scale-aware representations using high-resolution feature pyramids. Equipped with multi-resolution supervision for training and multi-resolution aggregation for inference, the proposed approach is able to solve the scale variation challenge in bottom-up multi-person pose estimation and localize keypoints more precisely, especially for small person. The feature pyramid in HigherHRNet consists of feature map outputs from HRNet and upsampled higher-resolution outputs through a transposed convolution. HigherHRNet outperforms the previous best bottom-up method by 2.5% AP for medium person on COCO test-dev, showing its effectiveness in handling scale variation. Furthermore, HigherHRNet achieves new state-of-the-art result on COCO test-dev (70.5% AP) without using refinement or other post-processing techniques, surpassing all existing bottom-up methods. HigherHRNet even surpasses all top-down methods on CrowdPose test (67.6% AP), suggesting its robustness in crowded scene.
[predict, predicting, outperforms, prediction, previous, dataset] [feature, higherhrnet, hrnet, heatmap, pyramid, aggregation, coco, module, table, detection, achieves, crowdpose, ablation, apm, grouping, object, supervision, crowded, george, map, personlab, backbone, tagmaps, predicted, extra, val, apl, semantic, bowen] [heatmaps, input, medium, model, adding, hourglass] [resolution, deconvolution, scale, method, high, residual, figure, convolution] [person, image, generate, train, representation] [large, small, network, training, performance, higher, size, find, larger, learning, gain, strategy, standard, baseline, best, simple, deep] [pose, keypoints, estimation, keypoint, human, single, ground, truth, thomas, solve]
@InProceedings{Cheng_2020_CVPR,
  author = {Cheng, Bowen and Xiao, Bin and Wang, Jingdong and Shi, Honghui and Huang, Thomas S. and Zhang, Lei},
  title = {HigherHRNet: Scale-Aware Representation Learning for Bottom-Up Human Pose Estimation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Detecting Attended Visual Targets in Video
Eunji Chong, Yongxin Wang, Nataniel Ruiz, James M. Rehg


We address the problem of detecting attention targets in video. Our goal is to identify where each person in each frame of a video is looking, and correctly handle the case where the gaze target is out-of-frame. Our novel architecture models the dynamic interaction between the scene and head features and infers time-varying attention targets. We introduce a new annotated dataset, VideoAttentionTarget, containing complex and dynamic patterns of real-world gaze behavior. Our experiments show that our model can effectively infer dynamic attention in videos. In addition, we apply our predicted attention maps to two social gaze behavior recognition tasks, and show that the resulting classifiers significantly outperform existing methods. We achieve state-of-the-art performance on three datasets: GazeFollow (static images), VideoAttentionTarget (videos), and VideoCoAtt (videos), and obtain the first results for automatically classifying clinically-relevant gaze behavior without wearable cameras or eye trackers.
[attention, social, dataset, behavior, prediction, video, videoattentiontarget, frame, recognition, gazefollow, shift, work, autism, visual, people, agata, three, temporal, eunji, infer, videocoatt, spatiotemporal, natural, inferring, concatenated, evaluation, attended] [head, map, detection, heatmap, feature, location, annotated, module, table, final, key, detect, branch] [gaze, model, eye, detecting, case, james, input, development, example] [ieee, method, spatial, dynamic, convolutional, pattern, analysis, figure, journal] [target, image, shared, person, address] [learning, performance, architecture, layer, deep, random, neural, training, task, problem] [scene, conference, computer, approach, human, vision, joint, international, position, novel, computed, ground, contact, estimation, single, pose]
@InProceedings{Chong_2020_CVPR,
  author = {Chong, Eunji and Wang, Yongxin and Ruiz, Nataniel and Rehg, James M.},
  title = {Detecting Attended Visual Targets in Video},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Closed-Loop Matters: Dual Regression Networks for Single Image Super-Resolution
Yong Guo, Jian Chen, Jingdong Wang, Qi Chen, Jiezhang Cao, Zeshuai Deng, Yanwu Xu, Mingkui Tan


Deep neural networks have exhibited promising performance in image super-resolution (SR) by learning a nonlinear mapping function from low-resolution (LR) images to high-resolution (HR) images. However, there are two underlying limitations to existing SR methods. First, learning the mapping function from LR to HR images is typically an ill-posed problem, because there exist infinite HR images that can be downsampled to the same LR image. As a result, the space of the possible functions can be extremely large, which makes it hard to find a good solution. Second, the paired LR-HR data may be unavailable in real-world applications and the underlying degradation method is often unknown. For such a more general case, existing SR models often incur the adaptation problem and yield poor performance. To address the above issues, we propose a dual regression scheme by introducing an additional constraint on LR data to reduce the space of the possible functions. Specifically, besides the mapping from LR to HR images, we learn an additional dual regression mapping estimates the down-sampling kernel and reconstruct LR images, which forms a closed-loop to provide additional supervision. More critically, since the dual regression process does not depend on HR images, we can directly learn from LR images. In this sense, we can easily adapt SR models to real-world data, e.g., raw video frames from YouTube. Extensive experiments with paired training data and unpaired real-world data demonstrate our superiority over existing methods.
[video, visual, provide, attention] [regression, yong, jian, propose, table, grant, apply] [model, generalization, improve, adversarial, easily] [dual, ieee, proposed, figure, primal, method, psnr, pattern, degradation, ssim, comparison, based, conv, existing, rcan, analysis, drn, raw, edsr, dbpn, downsampling, kernel] [image, paired, unpaired, mapping, mingkui, adaptation, synthetic, loss, learn, supervised, corresponding, cyclegan, train] [data, performance, scheme, learning, space, training, network, deep, function, neural, number, bound, algorithm, reduce, task, promising, good, problem, design, note, theoretical, machine, adapt, baseline, large] [conference, computer, vision, international, single, additional, constraint, reconstruct, demonstrate, compare, nearest]
@InProceedings{Guo_2020_CVPR,
  author = {Guo, Yong and Chen, Jian and Wang, Jingdong and Chen, Qi and Cao, Jiezhang and Deng, Zeshuai and Xu, Yanwu and Tan, Mingkui},
  title = {Closed-Loop Matters: Dual Regression Networks for Single Image Super-Resolution},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Neural Voxel Renderer: Learning an Accurate and Controllable Rendering Tool
Konstantinos Rematas, Vittorio Ferrari


We present a neural rendering framework that maps a voxelized scene into a high quality image. Highly-textured objects and scene element interactions are realistically rendered by our method, despite having a rough representation as an input. Moreover, our approach allows controllable rendering: geometric and appearance modifications in the input are accurately propagated to the output. The user can move, rotate and scale an object, change its appearance and texture or modify the position of the light and all these edits are represented in the final rendering. We demonstrate the effectiveness of our approach by rendering scenes with varying appearance, from single color per object to complex, high-frequency textures. We show that our rerendering network can generate very detailed images that represent precisely the appearance of the input scene. Our experiments illustrate that our approach achieves more accurate image synthesis results compared to alternatives and can also handle low voxel grid resolutions. Finally, we show how our neural rendering framework can capture and faithfully render objects from real images and from a diverse set of classes.
[encoding] [object, framework, final, map] [input, model, modify, typical] [light, color, figure, output, method, illumination, convolutional, splatting, based, high] [image, appearance, source, texture, rerendering, realistic, synthesis, real, translation, latent, target, row, learns, painting, generate, ability, faithfully, edits, generative] [neural, network, training, deep, set, learning, randomly, setting, sampled] [rendering, scene, voxel, voxels, single, camera, rendered, render, acm, approach, position, ground, nvr, detailed, differentiable, capture, view, renderer, geometry, geometric, novel, illustrate, textured, chair, michael, rotation, floor, represented, accurately, reconstruction, projective, mesh, hao, accurate, colored]
@InProceedings{Rematas_2020_CVPR,
  author = {Rematas, Konstantinos and Ferrari, Vittorio},
  title = {Neural Voxel Renderer: Learning an Accurate and Controllable Rendering Tool},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Neural Contours: Learning to Draw Lines From 3D Shapes
Difan Liu, Mohamed Nabail, Aaron Hertzmann, Evangelos Kalogerakis


This paper introduces a method for learning to generate line drawings from 3D models. Our architecture incorporates a differentiable module operating on geometric features of the 3D model, and an image-based module operating on view-based shape representations. At test time, geometric and view-based reasoning are combined with the help of a neural module to create a line drawing. The model is trained on a large number of crowdsourced comparisons of line drawings. Experiments demonstrate that our method achieves significant improvements in line drawing over the state-of-the-art when evaluated on standard benchmarks, resulting in drawings that are comparable to those produced by experienced human artists.
[dataset, evaluation, describe] [module, branch, table, threshold, map, iou] [model, input, create, decision, trained, study, combined] [method, figure, based, output, pixel, existing, thresholding, canny, reference] [drawing, image, translation, synthetic, train, generated, aaron, stylization] [neural, training, learning, ranking, test, set, best, network, max, compared, number, function, draw, architecture, select, selected] [shape, human, geometric, cole, shaded, occluding, surface, acm, suggestive, rendered, apparent, mesh, curvature, supplementary, depth, geometry, computer, differentiable, capture, camera, rtsc, chamfer, distance, material, operating, gathered, mturk]
@InProceedings{Liu_2020_CVPR,
  author = {Liu, Difan and Nabail, Mohamed and Hertzmann, Aaron and Kalogerakis, Evangelos},
  title = {Neural Contours: Learning to Draw Lines From 3D Shapes},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Softmax Splatting for Video Frame Interpolation
Simon Niklaus, Feng Liu


Differentiable image sampling in the form of backward warping has seen broad adoption in tasks like depth estimation and optical flow prediction. In contrast, how to perform forward warping has seen less attention, partly due to additional challenges such as resolving the conflict of mapping multiple pixels to the same target location in a differentiable way. We propose softmax splatting to address this paradigm shift and show its effectiveness on the application of frame interpolation. Specifically, given two input frames, we forward-warp the frames and their feature pyramid representations based on an optical flow estimate using softmax splatting. In doing so, the softmax splatting seamlessly handles cases where multiple source pixels map to the same target location. We then use a synthesis network to predict the interpolation result from the warped representations. Our softmax splatting allows us to not only interpolate frames at an arbitrary time but also to fine tune the feature pyramid and the optical flow. We show that our synthesis approach, empowered by softmax splatting, achieves new state-of-the-art results for video frame interpolation.
[frame, video, multiple, temporal, dataset, evaluation, temporally, context, extract] [feature, pyramid, map, extractor, car, table, location, effectiveness, benchmark] [input, subject, quality, trained, model] [splatting, optical, interpolation, flow, warping, ieee, pattern, figure, proposed, psnr, niklaus, motion, supervise, brightness, ssim, summation, warp, llap, performs, result, simon, dain, warped, quantitative, middlebury, pixel] [image, synthesis, target, source, unsupervised, synthesize, lpips, address, perform, learn, common] [softmax, forward, network, better, training, learning, deep, backward, metric, linear, support, neural] [computer, conference, approach, vision, well, depth, estimation, differentiable, estimate, enables, allows, international, view, estimator]
@InProceedings{Niklaus_2020_CVPR,
  author = {Niklaus, Simon and Liu, Feng},
  title = {Softmax Splatting for Video Frame Interpolation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
CIAGAN: Conditional Identity Anonymization Generative Adversarial Networks
Maxim Maximov, Ismail Elezi, Laura Leal-Taixe


The unprecedented increase in the usage of computer vision technology in society goes hand in hand with an increased concern in data privacy. In many real-world scenarios like people tracking or action recognition, it is important to be able to process the data while taking careful consideration in protecting people's identity. We propose and develop CIAGAN, a model for image and video anonymization based on conditional generative adversarial networks. Our model is able to remove the identifying characteristics of faces and bodies while producing high-quality images and videos that can be used for any computer vision task, such as detection or tracking. Unlike previous methods, we have full control over the de-identification (anonymization) procedure, ensuring both anonymization as well as diversity. We compare our method to several baselines and achieve state-of-the-art results. To facilitate further research, we make available the code and the models at https://github.com/dvl-tum/ciagan.
[temporal, recognition, dataset, work, december, people, video, order, provide, multiple, time] [detection, siamese, framework, propose, table, tracking] [identity, face, anonymization, adversarial, model, anonymized, input, identification, landmark, ciagan, original, trained, privacy, anonymize, showing] [method, ieee, based, convolutional, guidance, iccv, figure] [generated, image, real, generator, generative, control, discriminator, conditional, source, desired, generate, fake, consistency, representation, loss, generation, gan, person, translation, qualitative, realistic, preservation, perform, masked] [neural, network, learning, processing, training, rate, data, random, better, achieve, performance, standard, set] [computer, conference, vision, pose, international, full, compare, body, october, novel]
@InProceedings{Maximov_2020_CVPR,
  author = {Maximov, Maxim and Elezi, Ismail and Leal-Taixe, Laura},
  title = {CIAGAN: Conditional Identity Anonymization Generative Adversarial Networks},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Probabilistic Structural Latent Representation for Unsupervised Embedding
Mang Ye, Jianbing Shen


Unsupervised embedding learning aims at extracting low-dimensional visually meaningful representations from large-scale unlabeled images, which can then be directly used for similarity-based search. This task faces two major challenges: 1) mining positive supervision from highly similar fine-grained classes and 2) generating to unseen testing categories. To tackle these issues, this paper proposes a probabilistic structural latent representation (PSLR), which incorporates an adaptable softmax embedding to approximate the positive concentrated and negative instance separated properties in the graph latent space. It improves the discriminability by enlarging the positive/negative difference without introducing any additional computational cost while maintaining high learning efficiency. To address the limited supervision using data augmentation, a smooth variational reconstruction loss is introduced by modeling the intra-instance variance, which improves the robustness. Extensive experiments demonstrate the superiority of PSLR over state-of-the-art unsupervised methods on both seen and unseen categories with cosine similarity. Code is available at https://github.com/mangye16/PSLR
[embedding, graph, visual, dataset, nce, recognition, relationship] [instance, positive, feature, table, supervision, improves, main, achieves, mine] [testing, knn, generalizability, input, datasets, original, major] [enhance, demonstrates, figure, proposed, high, convolutional] [latent, representation, pslr, unsupervised, isif, adaptable, unseen, variational, image, structural, loss, uel, supervised, mom, exemplar, invariant, preserving, mang, discriminability] [learning, training, similarity, negative, performance, deep, softmax, data, network, learned, set, better, accuracy, classifier, product, mining, linear, augmentation, augmented, probabilistic, cosine, distribution, metric, note, label, search, optimization, random, sample, strategy] [smooth, directly, additional, neighborhood, structure, reconstruction]
@InProceedings{Ye_2020_CVPR,
  author = {Ye, Mang and Shen, Jianbing},
  title = {Probabilistic Structural Latent Representation for Unsupervised Embedding},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Semantically Multi-Modal Image Synthesis
Zhen Zhu, Zhiliang Xu, Ansheng You, Xiang Bai


In this paper, we focus on semantically multi-modal image synthesis (SMIS) task, namely, generating multi-modal images at the semantic level. Previous work seeks to use multiple class-specific generators, constraining its usage in datasets with a small number of classes. We instead propose a novel Group Decreasing Network (GroupDNet) that leverages group convolutions in the generator and progressively decreases the group numbers of the convolutions in the decoder. Consequently, GroupDNet is armed with much more controllability on translating semantic labels to natural images and has plausible high-quality yields for datasets with many classes. Experiments on several challenging datasets demonstrate the superiority of GroupDNet on performing the SMIS task. We also show that GroupDNet is capable of performing a wide range of interesting synthesis applications. Codes and models are available at: https://github.com/Seanseattle/SMIS.
[decoder, natural, work, represent, previous, multiple, dataset, evaluation] [semantic, feature, parsing, gconv, focus, challenging, represents] [model, adversarial, input, quality, datasets, clothes, deepfashion] [convolutional, high, comparison, figure, gaussian, superiority] [image, groupdnet, synthesis, smis, latent, encoder, generated, code, conditional, spade, semantically, generative, mulnet, groupnet, generator, style, diversity, generation, bicyclegan, diverse, generate, dscgan, specific, translation, real, loss, controllability, transfer, person, idea, vary] [group, network, number, class, performance, task, set, architecture, training, decreasing, batch, deep, upper, design, normalization, equal, support] [human, distance, single]
@InProceedings{Zhu_2020_CVPR,
  author = {Zhu, Zhen and Xu, Zhiliang and You, Ansheng and Bai, Xiang},
  title = {Semantically Multi-Modal Image Synthesis},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Nested Scale-Editing for Conditional Image Synthesis
Lingzhi Zhang, Jiancong Wang, Yinshuang Xu, Jie Min, Tarmily Wen, James C. Gee, Jianbo Shi


We propose an image synthesis approach that provides stratified navigation in the latent code space. With a tiny amount of partial or very low-resolution image, our approach can consistently out-perform state-of-the-art counterparts in terms of generating the closest sampled image to the ground truth. We achieve this through scale-independent editing while expanding scale-specific diversity. Scale-independence is achieved with a nested scale disentanglement loss. Scale-specific diversity is created by incorporating a progressive diversification constraint. We introduce semantic persistency across the scales by sharing common latent codes. Together they provide better control of the image synthesis process. We evaluate the effectiveness of our proposed approach through various tasks, including image outpainting, image superresolution, and cross-domain image translation.
[visual, multimodal, decoder, evaluation, work, previous, hierarchical] [feature, propose, semantic, level] [model, identity, adversarial, quality, input, face, facial, difficult] [ieee, scale, pattern, spatial, output, proposed, figure, method, resolution, convolutional, comparison, perceptual, recover] [image, latent, code, conditional, generative, disentanglement, synthesis, diversity, diverse, style, diversification, corresponding, progressive, loss, variable, synthesized, control, content, generate, munit, translation, specific, generator, generated, disentangled, transfer, bicyclegan, mode] [sampled, deep, neural, random, normalized, processing, best, pairwise, arxiv, preprint, evaluate, learning, layer, sampling, network] [conference, computer, vision, distance, ground, truth, single, recovery, international, european, approach, enforce]
@InProceedings{Zhang_2020_CVPR,
  author = {Zhang, Lingzhi and Wang, Jiancong and Xu, Yinshuang and Min, Jie and Wen, Tarmily and Gee, James C. and Shi, Jianbo},
  title = {Nested Scale-Editing for Conditional Image Synthesis},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
UnrealText: Synthesizing Realistic Scene Text Images From the Unreal World
Shangbang Long, Cong Yao


Synthetic data has been a critical tool for training scene text detection and recognition models. On the one hand, synthetic word images have proven to be a successful substitute for real images in training scene text recognizers. On the other hand, however, scene text detectors still heavily rely on a large amount of manually annotated real-world images, which are expensive. In this paper, we introduce UnrealText, an efficient image synthesis method that renders realistic images via a 3D graphics engine. 3D synthetic engine provides realistic appearance by rendering scene and text as a whole, and allows for better text region proposals with access to precise scene information, e.g. normal and even object meshes. The comprehensive experiments verify its effectiveness on both scene text detection and recognition. We also generate a multilingual version for future research into multilingual scene text detection and recognition. The code and the generated datasets are released at: https://github.com/Jyouhou/UnrealText/.
[text, recognition, engine, mlt, multilingual, environment, synthtext, visd, walk, natural, east, previous, icdar, dataset, viewfinder, latin, cong, unrealtext, evaluation] [detection, object, module, region, background, effectiveness, achieves, semantic, refined] [datasets, model, trained, robust] [proposed, method, ieee, pattern, based, analysis, result, screen] [synthetic, image, synthesis, real, generate, randomization, generated, train, realistic, diverse, generation, diversity, domain] [data, training, random, randomly, set, number, large, performance, better, size, manual, arxiv, preprint, total, find, experiment] [scene, camera, computer, conference, vision, initial, viewpoint, lighting, rendering, normal, compare]
@InProceedings{Long_2020_CVPR,
  author = {Long, Shangbang and Yao, Cong},
  title = {UnrealText: Synthesizing Realistic Scene Text Images From the Unreal World},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Fast Texture Synthesis via Pseudo Optimizer
Wu Shi, Yu Qiao


Texture synthesis using deep neural networks can generate high quality and diversified textures. However, it usually requires a heavy optimization process. The following works accelerate the process by using feed-forward networks, but at the cost of scalability. diversity or quality. We propose a new efficient method that aims to simulate the optimization process while retains most of the properties. Our method takes a noise image and the gradients from a descriptor network as inputs, and synthesize a refined image with respect to the target image. The proposed method can synthesize images with better quality and diversity than the other fast synthesis methods do. Moreover, our method trained on a large scale dataset can generalize to synthesize unseen textures.
[time, step, dataset, recognition] [propose, instance, refine, feature, named] [input, model, noise, iterative, quality, trained, vgg, shenzhen, change, universal] [method, fast, adaptive, figure, convolutional, reference, pattern, unfolding, output, ieee, simulate, proposed, transform] [texture, image, synthesis, loss, synthesize, target, gatys, synthesized, diversity, pseudo, style, arbitrary, progressive, gram, generate, descriptive, transfer, diversified, idea, wct, train, common] [network, optimization, neural, process, optimizer, gradient, training, learning, objective, set, layer, inference, architecture, deep, forward, requires, efficient, computation, backward, function, matrix, size, problem, optimal, implemented] [computer, conference, single, vision, international, defined, second]
@InProceedings{Shi_2020_CVPR,
  author = {Shi, Wu and Qiao, Yu},
  title = {Fast Texture Synthesis via Pseudo Optimizer},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Towards Learning Structure via Consensus for Face Segmentation and Parsing
Iacopo Masi, Joe Mathai, Wael AbdAlmageed


Face segmentation is the task of densely labeling pixels on the face according to their semantics. While current methods place an emphasis on developing sophisticated architectures, use conditional random fields for smoothness, or rather employ adversarial training, we follow an alternative path towards robust face segmentation and parsing. Occlusions, along with other parts of the face, have a proper structure that needs to be propagated in the model during training. Unlike state-of-the-art methods that treat face segmentation as an independent pixel prediction problem, we argue instead that it should hold highly correlated outputs within the same object pixels. We thereby offer a novel learning mechanism to enforce structure in the prediction via consensus, guided by a robust loss function that forces pixel objects to be consistent with each other. Our face parser is trained by transferring knowledge from another model, yet it encourages spatial consistency while fitting the labels. Different than current practice, our method enjoys pixel-wise predictions, yet paves the way for fewer artifacts, less sparse masks, and spatially coherent outputs.
[prediction, connected, work, recurrent, structured, previous, graph] [segmentation, semantic, mask, parsing, table, consensus, fully, occlusion, crf, blob, dkl, final, predicted, crfs, ablation, recall] [face, adversarial, input, cofw, model, generic, robust, case, iacopo, nirkin] [method, pixel, convolutional, proposed, figure, result, simply, residual, comparison, liu] [image, loss, transfer, qualitative] [learning, training, network, class, label, neural, baseline, deep, sparsity, set, report, arxiv, preprint, test, average, random, knowledge, softmax, expected, size, probability] [structure, smoothness, sparse, additional, second, provided, system, computer, shape, enforce, term]
@InProceedings{Masi_2020_CVPR,
  author = {Masi, Iacopo and Mathai, Joe and AbdAlmageed, Wael},
  title = {Towards Learning Structure via Consensus for Face Segmentation and Parsing},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
CookGAN: Causality Based Text-to-Image Synthesis
Bin Zhu, Chong-Wah Ngo


This paper addresses the problem of text-to-image synthesis from a new perspective, i.e., the cause-and-effect chain in image generation. Causality is a common phenomenon in cooking. The dish appearance changes depending on the cooking actions and ingredients. The challenge of synthesis is that a generated image should depict the visual result of action-on-object. This paper presents a new network architecture, CookGAN, that mimics visual effect in causality chain, preserves fine-grained details and progressively upsamples image. Particularly, a cooking simulator sub-network is proposed to incrementally make changes to food images based on the interaction between ingredients and cooking methods over a series of steps. Experiments on Recipe1M verify that CookGAN manages to generate food images with reasonably impressive inception score. Furthermore, the images are semantically interpretable and manipulable.
[cooking, visual, recipe, simulator, attended, semantics, text, ith, gru, retrieval, step, three, hidden, red, medr, instruction] [feature, map, table] [adversarial, black, manipulation, quality, change, trained, original, input] [figure, pattern, ieee, result, based, resolution, upsampling, visually, channel, high, color] [image, cookgan, ingredient, food, generated, gan, real, loss, generate, causality, synthesis, generation, conditional, generative, dish, generator, ingredientgan, progressively, manages, semantically, realistic, encoder, stepgan, composition, appearance, inception, stir, content, dried, discriminator, unconditional, fiattend, cooked] [learning, performance, network, paper, lower, size, higher, distribution, indicates, better] [conference, computer, vision]
@InProceedings{Zhu_2020_CVPR,
  author = {Zhu, Bin and Ngo, Chong-Wah},
  title = {CookGAN: Causality Based Text-to-Image Synthesis},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Weakly Supervised Discriminative Feature Learning With State Information for Person Identification
Hong-Xing Yu, Wei-Shi Zheng


Unsupervised learning of identity-discriminative visual feature is appealing in real-world tasks where manual labelling is costly. However, the images of an identity can be visually discrepant when images are taken under different states, e.g. different camera views and poses. This visual discrepancy leads to great difficulty in unsupervised discriminative learning. Fortunately, in real-world tasks we could often know the states without human annotation, e.g. we can easily have the camera view labels in person re-identification and facial pose labels in face recognition. In this work we propose utilizing the state information as weak supervision to address the visual discrepancy caused by different states. We formulate a simple pseudo label model and utilize the state information in an attempt to refine the assigned pseudo labels by the weakly supervised decision boundary rectification and weakly supervised feature drift regularization. We evaluate our model on unsupervised person re-identification and pose-invariant face recognition. Despite the simplicity of our method, it could outperform the state-of-the-art results on Duke-reID, MultiPIE and CFP datasets with a standard ResNet-50 backbone. We also find our model could perform comparably with the standard supervised fine-tuning results on the three datasets. Code is available at https://github.com/KovenYu/state-information.
[state, visual, recognition, work, correct] [feature, weakly, boundary, hard, table, liang, weak, supervision] [model, face, decision, distortion, identity, cfp, frontal, caused, highly, example] [figure, method, dark, illumination] [person, unsupervised, supervised, surrogate, wdbr, discriminative, image, learn, address, discrepancy, pretrained, pseudo, cluster, loss, pifr, domain, adaptation, ancong, wfdr] [learning, deep, set, class, training, basic, performance, soft, drift, standard, unlabeled, data, label, compared, problem, better, arxiv, preprint, unlabelled, extremely, clustering, network, note, learned, regularization] [camera, pose, view, full, local, vision, refer, mpi]
@InProceedings{Yu_2020_CVPR,
  author = {Yu, Hong-Xing and Zheng, Wei-Shi},
  title = {Weakly Supervised Discriminative Feature Learning With State Information for Person Identification},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Future Video Synthesis With Object Motion Prediction
Yue Wu, Rongrong Gao, Jaesik Park, Qifeng Chen


We present an approach to predict future video frames given a sequence of continuous video frames in the past. Instead of synthesizing images directly, our approach is designed to understand the complex scene dynamics by decoupling the background scene and moving objects. The appearance of the scene components in the future is predicted by non-rigid deformation of the background and affine transformation of moving objects. The anticipated appearances are combined to create a reasonable video in the future. With this procedure, our method exhibits much less tearing or distortion artifact compared to other approaches. Experimental results on the Cityscapes and KITTI datasets show that our model outperforms the state-of-the-art in terms of visual quality and accuracy.
[video, prediction, future, moving, frame, dataset, predict, static, sequence, trajectory, predicting, visual, long, temporal, multiple, sequential, dobj, dseq] [object, background, predicted, semantic, foreground, instance, module, feature, propose, detection, mask, employ, tracking] [model, input, adversarial, quality] [motion, flow, optical, dynamic, method, spatial, convolutional, based, proposed, affine, mcnet, warped, resolution, lsmooth] [image, loss, discriminator, inpainting, lpips, synthesize, unsupervised, train, generate, generative, transformed, appearance, synthesis, realistic, generated, missing, encoder, inpainted] [network, set, learning, deep, training, andrew, stochastic, backward] [approach, scene, kitti, single, defined, estimated, transformation, structure, directly]
@InProceedings{Wu_2020_CVPR,
  author = {Wu, Yue and Gao, Rongrong and Park, Jaesik and Chen, Qifeng},
  title = {Future Video Synthesis With Object Motion Prediction},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
MaskGAN: Towards Diverse and Interactive Facial Image Manipulation
Cheng-Han Lee, Ziwei Liu, Lingyun Wu, Ping Luo


Facial image manipulation has achieved great progress in recent years. However, previous methods either operate on a predefined set of face attributes or leave users little freedom to interactively manipulate images. To overcome these drawbacks, we propose a novel framework termed MaskGAN, enabling diverse and interactive face manipulation. Our key insight is that semantic masks serve as a suitable intermediate representation for flexible face manipulation with fidelity preservation. MaskGAN has two main components: 1) Dense Mapping Network (DMN) and 2) Editing Behavior Simulated Training (EBST). Specifically, DMN learns style mapping between a free-form user modified mask and a target image, enabling diverse generation results. EBST models the user editing behavior on the source mask, making the overall framework more robust to various manipulated inputs. Specifically, it introduces dual-editing consistency as the auxiliary supervision signal. To facilitate extensive studies, we construct a large-scale high-resolution face dataset with fine-grained mask annotations named CelebAMask-HQ. MaskGAN is comprehensively evaluated on two challenging tasks: attribute transfer and style copy, demonstrating superior performance over other state-of-the-art methods. The code, models, and dataset are available at https://github.com/switchablenorms/CelebAMask-HQ.
[behavior, dataset, visual, evaluation, blend] [mask, semantic, interactive, table, segmentation] [face, manipulation, facial, model, trained, quality, adversarial, identity, adding] [conv, ieee, figure, spatial, comparison, superior] [image, style, target, editing, maskgan, mapping, attribute, source, alpha, transfer, loss, blender, user, generation, src, inter, stargan, elegant, spade, smiling, dmn, maskvae, conditional, latent, learns, generated, consistency, encoder, manipulating, diverse, ziwei, consists] [training, network, size, outer, accuracy, label, deep, performance, classification, arxiv, preprint, set, indicates, update, learning, better] [dense, computer, simulated, conference, vision, structure, thomas]
@InProceedings{Lee_2020_CVPR,
  author = {Lee, Cheng-Han and Liu, Ziwei and Wu, Lingyun and Luo, Ping},
  title = {MaskGAN: Towards Diverse and Interactive Facial Image Manipulation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
A Graduated Filter Method for Large Scale Robust Estimation
Huu Le, Christopher Zach


Due to the highly non-convex nature of large-scale robust parameter estimation, avoiding poor local minima is challenging in real-world applications where input data is contaminated by a large or unknown fraction of outliers. In this paper, we introduce a novel solver for robust estimation that possesses a strong ability to escape poor local minima. Our algorithm is built upon the class of traditional graduated optimization techniques, which are considered state-of-the-art local methods to solve problems having many poor minima. The novelty of our work lies in the introduction of an adaptive kernel (or residual) scaling scheme, which allows us to achieve faster convergence rates. Like other existing methods that aim to return good local minima for robust estimation tasks, our method relaxes the original robust problem, but adapts a filter framework from non-linear constrained optimization to automatically choose the level of relaxation. Experimental results on real large-scale datasets such as bundle adjustment instances demonstrate that our proposed method achieves competitive results.
[step, work, order, described, provide, current, christopher] [main, cooperative, framework, region, global, easy] [robust, poor, original, constrained, chosen, nonlinear] [method, graduated, figure, kernel, adjustment, competitive, proposed, restoration, ieee, scale, pattern, existing, optimized, fast, contrast, residual] [image] [filter, optimization, problem, objective, algorithm, set, gnc, parameter, performance, schedule, irls, min, feasible, convergence, number, compared, minimization, best, penalty, large, scaling, achieve, good, maximum, function, choice, zach, scaled] [local, estimation, bundle, computer, cost, solution, vision, conference, constraint, lifting, acceptable, solver, approach, defined, solve, supplementary, novel, fitting]
@InProceedings{Le_2020_CVPR,
  author = {Le, Huu and Zach, Christopher},
  title = {A Graduated Filter Method for Large Scale Robust Estimation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Deep Face Super-Resolution With Iterative Collaboration Between Attentive Recovery and Landmark Estimation
Cheng Ma, Zhenyu Jiang, Yongming Rao, Jiwen Lu, Jie Zhou


Recent works based on deep learning and facial priors have succeeded in super-resolving severely degraded facial images. However, the prior knowledge is not fully exploited in existing methods, since facial priors such as landmark and component maps are always estimated by low-resolution or coarsely super-resolved images, which may be inaccurate and thus affect the recovery performance. In this paper, we propose a deep face super-resolution (FSR) method with iterative collaboration between two recurrent networks which focus on facial image recovery and landmark estimation respectively. In each recurrent step, the recovery branch utilizes the prior knowledge of landmarks to yield higher-quality images which facilitate more accurate landmark estimation in turn. Therefore, the iterative information interaction between two processes boosts the performance of each other progressively. Moreover, a new attentive fusion module is designed to strengthen the guidance of landmark maps, where facial components are generated individually and aggregated attentively for better restoration. Quantitative and qualitative experimental results show the proposed method significantly outperforms state-of-the-art FSR methods in recovering high-quality face images.
[step, recurrent, attention, previous, reason] [attentive, table, branch, module, propose, feature, china, grant, framework] [face, facial, landmark, iterative, quality, model, input, dicgan, hourglass, feedback, adversarial, helen, heatmaps, hallucination] [fusion, method, proposed, fsr, figure, nrmse, perceptual, prior, comparison, collaboration, guidance, convolutional, ssim, dic, psnr, recursive, block, quantitative, based, residual, conv, bicubic, erc, chen] [alignment, image, loss, corresponding, generative, component, celeba, structural, ability, extracted, generate, generation] [deep, better, network, performance, learning, best, group, design, process, training, knowledge, number] [accurate, recovery, estimated, single, estimation, local, recovered]
@InProceedings{Ma_2020_CVPR,
  author = {Ma, Cheng and Jiang, Zhenyu and Rao, Yongming and Lu, Jiwen and Zhou, Jie},
  title = {Deep Face Super-Resolution With Iterative Collaboration Between Attentive Recovery and Landmark Estimation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Coherent Reconstruction of Multiple Humans From a Single Image
Wen Jiang, Nikos Kolotouros, Georgios Pavlakos, Xiaowei Zhou, Kostas Daniilidis


In this work, we address the problem of multi-person 3D pose estimation from a single image. A typical regression approach in the top-down setting of this problem would first detect all humans and then reconstruct each one of them independently. However, this type of prediction suffers from incoherent results, e.g., interpenetration and inconsistent depth ordering between the people in the scene. Our goal is to train a single network that learns to avoid these problems and generate a coherent 3D reconstruction of all the humans in the scene. To this end, a key design choice is the incorporation of the SMPL parametric body model in our top-down framework, which enables the use of two novel losses. First, a distance field-based collision loss penalizes interpenetration among the reconstructed people. Second, a depth ordering-aware loss reasons about occlusions and promotes a depth ordering of people that leads to a rendering which is consistent with the annotated instance segmentation. This provides depth supervision signals to the network, even if the image has no explicit 3D annotations. The experiments show that our approach outperforms previous methods on standard 3D pose benchmarks, while our proposed losses enable more coherent reconstruction in natural images. The project website with videos, results, and code can be found at: https://jiangwenpl.github.io/multiperson
[people, multiple, dataset, work, evaluation, previous, natural, prediction] [instance, ordering, regression, holistic, segmentation, table, bounding, framework, promote, apply, box, panoptic, annotated, detection, overlap, penalize, detect, key] [model, trained, typical, input, improve] [proposed, figure, based, pixel, method] [person, loss, image, train, qualitative] [network, baseline, training, learning, problem, evaluate, performance, deep, ordinal, architecture, feedforward] [pose, depth, human, shape, estimation, single, reconstruction, approach, scene, smpl, coherent, interpenetration, mesh, estimate, michael, body, rendering, coherency, reconstructed, zanfir, ground, avoid, full, georgios, distance, estimated, truth, monocular, kostas, overlapping, parametric, consistent, rgb]
@InProceedings{Jiang_2020_CVPR,
  author = {Jiang, Wen and Kolotouros, Nikos and Pavlakos, Georgios and Zhou, Xiaowei and Daniilidis, Kostas},
  title = {Coherent Reconstruction of Multiple Humans From a Single Image},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
PointASNL: Robust Point Clouds Processing Using Nonlocal Neural Networks With Adaptive Sampling
Xu Yan, Chaoda Zheng, Zhen Li, Sheng Wang, Shuguang Cui


Raw point clouds data inevitably contains outliers or noise through acquisition from 3D sensors or reconstruction algorithms. In this paper, we present a novel end-to-end network for robust point clouds processing, named PointASNL, which can deal with point clouds with noise effectively. The key component in our approach is the adaptive sampling (AS) module. It first re-weights the neighbors around the initial sampled points from farthest point sampling (FPS), and then adaptively adjusts the sampled points beyond the entire point cloud. Our AS module can not only benefit the feature learning of point clouds, but also ease the biased effect of outliers. To further capture the neighbor and long-range dependencies of the sampled point, we proposed a local-nonlocal (L-NL) module inspired by the nonlocal operation. Such L-NL module enables the learning process insensitive to noise. Extensive experiments verify the robustness and superiority of our approach in point clouds processing tasks regardless of synthesis data, indoor data, and outdoor data with or without noise. Specifically, PointASNL achieves state-of-the-art robust performance for classification and segmentation tasks on all datasets, and significantly outperforms previous methods on real-world outdoor SemanticKITTI dataset with considerate noise.
[hierarchical, dataset, context, attention, previous] [module, feature, global, segmentation, semantic, key, aggregation, pnl, voting, fps] [input, model, robust, noise, robustness, improve, query, original] [ieee, pattern, adaptive, convolution, nonlocal, pnt, cell, proposed, convolutional, figure, spatial, method, raw, adaptively, noisy, conv, channel] [] [sampled, sampling, learning, neural, classification, data, network, group, deep, training, distribution, entire, function, number, processing, strategy, random, performance, subset, operation, achieve, updated, design, sum] [point, local, computer, conference, vision, cloud, pointasnl, indoor, shape, outdoor, scene, scannet, outlier, pointconv, neighbor, directly, pointnet, position, dgcnn, limited]
@InProceedings{Yan_2020_CVPR,
  author = {Yan, Xu and Zheng, Chaoda and Li, Zhen and Wang, Sheng and Cui, Shuguang},
  title = {PointASNL: Robust Point Clouds Processing Using Nonlocal Neural Networks With Adaptive Sampling},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
A Neural Rendering Framework for Free-Viewpoint Relighting
Zhang Chen, Anpei Chen, Guli Zhang, Chengyuan Wang, Yu Ji, Kiriakos N. Kutulakos, Jingyi Yu


We present a novel Relightable Neural Renderer (RNR) for simultaneous view synthesis and relighting using multi-view image inputs. Existing neural rendering (NR) does not explicitly model the physical rendering process and hence has limited capabilities on relighting. RNR instead models image formation in terms of environment lighting, object intrinsic attributes, and light transport function (LTF), each corresponding to a learnable component. In particular, the incorporation of a physically based rendering process not only enables relighting but also improves the quality of view synthesis. Comprehensive experiments on synthetic and real data show that RNR provides a practical and effective solution for conducting free-viewpoint relighting.
[environment, gcn, encode] [object, map, feature, global] [model, input, ltf] [light, illumination, ieee, pattern, based, method, learnable, captured, figure, convolutional] [synthesis, image, transport, texture, appearance, loss, synthetic, real, row, produce, representation, synthesize, learn, latent] [neural, deep, learning, network, training, sampling, data, sampled, number, arxiv, preprint, set, space, processing] [view, rendering, relighting, computer, rnr, conference, acm, novel, direction, surface, specular, geometry, vision, diffuse, deferrednr, michael, reflectance, deepvoxels, radiance, albedo, volume, differentiable, renderer, physically, point, sparse, scene, geometric, single, normal, initial, international, relightable, intrinsic, material, camera, estimation]
@InProceedings{Chen_2020_CVPR,
  author = {Chen, Zhang and Chen, Anpei and Zhang, Guli and Wang, Chengyuan and Ji, Yu and Kutulakos, Kiriakos N. and Yu, Jingyi},
  title = {A Neural Rendering Framework for Free-Viewpoint Relighting},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
A Multi-Task Mean Teacher for Semi-Supervised Shadow Detection
Zhihao Chen, Lei Zhu, Liang Wan, Song Wang, Wei Feng, Pheng-Ann Heng


Existing shadow detection methods suffer from an intrinsic limitation in relying on limited labeled datasets, and they may produce poor results in some complicated situations. To boost the shadow detection performance, this paper presents a multi-task mean teacher model for semi-supervised shadow detection by leveraging unlabeled data and exploring the learning of multiple information of shadows simultaneously. To be specific, we first build a multi-task baseline model to simultaneously detect shadow regions, shadow edges, and shadow count by leveraging their complementary information and assign this baseline model to the student and teacher network. After that, we encourage the predictions of the three tasks from the student and teacher networks to be consistent for computing a consistency loss on unlabeled data, which is then added to the supervised loss on the labeled data from the predictions of the multi-task baseline model. Experimental results on three widely-used benchmark datasets show that our method consistently outperforms all the compared state-of- the-art methods, which verifies that the proposed network can effectively leverage additional unlabeled data to boost the shadow detection performance.
[three, pred, dataset, ucf, natural, work, visual] [detection, edge, region, ber, cnn, sbu, map, benchmark, feature, annotated, denotes, segmentation, object, saliency, table, detect, supervision, fuse, semantic, apply, bdrar] [model, input, detecting, datasets, complementary, testing] [method, convolutional, existing, proposed, based, analysis, pattern, spatial, ieee, lei, figure, develop, result] [shadow, loss, supervised, consistency, image, istd, train, produce, produced] [network, training, unlabeled, data, count, teacher, labeled, learning, student, deep, baseline, better, set, performance, number, neural, layer, machine, compared, large] [ground, single, well, second]
@InProceedings{Chen_2020_CVPR,
  author = {Chen, Zhihao and Zhu, Lei and Wan, Liang and Wang, Song and Feng, Wei and Heng, Pheng-Ann},
  title = {A Multi-Task Mean Teacher for Semi-Supervised Shadow Detection},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
GroupFace: Learning Latent Groups and Constructing Group-Based Representations for Face Recognition
Yonghyun Kim, Wonpyo Park, Myung-Cheol Roh, Jongju Shin


In the field of face recognition, a model learns to distinguish millions of face images with fewer dimensional embedding features, and such vast information may not be properly encoded in the conventional model with a single branch. We propose a novel face-recognition-specialized architecture called GroupFace that utilizes multiple group-aware representations, simultaneously, to improve the quality of the embedding feature. The proposed method provides self-distributed labels that balance the number of samples belonging to each group without additional human annotations, and learns the group-aware representations that can narrow down the search space of the target identity. We prove the effectiveness of the proposed method by showing extensive ablation studies and visualizations. All the components of the proposed method can be trained in an end-to-end manner with a marginal increase of computational complexity. Finally, the proposed method achieves the state-of-the-art results with significant improvements in 1:1 face verification and 1:N face identification tasks on the following public datasets: LFW, YTF, CALFW, CPLFW, CFP, AgeDB-30, MegaFace, IJB-B and IJB-C.
[recognition, embedding, multiple, visual, dataset, considering, evaluation] [feature, ablation, grouping, improvement, final, effectiveness, denotes, aggregation, refined, table] [face, groupface, model, trained, arcface, gdn, verification, cosface, percentage, lfw, megaface, improve, identification, datasets, identity, deploying] [ieee, proposed, pattern, method, figure, based, conventional, high] [representation, loss, latent, image, common, corresponding] [group, learning, deep, network, number, similarity, probability, accuracy, performance, large, baseline, set, distribution, margin, angular, softmax, cosine, label, marginal, training, machine, dimension, sample] [conference, computer, vision, european, international, additional, compare, novel, distance, defined]
@InProceedings{Kim_2020_CVPR,
  author = {Kim, Yonghyun and Park, Wonpyo and Roh, Myung-Cheol and Shin, Jongju},
  title = {GroupFace: Learning Latent Groups and Constructing Group-Based Representations for Face Recognition},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Channel Attention Based Iterative Residual Learning for Depth Map Super-Resolution
Xibin Song, Yuchao Dai, Dingfu Zhou, Liu Liu, Wei Li, Hongdong Li, Ruigang Yang


Despite the remarkable progresses made in deep learning based depth map super-resolution (DSR), how to tackle real-world degradation in low-resolution (LR) depth maps remains a major challenge. Existing DSR model is generally trained and tested on synthetic dataset, which is very different from what would get from a real depth sensor. In this paper, we argue that DSR models trained under this setting are restrictive and not effective in dealing with realworld DSR tasks. We make two contributions in tackling real-world degradation of different depth sensors. First, we propose to classify the generation of LR depth maps into two types: non-linear downsampling with noise and interval downsampling, for which DSR models are learned correspondingly. Second, we propose a new framework for real-world DSR, which consists of four modules : 1) An iterative residual learning module with deep supervision to learn effective high-frequency components of depth maps in a coarse-to-fine manner; 2) A channel attention strategy to enhance channels with abundant high-frequency components; 3) A multi-stage fusion module to effectively reexploit the results in the coarse-to-fine process; and 4) A depth refinement module to improve the depth map by TGV regularization and input loss. Extensive experiments on benchmarking datasets demonstrate the superiority of our method over current state-of-the-art DSR methods.
[attention, dataset] [map, framework, feature, propose, apolloscape, yuchao, tackle] [input, noise, effective, model, datasets, iterative, effectively] [proposed, ieee, channel, degradation, based, method, pattern, residual, dsr, color, okfe, dvs, captured, srfbn, output, san, figure, convolutional, utilized, dcnn, block, middlebury, abpn, xibin, analysis, airws, comparison, downsampling, tgv, superresolution, kernel, extraction] [image, loss, real, factor, generation, generate, component] [learning, network, deep, number, best, strategy, evaluate, total, performance, smaller, sharing] [depth, conference, computer, vision, international, interval, reconstruction, nearest, defined, second, single, demonstrate, kinect, stereo, approach]
@InProceedings{Song_2020_CVPR,
  author = {Song, Xibin and Dai, Yuchao and Zhou, Dingfu and Liu, Liu and Li, Wei and Li, Hongdong and Yang, Ruigang},
  title = {Channel Attention Based Iterative Residual Learning for Depth Map Super-Resolution},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Time Flies: Animating a Still Image With Time-Lapse Video As Reference
Chia-Chi Cheng, Hung-Yu Chen, Wei-Chen Chiu


Time-lapse videos usually perform eye-catching appearances but are often hard to create. In this paper, we propose a self-supervised end-to-end model to generate the time-lapse video from a single image and a reference video. Our key idea is to extract both the style and the features of temporal variation from the reference video, and transfer them onto the input image. To ensure both the temporal consistency and realness of our resultant videos, we introduce several novel designs in our architecture, including classwise NoiseAdaIN, flow loss, and the video discriminator. In comparison to the baselines of state-of-the-art style transfer approaches, our proposed method is not only efficient in computation but also able to create more realistic and temporally smooth time-lapse video of a still image, with its temporal variation consistent to the reference.
[video, temporal, recognition, frame, time, order, lip, modulation, dataset, clip, step, decoder] [feature, map, semantic] [model, input, variation, adversarial, difference] [reference, proposed, figure, ieee, pattern, based, method, flow, comparison, color, wavelet, result, spatial, output, adjacent] [style, transfer, image, adain, generated, photorealistic, egv, consistency, loss, egi, noiseadain, wct, perform, classwise, artistic, content, corresponding, preserve, generate, realness, synthesis, arbitrary, generator, discriminator, appearance, target, generative, gram, photorealism, pretrained, stylization] [learning, training, network, neural, set, problem, deep, objective, layer, better, design] [computer, conference, vision, single, second, well, outdoor]
@InProceedings{Cheng_2020_CVPR,
  author = {Cheng, Chia-Chi and Chen, Hung-Yu and Chiu, Wei-Chen},
  title = {Time Flies: Animating a Still Image With Time-Lapse Video As Reference},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
SER-FIQ: Unsupervised Estimation of Face Image Quality Based on Stochastic Embedding Robustness
Philipp Terhorst, Jan Niklas Kolf, Naser Damer, Florian Kirchbuchner, Arjan Kuijper


Face image quality is an important factor to enable high-performance face recognition systems. Face quality assessment aims at estimating the suitability of a face image for the purpose of recognition. Previous work proposed supervised solutions that require artificially or human labelled quality values. However, both labelling mechanisms are error prone as they do not rely on a clear definition of quality and may not know the best characteristics for the utilized face recognition system. Avoiding the use of inaccurate quality labels, we proposed a novel concept to measure face quality based on an arbitrary face recognition model. By determining the embedding variations generated from random subnetworks of a face model, the robustness of a sample representation and thus, its quality is estimated. The experiments are conducted in a cross-database evaluation setting on three publicly available databases. We compare our proposed solution on two face embeddings against six state-of-the-art approaches from academia and industry. The results show that our unsupervised solution outperforms all other approaches in the majority of the investigated scenarios. In contrast to previous works, the proposed solution shows a stable performance over all scenarios. Utilizing the deployed face recognition model for our face quality assessment methodology avoids the training phase completely and further outperforms all baseline approaches by a large margin. Our solution can be easily integrated into current face recognition systems, and can be modified to other tasks beyond face recognition.
[recognition, embeddings, embedding, evaluation, three, previous, outperforms, current] [predicted] [face, quality, assessment, model, facenet, robustness, arcface, verification, subnetworks, deployed, biometric, adience, trained, lfw, fmr, faceqnet, colorferet, methodology, robust, decision, highly, unconstrained, evaluated, facial] [figure, based, proposed, ieee, low, high, pattern, method, comparison, demonstrates, june, utilized, presented] [image, unsupervised, representation, pretrained] [performance, sample, training, stochastic, dropout, learning, random, baseline, machine, network, best, stable, data, small, large, forward, rate, considered, measure, investigated] [approach, solution, conference, computer, international, error, estimation, human, vision, relative, well, require]
@InProceedings{Terhorst_2020_CVPR,
  author = {Terhorst, Philipp and Kolf, Jan Niklas and Damer, Naser and Kirchbuchner, Florian and Kuijper, Arjan},
  title = {SER-FIQ: Unsupervised Estimation of Face Image Quality Based on Stochastic Embedding Robustness},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Grid-GCN for Fast and Scalable Point Cloud Learning
Qiangeng Xu, Xudong Sun, Cho-Ying Wu, Panqu Wang, Ulrich Neumann


Due to the sparsity and irregularity of the point cloud data, methods that directly consume points have become popular. Among all point-based models, graph convolutional networks (GCN) lead to notable performance by fully preserving the data granularity and exploiting point interrelation. However, point-based networks spend a significant amount of time on data structuring (e.g., Farthest Point Sampling (FPS) and neighbor points querying), which limit the speed and scalability. In this paper, we present a method, named Grid-GCN, for fast and scalable point cloud learning. Grid-GCN uses a novel data structuring strategy, Coverage-Aware Grid Query (CAGQ). By leveraging the efficiency of grid space, CAGQ improves spatial coverage while reducing the theoretical time complexity. Compared with popular sampling methods such as Farthest Point Sampling (FPS) and Ball Query, CAGQ achieves up to 50 times speed-up. With a Grid Context Aggregation (GCA) module, Grid-GCN achieves state-of-the-art performance on major point cloud classification and segmentation benchmarks with significantly faster runtime than previous studies. Remarkably, Grid-GCN achieves the inference speed of 50FPS on ScanNet using 81920 points as input. The supplementary xharlie.github.io/papers/GGCN_supCamReady.pdf and the code github.com/xharlie/Grid-GCN are released.
[context, node, graph, previous, speed, relation, attention, time, includes] [center, table, fps, edge, feature, achieves, module, pooling, semantic, segmentation, aggregation, faster, object] [query, model, input, ball] [ieee, pattern, convolutional, figure, cube, convolution] [representation] [data, sampling, group, number, learning, space, neural, deep, computation, efficient, latency, performance, network, arxiv, classification, preprint, layer, processing, sample, achieve, better, random, size] [point, cloud, grid, conference, voxel, computer, vision, coverage, occupied, structuring, cagq, voxels, local, gridconv, volumetric, rps, neighbor, cost]
@InProceedings{Xu_2020_CVPR,
  author = {Xu, Qiangeng and Sun, Xudong and Wu, Cho-Ying and Wang, Panqu and Neumann, Ulrich},
  title = {Grid-GCN for Fast and Scalable Point Cloud Learning},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Domain Balancing: Face Recognition on Long-Tailed Domains
Dong Cao, Xiangyu Zhu, Xingyu Huang, Jianzhu Guo, Zhen Lei


Long-tailed problem has been an important topic in face recognition task. However, existing methods only concentrate on the long-tailed distribution of classes. Differently, we devote to the long-tailed domain distribution problem, which refers to the fact that a small number of domains frequently appear while other domains far less existing. The key challenge of the problem is that domain labels are too complicated (related to race, age, pose, illumination, etc.) and inaccessible in real applications. In this paper, we propose a novel Domain Balancing (DB) mechanism to handle this problem. Specifically, we first propose a Domain Frequency Indicator (DFI) to judge whether a sample is from head domains or tail domains. Secondly, we formulate a light-weighted Residual Balancing Mapping (RBM) block to balance the domain distribution by adjusting the network according to DFI. Finally, we propose a Domain Balancing Margin (DBM) in the loss function to further optimize the feature space of the tail domains to improve generalization. Extensive analysis and experiments on several face recognition benchmarks demonstrate that the proposed method effectively enhances the generalization capacities and achieves superior performance.
[recognition, three, mechanism, embedding, contribution] [feature, table, head, propose, module, effectiveness, boundary] [face, lfw, dfi, rbm, dbm, cplfw, cosface, megaface, calfw, verification, arcface, model, agedb, zhen, testing, poor, age, sphereface, refers, race, decision, datasets, identification] [residual, method, ieee, frequency, proposed, figure, pattern, enhancement, adjust, based] [domain, loss, mapping, representation] [balancing, performance, margin, tail, deep, training, learning, distribution, soft, data, softmax, large, problem, class, indicator, network, compactness, evaluate, gate, function, set, average, xiangyu, balance, indicates, accuracy, arxiv, preprint, number, investigate, process, bias] [conference, computer, vision, international, well]
@InProceedings{Cao_2020_CVPR,
  author = {Cao, Dong and Zhu, Xiangyu and Huang, Xingyu and Guo, Jianzhu and Lei, Zhen},
  title = {Domain Balancing: Face Recognition on Long-Tailed Domains},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
AdversarialNAS: Adversarial Neural Architecture Search for GANs
Chen Gao, Yunpeng Chen, Si Liu, Zhenxiong Tan, Shuicheng Yan


Neural Architecture Search (NAS) that aims to automate the procedure of architecture design has achieved promising results in many computer vision fields. In this paper, we propose an AdversarialNAS method specially tailored for Generative Adversarial Networks (GANs) to search for a superior generative model on the task of unconditional image generation. The AdversarialNAS is the first method that can search the architectures of generator and discriminator simultaneously in a differentiable manner. During searching, the designed adversarial search algorithm does not need to comput any extra metric to evaluate the performance of the searched architecture, and the search paradigm considers the relevance between the two network architectures and improves their mutual balance. Therefore, AdversarialNAS is very efficient and only takes 1 GPU day to search for a superior generative model in the proposed large search space. Experiments demonstrate the effectiveness and superiority of our method. The discovered generative model sets a new state-of-the-art FID score of 10.87 and highly competitive Inception Score of 8.74 on CIFAR-10. Its transferability is also proven by setting new state-of-the-art FID score of 26.98 and Inception score of 9.63 on STL-10. Code is at: https://github.com/chengaopro/AdversarialNAS.
[evaluation, previous, bilinear, reinforcement, natural, reward] [score, adopt, propose, edge, achieves, supervision] [adversarial, model, transferability, noise] [proposed, method, convolution, conv, superior, achieved, designed, figure, field, signal] [generator, discriminator, generative, gan, gans, image, fid, inception, progressive, specific, generation, loss, train, corresponding] [search, architecture, space, training, neural, performance, searching, adversarialnas, searched, optimal, network, autogan, set, arxiv, preprint, update, learning, discovered, algorithm, design, fixed, evaluate, agan, probability, stochastic, function, size, gpu, large, strategy, random, efficient, extremely, sample, distribution, gradient, calculating, operation, optimization, task, mutual, candidate, note, manual] [differentiable, directly]
@InProceedings{Gao_2020_CVPR,
  author = {Gao, Chen and Chen, Yunpeng and Liu, Si and Tan, Zhenxiong and Yan, Shuicheng},
  title = {AdversarialNAS: Adversarial Neural Architecture Search for GANs},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Image Super-Resolution With Cross-Scale Non-Local Attention and Exhaustive Self-Exemplars Mining
Yiqun Mei, Yuchen Fan, Yuqian Zhou, Lichao Huang, Thomas S. Huang, Honghui Shi


Deep convolution-based single image super-resolution (SISR) networks embrace the benefits of learning from large-scale external image resources for local recovery, yet most existing works have ignored the long-range feature-wise similarities in natural images. Some recent works have successfully leveraged this intrinsic feature correlation by exploring non-local attention modules. However, none of the current deep models have studied another inherent property of images: cross-scale feature correlation. In this paper, we propose the first Cross-Scale Non-Local (CS-NL) attention module with integration into a recurrent neural network. By combining the new CS-NL prior with local and in-scale non-local priors in a powerful recurrent fusion cell, we can find more cross-scale feature correlations within a single low-resolution (LR) image. The performance of SISR is significantly improved by exhaustively integrating all possible priors. Extensive experiments demonstrate the effectiveness of the proposed CS-NL module by setting new state-of-the-arts on multiple SISR benchmarks.
[attention, recurrent, natural, multiple, embedded, previous] [feature, module, effectiveness, table, branch, correlation, stride, achieves, map, yuchen, propose, mine, region] [model, internal, external, input] [ieee, proposed, figure, pattern, patch, convolution, fusion, psnr, based, sem, residual, san, rcan, convolutional, dbpn, edsr, rdn, sisr, scale, lapsrn, oisr, prior, method, abundant, downsample, ssim, repeated, existing, nonlocal, deconvolution] [image, factor] [deep, network, performance, size, mining, learning, neural, best, better, small] [conference, computer, vision, single, local, thomas, directly, international, intrinsic, matching, demonstrate]
@InProceedings{Mei_2020_CVPR,
  author = {Mei, Yiqun and Fan, Yuchen and Zhou, Yuqian and Huang, Lichao and Huang, Thomas S. and Shi, Honghui},
  title = {Image Super-Resolution With Cross-Scale Non-Local Attention and Exhaustive Self-Exemplars Mining},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
The Devil Is in the Details: Delving Into Unbiased Data Processing for Human Pose Estimation
Junjie Huang, Zheng Zhu, Feng Guo, Guan Huang


Recently, the leading performance of human pose estimation is dominated by top-down methods. Being a fundamental component in training and inference, data processing has not been systematically considered in pose estimation community, to the best of our knowledge. In this paper, we focus on this problem and find that the devil of top-down pose estimator is in the biased data processing. Specifically, by investigating the standard data processing in state-of-the-art approaches mainly including data transformation and encoding-decoding, we find that the results obtained by common flipping strategy are unaligned with the original ones in inference. Moreover, there is statistical error in standard encoding-decoding during both training and inference. Two problems couple together and significantly degrade the pose estimation performance. Based on quantitative analyses, we then formulate a principled way to tackle this dilemma. Data is processed in continuous space based on unit length (the intervals between pixels) instead of in discrete space with pixel, and a combined classification and regression approach is adopted to perform encoding-decoding. The Unbiased Data Processing (UDP) for human pose estimation can be achieved by combining the two together. UDP not only boosts the performance of existing methods by a large margin but also plays a important role in result reproducing and future exploration. As a model-agnostic approach, UDP promotes SimpleBaseline-ResNet50-256x192 by 1.5 AP (70.2 to 71.7) and HRNet-W32-256x192 by 1.7 AP (73.5 to 75.2) on COCO test-dev set. The HRNet-W48-384x288 equipped with UDP achieves 76.5 AP and sets a new state-of-the-art for human pose estimation. The source code is publicly available for further research.
[unbiased, unit, length, shift, future, recognition, predict, decoding] [coco, heatmap, predicted, improvement, achieves, detection, hrnet, biased, shifting, val, table, regression, simplebaseline, leading] [input, flipping, flip, original, combined, aforementioned, flipped] [proposed, result, ieee, method, pixel, pattern, output, based, figure, analysis, compensation] [image, source, person, corresponding, common, firstly] [data, network, processing, standard, performance, size, space, training, strategy, inference, label, classification, set, degrade, large, sample, find, principled] [pose, human, computer, estimation, conference, error, udp, transformation, coordinate, keypoint, vision, ground, truth, european, systematic, direction, international, continuous, approach]
@InProceedings{Huang_2020_CVPR,
  author = {Huang, Junjie and Zhu, Zheng and Guo, Feng and Huang, Guan},
  title = {The Devil Is in the Details: Delving Into Unbiased Data Processing for Human Pose Estimation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Data Uncertainty Learning in Face Recognition
Jie Chang, Zhonghao Lan, Changmao Cheng, Yichen Wei


Modeling data uncertainty is important for noisy images, but seldom explored for face recognition. The pioneer work, PFE, considers uncertainty by modeling each face image embedding as a Gaussian distribution. It is quite effective. However, it uses fixed feature (mean of the Gaussian) from an existing model. It only estimates the variance and relies on an ad-hoc and costly metric. Thus, it is not easy to use. It is unclear how uncertainty affects feature learning. This work applies data uncertainty learning to face recognition, such that the feature (mean) and uncertainty (variance) are learnt simultaneously, for the first time. Two learning methods are proposed. They are easy to use and outperform existing deterministic methods as well as PFE on challenging unconstrained scenarios. We also provide insightful analysis on how incorporating uncertainty estimation helps reducing the adverse effects of noisy samples and affects the feature learning.
[embedding, recognition, illustrated, dataset, work, embeddings] [feature, regression, table, predicted, easy, score, hard, propose, center] [face, model, dulcls, dulrgs, pfe, identity, trained, dul, fig, quality, noise, datasets, megaface, unconstrained, refers, ytf, genuine, imposter] [proposed, ieee, noisy, gaussian, figure, likelihood, pattern, analysis, existing, output, method, viewed] [image, representation, mapping, latent, train, loss, learn, target] [data, baseline, learning, training, variance, deep, learned, deterministic, space, better, neural, similarity, large, cosine, class, classification, distributional, compared, distribution, arxiv, preprint, fixed, best] [uncertainty, conference, computer, vision, estimated, international, point, estimation, well, term, continuous, matching]
@InProceedings{Chang_2020_CVPR,
  author = {Chang, Jie and Lan, Zhonghao and Cheng, Changmao and Wei, Yichen},
  title = {Data Uncertainty Learning in Face Recognition},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Regularizing Discriminative Capability of CGANs for Semi-Supervised Generative Learning
Yi Liu, Guangchang Deng, Xiangping Zeng, Si Wu, Zhiwen Yu, Hau-San Wong


Semi-supervised generative learning aims to learn the underlying class-conditional distribution of partially labeled data. Generative Adversarial Networks (GANs) have led to promising progress in this task. However, it still needs to further explore the issue of imbalance between real labeled data and fake data in the adversarial learning process. To address this issue, we propose a regularization technique based on Random Regional Replacement (R^3-regularization) to facilitate the generative learning process. Specifically, we construct two types of between-class instances: cross-category ones and real-fake ones. These instances could be closer to the decision boundaries and are important for regularizing the classification and discriminative networks in our class-conditional GANs, which we refer to as R^3-CGAN. Better guidance from these two networks makes the generative network produce instances with class-specific information and high fidelity. We experiment with multiple standard benchmarks, and demonstrate that the R^3-regularization can lead to significant improvement in both classification and class-conditional image synthesis.
[constructed, construct, constituent] [adopt, denotes, improvement, region, instance, table, effectiveness] [model, adversarial, adv, regional, improving, improve, trained, original] [proposed, figure, ieee, pattern, method, enhancing] [generative, real, image, discriminative, synthesized, fake, synthesis, generator, fid, corresponding, replacement, synthesize, cutmix, perform, competing, address, supervised, enhancedtgan, discriminator, generation, issue] [learning, training, network, data, classification, labeled, class, random, baseline, neural, processing, deep, unlabeled, classifier, regularization, better, distribution, strategy, svhn, log, label, indicates, optimization, performance, machine, imbalance, experiment, lead, architecture] [conference, international, computer, term, demonstrate]
@InProceedings{Liu_2020_CVPR,
  author = {Liu, Yi and Deng, Guangchang and Zeng, Xiangping and Wu, Si and Yu, Zhiwen and Wong, Hau-San},
  title = {Regularizing Discriminative Capability of CGANs for Semi-Supervised Generative Learning},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
FM2u-Net: Face Morphological Multi-Branch Network for Makeup-Invariant Face Verification
Wenxuan Wang, Yanwei Fu, Xuelin Qian, Yu-Gang Jiang, Qi Tian, Xiangyang Xue


It is challenging in learning a makeup-invariant face verification model, due to (1) insufficient makeup/non-makeup face training pairs, (2) the lack of diverse makeup faces, and (3) the significant appearance changes caused by cosmetics. To address these challenges, we propose a unified Face Morphological Multi-branch Network (FMMu-Net) for makeup-invariant face verification, which can simultaneously synthesize many diverse makeup faces through face morphology network (FM-Net) and effectively learn cosmetics-robust face representations using attention-based multi-branch learning network (AttM-Net). For challenges (1) and (2), FM-Net (two stacked auto-encoders) can synthesize realistic makeup face images by transferring specific regions of cosmetics via cycle consistent loss. For challenge (3), AttM-Net, consisting of one global and three local (task-driven on two eyes and mouth) branches, can effectively capture the complementary holistic and detailed information. Unlike DeepID2 which uses simple concatenation fusion, we introduce a heuristic method AttM-FM, attached to AttM-Net, to adaptively weight the features of different branches guided by the holistic information. We conduct extensive experiments on makeup face verification benchmarks (M-501, M-203, and FAM) and general face recognition datasets (LFW and IJB-A). Our framework FMMu-Net achieves state-of-the-art performances.
[recognition, three, dataset, visual, concatenation] [global, key, feature, propose, heavy, yanwei, achieves] [face, facial, model, verification, identity, original, izk, morphological, morphology, robust, izi, effectively, datasets, adversarial, covered] [ieee, fusion, pattern, method, figure, proposed, patch, adaptively] [makeup, loss, swapping, image, diverse, realistic, generative, generated, synthesize, synthetic, learn, cycle, person, representation, paired, generate, discriminative, transfer, introduce, corresponding] [network, learning, general, data, deep, training, performance, neural, test, achieve, compared, accuracy, indicates, problem, learned] [local, conference, computer, vision, international, consistent, capture]
@InProceedings{Wang_2020_CVPR,
  author = {Wang, Wenxuan and Fu, Yanwei and Qian, Xuelin and Jiang, Yu-Gang and Tian, Qi and Xue, Xiangyang},
  title = {FM2u-Net: Face Morphological Multi-Branch Network for Makeup-Invariant Face Verification},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
UCTGAN: Diverse Image Inpainting Based on Unsupervised Cross-Space Translation
Lei Zhao, Qihang Mo, Sihuan Lin, Zhizhong Wang, Zhiwen Zuo, Haibo Chen, Wei Xing, Dongming Lu


Although existing image inpainting approaches have been able to produce visually realistic and semantically correct results, they produce only one result for each masked input. In order to produce multiple and diverse reasonable solutions, we present Unsupervised Cross-space Translation Generative Adversarial Network (called UCTGAN) which mainly consists of three network modules: conditional encoder module, manifold projection module and generation module. The manifold projection module and the generation module are combined to learn one-to-one image mapping between two spaces in an unsupervised way by projecting instance image space and conditional completion image space into common low-dimensional manifold space, which can greatly improve the diversity of the repaired samples. For understanding of global information, we also introduce a new cross semantic attention layer that exploits the long-range dependencies between the known parts and the completed parts, which can improve realism and appearance consistency of repaired samples. Extensive experiments on various datasets such as CelebA-HQ, Places2, Paris Street View and ImageNet clearly demonstrate that our method not only generates diverse inpainting solutions from the same image to be repaired, but also has high image quality.
[attention, multiple, order] [instance, semantic, module, table, feature] [model, adversarial, paris, improve, quality] [method, based, ieee, output, figure, existing, pattern, restored, proposed, visually] [image, inpainting, conditional, diverse, masked, manifold, generate, loss, cross, generation, mapping, corresponding, diversity, uctgan, llrec, unsupervised, reasonable, produce, repaired, translation, missing, generative, consists, encoder, semantically, appearance, content, munit, latent, consistency, mode, style, generated, ladv, train] [training, network, space, distribution, set, learning, layer, probability, neural, deep, test, processing, function, data, scc, zhejiang] [completion, conference, computer, projection, vision, constraint, defined, ground, truth, structure, left]
@InProceedings{Zhao_2020_CVPR,
  author = {Zhao, Lei and Mo, Qihang and Lin, Sihuan and Wang, Zhizhong and Zuo, Zhiwen and Chen, Haibo and Xing, Wei and Lu, Dongming},
  title = {UCTGAN: Diverse Image Inpainting Based on Unsupervised Cross-Space Translation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Decoupled Representation Learning for Skeleton-Based Gesture Recognition
Jianbo Liu, Yongcheng Liu, Ying Wang, Veronique Prinet, Shiming Xiang, Chunhong Pan


Skeleton-based gesture recognition is very challenging, as the high-level information in gesture is expressed by a sequence of complexly composite motions. Previous works often learn all the motions with a single model. In this paper, we propose to decouple the gesture into hand posture variations and hand movements, which are then modeled separately. For the former, the skeleton sequence is embedded into a 3D hand posture evolution volume (HPEV) to represent fine-grained posture variations. For the latter, the shifts of hand center and fingertips are arranged as a 2D hand movement map (HMM) to capture holistic movements. To learn from the two inhomogeneous representations for gesture recognition, we propose an end-to-end two-stream network. The HPEV stream integrates both spatial layout and temporal evolution information of hand postures by a dedicated 3D CNN, while the HMM stream develops an efficient 2D CNN to extract hand movement features. Eventually, the predictions of the two streams are aggregated with high efficiency. Extensive experiments on SHREC'17 Track, DHG-14/28 and FPHA datasets demonstrate that our method is competitive with the state-of-the-art.
[gesture, skeleton, recognition, dataset, action, fpha, hpev, sequence, represent, movement, temporal, fingertip, hmm, frpv, three, frame, extract, lstm, order, embedded] [track, table, feature, cnn, map, propose, final, achieves, center, framework] [posture, input, model, influence] [method, based, motion, spatial, figure, dynamic, convolution, comparison, relu, output, raw, adjacent, resolution, convolutional] [learn, gap, representation, subtle, fine, introduce] [learning, evolution, network, deep, accuracy, vector, performance, training, normalized, compared, bottleneck, neural, decoupled, efficient, applied] [hand, volume, relative, human, capture, approach, coarse, local, position]
@InProceedings{Liu_2020_CVPR,
  author = {Liu, Jianbo and Liu, Yongcheng and Wang, Ying and Prinet, Veronique and Xiang, Shiming and Pan, Chunhong},
  title = {Decoupled Representation Learning for Skeleton-Based Gesture Recognition},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
An Efficient PointLSTM for Point Clouds Based Gesture Recognition
Yuecong Min, Yanxiao Zhang, Xiujuan Chai, Xilin Chen


Point clouds contain rich spatial information, which provides complementary cues for gesture recognition. In this paper, we formulate gesture recognition as an irregular sequence recognition problem and aim to capture long-term spatial correlations across point cloud sequences. A novel and effective PointLSTM is proposed to propagate information from past to future while preserving the spatial structure. The proposed PointLSTM combines state information from neighboring points in the past with current features to update the current states by a weight-shared LSTM layer. This method can be integrated into many other sequence learning approaches. In the task of gesture recognition, the proposed PointLSTM achieves state-of-the-art results on two challenging datasets (NVGesture and SHREC'17) and outperforms previous skeleton-based methods. To show its advantages in generalization, we evaluate our method on MSR Action3D dataset, and it produces competitive results with previous skeleton-based methods.
[gesture, pointlstm, recognition, lstm, sequence, action, dataset, skeleton, previous, frame, state, video, temporal, current, msr, time, relevant, hidden, nvgesture, recurrent, extract, explore, irregular] [grouping, table, achieves, challenging, propose] [model] [proposed, ieee, method, pattern, dynamic, comparison, spatial, motion, based, convolutional, flow, neighboring, cell, figure, window] [corresponding, shared, preserving, idea] [performance, learning, baseline, layer, neural, network, evaluate, sampling, operation, small, inference, find, architecture, better, training, set, update, basic] [point, conference, hand, computer, vision, cloud, capture, depth, structure, international, human, rgb, scene]
@InProceedings{Min_2020_CVPR,
  author = {Min, Yuecong and Zhang, Yanxiao and Chai, Xiujuan and Chen, Xilin},
  title = {An Efficient PointLSTM for Point Clouds Based Gesture Recognition},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Editing in Style: Uncovering the Local Semantics of GANs
Edo Collins, Raja Bala, Bob Price, Sabine Susstrunk


While the quality of GAN image synthesis has improved tremendously in recent years, our ability to control and condition the output is still limited. Focusing on StyleGAN, we introduce a simple and effective method for making local, semantically-aware edits to a target output image. This is accomplished by borrowing elements from a source image, also a GAN output, via a novel manipulation of style vectors. Our method requires neither supervision from an external model, nor involves complex spatial morphing operations. Instead, it relies on the emergent disentanglement of semantic objects that is learned by StyleGAN during its training. Semantic editing is demonstrated on GANs producing human faces, indoor scenes, cats, and cars. We measure the locality and photorealism of the edits produced by our method, and find that it accomplishes both.
[natural, work, relevant] [semantic, feature, object, global, supervision, roi] [face, blending, query, adversarial, trained, poisson, mouth, external, review, facial] [reference, method, spatial, output, figure, ieee, convolutional, tensor, interpolation, spatially, analysis, channel] [image, editing, style, target, stylegan, gan, latent, generative, gans, transfer, control, representation, generator, edited, edits, photorealism, disentanglement, disentangled, attribute, ffhq, appearance, specific, locality, produced, learn, fid, conditioned, nose, transferring, synthesis, swapping, cluster] [arxiv, preprint, neural, layer, space, deep, vector, simple, learned, activation, matrix, data, applied, processing, best] [local, computer, conference, approach, vision, localized, complex, spherical, international, human]
@InProceedings{Collins_2020_CVPR,
  author = {Collins, Edo and Bala, Raja and Price, Bob and Susstrunk, Sabine},
  title = {Editing in Style: Uncovering the Local Semantics of GANs},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
On the Detection of Digital Face Manipulation
Hao Dang, Feng Liu, Joel Stehouwer, Xiaoming Liu, Anil K. Jain


Detecting manipulated facial images and videos is an increasingly important topic in digital media forensics. As advanced face synthesis and manipulation methods are made available, new types of fake face representations are being created which have raised significant concerns for their use in social media. Hence, it is crucial to detect manipulated face images and localize manipulated regions. Instead of simply using multi-task learning to simultaneously detect manipulated images and predict the manipulated mask (regions), we propose to utilize an attention mechanism to process and improve the feature maps for the classification task. The learned attention maps highlight the informative regions to further improve the binary classification (genuine face v. fake face), and also visualize the manipulated regions. To enable our study of manipulated face detection and localization, we collect a large-scale database that contains numerous types of facial forgeries. With this dataset, we perform a thorough analysis of data-driven fake face detection. We show that the use of an attention mechanism improves facial forgery detection and manipulated region localization.
[attention, dataset, mechanism, video, three, matt, highlight] [map, detection, backbone, detect, feature, propose, mgt, localization, weakly, improves, supervision, table] [face, manipulated, manipulation, facial, forgery, adversarial, expression, identity, iinc, swap, model, digital, xiaoming, xception, pbca, uadfv, dffd, christian, collect, datasets, mam, deepfake, justus, detecting, improve] [proposed, figure, method, convolutional] [fake, real, image, supervised, diverse, utilize, modified, source, loss, generate, generative, attribute, produce] [learning, classification, deep, network, binary, entire, layer, data, performance, accuracy, training, large, set, number, cosine] [intersection, ground, approach, truth, direct, matthias, computer]
@InProceedings{Dang_2020_CVPR,
  author = {Dang, Hao and Liu, Feng and Stehouwer, Joel and Liu, Xiaoming and Jain, Anil K.},
  title = {On the Detection of Digital Face Manipulation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Learning Texture Transformer Network for Image Super-Resolution
Fuzhi Yang, Huan Yang, Jianlong Fu, Hongtao Lu, Baining Guo


We study on image super-resolution (SR), which aims to recover realistic textures from a low-resolution (LR) image. Recent progress has been made by taking high-resolution images as references (Ref), so that relevant textures can be transferred to LR images. However, existing SR approaches neglect to use attention mechanisms to transfer high-resolution (HR) textures from Ref images, which limits these approaches in challenging cases. In this paper, we propose a novel Texture Transformer Network for Image Super-Resolution (TTSR), in which the LR and Ref images are formulated as queries and keys in a transformer, respectively. TTSR consists of four closely-related modules optimized for image generation tasks, including a learnable texture extractor by DNN, a relevance embedding module, a hard-attention module for texture transfer, and a soft-attention module for texture synthesis. Such a design encourages joint feature learning across LR and Ref images, in which deep feature correspondences can be discovered by attention, and thus accurate texture features can be transferred. The proposed texture transformer can be further stacked in a cross-scale way, which enables texture recovery from different levels (e.g., from 1x to 4x magnification). Extensive experiments show that TTSR achieves significant improvements over state-of-the-art approaches on both quantitative and qualitative evaluations.
[transformer, relevance, relevant, attention, visual, embedding, represent, three] [feature, module, propose, map, achieves, table, extractor, ablation, denotes, effectiveness] [adversarial, model, study, quality, original, testing, input] [ttsr, reference, proposed, figure, perceptual, srntt, refsr, learnable, comparison, sisr, residual, convolutional, integration, csfi, psnr, rsrgan, crossnet, method, stacked, quantitative, based, enhanced, output, rcan, channel] [image, texture, transfer, loss, transferred, generative, generation, extracted, transferal, user] [network, performance, deep, learning, design, better, training, search, process, achieve, best, layer, improved, neural] [accurate, single, approach, enables, position, novel]
@InProceedings{Yang_2020_CVPR,
  author = {Yang, Fuzhi and Yang, Huan and Fu, Jianlong and Lu, Hongtao and Guo, Baining},
  title = {Learning Texture Transformer Network for Image Super-Resolution},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Reference-Based Sketch Image Colorization Using Augmented-Self Reference and Dense Semantic Correspondence
Junsoo Lee, Eungyeup Kim, Yunsung Lee, Dongjun Kim, Jaehyuk Chang, Jaegul Choo


This paper tackles the automatic colorization task of a sketch image given an already-colored reference image. Colorizing a sketch image is in high demand in comics, animation, and other content creation applications, but it suffers from information scarcity of a sketch image. To address this, a reference image can render the colorization process in a reliable and user-driven manner. However, it is difficult to prepare for a training data set that has a sufficient amount of semantically meaningful pairs of images as well as the ground truth for a colored image reflecting a given reference (e.g., coloring a sketch of an originally blue car given a reference green car). To tackle this challenge, we propose to utilize the identical image with geometric distortion as a virtual reference, which makes it possible to secure the ground truth for a colored output image. Furthermore, it naturally provides the ground truth for dense semantic correspondence, which we utilize in our internal attention mechanism for color transfer from reference to sketch input. We demonstrate the effectiveness of our approach in various types of sketch image colorization via quantitative as well as qualitative evaluation against existing methods.
[attention, evaluation, dataset, visual, described, work, automatic] [feature, semantic, module, table, score, object, car, reshape, region, map, key] [model, face, datasets, original, query, input, adversarial] [reference, color, pixel, output, method, spatially, figure, spatial, existing, quantitative, based, proposed] [image, sketch, colorization, loss, corresponding, transfer, semantically, style, generated, huang, translation, qualitative, content, encourages, generation, igt, lrec, real, ladv, fid, utilize, generating, scft] [imagenet, triplet, top, training, activation, ltr, task, set, deep, indicates, network, performance, close] [correspondence, ground, truth, human, colored, computer, full, directly, transformation, supplementary, international, well, approach, position, computed]
@InProceedings{Lee_2020_CVPR,
  author = {Lee, Junsoo and Kim, Eungyeup and Lee, Yunsung and Kim, Dongjun and Chang, Jaehyuk and Choo, Jaegul},
  title = {Reference-Based Sketch Image Colorization Using Augmented-Self Reference and Dense Semantic Correspondence},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Deblurring Using Analysis-Synthesis Networks Pair
Adam Kaufman, Raanan Fattal


Blind image deblurring remains a challenging problem for modern artificial neural networks. Unlike other image restoration problems, deblurring networks fail behind the performance of existing deblurring algorithms in case of uniform and 3D blur models. This follows from the diverse and profound effect that the unknown blur-kernel has on the deblurring operator. We propose a new architecture which breaks the deblurring network into an analysis network which estimates the blur, and a synthesis network that uses this kernel to deblur the image. Unlike existing deblurring networks, this design allows us to explicitly incorporate the blur-kernel in the network's training. In addition, we introduce new cross-correlation layers that allow better blur estimations, as well as unique components that allow the estimate blur to control the action of the synthesis deblurring action. Evaluating the new approach over established benchmark datasets shows its ability to achieve state-of-the-art deblurring accuracy on various tests, as well as offer a major speedup in runtime.
[dataset, three, pair, order, srn, recognition, action, describe, natural] [table, correlation, guided] [trained, case, input, model] [deblurring, analysis, blur, kernel, blind, ieee, convolution, figure, motion, pattern, deconvolution, existing, blurry, sharp, method, convolutional, psnr, applying, recover, spatial, deblur] [image, synthesis, consists, produced, loss, introduce, train] [network, training, size, neural, uniform, architecture, number, learning, layer, set, deep, accuracy, note, large, better, operation, design, entire, process] [computer, conference, well, vision, camera, single, approach, estimated, allows, reconstruction, international, estimation, allow, estimate, pipeline, novel]
@InProceedings{Kaufman_2020_CVPR,
  author = {Kaufman, Adam and Fattal, Raanan},
  title = {Deblurring Using Analysis-Synthesis Networks Pair},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Exploring Unlabeled Faces for Novel Attribute Discovery
Hyojin Bahng, Sunghyo Chung, Seungjoo Yoo, Jaegul Choo


Despite remarkable success in unpaired image-to-image translation, existing systems still require a large amount of labeled images. This is a bottleneck for their real-world applications; in practice, a model trained on labeled CelebA dataset does not work well for test images from a different distribution -- greatly limiting their application to unlabeled images of a much larger quantity. In this paper, we attempt to alleviate this necessity for labeled data in the facial image translation domain. We aim to explore the degree to which you can discover novel attributes from unlabeled faces and perform high-quality translation. To this end, we use prior knowledge about the visual world as guidance to discover novel attributes and transfer them via a novel normalization method. Experiments show that our method trained on unlabeled data produces high-quality translations, preserves identity, and be perceptually realistic, as good as, or better than, state-of-the-art methods trained on labeled data.
[recognition, dataset, multiple, work] [feature, instance, adopt, object, table] [adversarial, trained, facial, input, model, face, skin, summary, identity] [ieee, pattern, method, existing, figure, affine, convolutional, color] [translation, image, attribute, cluster, hair, celeba, real, generative, unpaired, style, discover, transfer, common, generate, perform, utilize, loss, translated, texture, generator, stargan, domain, blond, content, fake, munit, discovery, unsupervised, asin, target, diverse, drit, alexei, xploregan, conditional, emotionet, user] [unlabeled, normalization, learning, labeled, data, test, training, group, number, accuracy, clustering, imagenet, classification] [conference, computer, vision, international, single, novel, well, european]
@InProceedings{Bahng_2020_CVPR,
  author = {Bahng, Hyojin and Chung, Sunghyo and Yoo, Seungjoo and Choo, Jaegul},
  title = {Exploring Unlabeled Faces for Novel Attribute Discovery},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Neural Pose Transfer by Spatially Adaptive Instance Normalization
Jiashun Wang, Chao Wen, Yanwei Fu, Haitao Lin, Tianyun Zou, Xiangyang Xue, Yinda Zhang


Pose transfer has been studied for decades, in which the pose of a source mesh is applied to a target mesh. Particularly in this paper, we are interested in transferring the pose of source human mesh to deform the target human mesh, while the source and target meshes may have different identity information. Traditional studies assume that the paired source and target meshes are existed with the point-wise correspondences of user annotated landmarks/mesh points, which requires heavy labelling efforts. On the other hand, the generalization ability of deep models is limited, when the source and target meshes have different identities. To break this limitation, we proposes the first neural pose transfer model that solves the pose transfer via the latest technique for image style transfer, leveraging the newly proposed component -- spatially adaptive instance normalization. Our model does not require any correspondences between the source and target meshes. Extensive experiments show that the proposed model can effectively transfer deformation from source to target meshes, and has good generalization ability to deal with unseen identities or poses of meshes. Code is available at https://github.com/jiashunwang/Neural-Pose-Transfer.
[order, work, decoder, naive] [feature, edge, instance, key, propose, semantic, global, denotes] [model, identity, input, auxiliary, generalization] [figure, output, based, spatial, ieee, proposed, convolution, pattern, spatially, affine, method, comparison, quantitative] [transfer, source, target, spadain, style, image, mid, unseen, produce, generate, control, loss, qualitative, invariant, conditional, pmd, consists, ability, corresponding, arbitrary, learn, introduce] [network, learning, training, normalization, data, deep, architecture, regularization, activation, number, test, learned, neural, normalized] [mesh, pose, deformation, vertex, human, shape, conference, computer, vision, smpl, additional, acm, faust, point, ground, geometry, mpose, require, correspondence]
@InProceedings{Wang_2020_CVPR,
  author = {Wang, Jiashun and Wen, Chao and Fu, Yanwei and Lin, Haitao and Zou, Tianyun and Xue, Xiangyang and Zhang, Yinda},
  title = {Neural Pose Transfer by Spatially Adaptive Instance Normalization},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Fine-Grained Image-to-Image Transformation Towards Visual Recognition
Wei Xiong, Yutong He, Yixuan Zhang, Wenhan Luo, Lin Ma, Jiebo Luo


Existing image-to-image transformation approaches primarily focus on synthesizing visually pleasing data. Generating images with correct identity labels is challenging yet much less explored. It is even more challenging to deal with image transformation tasks with large deformation in poses, viewpoints, or scales while preserving the identity, such as face rotation and object viewpoint morphing. In this paper, we aim at transforming an image with a fine-grained category to synthesize new images that preserve the identity of the input image, which can thereby benefit the subsequent fine-grained image recognition and few-shot learning tasks. The generated images, transformed with large geometric deformation, do not necessarily need to be of high visual quality but are required to maintain as much identity information as possible. To this end, we adopt a model based on generative adversarial networks to disentangle the identity related and unrelated factors of an image. In order to preserve the fine-grained contextual details of the input image during the deformable transformation, a constrained nonalignment connection method is proposed to construct learnable highways between intermediate convolution blocks in the generator. Moreover, an adaptive identity modulation mechanism is proposed to transfer the identity information into the output image effectively. Extensive experiments on the CompCars and Multi-PIE datasets demonstrate that our model preserves the identity of the generated images much better than the state-of-the-art image-to-image transformation models, and as a result significantly boosts the visual recognition performance in fine-grained few-shot learning.
[visual, dataset, attention, recognition, modulation, conditioning] [feature, car, contextual, propose, map, location, object, adopt, boost] [identity, model, input, constrained, face, adversarial] [spatial, convolution, adaptive, output, existing, deformable, ieee, figure, method, proposed] [image, generated, generative, preserve, compcars, train, nonalignment, generate, generator, preservation, attribute, cnc, encoder, target, augment, discriminator, fid, real, aim, latent, generation, conditional] [learning, performance, training, vanilla, connection, classification, data, set, large, better, test, benefit, neural, label, classifier, accuracy, batch, network, selected] [transformation, neighborhood, conference, viewpoint, geometric, vision, computer, well]
@InProceedings{Xiong_2020_CVPR,
  author = {Xiong, Wei and He, Yutong and Zhang, Yixuan and Luo, Wenhan and Ma, Lin and Luo, Jiebo},
  title = {Fine-Grained Image-to-Image Transformation Towards Visual Recognition},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Deep Facial Non-Rigid Multi-View Stereo
Ziqian Bai, Zhaopeng Cui, Jamal Ahmed Rahim, Xiaoming Liu, Ping Tan


We present a method for 3D face reconstruction from multi-view images with different expressions. We formulate this problem from the perspective of non-rigid multi-view stereo (NRMVS). Unlike previous learning-based methods, which often regress the face shape directly, our method optimizes the 3D face shape by explicitly enforcing multi-view appearance consistency, which is known to be effective in recovering shape details according to conventional multi-view stereo methods. Furthermore, by estimating face shape through optimization based on multi-view consistency, our method can potentially have better generalization to unseen data. However, this optimization is challenging since each input image has a different expression. We facilitate it with a CNN network that learns to regularize the non-rigid 3D face according to the input image and preliminary optimization results. Extensive experiments show that our method achieves the state-of-the-art performance on various datasets and generalizes well to in-the-wild data.
[recognition, step] [feature, level, table, map, achieves] [face, model, facial, nrmvs, morphable, input, preliminary, generic, tewari, expression, database, xiaoming, generalization, trained, bosphorus] [adaptive, pattern, method, proposed, based, ieee, figure, comparison, intensity] [image, alignment, qualitative, appearance, texture] [optimization, better, deep, training, learning, network, parameter, objective, performance, data, standard, set, note, large] [computer, vision, reconstruction, basis, stereo, geometry, shape, conference, dense, international, geometric, ground, capture, acm, volume, multiview, view, truth, well, fitting, limited, monocular, compute, michael, regress, error]
@InProceedings{Bai_2020_CVPR,
  author = {Bai, Ziqian and Cui, Zhaopeng and Rahim, Jamal Ahmed and Liu, Xiaoming and Tan, Ping},
  title = {Deep Facial Non-Rigid Multi-View Stereo},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Attention-Driven Cropping for Very High Resolution Facial Landmark Detection
Prashanth Chandran, Derek Bradley, Markus Gross, Thabo Beeler


Facial landmark detection is a fundamental task for many consumer and high-end applications and is almost entirely solved by machine learning methods today. Existing datasets used to train such algorithms are primarily made up of only low resolution images, and current algorithms are limited to inputs of comparable quality and resolution as the training dataset. On the other hand, high resolution imagery is becoming increasingly more common as consumer cameras improve in quality every year. Therefore, there is need for algorithms that can leverage the rich information available in high resolution imagery. Naively attempting to reuse existing network architectures on high resolution imagery is prohibitive due to memory bottlenecks on GPUs. The only current solution is to downsample the images, sacrificing resolution and quality. Building on top of recent progress in attention-based networks, we present a novel, fully convolutional regional architecture that is specially designed for predicting landmarks on very high resolution facial images without downsampling. We demonstrate the flexibility of our architecture by training the proposed model with images of resolutions ranging from 256 x 256 to 4K. In addition to being the first method for facial landmark detection on high resolution images, our approach achieves superior performance over traditional (holistic) state-of-the-art architectures across ALL resolutions, leading to a general-purpose, extremely flexible, high quality landmark detector.
[attention, recognition, predict, prediction, multiple, work, dataset] [global, fully, stage, detection, region, bounding, table, imagery, heatmap, predicted, cnn, box, localization, downsampled, interest] [landmark, facial, regional, hourglass, face, heatmaps, model, quality, datasets, original, input, trained, robust, argmax] [resolution, high, method, low, convolutional, ieee, pattern, cropping, existing, crop, based, figure, proposed, designed, driven] [image, alignment, train, latent, loss, corresponding] [network, architecture, training, deep, higher, test, learning, performance, set, size, neural, machine, operation] [computer, conference, vision, approach, ground, truth, international, single, additional, pose, refer]
@InProceedings{Chandran_2020_CVPR,
  author = {Chandran, Prashanth and Bradley, Derek and Gross, Markus and Beeler, Thabo},
  title = {Attention-Driven Cropping for Very High Resolution Facial Landmark Detection},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Towards Unsupervised Learning of Generative Models for 3D Controllable Image Synthesis
Yiyi Liao, Katja Schwarz, Lars Mescheder, Andreas Geiger


In recent years, Generative Adversarial Networks have achieved impressive results in photorealistic image synthesis. This progress nurtures hopes that one day the classical rendering pipeline can be replaced by efficient models that are learned directly from images. However, current image synthesis models operate in the 2D domain where disentangling 3D properties such as camera viewpoint or object pose is challenging. Furthermore, they lack an interpretable and controllable representation. Our key hypothesis is that the image generation process should be modeled in 3D space as the physical world surrounding us is intrinsically three-dimensional. We define the new task of 3D controllable image synthesis and propose an approach for solving it by reasoning both in 3D space and in the 2D image domain. We demonstrate that our model is able to disentangle latent 3D factors of simple multi-object scenes in an unsupervised fashion from raw images. Compared to pure 2D baselines, it allows for synthesizing scenes that are consistent wrt. changes in viewpoint or object pose. We further evaluate various 3D representations in terms of their usefulness for this challenging task.
[dataset, multiple, recognition] [object, background, feature, map, foreground, car, challenging, propose, supervision, including] [model, adversarial, input] [method, ieee, figure, pattern, classical] [image, generative, generator, representation, controllable, synthesis, latent, learn, unsupervised, loss, generation, fid, disentangle, generated, alpha, generates, real, abstract, code, consistency, generate, manipulating, translation, disentangled, generating, photorealistic, interpretable, content] [learning, neural, processing, training, process, task, set, entire, data, sample, learned] [computer, vision, primitive, scene, international, pose, camera, differentiable, rendering, viewpoint, well, single, point, depth, render, allows, projected, geometric, consistent, sphere, indoor, directly, approach, full, rotation]
@InProceedings{Liao_2020_CVPR,
  author = {Liao, Yiyi and Schwarz, Katja and Mescheder, Lars and Geiger, Andreas},
  title = {Towards Unsupervised Learning of Generative Models for 3D Controllable Image Synthesis},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
End-to-End Pseudo-LiDAR for Image-Based 3D Object Detection
Rui Qian, Divyansh Garg, Yan Wang, Yurong You, Serge Belongie, Bharath Hariharan, Mark Campbell, Kilian Q. Weinberger, Wei-Lun Chao


Reliable and accurate 3D object detection is a necessity for safe autonomous driving. Although LiDAR sensors can provide accurate 3D point cloud estimates of the environment, they are also prohibitively expensive for many settings. Recently, the introduction of pseudo-LiDAR (PL) has led to a drastic reduction in the accuracy gap between methods based on LiDAR sensors and those based on cheap stereo cameras. PL combines state-of-the-art deep neural networks for 3D depth estimation with those for 3D object detection by converting 2D depth map outputs to 3D point cloud inputs. However, so far these two networks have to be trained separately. In this paper, we introduce a new framework based on differentiable Change of Representation (CoR) modules that allow the entire PL pipeline to be trained end-to-end. The resulting framework is compatible with most state-of-the-art networks for both tasks and in combination with PointRCNN improves over PL consistently across all benchmarks --- yielding the highest entry on the KITTI image-based 3D object detection leaderboard at the time of submission. Our code will be made available at https://github.com/mileyan/pseudo-LiDAR_e2e.
[dataset, red] [object, detection, bin, lidar, detector, framework, table, autonomous, hard, bounding, moderate, ldet, map, raquel, iou, easy, module, car, final, predicted, apbev, box, center] [trained, input, model, change, pixor, improve] [figure, based, applying, signal, result, convolutional, subsampling, existing, proposed] [loss, corresponding, image, gap, qualitative, train, representation] [training, network, quantization, soft, set, deep, best, pass, equation, neural, respect, report, performance, learning, gradient, average] [depth, point, pipeline, kitti, cloud, stereo, estimation, accurate, ground, estimator, joint, truth, differentiable, directly, error]
@InProceedings{Qian_2020_CVPR,
  author = {Qian, Rui and Garg, Divyansh and Wang, Yan and You, Yurong and Belongie, Serge and Hariharan, Bharath and Campbell, Mark and Weinberger, Kilian Q. and Chao, Wei-Lun},
  title = {End-to-End Pseudo-LiDAR for Image-Based 3D Object Detection},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Towards High-Fidelity 3D Face Reconstruction From In-the-Wild Images Using Graph Convolutional Networks
Jiangke Lin, Yi Yuan, Tianjia Shao, Kun Zhou


3D Morphable Model (3DMM) based methods have achieved great success in recovering 3D face shapes from single-view images. However, the facial textures recovered by such methods lack the fidelity as exhibited in the input images. Recent works demonstrate high-quality facial texture recovering with generative networks trained from a large-scale database of high-resolution UV maps of face textures, which is hard to prepare and not publicly available. In this paper, we introduce a method to reconstruct 3D facial shapes with high-fidelity textures from single-view images in the wild, without the need to capture a large-scale face texture database. The main idea is to refine the initial texture generated by a 3DMM based method with facial details from the input image. To this end, we propose to use graph convolutional networks to reconstruct the detailed colors for the mesh vertices instead of reconstructing the UV map. Experiments show that our method can generate high-quality results and outperforms state-of-the-art methods in both qualitative and quantitative comparisons.
[graph, recognition, work, decoder, gcn, embedding] [refinement, module, feature, framework, propose, adopt] [face, input, model, facial, morphable, adversarial, expression, facenet] [convolutional, ieee, spectral, pattern, based, method, high, figure, quantitative, comparison, proposed] [texture, image, loss, fidelity, generate, generated, utilize, produce, train, discriminator] [training, deep, network, learning, higher, neural, layer, large, better] [shape, computer, mesh, conference, detailed, vision, single, rendering, differentiable, rendered, reconstruction, coarse, vertex, reconstructed, albedo, computed, pose, regressor, thomas, international, projected, lighting, compute, refiner, reconstruct, compare, approach, defined, chebyshev, rgb, regress, fitting, laplacian]
@InProceedings{Lin_2020_CVPR,
  author = {Lin, Jiangke and Yuan, Yi and Shao, Tianjia and Zhou, Kun},
  title = {Towards High-Fidelity 3D Face Reconstruction From In-the-Wild Images Using Graph Convolutional Networks},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
CurricularFace: Adaptive Curriculum Learning Loss for Deep Face Recognition
Yuge Huang, Yuhan Wang, Ying Tai, Xiaoming Liu, Pengcheng Shen, Shaoxin Li, Jilin Li, Feiyue Huang


As an emerging topic in face recognition, designing margin-based loss functions can increase the feature margin between different classes for enhanced discriminability. More recently, the idea of mining-based strategies is adopted to emphasize the misclassified samples, achieving promising results. However, during the entire training process, the prior methods either do not explicitly emphasize the sample based on its importance that renders the hard samples not fully exploited; or explicitly emphasize the effects of semi-hard/hard samples even at the early training stage that may lead to convergence issue. In this work, we propose a novel Adaptive Curriculum Learning loss (CurricularFace) that embeds the idea of curriculum learning into the loss function to achieve a novel training strategy for deep face recognition, which mainly addresses easy samples in the early training stage and hard ones in the later stage. Specifically, our CurricularFace adaptively adjusts the relative importance of easy and hard samples during different training stages. In each stage, different samples are assigned with different importance according to their corresponding difficultness. Extensive experimental results on popular benchmarks demonstrate the superiority of our CurricularFace over the state-of-the-art competitors.
[recognition, modulation, embedding] [hard, easy, stage, positive, denotes, achieves, sota, feature, backbone, adopt, boundary, table, determined] [face, curricularface, arcface, model, verification, original, emphasize, emphasizes, decision, megaface, lfw, agedb, roc, emphasized, difficultness] [adaptive, method, adaptively, figure, proposed, formulated, coefficient] [loss, curriculum, corresponding, idea, discriminative, est] [training, cosine, negative, learning, sample, deep, early, large, similarity, margin, function, small, performance, set, softmax, parameter, larger, mining, strategy, harder, fixed, better, entire, convergence, achieve, popular, log, manually, average, best] [novel, focal, easier, truth, defined]
@InProceedings{Huang_2020_CVPR,
  author = {Huang, Yuge and Wang, Yuhan and Tai, Ying and Liu, Xiaoming and Shen, Pengcheng and Li, Shaoxin and Li, Jilin and Huang, Feiyue},
  title = {CurricularFace: Adaptive Curriculum Learning Loss for Deep Face Recognition},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Rotate-and-Render: Unsupervised Photorealistic Face Rotation From Single-View Images
Hang Zhou, Jihao Liu, Ziwei Liu, Yu Liu, Xiaogang Wang


Though face rotation has achieved rapid progress in recent years, the lack of high-quality paired training data remains a great hurdle for existing methods. The current generative models heavily rely on datasets with multi-view images of the same person. Thus, their generated results are restricted by the scale and domain of the data source. To overcome these challenges, we propose a novel unsupervised framework that can synthesize photo-realistic rotated faces using only single-view image collections in the wild. Our key insight is that rotating faces in the 3D space back and forth, and re-rendering them to the 2D plane can serve as a strong self-supervision. We leverage the recent advances in 3D face modeling and high-resolution GAN to constitute our building blocks. Since the 3D rotation-and-render on faces can be applied to arbitrary angles without losing details, our approach is extremely suitable for in-the-wild scenarios (i.e. no paired data are available), where existing methods fall short. Extensive experiments demonstrate that our approach has superior synthesis quality as well as identity preservation over the state-of-the-art methods, across a wide range of poses and domains. Furthermore, we validate that our rotate-and-render framework naturally can act as an effective data augmentation engine for boosting modern face recognition systems even on strong baseline models
[recognition, previous] [propose, framework, table, key, boost, feature, xiaogang, map] [face, frontalization, model, strong, input, create, datasets, identity, facial, rdb, adversarial, trained, fig, casia, invisible, original, lfw] [figure, ieee, method, existing, pattern, validate] [image, loss, translation, unsupervised, texture, generated, rotate, representation, paired, photorealistic, gan, generation, generate, discriminator, ziwei, generative, gans, real, synthesis] [training, data, learning, deep, neural, space, baseline, process, network, set, strategy, normalization, performance] [rendered, rotation, view, pose, rotating, rotated, render, position, computer, rendering, conference, approach, ground, fitting, vision, pipeline, projection, novel]
@InProceedings{Zhou_2020_CVPR,
  author = {Zhou, Hang and Liu, Jihao and Liu, Ziwei and Liu, Yu and Wang, Xiaogang},
  title = {Rotate-and-Render: Unsupervised Photorealistic Face Rotation From Single-View Images},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
One-Shot Domain Adaptation for Face Generation
Chao Yang, Ser-Nam Lim


In this paper, we propose a framework capable of generating face images that fall into the same distribution as that of a given one-shot example. We leverage a pre-trained StyleGAN model that already learned the generic face distribution. Given the one-shot target, we develop an iterative optimization scheme that rapidly adapts the weights of the model to shift the output's high-level distribution to the target's. To generate images of the same distribution, we introduce a style-mixing technique that transfers the low-level statistics from the target to faces randomly generated with the model. With that, we are able to generate an unlimited number of faces that inherit from the distribution of both generic human faces and the one-shot example. The newly generated faces can serve as augmented training data for other downstream tasks. Such setting is appealing as it requires labeling very few, or even one example, in the target domain, which is often the case of real-world face manipulations that result from a variety of unknown and unique distributions, each with extremely low prevalence. We show the effectiveness of our one-shot approach for detecting face manipulations and compare it with other few-shot domain adaptation methods qualitatively and quantitatively.
[visual, embeddings, shift, hierarchical, natural] [shifting, detect, detection, head, table] [face, model, deepfake, manipulation, generic, input, example, original, adversarial, trained, detecting, noise] [ieee, output, figure, pattern] [stylegan, image, target, domain, generated, style, manifold, adaptation, train, real, generate, generative, synthetic, progan, learn, loss, funit, gan, synthesis, latent, realistic] [vector, distribution, training, random, deep, classifier, neural, classification, learning, randomly, accuracy, number, large, arxiv, preprint, processing, optimization, capacity, network, updating, test, probabilistic, weight, compared, better] [conference, computer, international, reconstruction, vision, reconstructed, human, single, projection, approach, distance]
@InProceedings{Yang_2020_CVPR,
  author = {Yang, Chao and Lim, Ser-Nam},
  title = {One-Shot Domain Adaptation for Face Generation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
BidNet: Binocular Image Dehazing Without Explicit Disparity Estimation
Yanwei Pang, Jing Nie, Jin Xie, Jungong Han, Xuelong Li


Heavy haze results in severe image degradation and thus hampers the performance of visual perception, object detection, etc. On the assumption that dehazed binocular images are superior to the hazy ones for stereo vision tasks such as 3D object detection and according to the fact that image haze is a function of depth, this paper proposes a Binocular image dehazing Network (BidNet) aiming at dehazing both the left and right images of binocular images within the deep learning framework. Existing binocular dehazing methods rely on simultaneously dehazing and estimating disparity, whereas BidNet does not need to explicitly perform time-consuming and well-known challenging disparity estimation. Note that a small error in disparity gives rise to a large variation in depth and in estimation of haze-free image. The relationship and correlation between binocular images are explored and encoded by the proposed Stereo Transformation Module (STM). Jointly dehazing binocular image pairs is mutually beneficial, which is better than only dehazing left images. We extend the Foggy Cityscapes dataset to a Stereo Foggy Cityscapes dataset with binocular foggy image pairs. Experimental results demonstrate that BidNet significantly outperforms state-of-the-art dehazing methods in both subjective and objective assessments.
[dataset, three, visual, pair, concatenation] [module, map, feature, correlation, yanwei, object, horizontal, denotes, val, refinement, refined, china, semantic, ling, detection] [input, model] [dehazing, foggy, binocular, transmission, bidnet, atmospheric, proposed, dehazed, disparity, convolutional, based, light, method, perceptual, haze, psnr, clear, fog, stm, ssim, mscnn, hazy, scattering, griddehazenet, figure, extraction, ublock, degradation] [image, loss, real, synthetic, perform, corresponding] [network, learning, deep, layer, large, better, training, simultaneously, size, performance, function, matrix, set] [stereo, left, estimation, matching, depth, transformation, single, distance, estimated, estimating, jointly, estimate, camera, ground, error, truth, vision, demonstrate]
@InProceedings{Pang_2020_CVPR,
  author = {Pang, Yanwei and Nie, Jing and Xie, Jin and Han, Jungong and Li, Xuelong},
  title = {BidNet: Binocular Image Dehazing Without Explicit Disparity Estimation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Deep Shutter Unrolling Network
Peidong Liu, Zhaopeng Cui, Viktor Larsson, Marc Pollefeys


We present a novel network for rolling shutter effect correction. Our network takes two consecutive rolling shutter images and estimates the corresponding global shutter image of the latest frame. The dense displacement field from a rolling shutter image to its corresponding global shutter image is estimated via a motion estimation network. The learned feature representation of a rolling shutter image is then warped, via the displacement field, to its global shutter representation by a differentiable forward warping block. An image decoder recovers the global shutter image based on the warped feature representation. Our network can be trained end-to-end and only requires the global shutter image for supervision. Since there is no public dataset available, we also propose two large datasets: the Carla-RS dataset and the Fastec-RS dataset. Experimental results demonstrate that our network outperforms the state-of-the-art methods. We make both our code and datasets available at https://github.com/ethliup/DeepUnrollNet.
[dataset, time, decoder, multiple, recognition] [global, feature, propose, map, occlusion, pyramid, table] [model, input, correction, trained, datasets, experimental, difference] [shutter, rolling, motion, field, based, warping, zhuang, ieee, consecutive, pixel, classical, block, captured, method, rectification, formation, quantitative, iggt, warped, figure, recover, pattern, latest, demonstrates, advantage] [image, row, corresponding, loss, real, representation, qualitative, learn, generated, train] [network, deep, forward, learned, problem, learning, data, performance, neural, training, find, test, better, vector, function] [camera, displacement, estimated, single, dense, ground, estimation, depth, vision, differentiable, computer, truth, virtual, estimate, scene, velocity, recovered, demonstrate, well, predicts]
@InProceedings{Liu_2020_CVPR,
  author = {Liu, Peidong and Cui, Zhaopeng and Larsson, Viktor and Pollefeys, Marc},
  title = {Deep Shutter Unrolling Network},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Joint Texture and Geometry Optimization for RGB-D Reconstruction
Yanping Fu, Qingan Yan, Jie Liao, Chunxia Xiao


Due to inevitable noises and quantization error, the reconstructed 3D models via RGB-D sensors always accompany geometric error and camera drifting, which consequently lead to blurring and unnatural texture mapping results. Most of the 3D reconstruction methods focus on either geometry refinement or texture improvement respectively, which subjectively decouples the inter-relationship between geometry and texture. In this paper, we propose a novel approach that can jointly optimize the camera poses, texture and geometry of the reconstructed model, and color consistency between the key-frames. Instead of computing Shape-From-Shading (SFS) expensively, our method directly optimizes the reconstructed mesh according to color and geometric consistency and high-boost normal cues, which can effectively overcome the texture-copy problem generated by SFS and achieve more detailed shape reconstruction. As the joint optimization involves multiple correlated terms, therefore, we further introduce an iterative framework to interleave the optimal state. The experiments demonstrate that our method can recover not only fine-scale geometry but also high-fidelity texture.
[correct, time, three, sequence, frame] [refine, refinement, effectiveness, represents] [model, datasets, inconsistency, zhou, consumer, external, christian, effectively, quality] [color, method, figure, proposed, detail, result, captured, blurring, illumination, restore, based, optimized, enhance, enhanced, enhancement] [texture, consistency, image, mapping, corresponding, transfer, chunxia, qualitative, perform, generate] [optimization, optimize, set, achieve, function, number] [geometry, geometric, camera, vertex, reconstructed, depth, reconstruction, joint, mesh, normal, pose, computer, error, surface, triangle, directly, demonstrate, position, acm, photometric, initial, jointly, detailed, provided, visible, local, consistent, kinectfusion, kinect, defined, matthias, qingan, plane, projection, etex, term]
@InProceedings{Fu_2020_CVPR,
  author = {Fu, Yanping and Yan, Qingan and Liao, Jie and Xiao, Chunxia},
  title = {Joint Texture and Geometry Optimization for RGB-D Reconstruction},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Deep 3D Capture: Geometry and Reflectance From Sparse Multi-View Images
Sai Bi, Zexiang Xu, Kalyan Sunkavalli, David Kriegman, Ravi Ramamoorthi


We introduce a novel learning-based method to reconstruct the high-quality geometry and complex, spatially-varying BRDF of an arbitrary object from a sparse set of only six images captured by wide-baseline cameras under collocated point lighting. We first estimate per-view depth maps using a deep multi-view stereo network; these depth maps are used to coarsely align the different views. We propose a novel multi-view reflectance estimation network architecture that is trained to pool features from these coarsely aligned images and predict per-view spatially-varying diffuse albedo, surface normals, specular roughness and specular albedo. We do this by jointly optimizing the latent space of our multi-view reflectance network to minimize the photometric error between images rendered with our predictions and the input images. While previous state-of-the-art methods fail on such sparse acquisition setups, we demonstrate, via extensive experiments on synthetic and real data, that our method produces high-quality reconstructions that can be used to render photorealistic images.
[predict, previous] [object, feature, predicted, map, fuse, propose] [input, trained, poisson, robust] [method, captured, acquisition, figure, light, warped, pixel, illumination] [image, latent, real, appearance, encoder, loss, arbitrary, synthetic] [network, optimization, deep, set, learning, training, architecture, optimizing] [geometry, svbrdf, depth, reflectance, view, sparse, reconstruct, estimation, reconstruction, acm, novel, single, point, specular, capture, surface, shape, collocated, estimate, lighting, photometric, normal, rendering, brdf, albedo, mesh, initial, stereo, complex, reconstructed, roughness, vertex, brdfs, camera, ground, truth, rendered, render, estimated, refer, kalyan, ravi, multiview, computer]
@InProceedings{Bi_2020_CVPR,
  author = {Bi, Sai and Xu, Zexiang and Sunkavalli, Kalyan and Kriegman, David and Ramamoorthi, Ravi},
  title = {Deep 3D Capture: Geometry and Reflectance From Sparse Multi-View Images},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Auto-Tuning Structured Light by Optical Stochastic Gradient Descent
Wenzheng Chen, Parsa Mirdehghan, Sanja Fidler, Kiriakos N. Kutulakos


We consider the problem of optimizing the performance of an active imaging system by automatically discovering the illuminations it should use, and the way to decode them. Our approach tackles two seemingly incompatible goals: (1) "tuning" the illuminations and decoding algorithm precisely to the devices at hand---to their optical transfer functions, non-linearities, spectral responses, image processing pipelines---and (2) doing so without modeling or calibrating the system; without modeling the scenes of interest; and without prior training data. The key idea is to formulate a stochastic gradient descent (SGD) optimization procedure that puts the actual system in the loop: projecting patterns, capturing images, and calculating the gradient of expected reconstruction error. We apply this idea to structured-light triangulation to "auto-tune" several devices---from smartphones and laser projectors to advanced computational cameras. Our experiments show that despite being model-free and automatic, optical SGD can boost system 3D accuracy substantially over state-of-the-art coding schemes.
[decoder, structured, three, account, step] [map, correlation] [noise, model] [optical, light, imaging, ieee, figure, projector, pixel, disparity, jacobian, optimized, board, tog, column, coding, illumination, pattern, noisy, zncc, phase, carte, spatial, frequency, field] [image, control, avg, transport] [vector, sgd, training, performance, optimization, penalty, gradient, function, optimal, evaluate, neural, stochastic, procedure, computational, set, deep, problem, optimizing, processing, expected, requires, learning, iteration] [system, scene, depth, error, acm, camera, correspondence, reconstruction, differentiable, numerical, projection, well, hamiltonian, laser, estimate, neighborhood, capture, compute, stereo]
@InProceedings{Chen_2020_CVPR,
  author = {Chen, Wenzheng and Mirdehghan, Parsa and Fidler, Sanja and Kutulakos, Kiriakos N.},
  title = {Auto-Tuning Structured Light by Optical Stochastic Gradient Descent},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
MARMVS: Matching Ambiguity Reduced Multiple View Stereo for Efficient Large Scale Scene Reconstruction
Zhenyu Xu, Yiguang Liu, Xuelei Shi, Ying Wang, Yunan Zheng


The ambiguity in image matching is one of main factors decreasing the quality of the 3D model reconstructed by PatchMatch based multiple view stereo. In this paper, we present a novel method, matching ambiguity reduced multiple view stereo (MARMVS) to address this issue. The MARMVS handles the ambiguity in image matching process with three newly proposed strategies: 1) The matching ambiguity is measured by the differential geometry property of image surface with epipolar constraint, which is used as a critical criterion for optimal scale selection of every single pixel with corresponding neighbouring images. 2) The depth of every pixel is initialized to be more close to the true depth by utilizing the depths of its surrounding sparse feature points, which yields faster convergency speed in the following PatchMatch stereo and alleviates the ambiguity introduced by self similar structures of the image. 3) In the last propagation of the PatchMatch stereo, higher priorities are given to those planes with the related 2D image patch possesses less ambiguity, this strategy further propagates a correctly reconstructed surface to raw texture regions. In addition, the proposed method is very efficient even running on consumer grade CPUs, due to proper parameterization and discretization in the depth map computation step. The MARMVS is validated on public benchmarks, and experimental results demonstrate competing performance against the state of the art.
[evaluation, multiple, speed, time, individual] [map, propagation, feature, level, table, benchmark, surrounding] [stability] [pixel, method, scale, reference, based, proposed, ieee, pattern, patch, raw, high, figure, window, range, introduced, resolution, comparison] [image, texture, corresponding, consistency, generate] [set, selection, random, efficient, accuracy, strategy, computational, matrix, large, optimal, higher, computation, denote, proper, parameter, reduced, smaller, problem] [depth, matching, computer, neighbouring, stereo, surface, conference, normal, ambiguity, patchmatch, epipolar, vision, plane, neighbour, compute, point, computed, second, reconstruction, reconstructed, geometry, international, view, single, completeness, curvature, camera, dense, marmvs, direction, term, multiview, scene, measured]
@InProceedings{Xu_2020_CVPR,
  author = {Xu, Zhenyu and Liu, Yiguang and Shi, Xuelei and Wang, Ying and Zheng, Yunan},
  title = {MARMVS: Matching Ambiguity Reduced Multiple View Stereo for Efficient Large Scale Scene Reconstruction},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Uncertainty Based Camera Model Selection
Michal Polic, Stanislav Steidl, Cenek Albl, Zuzana Kukelova, Tomas Pajdla


The quality and speed of Structure from Motion (SfM) methods depend significantly on the camera model chosen for the reconstruction. In most of the SfM pipelines, the camera model is manually chosen by the user. In this paper, we present a new automatic method for camera model selection in large scale SfM that is based on efficient uncertainty evaluation. We first perform an extensive comparison of classical model selection based on known Information Criteria and show that they do not provide sufficiently accurate results when applied to camera model selection. Then we propose a new Accuracy-based Criterion, which evaluates an efficient approximation of the uncertainty of the estimated parameters in tested models. Using the new criterion, we design a camera model selection method and fine-tune it by machine learning. Our simulated and real experiments demonstrate a significant increase in reconstruction quality as well as a considerable speedup of the SfM process.
[time, eth, work, automatic, visual, evaluation] [table, threshold, positive] [model, distortion, radial, datasets, robust, true, input, tested, largest] [method, ieee, based, pattern, journal, comparison, figure, affine] [synthetic, real, image, common, loss] [selection, number, matrix, covariance, set, function, accuracy, selected, criterion, data, select, task, machine, standard, distribution, efficient, increase, statistical, depends, learning, evaluate, larger] [camera, reconstruction, reprojection, computer, sfm, estimated, assume, error, conference, uncertainty, gauge, registered, inliers, colmap, international, coordinate, point, vision, calibration, ctu, accurate, term, structure, well, polynomial, register, geometry]
@InProceedings{Polic_2020_CVPR,
  author = {Polic, Michal and Steidl, Stanislav and Albl, Cenek and Kukelova, Zuzana and Pajdla, Tomas},
  title = {Uncertainty Based Camera Model Selection},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Local Implicit Grid Representations for 3D Scenes
Chiyu "Max" Jiang, Avneesh Sud, Ameesh Makadia, Jingwei Huang, Matthias Niessner, Thomas Funkhouser


Shape priors learned from data are commonly used to reconstruct 3D objects from partial or noisy data. Yet no such shape priors are available for indoor scenes, since typical 3D autoencoders cannot handle their scale, complexity, or diversity. In this paper, we introduce Local Implicit Grid Representations, a new 3D shape representation designed for scalability and generality. The motivating idea is that most 3D surfaces share geometric details at some scale -- i.e., at a scale smaller than an entire object and larger than a small patch. We train an autoencoder to learn an embedding of local crops of 3D shapes at that size. Then, we use the decoder as a component in a shape optimization that solves for a set of latent codes on a regular grid of overlapping crops such that an interpolation of the decoded local shapes matches a partial or noisy observation. We demonstrate the value of this proposed approach for 3D surface reconstruction from sparse point observations, showing significantly better results than alternative approaches.
[embedding, decoder] [object, table, global, oriented, category] [input] [method, ieee, pattern, scale, reconstructing, proposed, figure, cell, based, comparison, interpolation, high] [latent, representation, learn, autoencoder, loss, corresponding, train, code] [learned, learning, training, network, entire, neural, deep, task, data, set, function, optimization, sampled, arxiv, preprint, large, sample, performance, space, number] [point, implicit, reconstruction, grid, shape, scene, local, geometric, computer, surface, conference, vision, sparse, single, overlapping, normal, matthias, reconstruct, thomas, approach, shapenet, well, continuous, scenenet, angela, michael, scalability, representing, leverage, tsdf, geometry, distance, exterior]
@InProceedings{Jiang_2020_CVPR,
  author = {Jiang, Chiyu "Max" and Sud, Avneesh and Makadia, Ameesh and Huang, Jingwei and Niessner, Matthias and Funkhouser, Thomas},
  title = {Local Implicit Grid Representations for 3D Scenes},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
TetraTSDF: 3D Human Reconstruction From a Single Image With a Tetrahedral Outer Shell
Hayato Onizuka, Zehra Hayirci, Diego Thomas, Akihiro Sugimoto, Hideaki Uchiyama, Rin-ichiro Taniguchi


Recovering the 3D shape of a person from its 2D appearance is ill-posed due to ambiguities. Nevertheless, with the help of convolutional neural networks (CNN) and prior knowledge on the 3D human body, it is possible to overcome such ambiguities to recover detailed 3D shapes of human bodies from single images. Current solutions, however, fail to reconstruct all the details of a person wearing loose clothes. This is because of either (a) huge memory requirement that cannot be maintained even on modern GPUs or (b) the compact 3D representation that cannot encode all the details. In this paper, we propose the tetrahedral outer shell volumetric truncated signed distance function (TetraTSDF) model for the human body, and its corresponding part connection network (PCN) for 3D human body shape regression. Our proposed model is compact, dense, accurate, and yet well suited for CNN-based regression task. Our proposed PCN allows us to learn the distribution of the TSDF in the tetrahedral volume from a single image in an end-to-end manner. Results show that our proposed method allows to reconstruct detailed shapes of humans wearing loose clothes from single RGB images.
[dataset, wearing, built, connected, work] [propose, cnn, regression, template] [model, input, clothes, university, hourglass, comparative] [proposed, method, figure, ieee, pattern, field, output, convolutional, high, color, cnns, resolution] [representation, image, person] [network, outer, layer, number, standard, memory, learning, neural, function, training, amount, note] [human, body, shape, single, tsdf, tetrahedral, computer, pose, volumetric, conference, detailed, reconstruct, reconstruction, smpl, vision, ground, truth, voxels, loose, shell, distance, voxel, international, michael, volume, bodynet, well, allows, mesh, articulated, coarse, surface, estimation, rgb, depth, reconstructed, grid, surreal, signed, regress, estimate, structure, dense, pcn]
@InProceedings{Onizuka_2020_CVPR,
  author = {Onizuka, Hayato and Hayirci, Zehra and Thomas, Diego and Sugimoto, Akihiro and Uchiyama, Hideaki and Taniguchi, Rin-ichiro},
  title = {TetraTSDF: 3D Human Reconstruction From a Single Image With a Tetrahedral Outer Shell},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Averaging Essential and Fundamental Matrices in Collinear Camera Settings
Amnon Geifman, Yoni Kasten, Meirav Galun, Ronen Basri


Global methods to Structure from Motion have gained popularity in recent years. A significant drawback of global methods is their sensitivity to collinear camera settings. In this paper, we introduce an analysis and algorithms for averaging bifocal tensors (essential or fundamental matrices) when either subsets or all of the camera centers are collinear. We provide a complete spectral characterization of bifocal tensors in collinear scenarios and further propose two averaging algorithms. The first algorithm uses rank constrained minimization to recover camera matrices in fully collinear settings. The second algorithm enriches the set of possibly mixed collinear and non-collinear cameras with additional, "virtual cameras," which are placed in general position, enabling the application of existing averaging methods to the enriched set of bifocal tensors. Our algorithms are shown to achieve state of the art results on various benchmarks that include autonomous car datasets and unordered image collections in both calibrated and unclibrated settings.
[three, time, viewing, graph, include, rit, realized, construct] [global, table, fully] [datasets, satisfies, condition] [ieee, method, pattern, motion, recover, column, tensor] [translation, image, corresponding, consistency, photo, characterization, third] [matrix, algorithm, triplet, set, denote, general, eij, implies, rank, execution, number, incremental, note, vit, linear, lemma] [camera, collinear, bifocal, averaging, computer, fundamental, calibrated, essential, uncalibrated, cover, point, decomposition, virtual, conference, consistent, form, vision, determine, second, rotation, unordered, solve, internet, structure, position, full, projective, fij, implying, error, svd, international, ronen, bundle, sfm, initial, additional, kitti]
@InProceedings{Geifman_2020_CVPR,
  author = {Geifman, Amnon and Kasten, Yoni and Galun, Meirav and Basri, Ronen},
  title = {Averaging Essential and Fundamental Matrices in Collinear Camera Settings},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
On the Distribution of Minima in Intrinsic-Metric Rotation Averaging
Kyle Wilson, David Bindel


Rotation Averaging is a non-convex optimization problem that determines orientations of a collection of cameras from their images of a 3D scene. The problem has been studied using a variety of distances and robustifiers. The intrinsic (or geodesic) distance on SO(3) is geometrically meaningful; but while some extrinsic distance-based solvers admit (conditional) guarantees of correctness, no comparable results have been found under the intrinsic metric. In this paper, we study the spatial distribution of local minima. First, we do a novel empirical study to demonstrate sharp transitions in qualitative behavior: as problems become noisier, they transition from a single (easy-to-find) dominant minimum to a cost surface filled with minima. In the second part of this paper we derive a theoretical bound for when this transition occurs. This is an extension of the results of [24], which used local convexity as a proxy to study the difficulty of problem. By recognizing the underly- ing quotient manifold geometry of the problem we achieve an n-fold improvement over prior work. Incidentally, our analysis also extends the prior l2 work to general lp costs. Our results suggest using algebraic connectivity as an indicator of problem difficulty.
[graph, natural, description, work, considering] [global, edge, horizontal, improvement, propose] [noise, study, robust] [analysis, figure, residual, method, based, spectral, nlm, prior] [manifold, row, distinct, notice] [problem, space, function, matrix, random, hessian, paper, bound, optimization, minimum, small, sij, note, orthogonal, consider, empirical, theoretical, rij, exactly, wij, distribution, best, group] [local, rotation, averaging, cost, gauge, convexity, tangent, solver, solution, quotient, relative, structure, connectivity, geodesic, initial, relaxed, gnm, intrinsic, dominant, convex, point, computer, distance, extrinsic, single, algebraic, orientation, absolute, vertex, guess, briales, sufficient, vertical, term, estimation, geometrically, demonstrate]
@InProceedings{Wilson_2020_CVPR,
  author = {Wilson, Kyle and Bindel, David},
  title = {On the Distribution of Minima in Intrinsic-Metric Rotation Averaging},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Lightweight Multi-View 3D Pose Estimation Through Camera-Disentangled Representation
Edoardo Remelli, Shangchen Han, Sina Honari, Pascal Fua, Robert Wang


We present a lightweight solution to recover 3D pose from multi-view images captured with spatially calibrated cameras. Building upon recent advances in interpretable representation learning, we exploit 3D geometry to fuse input images into a unified latent representation of pose, which is disentangled from camera view-points. This allows us to reason effectively about 3D pose across different views without using compute-intensive volumetric grids. Our architecture then conditions the learned representation on camera projection operators to produce accurate per-view 2d detections, that can be simply lifted to 3D via a differentiable Direct Linear Transform (DLT) layer. In order to do it efficiently, we propose a novel implementation of DLT that is orders of magnitude faster on GPU architectures than standard SVD-based triangulation methods. We evaluate our approach on two large-scale human pose datasets (H36M and Total Capture): our method outperforms or performs comparably to the state-of-the-art volumetric methods, while, unlike them, yielding real-time performance.
[reason, decoder, multiple, outperforms] [feature, table, faster, unified, propose] [input, model, technique, effectively] [fusion, method, figure, ieee, convolutional, pattern, transform, lightweight, simply, captured, proposed] [representation, unseen, image, latent, disentangled, train, encoder, consists] [training, network, accuracy, baseline, note, architecture, set, report, linear, implementation, neural, performance, efficiently, setting, matrix, test, learned, evaluate, computationally, data, standard, efficient, simple] [pose, camera, human, approach, canonical, computer, estimation, volumetric, conference, vision, dlt, projection, view, joint, additional, novel, triangulation, monocular, jointly, totalcapture, compare, pictorial, refer, allows, direct, differentiable, capture]
@InProceedings{Remelli_2020_CVPR,
  author = {Remelli, Edoardo and Han, Shangchen and Honari, Sina and Fua, Pascal and Wang, Robert},
  title = {Lightweight Multi-View 3D Pose Estimation Through Camera-Disentangled Representation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
A Novel Recurrent Encoder-Decoder Structure for Large-Scale Multi-View Stereo Reconstruction From an Open Aerial Dataset
Jin Liu, Shunping Ji


A great deal of research has demonstrated recently that multi-view stereo (MVS) matching can be solved with deep learning methods. However, these efforts were focused on close-range objects and only a very few of the deep learning-based methods were specifically designed for large-scale 3D urban reconstruction due to the lack of multi-view aerial image benchmarks. In this paper, we present a synthetic aerial dataset, called the WHU dataset, we created for MVS tasks, which, to our knowledge, is the first large-scale multi-view aerial dataset. It was generated from a highly accurate 3D digital surface model produced from thousands of real aerial images with precise camera parameters. We also introduce in this paper a novel network, called RED-Net, for wide-range depth inference, which we developed from a recurrent encoder-decoder structure to regularize cost maps across depths and a 2D fully convolutional network as framework. RED-Net's low memory requirements and high performance make it suitable for large-scale and highly accurate 3D Earth surface reconstruction. Our experiments confirmed that not only did our method exceed the current state-of-the-art MVS methods by more than 50% mean absolute error (MAE) with less memory and computational cost, but its efficiency as well. It outperformed one of the best commercial software programs based on conventional methods, improving their efficiency 16 times over. Moreover, we proved that our RED-Net model pre-trained on the synthetic WHU dataset can be efficiently transferred to very different multi-view aerial image datasets without any fine-tuning. Dataset and code are available at http://gpcv.whu.edu.cn/data.
[dataset, recurrent, three, software, gru, recognition] [aerial, map, feature, table, area, stride, cropped] [input, trained, model, datasets] [whu, ieee, figure, pattern, created, convolutional, method, based, resolution, output, called, reference, high, conventional] [image, synthetic, generated, consists, produced, real] [deep, learning, memory, size, network, training, set, sample, test, large, compared, number, neural, machine, regularized, paper] [depth, stereo, cost, computer, conference, vision, reconstruction, camera, ground, surface, dtu, matching, scene, structure, dense, complete, international, virtual, colmap, mvsnet, provided, volume, direction, truth, accurate]
@InProceedings{Liu_2020_CVPR,
  author = {Liu, Jin and Ji, Shunping},
  title = {A Novel Recurrent Encoder-Decoder Structure for Large-Scale Multi-View Stereo Reconstruction From an Open Aerial Dataset},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Factorized Higher-Order CNNs With an Application to Spatio-Temporal Emotion Estimation
Jean Kossaifi, Antoine Toisoul, Adrian Bulat, Yannis Panagakis, Timothy M. Hospedales, Maja Pantic


Training deep neural networks with spatio-temporal (i.e., 3D) or multidimensional convolutions of higher-order is computationally challenging due to millions of unknown parameters across dozens of layers. To alleviate this, one approach is to apply low-rank tensor decompositions to convolution kernels in order to compress the network and reduce its number of parameters. Alternatively, new convolutional blocks, such as MobileNet, can be directly designed for efficiency. In this paper, we unify these two approaches by proposing a tensor factorization framework for efficient multidimensional (separable) convolutions of higher-order. Interestingly, the proposed framework enables a novel higher-order transduction, allowing to train a network on a given domain (e.g., 2D images or N-dimensional data in general) and using transduction to generalize to higher-order data such as videos (or (N+K)--dimensional data in general), capturing for instance temporal dynamics while preserving the learnt spatial information. We apply the proposed methodology, coined CP-Higher-Order Convolution (HO-CPConv), to spatio-temporal facial emotion analysis. Most existing facial affect models focus on static imagery and discard all temporal information. This is due to the above-mentioned burden of training 3D convolutional nets and the lack of large bodies of video data annotated by experts. We address both issues with our proposed framework. Initial training is first done on static imagery before using transduction to generalize to the temporal domain. We demonstrate superior performance on three challenging large scale affect estimation datasets, AffectNet, SEWA, and AFEW-VA.
[temporal, static, factorized, regular, emotion, video, order, automatic] [framework, propose, apply] [facial, input, trained, pcc, model, database, series] [tensor, convolution, convolutional, affect, proposed, separable, kernel, valence, transduction, arousal, pattern, spatial, figure, ieee, applying, output, method, ccc, existing, sewa, analysis, tucker, sagr, introduced, conv, range, multidimensional, block] [train, loss, factor, image, domain] [deep, efficient, neural, number, learning, performance, rank, training, network, large, size, data, depthwise, applied, mobilenet, bottleneck, weight, layer, consider, machine, memory] [vision, estimation, computer, approach, decomposition, conference, rmse, jean, allows, continuous, structure]
@InProceedings{Kossaifi_2020_CVPR,
  author = {Kossaifi, Jean and Toisoul, Antoine and Bulat, Adrian and Panagakis, Yannis and Hospedales, Timothy M. and Pantic, Maja},
  title = {Factorized Higher-Order CNNs With an Application to Spatio-Temporal Emotion Estimation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Effectively Unbiased FID and Inception Score and Where to Find Them
Min Jin Chong, David Forsyth


This paper shows that two commonly used evaluation metrics for generative models, the Frechet Inception Distance (FID) and the Inception Score (IS), are biased -- the expected value of the score computed for a finite sample set is not the true value of the score. Worse, the paper shows that the bias term depends on the particular model being evaluated, so model A may get a better score than model B simply because model A's bias term is smaller. This effect cannot be fixed by evaluating at a fixed number of samples. This means all comparisons using FID or IS as currently computed are unreliable. We then show how to extrapolate the score to obtain an effectively bias-free estimate of scores computed with an infinite number of samples, which we term FID Infinity and IS Infinity. In turn, this effectively bias-free estimate requires good estimates of scores with a finite number of samples. We show that using Quasi-Monte Carlo integration notably improves estimates of FID and IS for finite sample sets. Our extrapolated scores are simple, drop-in replacements for the finite sample scores. Additionally, we show that using low discrepancy sequence in GAN training offers small improvements in the resulting generator.
[sequence, unbiased, evaluation, regular] [score, biased, table] [model, trained, effectively, adversarial, true, finite] [figure, low, high, ieee] [fid, inception, generative, gans, generator, biggan, dcgan, gan, image, generated] [sobol, bias, variance, fidn, carlo, better, isn, lower, log, sobolinv, training, arxiv, sampling, qmc, preprint, depends, small, standard, linear, number, best, random, distribution, monte, note, higher, good, function, integral, integrator, fids, sample, imagenet, compared, accuracy, comparable, sobolbm, fixed] [estimate, normal, term, estimated, error, computed, computer, point, compute, estimating, accurate, conference]
@InProceedings{Chong_2020_CVPR,
  author = {Chong, Min Jin and Forsyth, David},
  title = {Effectively Unbiased FID and Inception Score and Where to Find Them},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Robust Homography Estimation via Dual Principal Component Pursuit
Tianjiao Ding, Yunchen Yang, Zhihui Zhu, Daniel P. Robinson, Rene Vidal, Laurent Kneip, Manolis C. Tsakiris


We revisit robust estimation of homographies over point correspondences between two or three views, a fundamental problem in geometric vision. The analysis serves as a platform to support a rigorous investigation of Dual Principal Component Pursuit (DPCP) as a valid and powerful alternative to RANSAC for robust model fitting in multiple-view geometry. Homography fitting is cast as a robust nullspace estimation problem over either homographic or epipolar/trifocal embeddings. We prove that the nullspace of epipolar or trifocal embeddings in the homographic scenario, of dimension 3 and 6 for two and three views respectively, is defined by unique, computable homographies. Experiments show that DPCP performs on par with USAC with local optimization, while requiring an order of magnitude less computing time, and it also outperforms a recent deep learning implementation for homography estimation.
[embeddings, three, time, sequence] [corner, threshold, table, global, faster] [robust, model, case, compatible, hyperplane] [homography, ieee, pattern, analysis, motion, tensor, running, patch, dual, high, figure] [component, translation, image] [subspace, matrix, linear, set, problem, dimension, learning, optimization, vector, proposition, implementation, computational, large, machine, number, space, higher, general, ratio, group, consider, min, note] [computer, trifocal, homographic, conference, epipolar, estimation, error, vision, nullspace, dpcp, uniquely, homographies, usac, principal, international, fundamental, ransac, structure, planar, reprojection, point, local, camera, correspondence, inliers, well, left, rotation, view, accurate, inlier, geometry, plane, fitting, perspective]
@InProceedings{Ding_2020_CVPR,
  author = {Ding, Tianjiao and Yang, Yunchen and Zhu, Zhihui and Robinson, Daniel P. and Vidal, Rene and Kneip, Laurent and Tsakiris, Manolis C.},
  title = {Robust Homography Estimation via Dual Principal Component Pursuit},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Non-Adversarial Video Synthesis With Learned Priors
Abhishek Aich, Akash Gupta, Rameswar Panda, Rakib Hyder, M. Salman Asif, Amit K. Roy-Chowdhury


Most of the existing works in video synthesis focus on generating videos using adversarial learning. Despite their success, these methods often require input reference frame or fail to generate diverse videos from the given data distribution, with little to no uniformity in the quality of videos that can be generated. Different from these methods, we focus on the problem of generating videos from latent noise vectors, without any reference input frames. To this end, we develop a novel approach that jointly optimizes the input latent space, the weights of a recurrent neural network and a generator through non-adversarial learning. Optimizing for the input latent space along with the network weights allows us to generate videos in a controlled environment, i.e., we can faithfully generate all videos the model has seen during the learning process as well as new unseen videos. Extensive experiments on three challenging and diverse datasets well demonstrate that our proposed approach generates superior quality videos compared to the existing state-of-the-art methods.
[video, frame, static, action, dataset, recurrent, work] [propose] [adversarial, input, model, quality, condition] [proposed, method, ieee, reference, interpolation, figure, motion, pattern, comparison, convolutional, range] [latent, generate, generative, loss, generated, mocogan, generator, unseen, generation, golf, representation, weizmann, synthesis, generating, vgan, image, corresponding, fcs, diverse, gans, kzi] [space, network, learned, learning, training, neural, randomly, triplet, vector, set, portion, function, arxiv, preprint, data, optimization, min, class, random, deep, optimize, note, performance, better, respect] [transient, approach, conference, computer, human, relative, international, scene, vision, require, jointly, represented, chair]
@InProceedings{Aich_2020_CVPR,
  author = {Aich, Abhishek and Gupta, Akash and Panda, Rameswar and Hyder, Rakib and Asif, M. Salman and Roy-Chowdhury, Amit K.},
  title = {Non-Adversarial Video Synthesis With Learned Priors},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Uncertainty-Aware Mesh Decoder for High Fidelity 3D Face Reconstruction
Gun-Hee Lee, Seong-Whan Lee


3D Morphable Model (3DMM) is a statistical model of facial shape and texture using a set of linear basis functions. Most of the recent 3D face reconstruction methods aim to embed the 3D morphable basis functions into Deep Convolutional Neural Network (DCNN). However, balancing the requirements of strong regularization for global shape and weak regularization for high level details is still ill-posed. To address this problem, we properly control generality and specificity in terms of regularization by harnessing the power of uncertainty. Additionally, we focus on the concept of nonlinearity and find out that Graph Convolutional Neural Network (Graph CNN) and Generative Adversarial Network (GAN) are effective in reconstructing high quality 3D shapes and textures respectively. In this paper, we propose to employ (i) an uncertainty-aware encoder that presents face features as distributions and (ii) a fully nonlinear decoder model combining Graph CNN with GAN. We demonstrate how our method builds excellent high quality results and outperforms previous state-of-the-art methods on 3D face reconstruction tasks for both constrained and in-the-wild images.
[graph, decoder, dataset, represent, work, embedding, recognition, previous] [cnn, feature, propose, map, employ, level, unified, including, regression] [face, model, facial, input, morphable, robust, nonlinear, identity, quality, generality, uncertain, tran, specificity, korea, strong, effective, expression] [high, method, figure, proposed, convolutional, based, pixel, reconstructing, perceptual, comparison] [image, texture, loss, gan, encoder, representation, fidelity, alignment] [network, linear, regularization, power, training, learning, vector, deep, large, space, parameter, set, deterministic, confident, performance, distribution] [shape, reconstruction, uncertainty, mesh, reconstruct, rendered, single, additional, camera, directly, approach, pose, unobserved, novel, monocular, defined, vertex]
@InProceedings{Lee_2020_CVPR,
  author = {Lee, Gun-Hee and Lee, Seong-Whan},
  title = {Uncertainty-Aware Mesh Decoder for High Fidelity 3D Face Reconstruction},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
3FabRec: Fast Few-Shot Face Alignment by Reconstruction
Bjorn Browatzki, Christian Wallraven


Current supervised methods for facial landmark detection require a large amount of training data and may suffer from overfitting to specific datasets due to the massive number of parameters. We introduce a semi-supervised method in which the crucial idea is to first generate implicit face knowledge from the large amounts of unlabeled images of faces available today. In a first, completely unsupervised stage, we train an adversarial autoencoder to reconstruct faces via a low-dimensional face embedding. In a second, supervised stage, we interleave the decoder with transfer layers to retask the generation of color images to the prediction of landmark heatmaps. Our framework (3FabRec) achieves state-of-the-art performance on several common benchmarks and, most importantly, is able to maintain impressive accuracy on extremely small training sets down to as few as 10 images. As the interleaved layers only add a low amount of parameters to the decoder, inference runs at several hundred FPS on a GPU.
[dataset, order, current, prediction] [table, localization, heatmap, framework, stage, annotated, resnet, regression, predicted, detection, feature, including, add] [landmark, face, facial, adversarial, trained, datasets, robust, nme, interleaved, itls, wflw, heatmaps, aflw, lab, wing] [ieee, method, pattern, convolutional, figure, based, high, itl] [image, autoencoder, supervised, loss, alignment, unsupervised, transfer, generator, appearance, generative, encoder, latent, train, generation] [training, deep, learning, set, data, performance, large, layer, network, size, knowledge, number, accuracy, neural, rate, achieve, finetuning] [computer, conference, vision, reconstruction, approach, full, shape, international, pose, reconstruct, additional, well, reconstructed, implicit]
@InProceedings{Browatzki_2020_CVPR,
  author = {Browatzki, Bjorn and Wallraven, Christian},
  title = {3FabRec: Fast Few-Shot Face Alignment by Reconstruction},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Weakly-Supervised Domain Adaptation via GAN and Mesh Model for Estimating 3D Hand Poses Interacting Objects
Seungryul Baek, Kwang In Kim, Tae-Kyun Kim


Despite recent successes in hand pose estimation, there yet remain challenges on RGB-based 3D hand pose estimation (HPE) under hand-object interaction (HOI) scenarios where severe occlusions and cluttered backgrounds exhibit. Recent RGB HOI benchmarks have been collected either in real or synthetic domain, however, the size of datasets is far from enough to deal with diverse objects combined with hand poses, and 3D pose annotations of real samples are lacking, especially for occluded cases. In this work, we propose a novel end-to-end trainable pipeline that adapts the hand-object domain to the single hand-only domain, while learning for HPE. The domain adaption occurs in image space via 2D pixel-level guidance by Generative Adversarial Network (GAN) and 3D mesh guidance by mesh renderer (MR). Via the domain adaption in image space, not only 3D HPE accuracy is improved, but also HOI input images are translated to segmented and de-occluded hand-only images. The proposed method takes advantages of both the guidances: GAN accurately aligns hands, while MR effectively fills in occluded pixels. The experiments using Dexter-Object, Ego-Dexter and HO3D datasets show that our method significantly outperforms state-of-the-arts trained by hand-only data and is comparable to those supervised by HOI data. Note our method is trained primarily by hand-only images with pose labels, and HOI images without pose labels.
[skeleton, interaction, dataset] [hoi, segmentation, feature, object, tracking, final, table, reg, framework, supervision, fully, occluded, refine] [input, trained, datasets, model, testing, adversarial, heatmaps, christian] [based, existing, method, convolutional, figure] [image, corresponding, gan, domain, real, adaptation, synthetic, generative, generated, generator] [training, data, algorithm, learning, performance, network, deep, space, neural, comparable, test] [hand, pose, estimation, mesh, rgb, fpe, estimator, single, depth, skeletal, joint, provided, hme, renderer, dan, dhand, human, lheat, tex, mano, shape, interacting, initial, estimated, dgan, system, markus, vincent, antonis, reconstruction, thomas, estimating, seungryul, hpe]
@InProceedings{Baek_2020_CVPR,
  author = {Baek, Seungryul and Kim, Kwang In and Kim, Tae-Kyun},
  title = {Weakly-Supervised Domain Adaptation via GAN and Mesh Model for Estimating 3D Hand Poses Interacting Objects},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Vec2Face: Unveil Human Faces From Their Blackbox Features in Face Recognition
Chi Nhan Duong, Thanh-Dat Truong, Khoa Luu, Kha Gia Quach, Hung Bui, Kaushik Roy


Unveiling face images of a subject given his/her high-level representations extracted from a blackbox Face Recognition engine is extremely challenging. It is because the limitations of accessible information from that engine including its structure and uninterpretable extracted features. This paper presents a novel generative structure with Bijective Metric Learning, namely Bijective Generative Adversarial Networks in a Distillation framework (DiBiGAN), for synthesizing faces of an identity given that person's features. In order to effectively address this problem, this work firstly introduces a bijective metric so that the distance measurement and metric learning process can be directly adopted in image domain for an image reconstruction task. Secondly, a distillation process is introduced to maximize the information exploited from the blackbox face recognition engine. Then a Feature-Conditional Generator Structure with Exponential Weighting Strategy is presented for a more robust generator that can synthesize realistic faces with ID preservation. Results on several benchmarking datasets including CelebA, LFW, AgeDB, CFP-FP against matching engines have demonstrated the effectiveness of DiBiGAN on both image realism and ID preservation properties.
[recognition, embedding, exploit] [feature, chi, framework, table, propose, fully] [face, blackbox, adversarial, input, subject, whitebox, model, effectively, facial, lzg, nhan, khoa, kha, gia, quality, exploited, dibigan, testing, bijection] [proposed, method, adopted, figure, perceptual, prior] [image, bijective, real, generator, loss, latent, extracted, ldistill, realistic, mapping, synthesizing, generative, synthesized, variable, domain, synthesize, realism, preservation, synthesis, representation, conditional, learn] [learning, metric, deep, function, process, training, knowledge, distillation, arg, accuracy, set, space, min, distribution, arxiv, preprint, task, classifier, neural, feat, network, better] [reconstruction, structure, distance, directly, reconstructed, reconstruct, limited, matching, matcher]
@InProceedings{Duong_2020_CVPR,
  author = {Duong, Chi Nhan and Truong, Thanh-Dat and Luu, Khoa and Quach, Kha Gia and Bui, Hung and Roy, Kaushik},
  title = {Vec2Face: Unveil Human Faces From Their Blackbox Features in Face Recognition},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
StyleRig: Rigging StyleGAN for 3D Control Over Portrait Images
Ayush Tewari, Mohamed Elgharib, Gaurav Bharaj, Florian Bernard, Hans-Peter Seidel, Patrick Perez, Michael Zollhofer, Christian Theobalt


StyleGAN generates photorealistic portrait images of faces with eyes, teeth, hair and context (neck, shoulders, background), but lacks a rig-like control over semantic face parameters that are interpretable in 3D, such as face pose, expressions, and scene illumination. Three-dimensional morphable face models (3DMMs) on the other hand offer control over the semantic parameters, but lack photorealism when rendered and only model the face interior, not other parts of a portrait image (hair, mouth interior, background). We present the first method to provide a face rig-like control over a pretrained and fixed StyleGAN via a 3DMM. A new rigging network, RigNet is trained between the 3DMM's semantic parameters and StyleGAN's input. The network is trained in a self-supervised manner, without the need for manual annotations. At test time, our method generates portrait images with the photorealism of StyleGAN and provides explicit control over the 3D semantic parameters of the face.
[explicit, three, provide] [semantic, head, interactive, employ, florian, imagery] [face, model, facial, expression, morphable, trained, quality, identity, christian, adversarial, medium, ayush, input, change, patrick] [illumination, based, figure, output, high] [control, latent, image, stylegan, loss, generative, target, rignet, generated, consistency, code, editing, conditional, stylerig, photorealistic, pretrained, synthesis, source, portrait, photorealism, karras, style, mixing, train, transfer, gans, corresponding, modified, lack, generator, rerendering] [training, space, network, vector, parameter, learning, neural, function, deep, architecture, set, data, learned] [reconstruction, allows, approach, differentiable, scene, computer, michael, rotation, parametric, mesh, coarse, well, rig, acm, rendered, vision]
@InProceedings{Tewari_2020_CVPR,
  author = {Tewari, Ayush and Elgharib, Mohamed and Bharaj, Gaurav and Bernard, Florian and Seidel, Hans-Peter and Perez, Patrick and Zollhofer, Michael and Theobalt, Christian},
  title = {StyleRig: Rigging StyleGAN for 3D Control Over Portrait Images},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Self-Supervised 3D Human Pose Estimation via Part Guided Novel Image Synthesis
Jogendra Nath Kundu, Siddharth Seth, Varun Jampani, Mugalodi Rakesh, R. Venkatesh Babu, Anirban Chakraborty


Camera captured human pose is an outcome of several sources of variation. Performance of supervised 3D pose estimation approaches comes at the cost of dispensing with variations, such as shape and appearance, that may be useful for solving other related tasks. As a result, the learned model not only inculcates task-bias but also dataset-bias because of its strong reliance on the annotated samples, which also holds true for weakly-supervised models. Acknowledging this, we propose a self-supervised learning framework to disentangle such variations from unlabeled video frames. We leverage the prior knowledge on human skeleton and poses in the form of a single part-based 2D puppet model, human pose articulation constraints, and a set of unpaired 3D poses. Our differentiable formalization, bridging the representation gap between the 3D pose and spatial part maps, not only facilitates discovery of interpretable pose disentanglement, but also allows us to operate on videos with diverse camera movements. Qualitative results on unseen in-the-wild datasets establish our superior generalization across multiple tasks beyond the primary tasks of 3D pose estimation and part segmentation. Furthermore, we demonstrate state-of-the-art weakly-supervised 3D pose estimation performance on both Human3.6M and MPI-INF-3DHP datasets.
[dataset, decoder, video] [segmentation, supervision, map, framework, table, predicted, object, detection] [model, datasets, input, wild] [spatial, proposed, prior, chen, based, output, figure] [image, appearance, unsupervised, representation, encoder, supervised, consistency, unpaired, latent, unseen, paired, corresponding, synthesis, discovery, disentangled, loss, absence, gap] [learning, network, set, training, performance, energy, deep, presence, knowledge] [pose, human, estimation, camera, shape, canonical, differentiable, limb, joint, single, consistent, msal, novel, puppet, local, articulation, form, lsp, transformation, michael, leverage, rigid, uncertainty, articulated, direct, wunc, depth]
@InProceedings{Kundu_2020_CVPR,
  author = {Kundu, Jogendra Nath and Seth, Siddharth and Jampani, Varun and Rakesh, Mugalodi and Babu, R. Venkatesh and Chakraborty, Anirban},
  title = {Self-Supervised 3D Human Pose Estimation via Part Guided Novel Image Synthesis},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Learning Meta Face Recognition in Unseen Domains
Jianzhu Guo, Xiangyu Zhu, Chenxu Zhao, Dong Cao, Zhen Lei, Stan Z. Li


Face recognition systems are usually faced with unseen domains in real-world applications and show unsatisfactory performance due to their poor generalization. For example, a well-trained model on webface data cannot deal with the ID vs. Spot task in surveillance scenario. In this paper, we aim to learn a generalized model that can directly handle new unseen domains without any model updating. To this end, we propose a novel face recognition method via meta-learning named Meta Face Recognition (MFR). MFR synthesizes the source/target domain shift with a meta-optimization objective, which requires the model to learn effective representations not only on synthesized source domains but also on synthesized target domains. Specifically, we build domain-shift batches through a domain-level sampling strategy and get back-propagated gradients/meta-gradients on synthesized source/target domains by optimizing multi-domain distributions. The gradients and meta-gradients are further combined to update the model to improve generalization. Besides, we propose two benchmarks for generalized face recognition evaluation. Experiments on our benchmarks validate the generalization of our method compared to several baselines and other state-of-the-arts. The proposed benchmarks and code will be available at https://github.com/cleardusk/MFR.
[recognition, three, order, den, embedding, dataset, embeddings] [table, propose, hard, gallery, benchmark, named] [face, model, mfr, doma, mldg, generalization, dmtr, dmte, casia, african, probe, caucasian, trained, protocol, racial, datasets, comparative, asian, indian, race, spot, improve, testing] [ieee, method, pattern, proposed, high, performs] [domain, source, target, generalized, loss, unseen, learn, alignment, adaptation, image, synthesized] [base, learning, meta, performance, sampling, deep, update, gradient, compared, label, problem, set, batch, random, better, standard, arxiv, preprint, task] [conference, computer, vision, well, compare, directly, handle]
@InProceedings{Guo_2020_CVPR,
  author = {Guo, Jianzhu and Zhu, Xiangyu and Zhao, Chenxu and Cao, Dong and Lei, Zhen and Li, Stan Z.},
  title = {Learning Meta Face Recognition in Unseen Domains},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Cascaded Deep Monocular 3D Human Pose Estimation With Evolutionary Training Data
Shichao Li, Lei Ke, Kevin Pratama, Yu-Wing Tai, Chi-Keung Tang, Kwang-Ting Cheng


End-to-end deep representation learning has achieved remarkable accuracy for monocular 3D human pose estimation, yet these models may fail for unseen poses with limited and fixed training data. This paper proposes a novel data augmentation method that: (1) is scalable for synthesizing massive amount of training data (over 8 million valid 3D human poses with corresponding 2D projections) for training 2D-to-3D networks, (2) can effectively reduce dataset bias. Our method evolves a limited dataset to synthesize unseen 3D human skeletons based on a hierarchical human representation and heuristics inspired by prior knowledge. Extensive experiments show that our approach not only achieves state-of-the-art accuracy on the largest public benchmark, but also generalizes significantly better to unseen and rare poses. Relevant files and tools are available at the project website.
[dataset, skeleton, temporal, hierarchical] [cascade, table, regression, stage, parent, biased, propose] [model, trained, generalization, input, testing, dnn, datasets] [ieee, pattern, cascaded, figure, method, valid, residual, comparison] [representation, image, unseen, synthesizing, synthesize, train] [data, training, deep, evolution, learning, network, performance, augmentation, number, better, vector, set, size, dnew, evolved, crossover, augmented, neural, function, learner, evolutionary, accuracy, bias, evolve] [pose, human, conference, computer, vision, estimation, international, monocular, bone, coordinate, novel, initial, joint, error, mpjpe, michael, approach, local, single, left, supplementary, limited, camera, geometric, collection]
@InProceedings{Li_2020_CVPR,
  author = {Li, Shichao and Ke, Lei and Pratama, Kevin and Tai, Yu-Wing and Tang, Chi-Keung and Cheng, Kwang-Ting},
  title = {Cascaded Deep Monocular 3D Human Pose Estimation With Evolutionary Training Data},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
GHUM & GHUML: Generative 3D Human Shape and Articulated Pose Models
Hongyi Xu, Eduard Gabriel Bazavan, Andrei Zanfir, William T. Freeman, Rahul Sukthankar, Cristian Sminchisescu


We present a statistical, articulated 3D human shape modeling pipeline, within a fully trainable, modular, deep learning framework. Given high-resolution complete 3D body scans of humans, captured in various poses, together with additional closeups of their head and facial expressions, as well as hand articulation, and given initial, artist designed, gender neutral rigged quad-meshes, we train all model parameters including non-linear shape spaces based on variational auto-encoders, pose-space deformation correctives, skeleton joint center predictors, and blend skinning functions, in a single consistent learning loop. The models are simultaneously trained with all the 3d dynamic scan data (over 60,000 diverse human configurations in our new dataset) in order to capture correlations and ensure consistency of various components. Models support facial expression analysis, as well as body (with detailed hand) shape and pose estimation. We provide fully train-able generic human models of different resolutions- the moderate-resolution GHUM consisting of 10,168 vertices and the low-resolution GHUML(ite) of 3,194 vertices-, run comparisons between them, analyze the impact of different components and illustrate their reconstruction from image data. The models will be available for research.
[skeleton, order, multiple, sequence, blend, trainable, represent, natural] [head, including, center] [model, facial, expression, face, input, neutral, nonlinear, trained] [motion, based, detail, figure, resolution, captured, advantage] [variational, latent, loss, vae, image, learn] [data, space, learning, deep, training, linear, statistical, large, network, vector, initialize, set] [body, shape, pose, human, joint, ghum, hand, skinning, mesh, full, articulated, deformation, reconstruction, registration, rest, error, point, estimation, well, capture, ghuml, surface, estimate, acm, michael, cristian, pca, smpl, additional, single, closeup, registered, monocular, illustrate, angle, rely]
@InProceedings{Xu_2020_CVPR,
  author = {Xu, Hongyi and Bazavan, Eduard Gabriel and Zanfir, Andrei and Freeman, William T. and Sukthankar, Rahul and Sminchisescu, Cristian},
  title = {GHUM & GHUML: Generative 3D Human Shape and Articulated Pose Models},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Generating 3D People in Scenes Without People
Yan Zhang, Mohamed Hassan, Heiko Neumann, Michael J. Black, Siyu Tang


We present a fully automatic system that takes a 3D scene and generates plausible 3D human bodies that are posed naturally in that 3D scene. Given a 3D scene without people, humans can easily imagine how people could interact with the scene and the objects in it. However, this is a challenging task for a computer as solving it requires that (1) the generated human bodies to be semantically plausible within the 3D environment (e.g. people sitting on the sofa or cooking near the stove), and (2) the generated human-scene interaction to be physically feasible such that the human body and scene do not interpenetrate while, at the same time, body-scene contact supports physical interactions. To that end, we make use of the surface-based 3D human model SMPL-X. We first train a conditional variational autoencoder to predict semantically plausible 3D human poses conditioned on latent scene representations, then we further refine the generated 3D bodies using scene constraints to enforce feasible physical interaction. We show that our approach is able to synthesize realistic and expressive 3D human bodies that naturally interact with 3D environment. We perform extensive experiments demonstrating that our generative framework compares favorably with existing methods, both qualitatively and quantitatively. We believe that our scene-conditioned 3D human generation pipeline will be useful for numerous applications; e.g. to generate training data for human pose estimation, in video games and in VR/AR. Our project page for data and code can be seen at: https://vlg.inf.ethz.ch/projects/PSI/ .
[recognition, people, interaction, semantics, predict, dataset, interact, work, affordance, individual, natural] [object, semantic, global, propose, score, feature, denotes, table] [model, physical, improve] [ieee, pattern, method, figure, based] [generated, loss, generate, image, plausible, train, realistic, perform, generation, semantically, variational, generative, cat] [learning, data, baseline, denote, training, test, neural, set, network, evaluate, metric, average] [body, scene, human, conference, computer, vision, pose, mesh, depth, contact, virtual, international, fitting, estimation, camera, indoor, vposer, manolis, reconstruction, habitat, rgb, physically, shape, collision, capture, error]
@InProceedings{Zhang_2020_CVPR,
  author = {Zhang, Yan and Hassan, Mohamed and Neumann, Heiko and Black, Michael J. and Tang, Siyu},
  title = {Generating 3D People in Scenes Without People},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Transferring Cross-Domain Knowledge for Video Sign Language Recognition
Dongxu Li, Xin Yu, Chenchen Xu, Lars Petersson, Hongdong Li


Word-level sign language recognition (WSLR) is a fundamental task in sign language interpretation. It requires models to recognize isolated sign words from videos. However, annotating WSLR data needs expert knowledge, thus limiting WSLR dataset acquisition. On the contrary, there are abundant subtitled sign news videos on the internet. Since these videos have no word-level annotation and exhibit a large domain gap from isolated signs, they cannot be directly used for training WSLR models. We observe that despite the existence of a large domain gap, isolated and news signs share the same visual concepts, such as hand gestures and body movements. Motivated by this observation, we propose a novel method that learns domain-invariant visual concepts and fertilizes WSLR models by transferring knowledge of subtitled news sign to them. To this end, we extract news signs using a base WSLR model, and then design a classifier jointly trained on news and isolated signs to coarsely align these two domain features. In order to learn domain-invariant features within each class and suppress domain-specific features, our method further resorts to an external memory to store the class centroids of the aligned news signs. We then design a temporal attention based on the learnt descriptor to improve recognition performance. Experimental results on standard WSLR datasets show that our method outperforms previous state-of-the-art methods significantly. We also demonstrate the effectiveness of our method on automatically localizing signs from sign news, achieving 28.1 for AP@0.5.
[sign, news, isolated, recognition, temporal, wslr, language, attention, word, video, visual, extract, action, recurrent, msasl, american, dataset, domaininvariant, wlasl, embedding, coarsely, order, exploit, represent, subtitled] [feature, table, web, propose, focus, sliding, employ, localization, annotation] [model, external, datasets, collected] [ieee, method, pattern, based, convolutional, figure, proposed] [domain, common, learn, representation, transferring, train, shared, alignment] [memory, training, learning, prototypical, knowledge, data, classifier, class, neural, network, performance, average, classification, number, accuracy, large, observe, deep, space] [conference, coarse, computer, vision, descriptor, international, jointly, approach, directly]
@InProceedings{Li_2020_CVPR,
  author = {Li, Dongxu and Yu, Xin and Xu, Chenchen and Petersson, Lars and Li, Hongdong},
  title = {Transferring Cross-Domain Knowledge for Video Sign Language Recognition},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Bodies at Rest: 3D Human Pose and Shape Estimation From a Pressure Image Using Synthetic Data
Henry M. Clever, Zackory Erickson, Ariel Kapusta, Greg Turk, Karen Liu, Charles C. Kemp


People spend a substantial part of their lives at rest in bed. 3D human pose and shape estimation for this activity would have numerous beneficial applications, yet line-of-sight perception is complicated by occlusion from bedding. Pressure sensing mats are a promising alternative, but training data is challenging to collect at scale. We describe a physics-based method that simulates human bodies at rest in a bed with a pressure sensing mat, and present PressurePose, a synthetic dataset with 206K pressure images with 3D human poses and shapes. We also present PressureNet, a deep learning model that estimates human pose and shape given a pressure image and gender. PressureNet incorporates a pressure map reconstruction (PMR) network that models pressure image generation to promote consistency between estimated 3D body models and pressure image input. In our evaluations, PressureNet performed well with real data from participants in diverse poses, even though it had only been trained with synthetic data. When we ablated the PMR network, performance dropped substantially.
[dataset, work] [table, map] [model, mat, trained, input, posture, noise] [sensing, based, particle, capsule, figure, method, dynamic, motion, journal] [synthetic, image, real, pmr, generate, consists, loss, generated] [data, network, training, learning, soft, deep, dart, set, appendix, best, function, applied, karen] [pressure, human, body, pose, estimation, mesh, shape, bed, flex, joint, simulation, rigid, pressurenet, resting, smpl, contact, error, capsulized, simulated, force, point, articulated, rest, initial, surface, mattress, cloud, michael, pressurepose, well, supine, array, single, international, charles, estimate, distance, normal, compute, reconstructed, ablating, conference, zackory, rgb]
@InProceedings{Clever_2020_CVPR,
  author = {Clever, Henry M. and Erickson, Zackory and Kapusta, Ariel and Turk, Greg and Liu, Karen and Kemp, Charles C.},
  title = {Bodies at Rest: 3D Human Pose and Shape Estimation From a Pressure Image Using Synthetic Data},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Bayesian Adversarial Human Motion Synthesis
Rui Zhao, Hui Su, Qiang Ji


We propose a generative probabilistic model for human motion synthesis. Our model has a hierarchy of three layers. At the bottom layer, we utilize Hidden semi-Markov Model (HSMM), which explicitly models the spatial pose, temporal transition and speed variations in motion sequences. At the middle layer, HSMM parameters are treated as random variables which are allowed to vary across data instances in order to capture large intra- and inter-class variations. At the top layer, hyperparameters define the prior distributions of parameters, preventing the model from overfitting. By explicitly capturing the distribution of the data and parameters, our model has a more compact parameterization compared to GAN-based generative models. We formulate the data synthesis as an adversarial Bayesian inference problem, in which the distributions of generator and discriminator parameters are obtained for data synthesis. We evaluate our method through a variety of metrics, where we show advantage than other competing methods with better fidelity and diversity. We further evaluate the synthesis quality as a data augmentation method for recognition task. Finally, we demonstrate the benefit of our fully probabilistic approach in data restoration task.
[sequence, bleu, hidden, berkeley, modeling, cmu, dataset, hierarchical, action, mocap, explicitly, order, temporal, sequential, state, video, tsbn] [score, table, framework, including, achieves, propose, represents] [model, adversarial, quality, variation] [motion, dynamic, proposed, figure, method, based, prior, restoration, likelihood, spatial] [real, synthetic, generative, generator, synthesis, discriminator, synthesized, hsmm, generate, fidelity, missing, perform, hhmm, realistic, conditional, mmd, generating, consists] [data, bayesian, inference, learning, distribution, probabilistic, set, number, random, better, training, posterior, compared, benefit, experiment, best, indicates, large, evaluate, sample, cardinality, log, higher, augmentation, class, probability, capacity, standard] [human, pose, joint, second, demonstrate, handle]
@InProceedings{Zhao_2020_CVPR,
  author = {Zhao, Rui and Su, Hui and Ji, Qiang},
  title = {Bayesian Adversarial Human Motion Synthesis},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
LSM: Learning Subspace Minimization for Low-Level Vision
Chengzhou Tang, Lu Yuan, Ping Tan


We study the energy minimization problem in low-level vision tasks from a novel perspective. We replace the heuristic regularization term with a data-driven learnable subspace constraint, and preserve the data term to exploit domain knowledge derived from the first principles of a task. This learning subspace minimization (LSM) framework unifies the network structures and the parameters for many different low-level vision tasks, which allows us to train a single network for multiple tasks simultaneously with shared parameters, and even generalizes the trained network to an unseen task as long as the data term can be formulated. We validate our LSM frame on four low-level tasks including edge detection, interactive segmentation, stereo matching, and optical flow, and validate the network on various datasets. The experiments demonstrate that the proposed LSM generates state-of-the-art results with smaller model size, faster training convergence, and real-time inference.
[recognition, video, context, three, dataset] [cnn, interactive, segmentation, framework, feature, object, pyramid, including] [model, derived, trained] [optical, ieee, pattern, flow, lsm, based, conventional, figure, convolutional, june, motion, cnns, proposed, method, performs, conv, residual, ldof, intermediate] [image, variational, generate, train, corresponding] [subspace, minimization, network, learning, data, regularization, training, task, neural, better, learned, average, deep, efficient, vector, processing, energy, comparable, matrix, indicates, machine, problem, objective] [vision, computer, conference, stereo, term, solution, matching, international, basis, constraint, joint, projection, european, single, well, solve, dense, estimation, compare, structure, implicit, compute]
@InProceedings{Tang_2020_CVPR,
  author = {Tang, Chengzhou and Yuan, Lu and Tan, Ping},
  title = {LSM: Learning Subspace Minimization for Low-Level Vision},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Learning a Neural Solver for Multiple Object Tracking
Guillem Braso, Laura Leal-Taixe


Graphs offer a natural way to formulate Multiple Object Tracking (MOT) within the tracking-by-detection paradigm. However, they also introduce a major challenge for learning methods, as defining a model that can operate on such structured domain is not trivial. As a consequence, most learning-based work has been devoted to learning better features for MOT and then using these with well-established optimization frameworks. In this work, we exploit the classical network flow formulation of MOT to define a fully differentiable framework based on Message Passing Networks (MPNs). By operating directly on the graph domain, our method can reason globally over an entire set of detections and predict final solutions. Hence, we show that learning in MOT does not need to be restricted to feature extraction, but it can also be applied to the data association step. We show a significant improvement in both MOTA and IDF1 on three publicly available benchmarks. Our code is available at https://bit.ly/motsolv.
[graph, message, passing, node, embeddings, multiple, recognition, embedding, three, time, people, step, encoding, trajectory, exploit, order, future, work, goal] [tracking, edge, object, mot, final, feature, association, framework, detection, global, propose, main, improvement, mota] [model, robust] [ieee, pattern, flow, method, convolutional, output, based, figure, proposed, motion] [learn, perform, appearance, image, train] [learning, neural, network, set, update, data, problem, optimization, performance, number, deep, binary, entire, simple, task, active, respect, classification, vanilla, training, online] [computer, conference, vision, formulation, relative, directly, initial, cost, distance, position, international, solver, partitioning, structure]
@InProceedings{Braso_2020_CVPR,
  author = {Braso, Guillem and Leal-Taixe, Laura},
  title = {Learning a Neural Solver for Multiple Object Tracking},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
GLU-Net: Global-Local Universal Network for Dense Flow and Correspondences
Prune Truong, Martin Danelljan, Radu Timofte


Establishing dense correspondences between a pair of images is an important and general problem, covering geometric matching, optical flow and semantic correspondences. While these applications share fundamental challenges, such as large displacements, pixel-accuracy, and appearance changes, they are currently addressed with specialized network architectures, designed for only one particular task. This severely limits the generalization capabilities of such networks to new scenarios, where e.g. robustness to larger displacements or higher accuracy is required. In this work, we propose a universal network architecture that is directly applicable to all the aforementioned dense correspondence problems. We achieve both high accuracy and robustness to large displacements by investigating the combined use of global and local correlation layers. We further propose an adaptive resolution strategy, allowing our network to operate on virtually any input image resolution. The proposed GLU-Net achieves state-of-the-art performance for geometric and semantic matching as well as optical flow, when using the same network and weights. Code and trained models are available at https://github.com/PruneTruong/GLU-Net.
[dataset, december] [global, correlation, semantic, feature, table, employ, pyramid, map, architectural, level, achieves] [trained, universal, input, model, original] [flow, optical, resolution, ieee, pattern, figure, corr, aepe, cvpr, adaptive, june, convolutional, field, high, tss, based, liteflownet, coarsest, lake] [image, target, source, appearance] [network, large, architecture, neural, processing, training, layer, small, task, accuracy, performance, deep, applied, compared, andrew, larger] [local, computer, geometric, conference, dense, vision, correspondence, matching, cost, estimation, international, estimate, approach, accurate, detailed, volume, kitti, additional, allows, pck]
@InProceedings{Truong_2020_CVPR,
  author = {Truong, Prune and Danelljan, Martin and Timofte, Radu},
  title = {GLU-Net: Global-Local Universal Network for Dense Flow and Correspondences},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
SiamCAR: Siamese Fully Convolutional Classification and Regression for Visual Tracking
Dongyan Guo, Jun Wang, Ying Cui, Zhenhua Wang, Shengyong Chen


By decomposing the visual tracking task into two subproblems as classification for pixel category and regression for object bounding box at this pixel, we propose a novel fully convolutional Siamese network to solve visual tracking end-to-end in a per-pixel manner. The proposed framework SiamCAR consists of two simple subnetworks: one Siamese subnetwork for feature extraction and one classification-regression subnetwork for bounding box prediction. Different from state-of-the-art trackers like Siamese-RPN, SiamRPN++ and SPM, which are based on region proposal, the proposed framework is both proposal and anchor free. Consequently, we are able to avoid the tricky hyper-parameter tuning of anchors and reduce human intervention. The proposed framework is simple, neat and effective. Extensive experiments and comparisons with state-of-the-art trackers are conducted on challenging benchmarks including GOT-10K, LaSOT, UAV123 and OTB-50. Without bells and whistles, our SiamCAR achieves the leading performance with a considerable real-time speed. The code is available at https://github.com/ohhhyeahhh/SiamCAR.
[visual, evaluation, dataset, prediction, speed] [tracking, siamcar, siamese, object, bounding, location, regression, siamfc, siamrpn, box, region, feature, branch, map, correlation, response, threshold, matlab, ope, eco, cfnet, srdcf, cpu, overlap, staple, meem, template, tracker, backbone, fully, framework, challenging, background, kcfdp, proposal, achieves, represents, lasot, anchor, including, semantic, python, official, fdsst, predicted, centerness] [success, testing, input, improve, robust] [proposed, figure, scale, convolutional, based, extraction] [target, corresponding, loss] [network, classification, precision, training, search, subnetwork, performance, rate, large, deep, learning, set, better, accuracy, task, simple, online, data, best, achieve] [provided, error]
@InProceedings{Guo_2020_CVPR,
  author = {Guo, Dongyan and Wang, Jun and Cui, Ying and Wang, Zhenhua and Chen, Shengyong},
  title = {SiamCAR: Siamese Fully Convolutional Classification and Regression for Visual Tracking},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
MaskFlownet: Asymmetric Feature Matching With Learnable Occlusion Mask
Shengyu Zhao, Yilun Sheng, Yue Dong, Eric I-Chao Chang, Yan Xu


Feature warping is a core technique in optical flow estimation; however, the ambiguity caused by occluded areas during warping is a major problem that remains unsolved. In this paper, we propose an asymmetric occlusion-aware feature matching module, which can learn a rough occlusion mask that filters useless (occluded) areas immediately after feature warping without any explicit supervision. The proposed module can be easily integrated into end-to-end network architectures and enjoys performance gains while introducing negligible computational cost. The learned occlusion mask can be further fed into a subsequent network cascade with dual feature pyramids with which we achieve state-of-the-art performance. At the time of submission, our method, called MaskFlownet, surpasses all published optical flow methods on the MPI Sintel, KITTI 2012 and 2015 benchmarks. Code is available at https://github.com/microsoft/MaskFlownet.
[explicit, previous, concatenated, attention] [feature, mask, occlusion, pyramid, occluded, module, final, level, table, correlation, achieves, object, propose, stage, fed, subsequent, map, foreground] [clean, trained, major, easily] [flow, optical, ieee, warping, learnable, warped, sintel, convolutional, pattern, maskflownet, figure, proposed, asymofmm, dual, ofmm, deformable, convolution, upsampled, based, aepe, cnns] [image, asymmetric, unsupervised, learn, train, masked, source, target] [network, learning, layer, training, performance, learned, better, problem, design, large, operation, test, architecture, achieve, process] [computer, conference, matching, vision, cost, volume, estimation, displacement, kitti, thomas, european, international, single]
@InProceedings{Zhao_2020_CVPR,
  author = {Zhao, Shengyu and Sheng, Yilun and Dong, Yue and Chang, Eric I-Chao and Xu, Yan},
  title = {MaskFlownet: Asymmetric Feature Matching With Learnable Occlusion Mask},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Tracking by Instance Detection: A Meta-Learning Approach
Guangting Wang, Chong Luo, Xiaoyan Sun, Zhiwei Xiong, Wenjun Zeng


We consider the tracking problem as a special type of object detection problem, which we call instance detection. With proper initialization, a detector can be quickly converted into a tracker by learning the new instance from a single image. We find that model-agnostic meta-learning (MAML) offers a strategy to initialize the detector that satisfies our needs. We propose a principled three-step approach to build a high-performance tracker. First, pick any modern object detector trained with gradient descent. Second, conduct offline training (or initialization) with MAML. Third, perform domain adaptation using the initial frame. We follow this procedure to build two trackers, named Retina-MAML and FCOS-MAML, based on two modern detectors RetinaNet and FCOS. Evaluations on four benchmarks show that both trackers are competitive against state-of-the-art trackers. On OTB-100, Retina-MAML achieves the highest ever AUC of 0.712. On TrackingNet, FCOS-MAML ranks the first on the leader board with an AUC of 0.757 and the normalized precision of 0.822. Both trackers run in real-time at 40 FPS.
[visual, step, frame, evaluation, convert, build] [object, detector, tracking, instance, tracker, detection, table, branch, box, regression, trackingnet, achieves, feature, lasot, bounding, region, atom, template, mdnet, head, backbone, sota, siamese, anchor, overlap, score, correlation, retinanet, highest] [trained, offline, model, auc, success] [based, learnable, called, figure, comparison, competitive] [loss, domain, adaptation, target, image, train, perform] [training, learning, online, maml, network, updating, performance, baseline, support, set, number, gradient, update, rate, procedure, test, classification, find, modern, good, initialization, meta, deep, precision, achieve, algorithm, best, arxiv, strategy] [initial, single, approach]
@InProceedings{Wang_2020_CVPR,
  author = {Wang, Guangting and Luo, Chong and Sun, Xiaoyan and Xiong, Zhiwei and Zeng, Wenjun},
  title = {Tracking by Instance Detection: A Meta-Learning Approach},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
High-Performance Long-Term Tracking With Meta-Updater
Kenan Dai, Yunhua Zhang, Dong Wang, Jianhua Li, Huchuan Lu, Xiaoyun Yang


Long-term visual tracking has drawn increasing attention because it is much closer to practical applications than short-term tracking. Most top-ranked long-term trackers adopt the offline-trained Siamese architectures, thus,they cannot benefit from great progress of short-term trackers with online update. However, it is quite risky to straightforwardly introduce online-update-based trackers to solve the long-term problem, due to long-term uncertain and noisy observations. In this work, we propose a novel offline-trained Meta-Updater to address an important but unsolved problem: Is the tracker ready for updating in the current frame? The proposed meta-updater can effectively integrate geometric, discriminative, and appearance cues in a sequential manner, and then mine the sequential information with a designed cascaded LSTM module. Our meta-updater learns a binary output to guide the tracker's update and can be easily embedded into different trackers. This work also introduces a long-term tracking framework consisting of an online local tracker, an online verifier, a SiamRPN-based re-detector, and our meta-updater. Numerous experimental results on the VOT2018LT,VOT2019LT, OxUvALT, TLP, and LaSOT benchmarks show that our tracker performs remarkably better than other competing algorithms. Our project is available on the website: https://github.com/Daikenan/LTMU.
[visual, time, current, sequential, lstm, evaluation, frame, dataset, three] [tracking, tracker, object, table, response, atom, lasot, score, siamese, bounding, splt, huchuan, framework, template, rtmdnet, eco, box, global, tnr, region, map, mdnet, correlation, mbmd, positive, propose, oxuvalt, threshold] [dong, model, generalization, input, success, trained, effectively, conduct, risk, uncertain] [figure, proposed, output, method, cascaded, ieee, noisy, color] [target, appearance, image, competing] [online, update, training, learning, network, performance, set, precision, average, best, updated, search, rate, negative, better, good, deep, sample, updating] [local, novel, matching]
@InProceedings{Dai_2020_CVPR,
  author = {Dai, Kenan and Zhang, Yunhua and Wang, Dong and Li, Jianhua and Lu, Huchuan and Yang, Xiaoyun},
  title = {High-Performance Long-Term Tracking With Meta-Updater},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
TubeTK: Adopting Tubes to Track Multi-Object in a One-Step Training Model
Bo Pang, Yizhuo Li, Yifan Zhang, Muchen Li, Cewu Lu


Multi-object tracking is a fundamental vision problem that has been studied for a long time. As deep learning brings excellent performances to object detection algorithms, Tracking by Detection (TBD) has become the mainstream tracking framework. Despite the success of TBD, this two-step method is too complicated to train in an end-to-end manner and induces many challenges as well, such as insufficient exploration of video spatial-temporal information, vulnerability when facing object occlusion, and excessive reliance on detection results. To address these challenges, we propose a concise end-to-end model TubeTK which only needs one step training by introducing the "bounding-tube" to indicate temporal-spatial locations of objects in a short video clip. TubeTK provides a novel direction of multi-object tracking, and we demonstrate its potential to solve the above challenges without bells and whistles. We analyze the performance of TubeTK on several MOT benchmarks and provide empirical evidence to show that TubeTK has the ability to overcome occlusions to some extent without any ancillary technologies like Re-ID. Compared with other methods that adopt private detection results, our one-stage end-to-end model achieves state-of-the-art performances even if it adopts no ready-made detection results. We hope that the proposed TubeTK model can serve as a simple but strong alternative for video-based MOT task. The code and model will be publicly available accompanying this paper.
[video, temporal, moving, multiple, cewu, length, three, link, step, short, encode, action] [detection, tracking, btubes, tubetk, object, btube, mot, linking, tube, track, adopt, poi, iou, giou, predicted, map, achieves, bbox, adopting, propose, occluded, regression, head, overlap] [model, external, original, robust, input] [method, spatial, proposed, convolutional, figure, based, motion, scale] [loss, utilize, generated, train, target, image] [online, training, task, learning, arxiv, preprint, network, better, simple, algorithm, set, deep, potential, performance, reduce, find, linear] [matching, capture, position, regress, tbd, facing, direction, pose, point, structure, directly, form]
@InProceedings{Pang_2020_CVPR,
  author = {Pang, Bo and Li, Yizhuo and Zhang, Yifan and Li, Muchen and Lu, Cewu},
  title = {TubeTK: Adopting Tubes to Track Multi-Object in a One-Step Training Model},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Collaborative Motion Prediction via Neural Motion Message Passing
Yue Hu, Siheng Chen, Ya Zhang, Xiao Gu


Motion prediction is essential and challenging for autonomous vehicles and social robots. One challenge of motion prediction is to model the interaction among traffic actors, which could cooperate with each other to avoid collisions or form groups. To address this challenge, we propose neural motion message passing (NMMP) to explicitly model the interaction and learn representations for directed interactions between actors. Based on the proposed NMMP, we design the motion prediction systems for two settings: the pedestrian setting and the joint pedestrian and vehicle setting. Both systems share a common pattern: we use an individual branch to model the behavior of a single actor and an interactive branch to model the interaction between actors, while with different wrappers to handle the varied input formats and characteristics. The experimental results show that both systems outperform the previous state-of-the-art methods on several existing benchmarks. Besides, we provide interpretability for interaction learning.
[interaction, actor, trajectory, nmmp, prediction, individual, embedding, future, social, graph, traffic, vehicle, ith, previous, attention, sgan, message, passing, embeddings, predict, explicitly, considers, time, behavior, mechanism, interacted, ade, fde, dataset, outperforms, temporal, observed] [pedestrian, module, branch, final, table, interactive, predicted, propose, pooling, autonomous, feature, represents] [model, indicating, interpretability] [motion, spatial, ieee, figure, based, proposed, pattern, comparison, output, crowd, quantitative, prior, convolutional] [corresponding, gan, generator, discriminator, component, loss, domain] [neural, consider, learning, setting, number, arxiv, preprint, design, better] [conference, scene, computer, vision, joint, human, system, structure, predicts, international, capture, coordinate]
@InProceedings{Hu_2020_CVPR,
  author = {Hu, Yue and Chen, Siheng and Zhang, Ya and Gu, Xiao},
  title = {Collaborative Motion Prediction via Neural Motion Message Passing},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
P2B: Point-to-Box Network for 3D Object Tracking in Point Clouds
Haozhe Qi, Chen Feng, Zhiguo Cao, Feng Zhao, Yang Xiao


Towards 3D object tracking in point clouds, a novel point-to-box network termed P2B is proposed in an end-to-end learning manner. Our main idea is to first localize potential target centers in 3D search area embedded with target information. Then point-driven 3D target proposal and verification are executed jointly. In this way, the time-consuming 3D exhaustive search can be avoided. Specifically, we first sample seeds from the point clouds in template and search area respectively. Then, we execute permutation-invariant feature augmentation to embed target clues from template into search area seeds and represent them with target-specific features. Consequently, the augmented search area seeds regress the potential target centers via Hough voting. The centers are further strengthened with seed-wise targetness scores. Finally, each center clusters its neighbors to leverage the ensemble power for joint 3D target proposal and verification. We apply PointNet++ as our backbone and experiments on KITTI tracking dataset demonstrate P2B's superiority ( 10%'s improvement over state-of-the-art). Note that P2B can run with 40FPS on a single NVIDIA 1080Ti GPU. Our code and model are available at https://github.com/HaozheQi/P2B.
[recognition, previous, visual, include, frame, current, dataset, clue, predict, three, work] [feature, tracking, area, template, object, proposal, targetness, siamese, seed, hough, center, score, table, car, final, box, main, backbone, frtj, voting, detection, denotes, map] [success, verification, robust] [ieee, pattern, result, figure, method, degraded] [target, generate, feed, cluster, discriminative] [search, potential, network, learning, number, similarity, augmentation, deep, size, set, data, setting, precision, augmented, power, applied, test, observe, default, sparsity, better, yield] [point, conference, vision, computer, local, international, represented, kitti, cloud, novel, joint, position]
@InProceedings{Qi_2020_CVPR,
  author = {Qi, Haozhe and Feng, Chen and Cao, Zhiguo and Zhao, Feng and Xiao, Yang},
  title = {P2B: Point-to-Box Network for 3D Object Tracking in Point Clouds},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Self-Supervised Deep Visual Odometry With Online Adaptation
Shunkai Li, Xin Wang, Yingdian Cao, Fei Xue, Zike Yan, Hongbin Zha


Self-supervised VO methods have shown great success in jointly estimating camera pose and depth from videos. However, like most data-driven methods, existing VO networks suffer from a notable decrease in performance when confronted with scenes different from the training data, which makes them unsuitable for practical applications. In this paper, we propose an online meta-learning algorithm to enable VO networks to continuously adapt to new environments in a self-supervised manner. The proposed method utilizes convolutional long short-term memory (convLSTM) to aggregate rich spatial-temporal information in the past. The network is able to memorize and learn from its past experience for better estimation and fast adaptation to the current frame. When running VO in the open world, in order to deal with the changing environment, we propose an online feature alignment method by aligning feature distributions at different time. Our VO network is able to seamlessly adapt to different environments. Extensive experiments on unseen outdoor scenes, virtual to real world and outdoor to indoor environments demonstrate that our method consistently outperforms state-of-the-art self-supervised VO baselines considerably.
[current, order, previous, visual, outperforms, time, long, dataset, sequential, naive, temporal] [feature, table, propose, utilizes] [model, continuously] [method, fast, convlstm, running, motion, proposed, figure, convolutional] [adaptation, alignment, domain, changing, loss, learn, perform, learns, unsupervised, unseen, synthetic, image, aligning] [online, learning, network, training, data, adapt, deep, test, set, terr, rerr, gradient, open, performance, better, savo, meta, neural, machine, sfmlearner, experience, problem, objective, evaluate, reduce, update, consistently, stochastic] [error, kitti, depth, estimation, pose, well, monocular, odometry, camera, ground, truth, despite, virtual, approach, geonet, indoor]
@InProceedings{Li_2020_CVPR,
  author = {Li, Shunkai and Wang, Xin and Cao, Yingdian and Xue, Fei and Yan, Zike and Zha, Hongbin},
  title = {Self-Supervised Deep Visual Odometry With Online Adaptation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Globally Optimal Contrast Maximisation for Event-Based Motion Estimation
Daqi Liu, Alvaro Parra, Tat-Jun Chin


Contrast maximisation estimates the motion captured in an event stream by maximising the sharpness of the motion-compensated event image. To carry out contrast maximisation, many previous works employ iterative optimisation algorithms, such as conjugate gradient, which require good initialisation to avoid converging to bad local minima. To alleviate this weakness, we propose a new globally optimal event-based motion estimation algorithm. Based on branch-and-bound (BnB), our method solves rotational (3DoF) motion estimation on event streams, which supports practical applications such as video stabilisation and attitude estimation. Underpinning our method are novel bounding functions for contrast maximisation, whose theoretical validity is rigorously established. We show concrete examples from public datasets where globally optimal solutions are vital to the success of contrast maximisation. Despite its exact nature, our algorithm is currently able to process a 50,000-event input in approx 300 seconds (a locally optimal solver takes approx 30 seconds on the same input), and has the potential to be further speeded-up using GPUs.
[time, sequence, previous, visual, stream, duration] [global, tracking, bounding, region] [quality, ray, deviation] [event, motion, pixel, contrast, ieee, cmbnb, pattern, davide, method, based, bnb, guillermo, iqp, figure, rmax, disc, subseq, dynamic, star] [image] [bound, upper, algorithm, optimal, lower, discrete, max, angular, average, function, set, matrix, maximum, number, lemma, standard, maximisation, centre, objective, conjugate, gpu] [solution, local, computer, error, conference, vision, globally, estimation, rotational, continuous, locally, rotation, define, dominant, supplementary, international, camera, point, intersect, robotics, optimisation, require, novel, solver, scene]
@InProceedings{Liu_2020_CVPR,
  author = {Liu, Daqi and Parra, Alvaro and Chin, Tat-Jun},
  title = {Globally Optimal Contrast Maximisation for Event-Based Motion Estimation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
D3Feat: Joint Learning of Dense Detection and Description of 3D Local Features
Xuyang Bai, Zixin Luo, Lei Zhou, Hongbo Fu, Long Quan, Chiew-Lan Tai


A successful point cloud registration often lies on robust establishment of sparse matches through discriminative 3D local features. Despite the fast evolution of learning-based 3D feature descriptors, little attention has been drawn to the learning of 3D feature detectors, even less for a joint learning of the two tasks. In this paper, we leverage a 3D fully convolutional network for 3D point clouds, and propose a novel and practical learning mechanism that densely predicts both a detection score and a description feature for each 3D point. In particular, we propose a keypoint selection strategy that overcomes the inherent density variations of 3D point clouds, and further propose a self-supervised detector loss guided by the on-the-fly feature matching results during training. Finally, our method achieves state-of-the-art results in both indoor and outdoor scenarios, evaluated on 3DMatch and KITTI datasets, and shows its strong generalization ability on the ETH dataset. Towards practical use, we show that by adopting a reliable feature detector, sampling a smaller number of features is sufficient to achieve accurate and fast point cloud alignment.
[description, dataset, evaluation, eth, pair, predict] [feature, detector, detection, score, fully, table, saliency, perfectmatch, recall, propose, threshold, detected, achieves] [model, original, trained] [proposed, convolutional, method, kernel, convolution, fast, figure] [loss, image, ability, unsupervised] [network, learning, number, strategy, performance, set, data, selection, max, better, evaluate, higher, density, random, test, randomly, contrastive, margin, size] [point, local, keypoint, cloud, descriptor, matching, keypoints, joint, dense, kitti, fcgf, registration, repeatability, usip, distance, inlier, outdoor, voxel, neighborhood, correspondence, relative, transformation, geometric, handle, repeatable, rotation, demonstrate, defined, radius, dneg, error, compare, sparse, indoor]
@InProceedings{Bai_2020_CVPR,
  author = {Bai, Xuyang and Luo, Zixin and Zhou, Lei and Fu, Hongbo and Quan, Long and Tai, Chiew-Lan},
  title = {D3Feat: Joint Learning of Dense Detection and Description of 3D Local Features},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Towards Backward-Compatible Representation Learning
Yantao Shen, Yuanjun Xiong, Wei Xia, Stefano Soatto


We propose a way to learn visual features that are compatible with previously computed ones even when they have different dimensions and are learned via different neural network architectures and loss functions. Compatible means that, if such features are used to compare images, then "new" features can be compared directly to "old" features, so they can be used interchangeably. This enables visual search systems to bypass computing new features for all previously seen images when updating the embedding models, a process known as backfilling. Backward compatibility is critical to quickly deploy new embedding models that leverage ever-growing large-scale training datasets and improvements in deep learning architectures and training methods. We propose a framework to train embedding models, called backward-compatible training (BCT), as a first step towards backward compatible representation learning. In experiments on learning embeddings for face recognition, models trained with BCT successfully achieve backward compatibility without sacrificing accuracy, thus enabling backfill-free model updates of visual embeddings.
[embedding, pair, recognition, dataset, visual, described, embeddings, work, multiple, evaluation] [feature, gallery] [model, compatible, bct, trained, compatibility, face, influence, tnew, told, query, verification, tbct] [ieee, pattern, proposed, comparison, existing] [loss, representation, train, person, image, domain, common, photo] [training, backward, learning, set, classifier, update, deep, accuracy, gain, neural, search, test, margin, cosine, data, network, class, process, function, metric, note, task, arxiv, preprint, achieve, problem, knowledge, architecture, compared, classification, vector, open, distillation, softmax, experiment] [conference, computer, vision, approach, form, collection, second, directly, distance, additional, absolute]
@InProceedings{Shen_2020_CVPR,
  author = {Shen, Yantao and Xiong, Yuanjun and Xia, Wei and Soatto, Stefano},
  title = {Towards Backward-Compatible Representation Learning},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
PointAugment: An Auto-Augmentation Framework for Point Cloud Classification
Ruihui Li, Xianzhi Li, Pheng-Ann Heng, Chi-Wing Fu


We present PointAugment, a new auto-augmentation framework that automatically optimizes and augments point cloud samples to enrich the data diversity when we train a classification network. Different from existing auto-augmentation methods for 2D images, PointAugment is sample-aware and takes an adversarial learning strategy to jointly optimize an augmentor network and a classifier network, such that the augmentor can learn to produce augmented samples that best fit the classifier. Moreover, we formulate a learnable point augmentation function with a shape-wise transformation and a point-wise displacement, and carefully design loss functions to adopt the augmented samples based on the learning progress of the classifier. Extensive experiments also confirm PointAugment's effectiveness and robustness to improve the performance of various networks on shape classification and retrival.
[recognition, retrieval, graph, work, three, evaluation, dataset] [table, feature, employ, framework, object, benchmark] [input, original, adversarial, improve, robustness, datasets, workshop] [ieee, figure, conventional, pattern, based, existing, learnable, convolutional, extraction, analysis, presented, enhance] [loss, learn, train, generate, produce, image] [training, augmentor, classifier, augmentation, classification, pointaugment, learning, network, augmented, sample, data, accuracy, strategy, neural, deep, function, test, design, set, note, best, class, processing, small, number, update, comparing, performance, fixed, better, random, architecture, optimizes, formulate, larger] [point, shape, computer, cloud, vision, transformation, pointnet, regress, jointly, rscnn, dgcnn, local, fit]
@InProceedings{Li_2020_CVPR,
  author = {Li, Ruihui and Li, Xianzhi and Heng, Pheng-Ann and Fu, Chi-Wing},
  title = {PointAugment: An Auto-Augmentation Framework for Point Cloud Classification},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Cross-Batch Memory for Embedding Learning
Xun Wang, Haozhi Zhang, Weilin Huang, Matthew R. Scott


Mining informative negative instances are of central importance to deep metric learning (DML). However, the hard-mining ability of existing DML methods is intrinsically limited by mini-batch training, where only a mini-batch of instances are accessible at each iteration. In this paper, we identify a "slow drift" phenomena by observing that the embedding features drift exceptionally slow even as the model parameters are updating throughout the training process. It suggests that the features of instances computed at preceding iterations can considerably approximate to their features extracted by current model. We propose a cross-batch memory (XBM) mechanism that memorizes the embeddings of past iterations, allowing the model to collect sufficient hard negative pairs across multiple mini-batches - even over the whole dataset. Our XBM can be directly integrated into general pair-based DML framework.We demonstrate that, without bells and whistles, XBM augmented DML can boost the performance considerably on image retrieval. In particular, with XBM, a simple contrastive loss can have large R@1 improvements of 12%-22.5% on three large-scale datasets, easily surpassing the most sophisticated state-of-the-art methods [38, 27, 2], by a large margin. Our XBM is conceptually simple, easy to implement - using several lines of codes, and is memory efficient - with a negligible 0.2 GB extra GPU memory.
[embedding, embeddings, current, pair, three, retrieval, provide, multiple, visual, mechanism, recognition, time] [module, hard, feature, positive, table, vehicleid, instance, unified, framework, anchor] [model, collect, datasets, sophisticated] [method, figure, valid, existing, based, simply] [loss, image, person] [memory, xbm, negative, contrastive, learning, training, deep, dml, metric, mining, performance, large, informative, weighting, size, triplet, set, drift, gpu, wij, augmented, weight, general, problem, sampling, number, ratio, similarity, small, simple, computational, requires, sop, update, suggests, data, extremely] [computed, directly, limited, sufficient, solution, compute, single]
@InProceedings{Wang_2020_CVPR,
  author = {Wang, Xun and Zhang, Haozhi and Huang, Weilin and Scott, Matthew R.},
  title = {Cross-Batch Memory for Embedding Learning},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Circle Loss: A Unified Perspective of Pair Similarity Optimization
Yifan Sun, Changmao Cheng, Yuhan Zhang, Chi Zhang, Liang Zheng, Zhongdao Wang, Yichen Wei


This paper provides a pair similarity optimization viewpoint on deep feature learning, aiming to maximize the within-class similarity s_p and minimize the between-class similarity s_n. We find a majority of loss functions, including the triplet loss and the softmax cross-entropy loss, embed s_n and s_p into similarity pairs and seek to reduce (s_n-s_p). Such an optimization manner is inflexible, because the penalty strength on every single similarity score is restricted to be equal. Our intuition is that if a similarity score deviates far from the optimum, it should be emphasized. To this end, we simply re-weight each similarity to highlight the less-optimized similarity scores. It results in a Circle loss, which is named due to its circular decision boundary. The Circle loss has a unified formula for two elemental deep feature learning paradigms, ph i.e. , learning with class-level labels and pair-wise labels. Analytically, we show that the Circle loss offers a more flexible optimization approach towards a more definite convergence target, compared with the loss functions optimizing (s_n-s_p). Experimentally, we demonstrate the superiority of the Circle loss on a variety of deep feature learning tasks. On face recognition, person re-identification, as well as several fine-grained image retrieval datasets, the achieved performance is on par with the state of the art.
[pair, recognition, three, retrieval, embedding, state] [feature, boundary, score, unified, table, circular, achieves] [decision, face, sip, arcface, sjn, verification, elemental, ambiguous, status] [ieee, pattern, scale, society, flexible, proposed, figure, comparison] [loss, factor, person, image, manner] [circle, similarity, learning, optimization, deep, softmax, triplet, convergence, training, function, margin, metric, large, equal, performance, classification, weighting, sample, set, definite, class, log, popular, reducing, respect, setting, larger, relaxation, minimize, find, penalty, compared, data, gradient, cosine, better, higher, evaluate, accuracy, paper, optimize] [computer, conference, vision, international, single, well, mgn]
@InProceedings{Sun_2020_CVPR,
  author = {Sun, Yifan and Cheng, Changmao and Zhang, Yuhan and Zhang, Chi and Zheng, Liang and Wang, Zhongdao and Wei, Yichen},
  title = {Circle Loss: A Unified Perspective of Pair Similarity Optimization},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Steering Self-Supervised Feature Learning Beyond Local Pixel Statistics
Simon Jenni, Hailin Jin, Paolo Favaro


We introduce a novel principle for self-supervised feature learning based on the discrimination of specific transformations of an image. We argue that the generalization capability of learned features depends on what image neighborhood size is sufficient to discriminate different image transformations: The larger the required neighborhood size and the more global the image statistics that the feature can describe. An accurate description of global image statistics allows to better represent the shape and configuration of objects and their context, which ultimately generalizes better to new tasks such as object classification and detection. This suggests a criterion to choose and design image transformations. Based on this criterion, we introduce a novel image transformation that we call limited context inpainting (LCI). This transformation inpaints an image patch conditioned only on a small rectangular pixel boundary (the limited context). Because of the limited boundary information, the inpainter can learn to match local pixel statistics, but is unlikely to match the global statistics of the image. We claim that the same principle can be used to justify the performance of transformations such as image rotations and warping. Indeed, we demonstrate experimentally that learning to discriminate transformations such as LCI, image warping and rotations, yields features with state of the art generalization capabilities on several datasets such as Pascal VOC, STL-10, CelebA, and ImageNet. Remarkably, our trained features achieve a performance on Places on par with features trained through supervised learning with ImageNet labels.
[context, predict, prediction, visual, natural] [global, feature, table, detection, boundary] [trained, adversarial, original, case, example] [patch, ieee, pattern, convolutional, warping, based, warp, zhang, pixel, figure, proposed, method] [image, lci, transformed, unsupervised, learn, rot, noroozi, representation, transfer, inpainting, inpainter, favaro, train, frozen, supervised, distinguish, gidaris] [learning, training, classifier, imagenet, arxiv, preprint, network, performance, task, random, set, linear, accuracy, classification, achieve, data, principle, learned, neural, validation, deep, andrew, size, better, top, ssl] [conference, computer, local, transformation, vision, international, limited, european, novel, orientation]
@InProceedings{Jenni_2020_CVPR,
  author = {Jenni, Simon and Jin, Hailin and Favaro, Paolo},
  title = {Steering Self-Supervised Feature Learning Beyond Local Pixel Statistics},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Hyperbolic Image Embeddings
Valentin Khrulkov, Leyla Mirvakhabova, Evgeniya Ustinova, Ivan Oseledets, Victor Lempitsky


Computer vision tasks such as image classification, image retrieval, and few-shot learning are currently dominated by Euclidean and spherical embeddings so that the final decisions about class belongings or the degree of similarity are made using linear hyperplanes, Euclidean distances, or spherical geodesic distances (cosine similarity). In this work, we demonstrate that in many practical scenarios, hyperbolic embeddings provide a better alternative.
[embeddings, dataset, embedding, hierarchical, visual, natural, recognition, order, provide, embedded, work, language] [map, table, feature] [model, ball, origin, mnist, datasets, face, trained] [conv, ieee, pattern, figure, tree, based] [image, person, gromov] [hyperbolic, learning, deep, space, neural, metric, classification, omniglot, processing, standard, simple, network, training, arxiv, similarity, preprint, class, hyperbolicity, closer, miniimagenet, protonet, better, number, test, klein, appendix, exponential, accuracy, learned, data, serve, task] [euclidean, computer, vision, distance, conference, approach, defined, geometry, point, international, geodesic, spherical, curvature, computed]
@InProceedings{Khrulkov_2020_CVPR,
  author = {Khrulkov, Valentin and Mirvakhabova, Leyla and Ustinova, Evgeniya and Oseledets, Ivan and Lempitsky, Victor},
  title = {Hyperbolic Image Embeddings},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Controllable Orthogonalization in Training DNNs
Lei Huang, Li Liu, Fan Zhu, Diwen Wan, Zehuan Yuan, Bo Li, Ling Shao


Orthogonality is widely used for training deep neural networks (DNNs) due to its ability to maintain all singular values of the Jacobian close to 1 and reduce redundancy in representation. This paper proposes a computationally efficient and numerically stable orthogonalization method using Newton's iteration (ONI), to learn a layer-wise orthogonal weight matrix in DNNs. ONI works by iteratively stretching the singular values of a weight matrix towards 1. This property enables it to control the orthogonality of a weight matrix by its number of iterations. We show that our method improves the performance of image classification networks by effectively controlling the orthogonality to provide an optimal tradeoff between optimization benefits and representational capacity reduction. We also show that ONI stabilizes the training of generative adversarial networks (GANs) by maintaining the Lipschitz continuity of a network, similar to spectral normalization (SN), and further outperforms SN by providing controllable orthogonality.
[provide, work, recurrent, described] [resnet, bounding, table, propose, achieves, apply] [adversarial, dnns, improve, isometry, norm] [method, spectral, residual, figure, proposed, convolutional, based, column, comparison] [image, loss, corresponding, train, row, fid, learn, generative, perform] [oni, training, weight, matrix, orthogonal, neural, iteration, orthogonality, deep, singular, performance, learning, orthogonalization, network, optimization, normalization, observe, batch, test, convergence, linear, number, better, proxy, dynamical, instability, covariance, distribution, scaling, problem, imagenet, compared, algorithm, layer, computationally, initialization, standard, size, plain, rate, representational, improved, compact, achieve, gradient, theorem] [eigenvalue, property, eigen, decomposition, initial, numerical, solution, transformation]
@InProceedings{Huang_2020_CVPR,
  author = {Huang, Lei and Liu, Li and Zhu, Fan and Wan, Diwen and Yuan, Zehuan and Li, Bo and Shao, Ling},
  title = {Controllable Orthogonalization in Training DNNs},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
An Investigation Into the Stochasticity of Batch Whitening
Lei Huang, Lei Zhao, Yi Zhou, Fan Zhu, Li Liu, Ling Shao


Batch Normalization (BN) is extensively employed in various network architectures by performing standardization within mini-batches. A full understanding of the process has been a central target in the deep learning communities. Unlike existing works, which usually only analyze the standardization operation, this paper investigates the more general Batch Whitening (BW). Our work originates from the observation that while various whitening transformations equivalently improve the conditioning, they show significantly different behaviors in discriminative scenarios and training Generative Adversarial Networks (GANs). We attribute this phenomenon to the stochasticity that BW introduces. We quantitatively investigate the stochasticity of different whitening transformations and show that it correlates well with the optimization behaviors during training. We also investigate how stochasticity relates to the estimation of population statistics during inference. Based on our analysis, we provide a framework for designing and comparing BW algorithms in different scenarios. Our proposed BW algorithm improves the residual networks by a significant margin on ImageNet classification. Besides, we show that the stochasticity of BW can improve the GAN's performance with, however, the sacrifice of the training stability.
[provide, observation, conditioning, previous] [table, feature, achieves, main] [improve, investigation, input, stability] [figure, analysis, based, proposed, method, output, introduced, residual, comparison] [fid, discriminative, image, control, diversity] [whitening, batch, zca, training, stochasticity, performance, normalization, matrix, size, group, snd, better, stochastic, learning, covariance, optimization, itn, population, find, deep, whitened, observe, network, paper, investigate, sample, suggests, calculate, large, data, normalized, number, neural, distribution, rate, decay, evaluate, standardization, algorithm, achieve, stable, note, smaller, performing, sampled, test, classification, dimension, respect, operation] [pca, estimation, full, estimate, estimating, supplementary, transformation, error]
@InProceedings{Huang_2020_CVPR,
  author = {Huang, Lei and Zhao, Lei and Zhou, Yi and Zhu, Fan and Liu, Li and Shao, Ling},
  title = {An Investigation Into the Stochasticity of Batch Whitening},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
High-Order Information Matters: Learning Relation and Topology for Occluded Person Re-Identification
Guan'an Wang, Shuo Yang, Huanyu Liu, Zhicheng Wang, Yang Yang, Shuliang Wang, Gang Yu, Erjin Zhou, Jian Sun


Occluded person re-identification (ReID) aims to match occluded person images to holistic ones across dis-joint cameras. In this paper, we propose a novel framework by learning high-order relation and topology information for discriminative features and robust alignment. At first, we use a CNN backbone to learn feature maps and key-points estimation model to extract semantic local features. Even so, occluded images still suffer from occlusion and outliers. Then, we view the extracted local features of an image as nodes of a graph and propose an adaptive direction graph convolutional (ADGC) layer to pass relation information between nodes. The proposed ADGC layer can automatically suppress the message passing of meaningless features by dynamically learning direction and degree of linkage. When aligning two groups of local features, we view it as a graph matching problem and propose a cross-graph embedded-alignment (CGEA) layer to joint learn and embed topology information to local features, and straightly predict similarity score. The proposed CGEA layer can both take full use of alignment learned by graph matching and replace sensitive one-to-one alignment with a robust soft one. Finally, extensive experiments on occluded, partial, and holistic ReID tasks show the effectiveness of our proposed method. Specifically, our framework significantly outperforms state-of-the-art by 6.5% mAP scores on Occluded-Duke dataset.
[graph, relation, passing, message, directed, three, extract] [occluded, module, holistic, semantic, feature, framework, propose, map, global, table, meaningless, adgc, achieves, cnn, suppress, cgea, effectiveness, gallery, jian, hard, score] [model, robust, experimental, datasets, verification, sensitive] [proposed, convolutional, ieee, figure, pattern, adaptive, method, formulated, analysis, based] [person, reid, learn, alignment, image, loss, discriminative, yang] [learning, layer, similarity, deep, performance, arxiv, preprint, matrix, vanilla, training, learned, problem, network] [local, partial, conference, matching, computer, vision, topology, novel, human, match, direction, international, view, well, avoid, pose, jointly]
@InProceedings{Wang_2020_CVPR,
  author = {Wang, Guan'an and Yang, Shuo and Liu, Huanyu and Wang, Zhicheng and Yang, Yang and Wang, Shuliang and Yu, Gang and Zhou, Erjin and Sun, Jian},
  title = {High-Order Information Matters: Learning Relation and Topology for Occluded Person Re-Identification},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Same Features, Different Day: Weakly Supervised Feature Learning for Seasonal Invariance
Jaime Spencer, Richard Bowden, Simon Hadfield


"Like night and day" is a commonly used expression to imply that two things are completely different. Unfortunately, this tends to be the case for current visual feature representations of the same scene across varying seasons or times of day. The aim of this paper is to provide a dense feature representation that can be used to perform localization, sparse matching or image retrieval, regardless of the current seasonal or temporal appearance. Recently, there have been several proposed methodologies for deep learning dense feature representations. These methods make use of ground truth pixel-wise correspondences between pairs of images and focus on the spatial properties of the features. As such, they don't address temporal or seasonal variation. Furthermore, obtaining the required pixel-wise correspondence data to train in cross-seasonal environments is highly complex in most scenarios. We propose Deja-Vu, a weakly supervised approach to learning season invariant features that does not require pixel-wise ground truth data. The proposed system only requires coarse labels indicating if two images correspond to the same location or not. From these labels, the network is trained to produce "similar" dense feature maps for corresponding locations despite environmental changes. Code will be made available at: https://github.com/jspenmar/DejaVu_Features
[seasonal, visual, order, dataset, recognition, provide, relational, temporal, work, place] [feature, localization, location, contextual, positive, sun, propose, anchor, table, weakly] [dawn, trained, robust, case, indicating] [ieee, proposed, figure, dvf, night, pattern, spatial, traditional, rain, snow] [image, loss, perform, appearance, corresponding, representation, train, target, supervised, source, invariant, produce] [learning, similarity, network, training, performance, triplet, deep, negative, set, metric, paper, sample] [dense, computer, conference, season, vision, volume, international, matching, dusk, sift, robotcar, sand, sparse, robotics, ground, truth, single, match, posenet, lecture, system, despite]
@InProceedings{Spencer_2020_CVPR,
  author = {Spencer, Jaime and Bowden, Richard and Hadfield, Simon},
  title = {Same Features, Different Day: Weakly Supervised Feature Learning for Seasonal Invariance},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Learning to Dress 3D People in Generative Clothing
Qianli Ma, Jinlong Yang, Anurag Ranjan, Sergi Pujades, Gerard Pons-Moll, Siyu Tang, Michael J. Black


Three-dimensional human body models are widely used in the analysis of human pose and motion. Existing models, however, are learned from minimally-clothed 3D scans and thus do not generalize to the complexity of dressed people in common images and videos. Additionally, current models lack the expressive power needed to represent the complex non-linear geometry of pose-dependent clothing shapes. To address this, we learn a generative 3D mesh model of clothed people from 3D scans with varying pose and clothing. Specifically, we train a conditional Mesh-VAE-GAN to learn the clothing deformation from the SMPL body model, making clothing an additional term in SMPL. Our model is conditioned on both pose and clothing type, giving the ability to draw samples of clothing to dress different body shapes in a variety of styles and poses. To preserve wrinkle detail, our Mesh-VAE-GAN extends patchwise discriminators to 3D meshes. Our model, named CAPE, represents global shape and fine local structure, effectively extending the SMPL body model to clothing. To our knowledge, this is the first generative model that directly dresses 3D human body meshes and generalizes to different poses. The model, code and data are available for research purposes at https://cape.is.tue.mpg.de.
[graph, recognition, people, dataset, represent, work] [table, offset, global, template] [clothing, model, type, dress, garment, condition] [ieee, pattern, convolutional, existing, figure, based, captured, method] [generative, image, loss, generated, learn, latent, generate, discriminator, real, train, corresponding, fine, representation] [learning, layer, learned, data, network, neural, function, test, sampling, group, sample] [body, pose, human, shape, conference, computer, vision, smpl, clothed, cape, mesh, international, michael, reconstruction, capture, single, gerard, deformation, local, error, displacement, fitting, geometry, smplify, pca, acm, javier, variety, reconstruct, parametric, vertex, reconstructed, scan, skinning]
@InProceedings{Ma_2020_CVPR,
  author = {Ma, Qianli and Yang, Jinlong and Ranjan, Anurag and Pujades, Sergi and Pons-Moll, Gerard and Tang, Siyu and Black, Michael J.},
  title = {Learning to Dress 3D People in Generative Clothing},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
MAST: A Memory-Augmented Self-Supervised Tracker
Zihang Lai, Erika Lu, Weidi Xie


Recent interest in self-supervised dense tracking has yielded rapid progress, but performance still remains far from supervised methods. We propose a dense tracking model trained on videos without any annotations that surpasses previous self-supervised methods on existing benchmarks by a significant margin (+15%), and achieves performance comparable to supervised methods. In this paper, we first reassess the traditional choices used for self-supervised training and reconstruction loss by conducting thorough experiments that finally elucidate the optimal choices. Second, we further improve on existing methods by augmenting our architecture with a crucial memory component. Third, we benchmark on large-scale semi-supervised video object segmentation (aka. dense tracking), and propose a new metric: generalizability. Our first two contributions yield a self-supervised network that for the first time is competitive with supervised methods on standard evaluation metrics of dense tracking. When measuring generalizability, we show self-supervised approaches are actually superior to the majority of supervised methods. We believe this new generalizability metric can better capture the real-world use-cases for dense tracking, and will spur new interest in this research direction.
[video, frame, visual, previous, temporal, attention, mast, outperforms, multiple, long, evaluation, osvos, dataset, short, corrflow] [object, tracking, segmentation, feature, table, mask, propose, affinity, key, instance, roi, van, tracker, crucial, benchmark, davis] [model, trained, lab, query, input, generalizability, testing, refers] [color, reference, pixel, method, existing, figure, proposed, raw, based] [supervised, loss, unseen, representation, image, gap, target, learn, unsupervised, selfsupervised, train] [learning, memory, training, performance, network, neural, higher, algorithm, large, space, arxiv, matrix, task, preprint, andrew, architecture] [dense, rgb, term, correspondence, compute, reconstruction, human, defined]
@InProceedings{Lai_2020_CVPR,
  author = {Lai, Zihang and Lu, Erika and Xie, Weidi},
  title = {MAST: A Memory-Augmented Self-Supervised Tracker},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Learning by Analogy: Reliable Supervision From Transformations for Unsupervised Optical Flow Estimation
Liang Liu, Jiangning Zhang, Ruifei He, Yong Liu, Yabiao Wang, Ying Tai, Donghao Luo, Chengjie Wang, Jilin Li, Feiyue Huang


Unsupervised learning of optical flow, which leverages the supervision from view synthesis, has emerged as a promising alternative to supervised methods. However, the objective of unsupervised learning is likely to be unreliable in challenging scenes. In this work, we present a framework to use more reliable supervision from transformations. It simply twists the general unsupervised learning pipeline by running another forward pass with transformed data from augmentation, along with using transformed predictions of original data as the self-supervision signal. Besides, we further introduce a lightweight network with multiple frames by a highly-shared flow decoder. Our method consistently gets a leap of performance on several benchmarks with the best accuracy among deep unsupervised methods. Also, our method achieves competitive results to recent fully supervised methods while with much fewer parameters.
[recognition, dataset, video, previous, decoder, work, prediction, unreliable, regular] [occlusion, framework, occluded, table, challenging, final, feature, supervision, heavy] [model, original, trained, clean, improve, change] [flow, optical, ieee, method, pattern, sintel, spatial, aepe, laug, proposed, lightweight, based] [unsupervised, transformed, supervised, image, loss, learn, introduce, appearance, cross] [learning, training, augmentation, data, network, random, forward, distillation, performance, objective, deep, general, large, accuracy, best, reliable, set, teacher, indicates, regularization, augmented] [conference, vision, computer, transformation, kitti, dense, view, international, photometric, ground, pipeline, displacement, extension, scene, estimation, avoid, approach]
@InProceedings{Liu_2020_CVPR,
  author = {Liu, Liang and Zhang, Jiangning and He, Ruifei and Liu, Yong and Wang, Yabiao and Tai, Ying and Luo, Donghao and Wang, Chengjie and Li, Jilin and Huang, Feiyue},
  title = {Learning by Analogy: Reliable Supervision From Transformations for Unsupervised Optical Flow Estimation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
GNN3DMOT: Graph Neural Network for 3D Multi-Object Tracking With 2D-3D Multi-Feature Learning
Xinshuo Weng, Yongxin Wang, Yunze Man, Kris M. Kitani


3D Multi-object tracking (MOT) is crucial to autonomous systems. Recent work uses a standard tracking-by-detection pipeline, where feature extraction is first performed independently for each object in order to compute an affinity matrix. Then the affinity matrix is passed to the Hungarian algorithm for data association. A key process of this standard pipeline is to learn discriminative features for different objects in order to reduce confusion during data association. In this work, we propose two techniques to improve the discriminative feature learning for MOT: (1) instead of obtaining features for each object independently, we propose a novel feature interaction mechanism by introducing the Graph Neural Network. As a result, the feature of one object is informed of the features of other objects so that the object feature can lean towards the object with similar feature (i.e., object probably with a same ID) and deviate from objects with dissimilar features (i.e., object probably with different IDs), leading to a more discriminative feature for each object; (2) instead of obtaining the feature from either 2D or 3D space in prior work, we propose a novel joint feature extractor to learn appearance and motion features from 2D and 3D space simultaneously. As features from different modalities often have complementary information, the joint feature can be more discriminate than feature from each individual modality. To ensure that the joint feature extractor does not heavily rely on one modality, we also propose an ensemble training paradigm. Through extensive evaluation, our proposed method achieves state-of-the-art performance on KITTI and nuScenes 3D MOT benchmarks. Our code will be made available at https://github.com/xinshuoweng/GNN3DMOT
[node, graph, frame, work, evaluation, interaction, gnns, gnn, mechanism, lstm, kris, three, order] [feature, object, mot, affinity, detection, tracked, extractor, tracking, edge, table, aggregation, propose, detected, nit, mota, final, samota, nuscenes, regression, xinshuo, association, matched, branch, amota, amotp, hungarian, apply, autonomous, introducing] [type, ensemble, input] [motion, figure, proposed, method, prior, fusion, based, convolutional, extraction, comparison] [appearance, discriminative, loss, image, learn] [network, performance, learning, neural, matrix, data, training, online, set, number, space, similarity, layer, deep, triplet, batch, applied, pairwise, accuracy, process, entire, negative, algorithm] [kitti, joint, point, novel, mlp, neighborhood, term, cloud]
@InProceedings{Weng_2020_CVPR,
  author = {Weng, Xinshuo and Wang, Yongxin and Man, Yunze and Kitani, Kris M.},
  title = {GNN3DMOT: Graph Neural Network for 3D Multi-Object Tracking With 2D-3D Multi-Feature Learning},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
ClusterFit: Improving Generalization of Visual Representations
Xueting Yan, Ishan Misra, Abhinav Gupta, Deepti Ghadiyaram, Dhruv Mahajan


Pre-training convolutional neural networks with weakly-supervised and self-supervised strategies is becoming increasingly popular for several computer vision tasks. However, due to the lack of strong discriminative signals, these learned representations may overfit to the pre-training objective (e.g., hashtag prediction) and not generalize well to downstream tasks. In this work, we present a simple strategy - ClusterFit to improve the robustness of the visual representations learned during pre-training. Given a dataset, we (a) cluster its features extracted from a pre-trained network using k-means and (b) re-train a new network from scratch on this dataset using cluster assignments as pseudo-labels. We empirically show that clustering helps reduce the pre-training task-specific information from the extracted features thereby minimizing overfitting to the same. Our approach is extensible to different pre-training frameworks -- weak- and self-supervised, modalities -- images and videos, and pre-training tasks -- object and action classification. Through extensive transfer learning experiments on 11 different target datasets of varied vocabularies and granularities, we show that ClusterFit significantly improves the representation quality compared to the state-of-the-art large-scale (millions / billions) weakly-supervised image and video models and self-supervised image models.
[visual, dataset, video, kinetics, hashtags, longer, downstream, action] [dcf, feature, table, semantic, object, framework, weakly, van] [model, noise, trained, study, acc, datasets] [figure, method, noisy] [transfer, target, unsupervised, cluster, supervised, jigsaw, image, train, representation] [npre, ncf, learning, clusterfit, performance, dpre, label, clustering, number, network, proxy, objective, training, task, accuracy, space, rotnet, data, knowledge, distillation, better, capacity, arxiv, inaturalist, report, preprint, deep, architecture, layer, fixed, higher, neural, evaluate, ttar, compared, linear, baseline, note] [approach]
@InProceedings{Yan_2020_CVPR,
  author = {Yan, Xueting and Misra, Ishan and Gupta, Abhinav and Ghadiyaram, Deepti and Mahajan, Dhruv},
  title = {ClusterFit: Improving Generalization of Visual Representations},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Learning Dynamic Relationships for 3D Human Motion Prediction
Qiongjie Cui, Huaijiang Sun, Fei Yang


3D human motion prediction, i.e., forecasting future sequences from given historical poses, is a fundamental task for action analysis, human-computer interaction, machine intelligence. Recently, the state-of-the-art method assumes that the whole human motion sequence involves a fully-connected graph formed by links between each joint pair. Although encouraging performance has been made, due to the neglect of the inherent and meaningful characteristics of the natural connectivity of human joints, unexpected results may be produced. Moreover, such a complicated topology greatly increases the training difficulty. To tackle these issues, we propose a deep generative model based on graph networks and adversarial learning. Specifically, the skeleton pose is represented as a novel dynamic graph, in which natural connectivities of the joint pairs are exploited explicitly, and the links of geometrically separated joints can also be learned implicitly. Notably, in the proposed model, the natural connection strength is adaptively learned, whereas, in previous schemes, it was constant. Our approach is evaluated on two representations (i.e., angle-based, position-based) from various large-scale 3D skeleton benchmarks (e.g., H3.6M, CMU, 3DPW MoCap). Extensive experiments demonstrate that our approach achieves significant improvements against existing baselines in accuracy and visualization. Code will be available at https://github.com/cuiqiongjie/LDRGCN.
[graph, prediction, skeleton, sequence, natural, future, temporal, millisecond, connective, gcn, previous, frame, mocap, adjacency, walking, historical, action, recurrent, extract, cmu, aged, long, rnn, work, connected] [predicted, table, global, final, propose] [model, adversarial, input, adv] [motion, proposed, method, dynamic, residual, learnable, based, figure, convolutional, adaptively, introduced, output, comparison] [learn, discriminator, loss, train, generative, corresponding] [training, matrix, learning, connection, set, fixed, data, network, note, performance, deep] [human, error, ground, angle, joint, implicit, pose, capture, structure, position, directly, novel, approach]
@InProceedings{Cui_2020_CVPR,
  author = {Cui, Qiongjie and Sun, Huaijiang and Yang, Fei},
  title = {Learning Dynamic Relationships for 3D Human Motion Prediction},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Knowledge As Priors: Cross-Modal Knowledge Generalization for Datasets Without Superior Knowledge
Long Zhao, Xi Peng, Yuxiao Chen, Mubbasir Kapadia, Dimitris N. Metaxas


Cross-modal knowledge distillation deals with transferring knowledge from a model trained with superior modalities (Teacher) to another model trained with weak modalities (Student). Existing approaches require paired training examples exist in both modalities. However, accessing the data from superior modalities may not always be feasible. For example, in the case of 3D hand pose estimation, depth maps, point clouds, or stereo images usually capture better hand structures than RGB images, but most of them are expensive to be collected. In this paper, we propose a novel scheme to train the Student in a Target dataset where the Teacher is unavailable. Our key idea is to generalize the distilled cross-modal knowledge learned from a Source dataset, which contains paired examples from both modalities, to the Target dataset by modeling knowledge as priors on parameters of the Student. We name our method "Cross-Modal Knowledge Generalization" and demonstrate that our scheme results in competitive performance for 3D hand pose estimation on standard benchmark datasets.
[dataset, modality, work, attention, long, yuan] [regression, feature, table, weak, key, final] [trained, generalization, datasets, argmax, model, dimitris] [proposed, superior, method, figure, epe, intermediate] [target, source, loss, transfer, learn, paired, synthetic, domain, latent, aim, latt, transferring, generalize, introduce] [knowledge, network, learning, distillation, learned, performance, student, algorithm, teacher, data, regularizer, training, deep, scheme, neural, log, regularization, note, large, regularizers, default, distill, evaluate, lact, weight, proposition, procedure, activation] [hand, pose, approach, depth, estimation, rgb, stb, single, rhd, ground, match, term, point, stereo, novel, shape, error, joint, pck]
@InProceedings{Zhao_2020_CVPR,
  author = {Zhao, Long and Peng, Xi and Chen, Yuxiao and Kapadia, Mubbasir and Metaxas, Dimitris N.},
  title = {Knowledge As Priors: Cross-Modal Knowledge Generalization for Datasets Without Superior Knowledge},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
S3VAE: Self-Supervised Sequential VAE for Representation Disentanglement and Data Generation
Yizhe Zhu, Martin Renqiang Min, Asim Kadav, Hans Peter Graf


We propose a sequential variational autoencoder to learn disentangled representations of sequential data (e.g., videos and audios) under self-supervision. Specifically, we exploit the benefits of some readily accessible supervision signals from input data itself or some off-the-shelf functional models and accordingly design auxiliary tasks for our model to utilize these signals. With the supervision of the signals, our model can easily disentangle the representation of an input sequence into static factors and dynamic factors (i.e., time-invariant and time-varying parts). Comprehensive experiments across videos and audios verify the effectiveness of our model on representation disentanglement and generation of sequential data, and demonstrate that, our model with self-supervision performs comparable to, if not better than, the fully-supervised model with ground truth labels, and outperforms state-of-the-art unsupervised models by a large margin.
[video, sequential, static, sequence, dsvae, visual, frame, recurrent, three, audio, speech, accessible, outperforms, factorized, temporal, action] [supervision, object, table] [model, auxiliary, facial, expression, acc, input] [dynamic, figure, ieee, motion, pattern, prior, based, optical, performs, designed] [representation, disentangled, disentanglement, latent, generation, variable, appearance, variational, supervisory, unsupervised, generated, vae, generate, ldf, preserve, real, generative, ability, image, realistic, encourage, mocogan] [data, learning, neural, arxiv, preprint, processing, regularization, performance, randomly, mutual, posterior, sampled, better, expected, training] [computer, conference, full, vision, international, ground, truth, european, well]
@InProceedings{Zhu_2020_CVPR,
  author = {Zhu, Yizhe and Min, Martin Renqiang and Kadav, Asim and Graf, Hans Peter},
  title = {S3VAE: Self-Supervised Sequential VAE for Representation Disentanglement and Data Generation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Video Playback Rate Perception for Self-Supervised Spatio-Temporal Representation Learning
Yuan Yao, Chang Liu, Dezhao Luo, Yu Zhou, Qixiang Ye


In self-supervised spatio-temporal representation learning, the temporal resolution and long-short term characteristics are not yet fully explored, which limits representation capabilities of learned models. In this paper, we propose a novel self-supervised method, referred to as video Playback Rate Perception (PRP), to learn spatio-temporal representation in a simple-yet-effective way. PRP roots in a dilated sampling strategy, which produces self-supervision signals about video playback rates for representation model learning. PRP is implemented with a feature encoder, a classification module, and a reconstructing decoder, to achieve spatio-temporal semantic retention in a collaborative discrimination-generation manner. The discriminative perception model follows a feature encoder to prefer perceiving low temporal resolution and long-term representation by classifying fast-forward rates. The generative perception model acts as a feature decoder to focus on comprehending high temporal resolution and short-term representation by introducing a motion-attention mechanism. PRP is applied on typical video target tasks including action recognition and video retrieval. Experiments show that PRP outperforms state-of-the-art self-supervised models with significant margins. Code is available at github.com/yuanyao366/PRP.
[video, perception, temporal, playback, prp, action, attention, recognition, visual, retrieval, frame, clip, decoder, considering, predicting, spatiotemporal, rich, perceive, multiple, work] [feature, table, focus, propose, semantic, including, supervision, foreground] [model, input, difference, perceiving, trained] [motion, ieee, dilated, reconstructing, resolution, relu, convolutional, spatial, figure, based, high, proposed] [representation, discriminative, generative, target, learn, content, image, encoder, learns, unsupervised, loss] [learning, sampling, rate, network, set, accuracy, data, learned, activation, proxy, classification, task, performance, implemented, applied] [approach, term, interval]
@InProceedings{Yao_2020_CVPR,
  author = {Yao, Yuan and Liu, Chang and Luo, Dezhao and Zhou, Yu and Ye, Qixiang},
  title = {Video Playback Rate Perception for Self-Supervised Spatio-Temporal Representation Learning},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Learning to Manipulate Individual Objects in an Image
Yanchao Yang, Yutong Chen, Stefano Soatto


We describe a method to train a generative model with latent factors that are (approximately) independent and localized. This means that perturbing the latent variables affects only local regions of the synthesized image, corresponding to objects. Unlike other unsupervised generative models, ours enables object-centric manipulation, without requiring object-level annotations, or any form of annotation for that matter. The key to our method is the combination of spatial disentanglement, enforced by a Contextual Information Separation loss, and perceptual cycle-consistency, enforced by a loss that penalizes changes in the image partition in response to perturbations of the latent factors. We test our method's ability to allow independent control of spatial and semantic factors of variability on existing datasets and also introduce two new ones that highlight the limitations of current methods.
[context, decoder, natural, temporal, structured] [segmentation, object, contextual, semantic, background, alexander, instance, map, edge, segment] [model, identity, adversarial, manipulation] [spatial, method, perceptual, figure, ieee, pattern, color, proposed, separation, quantitative, partition] [generative, image, disentanglement, latent, independent, inpainting, unsupervised, representation, disentangled, learn, loss, variational, appearance, flying, consistency, masked, traversing, loic, proposes, minimizing, perturbing, corresponding] [network, learning, neural, note, training, randomly, processing, data, number, learned, mutual, task, distribution, machine, bottleneck, deep, binary] [conference, vision, computer, monet, complex, scene, textured, approach, well, international, shape, room, enables, single, colored]
@InProceedings{Yang_2020_CVPR,
  author = {Yang, Yanchao and Chen, Yutong and Soatto, Stefano},
  title = {Learning to Manipulate Individual Objects in an Image},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
PADS: Policy-Adapted Sampling for Visual Similarity Learning
Karsten Roth, Timo Milbich, Bjorn Ommer


Learning visual similarity requires to learn relations, typically between triplets of images. Albeit triplet approaches being powerful, their computational complexity mostly limits training to only a subset of all possible training triplets. Thus, sampling strategies that decide when to use which training sample during learning are crucial. Currently, the prominent paradigm are fixed or curriculum sampling strategies that are predefined before training starts. However, the problem truly calls for a sampling process that adjusts based on the actual state of the similarity representation during training. We, therefore, employ reinforcement learning and have a teacher network adjust the sampling distribution based on the current state of the learner network, which represents visual similarity. Experiments on benchmark datasets using standard triplet-based losses show that our adaptive sampling strategy significantly outperforms fixed sampling strategies. Moreover, although our adaptive sampling is only applied on top of basic triplet-learning frameworks, we reach competitive results to state-of-the-art approaches that employ diverse additional learning signals or strong ensemble architectures. Code can be found under https://github.com/Confusezius/CVPR2020_PADS.
[policy, state, static, provide, recognition, reinforcement, embedding, current, difficulty, multiple] [hard, improves, denotes] [model, strong, face, datasets, effectively] [adaptive, ieee, pattern, based, adjustment, range, analysis] [learn, curriculum, image, representation, control, loss] [sampling, learning, training, distribution, deep, metric, dml, triplet, negative, fixed, performance, process, network, set, nmi, neural, machine, strategy, standard, adapt, margin, ival, complexity, progression, gradient, initialization, processing, manually, optimal, learned, validation, ranking, number, similarity, sample, basic, find, probability, support, optimizing, data] [conference, computer, vision, international, distance, dan, interval, european, additional]
@InProceedings{Roth_2020_CVPR,
  author = {Roth, Karsten and Milbich, Timo and Ommer, Bjorn},
  title = {PADS: Policy-Adapted Sampling for Visual Similarity Learning},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Siam R-CNN: Visual Tracking by Re-Detection
Paul Voigtlaender, Jonathon Luiten, Philip H.S. Torr, Bastian Leibe


We present Siam R-CNN, a Siamese re-detection architecture which unleashes the full power of two-stage object detection approaches for visual object tracking. We combine this with a novel tracklet-based dynamic programming algorithm, which takes advantage of re-detections of both the first-frame template and previous-frame predictions, to model the full history of both the object to be tracked and potential distractor objects. This enables our approach to make better tracking decisions, as well as to re-detect tracked objects after long occlusion. Finally, we propose a novel hard example mining strategy to improve Siam R-CNN's robustness to similar looking objects. Siam R-CNN achieves the current best performance on ten tracking benchmarks, with especially strong results for long-term tracking. We make our code and models available at www.vision.rwth-aachen.de/page/siamrcnn.
[previous, visual, video, frame, current, outperforms, evaluation, work] [object, siam, tracking, tracklet, bounding, box, detection, segmentation, siamese, hard, score, tracklets, rpn, head, template, achieves, mask, coco, region, recall, faster, davis, overlap, sota, proposal, vot, roi, backbone, threshold, table, benchmark, siammask] [success, percentage, example, strong, trained] [dynamic, method, reference, figure, spatial, result] [image] [set, average, training, programming, algorithm, learning, network, precision, best, online, negative, deep, compared, mining, similarity, higher, evaluate, rate, distractor] [ground, truth, supplemental, novel]
@InProceedings{Voigtlaender_2020_CVPR,
  author = {Voigtlaender, Paul and Luiten, Jonathon and Torr, Philip H.S. and Leibe, Bastian},
  title = {Siam R-CNN: Visual Tracking by Re-Detection},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
ASLFeat: Learning Local Features of Accurate Shape and Localization
Zixin Luo, Lei Zhou, Xuyang Bai, Hongkai Chen, Jiahui Zhang, Yao Yao, Shiwei Li, Tian Fang, Long Quan


This work focuses on mitigating two limitations in the joint learning of local feature detectors and descriptors. First, the ability to estimate the local shape (scale, orientation, etc.) of feature points is often neglected during dense feature extraction, while the shape-awareness is crucial to acquire stronger geometric invariance. Second, the localization accuracy of detected keypoints is not sufficient to reliably recover camera geometry, which has become the bottleneck in tasks such as 3D reconstruction. In this paper, we present ASLFeat, with three light-weight yet effective modifications to mitigate above issues. First, we resort to deformable convolutional networks to densely estimate and apply local transformation. Second, we take advantage of the inherent feature hierarchy to restore spatial resolution and low-level details for accurate keypoint localization. Finally, we use a peakiness measurement to relate feature responses and derive more indicative detection scores. The effect of each modification is thoroughly studied, and the evaluation is extensively conducted across a variety of practical scenarios. State-of-the-art results are reported that demonstrate the superiority of our methods.
[dataset, three, evaluation, multiple, prediction, visual, outperforms] [feature, detection, localization, score, dcn, detector, including, propose, backbone, adopt, map, table, denotes, threshold] [model, effective, original, input] [spatial, proposed, affine, deformable, convolutional, resolution, convolution, homography, patch] [image, loss] [learning, set, number, ratio, accuracy, training, network, better, learned, find, implementation, report, architecture] [local, keypoint, aslfeat, shape, peakiness, keypoints, error, superpoint, dense, geometric, sift, estimation, matching, hpatches, joint, transformation, reconstruction, descriptor, modelling, muldet, accurate, camera, measurement, geometry, sparse, correspondence, kitti, point]
@InProceedings{Luo_2020_CVPR,
  author = {Luo, Zixin and Zhou, Lei and Bai, Xuyang and Chen, Hongkai and Zhang, Jiahui and Yao, Yao and Li, Shiwei and Fang, Tian and Quan, Long},
  title = {ASLFeat: Learning Local Features of Accurate Shape and Localization},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Filter Grafting for Deep Neural Networks
Fanxu Meng, Hao Cheng, Ke Li, Zhixin Xu, Rongrong Ji, Xing Sun, Guangming Lu


This paper proposes a new learning paradigm called filter grafting, which aims to improve the representation capability of Deep Neural Networks (DNNs). The motivation is that DNNs have unimportant (invalid) filters (e.g., l1 norm close to 0). These filters limit the potential of DNNs since they are identified as having little effect on the network. While filter pruning removes these invalid filters for efficiency consideration, filter grafting re-activates them from an accuracy boosting perspective. The activation is processed by grafting external information (weights) into invalid filters. To better perform the grafting process, we develop an entropy-based criterion to measure the information of filters and an adaptive weighting strategy for balancing the grafted information among networks. After the grafting operation, the network has very few invalid filters compared with its untouched state, enpowering the model with more representation capacity. We also perform extensive experiments on the classification and recognition tasks to show the superiority of our method. For example, the grafted MobileNetV2 outperforms the non-grafted MobileNetV2 by about 7 percent on CIFAR-100 dataset.
[multiple, recognition, three, dataset, decide] [table, propose, denotes, china] [norm, model, improve, external, dnns, difference, noise, study, internal] [figure, method, ieee, coefficient, convolutional, adaptive, valid, pattern, viewed] [learn, train, loss, person, changing] [grafting, filter, network, learning, invalid, training, baseline, neural, layer, performance, entropy, mutual, number, deep, weight, pruning, graft, weighting, criterion, rate, distillation, arxiv, preprint, better, grafted, performing, calculate, larger, repr, setting, measure, classification, imagenet, suppose, strategy, smaller, algorithm, increase, data, distribution, find, decay, motivation, unimportant, potential] [computer, conference, vision, single, consistent, international, initial, hao]
@InProceedings{Meng_2020_CVPR,
  author = {Meng, Fanxu and Cheng, Hao and Li, Ke and Xu, Zhixin and Ji, Rongrong and Sun, Xing and Lu, Guangming},
  title = {Filter Grafting for Deep Neural Networks},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
HOPE-Net: A Graph-Based Model for Hand-Object Pose Estimation
Bardia Doosti, Shujon Naha, Majid Mirbagheri, David J. Crandall


Hand-object pose estimation (HOPE) aims to jointly detect the poses of both a hand and of a held object. In this paper, we propose a lightweight model called HOPE-Net which jointly estimates hand and object pose in 2D and 3D in real-time. Our network uses a cascade of two adaptive graph convolutional neural networks, one to estimate 2D coordinates of the hand joints and object corners, followed by another to convert 2D coordinates to 3D. Our experiments show that through end-to-end training of the full network, we achieve better accuracy for both the 2D and 3D coordinate estimation problems. The proposed 2D to 3D graph convolution-based model could be applied to other 3D landmark detection problems, where it is possible to first predict the 2D keypoints and then transform them to 3D.
[graph, action, adjacency, recognition, dataset, work, node, correct, connected, convert, predict, gao, skeleton] [object, pooling, annotated, bounding, predicted, box, table] [model, percentage, trained, input] [adaptive, convolution, convolutional, figure, ieee, pattern, lightweight, proposed] [image, loss, learn, encoder] [network, matrix, learning, neural, training, better, layer, architecture, average, compared, best] [hand, pose, conference, estimation, computer, vision, human, single, unpooling, error, approach, initial, well, international, jointly, keypoints, shape, body, joint, mesh, estimate, tekin, wrist]
@InProceedings{Doosti_2020_CVPR,
  author = {Doosti, Bardia and Naha, Shujon and Mirbagheri, Majid and Crandall, David J.},
  title = {HOPE-Net: A Graph-Based Model for Hand-Object Pose Estimation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
DeepFaceFlow: In-the-Wild Dense 3D Facial Motion Estimation
Mohammad Rami Koujan, Anastasios Roussos, Stefanos Zafeiriou


Dense 3D facial motion capture from only monocular in-the-wild pairs of RGB images is a highly challenging problem with numerous applications, ranging from facial expression recognition to facial reenactment. In this work, we propose DeepFaceFlow, a robust, fast, and highly-accurate framework for the dense estimation of 3D non-rigid facial flow between pairs of monocular images. Our DeepFaceFlow framework was trained and tested on two very large-scale facial video datasets, one of them of our own collection and annotation, with the aid of occlusion-aware and 3D-based loss function. We conduct comprehensive experiments probing different aspects of our approach and demonstrating its improved performance against state-of-the-art flow and 3D reconstruction methods. Furthermore, we incorporate our framework in a full-head state-of-the-art facial video synthesis method and demonstrate the ability of our method in better representing and capturing the facial dynamics, resulting in a highly-realistic facial video synthesis. Given registered pairs of images, our framework generates 3D flow maps at 60 fps.
[dataset, pair, video, work, temporal, recognition, frame] [framework, table, head, annotated] [facial, face, expression, trained, input, stefanos, anastasios, datasets, model, highly, help] [flow, optical, motion, method, ieee, figure, proposed, pattern, pixel, convolutional, output, scale, based] [image, train, corresponding, loss, synthesis] [training, network, learning, deep, test, equation, problem, size, large, rate, average, entire, architecture] [estimation, scene, conference, computer, dense, rgb, monocular, international, reconstruction, shape, vision, estimating, estimated, approach, human, second, supplementary, well, visible, stereo, error, thomas, collection, pose, depth, estimate]
@InProceedings{Koujan_2020_CVPR,
  author = {Koujan, Mohammad Rami and Roussos, Anastasios and Zafeiriou, Stefanos},
  title = {DeepFaceFlow: In-the-Wild Dense 3D Facial Motion Estimation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Learning for Video Compression With Hierarchical Quality and Recurrent Enhancement
Ren Yang, Fabian Mentzer, Luc Van Gool, Radu Timofte


In this paper, we propose a Hierarchical Learned Video Compression (HLVC) method with three hierarchical quality layers and a recurrent enhancement network. The frames in the first layer are compressed by an image compression method with the highest quality. Using these frames as references, we propose the Bi-Directional Deep Compression (BDDC) network to compress the second layer with relatively high quality. Then, the third layer frames are compressed with the lowest quality, by the proposed Single Motion Deep Compression (SMDC) network, which adopts a single motion map to estimate the motions of multiple frames, thus saving bits for motion information. In our deep decoder, we develop the Weighted Recurrent Quality Enhancement (WRQE) network, which takes both compressed frames and the bit stream as inputs. In the recurrent cell of WRQE, the memory and update signal are weighted by quality features to reasonably leverage multi-frame information for enhancement. In our HLVC approach, the hierarchical quality benefits the coding efficiency, since the high quality information facilitates the compression and enhancement of low quality frames at encoder and decoder sides, respectively. Finally, the experiments validate that our HLVC approach advances the state-of-the-art of deep video compression methods, and outperforms the "Low-Delay P (LDP) very fast" mode of x265 in terms of both PSNR and MS-SSIM. The project page is at https://github.com/RenYang-home/HLVC.
[video, hierarchical, frame, recurrent, outperforms, decoder, recognition, previous] [propose, map, framework, table, achieves] [quality, model, improve, encoded] [compression, motion, figure, compressed, enhancement, psnr, high, hlvc, wrqe, ieee, proposed, low, smdc, cell, method, compress, optimized, dvc, reference, bdbr, coding, bddc, wim, ren, subnet, pattern, inverse, mai, reasonably, traditional, based, residual, uvg] [image, loss] [layer, deep, network, learned, performance, weighted, learning, bit, neural, calculated, test, baseline, memory, compressing, average, better, higher, training, class] [conference, approach, single, computer, vision, international, estimate, david]
@InProceedings{Yang_2020_CVPR,
  author = {Yang, Ren and Mentzer, Fabian and Gool, Luc Van and Timofte, Radu},
  title = {Learning for Video Compression With Hierarchical Quality and Recurrent Enhancement},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Learning Better Lossless Compression Using Lossy Compression
Fabian Mentzer, Luc Van Gool, Michael Tschannen


We leverage the powerful lossy image compression algorithm BPG to build a lossless image compression system. Specifically, the original image is first decomposed into the lossy reconstruction obtained after compressing it with BPG and the corresponding residual. We then model the distribution of the residual with a convolutional neural network-based probabilistic model that is conditioned on the BPG reconstruction, and combine it with entropy coding to losslessly encode the residual. Finally, the image is stored using the concatenation of the bitstreams produced by BPG and the learned residual coder. The resulting compression system achieves state-of-the-art performance in learned lossless full-resolution image compression, outperforming previous learned approaches as well as PNG, WebP, and JPEG2000.
[encode, previous, modeling, video, store, encoding, predict, natural, stream, fabian] [predicted, van, table, cnn, luc] [model, trained, input, overview, jpeg] [compression, lossless, residual, lossy, bpg, coding, bpsp, compress, method, conv, bitrate, flif, arithmetic, based, convolutional, compressor, proposed, pixelcnn, symbol, eirikur, losslessly, figure, coder, likelihood] [image, generative, train, minimizing] [learned, distribution, data, open, training, entropy, probability, set, mixture, network, probabilistic, fixed, deep, neural, optimal, note, random, learning, parameter, forward, evaluate, smaller, logistic, quantization, simple, discrete, tail, achieve, performance, better] [well, reconstruction, approach, rgb, david, michael, leverage, system, single]
@InProceedings{Mentzer_2020_CVPR,
  author = {Mentzer, Fabian and Gool, Luc Van and Tschannen, Michael},
  title = {Learning Better Lossless Compression Using Lossy Compression},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Flow2Stereo: Effective Self-Supervised Learning of Optical Flow and Stereo Matching
Pengpeng Liu, Irwin King, Michael R. Lyu, Jia Xu


In this paper, we propose a unified method to jointly learn optical flow and stereo matching. Our first intuition is stereo matching can be modeled as a special case of optical flow, and we can leverage 3D geometry behind stereoscopic videos to guide the learning of these two forms of correspondences. We then enroll this knowledge into the state-of-the-art self-supervised learning framework, and train one single network to estimate both flow and stereo. Second, we unveil the bottlenecks in prior self-supervised learning approaches, and propose to create a new set of challenging proxy tasks to boost performance. These two insights yield a single model that achieves the highest accuracy among all existing unsupervised flow and stereo methods on KITTI 2012 and 2015 benchmarks. More remarkably, our self-supervised method even outperforms several state-of-the-art fully supervised methods, including PWC-Net and FlowNet2 on KITTI 2012.
[time, relationship, evaluation, prediction] [occluded, challenging, employ, stage, achieves, propose, fully, including, improves, key] [model, create, improve, input, difference, testing] [flow, optical, disparity, method, pixel, plt, motion, quadrilateral, selflow, stereoscopic, figure, based] [unsupervised, loss, supervised, image, row, train, consistency, learn, creating] [learning, confident, performance, training, achieve, network, set, teacher, proxy, denote, best, large, deep, better, student, accuracy, data] [stereo, kitti, matching, geometric, estimation, constraint, single, correspondence, estimate, photometric, compute, triangle, depth, michael, second, camera, direction, jointly, accurate, cost, ground, projection, point, epipolar]
@InProceedings{Liu_2020_CVPR,
  author = {Liu, Pengpeng and King, Irwin and Lyu, Michael R. and Xu, Jia},
  title = {Flow2Stereo: Effective Self-Supervised Learning of Optical Flow and Stereo Matching},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Multi-Scale Fusion Subspace Clustering Using Similarity Constraint
Zhiyuan Dang, Cheng Deng, Xu Yang, Heng Huang


Classical subspace clustering methods often assume that the raw form data lie in a union of the low-dimension linear subspace. This assumption is too strict in practice, which largely limits the generalization of subspace clustering. To tackle this issue, deep subspace clustering (DSC) networks based on deep autoencoder (DAE) have been proposed, which non-linearly map the raw form data into a latent space well-adapted to subspace clustering. However, existing DSC models ignore the important multi-scale information embedded in DAE, thus abandon the much more useful deep features, leading their suboptimal clustering results. In this paper, we propose the Multi-Scale Fusion Subspace Clustering Using Similarity Constraint (SC-MSFSC) network, which learns a more discriminative self-expression coefficient matrix by a novel multi-scale fusion module. More importantly, it introduces a similarity constraint module to guide the fused self-expression coefficient matrix in training. Specifically, the multi-scale fusion module is framed to generate the self-expression coefficient matrix of each convolutional layer in DAE and then fuses them with the convolutional kernel. In addition, the similarity constraint module is to supervise the fused self-expression coefficient matrix by the designed similarity matrix. Extensive experimental results on four benchmark datasets demonstrate the superiority of our new model against state-of-the-art methods.
[dataset, embedded, decoder] [module, table, segmentation, feature, affinity, adopt, benchmark, final] [input, face, experimental, norm, datasets, kxl] [coefficient, kernel, fusion, convolutional, fused, dsc, orl, proposed, figure, spectral, ieee, based, supervise, channel, coil, stacked, extraction, pattern, existing, selfexpression, raw] [loss, latent, extended, image, discriminative, variable, unsupervised, extracted, learn, consists] [clustering, matrix, subspace, similarity, network, data, deep, dae, layer, learning, size, set, function, performance, best, yale, linear, training, initialization, baseline, space, proper, process, note, design] [constraint, reconstruction, sparse, novel, median, error, structure, demonstrate, property]
@InProceedings{Dang_2020_CVPR,
  author = {Dang, Zhiyuan and Deng, Cheng and Yang, Xu and Huang, Heng},
  title = {Multi-Scale Fusion Subspace Clustering Using Similarity Constraint},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Siamese Box Adaptive Network for Visual Tracking
Zedu Chen, Bineng Zhong, Guorong Li, Shengping Zhang, Rongrong Ji


Most of the existing trackers usually rely on either a multi-scale searching scheme or pre-defined anchor boxes to accurately estimate the scale and aspect ratio of a target. Unfortunately, they typically call for tedious and heuristic configurations. To address this issue, we propose a simple yet effective visual tracking framework (named Siamese Box Adaptive Network, SiamBAN) by exploiting the expressive power of the fully convolutional network (FCN). SiamBAN views the visual tracking problem as a parallel classification and regression problem, and thus directly classifies objects and regresses their bounding boxes in a unified FCN. The no-prior box design avoids hyper-parameters associated with the candidate boxes, making SiamBAN more flexible and general. Extensive experiments on visual tracking benchmarks including VOT2018, VOT2019, OTB100, NFS, UAV123, and LaSOT demonstrate that SiamBAN achieves state-of-the-art performance and runs at 40 FPS, confirming its effectiveness and efficiency. The code will be available at https://github.com/hqucv/siamban.
[visual, prediction] [object, tracking, siamese, tracker, box, bounding, feature, siamban, regression, map, ellipse, dimp, achieves, location, module, center, detection, correlation, atom, eco, framework, backbone, template, table, rectangle, anchor, lasot, reg, positive, eao, head, atrous, gyc, overlap, assignment, including, challenge, siamrpn] [success] [ieee, pattern, figure, scale, based, adaptive, aspect, convolutional, patch, convolution, comparison] [target, image, corresponding, loss, appearance] [network, search, classification, precision, rate, performance, label, learning, size, deep, compared, neural, circle, ratio, design, best, set, candidate] [computer, conference, vision, international, european, position, michael, accurately, point]
@InProceedings{Chen_2020_CVPR,
  author = {Chen, Zedu and Zhong, Bineng and Li, Guorong and Zhang, Shengping and Ji, Rongrong},
  title = {Siamese Box Adaptive Network for Visual Tracking},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Cross-Domain Face Presentation Attack Detection via Multi-Domain Disentangled Representation Learning
Guoqing Wang, Hu Han, Shiguang Shan, Xilin Chen


Face presentation attack detection (PAD) has been an urgent problem to be solved in the face recognition systems. Conventional approaches usually assume the testing and training are within the same domain; as a result, they may not generalize well into unseen scenarios because the representations learned for PAD may overfit to the subjects in the training set. In light of this, we propose an efficient disentangled representation learning for cross-domain face PAD. Our approach consists of disentangled representation learning (DR-Net) and multi-domain learning (MD-Net). DR-Net learns a pair of encoders via generative models that can disentangle PAD informative features from subject discriminative features. The disentangled features from different domains are fed to MD-Net which learns domain-independent features for the final cross-domain face PAD task. Extensive experiments on several public datasets validate the effectiveness of the proposed approach for cross-domain PAD.
[dataset, work, recognition, pair, video, extract, individual] [feature, detection, table, propose, module, cnn, effectiveness] [face, pad, spoof, live, generalization, model, protocol, attack, subject, presentation, testing, identity, adversarial, maddg, auxiliary, antispoofing, casia, drec, forensics, improve, robust] [proposed, conventional, based, ieee, method, figure, color] [disentangled, domain, representation, image, encoders, learn, source, generative, loss, generated, adaptation, lce, photo, ability, gan, train, consists, discriminative, texture, utilize, perform, lrec, unseen, disentangle, corresponding] [learning, classification, deep, training, learned, performance, better, network, space, binary, denote, data, baseline] [approach, well]
@InProceedings{Wang_2020_CVPR,
  author = {Wang, Guoqing and Han, Hu and Shan, Shiguang and Chen, Xilin},
  title = {Cross-Domain Face Presentation Attack Detection via Multi-Domain Disentangled Representation Learning},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Online Deep Clustering for Unsupervised Representation Learning
Xiaohang Zhan, Jiahao Xie, Ziwei Liu, Yew-Soon Ong, Chen Change Loy


Joint clustering and feature learning methods have shown remarkable performance in unsupervised representation learning. However, the training schedule alternating between feature clustering and network parameters update leads to unstable learning of visual representations. To overcome this challenge, we propose Online Deep Clustering (ODC) that performs clustering and network update simultaneously rather than alternatingly. Our key insight is that the cluster centroids should evolve steadily in keeping the classifier stably updated. Specifically, we design and maintain two dynamic memory modules, i.e., samples memory to store samples' labels and features, and centroids memory for centroids evolution. We break down the abrupt global clustering into steady memory update and batch-wise label re-assignment. The process is integrated into network update iterations. In this way, labels and the network evolve shoulder-to-shoulder rather than alternatingly. Extensive experiments demonstrate that ODC stabilizes the training process and boosts the performance effectively.
[visual, previous, abhinav] [feature, map, table, propose, backbone, split, highest] [model, original, trained, adversarial, change, largest] [figure, method, high, frequency, proposed] [unsupervised, representation, cluster, image, loss, supervised, train, perform, generative, ziwei] [odc, learning, clustering, performance, deep, imagenet, update, training, network, classification, svm, memory, small, process, size, layer, class, report, label, online, learned, large, linear, batch, number, arxiv, preprint, iteration, classifier, task, alexnet, observe, best, ratio, andrew, randomly, random, xiaohang] [joint, rotation]
@InProceedings{Zhan_2020_CVPR,
  author = {Zhan, Xiaohang and Xie, Jiahao and Liu, Ziwei and Ong, Yew-Soon and Loy, Chen Change},
  title = {Online Deep Clustering for Unsupervised Representation Learning},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Density-Aware Feature Embedding for Face Clustering
Senhui Guo, Jing Xu, Dapeng Chen, Chao Zhang, Xiaogang Wang, Rui Zhao


Clustering has many applications in research and industry. However, traditional clustering methods, such as K-means, DBSCAN and HAC, impose oversimplifying assumptions and thus are not well-suited to face clustering. To adapt to the distribution of realistic problems, a natural approach is to use Graph Convolutional Networks (GCNs) to enhance features for clustering. However, GCNs can only utilize local information, which ignores the overall characterisitcs of the clusters. In this paper, we propose a Density-Aware Feature Embedding Network (DA-Net) for the task of face clustering, which utilizes both local and non-local information, to learn a robust feature embedding. Specifically, DA-Net uses GCNs to aggregate features locally, and then incorporates non-local information using a density chain, which is a chain of faces from low density to high density. This density chain exploits the non-uniform distribution of face images in the dataset. Then, an LSTM takes the density chain as input to generate the final feature embedding. Once this embedding is generated, traditional clustering methods, such as density-based clustering, can be used to obtain the final clustering results. Extensive experiments verify the effectiveness of the proposed feature embedding method, which can achieve state-of-the-art performance on public benchmarks.
[graph, node, embedding, gcn, lstm, youtube, outperforms, connected, step, attention, time, relevant, visual] [feature, table, final, nearby, threshold, module, peak, key, aggregate] [face, clique, original, query, model, dbscan, robust, input, datasets] [method, figure, based, proposed, high, comparison, traditional] [cluster, image, utilize, person, learn, generate, corresponding, consists] [clustering, chain, density, network, distribution, learning, data, higher, find, layer, better, complexity, probability, efficient, training, update, neural, algorithm, deep, number, learned, pairwise, function, precision, learnclust, belong] [local, nearest, approach, represented, conference, structure, neighbor, international]
@InProceedings{Guo_2020_CVPR,
  author = {Guo, Senhui and Xu, Jing and Chen, Dapeng and Zhang, Chao and Wang, Xiaogang and Zhao, Rui},
  title = {Density-Aware Feature Embedding for Face Clustering},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Self-Supervised Learning of Pretext-Invariant Representations
Ishan Misra, Laurens van der Maaten


The goal of self-supervised learning from images is to construct image representations that are semantically meaningful via pretext tasks that do not require semantic annotations. Many pretext tasks lead to representations that are covariant with image transformations. We argue that, instead, semantic representations ought to be invariant under such transformations. Specifically, we develop Pretext-Invariant Representation Learning (PIRL, pronounced as `pearl') that learns invariant representations based on pretext tasks. We use PIRL with a commonly used pretext task that involves solving jigsaw puzzles. We find that PIRL substantially improves the semantic quality of the learned image representations. Our approach sets a new state-of-the-art in self-supervised learning from images on several popular benchmarks for self-supervised learning. Despite being unsupervised, PIRL outperforms supervised pre-training in learning image representations for object detection. Altogether, our results demonstrate the potential of self-supervised representations with good invariance properties.
[outperforms, visual, work, dataset, predicting, video, recognition, bank, predict, latexit] [object, table, semantic, feature, detection, van] [quality, trained, model, study] [prior, figure, patch, convolutional, method] [image, representation, jigsaw, invariant, loss, unsupervised, supervised, covariant, invariance, learn, transformed, learns, train, encourages, transfer] [pirl, learning, pretext, imagenet, accuracy, number, data, linear, performance, task, classification, network, arxiv, set, equation, preprint, function, memory, contrastive, training, large, better, negative, layer, setup, npid, andrew, learned, setting, fixed, deep, measure, best] [transformation, rotation, approach]
@InProceedings{Misra_2020_CVPR,
  author = {Misra, Ishan and Maaten, Laurens van der},
  title = {Self-Supervised Learning of Pretext-Invariant Representations},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
ROAM: Recurrently Optimizing Tracking Model
Tianyu Yang, Pengfei Xu, Runbo Hu, Hua Chai, Antoni B. Chan


In this paper, we design a tracking model consisting of response generation and bounding box regression, where the first component produces a heat map to indicate the presence of the object at different positions and the second part regresses the relative bounding box shifts to anchors mounted on sliding-window locations. Thanks to the resizable convolutional filters used in both components to adapt to the shape changes of objects, our tracking model does not need to enumerate different sized anchors, thus saving model parameters. To effectively adapt the model to appearance variations, we propose to offline train a recurrent neural optimizer to update tracking model in a meta-learning setting, which can converge the model in a few gradient steps. This improves the convergence speed of updating the tracking model while achieving better performance. We extensively evaluate our trackers, ROAM and ROAM++, on the OTB, VOT, LaSOT, GOT-10K and TrackingNet benchmark and our methods perform favorably against state-of-the-art algorithms.
[visual, recurrent, frame, future, video, lstm, current, previous] [tracking, object, bounding, box, roam, response, map, feature, regression, subsequent, siamfc, siamese, metatracker, eco, propose, anchor, correlation, reg, overlap, recurrently, resizable, siamrpn, framework, faster, including] [model, offline, success, input, auc, trained] [convolutional, based, aspect, scale, traditional, adaptive, figure, spatial, method] [loss, target, generation, appearance, perform, image, train] [learning, neural, gradient, training, optimizer, update, rate, filter, updating, size, meta, learned, optimization, network, sgd, compared, adapt, number, performance, deep, precision, large, online, note, better, updated, performing, function] [initial, shape, well]
@InProceedings{Yang_2020_CVPR,
  author = {Yang, Tianyu and Xu, Pengfei and Hu, Runbo and Chai, Hua and Chan, Antoni B.},
  title = {ROAM: Recurrently Optimizing Tracking Model},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Deformable Siamese Attention Networks for Visual Object Tracking
Yuechen Yu, Yilei Xiong, Weilin Huang, Matthew R. Scott


Siamese-based trackers have achieved excellent performance on visual object tracking. However, the target template is not updated online, and the features of target template and search image are computed independently in a Siamese architecture. In this paper, we propose Deformable Siamese Attention Networks, referred to as SiamAttn, by introducing a new Siamese attention mechanism that computes deformable self-attention and cross-attention. The self-attention learns strong context information via spatial attention, and selectively emphasizes interdependent channel-wise features with channel attention. The crossattention is capable of aggregating rich contextual interdependencies between the target template and the search image, providing an implicit manner to adaptively update the target template. In addition, we design a region refinement module that computes depth-wise cross correlations between the attentional features for more accurate tracking. We conduct experiments on six benchmarks, where our method achieves new state-of-the-art results, outperforming recent strong baseline, SiamRPN++, by 0.464 to 0.537 and 0.415 to 0.470 EAO on VOT 2016 and 2018.
[visual, attention, attentional, three, mechanism, context, work, multiple, prediction] [siamese, object, tracking, region, module, refinement, mask, template, bounding, eao, rpn, correlation, proposal, box, head, feature, score, atom, lasot, achieves, siamrpn, regression, trackingnet, table, wei, contextual, tracker, siammask, siamattn, backbone, apply] [improve, model, strong, success, robustness] [deformable, spatial, convolutional, method, convolution, proposed, channel, introduced, enhance] [target, image, cross, discriminative, dsa, representation, generating, loss] [search, learning, performance, training, deep, large, applied, precision, accuracy, compared, network, classification, set] [computed, complex, single, michael, computes]
@InProceedings{Yu_2020_CVPR,
  author = {Yu, Yuechen and Xiong, Yilei and Huang, Weilin and Scott, Matthew R.},
  title = {Deformable Siamese Attention Networks for Visual Object Tracking},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
15 Keypoints Is All You Need
Michael Snower, Asim Kadav, Farley Lai, Hans Peter Graf


Pose-tracking is an important problem that requires identifying unique human pose-instances and matching them temporally across different frames in a video. However, existing pose-tracking methods are unable to accurately model temporal relationships and require significant computation, often computing the tracks offline. We present an efficient multi-person pose-tracking method, KeyTrack that only relies on keypoint information without using any RGB or optical flow to locate and track human keypoints in real-time. KeyTrack is a top-down approach that learns spatio-temporal pose relationships by modeling the multi-person pose-tracking problem as a novel Pose Entailment task using a Transformer based architecture. Furthermore, KeyTrack uses a novel, parameter-free, keypoint refinement technique that improves the keypoint estimates used by the Transformers. We achieve state-of-the-art results on PoseTrack'17 and PoseTrack'18 benchmarks while using only a fraction of the computation used by most other methods for computing the tracking information.
[temporal, transformer, entailment, attention, frame, video, token, pair, keytrack, gcn, work, toks, visual, step, sentence, multiple, timestep, idsw, temporally] [tracking, posetrack, bounding, box, track, location, mota, detection, table, improvement, object, hrnet, fps] [model, type, input, improve] [spatial, ieee, flow, pattern, optical, method, figure, convolutional, based, low, cnns] [person, image, learn] [network, test, learning, efficient, set, arxiv, preprint, number, deep, problem, binary, neural, accuracy, indicates, classification] [pose, keypoint, keypoints, estimation, computer, conference, vision, human, matching, match, compare, position, approach, accurate, joint, unique, single]
@InProceedings{Snower_2020_CVPR,
  author = {Snower, Michael and Kadav, Asim and Lai, Farley and Graf, Hans Peter},
  title = {15 Keypoints Is All You Need},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Optical Flow in the Dark
Yinqiang Zheng, Mingfang Zhang, Feng Lu


Many successful optical flow estimation methods have been proposed, but they become invalid when tested in dark scenes because low-light scenarios are not considered when they are designed and current optical flow benchmark datasets lack low-light samples. Even if we preprocess to enhance the dark images, which achieves great visual perception, it still leads to poor optical flow results or even worse ones, because information like motion consistency may be broken while enhancing. We propose an end-to-end data-driven method that avoids error accumulation and learns optical flow directly from low-light noisy images. Specifically, we develop a method to synthesize large-scale low-light optical flow datasets by simulating the noise model on dark raw images. We also collect a new optical flow dataset in raw format with a large range of exposure to be used as a benchmark. The models trained on our synthetic dataset can relatively maintain optical flow accuracy as the image brightness descends and they outperform the existing methods greatly on low-light images.
[dataset, current, work, order] [table, propose, benchmark, level, achieves, main] [noise, model, input, trained, datasets, poisson, collect, deal, choose, original, poor] [optical, flow, dark, raw, brightness, sid, method, figure, flyingchairs, nlm, existing, vbof, exposure, bright, ieee, pattern, fcdn, denoising, flownets, sony, enhance, noisy, result, flownetc, enhancing, analysis, based, reference, enhanced, simulate, sensor, read, fujifilm, iso, feng, format, range, descends] [image, real, synthetic, synthesize, produced, corresponding] [data, training, performance, better, large, accuracy, network, evaluate, distribution, parameter, worse, achieve, learning, set] [conference, computer, vision, solution, camera, directly, international, complex, rgb, estimation]
@InProceedings{Zheng_2020_CVPR,
  author = {Zheng, Yinqiang and Zhang, Mingfang and Lu, Feng},
  title = {Optical Flow in the Dark},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Sketch-BERT: Learning Sketch Bidirectional Encoder Representation From Transformers by Self-Supervised Learning of Sketch Gestalt
Hangyu Lin, Yanwei Fu, Xiangyang Xue, Yu-Gang Jiang


Previous researches of sketches often considered sketches in pixel format and leveraged CNN based models in the sketch understanding. Fundamentally, a sketch is stored as a sequence of data points, a vector format representation, rather than the photo-realistic image of pixels. SketchRNN studied a generative neural representation for sketches of vector format by Long Short Term Memory networks (LSTM). Unfortunately, the representation learned by SketchRNN is primarily for the generation tasks, rather than the other tasks of recognition and retrieval of sketches. To this end and inspired by the recent BERT model, we present a model of learning Sketch Bidirectional Encoder Representation from Transformer (Sketch-BERT). We generalize BERT to sketch domain, with the novel proposed components and pre-training algorithms, including the newly designed sketch embedding networks, and the self-supervised learning of sketch gestalt. Particularly, towards the pre-training task, we present a novel Sketch Gestalt Model (SGM) to help train the Sketch-BERT. Experimentally, we show that the learned representation of Sketch-BERT can help and improve the performance of the downstream tasks of sketch recognition, sketch retrieval, and sketch gestalt.
[retrieval, embedding, state, transformer, recognition, sequence, sequential, bert, dataset, inspired, predict, language, downstream, hidden, bidirectional, understanding] [mask, offset, cnn, refine, improvement, feature] [model, input, help] [based, utilized, proposed, format, figure, pixel, designed, method] [sketch, gestalt, representation, masked, image, quickdraw, sketchbert, sketchrnn, train, loss, inpainting, generation, encoder, generative, stroke, learn] [learning, task, classification, network, performance, neural, training, data, vector, number, test, set, layer, arxiv, preprint, deep, studied, learned, indicates, efficiently] [point, novel, position, structure, ground, truth, computer, shape, recovering]
@InProceedings{Lin_2020_CVPR,
  author = {Lin, Hangyu and Fu, Yanwei and Xue, Xiangyang and Jiang, Yu-Gang},
  title = {Sketch-BERT: Learning Sketch Bidirectional Encoder Representation From Transformers by Self-Supervised Learning of Sketch Gestalt},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
A Unified Object Motion and Affinity Model for Online Multi-Object Tracking
Junbo Yin, Wenguan Wang, Qinghao Meng, Ruigang Yang, Jianbing Shen


Current popular online multi-object tracking (MOT) solutions apply single object trackers (SOTs) to capture object motions, while often requiring an extra affinity network to associate objects, especially for the occluded ones. This brings extra computational overhead due to repetitive feature extraction for SOT and affinity computation. Meanwhile, the model size of the sophisticated affinity network is usually non-trivial. In this paper, we propose a novel MOT framework that unifies object motion and affinity model into a single network, named UMA, in order to learn a compact feature that is discriminative for both object motion and affinity measure. In particular, UMA integrates single object tracking and metric learning into a unified triplet network by means of multi-task learning. Such design brings advantages of improved computation efficiency, low memory requirement and simplified training procedure. In addition, we equip our model with a task-specific attention module, which is used to boost task-aware feature learning. The proposed UMA can be easily trained end-to-end, and is elegant - requiring only one training stage. Experimental results show that it achieves promising performance on several MOT Challenge benchmarks.
[multiple, attention, prediction, previous, video, work, temporal] [tracking, object, sot, mot, feature, uma, siamese, detection, tracklet, tsa, occluded, instance, positive, association, lsot, module, table, wenguan, jianbing, apply, extra, global, backbone, liden, tracked, anchor, propose] [model, identity, robust, trained] [motion, based, lightweight, convolutional, figure, proposed, quantitative, method] [target, loss, learn, exemplar, discriminative, address, extracted, learnt, appearance] [online, learning, network, triplet, training, data, metric, performance, deep, candidate, measure, better, set, negative, ranking, applied, sample, neural, compared, design, computation] [single]
@InProceedings{Yin_2020_CVPR,
  author = {Yin, Junbo and Wang, Wenguan and Meng, Qinghao and Yang, Ruigang and Shen, Jianbing},
  title = {A Unified Object Motion and Affinity Model for Online Multi-Object Tracking},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Sub-Frame Appearance and 6D Pose Estimation of Fast Moving Objects
Denys Rozumnyi, Jan Kotera, Filip Sroubek, Jiri Matas


We propose a novel method that tracks fast moving objects, mainly non-uniform spherical, in full 6 degrees of freedom, estimating simultaneously their 3D motion trajectory, 3D pose and object appearance changes with a time step that is a fraction of the video frame exposure time. The sub-frame object localization and appearance estimation allows realistic temporal super-resolution and precise shape estimation. The method, called TbD-3D (Tracking by Deblatting in 3D) relies on a novel reconstruction algorithm which solves a piece-wise deblurring and matting problem. The 3D rotation is estimated by minimizing the reprojection error. As a second contribution, we present a new challenging dataset with fast moving objects that change their appearance and distance to the camera. High-speed camera recordings with zero lag between frame exposures were used to generate videos with different frame rates annotated with ground-truth trajectory and pose.
[trajectory, moving, frame, video, dataset, sequence, time, temporal, visual, recognition, previous, work] [object, tracking, location, fps, mask, annotated, correlation, final] [input, ball, model, curve, change, constant] [method, fast, proposed, figure, motion, ieee, output, blurred, pattern, deblurring, blur, formation] [appearance, image, corresponding, jan] [angular, average, set, learning, compared, optimization, problem, accuracy, number] [rotation, estimated, estimation, camera, axis, error, ground, shape, truth, velocity, estimate, tbd, angle, computer, distance, conference, deblatting, vision, pose, reconstruction, spherical, czech, single, assume, defined, term, filip, footage]
@InProceedings{Rozumnyi_2020_CVPR,
  author = {Rozumnyi, Denys and Kotera, Jan and Sroubek, Filip and Matas, Jiri},
  title = {Sub-Frame Appearance and 6D Pose Estimation of Fast Moving Objects},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
How to Train Your Deep Multi-Object Tracker
Yihong Xu, Aljosa Osep, Yutong Ban, Radu Horaud, Laura Leal-Taixe, Xavier Alameda-Pineda


The recent trend in vision-based multi-object tracking (MOT) is heading towards leveraging the representational power of deep learning to jointly learn to detect and track objects. However, existing methods train only certain sub-modules using loss functions that often do not correlate with established tracking evaluation measures such as Multi-Object Tracking Accuracy (MOTA) and Precision (MOTP). As these measures are not differentiable, the choice of appropriate loss functions for end-to-end training of multi-object tracking methods is still an open research problem. In this paper, we bridge this gap by proposing a differentiable proxy of MOTA and MOTP, which we combine in a loss function suitable for end-to-end training of deep multi-object trackers. As a key ingredient, we propose a Deep Hungarian Net (DHN) module that approximates the Hungarian matching algorithm. DHN allows estimating the correspondence between object tracks and ground truth objects to compute differentiable proxies of MOTA and MOTP, which are in turn used to optimize deep trackers directly. We experimentally demonstrate that the proposed differentiable framework improves the performance of existing multi-object trackers, and we establish a new state of the art on the MOTChallenge benchmark. Our code is publicly available from https://github.com/yihongXU/deepMOT.
[evaluation, multiple, hidden, video, frame] [tracking, object, mota, tracktor, mot, deepmot, bounding, track, assignment, association, motp, propose, hungarian, framework, module, tracker, motchallenge, predicted, detection, box, btp, improves, benchmark] [trained] [proposed, based, existing, method, ieee, output, flow] [loss, train, perform, appearance, representation, reid, learn] [training, deep, dhn, number, performance, matrix, network, learning, standard, impact, optimal, vanilla, data, neural, soft, approximation, base, problem, proxy, optimize, optimization, online, set, compared] [differentiable, distance, smooth, compute, matching, directly, demonstrate, establish]
@InProceedings{Xu_2020_CVPR,
  author = {Xu, Yihong and Osep, Aljosa and Ban, Yutong and Horaud, Radu and Leal-Taixe, Laura and Alameda-Pineda, Xavier},
  title = {How to Train Your Deep Multi-Object Tracker},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
TPNet: Trajectory Proposal Network for Motion Prediction
Liangji Fang, Qinhong Jiang, Jianping Shi, Bolei Zhou


Making accurate motion prediction of the surrounding traffic agents such as pedestrians, vehicles, and cyclists is crucial for autonomous driving. Recent data-driven motion prediction methods have attempted to learn to directly regress the exact future position or its distribution from massive amount of trajectory data. However, it remains difficult for these methods to provide multimodal predictions as well as integrate physical constraints such as traffic rules and movable areas. In this work we propose a novel two-stage motion prediction framework, Trajectory Proposal Network (TPNet). TPNet first generates a candidate set of future trajectories as hypothesis proposals, then makes the final predictions by classifying and refining the proposals which meets the physical constraints. By steering the proposal generation process, safe and multimodal predictions are realized. Thus this framework effectively mitigates the complexity of motion prediction problem while ensuring the multimodal output. Experiments on four large-scale trajectory prediction datasets, i.e. the ETH, UCY, Apollo and Argoverse datasets, show that TPNet achieves the state-of-the-art results both quantitatively and qualitatively.
[prediction, trajectory, future, multimodal, tpnet, argoverse, road, traffic, social, vehicle, lstm, dataset, fde, movable, agent, eth, ade, ucy, lane, transportation, extract, time, tobs, tpre, multiple] [proposal, predicted, framework, propose, semantic, refinement, effectiveness, surrounding, regressed, map, module, apolloscape, regression, pedestrian, stage, final] [physical, model, curve, safety, safe, guarantee] [motion, ieee, based, proposed, method, pattern, reference, prior] [generation, generated, generate, diversity, loss, generates] [set, classification, deep, network, knowledge, base, arxiv, preprint, neural, learning, best] [point, conference, intelligent, computer, position, vision, international, second, interval, grid]
@InProceedings{Fang_2020_CVPR,
  author = {Fang, Liangji and Jiang, Qinhong and Shi, Jianping and Zhou, Bolei},
  title = {TPNet: Trajectory Proposal Network for Motion Prediction},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Large Scale Video Representation Learning via Relational Graph Clustering
Hyodong Lee, Joonseok Lee, Joe Yue-Hei Ng, Paul Natsev


Representation learning is widely applied for various tasks on multimedia data, e.g., retrieval and search. One approach for learning useful representation is by utilizing the relationships or similarities between examples. In this work, we explore two promising scalable representation learning approaches on video domain. With hierarchical graph clusters built upon video-to-video similarities, we propose: 1) smart negative sampling strategy that significantly boosts training efficiency with triplet loss, and 2) a pseudo-classification approach using the clusters as pseudo-labels. The embeddings trained with the proposed methods are competitive on multiple video understanding tasks, including related video retrieval and video annotation. Both of these proposed methods are highly scalable, as verified by experiments on large-scale datasets.
[video, graph, relational, embedding, retrieval, recognition, difficulty, embeddings, hierarchical, audio] [level, map, table, feature, assigned, affinity, anchor, annotation, represents] [model, trained, face, chosen, query] [ieee, proposed, figure, pattern, method] [cluster, loss, representation, learn, train] [learning, training, metric, negative, clustering, sampled, gcml, cdml, deep, smart, classification, triplet, similarity, performance, sampling, margin, number, network, large, sample, batch, size, neural, online, informative, mining, data, set, ratio, better, softmax, processing, label, randomly, random, efficient, note, average, top, evaluate] [conference, computer, vision, international, european, approach, distance]
@InProceedings{Lee_2020_CVPR,
  author = {Lee, Hyodong and Lee, Joonseok and Ng, Joe Yue-Hei and Natsev, Paul},
  title = {Large Scale Video Representation Learning via Relational Graph Clustering},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Towards Universal Representation Learning for Deep Face Recognition
Yichun Shi, Xiang Yu, Kihyuk Sohn, Manmohan Chandraker, Anil K. Jain


Recognizing wild faces is extremely hard as they appear with all kinds of variations. Traditional methods either train with specifically annotated variation data from target domains, or by introducing unlabeled target variation data to adapt from the training data. Instead, we propose a universal representation learning framework that can deal with larger variation unseen in the given training data without leveraging target domain knowledge. We firstly synthesize training data alongside some semantically meaningful variations, such as low resolution, occlusion and head pose. However, directly feeding the augmented data for training will not converge well as the newly introduced samples are mostly hard examples. We propose to split the feature embedding into multiple sub-embeddings, and associate different confidence values for each sub-embedding to smooth the training procedure. The sub-embeddings are further decorrelated by regularizing variation classification loss and variation adversarial loss on different partitions of them. Experiments show that our method achieves top performance on general face recognition datasets such as LFW and MegaFace, while significantly better on extreme benchmarks such as TinyFace and IJB-S.
[recognition, embedding, dataset, multiple, three, evaluation, video] [feature, confidence, occlusion, challenging, table, hard, propose, achieves, xiang, benchmark] [face, variation, model, datasets, identification, universal, testing, lfw, decorrelation, original, type, identity, quality, adversarial, trained, adding, verification, kihyuk, tinyface, unconstrained] [method, figure, proposed, blur] [loss, domain, representation, target, prototype, introduce, unseen, learn] [training, performance, data, learning, augmentation, deep, better, large, indicates, log, augmented, classification, general, probabilistic, baseline, sample, margin, accuracy, achieve, metric, set, learned, size, test, good, neural, network, equation] [pose, manmohan, uncertainty, limited]
@InProceedings{Shi_2020_CVPR,
  author = {Shi, Yichun and Yu, Xiang and Sohn, Kihyuk and Chandraker, Manmohan and Jain, Anil K.},
  title = {Towards Universal Representation Learning for Deep Face Recognition},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Robust Partial Matching for Person Search in the Wild
Yingji Zhong, Xiaoyu Wang, Shiliang Zhang


Various factors like occlusions, backgrounds, etc., would lead to misaligned detected bounding boxes , e.g., ones covering only portions of human body. This issue is common but overlooked by previous person search works. To alleviate this issue, this paper proposes an Align-to-Part Network (APNet) for person detection and re-Identification (reID). APNet refines detected bounding boxes to cover the estimated holistic body regions, from which discriminative part features can be extracted and aligned. Aligned part features naturally formulate reID as a partial feature matching procedure, where valid part features are selected for similarity computation, while part features on occluded or noisy regions are discarded. This design enhances the robustness of person search to real-world challenges with marginal computation overhead. This paper also contributes a Large-Scale dataset for Person Search in the wild (LSPS), which is by far the largest and the most challenging dataset for person search. Experiments show that APNet brings considerable performance improvement on LSPS. Meanwhile, it achieves competitive performance on existing person search benchmarks like CUHK-SYSU and PRW.
[dataset, extract, outperforms] [bounding, feature, global, apnet, box, bba, stripe, lsps, region, detected, map, rsfe, detector, table, prw, detection, validity, oim, horizontal, refined, holistic, achieves, gallery, denotes, extractor, vpm, occluded, offset, challenging, degrades, cnn, propose, stage, decayed, chi] [query, trained, sensitive, robust, refers] [valid, noisy, figure, based, existing, method, proposed, comparison] [person, reid, loss, extracted, misalignment, aligned, shiliang, misaligned, issue, discriminative, image, wen] [search, learning, performance, training, network, number, set, compared, paper, rate, mutual, test, denote, larger, similarity, design] [partial, body, matching, visible, vertical, cover, well]
@InProceedings{Zhong_2020_CVPR,
  author = {Zhong, Yingji and Wang, Xiaoyu and Zhang, Shiliang},
  title = {Robust Partial Matching for Person Search in the Wild},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Correlation-Guided Attention for Corner Detection Based Visual Tracking
Fei Du, Peng Liu, Wei Zhao, Xianglong Tang


Accurate bounding box estimation has recently attracted much attention in the tracking community because traditional multi-scale search strategies cannot estimate tight bounding boxes in many challenging scenarios involving changes to the target. A tracker capable of detecting target corners can flexibly adapt to such changes, but existing corner detection based tracking methods have not achieved adequate success. We analyze the reasons for their failure and propose a state-of-the-art tracker that performs correlation-guided attentional corner detection in two stages. First, a region of interest (RoI) is obtained by employing an efficient Siamese network to distinguish the target from the background. Second, a pixel-wise correlation-guided spatial attention module and a channel-wise correlation-guided channel attention module exploit the relationship between the target template and the RoI to highlight corner regions and enhance features of the RoI for corner detection. The correlation-guided attention modules improve the accuracy of corner detection, thus enabling accurate bounding box estimation. When trained on large-scale datasets using a novel RoI augmentation strategy, the performance of the proposed tracker, running at a high speed of 70 FPS, is comparable with that of state-of-the-art trackers in meeting five challenging performance benchmarks.
[attention, visual, attentional, dataset, relationship, construct, state] [corner, tracking, roi, template, siamese, correlation, detection, module, tracker, bounding, feature, object, box, achieves, cgacd, atom, table, propose, regression, backbone, overlap, employed, eco, martin, challenging, location, detect, background] [auc, improve, trained, success] [spatial, channel, proposed, method, comparison, integration, based, high, figure, result, scale] [target, image, learn, discriminative] [network, performance, learning, similarity, test, precision, augmentation, function, normalized, online, best, accuracy, strategy, expected, achieve, architecture, set, efficient, comparable] [estimate, accurate, compare, michael, estimated, estimation]
@InProceedings{Du_2020_CVPR,
  author = {Du, Fei and Liu, Peng and Zhao, Wei and Tang, Xianglong},
  title = {Correlation-Guided Attention for Corner Detection Based Visual Tracking},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Learning Multi-Object Tracking and Segmentation From Automatic Annotations
Lorenzo Porzi, Markus Hofinger, Idoia Ruiz, Joan Serrat, Samuel Rota Bulo, Peter Kontschieder


In this work we contribute a novel pipeline to automatically generate training data, and to improve over state-of-the-art multi-object tracking and segmentation (MOTS) methods. Our proposed track mining algorithm turns raw street-level videos into high-fidelity MOTS training data, is scalable and overcomes the need of expensive and time-consuming manual annotation approaches. We leverage state-of-the-art instance segmentation results in combination with optical flow predictions, also trained on automatically harvested training data. Our second major contribution is MOTSNet - a deep learning, tracking-by-detection architecture for MOTS - deploying a novel mask-pooling layer for improved object association over time. Training MOTSNet with our automatically extracted data leads to significantly improved sMOTSA scores on the novel KITTI MOTS dataset (+1.9%/+7.5% on cars/pedestrians), and MOTSNet improves by +4.1% over previously best methods on the MOTSChallenge dataset. Our most impressive finding is that we can improve over previous best-performing works, even in complete absence of manually annotated MOTS training data.
[video, dataset, embedding, frame, automatically, provide, described, work] [segmentation, tracking, object, instance, motsnet, bounding, head, synth, mask, mot, mapillary, semantic, box, annotation, smotsa, segment, region, ave, track, annotated, car, payoff, table, ped, bastian, assignment] [trained, model, datasets, improve, quality] [ieee, flow, pattern, optical, based, proposed, raw] [generation, generated, extracted] [training, data, set, function, network, deep, task, learning, layer, best, validation, performance, manually, vector, process, class, manual, learned] [kitti, conference, computer, vision, international, approach, novel, ground, pipeline, second, directly, scene]
@InProceedings{Porzi_2020_CVPR,
  author = {Porzi, Lorenzo and Hofinger, Markus and Ruiz, Idoia and Serrat, Joan and Bulo, Samuel Rota and Kontschieder, Peter},
  title = {Learning Multi-Object Tracking and Segmentation From Automatic Annotations},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
PandaNet: Anchor-Based Single-Shot Multi-Person 3D Pose Estimation
Abdallah Benzine, Florian Chabot, Bertrand Luvison, Quoc Cuong Pham, Catherine Achard


Recently, several deep learning models have been proposed for 3D human pose estimation. Nevertheless, most of these approaches only focus on the single-person case or estimate 3D pose of a few people at high resolution. Furthermore, many applications such as autonomous driving or crowd analysis require pose estimation of a large number of people possibly at low-resolution. In this work, we present PandaNet (Pose estimAtioN and Dectection Anchor-based Network), a new single-shot, anchor-based and multi-person 3D pose estimation approach. The proposed model performs bounding box detection and, for each detected person, 2D and 3D pose regression into a single forward pass. It does not need any post-processing to regroup joints since the network predicts a full 3D pose for each bounding box and allows the pose estimation of a possibly large number of people at low resolution. To manage people overlapping, we introduce a Pose-Aware Anchor Selection strategy. Moreover, as imbalance exists between different people sizes in the image, and joints coordinates have different uncertainties depending on these sizes, we propose a method to automatically optimize weights associated to different people scales and joints for efficient training. PandaNet surpasses previous single-shot methods on several challenging datasets: a multi-person urban virtual but very realistic dataset (JTA Dataset), and two real world 3D multi-person datasets (CMU Panoptic and MuPoTS-3D).
[people, dataset, automatic, associated, previous, prediction, predict, three] [anchor, bounding, detection, object, table, propose, box, predicted, crowded, iou, panoptic, matched, regression, overlap, feature] [model, input, subject, trained, heatmaps, ambiguous] [low, method, figure, based, resolution, proposed, existing, convolutional] [loss, image, introduce, perform, corresponding] [number, large, selection, network, learning, weighting, set, process, strategy, readout, learned, arxiv, preprint, problem, best] [pose, human, estimation, pandanet, approach, joint, single, jta, camera, full, estimate, body, multiperson, defined, ground, distance, truth, predicts, uncertainty, second]
@InProceedings{Benzine_2020_CVPR,
  author = {Benzine, Abdallah and Chabot, Florian and Luvison, Bertrand and Pham, Quoc Cuong and Achard, Catherine},
  title = {PandaNet: Anchor-Based Single-Shot Multi-Person 3D Pose Estimation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Rotation Consistent Margin Loss for Efficient Low-Bit Face Recognition
Yudong Wu, Yichao Wu, Ruihao Gong, Yuanhao Lv, Ken Chen, Ding Liang, Xiaolin Hu, Xianglong Liu, Junjie Yan


In this paper, we consider the low-bit quantization problem of face recognition (FR) under the open-set protocol. Different from well explored low-bit quantization on closed-set image classification task, the open-set task is more sensitive to quantization errors (QEs). We redefine the QEs in angular space and disentangle it into class error and individual error. These two parts correspond to inter-class separability and intra-class compactness, respectively. Instead of eliminating the entire QEs, we propose the rotation consistent margin (RCM) loss to minimize the individual error, which is more essential to feature discriminative power. Extensive experiments on popular benchmark datasets such as MegaFace Challenge, Youtube Faces (YTF), Labeled Face in the Wild (LFW) and IJB-C show the superiority of proposed loss in low-bit FR quantization tasks.
[individual, dataset, recognition] [feature, center, propose, supervision, hard, positive, represents] [face, model, arcface, megaface, original, trained, decision, refers, improving, lfw, change, cosface] [ieee, pattern, proposed, figure, method, low] [loss, discriminative, train] [class, quantization, margin, deep, training, softmax, neural, performance, accuracy, quantized, qes, learning, angular, function, rcm, arxiv, preprint, network, compactness, efficient, classification, entire, additive, uniform, minimize, tpr, increase, set, quantizers, sample, number, processing] [conference, computer, error, rotation, vision, angle, consistent, approach, international, directly]
@InProceedings{Wu_2020_CVPR,
  author = {Wu, Yudong and Wu, Yichao and Gong, Ruihao and Lv, Yuanhao and Chen, Ken and Liang, Ding and Hu, Xiaolin and Liu, Xianglong and Yan, Junjie},
  title = {Rotation Consistent Margin Loss for Efficient Low-Bit Face Recognition},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Joint Spatial-Temporal Optimization for Stereo 3D Object Tracking
Peiliang Li, Jieqi Shi, Shaojie Shen


Directly learning multiple 3D objects motion from sequential images is difficult, while the geometric bundle adjustment lacks the ability to localize the invisible object centroid. To benefit from both the powerful object understanding skill from deep neural network meanwhile tackle precise geometry modeling for consistent trajectory estimation, we propose a joint spatial-temporal optimization-based stereo 3D object tracking method. From the network, we detect corresponding 2D bounding boxes on adjacent images and regress an initial 3D bounding box. Dense object cues (local depth and local coordinates) that associating to the object centroid are then predicted using a region-based network. Considering both the instant localization accuracy and motion consistency, our optimization models the relations between the object centroid and observed cues into a joint spatial-temporal error function. All historic cues will be summarized to contribute to the current estimation by a per-frame marginalization strategy without repeated computation. Quantitative evaluation on the KITTI tracking dataset shows our approach outperforms previous image-based 3D tracking methods by significant margins. We also report extensive results on multiple categories and larger datasets (KITTI raw and Argoverse Tracking) for future benchmarking.
[temporal, previous, current, sequential, trajectory, multiple, dataset, argoverse, frame, concatenated, observation, predict, evaluation, history, associated] [object, tracking, detection, box, feature, bounding, centroid, mota, autonomous, iou, motp, association, roi, foreground, lidar, raquel, proposal, peiliang, localization, region, positive] [model] [ieee, pattern, motion, based, spatial, pixel, raw, method, adjacent, figure] [image, paired] [network, optimization, note, learning, deep, linear, arxiv, preprint, benefit, report, comparing, performance] [stereo, local, conference, kitti, computer, error, vision, dense, joint, depth, point, estimation, monocular, distance, directly, accurate, reprojection, cloud, geometric, angle, pose, shaojie, consistent, continuous, photometric, defined, respecting]
@InProceedings{Li_2020_CVPR,
  author = {Li, Peiliang and Shi, Jieqi and Shen, Shaojie},
  title = {Joint Spatial-Temporal Optimization for Stereo 3D Object Tracking},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Unity Style Transfer for Person Re-Identification
Chong Liu, Xiaojun Chang, Yi-Dong Shen


Style variation has been a major challenge for person re-identification, which aims to match the same pedestrians across different cameras. Existing works attempted to address this problem with camera-invariant descriptor subspace learning. However, there will be more image artifacts when the difference between the images taken by different cameras is larger. To solve this problem, we propose a UnityStyle adaption method, which can smooth the style disparities within the same camera and across different cameras. Specifically, we firstly create UnityGAN to learn the style changes between cameras, producing shape-stable style-unity images for each camera, which is called UnityStyle images. Meanwhile, we use UnityStyle images to eliminate style differences between different images, which makes a better match between query and gallery. Then, we apply the proposed method to Re-ID models, expecting to obtain more style-robust depth features for querying. We conduct extensive experiments on widely used benchmark datasets to evaluate the performance of the proposed framework, the results of which confirm the superiority of the proposed model.
[attention, multiple, evaluation, making] [gallery, feature, cam, map, propose, module, final, liang, shallow, add, unified, pcb] [query, model, improve, adversarial, trained, input, technology, difference] [method, proposed, figure, enhanced, block, output, convolutional] [style, unitystyle, unitygan, image, person, transfer, generated, generate, loss, real, con, train, cyclegan, ide, ensure, xiaojun, generative, learn, corresponding, ibn, camstyle, generates, unsupervised, sln, lcross, gans] [training, data, deep, test, learning, sample, number, accuracy, set, better, network, layer, performance, achieve, classification, probability, problem, introduction, stable, compared] [camera, smooth, match, solve, compare]
@InProceedings{Liu_2020_CVPR,
  author = {Liu, Chong and Chang, Xiaojun and Shen, Yi-Dong},
  title = {Unity Style Transfer for Person Re-Identification},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Suppressing Uncertainties for Large-Scale Facial Expression Recognition
Kai Wang, Xiaojiang Peng, Jianfei Yang, Shijian Lu, Yu Qiao


Annotating a qualitative large-scale facial expression dataset is extremely difficult due to the uncertainties caused by ambiguous facial expressions, low-quality facial images, and the subjectiveness of annotators. These uncertainties suspend the progress of large-scale Facial Expression Recognition (FER) in data-driven deep learning era. To address this problelm, this paper proposes to suppress the uncertainties by a simple yet efficient Self-Cure Network (SCN). Specifically, SCN suppresses the uncertainty from two different aspects: 1) a self-attention mechanism over FER dataset to weight each sample in training with a ranking regularization, and 2) a careful relabeling mechanism to modify the labels of these samples in the lowest-ranked group. Experiments on synthetic FER datasets and our collected WebEmotion dataset validate the effectiveness of our method. Results on public benchmarks demonstrate that our SCN outperforms current state-of-the-art methods with 88.14% on RAF-DB, 60.23% on AffectNet, and 89.35% on FERPlus.
[recognition, dataset, attention, three, evaluation, emotion] [module, table, feature, cnn, propose, annotated, jianfei, suppress] [scn, facial, expression, face, relabeling, fer, uncertain, datasets, webemotion, original, affectnet, noise, metacleaner, collected, ferplus, model, ambiguous, inconsistent, relabel, curriculumnet, xiaojiang, public, robust, clean, incorrect] [noisy, low, figure, high, ieee, kai, comparison, based] [loss, synthetic, image, consists, learns, learn] [training, deep, learning, regularization, rank, network, sample, ratio, label, weight, margin, data, baseline, ranking, weighting, group, set, maximum, probability, performance, neural, learned, arxiv] [conference, computer, vision, international, human]
@InProceedings{Wang_2020_CVPR,
  author = {Wang, Kai and Peng, Xiaojiang and Yang, Jianfei and Lu, Shijian and Qiao, Yu},
  title = {Suppressing Uncertainties for Large-Scale Facial Expression Recognition},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Multiview-Consistent Semi-Supervised Learning for 3D Human Pose Estimation
Rahul Mitra, Nitesh B. Gundavarapu, Abhishek Sharma, Arjun Jain


The best performing methods for 3D human pose estimation from monocular images require large amounts of in-the-wild 2D and controlled 3D pose annotated datasets which are costly and require sophisticated systems to acquire. To reduce this annotation dependency, we propose Multiview-Consistent Semi Supervised Learning (MCSS) framework that utilizes similarity in pose information from unannotated, uncalibrated but synchronized multi-view videos of human motions as additional weak supervision signal to guide 3D human pose regression. Our framework applies hard-negative mining based on temporal relations in multi-view videos to arrive at a multi-view consistent pose embedding and when jointly trained with limited 3D pose annotations, our approach improves the baseline by 25% and state-of-the-art by 8.7%, whilst using substantially smaller networks. Lastly, but importantly, we demonstrate the advantages of the learned embedding and establish view-invariant pose retrieval benchmarks on two popular, publicly available multi-view human pose datasets, Human 3.6M and MPI-INF-3DHP, to facilitate future research.
[embedding, retrieval, dataset, temporal, multiple] [supervision, framework, feature, regression, annotated, positive, background, global, pascal, weak] [trained, model, query, datasets, subject, case, synchronized, aforementioned] [method, proposed, based, captured, figure, chen, motion] [representation, supervised, learn, image, loss, mapping, corresponding, shared] [learning, training, network, learned, baseline, performance, test, data, deep, semisupervised, space, large, batch, set, requires] [pose, human, estimation, canonical, limited, rhodin, camera, approach, view, require, mpjpe, monocular, novel, capture, viewpoint, distance, lpose, structure, lcnstr, joint, additional]
@InProceedings{Mitra_2020_CVPR,
  author = {Mitra, Rahul and Gundavarapu, Nitesh B. and Sharma, Abhishek and Jain, Arjun},
  title = {Multiview-Consistent Semi-Supervised Learning for 3D Human Pose Estimation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Regularizing Neural Networks via Minimizing Hyperspherical Energy
Rongmei Lin, Weiyang Liu, Zhen Liu, Chen Feng, Zhiding Yu, James M. Rehg, Li Xiong, Le Song


Inspired by the Thomson problem in physics where the distribution of multiple propelling electrons on a unit sphere can be modeled via minimizing some potential energy, hyperspherical energy minimization has demonstrated its potential in regularizing neural networks and improving their generalization power. In this paper, we first study the important role that hyperspherical energy plays in neural network training by analyzing its training dynamics. Then we show that naively minimizing hyperspherical energy suffers from some difficulties due to highly non-linear and non-convex optimization as the space dimensionality becomes higher, therefore limiting the potential to further improve the generalization. To address these problems, we propose the compressive minimum hyperspherical energy (CoMHE) as a more effective regularization for neural networks. Specifically, CoMHE utilizes projection mappings to reduce the dimensionality of neurons and minimizes their hyperspherical energy. According to different designs for the projection mapping, we propose several distinct yet well-performing variants and provide some theoretical guarantees to justify their effectiveness. Our experiments show that CoMHE consistently outperforms existing regularization methods, and can be easily applied to different neural networks.
[multiple, recognition, goal, construct] [table, propose, cnn, resnet, denotes] [original, randomness, unrolled, study, adversarial] [stationary, compressive] [diversity, minimizing, loss, preserving, perform] [comhe, hyperspherical, energy, random, mhe, neural, network, number, matrix, deep, optimization, learning, dimension, theorem, regularization, space, gradient, angular, performance, training, appendix, neuron, better, regularizing, baseline, randomly, distribution, reduce, minimizes, objective, group, set, weiyang, minimize, plain, arxiv, preprint, theoretical, hypersphere, note, evaluate, achieve, regularize, problem, dimensionality, weight, update, alternating, gain, zhiding, thomson, consistently, standard, lower, orthogonal, approximate, descent, linear, distributed, best] [projection, error, projected, angle, local, well]
@InProceedings{Lin_2020_CVPR,
  author = {Lin, Rongmei and Liu, Weiyang and Liu, Zhen and Feng, Chen and Yu, Zhiding and Rehg, James M. and Xiong, Li and Song, Le},
  title = {Regularizing Neural Networks via Minimizing Hyperspherical Energy},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Learning Representations by Predicting Bags of Visual Words
Spyros Gidaris, Andrei Bursuc, Nikos Komodakis, Patrick Perez, Matthieu Cord


Self-supervised representation learning targets to learn convnet-based image representations from unlabeled data. Inspired by the success of NLP methods in this area, in this work we propose a self-supervised approach based on spatially dense image descriptions that encode discrete visual concepts, here called visual words. To build such discrete representations, we quantize the feature maps of a first pre-trained self-supervised convnet, over a k-means based vocabulary. Then, as a self-supervised task, we train another convnet to predict the histogram of visual words of an image (i.e., its Bag-of-Words representation) given as input a perturbed version of that image. The proposed task forces the convnet to learn perturbation-invariant and context-aware image features, useful for downstream image understanding tasks. We extensively evaluate our method and demonstrate very strong empirical results, e.g., our pre-trained self-supervised representations transfer better on detection task and similarly on classification over classes "unseen" during pre-training, when compared to the supervised case. This also shows that the process of image discretization into visual words can provide the basis for very powerful self-supervised approaches in the image domain, thus allowing further connections to be made to related methods from the NLP domain that have been extremely successful so far.
[visual, prediction, word, work, provide, predicting, nlp, predict, vocabulary, multiple] [feature, table, detection, object, propose, van] [model, strong, input, perturbed, original] [method, ieee, pattern, based, residual, spatial, spatially, proposed, convolutional] [image, representation, train, unsupervised, learn, supervised, missing] [learning, bow, bownet, training, convnet, linear, task, classification, imagenet, discrete, rotnet, better, neural, arxiv, preprint, processing, applied, vector, learned, evaluate, set, layer, clustering, miniimagenet, classifier, accuracy, deep, performance, random, weight, batch, base, compared] [computer, vision, dense, local, european, second]
@InProceedings{Gidaris_2020_CVPR,
  author = {Gidaris, Spyros and Bursuc, Andrei and Komodakis, Nikos and Perez, Patrick and Cord, Matthieu},
  title = {Learning Representations by Predicting Bags of Visual Words},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
AnimalWeb: A Large-Scale Hierarchical Dataset of Annotated Animal Faces
Muhammad Haris Khan, John McDonagh, Salman Khan, Muhammad Shahabuddin, Aditya Arora, Fahad Shahbaz Khan, Ling Shao, Georgios Tzimiropoulos


Several studies show that animal needs are often expressed through their faces. Though remarkable progress has been made towards the automatic understanding of human faces, this has not been the case with animal faces. There exists significant room for algorithmic advances that could realize automatic systems for interpreting animal faces. Besides scientific value, resulting technology will foster better and cheaper animal care. We believe the underlying research progress is mainly obstructed by the lack of an adequately annotated dataset of animal faces, covering a wide spectrum of animal species. To this end, we introduce a large-scale, hierarchical annotated dataset of animal faces, featuring 22.4K faces from 350 diverse species and 21 animal orders across biological taxonomy. These faces are captured `in-the-wild' conditions and are consistently annotated with 9 landmarks on key facial features. The dataset is structured and scalable by design; its development underwent four systematic stages involving rigorous, overall effort of over 6K man-hours. We benchmark it for face alignment using the existing art under two new problem settings. Results showcase its challenging nature, unique attributes and present definite prospects for novel, adaptive, and generalized face-oriented CV algorithms. Further benchmarking the dataset across face detection and fine-grained recognition tasks demonstrates its multi-task applications and room for improvement. The dataset is available at: https://fdmaproject.wordpress.com/.
[dataset, hierarchical, evaluation, understanding, recognition, progress] [annotated, detection, challenging, annotation, key, benchmark, table] [face, animal, facial, animalweb, datasets, testing, landmark, workflow, cofw, development, trained, zooniverse, case, scientific, effort, developed, team, annotating, featuring, help, markup, robust] [ieee, pattern, figure, comparison, range, spectrum, existing, biological, captured, based] [alignment, image, diverse, unknown, project, appearance] [training, large, total, observe, performance, accuracy, count, popular, distribution, number, top, data, average, randomly, reported] [human, computer, conference, vision, collection, international, point, georgios, european, unique, limited]
@InProceedings{Khan_2020_CVPR,
  author = {Khan, Muhammad Haris and McDonagh, John and Khan, Salman and Shahabuddin, Muhammad and Arora, Aditya and Khan, Fahad Shahbaz and Shao, Ling and Tzimiropoulos, Georgios},
  title = {AnimalWeb: A Large-Scale Hierarchical Dataset of Annotated Animal Faces},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
A Transductive Approach for Video Object Segmentation
Yizhuo Zhang, Zhirong Wu, Houwen Peng, Stephen Lin


Semi-supervised video object segmentation aims to separate a target object from a video sequence, given the mask in the first frame. Most of current prevailing methods utilize information from additional modules trained in other domains like optical flow and instance segmentation, and as a result they do not compete with other methods on common ground. To address this issue, we propose a simple yet strong transductive method, in which additional modules, datasets, and dedicated architectural designs are not needed. Our method takes a label propagation approach where pixel labels are passed forward based on feature similarity in an embedding space. Different from other propagation methods, ours diffuses temporal information in a holistic manner which take accounts of long-term object appearance. In addition, our method requires few additional computational overhead, and runs at a fast 37 fps speed. Our single model with a vanilla ResNet50 backbone achieves an overall score of 72.3% on the DAVIS 2017 validation set and 63.1% on the test set. This simple yet high performing and efficient method can serve as a solid baseline that facilitates future research. Code and models are available at https://github.com/ microsoft/transductive-vos.pytorch.
[video, frame, temporal, embedding, work, evaluation, prediction, current, speed, previous, time] [object, segmentation, davis, tracking, instance, premvos, table, global, dyenet, propagation, feature, van, mask] [model, trained] [optical, ieee, flow, pattern, prior, figure, spatial, motion, stm, method, pixel, based, fast, convolutional, preceding, reference, tvos] [target, transductive, appearance, unsupervised, image] [learning, validation, similarity, simple, set, training, label, performance, unlabeled, neural, inference, data, sampling, online, finetuning, arxiv, preprint, test, learned, network, wij, sample, random, matrix, measure, efficient] [computer, conference, vision, local, additional, single, term, dense, approach, distant, sparse, european]
@InProceedings{Zhang_2020_CVPR,
  author = {Zhang, Yizhuo and Wu, Zhirong and Peng, Houwen and Lin, Stephen},
  title = {A Transductive Approach for Video Object Segmentation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Dynamic Face Video Segmentation via Reinforcement Learning
Yujiang Wang, Mingzhi Dong, Jie Shen, Yang Wu, Shiyang Cheng, Maja Pantic


For real-time semantic video segmentation, most recent works utilised a dynamic framework with a key scheduler to make online key/non-key decisions. Some works used a fixed key scheduling policy, while others proposed adaptive key scheduling methods based on heuristic strategies, both of which may lead to suboptimal global performance. To overcome this limitation, we model the online key decision process in dynamic video segmentation as a deep reinforcement learning problem and learn an efficient and effective scheduling policy from expert information about decision history and from the process of maximising global return. Moreover, we study the application of dynamic video segmentation on face videos, a field that has not been investigated before. By evaluating on the 300VW dataset, we show that the performance of our reinforcement key scheduler outperforms that of various baselines in terms of both effective key selections and running speed. Further results on the Cityscapes dataset demonstrate that our proposed method can also generalise to other scenarios. To the best of our knowledge, this is the first work to use reinforcement learning for online key-frame decision in dynamic video segmentation, and also the first work on its application on face videos.
[video, frame, scheduler, reinforcement, action, policy, dataset, dvsnet, kar, expert, ntask, reward, work, agent, ldk] [key, segmentation, semantic, feature, miou, global, fps, fully, mask, george] [face, model, decision, trained, evaluated, effective, facial, versus, scheduling] [ieee, pattern, dynamic, method, flow, proposed, convolutional, output, figure, optical, adaptive, interpolation, adopted, comparison, dff] [image, eat, loss, learn] [deep, training, arxiv, preprint, learning, performance, set, network, machine, neural, architecture, better, episode, fixed, strategy, average, selected, online, efficient, function, lead] [conference, computer, vision, international, demonstrate, estimation, limit]
@InProceedings{Wang_2020_CVPR,
  author = {Wang, Yujiang and Dong, Mingzhi and Shen, Jie and Wu, Yang and Cheng, Shiyang and Pantic, Maja},
  title = {Dynamic Face Video Segmentation via Reinforcement Learning},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Implicit Functions in Feature Space for 3D Shape Reconstruction and Completion
Julian Chibane, Thiemo Alldieck, Gerard Pons-Moll


While many works focus on 3D reconstruction from images, in this paper, we focus on 3D shape reconstruction and completion from a variety of 3D inputs, which are deficient in some respect: low and high resolution voxels, sparse and dense point clouds, complete or incomplete. Processing of such 3D inputs is an increasingly important problem as they are the output of 3D scanners, which are becoming more accessible, and are the intermediate output of 3D computer vision algorithms. Recently, learned implicit functions have shown great promise as they produce continuous reconstructions. However, we identified two limitations in reconstruction from 3D inputs: 1) details present in the input data are not retained, and 2) poor reconstruction of articulated humans. To solve this, we propose Implicit Feature Networks (IF-Nets), which deliver continuous outputs, can handle multiple topologies, and complete shapes for missing or sparse input data retaining the nice properties of recent learned implicit functions, but critically they can also retain detail when it is present in the input data, and can reconstruct articulated humans. Our work differs from prior work in two crucial aspects. First, instead of using a single vector to encode a 3D shape, we extract a learnable 3-dimensional multi-scale tensor of deep features, which is aligned with the original Euclidean space embedding the shape. Second, instead of classifying x-y-z point coordinates directly, we classify deep features extracted from the tensor at a continuous query point. We show that this forces our model to make decisions based on global and local shape structure, as opposed to point coordinates, which are arbitrary under Euclidean transformations. Experiments demonstrate that IF-Nets outperform prior work in 3D object reconstruction in ShapeNet, and obtain significantly more accurate 3D human reconstructions. Code and project website is available at https://virtualhumans.mpi-inf.mpg.de/ifnets/.
[encoding, work, multiple, imnet] [object, feature, global, focus] [input, query, model, christian, clothing] [ieee, pattern, method, based, output, detail, resolution] [representation, aligned, missing, image, preserve, produce, arbitrary, latent] [learning, deep, neural, data, processing, network, learned, training, space, vector, function] [point, computer, conference, shape, reconstruction, implicit, vision, international, continuous, human, surface, reconstruct, gerard, voxel, articulated, single, occupancy, sparse, dense, mesh, michael, cloud, complete, local, rigid, shapenet, detailed, grid, hao, completion, complex, depth, marching, voxels, distance, capture, occnet, dmc, european]
@InProceedings{Chibane_2020_CVPR,
  author = {Chibane, Julian and Alldieck, Thiemo and Pons-Moll, Gerard},
  title = {Implicit Functions in Feature Space for 3D Shape Reconstruction and Completion},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Semantic Drift Compensation for Class-Incremental Learning
Lu Yu, Bartlomiej Twardowski, Xialei Liu, Luis Herranz, Kai Wang, Yongmei Cheng, Shangling Jui, Joost van de Weijer


Class-incremental learning of deep networks sequentially increases the number of classes to be classified. During training, the network has only access to data of one task at a time, where each task contains several classes. In this setting, networks suffer from catastrophic forgetting which refers to the drastic drop in performance on previous tasks. The vast majority of methods have studied this scenario for classification networks, where for each new task the classification layer of the network must be augmented with additional weights to make room for the newly added classes. Embedding networks have the advantage that new classes can be naturally included into the network without adding new weights. Therefore, we study incremental learning for embedding networks. In addition, we propose a new method to estimate the drift, called semantic drift, of features and compensate for it without the need of any exemplars. We approximate the drift of previous tasks based on the drift that is experienced by current task data. We perform experiments on fine-grained datasets, CIFAR100 and ImageNet-Subset. We demonstrate that embedding networks suffer significantly less from catastrophic forgetting. We outperform existing methods which do not require exemplars and obtain competitive results compared to methods which store exemplars. Furthermore, we show that our proposed SDC when combined with existing methods to prevent forgetting consistently improves results.
[embedding, previous, embeddings, current, dataset, three, work] [semantic, propose, van, improves] [access, trained, combined, model] [method, based, proposed, existing, output, compensation, figure, comparison, suffer] [loss, image, prototype, exemplar] [learning, task, drift, forgetting, training, network, data, continual, average, classification, incremental, number, catastrophic, metric, accuracy, sdc, learned, class, prevent, softmax, deep, triplet, performance, classifier, consider, ewc, finetuning, preventing, lwf, large, negative, function, ncm, observe, rebalance, approximate, outperform, applied, neural, weight, evaluate, memory] [distance, compute, joint, estimate, single, computer, refer, approach]
@InProceedings{Yu_2020_CVPR,
  author = {Yu, Lu and Twardowski, Bartlomiej and Liu, Xialei and Herranz, Luis and Wang, Kai and Cheng, Yongmei and Jui, Shangling and Weijer, Joost van de},
  title = {Semantic Drift Compensation for Class-Incremental Learning},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Context-Aware Human Motion Prediction
Enric Corona, Albert Pumarola, Guillem Alenya, Francesc Moreno-Noguer


The problem of predicting human motion given a sequence of past observations is at the core of many applications in robotics and computer vision. Current state-of-the-art formulates this problem as a sequence-to-sequence task, in which a historical of 3D skeletons feeds a Recurrent Neural Network (RNN) that predicts future movements, typically in the order of 1 to 2 seconds. However, one aspect that has been obviated so far, is the fact that human motion is inherently driven by interactions with objects and/or other humans in the environment. In this paper, we explore this scenario using a novel context-aware motion prediction architecture. We use a semantic-graph model where the nodes parameterize the human and objects in the scene and the edges their mutual interactions. These interactions are iteratively learned through a graph attention layer, fed with the past observations, which now include both object and human body motions. Once this semantic graph is learned, we inject it to a standard RNN to predict future movements of the human/s and object/s. We consider two variants of our architecture, either freezing the contextual interactions in the future of updating them. A thorough evaluation in the Whole-Body Human Motion Database shows that in both cases, our context-aware networks clearly outperform baselines in which the context information is not considered.
[prediction, context, rnn, graph, future, predict, interaction, people, cmu, mocap, time, predicting, provide, node, hidden, attention, dataset, previous, action, state, adjacency, relevant, recurrent, include, cup, observed] [object, table, predicted, branch, box, contextual, bounding, semantic] [model, noise, trained, datasets, influence] [motion, figure, residual, convolutional, proposed] [image, representation, person] [learning, consider, arxiv, preprint, problem, neural, network, evaluate, baseline, learned, deep, matrix] [human, pose, estimation, approach, francesc, ground, scene, single, body, distance, predicts, represented, rigid, truth, joint, defined, capture]
@InProceedings{Corona_2020_CVPR,
  author = {Corona, Enric and Pumarola, Albert and Alenya, Guillem and Moreno-Noguer, Francesc},
  title = {Context-Aware Human Motion Prediction},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
DeepDeform: Learning Non-Rigid RGB-D Reconstruction With Semi-Supervised Data
Aljaz Bozic, Michael Zollhofer, Christian Theobalt, Matthias Niessner


Applying data-driven approaches to non-rigid 3D reconstruction has been difficult, which we believe can be attributed to the lack of a large-scale training corpus. Unfortunately, this method fails for important cases such as highly non-rigid deformations. We first address this problem of lack of data by introducing a novel semi-supervised strategy to obtain dense inter-frame correspondences from a sparse set of annotations. This way, we obtain a large dataset of 400 scenes, over 390,000 RGB-D frames, and 5,533 densely aligned frame pairs; in addition, we provide a test set along with several metrics for evaluation. Based on this corpus, we introduce a data-driven non-rigid feature matching approach, which we integrate into an optimization-based reconstruction pipeline. Here, we propose a new neural network that operates on RGB-D frames, while maintaining robustness under large non-rigid deformations and producing accurate predictions. Our approach significantly outperforms existing non-rigid reconstruction methods that do not use learned data terms, as well as learning-based approaches that only use self-supervision.
[frame, provide, dataset, moving, prediction, static, graph, integrate] [feature, object, heatmap, annotated, tracking, propose, employ, challenging] [visibility, robust, christian, datasets, trained] [ieee, based, pattern, dynamic, motion, comparison, pixel, method, figure, convolutional, densely, color] [source, target, alignment, aligned, image] [data, network, training, learned, learning, set, average, performance, strategy, large, test, probability] [reconstruction, depth, matching, dense, approach, correspondence, computer, sparse, deformation, vision, point, conference, scene, acm, ground, truth, surface, capture, dynamicfusion, volumetric, single, rigid, error, shahram, michael, matthias, geometry, nonrigid, camera, handle, compare]
@InProceedings{Bozic_2020_CVPR,
  author = {Bozic, Aljaz and Zollhofer, Michael and Theobalt, Christian and Niessner, Matthias},
  title = {DeepDeform: Learning Non-Rigid RGB-D Reconstruction With Semi-Supervised Data},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Optical Non-Line-of-Sight Physics-Based 3D Human Pose Estimation
Mariko Isogawa, Ye Yuan, Matthew O'Toole, Kris M. Kitani


We describe a method for 3D human pose estimation from transient images (i.e., a 3D spatio-temporal histogram of photons) acquired by an optical non-line-of-sight (NLOS) imaging system. Our method can perceive 3D human pose by 'looking around corners' through the use of light indirectly reflected by the environment. We bring together a diverse set of technologies from NLOS imaging, human pose estimation and deep reinforcement learning to construct an end-to-end data processing pipeline that converts a raw stream of photon measurements into a full 3D human pose sequence estimate. Our contributions are the design of data representation process which includes (1) a learnable inverse point spread function (PSF) to convert raw transient images into a deep feature vector; (2) a neural humanoid control policy conditioned on the transient image feature and learned from interactions with a physics simulator; and (3) a data synthesis and augmentation strategy based on depth data that can be transferred to a real-world NLOS imaging system. Our preliminary experiments suggest that our method is able to generalize to real-world NLOS measurement to estimate physically-valid 3D human poses.
[policy, sequence, time, temporal, work, hidden, state, reinforcement, recognition, context, reward, visual, extract] [feature, extractor] [model, poisson, noise, subject] [imaging, method, light, captured, ieee, inverse, optical, based, net, psf, pattern, motion, figure, sensor, proposed, result, photon] [image, real, person, control, ability, synthetic, generate] [data, deep, learning, process, function, network, test, procedure, set, training, neural, computational, large, metric] [pose, transient, human, nlos, estimation, humanoid, volume, ground, depth, conference, joint, computer, vision, estimate, single, point, confocal, acm, wall, reconstruction, posereg, matthew, system, truth, estimated, capture, visible, reflectance, ramesh]
@InProceedings{Isogawa_2020_CVPR,
  author = {Isogawa, Mariko and Yuan, Ye and O'Toole, Matthew and Kitani, Kris M.},
  title = {Optical Non-Line-of-Sight Physics-Based 3D Human Pose Estimation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Learning to Transfer Texture From Clothing Images to 3D Humans
Aymen Mir, Thiemo Alldieck, Gerard Pons-Moll


In this paper, we present a simple yet effective method to automatically transfer textures of clothing images (front and back) to 3D garments worn on top SMPL, in real time. We first automatically compute training pairs of images with aligned 3D garments using a custom non-rigid 3D to 2D registration method, which is accurate but slow. Using these pairs, we learn a mapping from pixels to the 3D garment surface. Our idea is to learn dense correspondences from garment image silhouettes to a 2D-UV map of a 3D garment surface using shape information alone, completely ignoring texture, which allows us to generalize to the wide range of web images. Several experiments demonstrate that our model is more accurate than widely used baselines such as thin-plate-spline warping and image-to-image translation networks while being orders of magnitude faster. Our model opens the door for applications such as virtual try-on, and allows for generation of 3D humans with varied textures which is necessary for learning. Code will be available at https://virtualhumans.mpi-inf.mpg.de/pix2surf/.
[dataset, work, people, automatically, automatic, recognition] [map, segmentation, focus, mask] [garment, clothing, model, input, christian] [ieee, pattern, method, based, figure, warping, pixel] [image, texture, mapping, learn, translation, real, produce, transfer, person, train, generalize, aligned] [learning, training, network, andrew, neural, data, online, minimize] [computer, shape, vision, conference, virtual, human, pose, body, gerard, international, silhouette, surface, acm, single, view, mesh, front, smpl, fitting, matching, correspondence, allows, dense, geometry, textured, michael, hao, retail, parametric, depth, european, david, capture, thiemo, accurate, novel, directly, reconstruction, compare, avatar, compute]
@InProceedings{Mir_2020_CVPR,
  author = {Mir, Aymen and Alldieck, Thiemo and Pons-Moll, Gerard},
  title = {Learning to Transfer Texture From Clothing Images to 3D Humans},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
UniPose: Unified Human Pose Estimation in Single Images and Videos
Bruno Artacho, Andreas Savakis


We propose UniPose, a unified framework for human pose estimation, based on our "Waterfall" Atrous Spatial Pooling architecture, that achieves state-of-art-results on several pose estimation metrics. UniPose incorporates contextual segmentation and joint localization to estimate the human pose in a single stage, with high accuracy, without relying on statistical postprocessing methods. The Waterfall module in UniPose leverages the efficiency of progressive filtering in the cascade architecture, while maintaining multi-scale fields-of-view comparable to spatial pyramid configurations. Additionally, our method is extended to UniPose-LSTM for multi-frame processing and achieves state-of-the-art results for temporal pose estimation in Video. Our results on multiple datasets demonstrate that UniPose, with a ResNet backbone and Waterfall module, is a robust and efficient architecture for pose estimation obtaining state-of-the-art results in single person pose detection for both single images and videos.
[dataset, lstm, decoder, temporal, action, video, bbc, recognition, order, frame, sign, previous, multiple] [detection, module, atrous, cascade, feature, semantic, bounding, table, pooling, box, contextual, final, aspp, segmentation, detect, unified, framework, backbone, main, cpm] [heatmaps, original, input, datasets] [unipose, wasp, ieee, method, figure, pattern, convolutional, spatial, waterfall, penn, resolution, based, high, flow, convolution, utilized] [image, person] [network, architecture, deep, efficient, better, learning, size, rate, processing, large, number, larger, configuration, machine, performance] [pose, estimation, human, conference, computer, vision, single, joint, body, approach, mpii, fov, lsp, international, european, estimate]
@InProceedings{Artacho_2020_CVPR,
  author = {Artacho, Bruno and Savakis, Andreas},
  title = {UniPose: Unified Human Pose Estimation in Single Images and Videos},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Minimal Solutions to Relative Pose Estimation From Two Views Sharing a Common Direction With Unknown Focal Length
Yaqing Ding, Jian Yang, Jean Ponce, Hui Kong


We propose minimal solutions to relative pose estimation problem from two views sharing a common direction with unknown focal length. This is relevant for cameras equipped with an IMU (inertial measurement unit), e.g., smart phones, tablets. Similar to the 6-point algorithm for two cameras with unknown but equal focal lengths and 7-point algorithm for two cameras with different and unknown focal lengths, we derive new 4- and 5-point algorithms for these two cases, respectively. The proposed algorithms can cope with coplanar points, which is a degenerate configuration for these 6- and 7-point counterparts. We present a detailed analysis and comparisons with the state of the art. Experimental results on both synthetic data and real images from a smart phone demonstrate the usefulness of the proposed algorithms.
[length, action, recognition, phone] [martin] [noise, case, pitch, degree, choose] [proposed, motion, ieee, figure, pattern, based, coefficient, analysis, homography, reference] [unknown, image, translation, real, shared, common] [algorithm, matrix, standard, smart, data, number, problem, vector, general, increased, set, forward, better, fewer, equal, denote, written, efficient] [focal, computer, rotation, relative, solution, minimal, polynomial, eigenvalue, pose, camera, gravity, conference, vision, roll, direction, imu, degenerate, point, system, error, planar, estimation, ransac, zuzana, estimated, additional, second, rewritten, pure, inliers, international, tomas, assume, monomials]
@InProceedings{Ding_2020_CVPR,
  author = {Ding, Yaqing and Yang, Jian and Ponce, Jean and Kong, Hui},
  title = {Minimal Solutions to Relative Pose Estimation From Two Views Sharing a Common Direction With Unknown Focal Length},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
3D Human Mesh Regression With Dense Correspondence
Wang Zeng, Wanli Ouyang, Ping Luo, Wentao Liu, Xiaogang Wang


Estimating 3D mesh of the human body from a single 2D image is an important task with many applications such as augmented reality and Human-Robot interaction. However, prior works reconstructed 3D mesh from global image feature extracted by using convolutional neural network (CNN), where the dense correspondences between the mesh surface and the image pixels are missing, leading to suboptimal solution. This paper proposes a model-free 3D human mesh estimation framework, named DecoMR, which explicitly establishes the dense correspondence between the mesh and the local image features in the UV space (i.e. a 2D space used for texture mapping of 3D mesh). DecoMR first predicts pixel-to-surface dense correspondence map (i.e., IUV image), with which we transfer local features from the image space to the UV space. Then the transferred local image features are processed in the UV space to regress a location map, which is well aligned with transferred features. Finally we reconstruct 3D human mesh from the regressed location map with a predefined mapping function. We also observe that the existing discontinuous UV map are unfriendly to the learning of network. Therefore, we propose a novel UV map that maintains most of the neighboring relations on the original mesh surface. Experiments demonstrate that our proposed local feature alignment and continuous UV map outperforms existing 3D mesh based methods on multiple public benchmarks. Code will be made available at https: //github.com/zengwang430521/DecoMR.
[outperforms, multiple, work, dataset, explicitly] [map, location, feature, global, framework, table, predicted, regression, regressed, achieves, predefined] [original, model, input] [figure, ieee, based, neighboring, pattern, pixel, method, output, raw, comparison, proposed, cmr, prior, existing] [image, transferred, loss, corresponding, transfer, representation, train, mapping, aligned] [space, learning, network, training, default, large, test, data, maintains, performance, function, neural] [mesh, human, local, iuv, body, pose, dense, correspondence, smpl, surface, conference, computer, estimation, shape, vision, continuous, distance, single, well, reconstruction, international, point, joint, reconstruct, spin, directly, lnet, consistent, surreal, estimate, cnet, reconstructed, regress]
@InProceedings{Zeng_2020_CVPR,
  author = {Zeng, Wang and Ouyang, Wanli and Luo, Ping and Liu, Wentao and Wang, Xiaogang},
  title = {3D Human Mesh Regression With Dense Correspondence},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Cross-Modal Pattern-Propagation for RGB-T Tracking
Chaoqun Wang, Chunyan Xu, Zhen Cui, Ling Zhou, Tong Zhang, Xiaoya Zhang, Jian Yang


Motivated by our observations on RGB-T data that pattern correlations are high-frequently recurred across modalities also along sequence frames, in this paper, we propose a cross-modal pattern-propagation (CMPP) tracking framework to diffuse instance patterns across RGB-T data on spatial domain as well as temporal domain. To bridge RGB-T modalities, the cross-modal correlations on intra-modal paired pattern-affinities are derived to reveal those latent cues between heterogenous modalities. Through the correlations, the useful patterns may be mutually propagated between RGB-T modalities so as to fulfill inter-modal pattern-propagation. Further, considering the temporal continuity of sequence frames, we adopt the spirit of pattern propagation to dynamic temporal domain, in which long-term historical contexts are adaptively correlated and propagated into the current frame for more effective information inheritance. Extensive experiments demonstrate that the effectiveness of our proposed CMPP, and the new state-of-the-art results are achieved with the significant improvements on two RGB-T object tracking benchmarks.
[historical, pair, visual, current, context, modality, sequence, temporal, frame, construct, video] [tracking, object, cmpp, thermal, feature, propagation, correlation, affinity, module, chenglong, impp, plot, gtot, threshold, propagated, cnn, effectiveness, positive, instance, achieves, location, propose, framework, rgbt] [model, success] [pattern, ieee, proposed, convolutional, method, low, jin, comparison, adaptively, figure, based, spatial] [infrared, target, image, representation] [network, learning, candidate, online, data, rate, performance, set, precision, deep, neural, dissimilar, statistical, confident] [computer, conference, rgb, vision, well, sparse, demonstrate, single]
@InProceedings{Wang_2020_CVPR,
  author = {Wang, Chaoqun and Xu, Chunyan and Cui, Zhen and Zhou, Ling and Zhang, Tong and Zhang, Xiaoya and Yang, Jian},
  title = {Cross-Modal Pattern-Propagation for RGB-T Tracking},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Distilling Knowledge From Graph Convolutional Networks
Yiding Yang, Jiayan Qiu, Mingli Song, Dacheng Tao, Xinchao Wang


Existing knowledge distillation methods focus on convolutional neural networks (CNNs), where the input samples like images lie in a grid domain, and have largely overlooked graph convolutional networks (GCN) that handle non-grid data. In this paper, we propose to our best knowledge the first dedicated approach to distilling knowledge from a pre-trained GCN model. To enable the knowledge transfer from the teacher GCN to the student, we propose a local structure preserving module that explicitly accounts for the topological semantics of the teacher. In this module, the local structure information from both the teacher and the student are extracted as distributions, and hence minimizing the distance between these distributions enables topology-aware knowledge transfer from the teacher, yielding a compact yet high-performance student model. Moreover, the proposed approach is readily extendable to dynamic graph models, where the input graphs for the teacher and the student may differ. We evaluate the proposed method on two different datasets using GCN models of different architectures, and demonstrate that our method achieves the state-of-the-art knowledge distillation performance for GCN models.
[graph, gcn, node, attention, dataset, provide, embedded, constructed, relationship] [feature, object, center, module, table, focus, fully] [model, input, topological, trained, adding, original] [method, convolutional, proposed, kernel, dynamic, ieee, pattern, intermediate, based, comparison, output] [preserving, transfer, learn, representation, loss] [student, knowledge, teacher, distillation, function, performance, network, set, neural, learning, deep, training, arxiv, preprint, best, stu, distilling, xinchao, distribution, classification, similarity, epoch, processing, mingli, evaluate, smaller, learned, distill, space, strategy, dimension] [structure, local, conference, computer, vision, lsp, point, distance, approach, grid, compute, rbf, international, represented]
@InProceedings{Yang_2020_CVPR,
  author = {Yang, Yiding and Qiu, Jiayan and Song, Mingli and Tao, Dacheng and Wang, Xinchao},
  title = {Distilling Knowledge From Graph Convolutional Networks},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Learning Identity-Invariant Motion Representations for Cross-ID Face Reenactment
Po-Hsiang Huang, Fu-En Yang, Yu-Chiang Frank Wang


Human face reenactment aims at transferring motion patterns from one face (from a source-domain video) to an-other (in the target domain with the identity of interest).While recent works report impressive results, they are notable to handle multiple identities in a unified model. In this paper, we propose a unique network of CrossID-GAN to perform multi-ID face reenactment. Given a source-domain video with extracted facial landmarks and a target-domain image, our CrossID-GAN learns the identity-invariant motion patterns via the extracted landmarks and such information to produce the videos whose ID matches that of the target domain. Both supervised and unsupervised settings are proposed to train and guide our model during training.Our qualitative/quantitative results confirm the robustness and effectiveness of our model, with ablation studies confirming our network design.
[video, frame, multiple, observed, three, temporal] [unified, effectiveness, feature, table, boundary, propose, ablation] [face, identity, reenactment, model, facial, landmark, input, expression, trained, adv, reenacted, adversarial, quality] [output, motion, ieee, proposed, figure, pattern, designed, quantitative, existing, perceptual, convolutional] [image, target, source, unsupervised, supervised, loss, code, extracted, unseen, content, transfer, latent, domain, representation, encoder, ladv, perform, consistency, produce, realistic, generator, disentanglement, dtmp, qualitative, transferring] [learning, training, network, data, deep, neural] [pose, human, conference, computer, vision, shape, ground]
@InProceedings{Huang_2020_CVPR,
  author = {Huang, Po-Hsiang and Yang, Fu-En and Wang, Yu-Chiang Frank},
  title = {Learning Identity-Invariant Motion Representations for Cross-ID Face Reenactment},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Distribution-Aware Coordinate Representation for Human Pose Estimation
Feng Zhang, Xiatian Zhu, Hanbin Dai, Mao Ye, Ce Zhu


While being the de facto standard coordinate representation for human pose estimation, heatmap has not been investigated in-depth. This work fills this gap. For the first time, we find that the process of decoding the predicted heatmaps into the final joint coordinates in the original image space is surprisingly significant for the performance. We further probe the design limitations of the standard coordinate decoding method, and propose a more principled distributionaware decoding method. Also, we improve the standard coordinate encoding process (i.e. transforming ground-truth coordinates to heatmaps) by generating unbiased/accurate heatmaps. Taking the two together, we formulate a novel Distribution-Aware coordinate Representation of Keypoints (DARK) method. Serving as a model-agnostic plug-in, DARK brings about significant performance boost to existing human pose estimation models. Extensive experiments show that DARK yields the best results on two common benchmarks, MPII and COCO. Besides, DARK achieves the 2nd place entry in the ICCV 2019 COCO Keypoints Challenge. The code is available online.
[decoding, encoding, unbiased, prediction, modulation, work] [heatmap, coco, table, predicted, shifting, location, regression, propose, cnn, biased, china] [model, input, original, heatmaps, maximal, technology, unconstrained] [dark, method, ieee, existing, resolution, pattern, based, proposed, gaussian, science, convolutional, spatial, feng] [representation, image, person, whilst, common, xiatian] [standard, performance, learning, distribution, training, activation, validation, neural, data, inference, maximum, compared, problem, best, accuracy, label, network, size, process, design, operation, efficient, deep, processing, reduction] [coordinate, pose, human, computer, estimation, conference, joint, vision, accurate, european, keypoints, mpii, cost, quantisation, mao, second, body, recovery]
@InProceedings{Zhang_2020_CVPR,
  author = {Zhang, Feng and Zhu, Xiatian and Dai, Hanbin and Ye, Mao and Zhu, Ce},
  title = {Distribution-Aware Coordinate Representation for Human Pose Estimation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Parsing-Based View-Aware Embedding Network for Vehicle Re-Identification
Dechao Meng, Liang Li, Xuejing Liu, Yadong Li, Shijie Yang, Zheng-Jun Zha, Xingyu Gao, Shuhui Wang, Qingming Huang


Vehicle Re-Identification is to find images of the same vehicle from various views in the cross-camera scenario. The main challenges of this task are the large intra-instance distance caused by different views and the subtle inter-instance discrepancy caused by similar vehicles. In this paper, we propose a parsing-based view-aware embedding network (PVEN) to achieve the view-aware feature alignment and enhancement for vehicle ReID. First, we introduce a parsing network to parse a vehicle into four different views and then align the features by mask average pooling. Such alignment provides a fine-grained representation of the vehicle. Second, in order to enhance the view-aware features, we design a common-visible attention to focus on the common visible views, which not only shortens the distance among intra-instances, but also enlarges the discrepancy of inter-instances. The PVEN helps capture the stable discriminative information of vehicle under different views. The experiments conducted on three datasets show that our model outperforms state-of-the-art methods by a large margin.
[vehicle, attention, three, embedding, dataset] [feature, pven, global, parsing, table, map, score, vehicleid, key, side, region, pooling, improvement, liang, mask, focus, gallery, effectiveness, semantic, propose, shortens] [model, medium, query, datasets, difference, adversarial, visibility] [ieee, based, figure, enhancement, proposed, method, enhance, color, introduced] [reid, loss, alignment, discriminative, person, representation, image, discrepancy, common, learn, subtle, introduce, corresponding, generated, qingming, target, extracted] [large, learning, network, test, performance, triplet, deep, set, average, top, training, better] [local, distance, visible, view, front, distinctive]
@InProceedings{Meng_2020_CVPR,
  author = {Meng, Dechao and Li, Liang and Liu, Xuejing and Li, Yadong and Yang, Shijie and Zha, Zheng-Jun and Gao, Xingyu and Wang, Shuhui and Huang, Qingming},
  title = {Parsing-Based View-Aware Embedding Network for Vehicle Re-Identification},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
HandVoxNet: Deep Voxel-Based Network for 3D Hand Shape and Pose Estimation From a Single Depth Map
Jameel Malik, Ibrahim Abdelaziz, Ahmed Elhayek, Soshi Shimada, Sk Aziz Ali, Vladislav Golyanik, Christian Theobalt, Didier Stricker


3D hand shape and pose estimation from a single depth map is a new and challenging computer vision problem with many applications. The state-of-the-art methods directly regress 3D hand meshes from 2D depth images via 2D convolutional neural networks, which leads to artefacts in the estimations due to perspective distortions in the images. In contrast, we propose a novel architecture with 3D convolutions trained in a weakly-supervised manner. The input to our method is a 3D voxelized depth map, and we rely on two hand shape representations. The first one is the 3D voxelized grid of the shape which is accurate but does not preserve the mesh topology and the number of mesh vertices. The second representation is the 3D hand surface which is less accurate but does not suffer from the limitations of the first representation. We combine the advantages of these two representations by registering the hand surface to the voxelized hand shape. In the extensive experiments, the proposed approach improves over the state of the art by47.8% on the SynHand5M dataset. Moreover, our augmentation policy for voxelized depth maps further enhances the accuracy of 3D hand pose estimation on real data. Our method produces visually more reasonable and realistic hand shapes on NYU and BigHand2.2M datasets compared to the existing approaches.
[dataset, recognition, work] [map, regression, table, fully, propose, improves] [heatmaps, input, datasets, model, effective] [method, proposed, pattern, figure, convolutional, based] [real, perform, synthetic, representation, mapping, loss, plausible] [data, accuracy, network, augmentation, deep, training, learning, compared, size, set, architecture, test, number] [hand, shape, pose, voxelized, depth, estimation, surface, estimated, approach, vision, computer, nyu, single, conference, accurate, mesh, estimate, international, joint, didier, grid, dispvoxnet, ground, combine, accurately, pipeline, truth, reconstruction, directly, registration, jameel, ahmed, topology, point, nrga, deephps, perspective, second, recovery]
@InProceedings{Malik_2020_CVPR,
  author = {Malik, Jameel and Abdelaziz, Ibrahim and Elhayek, Ahmed and Shimada, Soshi and Ali, Sk Aziz and Golyanik, Vladislav and Theobalt, Christian and Stricker, Didier},
  title = {HandVoxNet: Deep Voxel-Based Network for 3D Hand Shape and Pose Estimation From a Single Depth Map},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Determinant Regularization for Gradient-Efficient Graph Matching
Tianshu Yu, Junchi Yan, Baoxin Li


Graph matching refers to finding vertex correspondence for a pair of graphs, which plays a fundamental role in many vision and learning related tasks. Directly applying gradient-based continuous optimization on graph matching can be attractive for its simplicity but calls for effective ways of converting the continuous solution to the discrete one under the matching constraint. In this paper, we show a novel regularization technique with the tool of determinant analysis on the matching matrix which is relaxed into continuous domain with gradient based optimization. Meanwhile we present a theoretical study on the property of our relaxation technique. Our paper strikes an attempt to understand the geometric properties of different regularization techniques and the gradient behavior during the optimization. We show that the proposed regularization is more gradient-efficient than traditional ones during early update stages. The analysis will also bring about insights for other problems under bijection constraints. The algorithm procedure is simple and empirical results on public benchmark show its effectiveness on both synthetic and real-world data.
[graph, pair, node, time, dataset, behavior, current, cmu, house, element, previous] [affinity, score, edge, feature, assignment, object, pascal] [technique, effective, noise, case, multiplicative] [determinant, figure, method, based, analysis, graduated, spectral, tensor] [synthetic, corresponding, gap, image] [algorithm, matrix, bpf, detgm, gradient, accuracy, objective, regularization, rrwm, optimization, discrete, gagm, ggmlp, bgm, learning, polytope, ipfp, ggm, compared, path, relaxation, number, performance, updating, stochastic, permutation, implies, ratio, procedure, kxj, data, entropy, deep, proper, convergence] [matching, solution, point, absolute, continuous, correspondence, geometric, doubly, inliers, outlier, property, local, continuation, euclidean]
@InProceedings{Yu_2020_CVPR,
  author = {Yu, Tianshu and Yan, Junchi and Li, Baoxin},
  title = {Determinant Regularization for Gradient-Efficient Graph Matching},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
D3S - A Discriminative Single Shot Segmentation Tracker
Alan Lukezic, Jiri Matas, Matej Kristan


Template-based discriminative trackers are currently the dominant tracking paradigm due to their robustness, but are restricted to bounding box tracking and a limited range of transformation models, which reduces their localization accuracy. We propose a discriminative single-shot segmentation tracker - D3S, which narrows the gap between visual object tracking and video object segmentation. A single-shot network applies two target models with complementary geometric properties, one invariant to a broad range of transformations, including non-rigid deformations, the other assuming a rigid object to simultaneously achieve high robustness and online target segmentation. Without per-dataset finetuning and trained only for segmentation as the primary output, D3S outperforms all trackers on VOT2016, VOT2018 and GOT-10k benchmarks and performs close to the state-of-the-art trackers on the TrackingNet. D3S outperforms the leading segmentation tracker SiamMask on video segmentation benchmark and performs on par with top video object segmentation algorithms, while running an order of magnitude faster, close to real-time.
[video, visual, outperforms, pathway, frame, evaluation, dataset] [segmentation, object, tracking, bounding, siammask, gim, gem, box, tracker, correlation, dcf, backbone, sota, foreground, background, mask, region, atom, table, location, template, siamese, matej, eao, alan, localization, refinement, module] [model, robust, trained, constrained, robustness] [channel, figure, output, relu, convolutional, ieee, method, proposed, comparison, performs, adaptive, fast] [target, discriminative, invariant, extracted, image, produce] [performance, top, network, learning, training, accuracy, deep, average, similarity, compared, search, set, reduces, posterior, online, large] [computer, geometrically, accurate, single, rotated, ground, truth, michael, european, position, euclidean, fitting]
@InProceedings{Lukezic_2020_CVPR,
  author = {Lukezic, Alan and Matas, Jiri and Kristan, Matej},
  title = {D3S - A Discriminative Single Shot Segmentation Tracker},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
MANTRA: Memory Augmented Networks for Multiple Trajectory Prediction
Francesco Marchetti, Federico Becattini, Lorenzo Seidenari, Alberto Del Bimbo


Autonomous vehicles are expected to drive in complex scenarios with several independent non cooperating agents. Path planning for safely navigating in such environments can not just rely on perceiving present location and motion of other agents. It requires instead to predict such variables in a far enough future. In this paper we address the problem of multimodal trajectory prediction exploiting a Memory Augmented Neural Network. Our method learns past and future trajectory embeddings using recurrent neural networks and exploits an associative external memory to store and retrieve such embeddings. Trajectory prediction is then performed by decoding in-memory future encodings conditioned with the observed past. We incorporate scene knowledge in the decoding state by learning a CNN on top of semantic scene maps. Memory growth is limited by learning a writing controller based on the predictive capability of existing embeddings. We show that our method is able to natively perform multi-modal trajectory prediction obtaining state-of-the art results on three datasets. Moreover, thanks to the non-parametric nature of the memory module, we show how once trained our system can continuously improve by ingesting novel patterns.
[future, trajectory, mantra, prediction, observed, multiple, state, decoder, multimodal, infer, desire, encoding, predict, social, recurrent, vehicle, relevant, kalman, store, current, time, mann, behavior, dataset, ade, fde] [semantic, map, table, refinement, module, autonomous, associative] [model, trained, external] [ieee, method, pattern, based, figure, read, proposed, exploiting] [address, encoder, perform, train, representation, meaningful, generate, encoders] [memory, neural, learning, controller, training, augmented, size, set, differently, sample, arxiv, preprint, network, linear, knowledge, machine, increasing, test, report, number, data, function, problem] [conference, error, computer, vision, kitti, international, reconstruction, robotcar, novel, single, write, point, rotation]
@InProceedings{Marchetti_2020_CVPR,
  author = {Marchetti, Francesco and Becattini, Federico and Seidenari, Lorenzo and Bimbo, Alberto Del},
  title = {MANTRA: Memory Augmented Networks for Multiple Trajectory Prediction},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
End-to-End Model-Free Reinforcement Learning for Urban Driving Using Implicit Affordances
Marin Toromanoff, Emilie Wirbel, Fabien Moutarde


Reinforcement Learning (RL) aims at learning an optimal behavior policy from its own experiments and not rule-based control methods. However, there is no RL algorithm yet capable of handling a task as difficult as urban driving. We present a novel technique, coined implicit affordances, to effectively leverage RL for urban driving thus including lane keeping, pedestrians and vehicles avoidance, and traffic light detection. To our knowledge we are the first to present a successful RL agent handling such a complex task especially regarding the traffic light detection. Furthermore, we have demonstrated the effectiveness of our method by winning the Camera Only track of the CARLA challenge.
[traffic, agent, driving, carla, reward, urban, lane, affordances, speed, reinforcement, state, time, town, predict, red, lbc, imitation, work, affordance, current] [autonomous, semantic, benchmark, track, segmentation, main, ablation, table, final, challenge, car] [steering, trained, input, auxiliary, choose, middle] [light, figure, handling, signal, comparison, method, high] [encoder, supervised, desired, train, control, image, real, unseen, loss] [training, network, learning, performance, data, task, deep, discrete, large, replay, impact, maximum, arxiv, preprint, larger, size, algorithm, applied, better, memory, test, scheme] [implicit, handle, distance, angle, conference, david, international, solve, approach, second, single]
@InProceedings{Toromanoff_2020_CVPR,
  author = {Toromanoff, Marin and Wirbel, Emilie and Moutarde, Fabien},
  title = {End-to-End Model-Free Reinforcement Learning for Urban Driving Using Implicit Affordances},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
GraphTER: Unsupervised Learning of Graph Transformation Equivariant Representations via Auto-Encoding Node-Wise Transformations
Xiang Gao, Wei Hu, Guo-Jun Qi


Recent advances in Graph Convolutional Neural Networks (GCNNs) have shown their efficiency for nonEuclidean data on graphs, which often require a large amount of labeled data with high cost. It it thus critical to learn graph feature representations in an unsupervised manner in practice. To this end, we propose a novel unsupervised learning of Graph Transformation Equivariant Representations (GraphTER), aiming to capture intrinsic patterns of graph structure under both global and local transformations. Specifically, we allow to sample different groups of nodes from a graph and then transform them node-wise isotropically or anisotropically. Then, we self-train a representation encoder to capture the graph structures by reconstructing these node-wise transformations from the feature representations of the original and transformed graphs. In experiments, we apply the learned GraphTER to graphs of 3D point cloud data, and results on point cloud segmentation/classification show that GraphTER significantly outperforms state-of-the-art unsupervised approaches and pushes greatly closer towards the upper bound set by the fully supervised counterparts. The code is available at: https://github.com/gyshgx868/graph-ter.
[graph, graphter, node, decoder, isotropically, three, recognition, outperforms, adjacency, dataset, associated, decoding, individual] [feature, global, segmentation, achieves, edge, fully, propose, apply] [model, original, trained, input, experimental, aes] [proposed, signal, ieee, convolutional, method, pattern, figure, output] [unsupervised, learn, encoder, representation, supervised, transformed, train, generative, variational, aet] [learning, sampling, data, neural, classification, deep, sampled, processing, sample, learned, matrix, set, training, randomly, applied, network, accuracy, machine, group, comparable] [transformation, point, equivariant, conference, cloud, international, local, computer, vision, edgeconv, capture, intrinsic]
@InProceedings{Gao_2020_CVPR,
  author = {Gao, Xiang and Hu, Wei and Qi, Guo-Jun},
  title = {GraphTER: Unsupervised Learning of Graph Transformation Equivariant Representations via Auto-Encoding Node-Wise Transformations},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Can Facial Pose and Expression Be Separated With Weak Perspective Camera?
Evangelos Sariyanidi, Casey J. Zampella, Robert T. Schultz, Birkan Tunc


Separating facial pose and expression within images requires a camera model for 3D-to-2D mapping. The weak perspective (WP) camera has been the most popular choice; it is the default, if not the only option, in state-of-the-art facial analysis methods and software. WP camera is justified by the supposition that its errors are negligible when the subjects are relatively far from the camera, yet this claim has never been tested despite nearly 20 years of research. This paper critically examines the suitability of WP camera for separating facial pose and expression. First, we theoretically show that WP causes pose-expression ambiguity, as it leads to estimation of spurious expressions. Next, we experimentally quantify the magnitude of spurious expressions. Finally, we test whether spurious expressions have detrimental effects on a common facial analysis application, namely Action Unit (AU) detection. Contrary to conventional wisdom, we find that severe pose-expression ambiguity exists even when subjects are not close to the camera, leading to large false positive rates in AU detection. We also demonstrate that the magnitude and characteristics of spurious expressions depend on the point distribution model used to model the expressions. Our results suggest that common assumptions about WP need to be revisited in facial expression modeling, and that facial analysis software should encourage and facilitate the use of the true camera model whenever possible.
[action, software] [false, positive, head] [facial, expression, spurious, model, face, pdm, magnitude, neutral, yaw, itwmm, pitch, fovs, study, pdms, true, morphable, theoretically, subject, variation, analyze, interocular, diod, landmark] [analysis, pattern, ieee, figure, conventional, separation, separating, generally] [mapping, image, corresponding, minimizing, separate, synthesized] [close, large, set, average, matrix, approximation, optimization, machine, size, theorem, rate, experimentally, increase, small, separately, algorithm] [camera, pose, rotation, shape, error, computer, fov, projection, perspective, point, distance, defined, estimated, vision, estimation, computed, ambiguity, conference, rigid, estimate, assume, axis]
@InProceedings{Sariyanidi_2020_CVPR,
  author = {Sariyanidi, Evangelos and Zampella, Casey J. and Schultz, Robert T. and Tunc, Birkan},
  title = {Can Facial Pose and Expression Be Separated With Weak Perspective Camera?},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Probabilistic Regression for Visual Tracking
Martin Danelljan, Luc Van Gool, Radu Timofte


Visual tracking is fundamentally the problem of regressing the state of the target in each video frame. While significant progress has been achieved, trackers are still prone to failures and inaccuracies. It is therefore crucial to represent the uncertainty in the target estimation. Although current prominent paradigms rely on estimating a state-dependent confidence score, this value lacks a clear probabilistic interpretation, complicating its use. In this work, we therefore propose a probabilistic regression formulation and apply it to tracking. Our network predicts the conditional probability density of the target state given an input image. Crucially, our formulation is capable of modeling label noise stemming from inaccurate annotations and ambiguities in the task. The regression network is trained by minimizing the Kullback-Leibler divergence. When applied for tracking, our formulation not only allows a probabilistic representation of the output, but also substantially improves the performance. Our tracker sets a new state-of-the-art on six datasets, achieving 59.8% AUC on LaSOT and 75.8% Success on TrackingNet. The code and models are available at https://github.com/visionml/pytracking.
[visual, state, predict, prediction, frame, previous] [regression, tracking, center, bounding, box, confidence, object, dimp, predicted, tracker, bbr, overlap, table, correlation, branch, martin, siamese, employed, tcr, atom, annotation, benchmark, opt, achieves, employ, goutam, fahad] [model, auc, trained, success, noise, robust, input] [output, figure, ieee, pattern, based, proposed, convolutional, gaussian, analysis] [target, loss, pseudo, image, conditional, representation, discriminative] [network, label, probabilistic, learning, distribution, training, density, function, baseline, deep, set, probability, task, performance, standard, problem, general, applied, large, sampling, average, strategy, space, negative] [approach, computer, vision, formulation, uncertainty, conference, coordinate, michael, regressing, accurate, grid]
@InProceedings{Danelljan_2020_CVPR,
  author = {Danelljan, Martin and Gool, Luc Van and Timofte, Radu},
  title = {Probabilistic Regression for Visual Tracking},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
3DRegNet: A Deep Neural Network for 3D Point Registration
G. Dias Pais, Srikumar Ramalingam, Venu Madhav Govindu, Jacinto C. Nascimento, Rama Chellappa, Pedro Miraldo


We present 3DRegNet, a novel deep learning architecture for the registration of 3D scans. Given a set of 3D point correspondences, we build a deep neural network to address the following two challenges: (i) classification of the point correspondences into inliers/outliers, and (ii) regression of the motion parameters that align the scans into a common reference frame. With regard to regression, we present two alternative approaches: (i) a Deep Neural Network (DNN) registration and (ii) a Procrustes approach using SVD to estimate the transformation. Our correspondence-based approach achieves a higher speedup compared to competing baselines. We further propose the use of a refinement network, which consists of a smaller 3DRegNet as a refinement to improve the accuracy of the registration. Extensive experiments on two challenging datasets demonstrate that we outperform other methods and achieve state-of-the-art results. The code is available.
[recognition, pair, previous, time, evaluation, three, connected, extract] [refinement, regression, global] [input, robust, dnns, original, trained] [ieee, pattern, proposed, method, block, figure, analysis, based, output, motion, comparison, fast, convolutional] [loss, translation, image, alignment, unseen] [classification, deep, network, learning, training, number, neural, problem, set, accuracy, data, better, total, machine, architecture, pairwise, function, andrew, achieve, computing] [point, registration, computer, vision, rotation, icp, fgr, transformation, cloud, median, approach, pose, second, scan, ransac, distance, inliers, estimation, local, lie, intelligence, procrustes, computed, minimal, solution, algebra, error, closest, correspondence, inlier, solving, compute]
@InProceedings{Pais_2020_CVPR,
  author = {Pais, G. Dias and Ramalingam, Srikumar and Govindu, Venu Madhav and Nascimento, Jacinto C. and Chellappa, Rama and Miraldo, Pedro},
  title = {3DRegNet: A Deep Neural Network for 3D Point Registration},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Compressed Volumetric Heatmaps for Multi-Person 3D Pose Estimation
Matteo Fabbri, Fabio Lanzi, Simone Calderara, Stefano Alletto, Rita Cucchiara


In this paper we present a novel approach for bottom-up multi-person 3D human pose estimation from monocular RGB images. We propose to use high resolution volumetric heatmaps to model joint locations, devising a simple and effective compression method to drastically reduce the size of this representation. At the core of the proposed method lies our Volumetric Heatmap Autoencoder, a fully-convolutional network tasked with the compression of ground-truth heatmaps into a dense intermediate representation. A second model, the Code Predictor, is then trained to predict these codes, which can be decompressed at test time to re-obtain the original representation. Our experimental evaluation shows that our method performs favorably when compared to state of the art on both multi-person and single-person 3D human pose estimation datasets and, thanks to our novel compression strategy, can process full-HD images at the constant runtime of 8 fps regardless of the number of subjects in the scene. Code and models are publicly available.
[people, recognition, dataset, multiple, predict, natural, time, state, decoder, composed] [location, panoptic, heatmap, propose, detection, art, table] [heatmaps, trained, model] [ieee, pattern, method, compression, comparison, convolutional, pixel, compressed, proposed, relu] [representation, code, person, image, corresponding, loss, autoencoder] [training, network, set, learning, size, predictor, simple, test, number, deep, best, neural, data, note, architecture] [pose, volumetric, conference, computer, vision, human, joint, hpe, single, estimation, jta, ground, international, body, truth, european, approach, monocular, loco, directly, vha, camera, additional, shape, mpjpe, rgb, estimate, distance, demonstrate, root, well, coordinate, depth]
@InProceedings{Fabbri_2020_CVPR,
  author = {Fabbri, Matteo and Lanzi, Fabio and Calderara, Simone and Alletto, Stefano and Cucchiara, Rita},
  title = {Compressed Volumetric Heatmaps for Multi-Person 3D Pose Estimation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Three-Dimensional Reconstruction of Human Interactions
Mihai Fieraru, Mihai Zanfir, Elisabeta Oneata, Alin-Ionut Popa, Vlad Olaru, Cristian Sminchisescu


Understanding 3d human interactions is fundamental for fine grained scene analysis and behavioural modeling. However, most of the existing models focus on analyzing a single person in isolation, and those who process several people focus largely on resolving multi-person data association, rather than inferring interactions. This may lead to incorrect, lifeless 3d estimates, that miss the subtle human contact aspects--the essence of the event--and are of little use for detailed behavioral understanding. This paper addresses such issues and makes several contributions: (1) we introduce models for interaction signature estimation (ISP) encompassing contact detection, segmentation, and 3d contact signature prediction; (2) we show how such components can be leveraged in order to produce augmented losses that ensure contact consistency during 3d reconstruction; (3) we construct several large datasets for learning and evaluating 3d contact prediction and reconstruction methods; specifically, we introduce CHI3D, a lab-based accurate 3d motion capture dataset with 631 sequences containing 2,525 contact events, 728,664 ground truth 3d poses, as well as FlickrCI3D, a dataset of 11,216 images, with 14,081 processed pairs of people, and 81,233 facet-level surface correspondences within 138,213 selected contact regions. Finally, (4) we present models and baselines to illustrate how contact estimation supports meaningful 3d reconstruction where essential interactions are captured. Models and data are made available for research purposes at http://vision.imar.ro/ci3d.
[people, interaction, prediction, dataset, multiple, action, video, connected, order, visual, social, involved] [segmentation, region, annotation, annotated, table, fully, semantic, map, level] [signature, physical, datasets, model] [motion, analysis, figure, method, proposed, performed] [image, person, consistency, alignment, train, loss, introduce] [close, data, set, learning, large, deep, evaluate, number, performance, selected, energy, task] [contact, human, pose, body, estimation, reconstruction, shape, facet, capture, surface, interacting, mesh, michael, truth, single, ground, well, monocular, hand, rgb, acets, distance, estimated, full, scene, accurate, correspondence]
@InProceedings{Fieraru_2020_CVPR,
  author = {Fieraru, Mihai and Zanfir, Mihai and Oneata, Elisabeta and Popa, Alin-Ionut and Olaru, Vlad and Sminchisescu, Cristian},
  title = {Three-Dimensional Reconstruction of Human Interactions},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Distribution-Induced Bidirectional Generative Adversarial Network for Graph Representation Learning
Shuai Zheng, Zhenfeng Zhu, Xingxing Zhang, Zhizhe Liu, Jian Cheng, Yao Zhao


Graph representation learning aims to encode all nodes of a graph into low-dimensional vectors that will serve as input of many computer vision tasks. However, most existing algorithms ignore the existence of inherent data distribution and even noises. This may significantly increase the phenomenon of over-fitting and deteriorate the testing accuracy. In this paper, we propose a Distribution-induced Bidirectional Generative Adversarial Network (named DBGAN) for graph representation learning. Instead of the widely used Gaussian assumption, the prior distribution of latent representation in our DBGAN is estimated in a structure-aware way, which implicitly bridges the graph and content spaces by prototype learning. Thus discriminative and robust representations are generated for all nodes. Furthermore, to improve their generalization ability while preserving representation ability, the sample-level and distribution-level consistency are well balanced via a bidirectional adversarial learning framework. An extensive group of experiments is then carefully designed and presented, demonstrating that our DBGAN obtains remarkably more favorable trade-off between representation and robustness, and meanwhile is dimension-efficient, over currently available alternatives in various tasks.
[graph, node, bidirectional, adjacency, link, prediction, hidden, three, artificial, dataset, social, represent, evaluation] [feature, denotes, jian, table, framework, china, propose, effectiveness, improvement] [adversarial, generalization, model, improve, original, auc, robustness] [prior, method, spectral, raw, output, figure, achieved] [representation, latent, dbgan, ability, consistency, encoder, arga, generative, preserving, prototype, unsupervised, mapping, gae, gala, preserve, loss, generator, pubmed, sigkdd, discovery, deepwalk, fake, aidw, asp] [learning, distribution, data, matrix, network, clustering, learned, set, cora, performance, dimension, citeseer, knowledge, arxiv, preprint, deep, algorithm, machine, probability] [conference, international, structure, reconstruction, acm, estimation, normal, well, reconstructed, defined]
@InProceedings{Zheng_2020_CVPR,
  author = {Zheng, Shuai and Zhu, Zhenfeng and Zhang, Xingxing and Liu, Zhizhe and Cheng, Jian and Zhao, Yao},
  title = {Distribution-Induced Bidirectional Generative Adversarial Network for Graph Representation Learning},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Minimal Solvers for 3D Scan Alignment With Pairs of Intersecting Lines
Andre Mateus, Srikumar Ramalingam, Pedro Miraldo


We explore the possibility of using line intersection constraints for 3D scan registration. Typical 3D registration algorithms exploit point and plane correspondences, while line intersection constraints have not been used in the context of 3D scan registration before. Constraints from a match of pairs of intersecting lines in two 3D scans can be seen as two 3D line intersections, a plane correspondence, and a point correspondence. In this paper, we present minimal solvers that combine these different type of constraints: 1) three line intersections and one point match; 2) one line intersection and two point matches; 3) three line intersections and one plane match; 4) one line intersection and two plane matches; and 5) one line intersection, one point match, and one plane match. To use all the available solvers, we present a hybrid RANSAC loop. We propose a non-linear refinement technique using all the inliers obtained from the RANSAC. Vast experiments with simulated data and two real-data data-sets show that the use of these features and the combined solvers improve the accuracy. The code is available.
[recognition, three, frame, pair, multiple] [refinement, predefined, global] [robust] [ieee, pattern, method, presented, proposed] [translation, real, consists] [consider, set, selected, number, data, problem, computing, deep, neural, denote, probability, network, algorithm] [point, ransac, vision, computer, plane, registration, minimal, intersection, rotation, solver, scan, intersecting, solution, pose, icp, transformation, second, single, relative, robotics, error, volume, odometry, loop, cloud, camera, estimation, distance, compute, fgr, srikumar, intersect, coordinate, supplementary, daniel, european, hybrid, local, closest, geometric, matching, defined]
@InProceedings{Mateus_2020_CVPR,
  author = {Mateus, Andre and Ramalingam, Srikumar and Miraldo, Pedro},
  title = {Minimal Solvers for 3D Scan Alignment With Pairs of Intersecting Lines},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Wavelet Integrated CNNs for Noise-Robust Image Classification
Qiufu Li, Linlin Shen, Sheng Guo, Zhihui Lai


Convolutional Neural Networks (CNNs) are generally prone to noise interruptions, i.e., small image noise can cause drastic changes in the output. To suppress the noise effect to the final predication, we enhance CNNs by replacing max-pooling, strided-convolution, and average-pooling with Discrete Wavelet Transform (DWT). We present general DWT and Inverse DWT (IDWT) layers applicable to various wavelets like Haar, Daubechies, and Cohen, etc., and design wavelet integrated CNNs (WaveCNets) using these layers for image classification. In WaveCNets, feature maps are decomposed into the low-frequency and high-frequency components during the down-sampling. The low-frequency component stores main information including the basic object structures, which is transmitted into the subsequent layers to extract robust high-level features. The high-frequency components, containing most of the data noise, are dropped during inference to improve the noise-robustness of the WaveCNets. Our experimental results on ImageNet and ImageNet-C (the noisy version of ImageNet) show that WaveCNets, the wavelet integrated versions of VGG, ResNets, and DenseNet, achieve higher accuracy and better noise-robustness than their vanilla versions.
[extract] [feature, cnn, object, pooling, map, suppress, segmentation, table, camvid, final] [noise, original, improve, input, mce] [dwt, wavelet, idwt, wavecnets, denoising, cnns, noisy, convolutional, segnet, dwtll, transform, integrated, ieee, figure, based, output, signal, relu, pattern, xll, xlh, xhh, haar, mainstream, commonly, method, denoised, xhl, tensor, filtering, downsampling, proposed] [image, component, row, train] [data, network, accuracy, deep, classification, imagenet, neural, training, better, arxiv, preprint, design, size, general, basic, validation, achieve, increase, average, evaluate, test] [conference, computer, vision, structure]
@InProceedings{Li_2020_CVPR,
  author = {Li, Qiufu and Shen, Linlin and Guo, Sheng and Lai, Zhihui},
  title = {Wavelet Integrated CNNs for Noise-Robust Image Classification},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Embedding Expansion: Augmentation in Embedding Space for Deep Metric Learning
Byungsoo Ko, Geonmo Gu


Learning the distance metric between pairs of samples has been studied for image retrieval and clustering. With the remarkable success of pair-based metric learning losses, recent works have proposed the use of generated synthetic points on metric learning losses for augmentation and generalization. However, these methods require additional generative networks along with the main network, which can lead to a larger model size, slower training speed, and harder optimization. Meanwhile, post-processing techniques, such as query expansion and database augmentation, have proposed the combination of feature points to obtain additional semantic information. In this paper, inspired by query expansion and database augmentation, we propose an augmentation method in an embedding space for pair-based metric learning losses, called embedding expansion. The proposed method generates synthetic points containing augmented information by a combination of feature points and performs hard negative pair mining to learn with the most informative feature representations. Because of its simplicity and flexibility, it can be used for existing metric learning losses without affecting model size, training speed, or optimization difficulty. Finally, the combination of embedding expansion and representative metric learning losses outperforms the state-of-the-art losses and previous sample generation methods in both image retrieval and clustering tasks. The implementation is publicly available.
[embedding, pair, retrieval, structured, time, previous, illustrated, dataset, combining, three] [hard, feature, positive, main, occlusion, easy] [original, model, query, database, adversarial] [proposed, method, figure, ieee, expansion, pattern, combination, comparison, performs, existing] [synthetic, loss, image, generation, generate, generating, generates, generative, train, person, learn] [triplet, learning, metric, negative, training, performance, deep, mining, number, class, space, clustering, sample, lifted, selected, augmentation, internally, set, augmented, log, test, ratio, linear, equal, network, similarity, computing, baseline, increasing, daml, hdml, arxiv, preprint, larger, indicates, online] [conference, computer, vision, distance, additional, dividing, point, international]
@InProceedings{Ko_2020_CVPR,
  author = {Ko, Byungsoo and Gu, Geonmo},
  title = {Embedding Expansion: Augmentation in Embedding Space for Deep Metric Learning},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
PropagationNet: Propagate Points to Curve to Learn Structure Information
Xiehe Huang, Weihong Deng, Haifeng Shen, Xiubao Zhang, Jieping Ye


Deep learning technique has dramatically boosted the performance of face alignment algorithms. However, due to large variability and lack of samples, the alignment problem in unconstrained situations, e.g. large head poses, exaggerated expression, and uneven illumination, is still largely unsolved. In this paper, we explore the instincts and reasons behind our two proposals, i.e. Propagation Module and Focal Wing Loss, to tackle the problem. Concretely, we present a novel structure-infused face alignment algorithm based on heatmap regression via propagating landmark heatmaps to boundary heatmaps, which provide structure information for further attention map generation. Moreover, we propose a Focal Wing Loss for mining and emphasizing the difficult samples under in-the-wild condition. In addition, we adopt methods like CoordConv and Anti-aliased CNN from other fields that address the shift variance problem of CNN for face alignment. When implementing extensive experiments on different benchmarks, i.e. WFLW, 300W, and COFW, our method outperforms the state-of-the-arts by a significant margin. Our proposed approach achieves 4.05% mean error on WFLW, 2.93% mean error on 300W full-set, and 3.71% mean error on COFW.
[attention, shift, connected, order] [boundary, module, propagation, table, cnn, heatmap, regression, feature, head, map, extreme, pooling, detection, localization, adopt] [face, landmark, hourglass, facial, wing, model, heatmaps, lab, nme, awing, wflw, robust, cofw, curve, expression, effective, testset] [ieee, pattern, method, stacked, convolution, figure, based, convolutional, block] [loss, alignment, image, learn, common] [network, sample, large, subset, algorithm, data, training, number, set, deep, performance, function, baseline, larger, neural, size, potential, weight, process, normalization, architecture] [computer, conference, vision, focal, structure, pose, human, shape, ground, truth, error, coordinate, european, distance]
@InProceedings{Huang_2020_CVPR,
  author = {Huang, Xiehe and Deng, Weihong and Shen, Haifeng and Zhang, Xiubao and Ye, Jieping},
  title = {PropagationNet: Propagate Points to Curve to Learn Structure Information},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Sequential 3D Human Pose and Shape Estimation From Point Clouds
Kangkan Wang, Jin Xie, Guofeng Zhang, Lei Liu, Jian Yang


This work addresses the problem of 3D human pose and shape estimation from a sequence of point clouds. Existing sequential 3D human shape estimation methods mainly focus on the template model fitting from a sequence of depth images or the parametric model regression from a sequence of RGB images. In this paper, we propose a novel sequential 3D human pose and shape estimation framework from a sequence of point clouds. Specifically, the proposed framework can regress 3D coordinates of mesh vertices at different resolutions from the latent features of point clouds. Based on the estimated 3D coordinates and features at the low resolution, we develop a spatial-temporal mesh attention convolution (MAC) to predict the 3D coordinates of mesh vertices at the high resolution. By assigning specific attentional weights to different neighboring points in the spatial and temporal domains, our spatial-temporal MAC can capture structured spatial and temporal features of point clouds. We further generalize our framework to the real data of human bodies with a weakly supervised fine-tuning method. The experimental results on SURREAL, Human3.6M, DFAUST and the real detailed data demonstrate that the proposed approach can accurately recover the 3D body model sequence from a sequence of point clouds.
[temporal, attention, sequence, sequential, attentional, frame, predict, structured] [template, regression, feature, predicted, table, framework] [model, input] [method, spatial, ieee, convolution, pattern, proposed, convolutional, based, recover, consecutive, neighboring, motion, color, captured] [real, corresponding, generate, loss, image] [data, mac, network, learning, number, accuracy, large, training, set, test, sampled] [mesh, point, human, body, vertex, depth, computer, conference, shape, estimation, vision, pose, smpl, reconstruction, single, capture, michael, dfaust, detailed, fitting, recovery, parametric, accurately, international, error, estimated, local, coordinate, gerard, correspondence, recovered, directly, kanazawa]
@InProceedings{Wang_2020_CVPR,
  author = {Wang, Kangkan and Xie, Jin and Zhang, Guofeng and Liu, Lei and Yang, Jian},
  title = {Sequential 3D Human Pose and Shape Estimation From Point Clouds},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Improving the Robustness of Capsule Networks to Image Affine Transformations
Jindong Gu, Volker Tresp


Convolutional neural networks (CNNs) achieve translational invariance by using pooling operations. However, the operations do not preserve the spatial relationships in the learned representations. Hence, CNNs cannot extrapolate to various geometric transformations of inputs. Recently, Capsule Networks (CapsNets) have been proposed to tackle this problem. In CapsNets, each entity is represented by a vector and routed to high-level entity representations by a dynamic routing algorithm. CapsNets have been shown to be more robust than CNNs to affine transformations of inputs. However, there is still a huge gap between their performance on transformed inputs compared to untransformed versions. In this work, we first revisit the routing procedure by (un)rolling its forward and backward passes. Our investigation reveals that the routing procedure contributes neither to the generalization ability nor to the affine robustness of the CapsNets. Furthermore, we explore the limitations of capsule transformations and propose affine CapsNets (Aff-CapsNets), which are more robust to affine transformations. On our benchmark task, where models are trained on the MNIST dataset and tested on the AffNIST dataset, our Aff-CapsNets improve the benchmark performance by a large margin (from 79% to 93.21%), without using any routing mechanism.
[dataset, attention, mechanism, work, visual, entity, current, activity, connected] [cnn, propose, benchmark, table] [robustness, mnist, robust, iterative, input, model, primary, generalization, trained, improve, difference, investigation] [routing, capsnets, affine, capsule, dynamic, coupling, capsnet, figure, agreement, affnist, convolutional, proposed, cnns, parallel, contributes, untransformed] [image, train, corresponding, ability, target, loss, visualize, meaningful, transformed] [procedure, test, training, performance, cij, process, matrix, learning, layer, achieve, learned, accuracy, standard, investigate, neural, vector, forward, equation, gradient, class, network, classification, observe, set, large] [transformation, novel, local, coordinate]
@InProceedings{Gu_2020_CVPR,
  author = {Gu, Jindong and Tresp, Volker},
  title = {Improving the Robustness of Capsule Networks to Image Affine Transformations},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Noise Modeling, Synthesis and Classification for Generic Object Anti-Spoofing
Joel Stehouwer, Amin Jourabloo, Yaojie Liu, Xiaoming Liu


Using printed photograph and replaying videos of biometric modalities, such as iris, fingerprint and face, are common attacks to fool the recognition systems for granting access as the genuine user. With the growing online person-to-person shopping (e.g., Ebay and Craigslist), such attacks also threaten those services, where the online photo illustration might not be captured from real items but from paper or digital screen. Thus, the study of anti-spoofing should be extended from modality-specific solutions to generic-object-based ones. In this work, we define and tackle the problem of Generic Object Anti-Spoofing (GOAS) for the first time. One significant cue to detect these attacks is the noise patterns introduced by the capture sensors and spoof mediums. Different sensor/medium combinations can result in diverse noise patterns. We propose a GAN-based architecture to synthesize and identify the noise patterns from seen and unseen medium/sensor combinations. We show that the procedure of synthesis and identification are mutually beneficial. We further demonstrate the learned GOAS models can directly contribute to modality-specific anti-spoofing without domain transfer. The code and GOSet dataset are available at cvlab.cse.msu.edu/project-goas.html.
[recognition, dataset, modeling, work, modality, three, visual] [object, final, detection, table] [spoof, noise, face, golab, live, medium, gogen, goset, model, generic, gopad, trained, input, hter, eer, identification, godisc, auc, generalization, atoum, biometric, digital, study, spoofing, antispoofing, collected, boulkenafet, security, fingerprint] [sensor, ieee, pattern, proposed, captured, output, figure, based, method, patch] [image, train, loss, real, synthetic, synthesis, synthesize, generator, texture, unseen, specific, domain, discriminator] [data, training, performance, network, learning, test, deep, classification, algorithm, accuracy, problem, learned, binary, size] [conference, computer, vision, camera, international, additional, collection]
@InProceedings{Stehouwer_2020_CVPR,
  author = {Stehouwer, Joel and Jourabloo, Amin and Liu, Yaojie and Liu, Xiaoming},
  title = {Noise Modeling, Synthesis and Classification for Generic Object Anti-Spoofing},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Quaternion Product Units for Deep Learning on 3D Rotation Groups
Xuan Zhang, Shaofei Qin, Yi Xu, Hongteng Xu


We propose a novel quaternion product unit (QPU) to represent data on 3D rotation groups. The QPU leverages quaternion algebra and the law of 3D rotation group, representing 3D rotation data as quaternions and merging them via a weighted chain of Hamilton products. We prove that the representations derived by the proposed QPU can be disentangled into "rotation-invariant" features and "rotation-equivariant" features, respectively, which supports the rationality and the efficiency of the QPU in theory. We design quaternion neural networks based on our QPUs and make our models compatible with existing deep learning models. Experiments on both synthetic and real-world data show that the proposed QPU is beneficial for the learning tasks requiring rotation robustness.
[unit, action, skeleton, dataset, graph, represent, three, recognition, node, time, composed, multiple] [feature, edge, map, propose, apply] [input, model, testing, robustness] [output, proposed, figure, based, existing, convolutional] [real, representation, synthetic, corresponding, generate] [qpu, hamilton, product, layer, learning, data, deep, chain, neural, requires, weighted, qmlp, training, vector, standard, random, matrix, gradient, design, achieve, computational, network, group, function, set, note, connect, forward, classification, accuracy, power, scalar, weighting] [rotation, quaternion, point, human, imaginary, joint, compute, hand, represented, additional, cloud, representing]
@InProceedings{Zhang_2020_CVPR,
  author = {Zhang, Xuan and Qin, Shaofei and Xu, Yi and Xu, Hongteng},
  title = {Quaternion Product Units for Deep Learning on 3D Rotation Groups},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Unsupervised Representation Learning for Gaze Estimation
Yu Yu, Jean-Marc Odobez


Although automatic gaze estimation is very important to a large variety of application areas, it is difficult to train accurate and robust gaze models, in great part due to the difficulty in collecting large and diverse data (annotating 3D gaze is expensive and existing datasets use different setups). To address this issue, our main contribution in this paper is to propose an effective approach to learn a low dimensional gaze representation without gaze annotations, which to the best of our best knowledge, is the first work to do so. The main idea is to rely on a gaze redirection network and use the gaze representation difference of the input and target images (of the redirection network) as the redirection variable. A redirection loss in image domain allows the joint training of both the redirection network and the gaze representation network. In addition, we propose a warping field regularization which not only provides an explicit physical meaning to the gaze representations but also avoids redirection distortions. Promising results on few-shot gaze estimation (competitive results can be achieved with as few as <= 100 calibration samples), cross-dataset gaze estimation, gaze network pretraining, and another task (head pose estimation) demonstrate the validity of our framework.
[work, recognition, dataset, three, predict] [head, framework, main, tracking, feature, apply, resnet] [gaze, eye, redirection, model, input, columbia, eyediap, trained, datasets, ired, difference, physical, face, robust, meaning, pitch, utmultiview, yaw, iti] [warping, field, ieee, pattern, method, proposed, figure, based, motion, analysis, output] [representation, unsupervised, image, loss, train, learn, target, extracted, adaptation, appearance, pretrained] [network, data, learning, training, regularization, performance, linear, better, large, note, layer, architecture, machine, deep, distribution, randomly] [estimation, conference, computer, vision, pose, approach, calibration, european, international, well, geometric, ground, accurate, limg, acm]
@InProceedings{Yu_2020_CVPR,
  author = {Yu, Yu and Odobez, Jean-Marc},
  title = {Unsupervised Representation Learning for Gaze Estimation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
P-nets: Deep Polynomial Neural Networks
Grigorios G. Chrysos, Stylianos Moschoglou, Giorgos Bouritsas, Yannis Panagakis, Jiankang Deng, Stefanos Zafeiriou


Deep Convolutional Neural Networks (DCNNs) is currently the method of choice both for generative, as well as for discriminative learning in computer vision and machine learning. The success of DCNNs can be attributed to the careful selection of their building blocks (e.g., residual blocks, rectifiers, sophisticated normalization schemes, to mention but a few). In this paper, we propose \Pi-Nets, a new class of DCNNs. \Pi-Nets are polynomial neural networks, i.e., the output is a high-order polynomial of the input. \Pi-Nets can be implemented using special kind of skip connections and their parameters can be represented via high-order tensors. We empirically demonstrate that \Pi-Nets have better representation power than standard DCNNs and they even produce good results without the use of non-linear activation functions in a large battery of tasks and signals, i.e., images, graphs, and audio. When used in conjunction with activation functions, \Pi-Nets produce state-of-the-art results in challenging tasks, such as image generation. Lastly, our framework elucidates why recent generative models, such as StyleGAN, improve upon their predecessors, e.g., ProGAN.
[order, recognition, hierarchical, graph, work] [table, resnet, improvement] [input, model, adversarial, improve, kind] [residual, output, pattern, skip, method, figure, expansion, tensor, convolutional, block, proposed, learnable, scale, recursive] [image, generative, generation, discriminative, representation, stylegan, generator, latent] [neural, learning, activation, deep, network, product, linear, training, performance, prodpoly, function, brns, classification, processing, arxiv, preprint, machine, approximation, implemented, special, power, layer, applied, expressivity, srns, ncp, large, space, imagenet, schematic, expressive, note, accuracy, czt, experiment] [polynomial, conference, international, computer, vision, single, demonstrate, decomposition, well, error]
@InProceedings{Chrysos_2020_CVPR,
  author = {Chrysos, Grigorios G. and Moschoglou, Stylianos and Bouritsas, Giorgos and Panagakis, Yannis and Deng, Jiankang and Zafeiriou, Stefanos},
  title = {P-nets: Deep Polynomial Neural Networks},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Hierarchically Robust Representation Learning
Qi Qian, Juhua Hu, Hao Li


With the tremendous success of deep learning in visual tasks, the representations extracted from intermediate layers of learned models, that is, deep features, attract much attention of researchers. Previous empirical analysis shows that those features can contain appropriate semantic information. Therefore, with a model trained on a large-scale benchmark data set (e.g., ImageNet), the extracted features can work well on other tasks. In this work, we investigate this phenomenon and demonstrate that deep features can be suboptimal due to the fact that they are learned by minimizing the empirical risk. When the data distribution of the target task is different from that of the benchmark data set, the performance of deep features can degrade. Hence, we propose a hierarchically robust optimization method to learn more generic features. Considering the example-level and concept-level robustness simultaneously, we formulate the problem as a distributionally robust optimization problem with Wasserstein ambiguity set constraints, and an efficient algorithm with the conventional training pipeline is proposed. Experiments on benchmark data sets demonstrate the effectiveness of the robust deep representations.
[visual, hierarchically, work, step, hierarchical] [benchmark, feature, table, propose] [robust, model, robustness, adversarial, generic, difference, svmerm, original, svmel, svmcl, concept, dnns, example, trained, yik] [proposed, comparison, conventional, demonstrates, figure] [image, learn, wasserstein, extracted, target, specific] [deep, distribution, learning, data, set, performance, learned, problem, training, imagenet, gradient, optimization, empirical, erm, rate, min, algorithm, variance, max, size, accuracy, better, theorem, task, parameter, sop, augmented, svmhrrl, efficient, class, large, augmentation, larger, neural, layer, objective, optimize, regularization, confirms, denote, function] [ambiguity, well, distance, demonstrate, pipeline, consistent]
@InProceedings{Qian_2020_CVPR,
  author = {Qian, Qi and Hu, Juhua and Li, Hao},
  title = {Hierarchically Robust Representation Learning},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
How Useful Is Self-Supervised Pretraining for Visual Tasks?
Alejandro Newell, Jia Deng


Recent advances have spurred incredible progress in self-supervised pretraining for vision. We investigate what factors may play a role in the utility of these pretraining methods for practitioners. To do this, we evaluate various self-supervised algorithms across a comprehensive array of synthetic datasets and downstream tasks. We prepare a suite of synthetic data that enables an endless supply of annotated images as well as full control over dataset difficulty. Our experiments offer insights into how the utility of self-supervision changes as the number of available labels grows as well as how the utility changes as a function of the downstream task and the properties of the training data. We also find that linear evaluation does not correlate with finetuning performance. Code and data is available at \href https://www.github.com/princeton-vl/selfstudy github.com/princeton-vl/selfstudy .
[downstream, visual, dataset, evaluation, work, predicting, multiple, abhinav] [object, semantic, feature, benchmark] [model, trained, datasets, variation, change, finetuned, chosen] [figure, ieee, color, pattern, method] [utility, image, synthetic, unsupervised, train, pretrained, common, control, selfsupervised, representation, produce, texture, cmc] [performance, pretraining, learning, labeled, accuracy, number, data, arxiv, preprint, classification, linear, task, network, training, random, report, finetuning, baseline, large, deep, better, fixed, measure, best, scratch, complexity, contrastive, imagenet, amount, distribution, amdim, investigate] [computer, conference, vision, well, viewpoint, pose, additional, dense, render, international, variety, depth]
@InProceedings{Newell_2020_CVPR,
  author = {Newell, Alejandro and Deng, Jia},
  title = {How Useful Is Self-Supervised Pretraining for Visual Tasks?},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Copy and Paste GAN: Face Hallucination From Shaded Thumbnails
Yang Zhang, Ivor W. Tsang, Yawei Luo, Chang-Hui Hu, Xiaobo Lu, Xin Yu


Existing face hallucination methods based on convolutional neural networks (CNN) have achieved impressive performance on low-resolution (LR) faces in a normal illumination condition. However, their performance degrades dramatically when LR faces are captured in low or non-uniform illumination conditions. This paper proposes a Copy and Paste Generative Adversarial Network (CPGAN) to recover authentic high-resolution (HR) face images while compensating for low and non-uniform illumination. To this end, we develop two key components in our CPGAN: internal and external Copy and Paste nets (CPnets). Specifically, our internal CPnet exploits facial information residing in the input image to enhance facial details; while our external CPnet leverages an external HR face for illumination compensation. A new illumination compensation loss is thus developed to capture illumination from the external guided face image effectively. Furthermore, our method offsets illumination and upsamples facial details alternatively in a coarse-to-fine fashion, thus alleviating the correspondence ambiguity between LR inputs and external HR inputs. Extensive experiments demonstrate that our method manifests authentic HR face images in a uniform illumination condition and outperforms state-of-the-art methods qualitatively and quantitatively.
[dataset, work] [guided, feature, propose, module, china, represents] [face, facial, input, external, internal, hallucination, cpnet, copy, cpgan, paste, adversarial, fatih, tdae, hallucinate, fhc, unaligned, hallucinating, authentic, tiny] [illumination, compensation, proposed, result, pattern, ieee, xin, method, figure, psnr, output, ssim, based, block, rain, enhance, low, upsampling, compensate, spatial, bicubic, srgan, adaptive, upsample, adopted] [image, loss, style, celeba, generate, transfer, generative, generated, translation] [network, training, deep, normalization, performance, randomly, neural, uniform, learning, data] [computer, vision, conference, normal, lighting, shading, international, human, well]
@InProceedings{Zhang_2020_CVPR,
  author = {Zhang, Yang and Tsang, Ivor W. and Luo, Yawei and Hu, Chang-Hui and Lu, Xiaobo and Yu, Xin},
  title = {Copy and Paste GAN: Face Hallucination From Shaded Thumbnails},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
TailorNet: Predicting Clothing in 3D as a Function of Human Pose, Shape and Garment Style
Chaitanya Patel, Zhouyingcheng Liao, Gerard Pons-Moll


In this paper, we present TailorNet, a neural model which predicts clothing deformation in 3D as a function of three factors: pose, shape and style (garment geometry), while retaining wrinkle detail. This goes beyond prior models, which are either specific to one style and shape, or generalize to different shapes producing smooth results, despite being style specific. Our hypothesis is that (even non-linear) combinations of examples smoothes out high frequency components such as fine-wrinkles, which makes learning the three factors jointly hard. At the heart of our technique is a decomposition of deformation into a high frequency and a low frequency component. While the low-frequency component is predicted from pose, shape and style parameters with an MLP, the high-frequency component is predicted with a mixture of shape-style specific pose models. The weights of the mixture are computed with a narrow bandwidth kernel to guarantee that only predictions with similar high-frequency patterns are combined. The style variation is obtained by computing, in a canonical pose, a subspace of deformation, which satisfies physical constraints such as inter-penetration, and draping on the body. TailorNet delivers 3D garments which retain the wrinkles from the physics based simulations (PBS) it is learned from, while running more than 1000 times faster. In contrast to classical PBS, TailorNet is easy to use and fully differentiable, which is crucial for computer vision and learning algorithms. Several experiments demonstrate TailorNet produces more realistic results than prior work, and even generates temporally coherent deformations on sequences of the AMASS dataset, despite being trained on static poses from a different dataset. To stimulate further research in this direction, we will make a dataset consisting of 55800 frames, as well as our model publicly available at https://virtualhumans.mpi-inf.mpg.de/tailornet/.
[static, dataset, predict, video, graph, previous, people] [predicted, easy] [model, garment, clothing, variation, trained, dependent, physical, christian] [frequency, high, ieee, figure, based, method, low, pattern, simulate, dynamic, kernel] [style, real, specific, learn, generate, fine, component, train, control] [function, mixture, learning, fixed, training, data, space, baseline] [shape, pose, computer, body, vision, conference, tailornet, acm, deformation, smpl, gerard, human, cloth, smooth, single, canonical, simulation, fit, international, pca, vertex, virtual, michael, mesh, error, predicts, well, mlp, capture, wrinkle, despite]
@InProceedings{Patel_2020_CVPR,
  author = {Patel, Chaitanya and Liao, Zhouyingcheng and Pons-Moll, Gerard},
  title = {TailorNet: Predicting Clothing in 3D as a Function of Human Pose, Shape and Garment Style},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Object-Occluded Human Shape and Pose Estimation From a Single Color Image
Tianshu Zhang, Buzhen Huang, Yangang Wang


Occlusions between human and objects, especially for the activities of human-object interactions, are very common in practical applications. However, most of the existing approaches for 3D human shape and pose estimation require human bodies are well captured without occlusions or with minor self-occlusions. In this paper, we focus on the problem of directly estimating the object-occluded human shape and pose from single color images. Our key idea is to utilize a partial UV map to represent an object-occluded human body, and the full 3D human shape estimation is ultimately converted as an image inpainting problem. We propose a novel two-branch network architecture to train an end-to-end regressor via the latent feature supervision, which also includes a novel saliency map sub-net to extract the human information from object-occluded color images. To supervise the network training, we further build a novel dataset named as 3DOH50K. Several experiments are conducted to reveal the effectiveness of the proposed method. Experimental results demonstrate that the proposed method achieves the state-of-the-art comparing with previous methods. The dataset, codes are publicly available at https://www.yangangwang.com.
[dataset, work, explicitly, decoder, build, previous, describe, order] [map, occlusion, occluded, saliency, feature, segmentation, propose, mask, named, effectiveness, supervision, hard] [datasets, face, model, input] [color, method, proposed, figure, existing, comparison, captured, based, convolutional, result, performed, performs] [image, inpainting, representation, encoder, latent, synthetic, real, train, utilize, corresponding] [network, deep, learning, data, training, performance, problem, neural, inference, good] [human, shape, pose, estimation, body, single, smpl, full, mesh, partial, position, novel, accurate, reconstruction, directly, estimate, rgb, detailed, estimating, objectoccluded, demonstrate, monocular, recovering, geometry]
@InProceedings{Zhang_2020_CVPR,
  author = {Zhang, Tianshu and Huang, Buzhen and Wang, Yangang},
  title = {Object-Occluded Human Shape and Pose Estimation From a Single Color Image},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Recursive Least-Squares Estimator-Aided Online Learning for Visual Tracking
Jin Gao, Weiming Hu, Yan Lu


Online learning is crucial to robust visual object tracking as it can provide high discrimination power in the presence of background distractors. However, there are two contradictory factors affecting its successful deployment on the real visual tracking platform: the discrimination issue due to the challenges in vanilla gradient descent, which does not guarantee good convergence; the robustness issue due to over-fitting resulting from excessive update with limited memory size (the oldest samples are discarded). Despite many dedicated techniques proposed to somehow treat those issues, in this paper we take a new way to strike a compromise between them based on the recursive least-squares estimation (LSE) algorithm. After connecting each fully-connected layer with LSE separately via normal equations, we further propose an improved mini-batch stochastic gradient descent algorithm for fully-connected network learning with memory retention in a recursive fashion. This characteristic can spontaneously reduce the risk of over-fitting resulting from catastrophic forgetting in excessive online learning. Meanwhile, it can effectively improve convergence though the cost function is computed over all the training samples that the algorithm has ever seen. We realize this recursive LSE-aided online learning technique in the state-of-the-art RT-MDNet tracker, and the consistent improvements on four challenging benchmarks prove its efficiency without additional offline training and too much tedious work on parameter adjusting.
[visual, historical, work, provide] [tracking, tracker, object, overlap, correlation, meem, score, background, leading, instance, eco, challenging] [success, input, robustness, case, model, robust, improve, original, risk, offline] [recursive, based, ieee, pattern, proposed, figure, method, comparison, output] [target, loss, discrimination] [online, learning, memory, update, gradient, training, improved, function, layer, network, algorithm, deep, convergence, set, performance, mbsgd, uli, average, forgetting, updating, rate, weight, sgd, stochastic, descent, retention, catastrophic, classification, derivation, matrix, validation, lse, iteration, continual, ecohc, power, good, number] [normal, cost, term, computed, mlp, system, solving]
@InProceedings{Gao_2020_CVPR,
  author = {Gao, Jin and Hu, Weiming and Lu, Yan},
  title = {Recursive Least-Squares Estimator-Aided Online Learning for Visual Tracking},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Self-Supervised Monocular Scene Flow Estimation
Junhwa Hur, Stefan Roth


Scene flow estimation has been receiving increasing attention for 3D environment perception. Monocular scene flow estimation - obtaining 3D structure and 3D motion from two temporally consecutive images - is a highly ill-posed problem, and practical solutions are lacking to date. We propose a novel monocular scene flow method that yields competitive accuracy and real-time performance. By taking an inverse problem view, we design a single convolutional neural network (CNN) that successfully estimates depth and 3D motion simultaneously from a classical optical flow cost volume. We adopt self-supervised learning with 3D loss functions and occlusion reasoning to leverage unlabeled data. We validate our design choices, including the proxy loss and augmentation setup. Our model achieves state-of-the-art accuracy among unsupervised/self-supervised learning approaches to monocular scene flow, and yields competitive results for the optical flow and monocular depth estimation sub-tasks. Semi-supervised fine-tuning further improves the accuracy and yields promising results in real-time.
[decoder, dataset, previous, frame, sequence, evaluation] [table, cnn, occlusion, split, pyramid, improves, ablation, feature, map, adopt] [model, input, trained, study] [flow, optical, disparity, method, motion, reference, competitive, stefan, based, pixel, scale, residual, cnns, output, demonstrates, proposed, exploiting, slfw, figure, consecutive] [loss, image, target, unsupervised, corresponding, train, separate] [accuracy, learning, training, network, augmentation, data, test, problem, proxy] [scene, monocular, depth, estimation, kitti, stereo, camera, single, approach, point, estimate, joint, photometric, geometric, structure, reconstruction, jointly, well, limited, ground, rigid, smoothness, andreas, truth, absolute, correspondence, intrinsics, cost]
@InProceedings{Hur_2020_CVPR,
  author = {Hur, Junhwa and Roth, Stefan},
  title = {Self-Supervised Monocular Scene Flow Estimation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Learning Fast and Robust Target Models for Video Object Segmentation
Andreas Robinson, Felix Jaremo Lawin, Martin Danelljan, Fahad Shahbaz Khan, Michael Felsberg


Video object segmentation (VOS) is a highly challenging problem since the initial mask, defining the target object, is only given at test-time. The main difficulty is to effectively handle appearance changes and similar background objects, while maintaining accurate segmentation. Most previous approaches fine-tune segmentation networks on the first frame, resulting in impractical frame-rates and risk of overfitting. More recent methods integrate generative target appearance models, but either achieve limited robustness or require large amounts of training data. We propose a novel VOS architecture consisting of two network components. The target appearance model consists of a light-weight module, which is learned during the inference stage using fast optimization techniques to predict a coarse but robust target segmentation. The segmentation model is exclusively trained offline, designed to process the coarse scores into high quality segmentation masks. Our method is fast, easily trainable and remains highly effective in cases of limited training data. We perform extensive experiments on the challenging YouTube-VOS and DAVIS datasets. Our network achieves favorable performance, while operating at higher frame-rates compared to state-of-the-art. Code and trained models are available at https://github.com/andr345/frtm-vos.
[video, frame, previous, predict, recognition, dataset] [segmentation, object, davis, feature, score, mask, agame, final, tracking, vos, employ, backbone, table, premvos, background, challenging] [model, trained, robust, offline, employing, input] [ieee, pattern, output, method, based, figure, fast, convolutional, stm, comparison, block, analysis] [target, appearance, image, discriminative, train, generated, generative, extensive, learn] [network, training, data, learning, validation, inference, update, sample, problem, large, learned, compared, online, set, report, optimization, performance, strategy] [computer, conference, vision, approach, coarse, international, accurate, additional, compare, michael, initial, matching, despite]
@InProceedings{Robinson_2020_CVPR,
  author = {Robinson, Andreas and Lawin, Felix Jaremo and Danelljan, Martin and Khan, Fahad Shahbaz and Felsberg, Michael},
  title = {Learning Fast and Robust Target Models for Video Object Segmentation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Reciprocal Learning Networks for Human Trajectory Prediction
Hao Sun, Zhiqun Zhao, Zhihai He


We observe that the human trajectory is not only forward predictable, but also backward predictable. Both forward and backward trajectories follow the same social norms and obey the same physical constraints with the only difference in their time directions. Based on this unique property, we develop a new approach, called reciprocal learning, for human trajectory prediction. Two networks, forward and backward prediction networks, are tightly coupled, satisfying the reciprocal constraint, which allows them to be jointly learned. Based on this constraint, we borrow the concept of adversarial attacks of deep neural networks, which iteratively modifies the input of the network to match the given or forced network output, and develop a new method for network prediction, called reciprocal attack for matched prediction. It further improves the prediction accuracy. Our experimental results on benchmark datasets demonstrate that our new method outperforms the state-of-the-art methods for human trajectory prediction.
[prediction, trajectory, reciprocal, social, future, lstm, time, predict, work, multiple, moving, hidden, eth, silvio, outperforms, observed, predicting, walk, illustrated] [predicted, feature, matched, pooling, map, ablation, crowded] [adversarial, model, input, attack, physical, developed, experimental, datasets, improve, trained, major, original] [method, ieee, pattern, based, motion, figure, proposed, develop, called, existing] [loss, image, target, perform, idea, person, cycle, gan, consistency, encoder] [network, backward, learning, forward, performance, training, neural, linear, set, function, deep, feasible, data, better] [human, conference, computer, vision, depth, scene, error, approach, ground, international, distance, european, unique, tightly, iteratively, joint, coupled, truth]
@InProceedings{Sun_2020_CVPR,
  author = {Sun, Hao and Zhao, Zhiqun and He, Zhihai},
  title = {Reciprocal Learning Networks for Human Trajectory Prediction},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Nonparametric Object and Parts Modeling With Lie Group Dynamics
David S. Hayden, Jason Pacheco, III, John W. Fisher


Articulated motion analysis often utilizes strong prior knowledge such as a known or trained parts model for humans. Yet, the world contains a variety of articulating objects--mammals, insects, mechanized structures--where the number and configuration of parts for a particular object is unknown in advance. Here, we relax such strong assumptions via an unsupervised, Bayesian nonparametric parts model that infers an unknown number of parts with motions coupled by a body dynamic and parameterized by SE(D), the Lie group of rigid transformations. We derive an inference procedure that utilizes short observation sequences (image, depth, point cloud or mesh) of an object in motion without need for markers or learned body models. Efficient Gibbs decompositions for inference over distributions on SE(D) demonstrate robust part decompositions of moving objects under both 3D and 2D observation models. The inferred representation permits novel analysis, such as object segmentation by relative part motion, and transfers to new observations of the same object type.
[frame, time, observation, work, dirichlet, element, modeling, multiple] [object, segmentation] [model, noise] [motion, figure, ieee, gaussian, dynamic, pattern, analysis, method] [translation, unknown, representation, unsupervised, infinite, conditional, image] [group, space, number, covariance, data, inference, vector, distribution, sampling, bayesian, matrix, linear, dynamical, sampled, riemannian, spider, john, efficient, posterior, approximation, random] [body, lie, tangent, nonparametric, computer, conference, rotation, mesh, canonical, michael, rigid, ztn, vision, ytn, david, articulated, point, shape, depth, human, international, rxt, dxt, npe, gibbs, novel, transformation, homogeneous, hand, relative, full, single]
@InProceedings{Hayden_2020_CVPR,
  author = {Hayden, David S. and Pacheco, Jason and , III, John W. Fisher},
  title = {Nonparametric Object and Parts Modeling With Lie Group Dynamics},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Learning to Shadow Hand-Drawn Sketches
Qingyuan Zheng, Zhuoru Li, Adam Bargteil


We present a fully automatic method to generate detailed and accurate artistic shadows from pairs of line drawing sketches and lighting directions. We also contribute a new dataset of one thousand examples of pairs of line drawings and shadows that are tagged with lighting directions. Remarkably, the generated shadows quickly communicate the underlying 3D structure of the sketched scene. Consequently, the shadows generated by our approach can be used directly or as an excellent starting point for artists. We demonstrate that the deep learning network we propose takes a hand-drawn sketch, builds a 3D model in latent space, and renders the resulting shadows. The generated shadows respect the hand-drawn lines and underlying 3D space and contain sophisticated and accurate details, such as self-shadowing effects. Moreover, the generated shadows contain artistic effects, such as rim lighting or halos appearing from backlighting, that would be achievable with traditional 3D rendering methods.
[work, visual, include, reasoning] [stage, final, interactive, focus] [input, adversarial, model] [residual, method, output, proposed, light, ieee, intermediate, figure, pattern, spatial] [image, sketch, loss, shadow, discriminator, drawing, animation, generator, generate, generative, cel, artistic, user, stylized, translation, realistic, colorization, loutput, inker, shadowing, gans, film, aaron, satoshi, edgar, hiroshi, phillip, alexei, separate, learn] [deep, network, learning, binary, neural, training, simple, soft, set, small, architecture, processing, arxiv, preprint, function, layer, data] [lighting, acm, computer, conference, normal, relighting, direction, ground, international, directly, vision, additional, second, compare]
@InProceedings{Zheng_2020_CVPR,
  author = {Zheng, Qingyuan and Li, Zhuoru and Bargteil, Adam},
  title = {Learning to Shadow Hand-Drawn Sketches},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Intuitive, Interactive Beard and Hair Synthesis With Generative Models
Kyle Olszewski, Duygu Ceylan, Jun Xing, Jose Echevarria, Zhili Chen, Weikai Chen, Hao Li


We present an interactive approach to synthesizing realistic variations in facial hair in images, ranging from subtle edits to existing hair to the addition of complex and challenging hair in images of clean-shaven subjects. To circumvent the tedious and computationally expensive tasks of modeling, rendering and compositing the 3D geometry of the target hairstyle using the traditional graphics pipeline, we employ a neural network pipeline that synthesizes realistic and detailed images of facial hair directly in the target image in under one second. The synthesis is controlled by simple and sparse guide strokes from the user defining the general structural and color properties of the target hairstyle. We qualitatively and quantitatively evaluate our chosen method compared to several alternative approaches. We show compelling interactive editing results with a prototype user interface that allows novice users to progressively refine the generated image to match their desired hairstyle, and demonstrate that our approach also allows for flexible and high-fidelity scalp hair synthesis.
[dataset, modeling, provide, visual] [region, guide, interactive, mask, final, stage, segmented, segmentation] [facial, input, adversarial, trained, example, study] [color, field, ieee, method, figure, perceptual, result, pattern, reference] [hair, image, synthesis, editing, texture, user, target, synthesized, real, style, generative, realistic, synthesizing, desired, hairstyle, corresponding, intuitive, loss, transfer, perform, synthesize, scalp, generate, synthetic, stroke, generated, plausible] [network, training, vector, set, neural, large, deep, appropriate, architecture] [acm, computer, initial, approach, allow, conference, allows, local, structure, orientation, system, ground, truth, vision, hao, varying, provided, well, complex, interface, single, shape, supplementary]
@InProceedings{Olszewski_2020_CVPR,
  author = {Olszewski, Kyle and Ceylan, Duygu and Xing, Jun and Echevarria, Jose and Chen, Zhili and Chen, Weikai and Li, Hao},
  title = {Intuitive, Interactive Beard and Hair Synthesis With Generative Models},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Semantic Pyramid for Image Generation
Assaf Shocher, Yossi Gandelsman, Inbar Mosseri, Michal Yarom, Michal Irani, William T. Freeman, Tali Dekel


We present a novel GAN-based model that utilizes the space of deep features learned by a pre-trained classification model. Inspired by classical image pyramid representations, we construct our model as a Semantic Generation Pyramid -- a hierarchical framework which leverages the continuum of semantic information encapsulated in such deep features; this ranges from low level information contained in fine features to high level, semantic information contained in deeper features. More specifically, given a set of features extracted from a reference image, our model generates diverse image samples, each with matching features at each semantic level of the classification model. We demonstrate that our model results in a versatile and flexible framework that can be used in various classic and novel image generation tasks. These include: generating images with a controllable extent of semantic similarity to a reference image, and different manipulation tasks such as semantically-controlled inpainting and compositing; all achieved with the same model, with no further training.
[work, hierarchical, understanding, recognition, natural] [semantic, feature, pyramid, level, map, mask, framework, fed, object, region, unified, apply, global] [model, input, original, adversarial, manipulation, trained, noise] [reference, convolutional, ieee, figure, high, based, method, output, result, pattern, classical, perceptual] [image, generated, generator, generation, generative, generate, extracted, diverse, generating, content, inverting, gan, realistic, gans, user, fid, real, loss, mapping] [deep, classification, random, training, set, neural, learning, layer, space, optimization, deeper, distribution, similarity, class, machine] [conference, international, matching, allows, computer, vision, structure, reconstructed, approach, demonstrate, reconstruction, single]
@InProceedings{Shocher_2020_CVPR,
  author = {Shocher, Assaf and Gandelsman, Yossi and Mosseri, Inbar and Yarom, Michal and Irani, Michal and Freeman, William T. and Dekel, Tali},
  title = {Semantic Pyramid for Image Generation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
SynSin: End-to-End View Synthesis From a Single Image
Olivia Wiles, Georgia Gkioxari, Richard Szeliski, Justin Johnson


View synthesis allows for the generation of new views of a scene given one or more images. This is challenging; it requires comprehensively understanding the 3D scene from images. As a result, current methods typically use multiple images, train on ground-truth depth, or are limited to synthetic data. We propose a novel end-to-end model for this task using a single image at test time; it is trained on real images without any ground-truth 3D information. To this end, we introduce a novel differentiable point cloud renderer that is used to transform a latent 3D point cloud of features into the target view. The projected features are decoded by our refinement network to inpaint missing regions and generate a realistic output image. The 3D component inside of our generative model allows for interpretable manipulation of the latent feature space at test time, e.g. we can animate trajectories from a single image. Additionally, we can generate high resolution images and generalise to other input resolutions. We outperform baselines and prior work on the Matterport, Replica, and RealEstate10K datasets.
[work, multiple, dataset, video, prediction, understanding] [feature, refinement, hard, table, predicted, object, semantic, map] [input, model, trained, quality] [resolution, method, figure, convolutional, spatial, comparison, output, based, pixel] [image, synthesis, generative, target, generate, representation, train, generated, realistic, perform, generalisation, synthetic, latent, inpaint, missing] [learning, test, training, higher, network, set, neural, better, learned, performance, baseline, evaluate, deep, setup, requires, task] [depth, view, synsin, point, scene, single, cloud, renderer, acm, differentiable, rendering, system, approach, indoor, replica, richard, voxel, rendered, novel, structure, projected, visible, nearest, vox, rgb, noah]
@InProceedings{Wiles_2020_CVPR,
  author = {Wiles, Olivia and Gkioxari, Georgia and Szeliski, Richard and Johnson, Justin},
  title = {SynSin: End-to-End View Synthesis From a Single Image},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
A Characteristic Function Approach to Deep Implicit Generative Modeling
Abdul Fatir Ansari, Jonathan Scarlett, Harold Soh


Implicit Generative Models (IGMs) such as GANs have emerged as effective data-driven models for generating samples, particularly images. In this paper, we formulate the problem of learning an IGM as minimizing the expected distance between characteristic functions. Specifically, we minimize the distance between characteristic functions of the real and generated data distributions under a suitably-chosen weighting distribution. This distance metric, which we term as the characteristic function distance (CFD), can be (approximately) computed with linear time-complexity in the number of samples, in contrast with the quadratic-time Maximum Mean Discrepancy (MMD). By replacing the discrepancy measure in the critic of a GAN with the CFD, we obtain a model that is simple to implement and stable to train. The proposed metric enjoys desirable theoretical properties including continuity and differentiability with respect to generator parameters, and continuity in the weak topology. We further propose a variation of the CFD in which the weighting distribution parameters are also optimized during training; this obviates the need for manual tuning, and leads to an improvement in test power relative to CFD. We demonstrate experimentally that our proposed method outperforms WGAN and MMD-GAN variants on a variety of unsupervised image generation benchmarks.
[work, provide, dataset, outperforms, time] [continuity, weak, improvement] [model, adversarial, effective, true, lipschitz, variation, datasets] [proposed, optimized, scale, kernel, smoothed] [characteristic, generative, gan, generator, generated, mmd, ecfd, wgan, critic, image, fid, kid, gans, celeba, real, discrepancy, minimizing, synthetic, loss] [distribution, cfd, function, weighting, gradient, training, data, empirical, number, probability, learning, network, test, penalty, power, metric, optimization, random, parameter, better, set, appendix, performance, maximum, simple, stable, theorem, mixture, compared, best, deep, theoretical, divergence, alternative, integral, vector, maximize, squared] [distance, implicit, continuous, approach]
@InProceedings{Ansari_2020_CVPR,
  author = {Ansari, Abdul Fatir and Scarlett, Jonathan and Soh, Harold},
  title = {A Characteristic Function Approach to Deep Implicit Generative Modeling},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
High-Resolution Daytime Translation Without Domain Labels
Ivan Anokhin, Pavel Solovev, Denis Korzhenkov, Alexey Kharlamov, Taras Khakhulin, Aleksei Silvestrov, Sergey Nikolenko, Victor Lempitsky, Gleb Sterkin


Modeling daytime changes in high resolution photographs, e.g., re-rendering the same scene under different illuminations typical for day, night, or dawn, is a challenging image manipulation task. We present the high-resolution daytime translation (HiDT) model for this task. HiDT combines a generative image-to-image model and a new upsampling scheme that allows to apply image translation at high resolution. The model demonstrates competitive results in terms of both commonly used GAN metrics and human evaluation. Importantly, this good performance comes as a result of training on a dataset of still landscape images with no daytime labels available.
[dataset, work, previous, decoder] [segmentation, apply, merging, semantic, score, main] [model, original, trained, input, adversarial, medium] [enhancement, figure, resolution, ieee, high, convolutional, result, prior, pattern, upsampling, skip, method, output, downsampling, june, residual] [image, style, translation, hidt, translated, content, daytime, domain, loss, timelapse, extracted, adain, funit, transfer, generative, target, conditional, encoder, user, generator, genh, generation, discriminator, xhi, drit, swapping, springer, paired] [training, network, random, architecture, learning, distribution, sampled, applied, task, space, scheme, inference, landscape, consider, performance] [computer, conference, vision, international, well, approach, reconstruction, single, application, human]
@InProceedings{Anokhin_2020_CVPR,
  author = {Anokhin, Ivan and Solovev, Pavel and Korzhenkov, Denis and Kharlamov, Alexey and Khakhulin, Taras and Silvestrov, Aleksei and Nikolenko, Sergey and Lempitsky, Victor and Sterkin, Gleb},
  title = {High-Resolution Daytime Translation Without Domain Labels},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Leveraging 2D Data to Learn Textured 3D Mesh Generation
Paul Henderson, Vagia Tsiminaki, Christoph H. Lampert


Numerous methods have been proposed for probabilistic generative modelling of 3D objects. However, none of these is able to produce textured objects, which renders them of limited use for practical tasks. In this work, we present the first generative model of textured 3D meshes. Training such a model would traditionally require a large dataset of textured meshes, but unfortunately, existing datasets of meshes lack detailed textures. We instead propose a new training methodology that allows learning from collections of 2D images without any 3D information. To do so, we train our model to explain a distribution of images by modelling each image as a 3D foreground object placed in front of a 2D background. Thus, it learns to generate meshes that when rendered, produce images similar to those in its training set. A well-known problem when generating meshes with deep networks is the emergence of self-intersections, which are problematic for many use-cases. As a second contribution we therefore introduce a new generation process for 3D meshes that guarantees no self-intersections arise, based on the physical intuition that faces should push one another out of the way as they move. We conduct extensive experiments on our approach, reporting quantitative and qualitative results on both synthetic data and natural images. These show our method successfully learns to generate plausible and diverse textured 3D samples for five challenging object classes.
[decoder, natural, multiple, zbg] [object, background, foreground, segmentation, challenging, mask, propose, car] [model, trained, face, physical, datasets] [method, quantitative, figure, color] [image, generative, learn, generation, latent, generate, train, generated, variational, learns, encoder, zcolor, diverse, texture, corresponding, bird, fid, kid] [training, learning, setting, data, network, distribution, deep, process, space, performance, group, note, set, sampled, probabilistic] [mesh, textured, reconstruction, shape, surface, vertex, pose, reconstruct, parametrization, zshape, dmin, camera, local, reconstructed, single, paul, allows, predicts, intersect, intersecting, differentiable, chair, modelling]
@InProceedings{Henderson_2020_CVPR,
  author = {Henderson, Paul and Tsiminaki, Vagia and Lampert, Christoph H.},
  title = {Leveraging 2D Data to Learn Textured 3D Mesh Generation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Contextual Residual Aggregation for Ultra High-Resolution Image Inpainting
Zili Yi, Qiang Tang, Shekoofeh Azizi, Daesik Jang, Zhan Xu


Recently data-driven image inpainting methods have made inspiring progress, impacting fundamental image editing tasks such as object removal and damaged image repairing. These methods are more effective than classic approaches, however, due to memory limitations they can only handle low-resolution inputs, typically smaller than 1K. Meanwhile, the resolution of photos captured with mobile devices increases up to 8K. Naive up-sampling of the low-resolution inpainted result can merely yield a large yet blurry result. Whereas, adding a high-frequency residual image onto the large blurry image can generate a sharp result, rich in details and textures. Motivated by this, we propose a Contextual Residual Aggregation (CRA) mechanism that can produce high-frequency residuals for missing contents by weighted aggregating residuals from contextual patches, thus only requiring a low-resolution prediction from the network. Since convolutional layers of the neural network only need to operate on low-resolution inputs and outputs, the cost of memory and computing power is thus well suppressed. Moreover, the need for high-resolution training datasets is alleviated. In our experiments, we train the proposed model on small images with resolutions 512 x 512 and perform inference on high-resolution images, achieving compelling inpainting quality. Our model can inpaint images as large as 8K with considerable hole sizes, which is intractable with previous learning-based approaches. We further elaborate on the light-weight design of the network architecture, achieving real-time performance on 2K images on a GTX 1080 Ti GPU. Codes are available at: https://github. com/Ascend-Huawei/Ascend-Canada/tree/ master/Models/Research_HiFIll_Model
[attention, mechanism, gated, irregular, multiple, time, three] [contextual, feature, region, mask, aggregation, refine, object, including, inside] [model, input, trained, original, quality] [hole, figure, patch, convolution, ieee, residual, proposed, lwgc, method, filled, cra, pattern, blurry, output, convolutional, performs, cin, sharp] [image, inpainting, missing, loss, generator, inpainted, transfer, generates, fid, perform, inpaint, fill, filling] [network, size, training, large, processing, number, neural, weighted, layer, batch, memory, computing, deep, ultra] [conference, computer, vision, coarse, enables, partial, averaging, reconstruction]
@InProceedings{Yi_2020_CVPR,
  author = {Yi, Zili and Tang, Qiang and Azizi, Shekoofeh and Jang, Daesik and Xu, Zhan},
  title = {Contextual Residual Aggregation for Ultra High-Resolution Image Inpainting},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Flow Contrastive Estimation of Energy-Based Models
Ruiqi Gao, Erik Nijkamp, Diederik P. Kingma, Zhen Xu, Andrew M. Dai, Ying Nian Wu


This paper studies a training method to jointly estimate an energy-based model and a flow-based model, in which the two models are iteratively updated based on a shared adversarial value function. This joint training method has the following traits. (1) The update of the energy-based model is based on noise contrastive estimation, with the flow model serving as a strong noise distribution. (2) The update of the flow model approximately minimizes the Jensen-Shannon divergence between the flow model and the data distribution. (3) Unlike generative adversarial networks (GAN) which estimates an implicit probability distribution defined by a generator model, our method estimates two explicit probabilistic distributions on the data. Using the proposed method we demonstrate a significant improvement on the synthesis quality of the flow model, and show the effectiveness of unsupervised feature learning by the learned energy-based model. Furthermore, the proposed training method can be easily adapted to semi-supervised learning. We achieve competitive results to the state-of-the-art semi-supervised learning methods.
[language, observed, nce] [table, feature, effectiveness] [model, noise, adversarial, trained, easily] [flow, method, glow, based, figure, invertible, proposed, convolutional, ieee] [generative, pdata, variational, generated, diederik, unsupervised, real, synthesized, learns, generator, discriminative, celeba, aaron] [data, learning, arxiv, preprint, training, learned, ebm, distribution, neural, fce, contrastive, function, processing, probability, density, classification, deep, log, update, divergence, objective, labeled, energy, mle, top, machine, parameter, epdata, probabilistic, efficient, sampling, network, ebms, classifier, yoshua, negative, normalizing, logistic, suppose, gradient, svhn, inference] [estimation, conference, defined, international, joint, assume, estimated, form, david, computer]
@InProceedings{Gao_2020_CVPR,
  author = {Gao, Ruiqi and Nijkamp, Erik and Kingma, Diederik P. and Xu, Zhen and Dai, Andrew M. and Wu, Ying Nian},
  title = {Flow Contrastive Estimation of Energy-Based Models},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Hardware-in-the-Loop End-to-End Optimization of Camera Image Processing Pipelines
Ali Mosleh, Avinash Sharma, Emmanuel Onzon, Fahim Mannan, Nicolas Robidoux, Felix Heide


Commodity imaging systems rely on hardware image signal processing (ISP) pipelines. These low-level pipelines consist of a sequence of processing blocks that, depending on their hyperparameters, reconstruct a color image from RAW sensor measurements. Hardware ISP hyperparameters have a complex interaction with the output image, and therefore with the downstream application ingesting these images. Traditionally, ISPs are manually tuned in isolation by imaging experts without an end-to-end objective. Very recently, ISPs have been optimized with 1st-order methods that require differentiable approximations of the hardware ISP. Departing from such approximations, we present a hardware-in-the-loop method that directly optimizes hardware image processing pipelines for end-to-end domain-specific losses by solving a nonlinear multi-objective optimization problem with a novel 0th-order stochastic solver directly interfaced with the hardware ISP. We validate the proposed method with recent hardware ISPs and 2D object detection, segmentation, and human viewing as end-to-end downstream tasks. For automotive 2D object detection, the proposed method outperforms manual expert tuning by 30% mean average precision (mAP) and recent methods using ISP approximations by 18% mAP.
[downstream, evaluation, understanding, viewing, expert, recognition, outperforms, automatic] [object, detection, segmentation, map, table, box, instance, panoptic, including] [quality, black, model] [isp, proposed, method, perceptual, ieee, raw, output, isps, optimized, tseng, automotive, pixel, color, sensor, existing, pattern, onsemi, mar, imaging, validate, blockwise, based] [image, loss, synthetic] [optimization, hyperparameter, hyperparameters, processing, hardware, search, space, optimize, set, tuning, evolutionary, learning, data, task, default, parameter, problem, algorithm, manually, manual, large, approximation, number, function] [vision, computer, conference, human, arm, international, camera, approach, local, differentiable, interval, supplemental]
@InProceedings{Mosleh_2020_CVPR,
  author = {Mosleh, Ali and Sharma, Avinash and Onzon, Emmanuel and Mannan, Fahim and Robidoux, Nicolas and Heide, Felix},
  title = {Hardware-in-the-Loop End-to-End Optimization of Camera Image Processing Pipelines},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Search to Distill: Pearls Are Everywhere but Not the Eyes
Yu Liu, Xuhui Jia, Mingxing Tan, Raviteja Vemulapalli, Yukun Zhu, Bradley Green, Xiaogang Wang


Standard Knowledge Distillation (KD) approaches distill the knowledge of a cumbersome teacher model into the parameters of a student model with a pre-defined architecture. However, the knowledge of a neural network, which is represented by the network's output distribution conditioned on its input, depends not only on its parameters but also on its architecture. Hence, a more generalized approach for KD is to distill the teacher's knowledge into both the parameters and architecture of the student. To achieve this, we present a new Architecture-aware Knowledge Distillation (AKD) approach that finds student models (pearls for the teacher) that are best for distilling the given teacher model. In particular, we leverage Neural Architecture Search (NAS), equipped with our KD-guided reward, to search for the best student architectures for a given teacher. Experimental results show our proposed AKD consistently outperforms the conventional NAS plus KD approach, and achieves state-of-the-art results on the ImageNet classification task under various latency settings. Furthermore, the best AKD student architecture for the ImageNet classification task also transfers well to other tasks such as million level face recognition and ensemble learning.
[agent, reward, previous, recognition, work] [hard, table, final, improvement] [model, face, ensemble, trained, interesting, megaface] [ieee, method, output, figure, conventional, based, pattern, proposed, dark] [structural, target, image, train] [teacher, knowledge, distillation, neural, student, architecture, search, akdnet, akd, latency, performance, label, searching, network, training, imagenet, learning, distribution, best, space, sampled, deep, optimal, distill, task, set, accuracy, searched, classification, better, compared, nasnet, quoc, distilling, process, investigate, probability, gain, soft, converge, consistently, machine, argue, select, function, denote, large, random, ratio] [conference, computer, vision, approach]
@InProceedings{Liu_2020_CVPR,
  author = {Liu, Yu and Jia, Xuhui and Tan, Mingxing and Vemulapalli, Raviteja and Zhu, Yukun and Green, Bradley and Wang, Xiaogang},
  title = {Search to Distill: Pearls Are Everywhere but Not the Eyes},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Total Deep Variation for Linear Inverse Problems
Erich Kobler, Alexander Effland, Karl Kunisch, Thomas Pock


Diverse inverse problems in imaging can be cast as variational problems composed of a task-specific data fidelity term and a regularization term. In this paper, we propose a novel learnable general-purpose regularizer exploiting recent architectural design patterns from deep learning. We cast the learning problem as a discrete sampled optimal control problem, for which we derive the adjoint state equations and an optimality condition. By exploiting the variational structure of our approach, we perform a sensitivity analysis with respect to the learned parameters obtained from different training datasets. Moreover, we carry out a nonlinear eigenfunction analysis, which reveals interesting properties of the learned regularizer. We show state-of-the-art performance for classical image restoration and medical image reconstruction problems.
[time, state, order, dataset, three] [] [noise, variation, condition, model, input, nonlinear, sensitivity] [psnr, figure, ieee, proposed, inverse, analysis, denoising, operator, based, rnc, pattern, xinit, imaging, gaussian, method, convolutional, low, resolution, tdv, scale, classical, medical, residual, xiinit, accelerated, undersampled, prior] [image, variational, control, fidelity] [optimal, regularizer, data, deep, learned, training, learning, problem, gradient, stopping, total, sampled, linear, neural, network, discrete, theorem, energy, set, maximum, number, regularization, fixed, average, test, design, respect, function] [conference, reconstruction, single, computer, thomas, term, international, computed, structure, approach, initial, vision, adjoint, supplementary, ground, michael, novel, form, local]
@InProceedings{Kobler_2020_CVPR,
  author = {Kobler, Erich and Effland, Alexander and Kunisch, Karl and Pock, Thomas},
  title = {Total Deep Variation for Linear Inverse Problems},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Relative Interior Rule in Block-Coordinate Descent
Tomas Werner, Daniel Prusa, Tomas Dlask


It is well-known that for general convex optimization problems, block-coordinate descent can get stuck in poor local optima. Despite that, versions of this method known as convergent message passing are very successful to approximately solve the dual LP relaxation of the MAP inference problem in graphical models. In attempt to identify the reason why these methods often achieve good local minima, we argue that if in block-coordinate descent the set of minimizers over a variable block has multiple elements, one should choose an element from the relative interior of this set. We show that this rule is not worse than any other rule for choosing block-minimizers. Based on this observation, we develop a theoretical framework for block-coordinate descent applied to general convex problems. We illustrate this theory on convergent message-passing methods.
[sequence, message, graphical, element, passing] [map, global, framework] [face, rule, diffusion, dim, bounded, subject, finite, condition] [dual, method, pattern, ieee, analysis] [arc, cluster, variable, mplp] [interior, set, minimum, theorem, satisfying, lemma, max, convergent, update, corollary, function, descent, lim, bcd, closed, general, problem, relaxation, fixed, convergence, vector, linear, minimize, number, implies, optimization, inference, denote, iff, equality, machine, minimizers, objective, converges, upper, call, consider, optimal, subsequence, minimizes, energy, applied] [local, relative, convex, point, coordinate, single, technical, computer, direction, continuous, vertex, czech, unique]
@InProceedings{Werner_2020_CVPR,
  author = {Werner, Tomas and Prusa, Daniel and Dlask, Tomas},
  title = {Relative Interior Rule in Block-Coordinate Descent},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Learning Combinatorial Solver for Graph Matching
Tao Wang, He Liu, Yidong Li, Yi Jin, Xiaohui Hou, Haibin Ling


Learning-based approaches to graph matching have been developed and explored for more than a decade, have grown rapidly in scope and popularity in recent years. However, previous learning-based algorithms, with or without deep learning strategy, mainly focus on the learning of node and/or edge affinities generation, and pay less attention on the learning of the combinatorial solver. In this paper we propose a fully trainable framework for graph matching, in which learning of affinities and solving for combinatorial optimization are not explicitly separated as in many previous arts. We firstly convert the problem of building node correspondences between two input graphs to the problem of selecting reliable nodes from a constructed assignment graph. Subsequently, the graph network block module is adopted to perform computation on the graph to form structured representations for each node. It finally predicts a label for each node that is used for node classification, and the training is performed under the supervision of both permutation differences and the one-to-one matching constraints. The proposed method is evaluated on four public benchmarks in comparison with several state-of-the-art algorithms, and the experimental results illustrate its excellent performance.
[graph, node, dataset, constructed, structured, previous, trainable, relational, work] [edge, assignment, affinity, module, framework, object, denotes, propose, fully, labeling, predicted, haibin] [input] [convolution, proposed, convolutional, method, ieee, pattern, comparison, spectral, based, block] [loss, generate, attribute, perform, encoder, tao] [learning, problem, combinatorial, function, deep, algorithm, network, training, update, matrix, permutation, relaxation, accuracy, layer, set, optimization, random, randomly, computation, updated, probabilistic, path, discrete, class, vector, data, reliable] [matching, form, correspondence, compute, outlier, parametric, computed, point, solver, finally]
@InProceedings{Wang_2020_CVPR,
  author = {Wang, Tao and Liu, He and Li, Yidong and Jin, Yi and Hou, Xiaohui and Ling, Haibin},
  title = {Learning Combinatorial Solver for Graph Matching},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
SampleNet: Differentiable Point Cloud Sampling
Itai Lang, Asaf Manor, Shai Avidan


There is a growing number of tasks that work directly on point clouds. As the size of the point cloud grows, so do the computational demands of these tasks. A possible solution is to sample the point cloud first. Classic sampling approaches, such as farthest point sampling (FPS), do not consider the downstream task. A recent work showed that learning a task-specific sampling can improve results significantly. However, the proposed technique did not deal with the non-differentiability of the sampling operation and offered a workaround instead. We introduce a novel differentiable relaxation for point cloud sampling that approximates sampled points as a mixture of points in the primary input cloud. Our approximation scheme leads to consistently good results on classification and geometry reconstruction applications. We also show that the proposed sampling method can be used as a front to a point cloud registration network. This is a challenging task since sampling must be consistent across two different point clouds for a shared downstream task. In all cases, our approach outperforms existing non-learned and learned sampling alternatives. Our code is publicly available.
[recognition, work, downstream, outperforms] [fps, split, feature, object, segmentation, template] [input, trained, profile, adversarial] [ieee, figure, pattern, method, proposed, coefficient, optimized] [loss, source, progressive] [sampling, sampled, classification, network, task, set, soft, temperature, learned, neural, accuracy, sample, operation, performance, ratio, learning, deep, training, weight, test, processing, average, inference, arxiv, preprint, size, relaxation, selection, approximate, data, random, weighted, optimal, close, lower, linear] [point, cloud, samplenet, conference, computer, vision, nearest, projection, neighbor, registration, complete, reconstruction, simplified, projected, shape, error, softly, international, dovrat, differentiable, local, pointnet, simplification, leonidas, approach, nre, daniel, farthest, neighborhood, distance, rotation, charles]
@InProceedings{Lang_2020_CVPR,
  author = {Lang, Itai and Manor, Asaf and Avidan, Shai},
  title = {SampleNet: Differentiable Point Cloud Sampling},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Can We Learn Heuristics for Graphical Model Inference Using Reinforcement Learning?
Safa Messaoud, Maghav Kumar, Alexander G. Schwing


Combinatorial optimization is frequently used in computer vision. For instance, in applications like semantic segmentation, human pose estimation and action recognition, programs are formulated for solving inference in Conditional Random Fields (CRFs) to produce a structured output that is consistent with visual features of the image. However, solving inference in CRFs is in general intractable, and approximation methods are computationally demanding and limited to unary, pairwise and hand-crafted forms of higher order potentials. In this paper, we show that we can learn program heuristics, i.e., policies, for solving inference in higher order CRFs for the task of semantic segmentation, using reinforcement learning. Our method solves inference tasks efficiently without imposing any constraints on the form of the potentials. We show compelling results on the Pascal VOC and MOTS datasets.
[reinforcement, policy, graph, order, node, reward, state, action, structured, graphical, work, embedding] [semantic, segmentation, bounding, pascal, iou, crf, map, box, voc, superpixels, superpixel, pspnet, slic, object, crfs, labeling] [model, developed] [based, output, method, classical, tree, proposed, traditional, figure, convolutional] [image, learn, conditional, address, perform, variable] [learning, mcts, inference, energy, dqn, combinatorial, deep, higher, pairwise, unary, network, label, function, optimization, random, potential, search, neural, number, learned, set, linear, selected, training, algorithm, distribution, approximation, task, relaxation, large, problem, find, maximum, efficient, unaries] [approach, solving, solve, well, program, local, second]
@InProceedings{Messaoud_2020_CVPR,
  author = {Messaoud, Safa and Kumar, Maghav and Schwing, Alexander G.},
  title = {Can We Learn Heuristics for Graphical Model Inference Using Reinforcement Learning?},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Quasi-Newton Solver for Robust Non-Rigid Registration
Yuxin Yao, Bailin Deng, Weiwei Xu, Juyong Zhang


Imperfect data (noise, outliers and partial overlap) and high degrees of freedom make non-rigid registration a classical challenging problem in computer vision. Existing methods typically adopt the l_p type robust estimator to regularize the fitting and smoothness, and the proximal operator is used to solve the resulting non-smooth problem. However, the slow convergence of these algorithms limits its wide applications. In this paper, we propose a formulation for robust non-rigid registration based on a globally smooth robust estimator for data fitting and regularization, which can handle outliers and partial overlaps. We apply the majorization-minimization algorithm to the problem, which reduces each iteration to solving a simple least-squares problem with L-BFGS. Extensive experiments demonstrate the effectiveness of our method for non-rigid alignment between two shapes with outliers and partial overlap. with quantitative evaluation showing that it outperforms state-of-the-art methods in terms of registration accuracy and computational speed. The source code is available at https://github.com/Juyong/Fast_RNRR.
[graph, dataset, time] [adopt] [robust, deviation, model, noise, robustness] [method, ieee, proposed, affine, pattern, existing, comparison, based, motion, classical] [target, source, alignment, surrogate, corresponding, align] [function, algorithm, set, problem, optimization, matrix, large, regularization, simple, number, dij, gradient, data, iteration, computational, sparsity, achieve, compared, exp, quadratic, update, efficiently] [registration, point, transformation, deformation, computer, surface, distance, term, rpts, rmse, partial, error, icp, formulation, solve, ealign, erot, conference, local, vision, rigid, closest, sparse, projr, acm, induce, direction, nonrigid, position, mesh, numerical, solved, approach, iteratively]
@InProceedings{Yao_2020_CVPR,
  author = {Yao, Yuxin and Deng, Bailin and Xu, Weiwei and Zhang, Juyong},
  title = {Quasi-Newton Solver for Robust Non-Rigid Registration},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Rethinking Class-Balanced Methods for Long-Tailed Visual Recognition From a Domain Adaptation Perspective
Muhammad Abdullah Jamal, Matthew Brown, Ming-Hsuan Yang, Liqiang Wang, Boqing Gong


Object frequency in the real world often follows a power law, leading to a mismatch between datasets with long-tailed class distributions seen by a machine learning model and our expectation of the model to perform well on all classes. We analyze this mismatch from a domain adaptation point of view. First of all, we connect existing class-balanced methods for long-tailed classification to target shift, a well-studied scenario in domain adaptation. The connection reveals that these methods implicitly assume that the training data and test data share the same class-conditioned distribution, which does not hold in general and especially for the tail classes. While a head class could contain abundant and diverse training examples that well represent the expected data at inference time, the tail classes are often short of representative training data. To this end, we propose to augment the classic class-balanced learning by explicitly estimating the differences between the class-conditioned distributions with a meta-learning approach. We validate our approach with six benchmark datasets and three loss functions.
[recognition, visual, dataset, three, shift, explicitly, work] [table, head, object, hard, framework, king] [model, development, datasets, original, example] [method, assumption, existing, figure] [domain, conditional, loss, adaptation, target, factor, source, boqing, unsupervised, mismatch, perform, discrepancy, train, learn] [training, learning, class, set, imbalance, test, inat, data, tail, distribution, algorithm, deep, weighting, sample, network, balanced, ldam, classification, neural, size, reported, weight, rate, longtailed, inference, machine, inaturalist, number, problem, validation, large, update, imbalanced, expected, small] [focal, approach, well, error, second]
@InProceedings{Jamal_2020_CVPR,
  author = {Jamal, Muhammad Abdullah and Brown, Matthew and Yang, Ming-Hsuan and Wang, Liqiang and Gong, Boqing},
  title = {Rethinking Class-Balanced Methods for Long-Tailed Visual Recognition From a Domain Adaptation Perspective},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Optimizing Rank-Based Metrics With Blackbox Differentiation
Michal Rolinek, Vit Musil, Anselm Paulus, Marin Vlastelica, Claudio Michaelis, Georg Martius


Rank-based metrics are some of the most widely used criteria for performance evaluation of computer vision models. Despite years of effort, direct optimization for these metrics remains a challenge due to their non-differentiable and non-decomposable nature. We present an efficient, theoretically sound, and general method for differentiating rank-based metrics with mini-batch gradient descent. In addition, we address optimization instability and sparsity of the supervision signal that both arise from using rank-based metrics as optimization targets. Resulting losses based on recall and Average Precision are applied to image retrieval and object detection tasks. We obtain performance that is competitive with state-of-the-art on standard image retrieval datasets and consistently improve performance of near state-of-the-art object detectors.
[retrieval, relevant, multiple, evaluation, dataset, element, work] [object, detection, faster, positive, table, voc, map, recall, feature, pascal, denotes, highest] [blackbox, example, clothes, suitable, model] [ieee, pattern, method, proposed] [loss, image] [learning, metric, deep, ranking, average, rambo, log, training, performance, optimization, set, precision, batch, gradient, vector, neural, implementation, function, combinatorial, optimizing, margin, online, class, machine, rank, test, report, proxy, evaluate, note, backward, negative, lap, stanford, processing, general, standard, efficient, triplet, permutation, number] [conference, computer, vision, international, direct, differentiation, differentiable, directly, compute, truth]
@InProceedings{Rolinek_2020_CVPR,
  author = {Rolinek, Michal and Musil, Vit and Paulus, Anselm and Vlastelica, Marin and Michaelis, Claudio and Martius, Georg},
  title = {Optimizing Rank-Based Metrics With Blackbox Differentiation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
DualSDF: Semantic Shape Manipulation Using a Two-Level Representation
Zekun Hao, Hadar Averbuch-Elor, Noah Snavely, Serge Belongie


We are seeing a Cambrian explosion of 3D shape representations for use in machine learning. Some representations seek high expressive power in capturing high-resolution detail. Other approaches seek to represent shapes as compositions of simple parts, which are intuitive for people to understand and easy to edit and manipulate. However, it is difficult to achieve both fidelity and interpretability in the same representation. We propose DualSDF, a representation expressing shapes at two levels of granularity, one capturing fine details and the other representing an abstracted proxy shape using simple and semantically consistent shape primitives. To achieve a tight coupling between the two representations, we use a variational objective over a shared latent space. Our two-level model gives rise to a new shape manipulation technique in which a user can interactively manipulate the coarse proxy shape and see the changes instantly mirrored in the high-resolution shape. Moreover, our model actively augments and guides the manipulation towards producing semantically meaningful shapes, making complex manipulations possible with minimal user input.
[individual, modeling, multiple, work, describe] [semantic, table, propagated, framework, interactive, represents] [model, manipulation, technique, adversarial] [ieee, figure, pattern, high, resolution, prior, coupling, based] [latent, representation, generative, shared, consistency, user, code, semantically, variational, learn, image, generation, intuitive] [space, learning, arxiv, preprint, simple, objective, training, function, proxy, neural, deep, report, sampled, entire, network, sampling, evaluate, machine, achieve] [shape, distance, computer, primitive, conference, point, vision, signed, coarse, surface, well, vad, chair, reconstruction, approach, demonstrate, mesh, representing, novel, represented, sphere, single, collection, cloud, geometric, hao, dualsdf, consistent, complex]
@InProceedings{Hao_2020_CVPR,
  author = {Hao, Zekun and Averbuch-Elor, Hadar and Snavely, Noah and Belongie, Serge},
  title = {DualSDF: Semantic Shape Manipulation Using a Two-Level Representation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Dynamic Hierarchical Mimicking Towards Consistent Optimization Objectives
Duo Li, Qifeng Chen


While the depth of modern Convolutional Neural Networks (CNNs) surpasses that of the pioneering networks with a significant margin, the traditional way of appending supervision only over the final classifier and progressively propagating gradient flow upstream remains the training mainstay. Seminal Deeply-Supervised Networks (DSN) were proposed to alleviate the difficulty of optimization arising from gradient flow through a long chain. However, it is still vulnerable to issues including interference to the hierarchical representation generation process and inconsistent optimization objectives, as illustrated theoretically and empirically in this paper. Complementary to previous training strategies, we propose Dynamic Hierarchical Mimicking, a generic feature learning mechanism, to advance CNN training with enhanced generalization ability. Partially inspired by DSN, we fork delicately designed side branches from the intermediate layers of a given neural network. Each branch can emerge from certain locations of the main branch dynamically, which not only retains representation rooted in the backbone network but also generates more diverse representations along its own pathway. We go one step further to promote multi-level interactions among different branches through an optimization formula with probabilistic prediction matching losses, thus guaranteeing a more robust optimization process and better representation ability. Experiments on both category and instance recognition tasks demonstrate the substantial improvements of our proposed method over its corresponding counterparts using diverse state-of-the-art CNN architectures. Code and models are publicly available at https://github.com/d-li14/DHM.
[hierarchical, prediction, recognition, hidden, visual, connected, mechanism, dataset, bidirectional] [supervision, cnn, main, branch, table, side, feature, denotes, backbone, final, including, semantic, resnet] [auxiliary, model, datasets, original, generalization, improving, comprehensive, trained] [method, intermediate, proposed, convolutional, dynamic, residual, designed, based, flow] [loss, image, representation, corresponding, diverse, supervised, transfer, person] [training, deep, knowledge, network, optimization, baseline, mimicking, neural, classifier, performance, dhm, probabilistic, process, learning, dsl, layer, accuracy, imagenet, distribution, regularization, gradient, classification, standard, weight, set, data, objective, log, equation, architecture, scheme, function, rate, denote, advance] [term, single, supplementary, error, defined]
@InProceedings{Li_2020_CVPR,
  author = {Li, Duo and Chen, Qifeng},
  title = {Dynamic Hierarchical Mimicking Towards Consistent Optimization Objectives},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Deep Homography Estimation for Dynamic Scenes
Hoang Le, Feng Liu, Shu Zhang, Aseem Agarwala


Homography estimation is an important step in many computer vision problems. Recently, deep neural network methods have shown to be favorable for this problem when compared to traditional methods. However, these new methods do not consider dynamic content in input images. They train neural networks with only image pairs that can be perfectly aligned using homographies. This paper investigates and discusses how to design and train a deep neural network that handles dynamic scenes. We first collect a large video dataset with dynamic content. We then develop a multi-scale neural network and show that when properly trained using our new dataset, this neural network can already handle dynamic scenes to some extent. To estimate a homography of a dynamic scene in a more principled way, we need to identify the dynamic content. Since dynamic content detection and homography estimation are two tightly coupled tasks, we follow the multi-task learning principles and augment our multi-scale network such that it jointly estimates the dynamics masks and homographies. Our experiments show that our method can robustly estimate homography for challenging scenarios with dynamic scenes, blur artifacts, or lack of textures.
[video, static, dataset, pair, work, moving, previous, evaluation] [mask, feature, corner, global, area, challenging, boundary, pooling] [trained, input, robust, improve] [homography, dynamic, method, motion, figure, vidsetd, convolutional, optical, mhn, ieee, flow, warp, pfnet, vidsets, pattern, warped, existing, develop, version, consecutive, mhnm, perfectly] [image, train, loss, content, corresponding, aligned] [network, neural, deep, large, training, learning, base, number, dhn, better, size, consider, average, problem, compared] [estimation, handle, computer, estimate, conference, international, vision, error, scene, compute, estimated, camera, local]
@InProceedings{Le_2020_CVPR,
  author = {Le, Hoang and Liu, Feng and Zhang, Shu and Agarwala, Aseem},
  title = {Deep Homography Estimation for Dynamic Scenes},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
PF-Net: Point Fractal Network for 3D Point Cloud Completion
Zitian Huang, Yikuan Yu, Jiawen Xu, Feng Ni, Xinyi Le


In this paper, we propose a Point Fractal Network (PF-Net), a novel learning-based approach for precise and high-fidelity point cloud completion. Unlike existing point cloud completion networks, which generate the overall shape of the point cloud from the incomplete point cloud and always change existing points and encounter noise and geometrical loss, PF-Net preserves the spatial arrangements of the incomplete point cloud and can figure out the detailed geometrical structure of the missing region(s) in the prediction. To succeed at this task, PF-Net estimates the missing point cloud hierarchically by utilizing a feature-points-based multi-scale generating network. Further, we add up multi-stage completion loss and adversarial loss to generate more realistic missing region(s). The adversarial loss can better tackle multiple modes in the prediction. Our experiments demonstrate the effectiveness of our method for several challenging point cloud completion tasks.
[pred, prediction, three, decoder, multiple, extract, predict, dataset] [feature, table, pyramid, region, object, final, predicted, semantic, center, focus, map, lose] [input, adversarial, original, change, combined] [method, figure, resolution, output, existing, based] [missing, loss, generate, encoder, latent, train, discriminator, generating, generation] [network, size, learning, data, architecture, vector, deep, set, compared, test, better, training, performance, sampling, classification] [point, cloud, completion, error, shape, incomplete, ground, detailed, partial, ppd, cmlp, mlp, distance, structure, geometric, local, geometry, truth, chair, pcn, repairing, complete, lamp, ydetail, compute, yprimary, ysecondary, ygt]
@InProceedings{Huang_2020_CVPR,
  author = {Huang, Zitian and Yu, Yikuan and Xu, Jiawen and Ni, Feng and Le, Xinyi},
  title = {PF-Net: Point Fractal Network for 3D Point Cloud Completion},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
On the Regularization Properties of Structured Dropout
Ambar Pal, Connor Lane, Rene Vidal, Benjamin D. Haeffele


Dropout and its extensions (e.g. DropBlock and DropConnect) are popular heuristics for training neural networks, which have been shown to improve generalization performance in practice. However, a theoretical understanding of their optimization and regularization properties remains elusive. Recent work shows that in the case of single hidden-layer linear networks, Dropout is a stochastic gradient descent method for minimizing a regularized loss, and that the regularizer induces solutions that are low-rank and balanced. In this work we show that for single hidden-layer linear networks, DropBlock induces spectral k-support norm regularization, and promotes solutions that are low-rank and have factors with equal norm. We also show that the global minimizer for DropBlock can be computed in closed form, and that DropConnect is equivalent to Dropout. We then show that some of these results can be extended to a general class of Dropout-strategies, and, with some assumptions, to deep non-linear networks when Dropout is applied to the last layer. We verify our theoretical claims and assumptions experimentally with commonly used network architectures.
[hidden, ith, order] [global, final, denotes] [norm, study, case, original] [output, block, result, analysis, figure, method] [] [dropblock, dropout, regularization, training, network, layer, lower, objective, dropconnect, matrix, regularizer, theorem, problem, note, bound, deterministic, neural, induces, applied, probability, equivalent, algorithm, optimization, linear, stochastic, minimizer, deep, kui, size, induced, capacity, minimizers, singular, performance, gradient, closed, function, set, minimum, theoretical, general, min, denote, nuclear, dropping, lemma, envelope, squared, neuron, approximation, constraining, corollary, diagonal, equal, weight, vector, compared, typically, small, simple] [convex, form, computed, single, inf, second, solution]
@InProceedings{Pal_2020_CVPR,
  author = {Pal, Ambar and Lane, Connor and Vidal, Rene and Haeffele, Benjamin D.},
  title = {On the Regularization Properties of Structured Dropout},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Learning Oracle Attention for High-Fidelity Face Completion
Tong Zhou, Changxing Ding, Shaowen Lin, Xinchao Wang, Dacheng Tao


High-fidelity face completion is a challenging task due to the rich and subtle facial textures involved. What makes it more complicated is the correlations between different facial components, for example, the symmetry in texture and structure between both eyes. While recent works adopted the attention mechanism to learn the contextual relations among elements of the face, they have largely overlooked the disastrous impacts of inaccurate attention scores; in addition, they fail to pay sufficient attention to key facial components, the completion results of which largely determine the authenticity of a face image. Accordingly, in this paper, we design a comprehensive framework for face completion based on the U-Net structure. Specifically, we propose a dual spatial attention module to efficiently learn the correlations between facial textures at multiple scales; moreover, we provide an oracle supervision signal to the attention module to ensure that the obtained attention scores are reasonable. Furthermore, we take the location of the facial components as prior knowledge and impose a multi-discriminator on these regions, with which the fidelity of facial components is significantly promoted. Extensive experiments on two high-resolution face datasets including CelebA-HQ and Flickr-Faces-HQ demonstrate that the proposed approach outperforms state-of-the-art methods by large margins.
[attention, order, oracle, multiple] [module, feature, supervision, foreground, contextual, area, map, key, propose, mask, gconv, background, denotes, table, semantic, reshape, global, adopt, focus] [facial, face, input, adversarial, impose, model, conduct, original] [figure, proposed, method, quantitative, signal, result, psnr, ssim, prior, convolutional, patch, viewed, dual, output] [image, dsa, inpainting, discriminator, masked, generated, loss, generate, missing, transpose, generative, filling, qualitative, lpips, learn, row, structural] [network, learned, learning, eij, training, layer, best, size, higher, matrix, set, number] [ground, completion, truth, local, approach, structure, left]
@InProceedings{Zhou_2020_CVPR,
  author = {Zhou, Tong and Ding, Changxing and Lin, Shaowen and Wang, Xinchao and Tao, Dacheng},
  title = {Learning Oracle Attention for High-Fidelity Face Completion},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Deep Image Spatial Transformation for Person Image Generation
Yurui Ren, Xiaoming Yu, Junming Chen, Thomas H. Li, Ge Li


Pose-guided person image generation is to transform a source person image to a target pose. This task requires spatial manipulations of source data. However, Convolutional Neural Networks are limited by the lack of ability to spatially transform the inputs. In this paper, we propose a differentiable global-flow local-attention framework to reassemble the inputs at the feature level. Specifically, our model first calculates the global correlations between sources and targets to predict flow fields. Then, the flowed local patch pairs are extracted from the feature maps to calculate the local attention coefficients. Finally, we warp the source features using a content-aware sampling method with the obtained local attention coefficients. The results of both subjective and objective experiments demonstrate the superiority of our model. Besides, additional results in video animation and view synthesis show that our model is applicable to other tasks requiring spatial transformation. Our source code is available at https://github.com/RenYurui/Global-Flow-Local-Attention.
[attention, video, correct, bilinear, dataset] [feature, global, framework, ablation, occlusion, propose] [model, input, adversarial, subjective] [flow, spatial, ieee, figure, pattern, method, patch, spatially, affine, field, proposed, perceptual, transform, result, vivid, convolutional, output, calculates, warp] [image, source, target, generate, person, appearance, loss, generation, texture, generated, extracted, synthesis, generating, train, animation] [neural, sampling, network, processing, matrix, deep, task, operation, arxiv, preprint, calculate, data] [local, computer, conference, transformation, vision, pose, estimator, renderer, view, coordinate, international, thomas, deformation, human, full]
@InProceedings{Ren_2020_CVPR,
  author = {Ren, Yurui and Yu, Xiaoming and Chen, Junming and Li, Thomas H. and Li, Ge},
  title = {Deep Image Spatial Transformation for Person Image Generation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Learning to Optimize on SPD Manifolds
Zhi Gao, Yuwei Wu, Yunde Jia, Mehrtash Harandi


Many tasks in computer vision and machine learning are modeled as optimization problems with constraints in the form of Symmetric Positive Definite (SPD) matrices. Solving such optimization problems is challenging due to the non-linearity of the SPD manifold, making optimization with SPD constraints heavily relying on expert knowledge and human involvement. In this paper, we propose a meta-learning method to automatically learn an iterative optimizer on SPD manifolds. Specifically, we introduce a novel recurrent model that takes into account the structure of input gradients and identifies the updating scheme of optimization. We parameterize the optimizer by the recurrent model and utilize Riemannian operations to ensure that our method is faithful to the geometry of SPD manifolds. Compared with existing SPD optimizers, our optimizer effectively exploits the underlying data distribution and learns a better optimization trajectory in a data-driven manner. Extensive experiments on various computer vision tasks including metric nearness, clustering, and similarity learning demonstrate that our optimizer outperforms existing state-of-the-art methods consistently.
[previous, recognition, state, automatically, time, dataset] [positive, faster, denotes, propose] [model, trained] [figure, ieee, method, pattern, existing] [learn, loss, manifold, image, underlying, preserve, cub] [spd, optimizer, learning, optimization, gradient, riemannian, matrix, training, metric, update, space, optimizers, mlstm, retraction, parameter, set, clustering, stochastic, experience, test, machine, better, similarity, task, neural, base, vector, data, optimize, operation, mehrtash, scheme, descent, rate, orthogonal, performance, definite, distribution, design, converges, function, rsgd, search, inner, batchsize, replay, pool, rsvrg, conducted, knowledge] [conference, computer, international, vision, symmetric, tangent, compute, projection, computed, eigenvalue, geometry, euclidean, directly, form]
@InProceedings{Gao_2020_CVPR,
  author = {Gao, Zhi and Wu, Yuwei and Jia, Yunde and Harandi, Mehrtash},
  title = {Learning to Optimize on SPD Manifolds},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Deep 3D Portrait From a Single Image
Sicheng Xu, Jiaolong Yang, Dong Chen, Fang Wen, Yu Deng, Yunde Jia, Xin Tong


In this paper, we present a learning-based approach for recovering the 3D geometry of human head from a single portrait image. Our method is learned in an unsupervised manner without any ground-truth 3D data. We represent the head geometry with a parametric 3D face model together with a depth map for other head regions including hair and ear. A two-step geometry learning scheme is proposed to learn 3D head reconstruction from in-the-wild face images, where we first learn face shape on single images using self-reconstruction and then learn hair and ear geometry using pairs of images in a stereo-matching fashion. The second step is based on the output of the first to not only improve the accuracy but also ensure the consistency of overall head geometry. We evaluate the accuracy of our method both in 3D and with pose manipulation tasks on 2D images. We alter pose based on the recovered geometry and apply a refinement network trained with adversarial learning to ameliorate the reprojected images and translate them to the real image domain. Extensive evaluations and comparison with previous methods show that our new method can produce high-fidelity 3D head geometry and head pose manipulation results.
[recognition, video] [head, region, apply, refinement, denotes, background, map, table, propose, segmentation] [face, manipulation, input, facial, model, adversarial, expression, ear, dong, change, reenactment, trained] [ieee, method, pattern, figure, based, comparison, proposed, warping, output, presented, perceptual, convolutional, warped] [image, hair, portrait, train, loss, unsupervised, generative, real, learn, synthesis, missing, paired, conditional, inpainting, obvious, editing, zhu] [network, deep, learning, data, training, scheme, accuracy, neural, note, process, average] [conference, computer, vision, reconstruction, pose, depth, geometry, single, acm, international, estimation, shape, reconstructed, rotation, consistent, rgbd, error, michael, compare, well]
@InProceedings{Xu_2020_CVPR,
  author = {Xu, Sicheng and Yang, Jiaolong and Chen, Dong and Wen, Fang and Deng, Yu and Jia, Yunde and Tong, Xin},
  title = {Deep 3D Portrait From a Single Image},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
RDCFace: Radial Distortion Correction for Face Recognition
He Zhao, Xianghua Ying, Yongjie Shi, Xin Tong, Jingsi Wen, Hongbin Zha


The effects of radial lens distortion often appear in wide-angle cameras of surveillance and safeguard systems, which may severely degrade performances of previous face recognition algorithms. Traditional methods for radial lens distortion correction usually employ line features in scenarios that are not suitable for face images. In this paper, we propose a distortion-invariant face recognition system called RDCFace, which directly and only utilize the distorted images of faces, to alleviate the effects of radial lens distortion. RDCFace is an end-to-end trainable cascade network, which can learn rectification and alignment parameters to achieve a better face recognition performance without requiring supervision of facial landmarks and distortion parameters. We design sequential spatial transformer layers to optimize the correction, alignment, and recognition modules jointly. The feasibility of our method comes from implicitly using the statistics of the layout of face features learned from the large-scale face data. Extensive experiments indicate that our method is distortion robust and gains significant improvements on LFW, YTF, CFP, and RadialFace, a real distorted face benchmark compared with state-of-the-art methods.
[recognition, dataset, previous, three, transformer, construct, outperforms] [edge, module] [face, distortion, correction, distorted, radial, rdcface, original, input, model, radialface, help, fcorrect, degree, generalization, facial, rong, frec, verification, rdc, suitable, identity, fisheye, improve] [method, rectification, corrected, coefficient, lens, spatial, figure, based, proposed, division, existing, foveal] [image, alignment, loss, align, lrec, real, aligned, generate, lack, train] [network, performance, data, layer, learning, test, training, deep, set, baseline, accuracy, better, large, inverted, evaluate, parameter, standard, weight, achieve, design, optimize, compared] [system, directly, compare, geometric, transformation, error, view, calibration]
@InProceedings{Zhao_2020_CVPR,
  author = {Zhao, He and Ying, Xianghua and Shi, Yongjie and Tong, Xin and Wen, Jingsi and Zha, Hongbin},
  title = {RDCFace: Radial Distortion Correction for Face Recognition},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Global-Local GCN: Large-Scale Label Noise Cleansing for Face Recognition
Yaobin Zhang, Weihong Deng, Mei Wang, Jiani Hu, Xian Li, Dongyue Zhao, Dongchao Wen


In the field of face recognition, large-scale web-collected datasets are essential for learning discriminative representations, but they suffer from noisy identity labels, such as outliers and label flips. It is beneficial to automatically cleanse their label noise for improving recognition accuracy. Unfortunately, existing cleansing methods cannot accurately identify noise in the wild. To solve this problem, we propose an effective automatic label noise cleansing framework for face recognition datasets, FaceGraph. Using two cascaded graph convolutional networks, FaceGraph performs global-to-local discrimination to select useful data in a noisy environment. Extensive experiments show that cleansing widely used datasets, such as CASIA-WebFace, VGGFace2, MegaFace2, and MS-Celeb-1M, using the proposed method can improve the recognition performance of state-of-the-art representation learning methods like Arcface. Further, we cleanse massive self-collected celebrity data, namely MillionCelebs, to provide 18.8M images of 636K identities. Training with the new data, Arcface surpasses state-of-the-art performance by a notable margin to reach 95.62% TPR at 1e-5 FPR on the IJB-C benchmark.
[graph, recognition, dataset, node, gcn, prediction, graphsage, work] [global, table, positive, feature, propagation, hard, confidence, center] [face, noise, cleansing, facegraph, ggn, datasets, cleansed, trained, lgn, garbage, model, verification, cleanse, millioncelebs, identity, casia, subgraphs, improve, arcface] [figure, ieee, pattern, method, convolutional, designed, output, net, noisy, existing, proposed, based, signal, big] [loss, image, train] [label, data, learning, rate, training, performance, classification, deep, class, number, similarity, select, layer, large, higher, randomly, test, arxiv, preprint, tpr, pairwise, network] [conference, local, computer, vision, international, predicts]
@InProceedings{Zhang_2020_CVPR,
  author = {Zhang, Yaobin and Deng, Weihong and Wang, Mei and Hu, Jiani and Li, Xian and Zhao, Dongyue and Wen, Dongchao},
  title = {Global-Local GCN: Large-Scale Label Noise Cleansing for Face Recognition},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
MISC: Multi-Condition Injection and Spatially-Adaptive Compositing for Conditional Person Image Synthesis
Shuchen Weng, Wenbo Li, Dawei Li, Hongxia Jin, Boxin Shi


In this paper, we explore synthesizing person images with multiple conditions for various backgrounds. To this end, we propose a framework named "MISC" for conditional image generation and image compositing. For conditional image generation, we improve the existing condition injection mechanisms by leveraging the inter-condition correlations. For the image compositing, we theoretically prove the weaknesses of the cutting-edge methods, and make it more robust by removing the spatially-invariance constraint, and enabling the bounding mechanism and the spatial adaptability. We show the effectiveness of our method on the Video Instance-level Parsing dataset, and demonstrate the robustness through controllability tests.
[three, mechanism, multiple, embedding] [foreground, parsing, bounding, table, effectiveness, background, feature, stage, propose, framework, instance, semantic] [condition, model, injection, input, egc, adversarial, create, study, auxiliary, noise] [color, figure, pattern, spatial, tone, column, gaussian, proposed, adaptive, conv, quantitative, formulated, based, pixel, output, removing, visually] [image, compositing, person, attribute, conditional, generation, misc, loss, generated, adain, synthesis, lcm, unicolor, realistic, composited, injecting, abstract, spade, qualitative, necessity, adjusting, controllability] [network, gain, uniform, compared, number, set, problem, gradient, neural, normalization, denote, binary, baseline, similarity, training] [body, transformation, geometry, inferred, constraint]
@InProceedings{Weng_2020_CVPR,
  author = {Weng, Shuchen and Li, Wenbo and Li, Dawei and Jin, Hongxia and Shi, Boxin},
  title = {MISC: Multi-Condition Injection and Spatially-Adaptive Compositing for Conditional Person Image Synthesis},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
SAINT: Spatially Aware Interpolation NeTwork for Medical Slice Synthesis
Cheng Peng, Wei-An Lin, Haofu Liao, Rama Chellappa, S. Kevin Zhou


Deep learning-based single image super-resolution (SISR) methods face various challenges when applied to 3D medical volumetric data (i.e., CT and MR images) due to the high memory cost and anisotropic resolution, which adversely affect their performance. Furthermore, mainstream SISR methods are designed to work over specific upsampling factors, which makes them ineffective in clinical practice. In this paper, we introduce a Spatially Aware Interpolation NeTwork (SAINT) for medical slice synthesis to alleviate the memory constraint that volumetric data poses. Compared to other super-resolution methods, SAINT utilizes voxel spacing information to provide desirable levels of details, and allows for the upsampling factor to be determined on the fly. Our evaluations based on 853 CT scans from four datasets that contain liver, colon, hepatic vessels, and kidneys show that SAINT consistently outperforms other SISR methods in terms of medical slice synthesis quality, while using only a single model to deal with different upsampling factors
[visual, three, work, dataset, order] [feature, stage, segmentation, cnn, aware, propose, apply, table] [quality, datasets, model, physical, input] [slice, upsampling, interpolation, medical, saint, ami, icor, sisr, resolution, anisotropic, convolutional, isag, proposed, mdsr, rdn, ieee, residual, icsr, method, axial, convolution, quantitative, based, figure, mdcsrn, kidney, hepatic, spatial, spatially, high, acquisition, pixel, rfn, june, upsample, applying, called] [image, factor, generation, generate, generates, generated, arbitrary, igt, unseen] [network, filter, memory, learning, performance, deep, data, compared, better, best, meta, inference, achieve] [distance, single, allows, refer, dense, axis, volume, computer, voxel, approach]
@InProceedings{Peng_2020_CVPR,
  author = {Peng, Cheng and Lin, Wei-An and Liao, Haofu and Chellappa, Rama and Zhou, S. Kevin},
  title = {SAINT: Spatially Aware Interpolation NeTwork for Medical Slice Synthesis},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Recurrent Feature Reasoning for Image Inpainting
Jingyuan Li, Ning Wang, Lefei Zhang, Bo Du, Dacheng Tao


Existing inpainting methods have achieved promising performance for recovering regular or small image defects. However, filling in large continuous holes remains difficult due to the lack of constraints for the hole center. In this paper, we devise a Recurrent Feature Reasoning (RFR) network which is mainly constructed by a plug-and-play Recurrent Feature Reasoning module and a Knowledge Consistent Attention (KCA) module. Analogous to how humans solve puzzles (i.e., first solve the easier parts and then use the results as additional information to solve difficult parts), the RFR module recurrently infers the hole boundaries of the convolutional feature maps and then uses them as clues for further inference. The module progressively strengthens the constraints for the hole center and the results become explicit. To capture information from distant places in the feature map for RFR, we further develop KCA and incorporate it in RFR. Empirically, we first compare the proposed RFR-Net with existing backbones, demonstrating that RFR-Net is more efficient (e.g., a 4% SSIM improvement for the same model size). We then place the network in the context of the current state-of-the-art, where it exhibits improved performance. The corresponding source code is available at: https://github.com/jingyuanli001/RFR-Inpainting
[attention, reasoning, recurrent, dataset, previous, three, current] [feature, module, map, area, mask, score, pconv, location, semantic, table, center, merging, pic, ablation, background] [model, paris, input] [convolutional, recurrence, hole, existing, pixel, quantitative, figure, method, convolution, prvs, proposed, valid, ssim] [image, inpainting, rfr, loss, progressive, generated, masked, kca, inpaint, edgeconnect, progressively, generative, streetview, gatedconv, corresponding, texture, semantically, structural, style, lack, damaged] [network, number, performance, knowledge, layer, deep, size, process, training, calculated, merged, architecture, computational, inference, function, space, devise] [consistent, partial, compare, directly, solve, reconstructed, acm, rgb, structure]
@InProceedings{Li_2020_CVPR,
  author = {Li, Jingyuan and Wang, Ning and Zhang, Lefei and Du, Bo and Tao, Dacheng},
  title = {Recurrent Feature Reasoning for Image Inpainting},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Structure-Preserving Super Resolution With Gradient Guidance
Cheng Ma, Yongming Rao, Yean Cheng, Ce Chen, Jiwen Lu, Jie Zhou


Structures matter in single image super resolution (SISR). Recent studies benefiting from generative adversarial network (GAN) have promoted the development of SISR by recovering photo-realistic images. However, there are always undesired structural distortions in the recovered images. In this paper, we propose a structure-preserving super resolution method to alleviate the above issue while maintaining the merits of GAN-based methods to generate perceptual-pleasant details. Specifically, we exploit gradient maps of images to guide the recovery in two aspects. On the one hand, we restore high-resolution gradient maps by a gradient branch to provide additional structure priors for the SR process. On the other hand, we propose a gradient loss which imposes a second-order restriction on the super-resolved images. Along with the previous image-space loss functions, the gradient-space objectives help generative networks concentrate more on geometric structures. Moreover, our method is model-agnostic, which can be potentially used for off-the-shelf SR networks. Experimental results show that we achieve the best PI and LPIPS performance and meanwhile comparable PSNR and SSIM compared with state-of-the-art perceptual-driven SR methods. Visual results demonstrate our superiority in restoring structures while generating natural SR images.
[natural, visual, provide, exploit] [branch, propose, map, feature, edge, china, guide, effectiveness, grant] [adversarial, model, quality, improve, testing, gradiant] [spsr, method, esrgan, figure, perceptual, psnr, natsr, proposed, ssim, block, sharp, convolutional, recover, srgan, resolution, guidance, blurry, residual, conv, bicubic, comparison, super, high, sharpness, output, sftgan, optimized, utilized, enhancenet, based] [image, loss, generative, structural, lpips, generate, produce, extracted, generator, translation] [gradient, network, deep, best, better, performance, learning, arxiv, preprint, achieve, layer, compared] [geometric, single, recovered, structure, second]
@InProceedings{Ma_2020_CVPR,
  author = {Ma, Cheng and Rao, Yongming and Cheng, Yean and Chen, Ce and Lu, Jiwen and Zhou, Jie},
  title = {Structure-Preserving Super Resolution With Gradient Guidance},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Epipolar Transformers
Yihui He, Rui Yan, Katerina Fragkiadaki, Shoou-I Yu


A common approach to localize 3D human joints in a synchronized and calibrated multi-view setup consists of two-steps: (1) apply a 2D detector separately on each view to localize joints in 2D, and (2) perform robust triangulation on 2D detections from each view to acquire the 3D joint locations. However, in step 1, the 2D detector is limited to solving challenging cases which could potentially be better resolved in 3D, such as occlusions and oblique viewing angles, purely in 2D without leveraging any 3D information. Therefore, we propose the differentiable "epipolar transformer", which enables the 2D detector to leverage 3D-aware features to improve 2D pose estimation. The intuition is: given a 2D location p in the current view, we would like to first find its corresponding point p' in a neighboring view, and then combine the features at p' with the features at p, thus leading to a 3D-aware feature at p. Inspired by stereo matching, the epipolar transformer leverages epipolar constraints and feature matching to approximate the features at p'. Experiments on InterHand and Human3.6M show that our approach has consistent improvements over the baselines. Specifically, in the condition where no external data is used, our Human3.6M model trained with ResNet-50 backbone and image size 256 x 256 outperforms state-of-the-art by 4.23 mm and achieves MPJPE 26.9 mm.
[transformer, viewing, attention, prediction, dataset, multiple, embedded] [feature, table, detector, module, fuse, challenging, location, detection, final] [trained, input, external, difference, datasets, model, identity, hourglass] [reference, ieee, pattern, figure, fusion, proposed, gaussian, neighboring, intermediate, method, comparison, based, learnable, color] [source, corresponding, image] [network, data, number, size, deep, training, learning, baseline, better, performance, learned, sampler, similarity, softmax, sum, max] [epipolar, pose, view, conference, computer, hand, human, estimation, vision, mpjpe, interhand, triangulation, joint, matching, camera, european, depth, qiu, monocular, leverage, point, single, rgb, angle, compute, accurate]
@InProceedings{He_2020_CVPR,
  author = {He, Yihui and Yan, Rui and Fragkiadaki, Katerina and Yu, Shoou-I},
  title = {Epipolar Transformers},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Diversified Arbitrary Style Transfer via Deep Feature Perturbation
Zhizhong Wang, Lei Zhao, Haibo Chen, Lihong Qiu, Qihang Mo, Sihuan Lin, Wei Xing, Dongming Lu


Image style transfer is an underdetermined problem, where a large number of solutions can satisfy the same constraint (the content and style). Although there have been some efforts to improve the diversity of style transfer by introducing an alternative diversity loss, they have restricted generalization, limited diversity and poor scalability. In this paper, we tackle these limitations and propose a simple yet effective method for diversified arbitrary style transfer. The key idea of our method is an operation called deep feature perturbation (DFP), which uses an orthogonal random noise matrix to perturb the deep image feature maps while keeping the original style information unchanged. Our DFP operation can be easily integrated into many existing WCT (whitening and coloring transform)-based methods, and empower them to generate diverse results for arbitrary styles. Experimental results demonstrate that this learning-free and universal method can greatly increase the diversity while maintaining the quality of stylization.
[recognition, work] [feature, level, bottom] [noise, quality, perturbation, original, perturb, perturbed, strength, easily, sheng] [method, ieee, pattern, transform, proposed, figure, pixel, convolutional, column, fast, quantitative, integrated, existing, based] [style, diversity, transfer, content, image, gram, stylization, est, arbitrary, diversified, coloring, diverse, dfp, generate, texture, row, fcsn, generated, loss, perturbing, artistic, satisfy, wct, extracted, ulyanov, synthesis] [matrix, deep, orthogonal, number, neural, whitening, default, random, top, insert, hyperparameter, network, distribution, achieve, learning, large, set, small, layer, increase, problem, space] [conference, computer, vision, distance, international]
@InProceedings{Wang_2020_CVPR,
  author = {Wang, Zhizhong and Zhao, Lei and Chen, Haibo and Qiu, Lihong and Mo, Qihang and Lin, Sihuan and Xing, Wei and Lu, Dongming},
  title = {Diversified Arbitrary Style Transfer via Deep Feature Perturbation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
MSG-GAN: Multi-Scale Gradients for Generative Adversarial Networks
Animesh Karnewar, Oliver Wang


While Generative Adversarial Networks (GANs) have seen huge successes in image synthesis tasks, they are notoriously difficult to adapt to different datasets, in part due to instability during training and sensitivity to hyperparameters. One commonly accepted reason for this instability is that gradients passing from the discriminator to the generator become uninformative when there isn't enough overlap in the supports of the real and fake distributions. In this work, we propose the Multi-Scale Gradient Generative Adversarial Network (MSG-GAN), a simple but effective technique for addressing this by allowing the flow of gradients from the discriminator to the generator at multiple scales. This technique provides a stable approach for high resolution image synthesis, and serves as an alternative to the commonly used progressive growing technique. We show that MSG-GAN converges stably on a variety of image datasets of different sizes, resolutions and domains, as well as different types of loss functions and architectures, all with the same set of fixed hyperparameters. When compared to state-of-the-art GANs, our approach matches or exceeds the performance in most of the cases we tried.
[multiple, time, dataset, work, lin, previous] [final, table, overlap, score] [adversarial, technique, datasets, indian, trained, quality, stability, model] [resolution, high, method, intermediate, output, figure, proposed, flow, viewed, commonly] [image, generative, generator, discriminator, generated, real, progans, progressive, gan, fid, growing, loss, ffhq, synthesis, gans, generation, stylegan, celebs, latent, generate, lsun, fake, address] [training, function, learning, note, layer, neural, processing, best, gradient, standard, lower, instability, fixed, architecture, number, performance, data, problem, hyperparameters] [approach, combine, defined, single, conference]
@InProceedings{Karnewar_2020_CVPR,
  author = {Karnewar, Animesh and Wang, Oliver},
  title = {MSG-GAN: Multi-Scale Gradients for Generative Adversarial Networks},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Overcoming Multi-Model Forgetting in One-Shot NAS With Diversity Maximization
Miao Zhang, Huiqi Li, Shirui Pan, Xiaojun Chang, Steven Su


One-Shot Neural Architecture Search (NAS) significantly improves the computational efficiency through weight sharing. However, this approach also introduces multi-model forgetting during the supernet training (architecture search phase), where the performance of previous architectures degrade when sequentially training new architectures with partially-shared weights. To overcome such catastrophic forgetting, the state-of-the-art method assumes that the shared weights are optimal when jointly optimizing a posterior probability. However, this strict assumption is not necessarily held for One-Shot NAS in practice. In this paper, we formulate the supernet training in the One-Shot NAS as a constrained optimization problem of continual learning that the learning of current architecture should not degrade the performance of previous architectures during the supernet training. We propose a Novelty Search based Architecture Selection (NSAS) loss function and demonstrate that the posterior probability could be calculated without the strict assumption when maximizing the diversity of the selected constraints. A greedy novelty search method is devised to find the most representative subset to regularize the supernet training. We apply our proposed approach to two One-Shot NAS baselines, random sampling NAS (RandomNAS) and gradient-based sampling NAS (GDAS). Extensive experiments demonstrate that our method enhances the predictive ability of the supernet in One-Shot NAS and achieves remarkable performance on CIFAR-10, CIFAR-100, and PTB with efficiency.
[previous, current, step, const] [achieves, effectiveness] [trained, model, constrained, experimental, representative, numerous] [based, proposed, method, assumption, figure, cell, enhance] [loss, shared, train, diversity, ability] [architecture, supernet, search, neural, training, learning, weight, function, performance, forgetting, validation, randomnas, accuracy, novelty, path, random, subset, gdas, predictive, sharing, gradient, optimal, retraining, optimization, inheriting, ranking, posterior, problem, selection, test, sample, arxiv, preprint, catastrophic, probability, space, better, continual, greedy, algorithm, machine, oneshot, consider, best, param, overcoming, promising, select, ltrain, discrete, maximize, wpl] [single, conference, international, approach, continuous, normal, error]
@InProceedings{Zhang_2020_CVPR,
  author = {Zhang, Miao and Li, Huiqi and Pan, Shirui and Chang, Xiaojun and Su, Steven},
  title = {Overcoming Multi-Model Forgetting in One-Shot NAS With Diversity Maximization},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Select to Better Learn: Fast and Accurate Deep Learning Using Data Selection From Nonlinear Manifolds
Mohsen Joneidi, Saeed Vahidian, Ashkan Esmaeili, Weijia Wang, Nazanin Rahnavard, Bill Lin, Mubarak Shah


Finding a small subset of data whose linear combination spans other data points, also called column subset selection problem (CSSP), is an important open problem in computer science with many applications in computer vision and deep learning. There are some studies that solve CSSP in a polynomial time complexity w.r.t. the size of the original dataset. A simple and efficient selection algorithm with a linear complexity order, referred to as spectrum pursuit (SP), is proposed that pursuits spectral components of the dataset using available sample points. The proposed non-greedy algorithm aims to iteratively find K data samples whose span is close to that of the first K spectral components of entire data. SP has no parameter to be fine tuned and this desirable property makes it problem-independent. The simplicity of SP enables us to extend the underlying linear model to more complex models such as nonlinear manifolds and graph-based models. The nonlinear extension of SP is introduced as kernel-SP (KSP). The superiority of the proposed algorithms is demonstrated in a wide range of applications.
[graph, dataset, social, span, recognition, referred] [siamese, propose] [nonlinear, trained, identification, representative, model, face] [proposed, ieee, based, column, pattern, figure, kernel, fast, spectrum, spectral, analysis] [real, gan] [selection, data, algorithm, selected, problem, matrix, subset, learning, network, training, linear, ksp, neural, test, set, best, performance, selecting, number, select, complexity, processing, normalized, accuracy, singular, ipm, classification, deep, open, efficient, pursuit, entire, function, random, evaluate, similarity, sosis, class, classifier, finding, sample, large, selects, sampled, sampling, subspace, clustering, labeled, machine] [computer, conference, projection, vision, error, volume, international, approach, convex, accurate]
@InProceedings{Joneidi_2020_CVPR,
  author = {Joneidi, Mohsen and Vahidian, Saeed and Esmaeili, Ashkan and Wang, Weijia and Rahnavard, Nazanin and Lin, Bill and Shah, Mubarak},
  title = {Select to Better Learn: Fast and Accurate Deep Learning Using Data Selection From Nonlinear Manifolds},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Neural Point Cloud Rendering via Multi-Plane Projection
Peng Dai, Yinda Zhang, Zhuwen Li, Shuaicheng Liu, Bing Zeng


We present a new deep point cloud rendering pipeline through multi-plane projections. The input to the network is the raw point cloud of a scene and the output are image or image sequences from a novel view or along a novel camera trajectory. Unlike previous approaches that directly project features from 3D points onto 2D image domain, we propose to project these features into a layered volume of camera frustum. In this way, the visibility of 3D points can be automatically learnt by the network, such that ghosting effects due to false visibility check as well as occlusions caused by noise interferences are both avoided successfully. Next, the 3D feature volume is fed into a 3D CNN to produce multiple planes of images w.r.t. the space division in the depth directions. The multi-plane images are then blended based on learned weights to produce the final rendering results. Experiments show that our network produces more stable renderings compared to previous methods, especially near the object boundaries. Moreover, our pipeline is robust to noisy and relatively sparse point cloud for a variety of challenging scenes.
[temporal, work, multiple, dataset, previous] [feature, interactive, propose, occluded, adopt, framework] [blending, noise, visibility, input, robust] [method, based, figure, result, perceptual, output, noisy, light, proposed, color, pixel] [image, produce, project, generated, loss, representation, appearance, inpainting, synthesis] [neural, deep, network, learning, space, better, weight, learned, memory, number, large, training, note] [point, rendering, cloud, camera, view, depth, volume, render, npg, scene, frustum, computer, novel, sparse, scannet, geometry, matterport, acm, directly, voxelization, rgb, voxel, direct, conference, projection, layered, surface, projected, plane, well, direction, matthias, pipeline, aliev, ground, voxels]
@InProceedings{Dai_2020_CVPR,
  author = {Dai, Peng and Zhang, Yinda and Li, Zhuwen and Liu, Shuaicheng and Zeng, Bing},
  title = {Neural Point Cloud Rendering via Multi-Plane Projection},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Wish You Were Here: Context-Aware Human Generation
Oran Gafni, Lior Wolf


We present a novel method for inserting objects, specifically humans, into existing images, such that they blend in a photorealistic manner, while respecting the semantic context of the scene. Our method involves three subnetworks: the first generates the semantic map of the new person, given the pose of the other persons in the scene and an optional bounding box specification. The second network renders the pixels of the novel person and its blending mask, based on specifications in the form of multiple appearance components. A third network refines the generated face in order to match those of the target person. Our experiments present convincing high-resolution outputs in this novel and challenging application domain. In addition, the three networks are evaluated individually, demonstrating for example, state of the art results in pose transfer benchmarks.
[work, three, provide, conditioning, context, dataset, recognition, previous] [semantic, map, bounding, mask, ablation, employ, box, background, segmented, add, challenging] [face, input, trained, demonstrated, densepose, deepfashion, facial, blending] [method, figure, ieee, existing, pattern, based, output, perceptual, resolution, presented, channel, comparison] [person, image, generated, target, generation, egn, mcrn, generates, loss, appearance, generate, transfer, user, ability, source, third, component, realistic, drawing, optional, control] [network, arxiv, preprint, training, set, number, size, higher, binary, applied, frn, randomly] [pose, human, novel, conference, computer, vision, application, additional, second, scene, well, rendered, form, supplementary, coherent, rendering, demonstrate, single]
@InProceedings{Gafni_2020_CVPR,
  author = {Gafni, Oran and Wolf, Lior},
  title = {Wish You Were Here: Context-Aware Human Generation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Towards Photo-Realistic Virtual Try-On by Adaptively Generating-Preserving Image Content
Han Yang, Ruimao Zhang, Xiaobao Guo, Wei Liu, Wangmeng Zuo, Ping Luo


Image visual try-on aims at transferring a target clothes image onto a reference person, and has become a hot topic in recent years. Prior arts usually focus on preserving the character of a clothes image (e.g. texture, logo, embroidery) when warping it to arbitrary human pose. However, it remains a big challenge to generate photo-realistic try-on images when large occlusions and human poses are presented in the reference person. To address this issue, we propose a novel visual try-on network, namely Adaptive Content Generating and Preserving Network (ACGPN). In particular, ACGPN first predicts semantic layout of the reference image that will be changed after try-on (e.g.long sleeve shirt-arm, arm-jacket), and then determines whether its image content needs to be generated or preserved according to the predicted semantic layout, leading to photo-realistic try-on and rich clothes details. ACGPN generally involves three major modules. First, a semantic layout generation module utilizes semantic segmentation of the reference image to progressively predict the desired semantic layout after try-on. Second, a clothes warping module warps clothes image according to the generated semantic layout, where a second-order difference constraint is introduced to stabilize the warping process during training.Third, an inpainting module for content fusion integrates all information (e.g. reference image, semantic layout, warped clothes) to adaptively produce each semantic part of human body. In comparison to the state-of-the-art methods, ACGPN can generate photo-realistic images with much better perceptual quality and richer fine-details.
[visual, difficulty, three, dataset, character] [semantic, mask, module, hard, easy, map, region, segmentation, bottom, table, fully, ping] [clothing, clothes, acgpn, viton, vtnfp, fashion, medium, difference, adversarial, posture, quality, original, great, refers, study] [reference, warping, proposed, ieee, warped, based, method, adaptively, fusion, comparison, adaptive, figure, spatial] [image, target, generate, layout, generation, content, synthesized, preserve, preservation, generated, preserving, loss, texture, person, generative, inpainting, generating, generates, composition, user] [network, better, deep, training, learning, process] [body, virtual, computer, constraint, pose, human, conference, shape, coarse, vision, well, torso, transformation]
@InProceedings{Yang_2020_CVPR,
  author = {Yang, Han and Zhang, Ruimao and Guo, Xiaobao and Liu, Wei and Zuo, Wangmeng and Luo, Ping},
  title = {Towards Photo-Realistic Virtual Try-On by Adaptively Generating-Preserving Image Content},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Breaking the Cycle - Colleagues Are All You Need
Ori Nizan, Ayellet Tal


This paper proposes a novel approach to performing image-to-image translation between unpaired domains. Rather than relying on a cycle constraint, our method takes advantage of collaboration between various GANs. This results in a multi modal method, in which multiple optional and diverse images are produced for a given image. Our model addresses some of the shortcomings of classical GANs: (1) It is able to remove large objects, such as glasses. (2) Since it does not need to support the cycle constraint, no irrelevant traces of the input are left on the generated image. (3) It manages to translate between domains that require large shape modifications. Our results are shown to outperform those generated by state-of-the-art methods for several challenging applications on commonly-used datasets, both qualitatively and quantitatively.
[multiple, recognition, dataset, previous, goal, three, multimodal] [focus, propose, table, challenging] [input, adversarial, model, face, trained, member, selfie] [figure, ieee, result, pattern, output, classical, method, quantitative, remove] [council, image, generated, generator, generative, translation, gan, unsupervised, domain, source, target, discriminator, loss, male, generate, lossi, diverse, produced, anime, cycle, real, produce, cyclegan, female, common, unpaired, idea, distinguish, distinguishes, fid, kid, completely] [learning, outperform, random, neural, large, function, size, processing, set, equation, number, training, data, converge, entropy, paper] [conference, computer, vision, international, shape, novel, approach, single, structure, left]
@InProceedings{Nizan_2020_CVPR,
  author = {Nizan, Ori and Tal, Ayellet},
  title = {Breaking the Cycle - Colleagues Are All You Need},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Local Class-Specific and Global Image-Level Generative Adversarial Networks for Semantic-Guided Scene Generation
Hao Tang, Dan Xu, Yan Yan, Philip H.S. Torr, Nicu Sebe


In this paper, we address the task of semantic-guided scene generation. One open challenge widely observed in global image-level generation methods is the difficulty of generating small objects and detailed local texture. To tackle this issue, in this work we consider learning the scene generation in a local context, and correspondingly design a local class-specific generative network with semantic maps as a guidance, which separately constructs and learns sub-generators concentrating on the generation of different classes, and is able to provide more scene details. To learn more discriminative class-specific feature representations for the local generation, a novel classification module is also proposed. To combine the advantage of both global image-level and the local class-specific generation, a joint generation network is designed with an attention fusion module and a dual-discriminator structure embedded. Extensive experiments on two scene image generation tasks show superior generation performance of the proposed model. State-of-the-art results are established by large margins on both tasks and on challenging public benchmarks. The source code and trained models are available at https://github.com/Ha0Tang/LGGAN.
[three, evaluation, attention, dataset] [global, semantic, feature, map, object, propose, module, miou, table, final, sims] [adversarial, input, model] [proposed, convolutional, method, existing, fusion, figure] [image, generation, generative, lggan, synthesis, generator, translation, generate, igl, discriminative, generated, conditional, selectiongan, gaugan, learn, gan, fid, qualitative, learns, dayton, loss, inception, nicu, discriminator, encoder, produce] [network, learning, weight, class, observe, better, training, design, classification, number, large, compared, follow, task, small, learned, size] [local, scene, novel, structure, hao, thomas, dan, single]
@InProceedings{Tang_2020_CVPR,
  author = {Tang, Hao and Xu, Dan and Yan, Yan and Torr, Philip H.S. and Sebe, Nicu},
  title = {Local Class-Specific and Global Image-Level Generative Adversarial Networks for Semantic-Guided Scene Generation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
ManiGAN: Text-Guided Image Manipulation
Bowen Li, Xiaojuan Qi, Thomas Lukasiewicz, Philip H.S. Torr


The goal of our paper is to semantically edit parts of an image matching a given text that describes desired attributes (e.g., texture, colour, and background), while preserving other contents that are irrelevant to the text. To achieve this, we propose a novel generative adversarial network (ManiGAN), which contains two key components: text-image affine combination module (ACM) and detail correction module (DCM). The ACM selects image regions relevant to the given text and then correlates the regions with corresponding semantic words for effective manipulation. Meanwhile, it encodes original image features to help reconstruct text-irrelevant contents. The DCM rectifies mismatched attributes and completes missing contents of the synthetic image. Finally, we suggest a new metric for evaluating image manipulation results, in terms of both the generation of new attributes and the reconstruction of text-irrelevant contents. Extensive experiments on the CUB and COCO datasets demonstrate the superior performance of the proposed method.
[text, visual, natural, red, attention, hidden, language, concatenation, yellow, three] [module, main, coco, semantic, feature, denotes, fuse, focus, propose, adopt] [model, manipulation, original, input, adversarial, correction, regional, manipulated, effective, effectively, black] [affine, combination, detail, ieee, method, proposed, existing, pattern, figure, based] [image, corresponding, generation, generative, cub, bird, produce, generate, missing, generated, semantically, desired, generating, translation, sisgan, tagan, conditional, synthesis, manipulative, preserving, fails, style, pretrained] [training, neural, achieve, network, arxiv, preprint, processing, similarity, precision, required, learning, architecture] [conference, computer, acm, dcm, reconstruction, matching, vision, reconstruct]
@InProceedings{Li_2020_CVPR,
  author = {Li, Bowen and Qi, Xiaojuan and Lukasiewicz, Thomas and Torr, Philip H.S.},
  title = {ManiGAN: Text-Guided Image Manipulation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Watch Your Up-Convolution: CNN Based Generative Deep Neural Networks Are Failing to Reproduce Spectral Distributions
Ricard Durall, Margret Keuper, Janis Keuper


Generative convolutional deep neural networks, e.g. popular GAN architectures, are relying on convolution based up-sampling methods to produce non-scalar outputs like images or video sequences. In this paper, we show that common up-sampling methods, i.e. known as up-convolution or transposed convolution, are causing the inability of such models to reproduce spectral distributions of natural training data correctly. This effect is independent of the underlying architecture and we show that it can be used to easily detect generated data like deepfakes with up to 100% accuracy on public benchmarks. To overcome this drawback of current generative models, we propose to add a novel spectral regularization term to the training optimization objective. We show that this approach not only allows to train spectral consistent GANs that are avoiding high frequency errors. Also, we show that a correct approximation of the frequency spectrum has positive effects on the training stability and output quality of generative networks.
[correct, order, video, artificial, current] [detection, detect, propose, positive] [adversarial, face, deepfake, input, original, public, stability] [spectral, figure, frequency, high, ieee, spectrum, convolutional, based, pattern, output, spatial, low, resolution, transconv, convolution, proposed, signal, analysis, interpolation, integration, commonly] [generative, image, loss, gan, real, fid, generated, gans, generator, common, generation, azimuthal, celeba, fake, dcgan, lsgan] [training, data, neural, learning, power, regularization, arxiv, preprint, deep, simple, processing, filter, layer, network, size, stable, large, evaluate, machine, top, impact, theoretical] [conference, computer, vision, international, term, well]
@InProceedings{Durall_2020_CVPR,
  author = {Durall, Ricard and Keuper, Margret and Keuper, Janis},
  title = {Watch Your Up-Convolution: CNN Based Generative Deep Neural Networks Are Failing to Reproduce Spectral Distributions},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Belief Propagation Reloaded: Learning BP-Layers for Labeling Problems
Patrick Knobelreiter, Christian Sormann, Alexander Shekhovtsov, Friedrich Fraundorfer, Thomas Pock


It has been proposed by many researchers that combining deep neural networks with graphical models can create more efficient and better regularized composite models. The main difficulties in implementing this in practice are associated with a discrepancy in suitable learning objectives as well as with the necessity of approximations for the inference. In this work we take one of the simplest inference methods, a truncated max-product Belief Propagation, and add what is necessary to make it a proper component of a deep learning model: connect it to learning formulations with losses on marginals and compute the backprop operation. This BP-Layer can be used as the final or an intermediate block in convolutional neural networks (CNNs), allowing us to design a hierarchical model composing BP inference and CNNs at different scale levels. The model is applicable to a range of dense prediction problems, is well-trainable and provides parameter-efficient and robust solutions in stereo, flow and semantic segmentation.
[prediction, recognition, work, belief, graph, hierarchical, time, three] [semantic, propagation, refinement, table, cnn, segmentation, crf, apply, score, add] [model, input, robust] [flow, pattern, ieee, optical, dynamic, method, marginals, middlebury, pixel, proposed, disparity, block, convolutional, spatial, figure, output, sgm, tree] [loss, image, conditional] [learning, inference, algorithm, deep, pairwise, training, neural, max, efficient, approximate, chain, random, gradient, processing, backprop, programming, function, computation, variant, learned, set, log, better, general, layer, approximation, label, linear] [stereo, computer, conference, vision, compute, matching, sweep, kitti, dense, allows, direction, intelligence, volume]
@InProceedings{Knobelreiter_2020_CVPR,
  author = {Knobelreiter, Patrick and Sormann, Christian and Shekhovtsov, Alexander and Fraundorfer, Friedrich and Pock, Thomas},
  title = {Belief Propagation Reloaded: Learning BP-Layers for Labeling Problems},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Barycenters of Natural Images Constrained Wasserstein Barycenters for Image Morphing
Dror Simon, Aviad Aberdam


Image interpolation, or image morphing, refers to a visual transition between two (or more) input images. For such a transition to look visually appealing, its desirable properties are (i) to be smooth; (ii) to apply the minimal required change in the image; and (iii) to seem "real", avoiding unnatural artifacts in each image in the transition. To obtain a smooth and straightforward transition, one may adopt the well-known Wasserstein Barycenter Problem (WBP). While this approach guarantees minimal changes under the Wasserstein metric, the resulting images might seem unnatural. In this work, we propose a novel approach for image morphing that possesses all three desired properties. To this end, we define a constrained variant of the WBP that enforces the intermediate images to satisfy an image prior. We describe an algorithm that solves this problem and demonstrate it using the sparse prior and generative adversarial networks.
[natural, step, visual, dataset, work, three, described, english, sequence] [leading, propose] [input, constrained, adversarial, model, trained, mnist] [figure, method, prior, interpolation, ieee, intermediate, signal, pixel, suggested, color, output, pattern, avoiding] [image, wasserstein, latent, barycenter, gan, morphing, generative, manifold, dcgan, wbp, representation, generate, desired, generated, train, satisfy, shoe] [space, algorithm, transition, problem, linear, min, optimal, arg, learning, simple, probability, set, matrix, process, vector, size, processing, training, measure, approximation, straightforward, entire] [approach, sparse, distance, conference, computer, smooth, solution, defined, transformation, second, minimal, euclidean, novel, demonstrate, solving, solve, convex, vision, additional]
@InProceedings{Simon_2020_CVPR,
  author = {Simon, Dror and Aberdam, Aviad},
  title = {Barycenters of Natural Images Constrained Wasserstein Barycenters for Image Morphing},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Guided Variational Autoencoder for Disentanglement Learning
Zheng Ding, Yifan Xu, Weijian Xu, Gaurav Parmar, Yang Yang, Max Welling, Zhuowen Tu


We propose an algorithm, guided variational autoencoder (Guided-VAE), that is able to learn a controllable generative model by performing latent representation disentanglement learning. The learning objective is achieved by providing signal to the latent encoding/embedding in VAE without changing its main backbone architecture, hence retaining the desirable properties of the VAE. We design an unsupervised and a supervised strategy in Guided-VAE and observe enhanced modeling and controlling capability over the vanilla VAE. In the unsupervised strategy, we guide the VAE learning by introducing a lightweight decoder that learns latent geometric transformation and principal components; in the supervised strategy, we use an adversarial excitation and inhibition mechanism to encourage the disentanglement of the latent variables. Guided-VAE enjoys its transparency and simplicity for the general representation learning task, as well as disentanglement learning. On a number of experiments for representation learning, improved synthesis/sampling, better disentanglement for classification, and reduced classification errors in meta learning have been observed.
[modeling, dataset, decoder, making, work] [table, represents, score, main, guided, introducing] [model, adversarial, input, mnist] [figure, excitation, guidance, deformable, lightweight, method, journal, analysis, interpolation, achieved, result] [latent, vae, disentanglement, supervised, representation, unsupervised, generative, attribute, variational, inhibition, variable, autoencoder, image, zcont, content, traversal, disentangling, zdef, discriminative, uided, guidedvae, component, zrst, qualitative, traversing, disentangled, generated, encourage, encoder] [learning, classification, neural, processing, max, data, training, classifier, deep, learned, task, lower, machine, vanilla, process, network, space, standard, denoted] [pca, error, term, compare, transformation, principal, deformation, geometric, rest, rotation, david]
@InProceedings{Ding_2020_CVPR,
  author = {Ding, Zheng and Xu, Yifan and Xu, Weijian and Parmar, Gaurav and Yang, Yang and Welling, Max and Tu, Zhuowen},
  title = {Guided Variational Autoencoder for Disentanglement Learning},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Cross-Spectral Face Hallucination via Disentangling Independent Factors
Boyan Duan, Chaoyou Fu, Yi Li, Xingguang Song, Ran He


The cross-sensor gap is one of the challenges that have aroused much research interests in Heterogeneous Face Recognition (HFR). Although recent methods have attempted to fill the gap with deep generative networks, most of them suffer from the inevitable misalignment between different face modalities. Instead of imaging sensors, the misalignment primarily results from facial geometric variations that are independent of the spectrum. Rather than building a monolithic but complex structure, this paper proposes a Pose Aligned Cross-spectral Hallucination (PACH) approach to disentangle the independent factors and deal with them in individual stages. In the first stage, an Unsupervised Face Alignment (UFA) module is designed to align the facial shapes of the near-infrared (NIR) images with those of the visible (VIS) images in a generative way, where UV maps are effectively utilized as the shape guidance. Thus the task of the second stage becomes spectrum translation with aligned paired data. We develop a Texture Prior Synthesis (TPS) module to achieve complexion control and consequently generate more realistic VIS images than existing methods. Experiments on three challenging NIR-VIS datasets verify the effectiveness of our approach in producing visually appealing images and achieving state-of-the-art performance in HFR.
[recognition, heterogeneous, dataset, three] [stage, table, module, feature, xiang, tackle, map, including] [face, facial, casia, input, identity, adversarial, hallucination, model, experimental, testing, datasets, trained, lightcnn, stan, discriminant, zhen] [method, figure, prior, based, quantitative, proposed, presented, output] [nir, image, paired, synthesized, aligned, texture, synthesis, unsupervised, ufa, alignment, realistic, loss, pcfh, translation, common, complexion, encs, adfl, independent, misalignment, proposes, pach, train, align, enci, sketch, gap, generative, disentangle, representation, generator] [training, learning, compared, set, deep, performance, data, network, baseline] [shape, second, well, local]
@InProceedings{Duan_2020_CVPR,
  author = {Duan, Boyan and Fu, Chaoyou and Li, Yi and Song, Xingguang and He, Ran},
  title = {Cross-Spectral Face Hallucination via Disentangling Independent Factors},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Learned Image Compression With Discretized Gaussian Mixture Likelihoods and Attention Modules
Zhengxue Cheng, Heming Sun, Masaru Takeuchi, Jiro Katto


Image compression is a fundamental research field and many well-known compression standards have been developed for many decades. Recently, learned compression methods exhibit a fast development trend with promising results. However, there is still a performance gap between learned compression algorithms and reigning compression standards, especially in terms of widely used PSNR metric. In this paper, we explore the remaining redundancy of recent learned compression algorithms. We have found accurate entropy models for rate estimation largely affect the optimization of network parameters and thus affect the rate-distortion performance. Therefore, in this paper, we propose to use discretized Gaussian Mixture Likelihoods to parameterize the distributions of latent codes, which can achieve a more accurate and flexible entropy model. Besides, we take advantage of recent attention modules and incorporate them into network architecture to enhance the performance. Experimental results demonstrate our proposed method achieves a state-of-the-art performance compared to existing learned compression methods on both Kodak and high-resolution datasets. To our knowledge our approach is the first work to achieve comparable performance with latest compression standard Versatile Video Coding (VVC) regarding PSNR. More importantly, our approach generates more visually pleasant results when optimized by MS-SSIM.
[attention, work, video, three, element] [module, achieves, propose, visualization, denotes] [model, quality, jpeg, improve, development] [compression, proposed, method, coding, gaussian, optimized, discretized, figure, kodak, scale, ieee, spatial, june, psnr, existing, adaptive, vvc, column, high, flexible, based, autoregressive, lee, transform, arithmetic, hyperprior, resolution, residual, gmm] [image, loss, latent, visualize] [entropy, learned, mixture, performance, achieve, learning, network, distribution, redundancy, rate, parameterized, training, remaining, required, neural, better, architecture, deep, test, set, comparable, fewer, quantization, logistic] [approach, reconstructed, accurate, joint, computer, simplified, estimation, single]
@InProceedings{Cheng_2020_CVPR,
  author = {Cheng, Zhengxue and Sun, Heming and Takeuchi, Masaru and Katto, Jiro},
  title = {Learned Image Compression With Discretized Gaussian Mixture Likelihoods and Attention Modules},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
C-Flow: Conditional Generative Flow Models for Images and 3D Point Clouds
Albert Pumarola, Stefan Popov, Francesc Moreno-Noguer, Vittorio Ferrari


Flow-based generative models have highly desirable properties like exact log-likelihood evaluation and exact latent-variable inference, however they are still in their infancy and have not received as much attention as alternative generative models. In this paper, we introduce C-Flow, a novel conditioning scheme that brings normalizing flows to an entirely new scenario with great possibilities for multimodal data modeling. C-Flow is based on a parallel sequence of invertible mappings in which a source flow guides the target flow at every step, enabling fine-grained control over the generation process. We also devise a new strategy to model unordered 3D point clouds that, in combination with the conditioning scheme, makes it possible to address 3D reconstruction from a single image and its inverse problem of rendering an image given a point cloud. We demonstrate our conditioning method to be very adaptable, being also applicable to image manipulation, style transfer and multi-modal image-to-image mapping in a diversity of domains, including RGB images, segmentation maps and edge masks.
[conditioning, modeling, embedding, multiple, step] [segmentation, global, map, object, propose, table] [model, adversarial, manipulation, input, original, sort, condition, trained] [coupling, invertible, flow, figure, proposed, based, exact, affine, squeeze, high, applicable] [generative, image, conditional, latent, cycle, style, target, mapping, domain, transfer, content, loss, perform, consistency, source, bijective, generate, introduce, generation] [data, learning, normalizing, distribution, training, layer, scheme, deep, arxiv, preprint, sample, sampling, space, log, size, learned, architecture, note, network, sorting, backward] [point, reconstruction, cloud, shape, structure, approach, distance, rendering, chamfer, albert, francesc, unordered, single, transformation]
@InProceedings{Pumarola_2020_CVPR,
  author = {Pumarola, Albert and Popov, Stefan and Moreno-Noguer, Francesc and Ferrari, Vittorio},
  title = {C-Flow: Conditional Generative Flow Models for Images and 3D Point Clouds},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Cogradient Descent for Bilinear Optimization
Li'an Zhuo, Baochang Zhang, Linlin Yang, Hanlin Chen, Qixiang Ye, David Doermann, Rongrong Ji, Guodong Guo


Conventional learning methods simplify the bilinear model by regarding two intrinsically coupled factors independently, which degrades the optimization procedure. One reason lies in the insufficient training due to the asynchronous gradient descent, which results in vanishing gradients for the coupled variables. In this paper, we introduce a Cogradient Descent algorithm (CoGD) to address the bilinear problem, based on a theoretical framework to coordinate the gradient of hidden variables via a projection function. We solve one variable by considering its coupling relationship with the other, leading to a synchronous gradient descent to facilitate the optimization procedure. Our algorithm is applied to solve problems with one variable under the sparsity constraint, which is widely used in the learning paradigm. We validate our CoGD considering an extensive set of applications including image reconstruction, inpainting, and network pruning. Experiments show that it improves the state-of-the-art by a significant margin.
[] [final] [] [] [project, image] [filter, epoch] [estimate, sparse, dense, incomplete]
@InProceedings{Zhuo_2020_CVPR,
  author = {Zhuo, Li'an and Zhang, Baochang and Yang, Linlin and Chen, Hanlin and Ye, Qixiang and Doermann, David and Ji, Rongrong and Guo, Guodong},
  title = {Cogradient Descent for Bilinear Optimization},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Instance-Aware Image Colorization
Jheng-Wei Su, Hung-Kuo Chu, Jia-Bin Huang


Image colorization is inherently an ill-posed problem with multi-modal uncertainty. Previous methods leverage the deep neural network to map input grayscale images to plausible color outputs directly. Although these learning-based methods have shown impressive performance, they usually fail on the input images that contain multiple objects. The leading cause is that existing models perform learning and colorization on the entire image. In the absence of a clear figure-ground separation, these models cannot effectively locate and learn meaningful object-level semantics. In this paper, we propose a method for achieving instance-aware colorization. Our network architecture leverages an off-the-shelf object detector to obtain cropped object images and uses an instance colorization network to extract object-level features. We use a similar network to extract the full-image features and apply a fusion module to full object-level and image-level features to predict the final colors. Both colorization networks and fusion modules are learned from a large-scale dataset. Experimental results show that our work outperforms existing methods on different quality metrics and achieves state-of-the-art performance on image colorization.
[multiple, visual, three, dataset, predict, work, automatic, semantics] [instance, object, feature, module, bounding, map, jxi, split, semantic, fusing, mask, box, cocostuff, detection, table, cropped, level, adopt, fuse] [model, input, improve, trained, datasets, experimental, legacy] [fusion, color, method, zhang, figure, proposed, existing, ssim, tog, psnr, reference, validate, fused, quantitative] [colorization, image, grayscale, deoldify, colorizing, lpips, learn, colorize, user, extracted, representation, phillip, alexei, diverse] [network, training, learning, deep, performance, imagenet, weight, validation, architecture, learned, neural, task, weighted, sum] [acm, full, well, complex]
@InProceedings{Su_2020_CVPR,
  author = {Su, Jheng-Wei and Chu, Hung-Kuo and Huang, Jia-Bin},
  title = {Instance-Aware Image Colorization},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Joint Training of Variational Auto-Encoder and Latent Energy-Based Model
Tian Han, Erik Nijkamp, Linqi Zhou, Bo Pang, Song-Chun Zhu, Ying Nian Wu


This paper proposes a joint training method to learn both the variational auto-encoder (VAE) and the latent energy-based model (EBM). The joint training of VAE and latent EBM are based on an objective function that consists of three Kullback-Leibler divergences between three joint distributions on the latent vector and the image, and the objective function is of an elegant symmetric and anti-symmetric form of divergence triangle that seamlessly integrates variational and adversarial learning. In this joint training scheme, the latent EBM serves as a critic of the generator model, while the generator model and the inference model in VAE serve as the approximate synthesis sampler and inference sampler of the latent EBM. Our experiments show that the joint training greatly improves the synthesis quality of the VAE. It also enables learning of an energy function that is capable of detecting out of sample examples for anomaly detection.
[three, observed, directed] [table, detection] [model, adversarial, testing, quality, trained] [method, based, called, likelihood, alice, figure, convolutional, proposed, phase, integrates] [latent, generator, vae, synthesis, image, variational, learn, generative, generation, celeba, consists, qdata, unsupervised, svae, eqdata, conditional, helmholtz, minimizing] [inference, training, learning, density, ebm, data, vector, arxiv, preprint, sampling, network, function, distribution, energy, anomaly, machine, divergence, approximate, objective, gradient, close, deep, learned, ood, neural, algorithm, boltzmann, defines, log, evaluate, sampler, call, minimization, maximum, negative, uniform, baseline, ying, nian, sample, compared] [joint, form, triangle, compare, conference, symmetric]
@InProceedings{Han_2020_CVPR,
  author = {Han, Tian and Nijkamp, Erik and Zhou, Linqi and Pang, Bo and Zhu, Song-Chun and Wu, Ying Nian},
  title = {Joint Training of Variational Auto-Encoder and Latent Energy-Based Model},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Adaptive Loss-Aware Quantization for Multi-Bit Networks
Zhongnan Qu, Zimu Zhou, Yun Cheng, Lothar Thiele


We investigate the compression of deep neural networks by quantizing their weights and activations into multiple binary bases, known as multi-bit networks (MBNs), which accelerate the inference and reduce the storage for the deployment on low-resource mobile and embedded platforms. We propose Adaptive Loss-aware Quantization (ALQ), a new MBN quantization pipeline that is able to achieve an average bitwidth below one-bit without notable loss in inference accuracy. Unlike previous MBN quantization solutions that train a quantizer by minimizing the error to reconstruct full precision weights, ALQ directly minimizes the quantization-induced error on the loss function involving neither gradient approximation nor full precision maintenance. ALQ also exploits strategies including adaptive bitwidth, smooth bitwidth reduction, and iterative trained quantization to allow a smaller network size without loss in accuracy. Experiment results on popular image datasets show that ALQ outperforms state-of-the-art compressed networks in terms of both storage and accuracy.
[step, previous, multiple] [table, global] [trained, vgg, increment, original] [adaptive, compression, comparison, convolutional, method, low, output, ieee, pattern, notable] [loss, domain, minimizing, corresponding] [alq, bitwidth, binary, quantization, neural, learning, accuracy, deep, pruning, ste, average, precision, network, quantized, optimization, storage, group, layer, optimizing, weight, gradient, number, appendix, size, optimal, quantize, set, training, activation, mbn, higher, bitwise, scheme, uniform, problem, optimizer, amsgrad, rate, baseline, inference, function, replace, efficient, optimize, imin, pytorch, reduce] [conference, full, international, computer, error, directly, reconstruction, basis, projection, compare, vision]
@InProceedings{Qu_2020_CVPR,
  author = {Qu, Zhongnan and Zhou, Zimu and Cheng, Yun and Thiele, Lothar},
  title = {Adaptive Loss-Aware Quantization for Multi-Bit Networks},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
ScopeFlow: Dynamic Scene Scoping for Optical Flow
Aviram Bar-Haim, Lior Wolf


We propose to modify the common training protocols of optical flow, leading to sizable accuracy improvements without adding to the computational complexity of the training process. The improvement is based on observing the bias in sampling challenging data that exists in the current training protocol, and improving the sampling process. In addition, we find that both regularization and augmentation should decrease during the training protocol. Using an existing low parameters architecture, the method is ranked first on the MPI Sintel benchmark among all other methods, improving the best two frames method accuracy by more than 10%. The method also surpasses all similar architecture variants by more than 12% and 19.7% on the KITTI benchmarks, achieving the lowest Average End-Point Error on KITTI2012 among two-frame methods, without using extra datasets.
[order, recognition, dataset, scope, three, current, includes] [occlusion, cnn, final, leading, sized, category, val, table, improvement] [model, trained, improve, protocol, tested, maximal, improving] [crop, flow, optical, sintel, cropping, pixel, ieee, method, pattern, fast, valid, range, zoom, motion, fchairs, scoping, suggested, mepe, dynamic, zooming, based, applying, affine] [image, common, train] [training, size, random, regularization, fixed, sampling, best, larger, accuracy, probability, network, learning, weight, data, test, number, decay, augmentation, pretraining, finetune, bias, set, deep, reducing, architecture, performance, improved, large, ratio, lower, validation, finetuning] [kitti, computer, vision, conference, scene, mpi, estimation, cost, approach, second, error]
@InProceedings{Bar-Haim_2020_CVPR,
  author = {Bar-Haim, Aviram and Wolf, Lior},
  title = {ScopeFlow: Dynamic Scene Scoping for Optical Flow},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Video Super-Resolution With Temporal Group Attention
Takashi Isobe, Songjiang Li, Xu Jia, Shanxin Yuan, Gregory Slabaugh, Chunjing Xu, Ya-Li Li, Shengjin Wang, Qi Tian


Video super-resolution, which aims at producing a high-resolution video from its corresponding low-resolution version, has recently drawn increasing attention. In this work, we propose a novel method that can effectively incorporate temporal information in a hierarchical way. The input sequence is divided into several groups, with each one corresponding to a kind of frame rate. These groups provide complementary information to recover missing details in the reference frame, which is further integrated with an attention module and a deep intra-group fusion module. In addition, a fast spatial alignment is proposed to handle videos with large motion. Extensive results demonstrate the capability of the proposed model in handling videos with various motion. It achieves favorable performance against state-of-the-art methods on several benchmark datasets.
[temporal, video, frame, attention, hierarchical, sequence, unit, time, three, explicit, integrate, extract] [module, feature, grouping, effectiveness, map, propose, achieves] [model, input, complementary, effectively, original, conduct] [proposed, motion, method, reference, fusion, spatial, convolutional, fast, optical, flow, neighboring, block, duf, figure, residual, compensation, homography, rbpn, tga, vsr, result, edvr, upsampling, based, resolution, toflow, applying, consecutive, fsa, dynamic, super, bicubic] [image, alignment, missing, corresponding, produced] [group, network, deep, large, performance, neural, better, size, rate, layer, computation, efficient, conducted] [dense, single, estimation, computed, implicit, novel, demonstrate, full]
@InProceedings{Isobe_2020_CVPR,
  author = {Isobe, Takashi and Li, Songjiang and Jia, Xu and Yuan, Shanxin and Slabaugh, Gregory and Xu, Chunjing and Li, Ya-Li and Wang, Shengjin and Tian, Qi},
  title = {Video Super-Resolution With Temporal Group Attention},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Group Sparsity: The Hinge Between Filter Pruning and Decomposition for Network Compression
Yawei Li, Shuhang Gu, Christoph Mayer, Luc Van Gool, Radu Timofte


In this paper, we analyze two popular network compression techniques, i.e. filter pruning and low-rank decomposition, in a unified sense. By simply changing the way the sparsity regularization is enforced, filter pruning and low-rank decomposition can be derived accordingly. This provides another flexible choice for network compression because the techniques complement each other. For example, in popular network architectures with shortcut connections (e.g. ResNet), filter pruning cannot deal with the last convolutional layer in a ResBlock while the low-rank decomposition methods can. In addition, we propose to compress the whole network jointly instead of in a layer-wise manner. Our approach proves its potential as it compares favorably to the state-of-the-art on several benchmarks. Code is available at https://github.com/ofsoundof/group_sparsity.
[structured, step, current] [threshold, feature, resnet, resnext, table, map, van] [original, norm, input, model, influence] [compression, method, convolutional, proposed, output, convolution, proximal, tensor, based, compressed, comparison, compress, block, figure, channel, residual, lightweight] [factor, loss, target, image, enforced] [network, group, filter, sparsity, pruning, hinge, matrix, regularization, neural, ratio, learning, deep, rate, gradient, flop, layer, algorithm, parameter, weight, regularizer, compact, better, set, kse, efficient, optimization, distillation, function, bottleneck, linear, equivalent, applied, number, approximate, binary] [decomposition, error, annealing, solve, approach]
@InProceedings{Li_2020_CVPR,
  author = {Li, Yawei and Gu, Shuhang and Mayer, Christoph and Gool, Luc Van and Timofte, Radu},
  title = {Group Sparsity: The Hinge Between Filter Pruning and Decomposition for Network Compression},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
3D Photography Using Context-Aware Layered Depth Inpainting
Meng-Li Shih, Shih-Yang Su, Johannes Kopf, Jia-Bin Huang


We propose a method for converting a single RGB-D input image into a 3D photo, i.e., a multi-layer representation for novel view synthesis that contains hallucinated color and depth structures in regions occluded in the original view. We use a Layered Depth Image with explicit pixel connectivity as underlying representation, and present a learning-based inpainting model that iteratively synthesizes new local color-and-depth content into the occluded region in a spatial context-aware manner. The resulting 3D photos can be efficiently rendered with motion parallax using standard graphics engines. We validate the effectiveness of our method on a wide range of challenging everyday scenes and show less artifacts when compared with the state-of-the-arts.
[context, work, visual, dataset, connected] [region, edge, occluded, apply, background, table] [model, input, diffusion] [color, method, figure, pixel, light, dilation, field, proposed, photography, dual] [inpainting, image, synthesis, inpainted, representation, content, inpaint, synthesized, photo, missing, produce, texture, lpips, jimei] [algorithm, network, number, deep, standard, better, training, learning, layer] [depth, view, novel, single, acm, ldi, facebook, stereo, richard, layered, rendering, well, silhouette, local, completion, michael, camera, mesh, form, computer, johannes, connectivity, conference, thomas, noah, require, capture]
@InProceedings{Shih_2020_CVPR,
  author = {Shih, Meng-Li and Su, Shih-Yang and Kopf, Johannes and Huang, Jia-Bin},
  title = {3D Photography Using Context-Aware Layered Depth Inpainting},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
MixNMatch: Multifactor Disentanglement and Encoding for Conditional Image Generation
Yuheng Li, Krishna Kumar Singh, Utkarsh Ojha, Yong Jae Lee


We present MixNMatch, a conditional generative model that learns to disentangle and encode background, object pose, shape, and texture from real images with minimal supervision, for mix-and-match image generation. We build upon FineGAN, an unconditional generative model, to learn the desired disentanglement and image generator, and leverage adversarial joint image-code distribution matching to learn the latent factor encoders. MixNMatch requires bounding boxes during training to model background, but requires no other supervision. Through extensive experiments, we demonstrate MixNMatch's ability to accurately disentangle, encode, and combine multiple factors for mix-and-match image generation, including sketch2color, cartoon2img, and img2gif applications. Our code/models/demo can be found at https://github.com/Yuheng-Li/MixNMatch
[child, encode, work, multiple, extract] [object, feature, background, parent, stage, supervision, table, bounding] [model, adversarial, input, trained, identity] [reference, figure, high] [image, real, code, texture, latent, mode, mixnmatch, finegan, disentanglement, disentangle, generated, generator, conditional, disentangled, generative, generation, generate, fake, encoders, loss, unsupervised, learn, conditioned, representation, bird, learns, train, domain, corresponding, paired, factor, extracted, preserve, desired, disentangling, disentangles, duck, ability, encoder, specific, translation, appearance, perform, realistic] [sampled, distribution, learned, clustering, training, learning, data, requires, number, randomly] [shape, pose, well, varying, minimal, joint, combine, directly, require, single]
@InProceedings{Li_2020_CVPR,
  author = {Li, Yuheng and Singh, Krishna Kumar and Ojha, Utkarsh and Lee, Yong Jae},
  title = {MixNMatch: Multifactor Disentanglement and Encoding for Conditional Image Generation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Low-Rank Compression of Neural Nets: Learning the Rank of Each Layer
Yerlan Idelbayev, Miguel A. Carreira-Perpinan


Neural net compression can be achieved by approximating each layer's weight matrix by a low-rank matrix. The real difficulty in doing this is not in training the resulting neural net (made up of one low-rank matrix per layer), but in determining what the optimal rank of each layer is--effectively, an architecture search problem with one hyperparameter per layer. We show that, with a suitable formulation, this problem is amenable to a mixed discrete-continuous optimization jointly over the ranks and over the matrix elements, and give a corresponding algorithm. We show that this indeed can select ranks much better than existing approaches, making low-rank compression much more attractive than previously thought. For example, we can make a VGG network faster than a ResNet and with nearly the same classification error.
[step, recognition, work] [faster] [model, norm, original, case, input] [compression, net, convolutional, reference, compressed, pattern, ieee, june, method, society, tensor, published, output] [train, loss, corresponding, image, mit] [rank, neural, deep, layer, algorithm, problem, matrix, learning, training, number, weight, test, optimization, nin, pruning, network, linear, singular, function, baseline, scheme, classification, inference, selection, optimal, machine, filter, selected, approximation, architecture, distribution, resnets, better, memory, optimize, nuclear, min, processing, miguel, large, sgd, achieve, size] [error, computer, cost, svd, vision, solution, solved, single, volume]
@InProceedings{Idelbayev_2020_CVPR,
  author = {Idelbayev, Yerlan and Carreira-Perpinan, Miguel A.},
  title = {Low-Rank Compression of Neural Nets: Learning the Rank of Each Layer},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Global Texture Enhancement for Fake Face Detection in the Wild
Zhengzhe Liu, Xiaojuan Qi, Philip H.S. Torr


Generative Adversarial Networks (GANs) can generate realistic fake face images that can easily fool human beings. On the contrary, a common Convolutional Neural Network(CNN) discriminator can achieve more than99.9%accuracyin discerning fake/real images. In this paper, we conduct an empirical study on fake/real faces, and have two important observations: firstly, the texture of fake faces is substantially different from real ones; secondly, global texture statistics are more robust to image editing and transferable to fake faces from different GANs and datasets. Motivated by the above observations, we propose a new architecture coined as Gram-Net, which leverages global image texture representations for robust fake image detection. Experimental results on several datasets demonstrate that our Gram-Netoutperforms existing approaches. Especially, our Gram-Netis more robust to image editings, e.g. down-sampling, JPEGcompression, blur, and noise. More importantly, our Gram-Net generalizes significantly better in detecting fake faces from GAN models not seen in the training phase and can perform decently in detecting fake natural images
[natural, understanding, outperforms] [resnet, table, feature, cnn, global, detection, detect, propose] [face, model, original, trained, generalization, robust, skin, jpeg, detecting, adversarial, input, noise, testing, robustness, easily] [figure, cnns, contrast, analysis, ieee, color, method, blur, convolutional, based, proposed] [fake, texture, image, stylegan, real, gram, gan, gans, ffhq, generated, pggan, train, generative, discriminator, editing, biggan, generate] [matrix, training, arxiv, preprint, performance, neural, set, architecture, test, baseline, accuracy, better, imagenet, evaluate, large, pool, learning, processing, deep, layer, empirical] [human, conference, reconstructed, computer, capture]
@InProceedings{Liu_2020_CVPR,
  author = {Liu, Zhengzhe and Qi, Xiaojuan and Torr, Philip H.S.},
  title = {Global Texture Enhancement for Fake Face Detection in the Wild},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Panoptic-Based Image Synthesis
Aysegul Dundar, Karan Sapra, Guilin Liu, Andrew Tao, Bryan Catanzaro


Conditional image synthesis for generating photorealistic images serves various applications for content editing to content generation. Previous conditional image synthesis algorithms mostly rely on semantic maps, and often fail in complex environments where multiple instances occlude each other. We propose a panoptic aware image synthesis network to generate high fidelity and photorealistic images conditioned on panoptic maps which unify semantic and instance information. To achieve this, we efficiently use panoptic maps in convolution and upsampling layers. We show that with the proposed changes to the generator, we can improve on the previous state-of-the-art methods by generating images in complex instance interaction environments in higher fidelity and tiny objects in more details. Furthermore, our proposed method also outperforms the previous state-of-the-art methods in metrics of mean IoU (Intersection over Union), and detAP (Detection Average Precision).
[provide, crn, dataset, previous, multiple, recognition, outperforms, interested] [panoptic, semantic, aware, map, instance, feature, sims, segmentation, object, miou, detection, detap, boundary, resnet, table, propose] [correction, adversarial, identity, trained, model] [upsampling, convolution, figure, ieee, method, pattern, upsampled, pixel, proposed, based, resolution, high, block, convolutional] [image, synthesis, spade, conditional, generate, generates, generated, generative, content, fidelity, generation, generator, unsupervised, bryan, generating] [layer, network, higher, neural, learning, accuracy, andrew, baseline, operation, arxiv, preprint, binary, algorithm, number, training, processing, deep] [conference, computer, vision, partial, nearest, neighbor, international]
@InProceedings{Dundar_2020_CVPR,
  author = {Dundar, Aysegul and Sapra, Karan and Liu, Guilin and Tao, Andrew and Catanzaro, Bryan},
  title = {Panoptic-Based Image Synthesis},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Lighthouse: Predicting Lighting Volumes for Spatially-Coherent Illumination
Pratul P. Srinivasan, Ben Mildenhall, Matthew Tancik, Jonathan T. Barron, Richard Tucker, Noah Snavely


We present a deep learning solution for estimating the incident illumination at any 3D location within a scene from an input narrow-baseline stereo image pair. Previous approaches for predicting global illumination from images either predict just a single illumination for the entire scene, or separately estimate the illumination at each 3D location without enforcing that the predictions are consistent with the same 3D scene. Instead, we propose a deep learning model that estimates a 3D volumetric RGBA model of a scene, including content outside the observed field of view, and then uses standard volume rendering to estimate the incident illumination at any 3D location within that volume. Our model is trained without any ground truth 3D data and only requires a held-out perspective view near the input stereo pair and a spherical panorama taken within each scene as supervision, as opposed to prior methods for spatially-varying lighting estimation, which require ground truth scene geometry for training. We demonstrate that our method can predict consistent spatially-varying lighting that is convincing enough to plausibly relight and insert highly specular virtual objects into real images.
[environment, predict, observed, pair, predicting, prediction, work, dataset] [map, location, cnn, object, predicted, global, level] [input, model, adversarial] [illumination, multiscale, method, light, resolution, figure, reference, hdr, pixel, inverse, field, prior] [image, representation, content, inserted, realistic, loss, real, synthesis, train] [deep, network, learning, training, neural, set, entire, requires] [scene, lighting, volume, rendering, geometry, single, stereo, spherical, virtual, estimating, estimate, camera, view, unobserved, mpi, perspective, volumetric, ground, truth, relit, estimated, visible, indoor, incident, specular, completion, approach, panorama, predicts, render, supplementary, kalyan, consistent, novel, relighting, completed, frustum]
@InProceedings{Srinivasan_2020_CVPR,
  author = {Srinivasan, Pratul P. and Mildenhall, Ben and Tancik, Matthew and Barron, Jonathan T. and Tucker, Richard and Snavely, Noah},
  title = {Lighthouse: Predicting Lighting Volumes for Spatially-Coherent Illumination},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Learning to Cartoonize Using White-Box Cartoon Representations
Xinrui Wang, Jinze Yu


This paper presents an approach for image cartoonization. By observing the cartoon painting behavior and consulting artists, we propose to separately identify three white-box representations from images: the surface representation that contains smooth surface of cartoon images, the structure representation that refers to the sparse color-blocks and flatten global content in the celluloid style workflow, and the texture representation that reflects high-frequency texture, contours and details in cartoon images. A Generative Adversarial Network (GAN) framework is used to learn the extracted representations and to cartoonize images. The learning objectives of our method are separately based on each extracted representations, making our framework controllable and adjustable. This enables our approach to meet artists' requirements in different styles and diverse use cases. Qualitative comparisons and quantitative analyses, as well as user studies, have been conducted to validate the effectiveness of this approach, and our method outperforms previous methods in all comparisons. Finally, the ablation study demonstrates the influence of each component in our framework.
[three, previous, extract, dataset, represent, outperforms] [global, framework, segmentation, guided, abstraction, table, guide, superpixel, propose] [model, input, adversarial, quality, original, face] [method, color, figure, ieee, based, pattern, proposed, adaptive, quantitative, clear, output, filtering, introduced] [image, cartoon, representation, style, texture, extracted, photo, fid, loss, diverse, user, generate, content, generator, cartoonization, generative, qualitative, artistic, painting, learn, controllable, cartoonized, coloring, translation, celluloid, gan, transfer, discriminator] [network, neural, training, learning, algorithm, weight, smoothing, conducted, data, processing, standard, performance, evaluate] [structure, computer, surface, conference, vision, sparse, international, smooth, european, human]
@InProceedings{Wang_2020_CVPR,
  author = {Wang, Xinrui and Yu, Jinze},
  title = {Learning to Cartoonize Using White-Box Cartoon Representations},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
End-to-End Learnable Geometric Vision by Backpropagating PnP Optimization
Bo Chen, Alvaro Parra, Jiewei Cao, Nan Li, Tat-Jun Chin


Deep networks excel in learning patterns from large amounts of data. On the other hand, many geometric vision tasks are specified as optimization problems. To seamlessly combine deep learning and geometric vision, it is vital to perform learning and geometric optimization end-to-end. Towards this aim, we present BPnP, a novel network module that backpropagates gradients through a Perspective-n-Points (PnP) solver to guide parameter updates of a neural network. Based on implicit differentiation, we show that the gradients of a "self-contained" PnP solver can be derived accurately and efficiently, as if the optimizer block were a differentiable function. We validate BPnP by incorporating it in a deep model that can learn camera intrinsics, camera extrinsics (poses) and 3D structure from training datasets. Further, we develop an end-to-end trainable pipeline for object pose estimation, which achieves greater accuracy by combining feature-based heatmap losses with 2D-3D reprojection errors. Since our approach can be extended to other optimization problems, our work helps to pave the way to perform learnable geometric vision in a principled manner. Our PyTorch implementation of BPnP is available on http://github.com/BoChenYS/BPnP.
[incorporate, predict] [object, feature, location, predicted, heatmap, fully, regression] [model, input, backpropagated, robust] [pnp, output, convolutional, method, based, figure] [loss, image, learn, target, train, perform] [deep, learning, optimization, network, function, algorithm, evolution, neural, training, set, test, objective, backpropagation, squared, gradient, matrix, iteration, parameter, regularization, sample, random] [pose, geometric, camera, bpnp, vision, solver, estimation, implicit, structure, depth, ground, differentiation, truth, computer, projection, differentiable, pipeline, reprojection, approach, point, compute, partial, keypoint, sfm, keypoints, single, solving, form, intrinsic, initial]
@InProceedings{Chen_2020_CVPR,
  author = {Chen, Bo and Parra, Alvaro and Cao, Jiewei and Li, Nan and Chin, Tat-Jun},
  title = {End-to-End Learnable Geometric Vision by Backpropagating PnP Optimization},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Analyzing and Improving the Image Quality of StyleGAN
Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, Timo Aila


The style-based GAN architecture (StyleGAN) yields state-of-the-art results in data-driven unconditional generative image modeling. We expose and analyze several of its characteristic artifacts, and propose changes in both model architecture and training methods to address them. In particular, we redesign the generator normalization, revisit progressive growing, and regularize the generator to encourage good conditioning in the mapping from latent codes to images. In addition to improving image quality, this path length regularizer yields the additional benefit that the generator becomes significantly easier to invert. This makes it possible to reliably attribute a generated image to a particular network. We furthermore visualize how well the generator utilizes its output resolution, and identify a capacity problem, motivating us to train larger models for additional quality improvements. Overall, our improved model redefines the state of the art in unconditional image modeling, both in terms of existing distribution quality metrics as well as perceived image quality.
[length, work, modulation] [feature, table, instance, recall, focus, main] [quality, original, adversarial, input, noise, norm, trained, improving, deviation, std] [figure, residual, output, conv, resolution, mod, based, scale, skip, method, signal, block] [generator, image, stylegan, latent, ppl, generated, fid, generative, lsun, style, progressive, synthesis, discriminator, trgb, gan, growing, corresponding, frgb, real, characteristic, mapping, adain, code, wijk, row, tero, timo, droplet] [training, normalization, space, network, path, architecture, regularization, design, standard, appendix, precision, weight, nvidia, improved, random, find, deep, regularizer, higher, observe, problem, baseline, data, metric, better] [projection, additional, easier, projected]
@InProceedings{Karras_2020_CVPR,
  author = {Karras, Tero and Laine, Samuli and Aittala, Miika and Hellsten, Janne and Lehtinen, Jaakko and Aila, Timo},
  title = {Analyzing and Improving the Image Quality of StyleGAN},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Fashion Editing With Adversarial Parsing Learning
Haoye Dong, Xiaodan Liang, Yixuan Zhang, Xujie Zhang, Xiaohui Shen, Zhenyu Xie, Bowen Wu, Jian Yin


Interactive fashion image manipulation, which enables users to edit images with sketches and color strokes, is an interesting research problem with great application value. Existing works often treat it as a general inpainting task and do not fully leverage the semantic structural information in fashion images. Moreover, they directly utilize conventional convolution and normalization layers to restore the incomplete image, which tends to wash away the sketch and color information. In this paper, we propose a novel Fashion Editing Generative Adversarial Network (FE-GAN), which is capable of manipulating fashion images by free-form sketches and sparse color strokes. FE-GAN consists of two modules: 1) a free-form parsing network that learns to control the human parsing generation by manipulating sketch and color; 2) a parsing-aware inpainting network that renders detailed textures with semantic guidance from the human parsing map. A new attention normalization layer is further applied at multiple scales in the decoder of the inpainting network to enhance the quality of the synthesized image. Extensive experiments on high-resolution fashion image datasets demonstrate that the proposed FE-GAN significantly outperforms the state-of-the-art methods on fashion image manipulation.
[attention, extract, dataset, composed, three, decoder, outperforms, anls] [parsing, map, mask, semantic, feature, propose, table, interactive, denotes, foreground] [fashion, adversarial, face, manipulation, external, deepfashion, quality, noise, conduct, original] [color, figure, proposed, convolution, ssim, affine, method, psnr, convolutional] [image, inpainting, sketch, editing, loss, generative, train, generate, generated, fid, synthesized, realistic, conditioned, mpv, conditional, manipulating, encoder, mforeground, lforeground, fashione] [network, normalization, arxiv, preprint, layer, training, set, learning, design, deep, data, learned, normalized, performance, neural, objective, function, test] [incomplete, human, sparse, partial, novel, complete, structure]
@InProceedings{Dong_2020_CVPR,
  author = {Dong, Haoye and Liang, Xiaodan and Zhang, Yixuan and Zhang, Xujie and Shen, Xiaohui and Xie, Zhenyu and Wu, Bowen and Yin, Jian},
  title = {Fashion Editing With Adversarial Parsing Learning},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Augment Your Batch: Improving Generalization Through Instance Repetition
Elad Hoffer, Tal Ben-Nun, Itay Hubara, Niv Giladi, Torsten Hoefler, Daniel Soudry


Large-batch SGD is important for scaling training of deep neural networks. However, without fine-tuning hyperparameter schedules, the generalization of the model may be hampered. We propose to use batch augmentation: replicating instances of samples within the same batch with different data augmentations. Batch augmentation acts as a regularizer and an accelerator, increasing both generalization and performance scaling for a fixed budget of optimization steps. We analyze the effect of batch augmentation on gradient variance and show that it empirically improves convergence for a wide variety of networks and datasets. Our results show that batch augmentation reduces the number of necessary SGD updates to achieve the same accuracy as the state-of-the-art. Overall, this simple yet effective method enables faster training and better generalization by allowing more computational resources to be used concurrently.
[dataset, multiple, time, previous, observed] [table, final, improvement, faster] [generalization, model, input, original, trained, improve, correlated, effective, norm, tested] [figure, method, suggested, transforms, result, scale, introduced, runtime] [image, common, train] [batch, training, augmentation, accuracy, data, number, learning, validation, large, imagenet, arxiv, preprint, neural, size, deep, gradient, variance, baseline, rate, sgd, regime, network, regularization, convergence, standard, random, scaling, increasing, achieve, better, larger, reduction, sample, performance, processing, small, classification, fixed, optimization, augmented, dropout, parallelism, distributed, compared, reduced, improved] [error, well, additional]
@InProceedings{Hoffer_2020_CVPR,
  author = {Hoffer, Elad and Ben-Nun, Tal and Hubara, Itay and Giladi, Niv and Hoefler, Torsten and Soudry, Daniel},
  title = {Augment Your Batch: Improving Generalization Through Instance Repetition},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
ARShadowGAN: Shadow Generative Adversarial Network for Augmented Reality in Single Light Scenes
Daquan Liu, Chengjiang Long, Hongpan Zhang, Hanning Yu, Xinzhi Dong, Chunxia Xiao


Generating virtual object shadows consistent with the real-world environment shading effects is important but challenging in computer vision and augmented reality applications. To address this problem, we propose an end-to-end Generative Adversarial Network for shadow generation named ARShadowGAN for augmented reality in single light scenes. Our ARShadowGAN makes full use of attention mechanism and is able to directly model the mapping relation between the virtual object shadow and the real-world environment without any explicit estimation of the illumination and 3D geometric information. In addition, we collect an image set which provides rich clues for shadow generation and construct a dataset for training and evaluating our proposed ARShadowGAN. The extensive experimental results show that our proposed ARShadowGAN is capable of directly generating plausible virtual object shadows in single light scenes. Our source code is available at https://github.com/ldq9526/ARShadowGAN.
[attention, recognition, dataset, visual, work, decoder, construct] [object, mask, map, occluders, refinement, feature, detection, area, gang, final, ablation] [adversarial, input, model] [figure, ieee, light, pattern, block, illumination, residual, proposed, inverse, output, remove] [shadow, image, arshadowgan, corresponding, generator, generation, inserted, discriminator, generative, chengjiang, synthetic, lper, generated, generate, loss, real, consists, plausible, shadowgan, chunxia, source, ladv, produce] [training, network, learning, deep, data, indicates, distribution, augmented, set, measure, batch] [virtual, computer, vision, conference, single, lighting, ground, international, directly, camera, rendering, indoor, truth, consistent, estimation, full, marker]
@InProceedings{Liu_2020_CVPR,
  author = {Liu, Daquan and Long, Chengjiang and Zhang, Hongpan and Yu, Hanning and Dong, Xinzhi and Xiao, Chunxia},
  title = {ARShadowGAN: Shadow Generative Adversarial Network for Augmented Reality in Single Light Scenes},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
An End-to-End Edge Aggregation Network for Moving Object Segmentation
Prashant W. Patil, Kuldeep M. Biradar, Akshay Dudhane, Subrahmanyam Murala


Moving object segmentation in videos (MOS) is a highly demanding task for security-based applications like automated outdoor video surveillance. Most of the existing techniques proposed for MOS are highly depend on fine-tuning a model on the first frame(s) of test sequence or complicated training procedure, which leads to limited practical serviceability of the algorithm. In this paper, the inherent correlation learning-based edge extraction mechanism (EEM) and dense residual block (DRB) are proposed for the discriminative foreground representation. The multi-scale EEM module provides the efficient foreground edge related information (with the help of encoder) to the decoder through skip connection at subsequent scale. Further, the response of the optical flow encoder stream and the last EEM module are embedded in the bridge network. The bridge network comprises of multi-scale residual blocks with dense connections to learn the effective and efficient foreground relevant features. Finally, to generate accurate and consistent foreground object maps, a decoder block is proposed with skip connections from respective multi-scale EEM module feature maps and the subsequent down-sampled response of previous frame output. Specifically, the proposed network does not require any pre-trained models or fine-tuning of the parameters with the initial frame(s) of the test video. The performance of the proposed network is evaluated with different configurations like disjoint, cross-data, and global training-testing techniques. The ablation study is conducted to analyse each model of the proposed network. To demonstrate the effectiveness of the proposed framework, a comprehensive analysis on four benchmark video datasets is conducted. Experimental results show that the proposed approach outperforms the state-of-the-art methods for MOS.
[video, frame, stream, visual, decoder, previous, moving, mechanism, three, attention, concatenation] [object, foreground, segmentation, feature, module, table, edge, background, ablation, response, effectiveness, thermal, subsequent] [database, model, effective, testing, trained, input, adversarial] [proposed, ieee, optical, method, flow, eem, figure, motion, existing, based, extraction, residual, block, scale, quantitative, achieved, automated, analysis, high, output, fast, drb] [encoder, bridge, learn, image] [network, training, learning, test, performance, practical, respective, compared, data, task, average] [conference, approach, computer, international, dense, system, vision, rgb]
@InProceedings{Patil_2020_CVPR,
  author = {Patil, Prashant W. and Biradar, Kuldeep M. and Dudhane, Akshay and Murala, Subrahmanyam},
  title = {An End-to-End Edge Aggregation Network for Moving Object Segmentation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Learning Video Stabilization Using Optical Flow
Jiyang Yu, Ravi Ramamoorthi


We propose a novel neural network that infers the per-pixel warp fields for video stabilization from the optical flow fields of the input video. While previous learning based video stabilization methods attempt to implicitly learn frame motions from color videos, our method resorts to optical flow for motion analysis and directly learns the stabilization using the optical flow. We also propose a pipeline that uses optical flow principal components for motion inpainting and warp field smoothing, making our method robust to moving objects, occlusion and optical flow inaccuracy, which is challenging for other video stabilization methods. Our method achieves quantitatively and visually better results than the state-of-the-art optimization based and deep learning based video stabilization methods. Our method also gives a 3x speed improvement compared to the optimization based methods.
[video, frame, moving, visual, dataset] [wang, stage, feature, mask, propose, introduces, sliding, category, achieves, object, region, map] [distortion, input, example, robust, original, noise, stability] [flow, warp, optical, motion, field, stabilization, based, method, figure, liu, comparison, warping, window, grundmann, valid, ieee, output, proposed, pixel, color, raw, smoothed, spatial, result, warped, traditional, inaccurate, frequency, stabilizing, affine] [loss, generate, inpainted, image, domain, introduce] [network, training, learning, large, deep, better, optimization, compared, discussed, indicates, set, number, neural, invalid, larger, function, note] [pca, local, principal, pipeline, camera, grid, second, compute, directly, physically, fit, transformation]
@InProceedings{Yu_2020_CVPR,
  author = {Yu, Jiyang and Ramamoorthi, Ravi},
  title = {Learning Video Stabilization Using Optical Flow},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Reusing Discriminators for Encoding: Towards Unsupervised Image-to-Image Translation
Runfa Chen, Wenbing Huang, Binghui Huang, Fuchun Sun, Bin Fang


Unsupervised image-to-image translation is a central task in computer vision. Current translation frameworks will abandon the discriminator once the training process is completed. This paper contends a novel role of the discriminator by reusing it for encoding the images of the target domain. The proposed architecture, termed as NICE-GAN, exhibits two advantageous patterns over previous approaches: First, it is more compact since no independent encoding component is required; Second, this plug-in encoder is directly trained by the adversary loss, making it more informative and trained more effectively if a multi-scale discriminator is applied. The main issue in NICE-GAN is the coupling of translation with discrimination along the encoder, which could incur training inconsistency when we play the min-max game via GAN. To tackle this issue, we develop a decoupled training strategy by which the encoder is only trained when maximizing the adversary loss while keeping frozen otherwise. Extensive experiments on four popular benchmarks demonstrate the superior performance of NICE-GAN over state-of-the-art methods in terms of FID, KID, and also human preference. Comprehensive ablation studies are also carried out to isolate the validity of each proposed component. Our codes are available at https://github.com/alpc91/NICE-GAN-pytorch.
[encoding, hidden, attention, three, current, unit, dog, dataset] [table, feature, ablation] [trained, adversarial, input, identity] [ieee, proposed, figure, pattern, residual, scale, method, receptive, based, convolutional] [translation, discriminator, image, encoder, unsupervised, fid, kid, domain, generator, generative, loss, cyclegan, nice, introspective, latent, independent, gan, component, translated, lgan, lcycle, generated, real, conditional, munit, drit, encoders, lrecon] [training, decoupled, learning, neural, number, conducted, reusing, paper, performance, network, andrew, processing, compact, function, layer, architecture, space] [conference, computer, vision, international, formulation, human, reconstruction, single]
@InProceedings{Chen_2020_CVPR,
  author = {Chen, Runfa and Huang, Wenbing and Huang, Binghui and Sun, Fuchun and Fang, Bin},
  title = {Reusing Discriminators for Encoding: Towards Unsupervised Image-to-Image Translation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Robust Design of Deep Neural Networks Against Adversarial Attacks Based on Lyapunov Theory
Arash Rahnama, Andre T. Nguyen, Edward Raff


Deep neural networks (DNNs) are vulnerable to subtle adversarial perturbations applied to the input. These adversarial perturbations, though imperceptible, can easily mislead the DNN. In this work, we take a control theoretic approach to the problem of robustness in DNNs. We treat each individual layer of the DNN as a nonlinear system and use Lyapunov theory to prove stability and robustness locally. We then proceed to prove stability and robustness globally for the entire DNN. We develop empirically tight bounds on the response of the output layer, or any hidden layer, to adversarial perturbations added to the input, or the input of hidden layers. Recent works have proposed spectral norm regularization as a solution for improving robustness against l2 adversarial attacks. Our results give new insights into how spectral norm regularization can mitigate the adversarial effects. Finally, we evaluate the power of our approach on a variety of data sets and network architectures and against some of the well-known adversarial attacks.
[relationship, behavior, hidden, work, individual, previous] [positive, global, level, table, response, inside] [adversarial, robustness, dnn, input, nonlinear, lyapunov, robust, stability, norm, theory, definition, dnns, attack, instantaneously, iifofp, model, conic, bounded, condition, lipschitz, trained, adversary, iifg, subsection] [output, spectral, based, analysis, signal, relu, method] [control, mapping] [layer, training, matrix, regularization, set, arxiv, appendix, theorem, preprint, entire, data, larger, weight, activation, neural, accuracy, network, stable, learning, consider, respective, design, deep, bound, machine, empirically, selection, size, vector, applied, general] [system, approach, defined, transformation, enforcing]
@InProceedings{Rahnama_2020_CVPR,
  author = {Rahnama, Arash and Nguyen, Andre T. and Raff, Edward},
  title = {Robust Design of Deep Neural Networks Against Adversarial Attacks Based on Lyapunov Theory},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
StarGAN v2: Diverse Image Synthesis for Multiple Domains
Yunjey Choi, Youngjung Uh, Jaejun Yoo, Jung-Woo Ha


A good image-to-image translation model should learn a mapping between different visual domains while satisfying the following properties: 1) diversity of generated images and 2) scalability over multiple domains. Existing methods address either of the issues, having limited diversity or multiple models for all domains. We propose StarGAN v2, a single framework that tackles both and shows significantly improved results over the baselines. Experiments on CelebA-HQ and a new animal faces dataset (AFHQ) validate our superiority in terms of visual quality, diversity, and scalability. To better assess image-to-image translation models, we release AFHQ, high-quality animal faces with large inter- and intra-domain differences. The code, pretrained models, and dataset are available at https://github.com/clovaai/stargan-v2.
[multiple, visual, dataset, provide, considering, individual] [table, framework, leading] [input, model, adversarial, quality, trained, animal, original, translates] [reference, method, output, figure, proposed, comparison, transform, high] [style, image, domain, stargan, code, diverse, latent, generator, mapping, source, synthesis, generated, translation, afhq, learns, lpips, target, diversity, generate, loss, fid, learn, encoder, generative, discriminator, real, drit, msgan, synthesize, train, corresponding, reflecting, produce, munit, specific, unsupervised, generating] [network, training, baseline, large, note, better, learning, regularization, number, randomly, evaluate, learned, space, test, random, objective] [single, reconstruction, allows]
@InProceedings{Choi_2020_CVPR,
  author = {Choi, Yunjey and Uh, Youngjung and Yoo, Jaejun and Ha, Jung-Woo},
  title = {StarGAN v2: Diverse Image Synthesis for Multiple Domains},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Warping Residual Based Image Stitching for Large Parallax
Kyu-Yul Lee, Jae-Young Sim


Image stitching techniques align two images captured at different viewing positions onto a single wider image. When the captured 3D scene is not planar and the camera baseline is large, two images exhibit parallax where the relative positions of scene structures are quite different from each view. The existing image stitching methods often fail to work on the images with large parallax. In this paper, we propose an image stitching algorithm robust to large parallax based on the novel concept of warping residuals. We first estimate multiple homographies and find their inlier feature matches between two images. Then we evaluate warping residual for each feature match with respect to the multiple homographies. To alleviate the parallax artifacts, we partition input images into superpixels and warp each superpixel adaptively according to an optimal homography which is computed by minimizing the error of feature matches weighted by the warping residuals. Experimental results demonstrate that the proposed algorithm provides accurate stitching results for images with large parallax, and outperforms the existing methods qualitatively and quantitatively.
[multiple, red, three, regular] [feature, superpixels, superpixel, object, foreground, located, background, occlusion, occluded, global, refined, matched] [input, vicinity, white, exhibit, experimental] [warping, homography, based, stitching, proposed, figure, parallax, residual, warped, warp, ieee, cell, apap, sjj, existing, adaptively, conventional, reference, pixel, severe, method, spatial, partition, adaptive, partitioned, stitched, neighboring, captured] [image, target, aligns, alignment, align, domain] [large, algorithm, optimal, note, performance, weight, applied, set, average, small] [scene, estimated, estimate, transformation, grid, homographies, error, point, distance, planar, inlier, ground, relative, computed, local, estimation, plane]
@InProceedings{Lee_2020_CVPR,
  author = {Lee, Kyu-Yul and Sim, Jae-Young},
  title = {Warping Residual Based Image Stitching for Large Parallax},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
A U-Net Based Discriminator for Generative Adversarial Networks
Edgar Schonfeld, Bernt Schiele, Anna Khoreva


Among the major remaining challenges for generative adversarial networks (GANs) is the capacity to synthesize globally and locally coherent images with object shapes and textures indistinguishable from real images. To target this issue we propose an alternative U-Net based discriminator architecture, borrowing the insights from the segmentation literature. The proposed U-Net based architecture allows to provide detailed per-pixel feedback to the generator while maintaining the global coherence of synthesized images, by providing the global image feedback as well. Empowered by the per-pixel response of the discriminator, we further propose a per-pixel consistency regularization technique based on the CutMix data augmentation, encouraging the U-Net discriminator to focus more on semantic and structural changes between real and fake images. This improves the U-Net discriminator training, further enhancing the quality of generated samples. The novel discriminator improves over the state of the art in terms of the standard distribution and image quality metrics, enabling the generator to synthesize images with varying structure, appearance and levels of detail, maintaining global and local realism. Compared to the BigGAN baseline, we achieve an average improvement of 2.7 FID points across FFHQ, CelebA, and the proposed COCO-Animals dataset.
[decoder, dataset, provide] [global, propose, score, table, improves, semantic, focus, improvement] [adversarial, model, feedback, original, quality, improving, input, improve] [figure, proposed, based, output, resolution, high, pixel, pattern, introduced, method] [discriminator, image, gan, cutmix, fid, real, generator, consistency, fake, biggan, ddec, ffhq, generative, encoder, generated, synthetic, loss, denc, synthesis, celeba, conditional, synthesize, unconditional, latent, structural] [training, regularization, learning, architecture, neural, network, processing, note, class, data, classification, standard, improved, best, space, objective, augmentation, observe, vector, better] [conference, local, international, computer, vision, median, human, well, refer, globally, coherent, allows]
@InProceedings{Schonfeld_2020_CVPR,
  author = {Schonfeld, Edgar and Schiele, Bernt and Khoreva, Anna},
  title = {A U-Net Based Discriminator for Generative Adversarial Networks},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Unpaired Portrait Drawing Generation via Asymmetric Cycle Mapping
Ran Yi, Yong-Jin Liu, Yu-Kun Lai, Paul L. Rosin


Portrait drawing is a common form of art with high abstraction and expressiveness. Due to its unique characteristics, existing methods achieve decent results only with paired training data, which is costly and time-consuming to obtain.In this paper, we address the problem of automatic transfer from face photos to portrait drawings with unpaired training data. We observe that due to the significant imbalance of information richness between photos and drawings, existing unpaired transfer methods such as CycleGAN tends to embed invisible reconstruction information indiscriminately in the whole drawings, leading to important facial features partially missing in drawings. To address this problem, we propose a novel asymmetric cycle mapping that enforces the reconstruction information to be visible (by a truncation loss) and only embedded in selective facial regions (by a relaxed forward cycle-consistency loss). Along with localized discriminators for the eyes, nose and lips, our method well preserves all important facial features in the generated portrait drawings. By introducing a style classifier and taking the style vector into account, our method can learn to generate portrait drawings in multiple styles using a single network. Extensive experiments show that our model outperforms state-of-the-art methods.
[three, recognition, unit, multiple] [feature, web, edge, partially, propose] [face, facial, input, model, adversarial, strict, study, quality, collected, invisible, indiscriminately, trained] [method, ieee, pattern, convolution, figure, proposed, removing] [style, drawing, loss, photo, generated, portrait, image, cycle, translation, transfer, real, unpaired, generator, consistency, cyclegan, apdrawing, discriminator, asymmetric, truncation, combogan, nose, content, dualgan, munit, apdrawings, generation, paired, missing, generate, dcls, gatys, user, mapping, generative, domain] [training, data, neural, network, learning, probability, classifier, function, classification, problem, set] [conference, computer, relaxed, local, vision, reconstruction, international, structure, reconstruct, single, reconstructed]
@InProceedings{Yi_2020_CVPR,
  author = {Yi, Ran and Liu, Yong-Jin and Lai, Yu-Kun and Rosin, Paul L.},
  title = {Unpaired Portrait Drawing Generation via Asymmetric Cycle Mapping},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
When to Use Convolutional Neural Networks for Inverse Problems
Nathaniel Chodosh, Simon Lucey


Reconstruction tasks in computer vision aim fundamentally to recover an undetermined signal from a set of noisy measurements. Examples include super-resolution, image denoising, and non-rigid structure from motion, all of which have seen recent advancements through deep learning. However, earlier work made extensive use of sparse signal reconstruction frameworks (e.g. convolutional sparse coding). While this work was ultimately surpassed by deep learning, it rested on a much more developed theoretical framework. Recent work by Papyan et. al. provides a bridge between the two approaches by showing how a convolutional neural network (CNN) can be viewed as an approximate solution to a convolutional sparse coding (CSC) problem. In this work we argue that for some types of inverse problems the CNN approximation breaks down leading to poor performance. We argue that for these types of problems the CSC approach should be used instead and validate this argument with empirical evidence. Specifically we identify JPEG artifact reduction and non-rigid trajectory reconstruction as challenging inverse problems for CNNs and demonstrate state of the art performance on them using a CSC method. Furthermore, we offer some practical improvements to this model and its application, and also show how insights from the CSC model can be used to make CNNs effective in tasks where their naive application fails.
[work, trajectory, state, modeling, predict, include] [cnn, art, table] [jpeg, model, original, quality, input, developed] [csc, convolutional, inverse, method, signal, cnns, figure, coding, block, ieee, based, artifact, assumption, thresholding, applying, operator, sulam, simon, papyan, convolutionally, prior, degradation, removal, proposed, journal, pattern] [image, perform, synthetic, zhu, expect, train] [algorithm, deep, neural, problem, performance, learning, matrix, network, experiment, sparsity, optimization, dictionary, objective, equation, task, layer, diagonal, linear, practical, min, processing, better, note, set, learned] [sparse, reconstruction, computer, measurement, structure, vision, conference, form, approach, demonstrate, solving, solve, full, michael, international]
@InProceedings{Chodosh_2020_CVPR,
  author = {Chodosh, Nathaniel and Lucey, Simon},
  title = {When to Use Convolutional Neural Networks for Inverse Problems},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
LUVLi Face Alignment: Estimating Landmarks' Location, Uncertainty, and Visibility Likelihood
Abhinav Kumar, Tim K. Marks, Wenxuan Mou, Ye Wang, Michael Jones, Anoop Cherian, Toshiaki Koike-Akino, Xiaoming Liu, Chen Feng


Modern face alignment methods have become quite accurate at predicting the locations of facial landmarks, but they do not typically estimate the uncertainty of their predicted locations nor predict whether landmarks are visible. In this paper, we present a novel framework for jointly predicting landmark locations, associated uncertainties of these predicted locations, and landmark visibilities. We model these as mixed random variables and estimate them using a deep network trained using our proposed Location, Uncertainty, and Visibility Likelihood (LUVLi) loss. In addition, we release an entirely new labeling of a large face alignment dataset with over 19,000 face images in a full range of head poses. Each face is manually labeled with the ground-truth locations of 68 landmarks, with the additional information of whether each landmarks is visible, self-occluded (due to extreme head poses), or externally occluded. Not only does our joint estimation yield accurate estimates of the uncertainty of predicted landmark locations, but it also yields state-of-the-art estimates for the landmark locations themselves on mulitple standard face alignment datasets. Our method's estimates of the uncertainty of predicted landmark locations could be used to automatically identify input images on which face alignment fails, which can be critical for downstream tasks.
[dataset, three, prediction, multiple] [predicted, regression, location, table, occluded, heatmap, split, localization, object, box, sota, detection, denotes, positive] [landmark, face, visibility, luvli, model, facial, externally, menpo, nmebox, lij, unoccluded, cholesky, vbij, robust, frontal, input, stefanos, xiaoming] [method, figure, gaussian, likelihood, cvpr, convolutional, chen, spatial] [alignment, loss, image, appearance, multivariate, train] [network, test, covariance, deep, neural, distribution, matrix, training, active, set, learning, probability, best, function, arxiv, preprint, variance, labeled] [uncertainty, estimate, estimation, pose, estimator, accurate, human, direct, laplacian, estimating, full, shape, joint, estimated, second, visible]
@InProceedings{Kumar_2020_CVPR,
  author = {Kumar, Abhinav and Marks, Tim K. and Mou, Wenxuan and Wang, Ye and Jones, Michael and Cherian, Anoop and Koike-Akino, Toshiaki and Liu, Xiaoming and Feng, Chen},
  title = {LUVLi Face Alignment: Estimating Landmarks' Location, Uncertainty, and Visibility Likelihood},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Affinity Graph Supervision for Visual Recognition
Chu Wang, Babak Samari, Vladimir G. Kim, Siddhartha Chaudhuri, Kaleem Siddiqi


Affinity graphs are widely used in deep architectures, including graph convolutional neural networks and attention networks. Thus far, the literature has focused on abstracting features from such graphs, while the learning of the affinities themselves has been overlooked. Here we propose a principled method to directly supervise the learning of weights in affinity graphs, to exploit meaningful connections between entities in the data source. Applied to a visual attention network, our affinity supervision improves relationship recovery between objects, even without the use of manually annotated relationship labels. We further show that affinity learning between objects boosts scene categorization performance and that the supervision of affinity can also be applied to graphs built from mini-batches, for neural network training. In an image classification task we demonstrate consistent improvement over the baseline, with diverse network architectures and datasets.
[attention, graph, visual, relationship, relation, built, represent, recognition, node, context] [affinity, supervision, object, feature, mass, cnn, proposal, detection, resnet, categorization, ablation, module, pooling, table, propose, improvement, edge, aggregation, box, main, boost] [tiny, model, study, input] [figure, convolutional, based, version, method, proposed, reference] [loss, target, image, supervised, pretrained] [learning, network, training, neural, applied, batch, matrix, classification, set, baseline, metric, performance, pairwise, learned, test, fin, deep, arxiv, imagenet, data, large, best, class, task, general, supervising, optimization, reported, convergence] [scene, ground, truth, additional, focal, defined, directly, demonstrate, distance, vision]
@InProceedings{Wang_2020_CVPR,
  author = {Wang, Chu and Samari, Babak and Kim, Vladimir G. and Chaudhuri, Siddhartha and Siddiqi, Kaleem},
  title = {Affinity Graph Supervision for Visual Recognition},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Unsupervised Magnification of Posture Deviations Across Subjects
Michael Dorkenwald, Uta Buchler, Bjorn Ommer


Analyzing human posture and precisely comparing it across different subjects is essential for accurate understanding of behavior and numerous vision applications such as medical diagnostics, sports, or surveillance. Motion magnification techniques help to see even small deviations in posture that are invisible to the naked eye. However, they fail when comparing subtle posture differences across individuals with diverse appearance. Keypoint-based posture estimation and classification techniques can handle large variations in appearance, but are invariant to subtle deviations in posture. We present an approach to unsupervised magnification of posture differences across individuals despite large deviations in appearance. We do not require keypoint annotation and visualize deviations on a sub-bodypart level. To transfer appearance across subjects onto a magnified posture, we propose a novel loss for disentangling appearance and posture in an autoencoder. Posture magnification yields exaggerated images that are different from the training set. Therefore, we incorporate magnification already into the training of the disentangled autoencoder and learn on real data and synthesized magnifications without supervision. Experiments confirm that our approach improves upon the state-of-the-art in magnification and on the application of discovering posture deviations due to impairment.
[video, encoding, frame, dataset, behavior, decoder, recognition, work, three, represent, previous] [final, propose] [posture, magnification, model, magnified, query, subject, magnifying, trained, input, quality, original, magnify, amplify, ldis, lmag, healthy, deviation, datasets, adversarial, precisely] [motion, ieee, pattern, reference, figure, analysis, comparison, proposed] [appearance, image, subtle, unsupervised, loss, transfer, generated, introduce, generate, real, train, encoder, disentangling, disentanglement, target, generation, content, representation] [neural, learning, deep, training, william, classification, data, evaluate, processing, large, problem, space] [computer, conference, vision, approach, distance, reconstruction, require, human, european, compare, directly, pose, international, novel, michael, estimation]
@InProceedings{Dorkenwald_2020_CVPR,
  author = {Dorkenwald, Michael and Buchler, Uta and Ommer, Bjorn},
  title = {Unsupervised Magnification of Posture Deviations Across Subjects},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Accurate Estimation of Body Height From a Single Depth Image via a Four-Stage Developing Network
Fukun Yin, Shizhe Zhou


Non-contact measurement of human body height can be very difficult under some circumstances.In this paper we address the problem of accurately estimating the height of a person with arbitrary postures from a single depth image. By introducing a novel part-based intermediate representation plus a four-stage increasingly complex deep neural network, we manage to achieve significantly higher accuracy than previous methods. We first describe the human body in the form of a segmentation of human torso as four nearly rigid parts and then predict their lengths respectively by 3 CNNs. Instead of directly adding the lengths of these parts together, we further construct another independent developing CNN that combines the intermediate representation, part lengths and depth information together to finally predict the body height results.Here we develop an increasingly complex network architecture and adopt a hybrid pooling to optimize training process. To the best of our knowledge, this is the first method that estimates height only from a single depth image. In experiments our average accuracy reaches at 99.1% for people in various positions and postures.
[prediction, three, dataset, increasingly, predict, construct, order, length] [height, table, segmentation, head, developing, pooling, propose, adopt, stand, segment, edge, fully] [input, verify, improve, posture] [intermediate, method, figure, based, convolutional, ieee, output, proposed, science] [image, representation, person, train] [network, architecture, accuracy, data, average, number, set, test, label, deep, neural, learning, training, layer, connection, rate, problem, better] [human, depth, body, error, torso, estimation, single, camera, conference, estimate, relative, computer, rgb, estimating, complex, calibration, international, structure, distance, reconstruction, vanishing, point, pose, accurate]
@InProceedings{Yin_2020_CVPR,
  author = {Yin, Fukun and Zhou, Shizhe},
  title = {Accurate Estimation of Body Height From a Single Depth Image via a Four-Stage Developing Network},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Fast Soft Color Segmentation
Naofumi Akimoto, Huachun Zhu, Yanghua Jin, Yoshimitsu Aoki


We address the problem of soft color segmentation, defined as decomposing a given image into several RGBA layers, each containing only homogeneous color regions. The resulting layers from decomposition pave the way for applications that benefit from layer-based editing, such as recoloring and compositing of images and videos. The current state-of-the-art approach for this problem is hindered by slow processing time due to its iterative nature, and consequently does not scale to certain real-world scenarios. To address this issue, we propose a neural network based method for this task that decomposes a given image into multiple layers in a single forward pass. Furthermore, our method separately decomposes the color layers and the alpha channel layers. By leveraging a novel training objective, our method achieves proper assignment of colors amongst layers. As a consequence, our method achieve promising quality without existing issue of inference speed for iterative approaches. Our thorough experimental analysis shows that our method produces qualitative and quantitative results comparable to previous methods while achieving a 300,000x speed improvement. Finally, we utilize our proposed method on several applications, and demonstrate its speed advantage, especially in video editing.
[recognition, video, time, speed, multiple] [segmentation, hard, semantic, score, table, propose, mask, guided] [input, original, versus, model] [color, palette, method, figure, ieee, aksoy, pattern, based, decompose, residue, comparison, rgba, pixel, koyama, tan, decomposing, proposed, recoloring, analysis, removal, quantitative, unmixing, skip, fast, output, decomposes, processed] [image, alpha, decomposed, loss, user, qualitative, consists, editing, corresponding] [layer, soft, network, neural, predictor, sparsity, training, processing, function, energy, number, objective, note, variance, deep, machine, task, normalized, sum, size, problem] [single, computer, conference, vision, rgb, reconstruction, approach, decomposition, error, estimation, acm, homogeneous, novel, distance, constraint, overlapping]
@InProceedings{Akimoto_2020_CVPR,
  author = {Akimoto, Naofumi and Zhu, Huachun and Jin, Yanghua and Aoki, Yoshimitsu},
  title = {Fast Soft Color Segmentation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Global Optimality for Point Set Registration Using Semidefinite Programming
Jose Pedro Iglesias, Carl Olsson, Fredrik Kahl


In this paper we present a study of global optimality conditions for Point Set Registration (PSR) with missing data. PSR is the problem of aligning multiple point clouds with an unknown target point cloud. Since non-linear rotation constraints are present the problem is inherently non-convex and typically relaxed by computing the Lagrange dual, which is a Semidefinite Program (SDP). In this work we show that given a local minimizer the dual variables of the SDP can be computed in closed form. This opens up the possibility of verifying the optimally, using the SDP formulation without explicitly solving it. In addition it allows us to study under what conditions the relaxation is tight, through spectral analysis. We show that if the errors in the (unknown) optimal solution are bounded the SDP formulation will be able to recover it.
[evaluation, sequence, work, multiple, observed, rjt] [global, positive, apply, main] [noise, case, condition, study] [dual, primal, result, figure, ieee, spectral, analysis, block, pattern, noisy, spatial] [missing, source, target, gap, generated, corresponding, real] [problem, data, matrix, theorem, candidate, set, relaxation, note, optimization, rank, bound, lemma, average, function, diagonal, corollary, optimizer, optimal, minimum, carl, paper, closed, studied, distribution, min] [point, solution, local, sdp, optimality, registration, duality, rotation, sufficient, computer, eigenvalue, conference, semidefinite, psr, solving, lagrangian, cloud, international, vision, averaging, kkt, coordinate, cost, smallest, write, fulfills, allows, solved, notation]
@InProceedings{Iglesias_2020_CVPR,
  author = {Iglesias, Jose Pedro and Olsson, Carl and Kahl, Fredrik},
  title = {Global Optimality for Point Set Registration Using Semidefinite Programming},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Image2StyleGAN++: How to Edit the Embedded Images?
Rameen Abdal, Yipeng Qin, Peter Wonka


We propose Image2StyleGAN++, a flexible image editing framework with many applications. Our framework extends the recent Image2StyleGAN in three ways. First, we introduce noise optimization as a complement to the W+ latent space embedding. Our noise optimization can restore high frequency features in images and thus significantly improves the quality of reconstructed images, e.g. a big increase of PSNR from 20 dB to 45 dB. Second, we extend the global W+ latent space embedding to enable local embeddings. Third, we combine embedding with activation tensor manipulation to perform high quality local edits along with global semantic edits on images. Such edits motivate various high quality image editing applications, e.g. image reconstruction, image inpainting, image crossover, local style transfer, image editing using scribbles, and attribute level feature transfer. Examples of the edited images are shown across the paper for visual inspection.
[embedding, embedded, embed, multiple, three, previous] [mask, semantic, framework, propose, global, building] [noise, adversarial, quality, manipulation, improve, input, face] [high, tensor, perceptual, spatial, ieee, wout, figure, mse, frequency, column, pattern, result, adam, psnr] [image, nini, style, masked, generative, loss, latent, code, perform, edits, editing, inpainting, wini, mblur, transfer, copying, stylegan, ail, gan, third, synthesis, target, corresponding, generated, gans, generator, fourth, attribute, control, extended, texture] [optimization, space, algorithm, activation, function, optimizing, learning, neural, nout, set, layer, network, rate, improved, training] [local, computer, conference, vision, second, international, reconstruction]
@InProceedings{Abdal_2020_CVPR,
  author = {Abdal, Rameen and Qin, Yipeng and Wonka, Peter},
  title = {Image2StyleGAN++: How to Edit the Embedded Images?},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
SQE: a Self Quality Evaluation Metric for Parameters Optimization in Multi-Object Tracking
Yanru Huang, Feiyu Zhu, Zheni Zeng, Xi Qiu, Yuan Shen, Jianan Wu


We present a novel self quality evaluation metric SQE for parameters optimization in the challenging yet critical multi-object tracking task. Current evaluation metrics all require annotated ground truth, thus will fail in the test environment and realistic circumstances prohibiting further optimization after training. By contrast, our metric reflects the internal characteristics of trajectory hypotheses and measures tracking performance without ground truth. We demonstrate that trajectories with different qualities exhibit different single or multiple peaks over feature distance distribution, inspiring us to design a simple yet effective method to assess the quality of trajectories using a two-class Gaussian mixture model. Experiments mainly on MOT16 Challenge data sets verify the effectiveness of our method in both correlating with existing metrics and enabling parameters self-optimization to achieve better performance. We believe that our conclusions and method are inspiring for future multi-object tracking in practice.
[evaluation, multiple, trajectory, video, work, length] [tracking, sqe, threshold, feature, tracker, false, merging, mot, effectiveness, object, table, visualization, track, association, denotes, chi, detection] [quality, identification, identity, model] [method, ieee, figure, gaussian, pattern, based, existing, comparison] [reid, target, intra, inter, corresponding, person, changing] [metric, performance, set, test, distribution, number, data, baseline, optimization, training, online, parameter, optimal, deep, better, simple, mixture, algorithm, large, small, dif, design, find, calculate, ideal, practical] [distance, conference, ground, computer, vision, truth, single, demonstrate, matching, kitti, international]
@InProceedings{Huang_2020_CVPR,
  author = {Huang, Yanru and Zhu, Feiyu and Zeng, Zheni and Qiu, Xi and Shen, Yuan and Wu, Jianan},
  title = {SQE: a Self Quality Evaluation Metric for Parameters Optimization in Multi-Object Tracking},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
EventSR: From Asynchronous Events to Image Reconstruction, Restoration, and Super-Resolution via End-to-End Adversarial Learning
Lin Wang, Tae-Kyun Kim, Kuk-Jin Yoon


Event cameras sense intensity changes and have many advantages over conventional cameras. To take advantage of event cameras, some methods have been proposed to reconstruct intensity images from event streams. However, the outputs are still in low resolution (LR), noisy, and unrealistic. The low-quality outputs stem broader applications of event cameras, where high spatial resolution (HR) is needed as well as high temporal resolution, dynamic range, and no motion blur. We consider the problem of reconstructing and super-resolving intensity images from pure events, when no ground truth (GT) HR images and down-sampling kernels are available. To tackle the challenges, we propose a novel end-to-end pipeline that reconstructs LR images from event streams, enhances the image qualities and upsamples the enhanced images, called EventSR. For the absence of real GT images, our method is primarily unsupervised, deploying adversarial learning. To train EventSR, we create an open dataset including both real-world and simulated scenes. The use of both datasets boosts up the network performance, and the network architectures and various loss functions in each phase help improve the image qualities. The whole pipeline is trained in three phases. While each phase is mainly for one of the three tasks, the networks in earlier phases are fine-tuned by respective loss functions in an end-to-end manner. Experimental results show that EventSR generates high-quality SR images from events for both simulated and real-world data.
[dataset, visual, three, frame, video, evaluation, recognition, embedding, goal] [achieves, propose, sota, including, tracking, wang, cropped] [adversarial, clean, quality, feedback, lid, trained, experimental, pgd] [event, phase, eventsr, intensity, ieee, aps, pattern, proposed, figure, hdr, high, reconstructing, motion, method, stacked, based, resolution, dynamic, esim, pgr, restoration, deblur, davide, asynchronous, low, blur, contrast, clr, comparison, remove] [image, loss, real, train, unsupervised, utilize, realistic, mapping] [training, learning, network, better, data, deep, problem, open, set, performance, evaluate] [conference, computer, vision, reconstruct, reconstruction, camera, simulated, international, single, well, reconstructed, second]
@InProceedings{Wang_2020_CVPR,
  author = {Wang, Lin and Kim, Tae-Kyun and Yoon, Kuk-Jin},
  title = {EventSR: From Asynchronous Events to Image Reconstruction, Restoration, and Super-Resolution via End-to-End Adversarial Learning},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Hierarchical Pyramid Diverse Attention Networks for Face Recognition
Qiangchang Wang, Tianyi Wu, He Zheng, Guodong Guo


Deep learning has achieved a great success in face recognition (FR), however, few existing models take hierarchical multi-scale local features into consideration. In this work, we propose a hierarchical pyramid diverse attention (HPDA) network. First, it is observed that local patches would play important roles in FR when the global face appearance changes dramatically. Some recent works apply attention modules to locate local patches automatically without relying on face landmarks. Unfortunately, without considering diversity, some learned attentions tend to have redundant responses around some similar local patches, while neglecting other potential discriminative facial parts. Meanwhile, local patches may appear at different scales due to pose variations or large expression changes. To alleviate these challenges, we propose a pyramid diverse attention (PDA) to learn multi-scale diverse local representations automatically and adaptively. More specifically, a pyramid attention is developed to capture multi-scale features. Meanwhile, a diverse learning is developed to encourage models to focus on different local patches and generate diverse local features. Second, almost all existing models focus on extracting features from the last convolutional layer, lacking of local details or small-scale face parts in lower layers. Instead of simple concatenation or addition, we propose to use a hierarchical bilinear pooling (HBP) to fuse information from multiple layers effectively. Thus, the HPDA is developed by integrating the PDA into the HBP. Experimental results on several datasets show the effectiveness of the HPDA, compared to the state-of-the-art methods.
[attention, hierarchical, multiple, recognition, bilinear, represent, dataset, automatically, concatenation, three, observed, work] [feature, global, pyramid, cnn, background, locate, table, pooling, focus, guide, framework, propose, challenging] [face, model, facial, hpda, hbp, complementary, guodong, expression, lanet, calfw, cplfw, lfw] [proposed, cnns, convolutional, column, ieee, scale, noisy, pattern, spatial, figure] [diverse, discriminative, row, learn, loss] [learning, deep, layer, compared, number, network, arxiv, preprint, large, performance] [local, conference, computer, pose, vision, varying, capture]
@InProceedings{Wang_2020_CVPR,
  author = {Wang, Qiangchang and Wu, Tianyi and Zheng, He and Guo, Guodong},
  title = {Hierarchical Pyramid Diverse Attention Networks for Face Recognition},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
RGBD-Dog: Predicting Canine Pose from RGBD Sensors
Sinead Kearney, Wenbin Li, Martin Parsons, Kwang In Kim, Darren Cosker


The automatic extraction of animal 3D pose from images without markers is of interest in a range of scientific fields. Most work to date predicts animal pose from RGB images, based on 2D labelling of joint positions. However, due to the difficult nature of obtaining training data, no ground truth dataset of 3D animal motion is available to quantitatively evaluate these approaches. In addition, a lack of 3D animal pose data also makes it difficult to train 3D pose-prediction methods in a similar manner to the popular field of body-pose prediction. In our work, we focus on the problem of 3D canine pose estimation from RGBD images, recording a diverse range of dog breeds with several Microsoft Kinect v2s, simultaneously obtaining the 3D ground truth skeleton via a motion capture system. We generate a dataset of synthetic RGBD images from this data. A stacked hourglass network is trained to predict 3D joint locations, which is then constrained using prior models of shape and pose. We evaluate our model on both synthetic and real RGBD images and compare our results to previously published work fitting canine models to images. Finally, despite our training set consisting only of dog data, visual inspection implies that our network can produce good predictions for images of other quadrupeds - e.g. horses or cats - when their pose is similar to that contained in our training set.
[dog, skeleton, dataset, prediction, predict, three, work, associated] [predicted, mask, cnn, highest, threshold] [model, animal, trained, heatmaps, chosen, neutral] [motion, figure, method, ieee, result, range, based, scale, created] [synthetic, image, real, generate, produce, train, texture, generated, generation, unknown] [network, data, training, set, neural, process, number, evaluate] [pose, joint, shape, depth, pck, mpjpe, human, kinect, ground, estimation, capture, rgbd, truth, conference, body, computer, mesh, rgb, root, system, smal, provided, pipeline, error, vision, rendered, bone, pca, accurate, michael, fitting, rotation, international, canine, compare, well, additional, supplementary]
@InProceedings{Kearney_2020_CVPR,
  author = {Kearney, Sinead and Li, Wenbin and Parsons, Martin and Kim, Kwang In and Cosker, Darren},
  title = {RGBD-Dog: Predicting Canine Pose from RGBD Sensors},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Multi-Scale Progressive Fusion Network for Single Image Deraining
Kui Jiang, Zhongyuan Wang, Peng Yi, Chen Chen, Baojin Huang, Yimin Luo, Jiayi Ma, Junjun Jiang


Rain streaks in the air appear in various blurring degrees and resolutions due to different distances from their positions to the camera. Similar rain patterns are visible in a rain image as well as its multi-scale (or multi-resolution) versions, which makes it possible to exploit such complementary information for rain streak representation. In this work, we explore the multi-scale collaborative representation for rain streaks from the perspective of input image scales and hierarchical deep features in a unified framework, termed multi-scale progressive fusion network (MSPFN) for single image rain streak removal. For the similar rain streaks at different positions, we employ recurrent calculation to capture the global texture, thus allowing to explore the complementary and redundant information at the spatial dimension to characterize target rain streaks. Besides, we construct multi-scale pyramid structure, and further introduce the attention mechanism to guide the fine fusion of these correlated information from different scales. This multi-scale progressive fusion strategy not only promotes the cooperative representation, but also boosts the end-to-end training. Our proposed method is extensively evaluated on several benchmark datasets and achieves the state-of-the-art results. Moreover, we conduct experiments on joint deraining, detection, and segmentation tasks, and inspire a new research direction of vision task driven image deraining. The source code is available at https://github.com/kuihua/MSPFN.
[recurrent, attention, exploit, three, collaborative] [pyramid, detection, module, table, segmentation, feature, object, semantic, achieves, propose, cau, denotes, global, framework] [datasets, model, original, correlated, quality, complementary, cfm, input] [rain, deraining, ieee, mspfn, fusion, streak, residual, mspfnm, proposed, scale, based, restoration, prenet, ffm, convolution, rescan, comparison, figure, removal, cascaded, umrl, psnr, iderain, block, skip, ssim, spatial, parallel, gaussian, channel, remove] [image, progressive, representation, loss, synthetic, target, introduce, fine] [network, performance, deep, learning, better, set, basic, layer, function, average] [conference, single, well, joint, reconstruction, international, depth, capture]
@InProceedings{Jiang_2020_CVPR,
  author = {Jiang, Kui and Wang, Zhongyuan and Yi, Peng and Chen, Chen and Huang, Baojin and Luo, Yimin and Ma, Jiayi and Jiang, Junjun},
  title = {Multi-Scale Progressive Fusion Network for Single Image Deraining},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Learning a Neural 3D Texture Space From 2D Exemplars
Philipp Henzler, Niloy J. Mitra, Tobias Ritschel


We suggest a generative model of 2D and 3D natural textures with diversity, visual fidelity and at high computational efficiency. This is enabled by a family of methods that extend ideas from classic stochastic procedural texturing (Perlin noise) to learned, deep, non-linearities. Our model encodes all exemplars from a diverse set of textures without a need to be re-trained for each exemplar. Applications include texture interpolation, and learning 3D textures from 2D exemplars.
[work, visual, natural, multiple, decoder, three, time] [cnn, key, seed] [noise, vgg, model, input, variation, quality, adversarial, access, success] [method, figure, cnns, high, interpolation, spatial, classic, achieved, convolutional, pixel] [texture, diversity, perlin, exemplar, code, synthesis, image, produce, generate, latent, style, infinite, perlint, cnnd, oursnot, encoder, generative, extended, oursp, loss, diverse, generating, produced, procedural, transformed] [space, random, similarity, stochastic, learning, memory, linear, efficiency, neural, set, vector, entire, computational, required, simple, data] [approach, mlp, single, position, allows, well, mlps, acm, volume, vision, require, match, transformation, error]
@InProceedings{Henzler_2020_CVPR,
  author = {Henzler, Philipp and Mitra, Niloy J. and Ritschel, Tobias},
  title = {Learning a Neural 3D Texture Space From 2D Exemplars},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
BachGAN: High-Resolution Image Synthesis From Salient Object Layout
Yandong Li, Yu Cheng, Zhe Gan, Licheng Yu, Liqiang Wang, Jingjing Liu


We propose a new task towards more practical applications for image generation - high-quality image synthesis from salient object layout. This new setting requires users to provide only the layout of salient objects (i.e., foreground bounding boxes and categories) and lets the model complete the drawing with an invented background and a matching foreground. Two main challenges spring from this new task: (i) how to generate fine-grained details and realistic textures without segmentation map input; and (ii) how to create and weave a background into standalone objects in a seamless way. To tackle this, we propose Background Hallucination Generative Adversarial Network (BachGAN), which leverages a background retrieval module to first select a set of segmentation maps from a large candidate pool, then encodes these candidate layouts via a background fusion module to hallucinate a suitable background for the given objects. By generating the hallucinated background representation dynamically, our model can synthesize high-resolution images with both photo-realistic foreground and integral background. Experiments on Cityscapes and ADE20K datasets demonstrate the advantage of BachGAN over existing approaches, measured on both visual fidelity of generated images and visual alignment between output images and input layouts.
[bank, retrieval, visual, text, work, graph, previous] [object, background, segmentation, salient, map, semantic, foreground, car, bounding, module, feature, table, denotes, propose] [input, model, adversarial, hallucination, hallucinate, datasets, quality, query, adding] [figure, based, proposed, fusion, conv, method, pixel, analysis] [image, layout, bachgan, synthesis, generate, retrieved, spade, generation, conditional, loss, generator, synthesized, gan, fid, generated, translation, generative, synthesize, corresponding, realistic, train, windowpane] [label, memory, training, set, task, layer, network, candidate, normalization, compared, pool, learning, number, simple, max, baseline, performance] [scene, provided, demonstrate, consistent]
@InProceedings{Li_2020_CVPR,
  author = {Li, Yandong and Cheng, Yu and Gan, Zhe and Yu, Licheng and Wang, Liqiang and Liu, Jingjing},
  title = {BachGAN: High-Resolution Image Synthesis From Salient Object Layout},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Rethinking Data Augmentation for Image Super-resolution: A Comprehensive Analysis and a New Strategy
Jaejun Yoo, Namhyuk Ahn, Kyung-Ah Sohn


Data augmentation is an effective way to improve the performance of deep networks. Unfortunately, current methods are mostly developed for high-level vision tasks (e.g., classification) and few are studied for low-level vision tasks (e.g., image restoration). In this paper, we provide a comprehensive analysis of the existing augmentation methods applied to the super-resolution task. We find that the methods discarding or manipulating the pixels or features too much hamper the image restoration, where the spatial relationship is very important. Based on our analyses, we propose CutBlur that cuts a low-resolution patch and pastes it to the corresponding high-resolution image region and vice versa. The key intuition of CutBlur is to enable a model to learn not only "how" but also "where" to super-resolve an image. By doing so, the model can understand "how much", instead of blindly learning to apply super-resolution to every given pixel. Our method consistently and significantly improves the performance across various scenarios, especially when the model size is big and the data is collected under real-world environments. We also show that our method improves other low-level vision tasks, such as denoising and compression artifact removal.
[dataset, provide] [region, apply, improves, table, feature, benchmark] [model, trained, input, datasets, jpeg, comprehensive] [cutblur, proposed, edsr, method, rcan, carn, ieee, figure, psnr, realsr, comparison, pattern, denoising, residual, analysis, resolution, ssim, artifact, output, existing, radu, srcnn, applying, xhr, intensity, gaussian, perceptual, convolutional, spatial, based] [image, learn, synthetic, cutmix, train, lpips, corresponding, unrealistic, qualitative, gap, real] [performance, augmentation, baseline, training, data, test, arxiv, preprint, size, network, regularization, applied, cutout, strategy, better, mixup, deep, random, find, learning, consistently, large, set, small, problem, simple] [vision, computer, conference, single]
@InProceedings{Yoo_2020_CVPR,
  author = {Yoo, Jaejun and Ahn, Namhyuk and Sohn, Kyung-Ah},
  title = {Rethinking Data Augmentation for Image Super-resolution: A Comprehensive Analysis and a New Strategy},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
On Positive-Unlabeled Classification in GAN
Tianyu Guo, Chang Xu, Jiajun Huang, Yunhe Wang, Boxin Shi, Chao Xu, Dacheng Tao


This paper defines a positive and unlabeled classification problem for standard GANs, which then leads to a novel technique to stabilize the training of the discriminator in GANs. Traditionally, real data are taken as positive while generated data are negative. This positive-negative classification criterion was kept fixed all through the learning process of the discriminator without considering the gradually improved quality of generated data, even if they could be more realistic than real data at times. In contrast, it is more reasonable to treat the generated data as unlabeled, which could be positive or negative according to their quality. The discriminator is thus a classifier for this positive and unlabeled classification problem, and we derive a new Positive-Unlabeled GAN (PUGAN). We theoretically discuss the global optimality the proposed model will achieve and the equivalent optimization goal. Empirically, we find that PUGAN can achieve comparable or even better performance than those sophisticated discriminator stabilization methods.
[dataset, three, sgan, provide] [positive, table, framework] [quality, adversarial, datasets, stability, model, trained, improve] [proposed, method, figure, resolution, version, prior, existing, analysis, pattern, relativistic] [generated, real, gan, discriminator, loss, generator, fid, generative, generation, gans, fake, pusgan, image, lsgan, pdata, pgf, distinguish, xgr, hingegan, xgf, pulsgan] [data, training, distribution, function, basic, learning, objective, batch, algorithm, standard, performance, sample, arxiv, classification, achieve, preprint, unlabeled, better, neural, network, max, size, evaluate, stable, min, class, proportion, problem, theoretical, general, enjoys, optimal, theorem, experiment, setting, processing, fixed] [demonstrate]
@InProceedings{Guo_2020_CVPR,
  author = {Guo, Tianyu and Xu, Chang and Huang, Jiajun and Wang, Yunhe and Shi, Boxin and Xu, Chao and Tao, Dacheng},
  title = {On Positive-Unlabeled Classification in GAN},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
DoveNet: Deep Image Harmonization via Domain Verification
Wenyan Cong, Jianfu Zhang, Li Niu, Liu Liu, Zhixin Ling, Weiyuan Li, Liqing Zhang


Image composition is an important operation in image processing, but the inconsistency between foreground and background significantly degrades the quality of composite image. Image harmonization, aiming to make the foreground compatible with the background, is a promising yet challenging task. However, the lack of high-quality publicly available dataset for image harmonization greatly hinders the development of image harmonization techniques. In this work, we contribute an image harmonization dataset iHarmony4 by generating synthesized composite images based on COCO (resp., Adobe5k, Flickr, day2night) dataset, leading to our HCOCO (resp., HAdobe5k, HFlickr, Hday2night) sub-dataset. Moreover, we propose a new deep image harmonization method DoveNet using a novel domain verification discriminator, with the insight that the foreground needs to be translated to the same domain as background. Extensive experiments on our constructed dataset demonstrate the effectiveness of our proposed method. Our dataset and code are available at https://github.com/bcmi/Image_Harmonization_Datasets.
[dataset, attention, constructed, visual, evaluation, build, microsoft] [foreground, background, table, coco, flickr, global, segmentation, region, object, apply, effectiveness, positive, propose] [verification, model, input, adversarial, datasets, compatible, study, quality] [color, based, method, proposed, convolutional, mse, range, figure, traditional, captured, remove, convolution] [image, composite, domain, real, harmonization, synthesized, dovenet, discriminator, harmonized, transfer, generate, target, dih, generated, paired, representation, corresponding, unrealistic, generator, hflickr, hcoco, produce, harmonious, zhu, loss, zhe, jianfu, liqing, generating] [training, deep, learning, set, test, large, best, better, network, expected, ratio] [acm, partial, single, novel]
@InProceedings{Cong_2020_CVPR,
  author = {Cong, Wenyan and Zhang, Jianfu and Niu, Li and Liu, Liu and Ling, Zhixin and Li, Weiyuan and Zhang, Liqing},
  title = {DoveNet: Deep Image Harmonization via Domain Verification},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Noise Robust Generative Adversarial Networks
Takuhiro Kaneko, Tatsuya Harada


Generative adversarial networks (GANs) are neural networks that learn data distributions through adversarial training. In intensive studies, recent GANs have shown promising results for reproducing training images. However, in spite of noise, they reproduce images with fidelity. As an alternative, we propose a novel family of GANs called noise robust GANs (NR-GANs), which can learn a clean image generator even when training images are noisy. In particular, NR-GANs can solve this problem without having complete noise information (e.g., the noise distribution type, noise amount, or signal-noise relationship). To achieve this, we introduce a noise generator and train it along with a clean image generator. However, without any constraints, there is no incentive to generate an image and noise separately. Therefore, we propose distribution and transformation constraints that encourage the noise generator to capture only the noise-specific components. In particular, considering such constraints under different assumptions, we devise two variants of NR-GANs for signal-independent noise and three variants of NR-GANs for signal-dependent noise. On three benchmark datasets, we demonstrate the effectiveness of NR-GANs in noise robust image generation. Furthermore, we show the applicability of NR-GANs in image denoising. Our code is available at https://github.com/takuhirok/NR-GAN/.
[three, provide, relationship, natural, outperforms] [table, apply, propose] [noise, clean, robust, ambientgan, adversarial, model, type, poisson, edroom, multiplicative, study, trained, datasets, agf, mgf, limitation, comprehensive, examined, case, takuhiro] [noisy, gaussian, figure, assumption, denoising, comparison, applicable, ieee, lei] [image, gans, generator, generative, learn, gan, real, variable, unknown, lsun, generated, ffhq, generate, introduce, generation, representation, tatsuya] [training, distribution, learning, deep, performance, arxiv, fixed, best, amount, problem, neural, standard, achieve, preprint, knowledge, data, test, devise, improved] [transformation, constraint, complex, defined, solution, sparse, full]
@InProceedings{Kaneko_2020_CVPR,
  author = {Kaneko, Takuhiro and Harada, Tatsuya},
  title = {Noise Robust Generative Adversarial Networks},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Normalizing Flows With Multi-Scale Autoregressive Priors
Apratim Bhattacharyya, Shweta Mahajan, Mario Fritz, Bernt Schiele, Stefan Roth


Flow-based generative models are an important class of exact inference models that admit efficient inference and sampling for image synthesis. Owing to the efficiency constraints on the design of the flow layers, e.g. split coupling flow layers in which approximately half the pixels do not undergo further transformations, they have limited expressiveness for modeling long-range data dependencies compared to autoregressive models that rely on conditional pixel-wise generation. In this work, we improve the representational power of flow-based models by introducing channel-wise dependencies in their latent space through multi-scale autoregressive priors (mAR). Our mAR prior for models with split coupling flow layers (mAR-SCF) can better capture dependencies in complex multimodal data. The resulting model achieves state-of-the-art density estimation results on MNIST, CIFAR-10, and ImageNet. Furthermore, we show that mAR-SCF allows for improved image generation quality, with gains in FID and Inception scores compared to state-of-the-art flow-based models.
[modeling, speed, den, multimodal, sequential, powerful, latexit, previous] [split, table, van, propose, level] [model, improve, mnist, quality, input, adversarial, difficult] [autoregressive, prior, flow, mar, invertible, residual, affine, mixlogcdf, coupling, glow, exact, spatial, plit, scf, pixelcnn, channel, comparison, interpolation, resolution, likelihood, method, pixel, modeled] [generative, latent, image, variational, conditional, synthesis] [sampling, data, number, sample, inference, density, normalizing, distribution, space, size, better, computational, required, efficient, imagenet, note, compared, architecture, layer, improved, operation, log, neural] [complex, allows, capture, allow, limited, estimation, cost]
@InProceedings{Bhattacharyya_2020_CVPR,
  author = {Bhattacharyya, Apratim and Mahajan, Shweta and Fritz, Mario and Schiele, Bernt and Roth, Stefan},
  title = {Normalizing Flows With Multi-Scale Autoregressive Priors},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Robust Reference-Based Super-Resolution With Similarity-Aware Deformable Convolution
Gyumin Shim, Jinsun Park, In So Kweon


In this paper, we propose a novel and efficient reference feature extraction module referred to as the Similarity Search and Extraction Network (SSEN) for reference-based super-resolution (RefSR) tasks. The proposed module extracts aligned relevant features from a reference image to increase the performance over single image super-resolution (SISR) methods. In contrast to conventional algorithms which utilize brute-force searches or optical flow estimations, the proposed algorithm is end-to-end trainable without any additional supervision or heavy computation, predicting the best match with a single network forward operation. Moreover, the proposed module is aware of not only the best matching position but also the relevancy of the best match. This makes our algorithm substantially robust when irrelevant reference images are given, overcoming the major cause of the performance degradation when using existing RefSR methods. Furthermore, our module can be utilized for self-similarity SR if no reference image is available. Experimental results demonstrate the superior performance of the proposed algorithm compared to previous works both quantitatively and qualitatively.
[recognition, dataset, video, attention] [offset, feature, module, propose, adopt] [input, adversarial, robustness, quality] [reference, deformable, refsr, proposed, convolution, method, sisr, psnr, pattern, ssen, dynamic, extraction, ssim, figure, patch, srntt, flow, residual, convolutional, receptive, perceptual, output, based, optical, degradation, utilized, conventional, superior, block] [image, realistic, aligned, utilize, content] [network, similarity, performance, algorithm, deep, sampling, baseline, best, training, search, compared, learning, neural, large, processing, number, process, better] [vision, computer, matching, reconstruction, single, ground, estimator, defined, truth, demonstrate, reconstruct, accurate, approach]
@InProceedings{Shim_2020_CVPR,
  author = {Shim, Gyumin and Park, Jinsun and Kweon, In So},
  title = {Robust Reference-Based Super-Resolution With Similarity-Aware Deformable Convolution},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Painting Many Pasts: Synthesizing Time Lapse Videos of Paintings
Amy Zhao, Guha Balakrishnan, Kathleen M. Lewis, Fredo Durand, John V. Guttag, Adrian V. Dalca


We introduce a new video synthesis task: synthesizing time lapse videos depicting how a given painting might have been created. Artists paint using unique combinations of brushes, strokes, and colors. There are often many possible ways to create a given painting. Our goal is to learn to capture this rich range of possibilities. Creating distributions of long-term videos is a challenge for learning-based video synthesis methods. We present a probabilistic model that, given a single image of a completed painting, recurrently synthesizes steps of the painting process. We implement this model as a convolutional neural network, and introduce a novel training scheme to enable learning from a limited dataset of painting time lapses. We demonstrate that this model can be used to sample many time steps, enabling long-term stochastic video synthesis. We evaluate our method on digital and watercolor paintings collected from video websites, and show that human raters find our synthetic videos to be similar to time lapse videos produced by real artists.
[video, time, frame, work, sequential, visual, prediction, temporal, sequence, multiple, dataset, future, natural, previous] [focus, apply] [model, digital, change, input, physical, collected, study] [method, ieee, figure, vdp, pattern, convolutional, quantitative, capturing, interpolation] [painting, image, synthesis, real, paint, watercolor, lapse, synthesize, realistic, loss, synthesized, critic, artist, produce, conditional, brush, introduce, artistic, variational, style, synthesizing, encourages, content] [training, learning, neural, similarity, distribution, test, stochastic, sampling, sampled, probabilistic, deep, processing, problem, small, arxiv, preprint, sample] [computer, conference, international, vision, single, completed, human, capture, volume, distance]
@InProceedings{Zhao_2020_CVPR,
  author = {Zhao, Amy and Balakrishnan, Guha and Lewis, Kathleen M. and Durand, Fredo and Guttag, John V. and Dalca, Adrian V.},
  title = {Painting Many Pasts: Synthesizing Time Lapse Videos of Paintings},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
GeoDA: A Geometric Framework for Black-Box Adversarial Attacks
Ali Rahmati, Seyed-Mohsen Moosavi-Dezfooli, Pascal Frossard, Huaiyu Dai


Adversarial examples are known as carefully perturbed images fooling image classifiers. We propose a geometric framework to generate adversarial examples in one of the most challenging black-box settings where the adversary can only generate a small number of queries, each of them returning the top-1 label of the classifier. Our framework is based on the observation that the decision boundary of deep networks usually has a small mean curvature in the vicinity of data samples. We propose an effective iterative algorithm to generate query-efficient black-box perturbations with small p norms which is confirmed via experimental evaluations on state-of-the-art natural image classifiers. Moreover, for p=2, we theoretically show that our algorithm actually converges to the minimal perturbation when the curvature of the decision boundary is bounded. We also obtain the optimal distribution of the queries over the iterations of the algorithm. Finally, experimental results confirm that our principled black-box attack algorithm performs better than state-of-the-art algorithms as it generates smaller perturbations with a reduced number of queries.
[order] [boundary, framework, pascal, propose, box] [decision, adversarial, geoda, attack, query, perturbation, hyperplane, case, fooling, hopskipjump, truncated, access, adversary, vicinity, robustness, classified] [prior, proposed, based, ieee, low, method, comparison] [image, generate] [number, vector, distribution, algorithm, deep, performance, iteration, arxiv, preprint, optimal, classifier, search, small, neural, rate, optimization, label, covariance, matrix, data, convergence, problem, space, efficient, sparsity, general, consider, find, bound, compared] [normal, point, estimation, curvature, sparse, estimate, direction, geometric, minimal, limited, distance, conference, error, estimated, define, computer]
@InProceedings{Rahmati_2020_CVPR,
  author = {Rahmati, Ali and Moosavi-Dezfooli, Seyed-Mohsen and Frossard, Pascal and Dai, Huaiyu},
  title = {GeoDA: A Geometric Framework for Black-Box Adversarial Attacks},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
GAMIN: Generative Adversarial Multiple Imputation Network for Highly Missing Data
Seongwook Yoon, Sanghoon Sull


We propose a novel imputation method for highly missing data. Though most existing imputation methods focus on moderate missing rate, imputation for high missing rate over 80% is still important but challenging. As we expect that multiple imputation is indispensable for high missing rate, we propose a generative adversarial multiple imputation network (GAMIN) based on generative adversarial network (GAN) for multiple imputation. Compared with similar imputation methods adopting GAN, our method has three novel contributions: 1)We propose a novel imputation architecture which generates candidates of imputation. 2)We present a confidence prediction method to perform reliable multiple imputation. 3)We realize them with GAMIN and train it using novel loss functions based on the confidence. We synthesized highly missing datasets using MNIST and CelebA to perform various experiments. The results show that our method outperforms baseline methods at high missing rate from 80% to 95%.
[multiple, dataset, prediction, observation, observed, work, three, describe] [confidence, table, mask, propose, regression, focus] [mnist, adversarial, input, highly, difference, actual, true, masking, model, case, korea, datasets, major] [method, figure, prior, based, high, indicate, low] [imputation, missing, generator, conditional, substitution, loss, misgan, fake, generation, imputer, discriminator, unconditional, missingness, generative, substituted, discriminates, perform, gamin, celeba, real, ladv, generates, synthesized, unsupervised, generated, gan] [data, candidate, learning, dropout, architecture, gain, stochastic, function, performance, amount, classification, rate, distribution, better, random, vector, good, objective, neural, network, metric, deep] [complete, rmse, novel, square, represented, term, dimensional]
@InProceedings{Yoon_2020_CVPR,
  author = {Yoon, Seongwook and Sull, Sanghoon},
  title = {GAMIN: Generative Adversarial Multiple Imputation Network for Highly Missing Data},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
An Internal Covariate Shift Bounding Algorithm for Deep Neural Networks by Unitizing Layers' Outputs
You Huang, Yuanlong Yu


Batch Normalization (BN) techniques have been proposed to reduce the so-called Internal Covariate Shift (ICS) by attempting to keep the distributions of layer outputs unchanged. Experiments have shown their effectiveness on training deep neural networks. However, since only the first two moments are controlled in these BN techniques, it seems that a weak constraint is imposed on layer distributions and furthermore whether such constraint can reduce ICS is unknown. Thus this paper proposes a measure for ICS by using the Earth Mover (EM) distance and then derives the upper and lower bounds for the measure to provide a theoretical analysis of BN. The upper bound has shown that BN techniques can control ICS only for the outputs with low dimensions and small noise whereas their control is not effective in other cases. This paper also proves that such control is just a bounding of ICS rather than a reduction of ICS. Meanwhile, the analysis shows that the high-order moments and noise, which BN cannot control, have great impact on the lower bound. Based on such analysis, this paper furthermore proposes an algorithm that unitizes the outputs with an adjustable parameter to further bound ICS in order to cope with the problems of BN. The upper bound for the proposed unitization is noise-free and only dominated by the parameter. Thus, the parameter can be trained to tune the bound and further to control ICS. Besides, the unitization is embedded into the framework of BN to reduce the information loss. The experiments show that this proposed algorithm outperforms existing BN techniques on CIFAR-10, CIFAR-100 and ImageNet datasets.
[order, dataset, shift, provide] [kaiming, including, feature, table, bounding] [constant, trained, noise, covariate, controlled, norm, internal, case, effective] [proposed, analysis, method, convolutional, ieee] [control, proposes, image, train, shared] [bound, unitization, algorithm, upper, training, network, neural, batch, lower, deep, accuracy, normalization, reduce, paper, layer, learning, distribution, unitized, classification, arxiv, preprint, processing, imagenet, gradient, measure, performance, problem, normalizing, weight, large, size, theoretical, normalized, theorem, suppose, parameter, lth, sample, practical, unitizing, small, reduction] [distance, conference, defined, computer, international, estimated, constraint, second, transformation]
@InProceedings{Huang_2020_CVPR,
  author = {Huang, You and Yu, Yuanlong},
  title = {An Internal Covariate Shift Bounding Algorithm for Deep Neural Networks by Unitizing Layers' Outputs},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
A Unified Optimization Framework for Low-Rank Inducing Penalties
Marcus Valtonen Ornhag, Carl Olsson


In this paper we study the convex envelopes of a new class of functions. Using this approach, we are able to unify two important classes of regularizers from unbiased non-convex formulations and weighted nuclear norm penalties. This opens up for possibilities of combining the best of both worlds, and to leverage each methods contribution to cases where simply enforcing one of the regularizers are insufficient. We show that the proposed regularizers can be incorporated in standard splitting schemes such as Alternating Direction Methods of Multipliers (ADMM), and other sub-gradient methods. This can be implemented efficiently since the the proximal operator can be computed fast. Furthermore, we show on real non-rigid structure from motion datasets, the issues that arise from using weighted nuclear norm penalties, and how this can be remedied using our proposed prior-free method.
[sequence, work, marcus, recognition, inducing, observed] [table] [norm, shrinking, robust, unconstrained, splitting] [proposed, method, wnnm, figure, pattern, ieee, journal, analysis, proximal, operator, low, unifying] [missing, image, corresponding, minimizing, larsson] [singular, rank, nuclear, data, min, matrix, problem, algorithm, weighted, bias, penalty, consider, maximizing, optimization, regularization, function, datafit, carl, objective, minimization, standard, general, parameter, krx, large, max, envelope, note, set, class, computing, small, factorization, number] [computer, conference, approach, international, local, vision, convex, ground, reconstruction, form, formulation, distance, solution, recovered, allows, truth, dai, enforcing, limited, rigid, accurate, recovery, assume, supplementary]
@InProceedings{Ornhag_2020_CVPR,
  author = {Ornhag, Marcus Valtonen and Olsson, Carl},
  title = {A Unified Optimization Framework for Low-Rank Inducing Penalties},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Single-Side Domain Generalization for Face Anti-Spoofing
Yunpei Jia, Jie Zhang, Shiguang Shan, Xilin Chen


Existing domain generalization methods for face anti-spoofing endeavor to extract common differentiation features to improve the generalization. However, due to large distribution discrepancies among fake faces of different domains, it is difficult to seek a compact and generalized feature space for the fake faces. In this work, we propose an end-to-end single-side domain generalization framework (SSDG) to improve the generalization ability of face anti-spoofing. The main idea is to learn a generalized feature space, where the feature distribution of the real faces is compact while that of the fake ones is dispersed among domains but compact within each domain. Specifically, a feature generator is trained to make only the real faces from different domains undistinguishable, but not for the fake ones, thus forming a single-side adversarial learning. Moreover, an asymmetric triplet loss is designed to constrain the fake faces of different domains separated while the real ones aggregated. The above two points are integrated into a unified framework in an end-to-end training manner, resulting in a more generalized class boundary, especially good for samples from novel domains. Feature and weight normalization is incorporated to further improve the generalization ability. Extensive experiments show that our proposed approach is effective and outperforms the state-of-the-art methods on four public databases. The code is released online.
[recognition, extract, temporal, multiple, work, incorporated] [feature, detection, propose, boundary, table, leading, aggregate, framework, china] [face, generalization, adversarial, ssdg, testing, improve, dispersed, attack, database, model, maddg, biometrics, difficult, trained, effective, forensics, security, presentation] [method, figure, proposed, pattern, comparison] [fake, real, domain, source, generalized, asymmetric, loss, generator, learn, target, unseen, discriminative, discriminator, perform, ability, extracted, seeking, image, separate, corresponding, seek, texture] [space, learning, triplet, distribution, class, weight, normalization, better, deep, network, compact, training, performance, processing, mining, achieve, optimization, classifier, data] [conference, computer, vision, international, novel, well]
@InProceedings{Jia_2020_CVPR,
  author = {Jia, Yunpei and Zhang, Jie and Shan, Shiguang and Chen, Xilin},
  title = {Single-Side Domain Generalization for Face Anti-Spoofing},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
The Knowledge Within: Methods for Data-Free Model Compression
Matan Haroush, Itay Hubara, Elad Hoffer, Daniel Soudry


Background: Recently, an extensive amount of research has been focused on compressing and accelerating Deep Neural Networks (DNN). So far, high compression rate algorithms require part of the training dataset for a low precision calibration, or a fine-tuning process. However, this requirement is unacceptable when the data is unavailable or contains sensitive information, as in medical and biometric use-cases. Contributions: We present three methods for generating synthetic samples from trained models. Then, we demonstrate how these samples can be used to calibrate and fine-tune quantized models without using any real data in the process. Our best performing method has a negligible accuracy degradation compared to the original training set. This method, which leverages intrinsic batch normalization layers' statistics of the trained model, can be used to evaluate data similarity. Our approach opens a path towards genuine data-free model compression, alleviating the need for training data during model deployment.
[] [] [] [] [] [] []
@InProceedings{Haroush_2020_CVPR,
  author = {Haroush, Matan and Hubara, Itay and Hoffer, Elad and Soudry, Daniel},
  title = {The Knowledge Within: Methods for Data-Free Model Compression},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Scale-Space Flow for End-to-End Optimized Video Compression
Eirikur Agustsson, David Minnen, Nick Johnston, Johannes Balle, Sung Jin Hwang, George Toderici


Despite considerable progress on end-to-end optimized deep networks for image compression, video coding remains a challenging task. Recently proposed methods for learned video compression use optical flow and bilinear warping for motion compensation and show competitive rate-distortion performance relative to hand-engineered codecs like H.264 and HEVC. However, these learning-based methods rely on complex architectures and training schemes including the use of pre-trained optical flow networks, sequential training of sub-networks, adaptive rate control, and buffering intermediate reconstructions to disk during training. In this paper, we show that a generalized warping operator that better handles common failure cases, e.g. disocclusions and fast motion, can provide competitive compression results with a greatly simplified model and training procedure. Specifically, we propose scale-space flow, an intuitive generalization of optical flow that adds a scale parameter to allow the network to better model uncertainty. Our experiments show that a low-latency video compression model (no B-frames) using scale-space flow for motion compensation can outperform analogous state-of-the art learned video compression models while being trained using a much simpler procedure and without any pre-trained optical flow networks.
[video, bilinear, outperforms, frame, dataset, prediction, previous, work] [final] [model, trained, distortion] [flow, compression, warping, optical, scale, figure, motion, hevc, field, method, residual, ieee, optimized, compensation, gaussian, proposed, psnr, codecs, warped, warp, decoded, kernel, pattern, coding, uvg] [image, corresponding, latent, encoder, loss] [training, rate, size, learned, performance, standard, network, compared, space, architecture, learning, better, equal, deep, neural, good, entropy, note, bit] [computer, conference, reconstruction, vision, estimation, international, complex, volume, well, johannes, relative, displacement, estimate, refer, directly, approach, system]
@InProceedings{Agustsson_2020_CVPR,
  author = {Agustsson, Eirikur and Minnen, David and Johnston, Nick and Balle, Johannes and Hwang, Sung Jin and Toderici, George},
  title = {Scale-Space Flow for End-to-End Optimized Video Compression},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Dynamic Neural Relational Inference
Colin Graber, Alexander G. Schwing


Understanding interactions between entities, e.g., joints of the human body, team sports players, etc., is crucial for tasks like forecasting. However, interactions between entities are commonly not observed and often hard to quantify. To address this challenge, recently, `Neural Relational Inference' was introduced. It predicts static relations between entities in a system and provides an interpretable representation of the underlying system dynamics that are used for better trajectory forecasting. However, generally, relations between entities change as time progresses. Hence, static relations improperly model the data. In response to this, we develop Dynamic Neural Relational Inference (dNRI), which incorporates insights from sequential latent variable models to predict separate relation graphs for every time-step. We demonstrate on several real-world datasets that modeling dynamic relations improves forecasting of complex trajectories.
[static, relation, time, nri, trajectory, dnri, predict, relational, graph, step, prediction, future, decoder, represent, sequential, predicting, gnn, state, entity, previous, hidden, fcgraph, observed, frame, lstm, work, passing, kipf, described, basketball, singlelstm, jointlstm, modeling, traffic, attention, three] [predicted, edge, represents] [model, input, subject, change, study, improve] [prior, dynamic, motion, figure, mse, based] [latent, encoder, variable, consists, separate, address, underlying, assumes, learn] [neural, inference, posterior, distribution, approximate, function, data, learning, number, sample, entire, training, better, process, set] [capture, predicts, provided, system, point, additional, form, human]
@InProceedings{Graber_2020_CVPR,
  author = {Graber, Colin and Schwing, Alexander G.},
  title = {Dynamic Neural Relational Inference},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Real-Time Panoptic Segmentation From Dense Detections
Rui Hou, Jie Li, Arjun Bhargava, Allan Raventos, Vitor Guizilini, Chao Fang, Jerome Lynch, Adrien Gaidon


Panoptic segmentation is a complex full scene parsing task requiring simultaneous instance and semantic segmentation at high resolution. Current state-of-the-art approaches cannot run in real-time, and simplifying these architectures to improve efficiency severely degrades their accuracy. In this paper, we propose a new single-shot panoptic segmentation network that leverages dense detections and a global self-attention mechanism to operate in real-time with performance approaching the state of the art. We introduce a novel parameter-free mask construction method that substantially reduces computational complexity by efficiently reusing information from the object detection and semantic segmentation sub-tasks. The resulting network has a simple data flow that requires no feature map re-sampling, enabling significant hardware acceleration. Our experiments on the Cityscapes and COCO benchmarks show that our network works at 30 FPS on 1024x2048 resolution, trading a 3% relative performance degradation from the current state of the art for up to 440% faster inference.
[predict, state, provide, time, prediction, current, explicit, associated] [segmentation, panoptic, instance, semantic, bounding, box, mask, object, detection, predicted, feature, fpn, location, global, level, table, final, weakly, foreground, fully, propose, coco, art, backbone, framework, btj, ross, kaiming, piotr, including, assignment, iou] [model, input, quality] [ieee, method, pixel, pattern, figure, proposed, convolutional] [loss, supervised, image, introduce, target] [inference, network, performance, probability, class, number, task, arxiv, preprint, set, accuracy, learning, architecture, clustering] [computer, conference, dense, vision, construction, ground, novel, truth, scene]
@InProceedings{Hou_2020_CVPR,
  author = {Hou, Rui and Li, Jie and Bhargava, Arjun and Raventos, Allan and Guizilini, Vitor and Fang, Chao and Lynch, Jerome and Gaidon, Adrien},
  title = {Real-Time Panoptic Segmentation From Dense Detections},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Deep Snake for Real-Time Instance Segmentation
Sida Peng, Wen Jiang, Huaijin Pi, Xiuli Li, Hujun Bao, Xiaowei Zhou


This paper introduces a novel contour-based approach named deep snake for real-time instance segmentation. Unlike some recent methods that directly regress the coordinates of the object boundary points from an image, deep snake uses a neural network to iteratively deform an initial contour to match the object boundary, which implements the classic idea of snake algorithms with a learning-based approach. For structured feature learning on the contour, we propose to use circular convolution in deep snake, which better exploits the cycle-graph structure of a contour compared against generic graph convolution. Based on deep snake, we develop a two-stage pipeline for instance segmentation: initial contour proposal and contour deformation, which can handle errors in object localization. Experiments show that the proposed approach achieves competitive performances on the Cityscapes, KINS, SBD and COCO datasets while being efficient for real-time applications with a speed of 32.3 fps for 512 x 512 images on a 1080Ti GPU. The code is available at https://github.com/zju3dv/snake/.
[graph, dataset, three, prediction, construct] [contour, object, snake, instance, circular, feature, detection, segmentation, extreme, sbd, detector, table, box, apvol, octagon, mask, boundary, proposal, propose, achieves, diamond, centernet, fps, bounding, fully, detected, semantic, amodal, pascal, coco, deforms] [input, model, trained] [convolution, proposed, figure, based, kernel, method, output] [image, fine, perform, representation] [deep, learning, network, set, standard, efficient, function, test, performance, strategy, training, active, algorithm, validation, energy] [initial, approach, vertex, pipeline, match, defined, point, deformation, deform, shape]
@InProceedings{Peng_2020_CVPR,
  author = {Peng, Sida and Jiang, Wen and Pi, Huaijin and Li, Xiuli and Bao, Hujun and Zhou, Xiaowei},
  title = {Deep Snake for Real-Time Instance Segmentation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
AdaCoSeg: Adaptive Shape Co-Segmentation With Group Consistency Loss
Chenyang Zhu, Kai Xu, Siddhartha Chaudhuri, Li Yi, Leonidas J. Guibas, Hao Zhang


We introduce AdaCoSeg, a deep neural network architecture for adaptive co-segmentation of a set of 3D shapes represented as point clouds. Differently from the familiar single-instance segmentation problem, co-segmentation is intrinsically contextual: how a shape is segmented can vary depending on the set it is in. Hence, our network features an adaptive learning module to produce a consistent shape segmentation which adapts to a set. Specifically, given an input set of unsegmented shapes, we first employ an offline pre-trained part prior network to propose per-shape parts. Then the co-segmentation network iteratively and jointly optimizes the part labelings across the set subjected to a novel group consistency loss defined by matrix ranks. While the part prior network can be trained with noisy and inconsistently segmented shapes, the final output of AdaSeg is a consistent part labeling for the input set, with each shape segmented into up to (a user-specified) K parts. Overall, our method is weakly supervised, producing segmentations tailored to the test set, without consistent ground-truth segmentations. We show qualitative and quantitative results from AdaSeg and evaluate it via ablation studies and comparisons to state-of-the-art co-segmentation methods.
[dataset, work] [feature, segmentation, adacoseg, module, segmented, labeling, semantic, mrg, object, complementme, weak, foreground, table, weakly, cosegmentation, final, ablation, refined] [input, trained, offline, model, denoise] [prior, figure, method, adaptive, noisy, output, based, quantitative, proposed, denoising] [consistency, loss, encoder, msg, supervised, unsupervised, learn, image, plausible, learns] [network, set, deep, training, group, test, learning, rank, classifier, matrix, label, binary, online, data, small, learned, architecture, function, large, fixed] [shape, point, consistent, hao, leonidas, collection, ground, second, shapenet, chair, iteratively, truth, cloud, single, siddhartha, novel, computer, geometric]
@InProceedings{Zhu_2020_CVPR,
  author = {Zhu, Chenyang and Xu, Kai and Chaudhuri, Siddhartha and Yi, Li and Guibas, Leonidas J. and Zhang, Hao},
  title = {AdaCoSeg: Adaptive Shape Co-Segmentation With Group Consistency Loss},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Learning Dynamic Routing for Semantic Segmentation
Yanwei Li, Lin Song, Yukang Chen, Zeming Li, Xiangyu Zhang, Xingang Wang, Jian Sun


Recently, numerous handcrafted and searched networks have been applied for semantic segmentation. However, previous works intend to handle inputs with various scales in pre-defined static architectures, such as FCN, U-Net, and DeepLab series. This paper studies a conceptually new method to alleviate the scale variance in semantic representation, named dynamic routing. The proposed framework generates data-dependent routes, adapting to the scale distribution of each image. To this end, a differentiable gating function, called soft conditional gate, is proposed to select scale transform paths on the fly. In addition, the computational cost can be further reduced in an end-to-end manner by giving budget constraints to the gating function. We further relax the network level routing space to support multi-path propagations and skip-connections in each forward, bringing substantial network capacity. To demonstrate the superiority of the dynamic property, we compare with several static architectures, which can be modeled as special cases in the routing space. Extensive experiments are conducted on Cityscapes and PASCAL VOC 2012 to illustrate the effectiveness of the dynamic framework. Code is available at https://github.com/yanwei-li/DynamicRouting.
[previous, static, dataset] [semantic, feature, pascal, val, segmentation, voc, achieves, inside, table, framework, object, adopt] [input, budget] [dynamic, routing, proposed, scale, cell, output, designed, convolutional, modeled, method, upsample, resolution, adopted, activating, superiority, traditional, presented, formulated, flopsm] [image, conditional, corresponding, common, generate, factor] [network, architecture, space, resource, performance, search, neural, set, path, gate, soft, learning, computational, stem, activation, function, efficient, process, forward, better, deep, variance, distribution, denote, fixed, layer, indicates] [cost, well, compare, handcrafted, differentiable, transformation]
@InProceedings{Li_2020_CVPR,
  author = {Li, Yanwei and Song, Lin and Chen, Yukang and Li, Zeming and Zhang, Xiangyu and Wang, Xingang and Sun, Jian},
  title = {Learning Dynamic Routing for Semantic Segmentation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Boosting Semantic Human Matting With Coarse Annotations
Jinlin Liu, Yuan Yao, Wendi Hou, Miaomiao Cui, Xuansong Xie, Changshui Zhang, Xian-Sheng Hua


Semantic human matting aims to estimate the per-pixel opacity of the foreground human regions. It is quite challenging that usually requires user interactive trimaps and plenty of high quality annotated data. Annotating such kind of data is labor intensive and requires great skills beyond normal users, especially considering the very detailed hair part of humans. In contrast, coarse annotated human dataset is much easier to acquire and collect from the public dataset. In this paper, we propose to leverage coarse annotated data coupled with fine annotated data to boost end-to-end semantic human matting without trimaps as extra input. Specifically, We train a mask prediction network to estimate the coarse semantic mask using the hybrid data, and then propose a quality unification network to unify the quality of the previous coarse mask outputs. A matting refinement network takes the unified mask and the input image to predict the final alpha matte. The collected coarse annotated dataset enriches our dataset significantly, allows generating high quality alpha matte for real images. Experimental results show that the proposed method performs comparably against state-of-the-art methods. Moreover, the proposed method can be used for refining coarse annotated public dataset, as well as semantic segmentation methods, which reduces the cost of annotating high quality human data to a great extent.
[dataset, prediction, predict, natural, mpn, recognition, three, work] [mask, annotated, semantic, refinement, unified, foreground, background, segmentation, propose, deeplab, final, map, interactive, including, pascal, predicted, annotation, coco, denotes, table, refine] [quality, input, trained, dim, public, collected] [method, proposed, high, figure, low, ieee, output, pattern, based, resolution, inaccurate] [matting, alpha, image, matte, fine, unification, trimap, qun, train, loss, trimaps, real, mrn, user, generate, shm, portrait, unknown, corresponding] [network, data, training, deep, sampling, set, requires, performance, carefully, size, better] [coarse, human, accurate, conference, computer, vision, well, estimate, hybrid, rgb, estimated, international, estimation, volume, constraint, solution, structure]
@InProceedings{Liu_2020_CVPR,
  author = {Liu, Jinlin and Yao, Yuan and Hou, Wendi and Cui, Miaomiao and Xie, Xuansong and Zhang, Changshui and Hua, Xian-Sheng},
  title = {Boosting Semantic Human Matting With Coarse Annotations},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
BlendMask: Top-Down Meets Bottom-Up for Instance Segmentation
Hao Chen, Kunyang Sun, Zhi Tian, Chunhua Shen, Yongming Huang, Youliang Yan


Instance segmentation is one of the fundamental vision tasks. Recently, fully convolutional instance segmentation methods have drawn much attention as they are often simpler and more efficient than two-stage approaches like Mask R-CNN. To date, almost all such approaches fall behind the two-stage Mask R-CNN method in mask precision when models have similar computation complexity, leaving great room for improvement. In this work, we achieve improved mask prediction by effectively combining instance-level information with semantic information with lower-level fine-granularity. Our main contribution is a blender module which draws inspiration from both top-down and bottom-up instance segmentation approaches. The proposed BlendMask can effectively predict dense per-pixel position-sensitive instance features with very few channels, and learn attention maps for each instance with merely one convolution layer, thus being fast in inference. BlendMask can be easily incorporate with the state-of-the-art one-stage detection frameworks and outperforms Mask R-CNN under the same training schedule while being faster. A light-weight version of BlendMask achieves 36.0 mAP at 27 FPS evaluated on a single 1080Ti. Because of its simplicity and efficacy, we hope that our BlendMask could serve as a simple yet strong baseline for a wide range of instance-wise prediction tasks.
[bilinear, time, attention, prediction, predict, speed] [mask, bottom, instance, module, blendmask, yolact, segmentation, table, object, semantic, detection, fcis, backbone, map, box, feature, fpn, fully, kaiming, ablation, roi, score, bounding, region, final, coco, tensormask, ross, piotr, represents, fcos] [model] [resolution, figure, convolutional, convolution, fast, method, interpolation, version, output, channel, comparison] [blender, representation, image, generate] [top, number, performance, computation, inference, training, set, sampling, higher, learning, increasing, base, width, simple, network, learned, accuracy, larger, better] [dense, nearest, detailed, compare, single]
@InProceedings{Chen_2020_CVPR,
  author = {Chen, Hao and Sun, Kunyang and Tian, Zhi and Shen, Chunhua and Huang, Yongming and Yan, Youliang},
  title = {BlendMask: Top-Down Meets Bottom-Up for Instance Segmentation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
UC-Net: Uncertainty Inspired RGB-D Saliency Detection via Conditional Variational Autoencoders
Jing Zhang, Deng-Ping Fan, Yuchao Dai, Saeed Anwar, Fatemeh Sadat Saleh, Tong Zhang, Nick Barnes


In this paper, we propose the first framework (UCNet) to employ uncertainty for RGB-D saliency detection by learning from the data labeling process. Existing RGB-D saliency detection methods treat the saliency detection task as a point estimation problem, and produce a single saliency map following a deterministic learning pipeline. Inspired by the saliency data labeling process, we propose probabilistic RGB-D saliency detection network via conditional variational autoencoders to model human annotation uncertainty and generate multiple saliency maps for each input image by sampling in the latent space. With the proposed saliency consensus process, we are able to generate an accurate saliency map based on these multiple predictions. Quantitative and qualitative evaluations on six challenging benchmark datasets against 18 competing algorithms demonstrate the effectiveness of our approach in learning the distribution of saliency maps, leading to a new state-of-the-art in RGB-D saliency detection.
[multiple, prediction, pair, inspired, visual, mechanism, three] [saliency, detection, salient, object, map, feature, consensus, labeling, saliencynet, depthcorrectionnet, propose, effectiveness, module, priornet, segmentation, latentnet, refined, framework, voting, ssb, benchmark, semantic, stage] [model, input, datasets, testing] [ieee, based, channel, proposed, method, convolutional, prior, gaussian, figure, fusion, existing, output] [image, produce, latent, cvae, loss, generate, diverse, variational, conditional, variable, vae] [network, performance, learning, deep, training, data, distribution, deterministic, probabilistic, stochastic, size, majority, set, standard, better, posterior] [depth, rgb, human, uncertainty, single, define, provided, representing, rgbd, point, ground, truth, estimation]
@InProceedings{Zhang_2020_CVPR,
  author = {Zhang, Jing and Fan, Deng-Ping and Dai, Yuchao and Anwar, Saeed and Saleh, Fatemeh Sadat and Zhang, Tong and Barnes, Nick},
  title = {UC-Net: Uncertainty Inspired RGB-D Saliency Detection via Conditional Variational Autoencoders},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Deep Geometric Functional Maps: Robust Feature Learning for Shape Correspondence
Nicolas Donati, Abhishek Sharma, Maks Ovsjanikov


We present a novel learning-based approach for computing correspondences between non-rigid 3D shapes. Unlike previous methods that either require extensive training data or operate on handcrafted input descriptors and thus generalize poorly across diverse datasets, our approach is both accurate and robust to changes in shape structure. Key to our method is a feature-extraction network that learns directly from raw shape geometry, combined with a novel regularized map extraction layer and loss, based on the functional map representation. We demonstrate through extensive experiments in challenging shape matching scenarios that our method can learn from less training data than existing supervised approaches and generalizes significantly better than current descriptor-based learning methods. Our source code is available at: https://github.com/LIX-shape-analysis/GeomFmaps.
[dataset, work] [map, feature, fully, key, challenging, main] [input, robust, testing, model, generalization] [method, spectral, based, figure, existing, convolutional, ieee, raw] [loss, learn, supervised, train, unsupervised, representation] [training, learning, number, test, set, deep, experiment, data, network, computing, small, neural, good, optimal, learned, power, achieve, efficient] [functional, shape, computer, point, correspondence, ground, truth, computed, michael, fmnet, volume, approach, directly, pipeline, emanuele, surfmnet, acm, novel, faust, conference, accurate, transformation, vision, leonidas, second, geodesic, international, geometry, estimation, compute, human, descriptor, distance, error]
@InProceedings{Donati_2020_CVPR,
  author = {Donati, Nicolas and Sharma, Abhishek and Ovsjanikov, Maks},
  title = {Deep Geometric Functional Maps: Robust Feature Learning for Shape Correspondence},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Deep Polarization Cues for Transparent Object Segmentation
Agastya Kalra, Vage Taamazyan, Supreeth Krishna Rao, Kartik Venkataraman, Ramesh Raskar, Achuta Kadambi


Segmentation of transparent objects is a hard, open problem in computer vision. Transparent objects lack texture of their own, adopting instead the texture of scene background. This paper reframes the problem of transparent object segmentation into the realm of light polarization, i.e., the rotation of light waves. We use a polarization camera to capture multi-modal imagery and couple this with a unique deep learning backbone for processing polarization input data. Our method achieves instance segmentation on cluttered, transparent objects in various scene and background conditions, demonstrating an improvement over traditional image-based approaches. As an application we use this for robotic bin picking of transparent objects.
[three, dataset, environment, attention, work] [segmentation, object, mask, instance, table, detection, bin, cnn, picking, framework, backbone, semantic, polar, background, improvement, ablation, false, feature] [input, robust, model] [intensity, ieee, figure, pattern, fusion, light, method, range, reflection, analysis, imaging, cnns, based, formation, sensor] [image, texture, real] [deep, test, data, learning, set, problem, performance, training, network, applied] [transparent, polarization, conference, polarized, computer, vision, international, unique, depth, visible, novel, camera, robotic, clutter, cluttered, aolp, rgb, diffuse, estimation, iun, angle, shape, single, surface, geometric, scene, application, compare, specular, pose, reconstruction]
@InProceedings{Kalra_2020_CVPR,
  author = {Kalra, Agastya and Taamazyan, Vage and Rao, Supreeth Krishna and Venkataraman, Kartik and Raskar, Ramesh and Kadambi, Achuta},
  title = {Deep Polarization Cues for Transparent Object Segmentation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
DualConvMesh-Net: Joint Geodesic and Euclidean Convolutions on 3D Meshes
Jonas Schult, Francis Engelmann, Theodora Kontogianni, Bastian Leibe


We propose DualConvMesh-Nets (DCM-Net) a family of deep hierarchical convolutional networks over 3D geometric data that combines two types of convolutions. The first type, Geodesic convolutions, defines the kernel weights over mesh surfaces or graphs. That is, the convolutional kernel weights are mapped to the local surface of a given mesh. The second type, Euclidean convolutions, is independent of any underlying mesh structure. The convolutional kernel is applied on a neighborhood obtained from a local affinity representation based on the Euclidean distance between 3D points. Intuitively, geodesic convolutions can easily separate objects that are spatially close but have disconnected surfaces, while Euclidean convolutions can represent interactions between nearby objects better, as they are oblivious to object surfaces. To realize a multi-resolution architecture, we borrow well-established mesh simplification methods from the geometry processing domain and adapt them to define mesh-preserving pooling and unpooling operations. We experimentally show that combining both types of convolutions in our architecture leads to significant performance gains for 3D semantic segmentation, and we report competitive results on three scene segmentation benchmarks. Models and code will be made publicly available.
[graph, recognition, hierarchical, work] [pooling, semantic, segmentation, edge, table, feature, threshold, propose, miou, level, benchmark, ablation] [trace, model] [ieee, convolutional, pattern, figure, dual, method, kernel, convolution] [perform, representation] [learning, deep, sampling, size, neural, architecture, network, training, processing, random, number, set, clustering, test, performance, report, reducing, better, data, applied] [mesh, point, geodesic, euclidean, conference, vertex, vision, computer, neighborhood, qem, surface, single, scannet, dualconv, radius, international, error, define, shape, cloud, quadric, geometric, local, simplification, scene, well, matthias, unpooling, defined, indoor, thomas, geo, michael]
@InProceedings{Schult_2020_CVPR,
  author = {Schult, Jonas and Engelmann, Francis and Kontogianni, Theodora and Leibe, Bastian},
  title = {DualConvMesh-Net: Joint Geodesic and Euclidean Convolutions on 3D Meshes},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
F-BRS: Rethinking Backpropagating Refinement for Interactive Segmentation
Konstantin Sofiiuk, Ilia Petrov, Olga Barinova, Anton Konushin


Deep neural networks have become a mainstream approach to interactive segmentation. As we show in our experiments, while for some images a trained network provides accurate segmentation result with just a few clicks, for some unknown objects it cannot achieve satisfactory result even with a large amount of user input. Recently proposed backpropagating refinement scheme (BRS) introduces an optimization problem for interactive segmentation that results in significantly better performance for the hard cases. At the same time, BRS requires running forward and backward pass through a deep network several times that leads to significantly increased computational budget per click compared to other methods. We propose f-BRS (feature backpropagating refinement scheme) that solves an optimization problem with respect to auxiliary variables instead of the network inputs, and requires running forward and backward passes just for a small part of a network. Experiments on GrabCut, Berkeley, DAVIS and SBD datasets set new state-of-the-art at an order of magnitude lower time per click compared to original BRS. The code and trained models are available at https://github.com/saic-vul/fbrs_interactive_segmentation.
[berkeley, time, dataset, evaluation, passed, order, previous] [segmentation, interactive, object, backpropagating, refinement, davis, mask, grabcut, click, sbd, table, positive, satisfactory, semantic, propose] [input, auxiliary, model, datasets, trained, original] [ieee, pattern, proposed, method, result, figure, running, comparison, scale, output] [image, user, target, loss] [network, optimization, number, problem, respect, deep, set, small, function, training, energy, neural, large, requires, denote, backward, compared, standard, learning, bias, achieve, negative, report, forward, algorithm, scheme, performance, pass, data, formulate] [computer, conference, vision, distance, approach, accurate]
@InProceedings{Sofiiuk_2020_CVPR,
  author = {Sofiiuk, Konstantin and Petrov, Ilia and Barinova, Olga and Konushin, Anton},
  title = {F-BRS: Rethinking Backpropagating Refinement for Interactive Segmentation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Approximating shapes in images with low-complexity polygons
Muxingzi Li, Florent Lafarge, Renaud Marlet


We present an algorithm for extracting and vectorizing objects in images with polygons. Departing from a polygonal partition that oversegments an image into convex cells, the algorithm refines the geometry of the partition while labeling its cells by a semantic class. The result is a set of polygons, each capturing an object in the image. The quality of a configuration is measured by an energy that accounts for both the fidelity to input data and the complexity of the output polygons. To efficiently explore the configuration space, we perform splitting and merging operations in tandem on the cells of the polygonal partition. The exploration mechanism is controlled by a priority queue that sorts the operations most likely to decrease the energy. We show the potential of our algorithm on different types of scenes, from organic shapes to man-made objects through floor maps, and demonstrate its efficiency compared to existing vectorization methods.
[mechanism, exploration, priority, time, illustrated, visual, polyline, composed] [semantic, object, merging, map, edge, extracting, grouping, detection, saliency, segmentation, inside, table] [splitting, input, model, quality] [partition, output, figure, compression, based, low, pixel, high, adjacent, method, extraction] [image, fidelity, consists, mapping, produce] [algorithm, energy, probability, data, set, complexity, number, accuracy, queue, good, typically, average, compared, gradient, note, configuration, strategy, label, operation, compact] [polygonal, polygon, initial, facet, vectorization, voronoi, mesh, term, geometry, well, capture, kinetic, simulated, floor, allow, geometric, accurate, solution, delaunay, convex, organic, partitioning]
@InProceedings{Li_2020_CVPR,
  author = {Li, Muxingzi and Lafarge, Florent and Marlet, Renaud},
  title = {Approximating shapes in images with low-complexity polygons},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Towards Visually Explaining Variational Autoencoders
Wenqian Liu, Runze Li, Meng Zheng, Srikrishna Karanam, Ziyan Wu, Bir Bhanu, Richard J. Radke, Octavia Camps


Recent advances in Convolutional Neural Network (CNN) model interpretability have led to impressive progress in visualizing and understanding model predictions. In particular, gradient-based visual attention methods have driven much recent effort in using visual attention maps as a means for visual explanations. A key problem, however, is these methods are designed for classification and categorization tasks, and their extension to explaining generative models, e.g., variational autoencoders (VAE) is not trivial. In this work, we take a step towards bridging this crucial gap, proposing the first technique to visually explain VAEs by means of gradient-based attention. We present methods to generate visual attention from the learned latent space, and also demonstrate such attention explanations serve more than just explaining VAE predictions. We show how these attention maps can be used to localize anomalies in images, demonstrating state-of-the-art performance on the MVTec-AD dataset. We also show how they can be infused into model training, helping bootstrap the VAE into learning improved latent space disentanglement, demonstrated on the Dsprites dataset.
[attention, visual, work, dataset, localize, mechanism, understanding, step, trainable] [localization, feature, detection, score, map, cnn] [model, input, explain, trained, explaining, testing, difference, adversarial, reparameterization, conduct] [proposed, figure, method, convolutional, visually, existing, based, gaussian] [latent, vae, disentanglement, generate, generative, variational, factorvae, generated, generation, intuition, digit, qualitative, image, train, dsprites, generating, unsupervised, representation, autoencoder, lad, vaes, visualize, loss] [anomaly, space, learning, deep, performance, data, distribution, training, standard, note, sample, learned, objective, improved, test, baseline, neural, network, classification, dimension, best, anomalous] [normal, compute, reconstruction, well, approach, demonstrate, inferred]
@InProceedings{Liu_2020_CVPR,
  author = {Liu, Wenqian and Li, Runze and Zheng, Meng and Karanam, Srikrishna and Wu, Ziyan and Bhanu, Bir and Radke, Richard J. and Camps, Octavia},
  title = {Towards Visually Explaining Variational Autoencoders},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Towards Global Explanations of Convolutional Neural Networks With Concept Attribution
Weibin Wu, Yuxin Su, Xixian Chen, Shenglin Zhao, Irwin King, Michael R. Lyu, Yu-Wing Tai


With the growing prevalence of convolutional neural networks (CNNs), there is an urgent demand to explain their behaviors. Global explanations contribute to understanding model predictions on a whole category of samples, and thus have attracted increasing interest recently. However, existing methods overwhelmingly conduct separate input attribution or rely on local approximations of models, making them fail to offer faithful global explanations of CNNs. To overcome such drawbacks, we propose a novel two-stage framework, Attacking for Interpretability (AfI), which explains model decisions in terms of the importance of user-defined concepts. AfI first conducts a feature occlusion analysis, which resembles a process of attacking models to derive the category-wide importance of different features. We then map the feature importance to concept importance through ad-hoc semantic tasks. Experimental results confirm the effectiveness of AfI and its superiority in providing more accurate estimations of concept importance than existing proposals.
[prediction, recognition, individual, work, visual] [feature, global, occlusion, propose, score, semantic, framework, cnn, occluders, effectiveness, table, alexander, focus] [concept, model, attribution, input, occluder, afi, explanation, tcav, conduct, original, googlenet, attacking, adversarial, interpretability, sscs, explaining] [cnns, convolutional, figure, existing, ieee, pattern, prior, based, quantitative] [image, corresponding, learn] [class, neural, learning, deep, accuracy, classification, random, processing, layer, student, vector, average, imagenet, number, procedure, set, process, data, performance, function, teacher, machine] [conference, international, computer, vision, local, novel, resultant, approach, accurate, view]
@InProceedings{Wu_2020_CVPR,
  author = {Wu, Weibin and Su, Yuxin and Chen, Xixian and Zhao, Shenglin and King, Irwin and Lyu, Michael R. and Tai, Yu-Wing},
  title = {Towards Global Explanations of Convolutional Neural Networks With Concept Attribution},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Interpretable and Accurate Fine-grained Recognition via Region Grouping
Zixuan Huang, Yin Li


We present an interpretable deep model for fine-grained visual recognition. At the core of our method lies the integration of region-based part discovery and attribution within a deep neural network. Our model is trained using image-level object labels, and provides an interpretation of its results via the segmentation of object parts and the identification of their contributions towards classification. To facilitate the learning of object parts without direct supervision, we explore a simple prior of the occurrence of object parts. We demonstrate that this prior, when combined with our region-based part discovery and attribution, leads to an interpretable model that remains highly accurate. Our model is evaluated on major fine-grained recognition datasets, including CUB-200, CelebA and iNaturalist. Our results compares favourably to state-of-the-art methods on classification tasks, and outperforms previous approaches on the localization of object parts.
[attention, recognition, visual, work, dataset, previous] [object, localization, feature, region, assignment, map, table, segmentation, including, cnn, grouping, annotated, visualization, bounding, finegrained, pointing, achieves, tasn, segment, key] [model, facial, input, trained, interpretability, landmark, scops, beta, identify, decision, game, face, attribution, mouth] [convolutional, prior, method, proposed, dff, figure] [image, interpretable, occurrence, attribute, bird, celeba, discriminative, discovery, loss, meaningful] [deep, learning, network, classification, neural, distribution, accuracy, set, regularization, dictionary, report, baseline, qij, evaluate, activation, consider, small, batch, training, linear, vector, compared, simple, find] [error, distance, compare]
@InProceedings{Huang_2020_CVPR,
  author = {Huang, Zixuan and Li, Yin},
  title = {Interpretable and Accurate Fine-grained Recognition via Region Grouping},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
SAM: The Sensitivity of Attribution Methods to Hyperparameters
Naman Bansal, Chirag Agarwal, Anh Nguyen


Attribution methods can provide powerful insights into the reasons for a classifier's decision. We argue that a key desideratum of an explanation is its robustness to input hyperparameter changes that are often randomly set or empirically tuned. High sensitivity to arbitrary hyperparameter choices does not only impede reproducibility but also questions the correctness of an explanation and impairs the trust by end-users. In this paper, we provide a thorough empirical study on the sensitivity of existing attribution methods. We found an alarming trend that many methods are highly sensitive to changes in their common hyperparameters e.g. even changing a random seed can yield a different explanation! In contrast, explanations generated for robust classifiers that are trained to be invariant to pixel-wise perturbations are surprisingly more robust. Interestingly, such sensitivity is not reflected in the average explanation correctness scores over the entire dataset as commonly reported in the literature.
[regular, three] [seed, resnet, localization, object, heatmap, saliency, score, map] [attribution, input, sensitivity, robust, explanation, heatmaps, sensitive, nsg, variation, niter, noise, insertion, wojciech, explaining, deletion, std, highly, change, model, quantify, sweeping] [patch, lime, blur, ssim, noisy, ieee, pattern, figure, gaussian, range, medical] [image, generated, changing, common, invariant, interpretable] [accuracy, size, random, gradient, similarity, hyperparameters, deep, number, arxiv, preprint, average, neural, hyperparameter, learning, classification, sample, set, classifier, compared, higher, machine, small, optimization, yield, vanilla, experiment, large] [conference, international, computer, varying, radius, consistent, vision]
@InProceedings{Bansal_2020_CVPR,
  author = {Bansal, Naman and Agarwal, Chirag and Nguyen, Anh},
  title = {SAM: The Sensitivity of Attribution Methods to Hyperparameters},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
High-Frequency Component Helps Explain the Generalization of Convolutional Neural Networks
Haohan Wang, Xindi Wu, Zeyi Huang, Eric P. Xing


We investigate the relationship between the frequency spectrum of image data and the generalization behavior of convolutional neural networks (CNN). We first notice CNN's ability in capturing the high-frequency components of images. These high-frequency components are almost imperceptible to a human. Thus the observation leads to multiple hypotheses that are related to the generalization behaviors of CNN, including a potential explanation for adversarial examples, a discussion of CNN's trade-off between robustness and accuracy, and some evidence in understanding training heuristics.
[prediction, multiple, exploit, behavior] [cnn, denotes, map, including] [adversarial, generalization, robustness, model, lfc, hfc, mnatural, madversarial, batchnorm, trained, original, robust, defense, tend, improve, testing, tendency, explain, memorizing, pgd, mshuffle, haohan, discussion] [figure, convolutional, frequency, method, high, ieee, low, kernel, capturing, fourier, disparity] [image, train, learn, component, eric, shuffled, loss, gap, notice] [training, learning, test, neural, deep, data, arxiv, preprint, accuracy, epoch, performance, label, sample, vanilla, set, batch, investigate, denote, smaller, yoshua, predictive, capacity, network, size, machine] [conference, human, international, computer, well, directly, vision]
@InProceedings{Wang_2020_CVPR,
  author = {Wang, Haohan and Wu, Xindi and Huang, Zeyi and Xing, Eric P.},
  title = {High-Frequency Component Helps Explain the Generalization of Convolutional Neural Networks},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
CNN-Generated Images Are Surprisingly Easy to Spot... for Now
Sheng-Yu Wang, Oliver Wang, Richard Zhang, Andrew Owens, Alexei A. Efros


In this work we ask whether it is possible to create a “universal” detector for telling apart real images from these generated by a CNN, regardless of architecture or dataset used. To test this, we collect a dataset consisting of fake images generated by 11 different CNN-based image generator models, chosen to span the space of commonly used architectures today (ProGAN, StyleGAN, BigGAN, CycleGAN, StarGAN, GauGAN, DeepFakes, cascaded refinement networks, implicit maximum likelihood estimation, second-order attention super-resolution, seeingin-the-dark). We demonstrate that, with careful pre- and post-processing and data augmentation, a standard image classifier trained on only one specific CNN generator (ProGAN) is able to generalize surprisingly well to unseen architectures, datasets, and training methods (including the just released StyleGAN2 [21]). Our findings suggest the intriguing possibility that today’s CNN-generated images share some common systematic flaws, preventing them from achieving realistic image synthesis.
[dataset, work, crn, chance, visual] [cnn, detect, detection, detector, table] [model, trained, detecting, deepfake, forensics, generalization, jpeg, study, quality, adversarial, tested, robustness, face, manipulation] [figure, san, zhang, blur, convolutional, frequency, ieee] [image, progan, fake, real, biggan, generated, cyclegan, stylegan, stargan, synthesis, gan, generalize, train, generation, gaugan, imle, sitd, common, unconditional, alexei, translation, synthetic, specific, synthesized, diversity, gans, lsun, conditional, percentile, generator, generative] [training, augmentation, data, test, learning, deep, performance, classifier, evaluate, network, find, number, andrew, architecture, observe, simple, note] [rgb, well, single, variety, provided, supplemental]
@InProceedings{Wang_2020_CVPR,
  author = {Wang, Sheng-Yu and Wang, Oliver and Zhang, Richard and Owens, Andrew and Efros, Alexei A.},
  title = {CNN-Generated Images Are Surprisingly Easy to Spot... for Now},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
FALCON: A Fourier Transform Based Approach for Fast and Secure Convolutional Neural Network Predictions
Shaohua Li, Kaiping Xue, Bin Zhu, Chenkai Ding, Xindi Gao, David Wei, Tao Wan


Deep learning as a service has been widely deployed to utilize deep neural network models to provide prediction services. However, this raises privacy concerns since clients need to send sensitive information to servers. In this paper, we focus on the scenario where clients want to classify private images with a convolutional neural network model hosted in the server, while both parties keep their data private. We present FALCON, a fast and secure approach for CNN predictions based on fast Fourier Transform. Our solution enables linear layers of a CNN model to be evaluated simply and efficiently with fully homomorphic encryption. We also introduce the first efficient and privacy-preserving protocol for softmax function, which is an indispensable component in CNNs and has not yet been evaluated in previous work due to its high complexity.
[overhead, prediction, time, evaluation, previous] [pooling, cnn, fully, table, framework, module, propose] [input, model, security, privacy, symposium, private, protocol] [secure, relu, falcon, convolutional, homomorphic, fft, output, plaintext, gazelle, ciphertext, conv, fast, encryption, performs, result, additively, fourier, minionn, garbled, based, listing, mod, high, optimized, ciphertexts, transform] [image, real, introduce, corresponding] [layer, max, client, softmax, neural, data, learning, server, deep, function, size, computation, online, network, setup, number, vector, machine, share, linear, additive, calculate, boolean, note, random, processing, probability, evaluate, execution] [conference, computer, acm]
@InProceedings{Li_2020_CVPR,
  author = {Li, Shaohua and Xue, Kaiping and Zhu, Bin and Ding, Chenkai and Gao, Xindi and Wei, David and Wan, Tao},
  title = {FALCON: A Fourier Transform Based Approach for Fast and Secure Convolutional Neural Network Predictions},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Dreaming to Distill: Data-Free Knowledge Transfer via DeepInversion
Hongxu Yin, Pavlo Molchanov, Jose M. Alvarez, Zhizhong Li, Arun Mallya, Derek Hoiem, Niraj K. Jha, Jan Kautz


We introduce DeepInversion, a new method for synthesizing images from the image distribution used to train a deep neural network. We "invert" a trained network (teacher) to synthesize class-conditional input images starting from random noise, without using any additional information about the training dataset. Keeping the teacher fixed, our method optimizes the input while regularizing the distribution of intermediate feature maps using information stored in the batch normalization layers of the teacher. Further, we improve the diversity of synthesized images using Adaptive DeepInversion, which maximizes the Jensen-Shannon divergence between the teacher and student network logits. The resulting synthesized images from networks trained on the CIFAR-10 and ImageNet datasets demonstrate high fidelity and degree of realism, and help enable a new breed of data-free applications - ones that do not require any real images or labeled data. We demonstrate the applicability of our proposed method to three tasks of immense practical importance - (i) data-free network pruning, (ii) data-free knowledge transfer, and (iii) data-free continual learning.
[dataset, work, natural, oracle, multiple, three] [table, feature, cnn] [trained, model, original, noise, improve, input, datasets, adversarial] [method, adaptive, prior, high, output, convolutional, based, figure, intermediate, proposed] [image, loss, pretrained, generated, transfer, synthesized, inversion, introduce, real, synthesize, fidelity, generative, cub, target, generate, generator, synthesizing, train, diversity] [deepinversion, knowledge, network, student, training, teacher, neural, imagenet, learning, pruning, deep, distribution, distillation, continual, data, accuracy, deepdream, class, regularization, large, set, classification, initialized, pxk, divergence, batch, classifier, achieve, efficient, arxiv, preprint, inverted, inference, layer] [additional, term, demonstrate, require]
@InProceedings{Yin_2020_CVPR,
  author = {Yin, Hongxu and Molchanov, Pavlo and Alvarez, Jose M. and Li, Zhizhong and Mallya, Arun and Hoiem, Derek and Jha, Niraj K. and Kautz, Jan},
  title = {Dreaming to Distill: Data-Free Knowledge Transfer via DeepInversion},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Unsupervised Domain Adaptation via Structurally Regularized Deep Clustering
Hui Tang, Ke Chen, Kui Jia


Unsupervised domain adaptation (UDA) is to make predictions for unlabeled data on a target domain, given labeled data on a source domain whose distribution shifts from the target one. Mainstream UDA methods learn aligned features between the two domains, such that a classifier trained on the source features can be readily applied to the target ones. However, such a transferring strategy has a potential risk of damaging the intrinsic discrimination of target data. To alleviate this risk, we are motivated by the assumption of structural domain similarity, and propose to directly uncover the intrinsic target discrimination via discriminative clustering of target data. We constrain the clustering solutions using structural source regularization that hinges on our assumed structural domain similarity. Technically, we use a flexible framework of deep network based discriminative clustering that minimizes the KL divergence between predictive label distribution of the network and an introduced auxiliary one; replacing the auxiliary distribution with that formed by ground-truth labels of source data implements the structural source regularization via a simple strategy of joint network training. We term our proposed method as Structurally Regularized Deep Clustering (SRDC), where we also enhance target discrimination with clustering of intermediate network features, and enhance structural regularization with soft selection of less divergent source examples. Careful ablation studies show the efficacy of our proposed SRDC. Notably, with no explicit domain alignment, SRDC outperforms all existing methods on three UDA benchmarks.
[recognition, work, explicit] [feature, table, propose, ablation, benchmark] [auxiliary, adversarial, model, trained, efficacy] [based, pattern, proposed, june, figure, enhance, ieee, method, assumption, existing, introduced] [source, target, domain, structural, srdc, discrimination, discriminative, unsupervised, transfer, uda, adaptation, cluster, alignment, mingsheng, uncover, jianmin, structurally, yjs, jun, damaging, shared] [clustering, learning, data, deep, network, regularization, training, distribution, machine, labeled, label, soft, strategy, selection, space, objective, neural, classifier, class, sample, regularized, unlabeled, processing, note, similarity, min, log, divergence, predictive, simple, classification] [conference, computer, vision, international, intrinsic, joint, term, volume, directly]
@InProceedings{Tang_2020_CVPR,
  author = {Tang, Hui and Chen, Ke and Jia, Kui},
  title = {Unsupervised Domain Adaptation via Structurally Regularized Deep Clustering},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
HyperSTAR: Task-Aware Hyperparameters for Deep Networks
Gaurav Mittal, Chang Liu, Nikolaos Karianakis, Victor Fragoso, Mei Chen, Yun Fu


While deep neural networks excel in solving visual recognition tasks, they require significant effort to find hyperparameters that make them work optimally. Hyperparameter Optimization (HPO) approaches have automated the process of finding good hyperparameters but they do not adapt to a given task (task-agnostic), making them computationally inefficient. To reduce HPO time, we present HyperSTAR (System for Task Aware Hyperparameter Recommendation), a task-aware method to warm-start HPO for deep neural networks. HyperSTAR ranks and recommends hyperparameters by predicting their performance conditioned on a joint dataset-hyperparameter space. It learns a dataset (task) representation along with the performance predictor directly from raw images in an end-to-end fashion. The recommendations, when integrated with an existing HPO method, make it task-aware and significantly reduce the time to achieve optimal performance. We conduct extensive experiments on 10 publicly available large-scale image classification datasets over two different network architectures, validating that HyperSTAR evaluates 50% less configurations to achieve the best performance compared to existing methods. We further demonstrate that HyperSTAR makes Hyperband (HB) task-aware, achieving the optimal accuracy in just 25% of the budget required by both vanilla HB and Bayesian Optimized HB (BOHB).
[dataset, time, order, outperforms, visual] [global, achieves] [recommendation, datasets, budget, offline, model] [based, ieee, pattern, method, phase, existing, raw, figure, achieved] [representation, image, unseen, learns, learn, list] [performance, hyperparameter, hyperstar, task, learning, hpo, predictor, training, hyperparameters, accuracy, configuration, hyperband, neural, machine, search, optimal, number, best, optimization, bayesian, batch, deep, classification, network, similarity, gwt, average, random, space, compared, efficient, algorithm, ranking, accelerate, large, online, bohb, test, rank, vanilla, learned, feurer, processing, frank, function, hws, baseline, observe, achieve] [conference, vision, computer, international, joint, computed]
@InProceedings{Mittal_2020_CVPR,
  author = {Mittal, Gaurav and Liu, Chang and Karianakis, Nikolaos and Fragoso, Victor and Chen, Mei and Fu, Yun},
  title = {HyperSTAR: Task-Aware Hyperparameters for Deep Networks},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
ActBERT: Learning Global-Local Video-Text Representations
Linchao Zhu, Yi Yang


In this paper, we introduce ActBERT for self-supervised learning of joint video-text representations from unlabeled data. First, we leverage global action information to catalyze the mutual interactions between linguistic texts and local regional objects. It uncovers global and local visual clues from paired video sequences and text descriptions for detailed visual and text relation modeling. Second, we introduce an ENtangled Transformer block (ENT) to encode three sources of information, i.e., global actions, local regional objects, and linguistic descriptions. Global-local correspondences are discovered via judicious clues extraction from contextual information. It enforces the joint videotext representation to be aware of fine-grained objects as well as global human intention. We validate the generalization capability of ActBERT on downstream video-and language tasks, i.e., text-video clip retrieval, video captioning, video question answering, action segmentation, and action step localization. ActBERT significantly outperform the state-of-the-arts, demonstrating its superiority in video-text representation learning.
[video, action, actbert, visual, transformer, linguistic, text, clip, language, question, attention, embedding, dataset, tangled, modeling, three, bert, frame, token, downstream, step, instructional, embeddings, temporal, extract, retrieval, tvje, sequence, answering, captioning, outperforms, natural, incorporate] [object, global, feature, region, add, table, contextual, detection] [model, regional, input, original] [block, output, convolutional] [masked, representation, image, corresponding, introduce, supervised, learn] [learning, arxiv, preprint, evaluate, task, better, data, classification, network, training, neural, label, performance] [local, joint, position, human, detailed, leverage]
@InProceedings{Zhu_2020_CVPR,
  author = {Zhu, Linchao and Yang, Yi},
  title = {ActBERT: Learning Global-Local Video-Text Representations},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
State-Relabeling Adversarial Active Learning
Beichen Zhang, Liang Li, Shijie Yang, Shuhui Wang, Zheng-Jun Zha, Qingming Huang


Active learning is to design label-efficient algorithms by sampling the most representative samples to be labeled by an oracle. In this paper, we propose a state relabeling adversarial active learning model (SRAAL), that leverages both the annotation and the labeled/unlabeled state information for deriving the most informative unlabeled samples. The SRAAL consists of a representation generator and a state discriminator. The generator uses the complementary annotation information with traditional reconstruction information to generate the unified representation of samples, which embeds the semantic into the whole data representation. Then, we design an online uncertainty indicator in the discriminator, which endues unlabeled samples with different importance. As a result, we can select the most informative samples based on the discriminator's predicted state. We also design an algorithm to initialize the labeled pool, which makes subsequent sampling more efficient. The experiments conducted on various datasets show that our model outperforms the previous state-of-art active learning methods and our initially sampling algorithm achieves better performance.
[state, outperforms, dataset, previous, prediction] [annotation, unified, segmentation, semantic, score, propose, ablation, module, key, subsequent] [model, adversarial, relabeling, study, trained] [method, based, ieee, initially, figure, proposed] [image, representation, target, generator, supervised, discriminator, latent, loss, consists, unsupervised, introduce, generative, learns] [learning, labeled, active, unlabeled, data, pool, sampling, performance, sraal, informative, algorithm, sample, select, better, deep, online, selection, evaluate, experiment, arxiv, preprint, indicator, network, selected, classification, initialization, vaal, accuracy, label, set, task, neural, learner, distribution, vector, random, design, number, stl, objective, function, training] [uncertainty, computer, conference, initial, international, vision, approach, distance]
@InProceedings{Zhang_2020_CVPR,
  author = {Zhang, Beichen and Li, Liang and Yang, Shijie and Wang, Shuhui and Zha, Zheng-Jun and Huang, Qingming},
  title = {State-Relabeling Adversarial Active Learning},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Erasing Integrated Learning: A Simple Yet Effective Approach for Weakly Supervised Object Localization
Jinjie Mai, Meng Yang, Wenfeng Luo


Weakly supervised object localization (WSOL) aims to localize object with only weak supervision like image-level labels. However, a long-standing problem for available techniques based on the classification network is that they often result in highlighting the most discriminative parts rather than the entire extent of object. Nevertheless, trying to explore the integral extent of the object could degrade the performance of image classification on the contrary. To remedy this, we propose a simple yet powerful approach by introducing a novel adversarial erasing technique, erasing integrated learning (EIL). By integrating discriminative region mining and adversarial erasing in a single forward-backward propagation in a vanilla CNN, the proposed EIL explores the high response class-specific area and the less discriminative region simultaneously, thus could maintain high performance in classification and jointly discover the full extent of the object. Furthermore, we apply multiple EIL (MEIL) modules at different levels of the network in a sequential manner, which for the first time integrates semantic features of multiple levels and multiple scales through adversarial erasing learning. In particular, the proposed EIL and advanced MEIL both achieve a new state-of-the-art performance in CUB-200-2011 and ILSVRC 2016 benchmark, making significant improvement in localization while advancing high performance in image classification.
[attention, multiple, stream, visual, explore] [erasing, object, eil, localization, weakly, erased, map, feature, unerased, meil, region, cam, table, extent, mask, ilsvrc, semantic, segmentation, visualization, cnn, advanced, branch, location, backbone, threshold, loc, propose] [adversarial, erase] [ieee, pattern, proposed, cnns, convolutional, high, integrated, figure, guidance, based, result] [discriminative, supervised, image, loss, shared, learn, generate, perform, produce] [network, classification, learning, performance, training, data, layer, average, accuracy, activation, calculate, dropout, set, compared, deep, simple, entire, mining, vanilla] [computer, conference, vision, international, single, approach, full, october]
@InProceedings{Mai_2020_CVPR,
  author = {Mai, Jinjie and Yang, Meng and Luo, Wenfeng},
  title = {Erasing Integrated Learning: A Simple Yet Effective Approach for Weakly Supervised Object Localization},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
A Shared Multi-Attention Framework for Multi-Label Zero-Shot Learning
Dat Huynh, Ehsan Elhamifar


In this work, we develop a shared multi-attention model for multi-label zero-shot learning. We argue that designing attention mechanism for recognizing multiple seen and unseen labels in an image is a non-trivial task as there is no training signal to localize unseen labels and an image only contains a few present labels that need attentions out of thousands of possible labels. Therefore, instead of generating attentions for unseen labels which have unknown behaviors and could focus on irrelevant regions due to the lack of any training sample, we let the unseen labels select among a set of shared attentions which are trained to be label-agnostic and to focus on only relevant/foreground regions through our novel loss. Finally, we learn a compatibility function to distinguish labels based on the selected attention. We further propose a novel loss function that consists of three components guiding the attention to focus on diverse and relevant image regions while utilizing all attention features. By extensive experiments, we show that our method improves the state of the art by 2.9% and 1.4% F1 score on the NUS-WIDE and the large scale Open Images datasets, respectively.
[attention, relevant, recognition, prediction, visual, work, mechanism, multiple, localize, lrank] [score, feature, map, module, semantic, region, focus, object, localization, propose, denotes, framework, improves, bounding, effectiveness, table, improvement, global] [model] [ieee, pattern, method, figure, proposed, convolutional, based] [image, unseen, loss, shared, learn, notice, gzs, generalized, generalize, diverse, generating, lrel] [learning, label, training, number, neural, open, large, processing, function, set, larger, find, ranking, test, memory, select, problem, performance, vector, machine, learned, classifier, computing, data] [conference, computer, vision, single, well, international, novel, define, approach, allows]
@InProceedings{Huynh_2020_CVPR,
  author = {Huynh, Dat and Elhamifar, Ehsan},
  title = {A Shared Multi-Attention Framework for Multi-Label Zero-Shot Learning},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Self-Supervised Learning of Interpretable Keypoints From Unlabelled Videos
Tomas Jakab, Ankush Gupta, Hakan Bilen, Andrea Vedaldi


We propose a new method for recognizing the pose of objects from a single image that for learning uses only unlabelled videos and a weak empirical prior on the object poses. Video frames differ primarily in the pose of the objects they contain, so our method distils the pose information by analyzing the differences between frames. The distillation uses a new dual representation of the geometry of objects as a set of 2D keypoints, and as a pictorial representation, i.e. a skeleton image. This has three benefits: (1) it provides a tight 'geometric bottleneck' which disentangles pose from appearance, (2) it can leverage powerful image-to-image translation networks to map between photometry and geometry, and (3) it allows to incorporate empirical pose priors in the learning process. The pose priors are obtained from unpaired data, such as from a different dataset or modality such as mocap, such that no annotated image is ever used in learning the pose recognition network. In standard benchmarks for pose recognition for humans and faces, our method achieves state-of-the-art performance among methods that do not require any labelled images for training. Project page: http://www.robots.ox.ac.uk/ vgg/research/unsupervised_pose/
[skeleton, dataset, video, decoder, recognition, predict] [object, detection, supervision, table, predicted, annotated] [landmark, model, face, adversarial, facial, input, datasets] [method, prior, figure, ieee, convolutional, dual] [image, unpaired, appearance, supervised, learn, representation, conditional, unsupervised, translation, generator, cyclegan, generation, loss, cat, train, discriminator, encoder, target, extracted, paired, domain] [learning, network, bottleneck, training, test, set, performance, neural, unlabelled, empirical, report, deep, sample, consider] [pose, human, conference, estimation, computer, keypoint, keypoints, pictorial, simplified, international, vision, geometry, error, second, allows, directly, full, andrea, tight, leverage, reconstruct]
@InProceedings{Jakab_2020_CVPR,
  author = {Jakab, Tomas and Gupta, Ankush and Bilen, Hakan and Vedaldi, Andrea},
  title = {Self-Supervised Learning of Interpretable Keypoints From Unlabelled Videos},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Few-Shot Open-Set Recognition Using Meta-Learning
Bo Liu, Hao Kang, Haoxiang Li, Gang Hua, Nuno Vasconcelos


The problem of open-set recognition is considered. While previous approaches only consider this problem in the context of large-scale classifier training, we seek a unified solution for this and the low-shot classification setting. It is argued that the classic softmax classifier is a poor solution for open-set recognition, since it tends to overfit on the training classes. Randomization is then proposed as a solution to this problem. This suggests the use of meta-learning techniques, commonly used for few-shot classification, for the solution of open-set recognition. A new oPen sEt mEta LEaRning (PEELER) algorithm is then introduced. This combines the random selection of a set of novel classes per episode, a loss that maximizes the posterior entropy for examples of those classes, and a new metric learning formulation based on the Mahalanobis distance. Experimental results show that PEELER achieves state of the art open set recognition performance for both few-shot and large-scale recognition. On CIFAR and miniImageNet, it achieves substantial gains in seen/unseen class detection AUROC for a given seen-class classification accuracy.
[recognition, embedding, work, state, embeddings] [feature, detection, table, object, main] [model, trained, query, tsi, counterfactual, adversarial] [proposed, ieee, figure, pattern, gaussian, based, method, traditional] [unseen, loss, image, produce] [set, training, learning, class, number, basic, classification, test, gaussiane, problem, classifier, open, metric, support, performance, softmax, meta, oploss, network, openmax, deep, neural, ssi, peeler, large, episode, setting, space, better, sampling, procedure, optimal, openset, posterior, entropy, fewshot, randomly, prototypical, learned, algorithm, auroc, distribution, popular, data] [computer, conference, vision, distance, solution, novel, lecture, well]
@InProceedings{Liu_2020_CVPR,
  author = {Liu, Bo and Kang, Hao and Li, Haoxiang and Hua, Gang and Vasconcelos, Nuno},
  title = {Few-Shot Open-Set Recognition Using Meta-Learning},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Few-Shot Learning via Embedding Adaptation With Set-to-Set Functions
Han-Jia Ye, Hexiang Hu, De-Chuan Zhan, Fei Sha


Learning with limited data is a key challenge for visual recognition. Many few-shot learning methods address this challenge by learning an instance embedding function from seen classes and apply the function to instances from unseen classes with limited labels. This style of transfer learning is task-agnostic: the embedding function is not learned optimally discriminative with respect to the unseen classes, where discerning among them leads to the target task. In this paper, we propose a novel approach to adapt the instance embeddings to the target classification task with a set-to-set function, yielding embeddings that are task-specific and are discriminative. We empirically investigated various instantiations of such set-to-set functions and observed the Transformer is most effective --- as it naturally satisfies key properties of our desired model. We denote this model as FEAT (few-shot embedding adaptation w/ Transformer) and validate it on both the standard few-shot classification benchmark and four extended few-shot learning settings with essential use cases, i.e., cross-domain, transductive, generalized few-shot learning, and low-shot learning. It archived consistent improvements over baseline models as well as previous methods, and established the new state-of-the-art results on two benchmarks.
[embedding, embeddings, visual, transformer, bilstm, previous, gcn, work, graph, prediction] [instance, key, table, apply, cnn, backbone, resnet, achieves, propose] [model, strong, study] [figure, based, interpolation, residual] [adaptation, unseen, target, transductive, discriminative, generalized, domain, train, learn] [learning, feat, set, function, test, task, classification, number, fsl, data, evaluate, deepsets, standard, support, training, dtrain, protonet, observe, baseline, class, miniimagenet, network, extrapolation, xtrain, performance, adapted, accuracy, better, best, implement, deep, permutation, sampled, ytest, convnet, learned, discerning, denote, space, adapt] [additional, approach, nearest, well, neighbor, transformation]
@InProceedings{Ye_2020_CVPR,
  author = {Ye, Han-Jia and Hu, Hexiang and Zhan, De-Chuan and Sha, Fei},
  title = {Few-Shot Learning via Embedding Adaptation With Set-to-Set Functions},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Temporally Distributed Networks for Fast Video Semantic Segmentation
Ping Hu, Fabian Caba, Oliver Wang, Zhe Lin, Stan Sclaroff, Federico Perazzi


We present TDNet, a temporally distributed network designed for fast and accurate video semantic segmentation. We observe that features extracted from a certain high-level layer of a deep CNN can be approximated by composing features extracted from several shallower sub-networks. Leveraging the inherent temporal continuity in videos, we distribute these sub-networks over sequential frames. Therefore, at each time step, we only need to perform a lightweight computation to extract a sub-features group from a single sub-network. The full features used for segmentation are then recomposed by application of a novel attention propagation module that compensates for geometry deformation between frames. A grouped knowledge distillation loss is also introduced to further improve the representation power at both full and sub-feature levels. Experiments on Cityscapes, CamVid, and NYUD-v2 demonstrate that our method achieves state-of-the-art accuracy with significantly faster speed and lower latency.
[attention, video, frame, previous, temporally, temporal, time, encoding, context, outperforms, speed, current, evaluation, extract] [segmentation, semantic, feature, propagation, tdnet, table, miou, module, grouped, achieves, apm, map, apply, shallow, effectiveness, propose, aggregation, camvid, distribute] [model, improve, strong, original] [method, based, convolutional, spatial, output, motion, flow, high, figure, convolution, downsampling] [image, extracted, loss, representation, independent] [deep, knowledge, accuracy, network, distributed, computation, distillation, latency, group, performance, neural, better, efficient, large, reduction, teacher, training, layer, lower] [full, single, scene]
@InProceedings{Hu_2020_CVPR,
  author = {Hu, Ping and Caba, Fabian and Wang, Oliver and Lin, Zhe and Sclaroff, Stan and Perazzi, Federico},
  title = {Temporally Distributed Networks for Fast Video Semantic Segmentation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Benchmarking the Robustness of Semantic Segmentation Models
Christoph Kamann, Carsten Rother


When designing a semantic segmentation module for a practical application, such as autonomous driving, it is crucial to understand the robustness of the module with respect to a wide range of image corruptions. While there are recent robustness studies for full-image classification, we are the first to present an exhaustive study for semantic segmentation, based on the state-of-the-art model DeepLabv3+. To increase the realism of our study, we utilize almost 400,000 images generated from Cityscapes, PASCAL VOC 2012, and ADE20K. Based on the benchmark study, we gain several new insights. Firstly, contrary to full-image classification, model robustness increases with model performance, in most cases. Secondly, some architecture properties affect robustness significantly, such as a Dense Prediction Cell, which was designed to maximize performance on clean data only.
[prediction, link, dataset, evaluation, three] [semantic, miou, segmentation, atrous, pascal, voc, architectural, backbone, pooling, aspp, global, object, pyramid, ablation, van, module, table] [robustness, model, noise, clean, corruption, robust, corrupted, adversarial, severity, study, distortion, dpc, trained, type] [blur, convolutional, based, reference, ablated, psf, rcd, cell, gaussian, spatial, range, result, degradation, weather] [image, corresponding] [network, performance, architecture, deep, average, data, neural, learning, respective, evaluate, impact, validation, averaged, respect, increase, efficient] [dense, geometric, camera, volume]
@InProceedings{Kamann_2020_CVPR,
  author = {Kamann, Christoph and Rother, Carsten},
  title = {Benchmarking the Robustness of Semantic Segmentation Models},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
There and Back Again: Revisiting Backpropagation Saliency Methods
Sylvestre-Alvise Rebuffi, Ruth Fong, Xu Ji, Andrea Vedaldi


Saliency methods seek to explain the predictions of a model by producing an importance map across each input sample. A popular class of such methods is based on backpropagating a signal and analyzing the resulting gradient. Despite much research on such methods, relatively little work has been done to clarify the differences between such methods as well as the desiderata of these techniques. Thus, there is a need for rigorously understanding the relationships between different methods as well as their failure modes. In this work, we conduct a thorough analysis of backpropagation-based saliency methods and propose a single framework under which several such methods can be unified. As a result of our study, we make three additional contributions. First, we use our framework to propose NormGrad, a novel saliency method based on the spatial contribution of gradients of convolutional weights. Second, we combine saliency maps at different layers to test the ability of saliency methods to extract complementary information at different network levels (e.g. trading off spatial resolution and distinctiveness) and we explain why some methods fail at specific layers (e.g., Grad-CAM anywhere besides the last convolutional layer). Third, we introduce a class-sensitivity metric and a meta-learning inspired paradigm applicable to any saliency method for improving sensitivity to the output class being explained.
[selective, contribution, visual, combining, step, work, order, understanding] [saliency, correlation, map, aggregation, location, pointing, framework, feature, guided, table, pascal, propose, heatmap, cam, object] [identity, input, model, sensitivity, norm, game, attribution, heatmaps, explanation, sensitive] [spatial, output, convolutional, figure, method, xin, spatially, phase, conv, existing, based, signal] [image, target, introduce, produce] [class, layer, gradient, network, linear, normgrad, min, deep, scaling, xout, approximation, backprop, function, guout, learning, neural, performance, bias, backpropagation, metric, sum, product, max, weighted, best, imagenet, set, classification, sgd, average, weighting, training] [virtual, single, compute]
@InProceedings{Rebuffi_2020_CVPR,
  author = {Rebuffi, Sylvestre-Alvise and Fong, Ruth and Ji, Xu and Vedaldi, Andrea},
  title = {There and Back Again: Revisiting Backpropagation Saliency Methods},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Deep Semantic Clustering by Partition Confidence Maximisation
Jiabo Huang, Shaogang Gong, Xiatian Zhu


By simultaneously learning visual features and data grouping, deep clustering has shown impressive ability to deal with unsupervised learning for structure analysis of high-dimensional visual data. Existing deep clustering methods typically rely on local learning constraints based on inter-sample relations and/or self-estimated pseudo labels. This is susceptible to the inevitable errors distributed in the neighbourhoods and suffers from error-propagation during training. In this work, we propose to solve this problem by learning the most confident clustering solution from all the possible separations, based on the observation that assigning samples from the same semantic categories into different clusters will reduce both the intra-cluster compactness and inter-cluster diversity, i.e. lower partition confidence. Specifically, we introduce a novel deep clustering method named PartItion Confidence mAximisation (PICA). It is established on the idea of learning the most semantically plausible data separation, in which all clusters can be mapped to the ground-truth classes one-to-one, by maximising the "global" partition confidence of clustering solution. This is realised by introducing a differentiable partition uncertainty index and its stochastic approximation as well as a principled objective loss function that minimises such index, all of which together enables a direct adoption of the conventional deep networks and mini-batch based model training. Extensive experiments on six widely-adopted clustering benchmarks demonstrate our model's performance superiority over a wide range of the state-of-the-art approaches. The code is available online.
[recognition, visual, prediction] [assignment, confidence, global, table, semantic, assigned, feature, propose] [model, robustness, auxiliary, acc, decision, trained, case] [partition, ieee, method, pattern, based, proposed, analysis, range, existing] [cluster, image, target, loss, unsupervised, idea, semantically, plausible, representation, learn, ari] [clustering, deep, learning, pica, data, training, objective, performance, stochastic, neural, asv, function, machine, set, sample, maximisation, confident, network, accuracy, standard, randomly, probability, random, nmi, processing, problem, approximation, margin, class, pui, entropy] [conference, uncertainty, vision, computer, international, local, solution, novel, initialisation]
@InProceedings{Huang_2020_CVPR,
  author = {Huang, Jiabo and Gong, Shaogang and Zhu, Xiatian},
  title = {Deep Semantic Clustering by Partition Confidence Maximisation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
StructEdit: Learning Structural Shape Variations
Kaichun Mo, Paul Guerrero, Li Yi, Hao Su, Peter Wonka, Niloy J. Mitra, Leonidas J. Guibas


Learning to encode differences in the geometry and (topological) structure of the shapes of ordinary objects is key to generating semantically plausible variations of a given shape, transferring edits from one shape to another, and for many other applications in 3D content creation. The common approach of encoding shapes as points in a high-dimensional latent feature space suggests treating shape differences as vectors in that space. Instead, we treat shape differences as primary objects in their own right and propose to encode them in their own latent space. In a setting where the shapes themselves are encoded in terms of fine-grained part hierarchies, we demonstrate that a separate encoding of shape deltas or differences provides a principled way to deal with inhomogeneities in the shape space due to different combinatorial part structures, while also allowing for compactness in the representation, as well as edit abstraction and transfer. Our approach is based on a conditional variational autoencoder for encoding and decoding shape deltas, conditioned on a source shape. We demonstrate the effectiveness and robustness of our approach in multiple shape modification and generation tasks, and provide comparison and ablation studies on the PartNet dataset, one of the largest publicly available 3D datasets.
[bar, encode, encoding, decoder, dataset, child, hierarchical, represent] [feature, box, object, bounding, table] [model, identity, input, encoded] [figure, method, ieee, pattern, based] [source, edit, edits, structural, transfer, latent, target, generative, encoder, loss, modified, conditional, learn, generated, corresponding, generation, consistency, train, vkbox, component, variational] [set, learning, space, distribution, vector, deep, network, neural] [shape, delta, point, geometric, deformation, distance, computer, acm, conference, reconstruction, leg, leonidas, ground, truth, defined, deleted, volume, hao, directly, structedit, error, structure, structurenet, vision, daniel, geometry, approach, partnet, single, local, compare, correspondence, coverage, niloy]
@InProceedings{Mo_2020_CVPR,
  author = {Mo, Kaichun and Guerrero, Paul and Yi, Li and Su, Hao and Wonka, Peter and Mitra, Niloy J. and Guibas, Leonidas J.},
  title = {StructEdit: Learning Structural Shape Variations},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Harmonizing Transferability and Discriminability for Adapting Object Detectors
Chaoqi Chen, Zebiao Zheng, Xinghao Ding, Yue Huang, Qi Dou


Recent advances in adaptive object detection have achieved compelling results in virtue of adversarial feature adaptation to mitigate the distributional shifts along the detection pipeline. Whilst adversarial adaptation significantly enhances the transferability of feature representations, the feature discriminability of object detectors remains less investigated. Moreover, transferability and discriminability may come at a contradiction in adversarial adaptation given the complex combinations of objects and the differentiated scene layouts between domains. In this paper, we propose a Hierarchical Transferability Calibration Network (HTCN) that hierarchically (local-region/image/instance) calibrates the transferability of feature representations for harmonizing transferability and discriminability. The proposed model consists of three components: (1) Importance Weighted Adversarial Training with input Interpolation (IWAT-I), which strengthens the global discriminability by re-weighting the interpolated image-level features; (2) Context-aware Instance-Level Alignment (CILA) module, which enhances the local discriminability by capturing the underlying complementary effect between the instance-level feature and the global context information for the instance-level feature alignment; (3) local feature masks that calibrate the local transferability to provide semantic guidance for the following discriminative pattern alignment. Experimental results show that HTCN significantly outperforms the state-of-the-art methods on benchmark datasets.
[context, hierarchical, explicitly, dataset, outperforms] [feature, object, detection, semantic, propose, global, table, region, map, bounding, pascal, denotes, iou, instance, module] [adversarial, transferability, input, model] [proposed, based, adaptive, interpolation, figure, tensor] [domain, adaptation, source, target, alignment, unsupervised, htcn, discriminability, image, transferable, loss, contradiction, uda, transfer, calibrate, representation, mingsheng, jianmin, kate, adapting, consistency, discriminator] [training, learning, deep, network, performance, set, product, informative, distribution, vector, problem, distributional, labeled, data, note, negative, achieve, denote, dimension] [local, approach, uncertainty, scene, computer, calibration, matching, computed, defined, hypothesis]
@InProceedings{Chen_2020_CVPR,
  author = {Chen, Chaoqi and Zheng, Zebiao and Ding, Xinghao and Huang, Yue and Dou, Qi},
  title = {Harmonizing Transferability and Discriminability for Adapting Object Detectors},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Fast Video Object Segmentation With Temporal Aggregation Network and Dynamic Template Matching
Xuhua Huang, Jiarui Xu, Yu-Wing Tai, Chi-Keung Tang


Significant progress has been made in Video Object Segmentation (VOS), the video object tracking task in its finest level. While the VOS task can be naturally decoupled into image semantic segmentation and video object tracking, significantly much more research effort has been made in segmentation than tracking. In this paper, we introduce "tracking-by-detection" into VOS which can coherently integrates segmentation into tracking, by proposing a new temporal aggregation network and a novel dynamic time-evolving template matching mechanism to achieve significantly improved performance. Notably, our method is entirely online and thus suitable for one-shot learning, and our end-to-end trainable model allows multiple object segmentation in one forward pass. We achieve new state-of-the-art performance on the DAVIS benchmark without complicated bells and whistles in both speed and accuracy, with a speed of 0.14 second per frame and J &F measure of 75.9% respectively.
[video, temporal, frame, multiple, dataset, speed, current, moving, bank, complicated, recognition] [object, segmentation, davis, tracking, template, aggregation, semantic, feature, detection, vos, table, backbone, bounding, ross, kaiming, benchmark, leading, dttm, mask, box, premvos, fully, instance, challenge, piotr, feelvos, easy, map, head, iou] [model, input, change] [ieee, pattern, method, figure, based, dynamic, fast, optical, flow, high, convolutional, extend, proposed] [target, image, appearance] [network, performance, online, training, learning, note, setting, design, neural, baseline, deep, validation, task, simple, denote] [conference, computer, vision, matching, international, pipeline, novel, single]
@InProceedings{Huang_2020_CVPR,
  author = {Huang, Xuhua and Xu, Jiarui and Tai, Yu-Wing and Tang, Chi-Keung},
  title = {Fast Video Object Segmentation With Temporal Aggregation Network and Dynamic Template Matching},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
CascadePSP: Toward Class-Agnostic and Very High-Resolution Segmentation via Global and Local Refinement
Ho Kei Cheng, Jihoon Chung, Yu-Wing Tai, Chi-Keung Tang


State-of-the-art semantic segmentation methods were almost exclusively trained on images within a fixed resolution range. These segmentations are inaccurate for very high-resolution images since using bicubic upsampling of low-resolution segmentation does not adequately capture high-resolution details along object boundaries. In this paper, we propose a novel approach to address the high-resolution segmentation problem without using any high-resolution training data. The key insight is our CascadePSP network which refines and corrects local boundaries whenever possible. Although our network is trained with low-resolution segmentation data, our method is applicable to any resolution even for very high-resolution images larger than 4K. We present quantitative and qualitative studies on different datasets to show that CascadePSP can reveal pixel-accurate segmentation boundaries using our novel refinement module without any finetuning. Thus, our method can be regarded as class-agnostic. Finally, we demonstrate the application of our model to scene parsing in multi-class segmentation.
[step, dataset] [segmentation, refinement, semantic, global, boundary, pascal, cascade, object, stride, module, voc, table, iou, refine, pyramid, cascadepsp, refines, fully, pspnet, ablation, parsing, refined, feature, propose, deeplab, bilinearly, region, edge, detection, mba] [model, input, trained, datasets, robust] [figure, output, method, resolution, big, convolutional, highresolution, high, pixel] [image, loss, produce, perform, specific, generate] [training, network, deep, data, memory, higher, validation, learning, large, evaluate, accuracy, gpu, size, gradient, note, test, better] [local, capture, scene, structure, ground, accurate, coarse, single, truth, initial]
@InProceedings{Cheng_2020_CVPR,
  author = {Cheng, Ho Kei and Chung, Jihoon and Tai, Yu-Wing and Tang, Chi-Keung},
  title = {CascadePSP: Toward Class-Agnostic and Very High-Resolution Segmentation via Global and Local Refinement},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Correlating Edge, Pose With Parsing
Ziwei Zhang, Chi Su, Liang Zheng, Xiaodong Xie


According to existing studies, human body edge and pose are two beneficial factors to human parsing. The effectiveness of each of the high-level features (edge and pose) is confirmed through the concatenation of their features with the parsing features. Driven by the insights, this paper studies how human semantic boundaries and keypoint locations can jointly improve human parsing. Compared with the existing practice of feature concatenation, we find that uncovering the correlation among the three factors is a superior way of leveraging the pivotal contextual cues provided by edges and poses. To capture such correlations, we propose a Correlation Parsing Machine (CorrPM) employing a heterogeneous non-local block to discover the spatial affinity among feature maps from the edge, pose and parsing. The proposed CorrPM allows us to report new state-of-the-art accuracy on three human parsing datasets. Importantly, comparative studies confirm the advantages of feature correlation over the concatenation.
[three, heterogeneous, concatenation, relationship, lip, concatenating, attention, semantics, prediction] [parsing, edge, feature, semantic, correlation, boundary, miou, propose, segmentation, hnl, contextual, map, module, corrpm, correlating, liang, detection, framework, fed, achieves, fully, boost, mask, add, table] [model, shuicheng, clothes] [ieee, proposed, pattern, existing, fusion, method, convolution, comparison, adjacent, convolutional, block, figure] [image, loss, representation, generate, encoder] [network, performance, learning, training, baseline, size, compared, machine, neural, strategy, deep, number, base, higher] [human, pose, conference, computer, body, vision, keypoint, estimation, structure, single, hybrid, european, international, capture, left, leverage, joint]
@InProceedings{Zhang_2020_CVPR,
  author = {Zhang, Ziwei and Su, Chi and Zheng, Liang and Xie, Xiaodong},
  title = {Correlating Edge, Pose With Parsing},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
VecRoad: Point-Based Iterative Graph Exploration for Road Graphs Extraction
Yong-Qiang Tan, Shang-Hua Gao, Xuan-Yi Li, Ming-Ming Cheng, Bo Ren


Extracting road graphs from aerial images automatically is more efficient and costs less than from field acquisition. This can be done by a post-processing step that vectorizes road segmentation predicted by CNN, but imperfect predictions will result in road graphs with low connectivity. On the other hand, iterative next move exploration could construct road graphs with better road connectivity, but often focuses on local information and does not provide precise alignment with the real road. To enhance the road connectivity while maintaining the precise alignment between the graph and real road, we propose a point-based iterative graph exploration scheme with segmentation-cues guidance and flexible steps. In our approach, we represent the location of the next move as a 'point' that unifies the representation of multiple constraints such as the direction and step size in each moving step. Information cues such as road segmentation and road junctions are jointly detected and utilized to guide the next move and achieve better alignment of roads. We demonstrate that our proposed method has a considerable improvement over state-of-the-art road graph extraction methods in terms of F-measure and road connectivity metrics on common datasets.
[road, graph, step, exploration, trajectory, starting, moving, extract, multiple, current, centerline, length, represent, prediction, predict, construct, apls, cue, time, roadtracer] [segmentation, junction, aerial, supervision, predicted, feature, apply, adopt, precise, map, detection, backbone, propose, location, global] [move, iterative, input, hourglass, model] [method, extraction, output, guidance, proposed, figure, result, convolutional, gaussian, block, ieee, adopted, designed, advantage] [image, alignment, generate, representation, real, generated, loss] [size, network, learning, neural, training, set, better, path, performance, scheme, distribution, evaluate, deep, applied, algorithm] [connectivity, point, vertex, complex, local, angle]
@InProceedings{Tan_2020_CVPR,
  author = {Tan, Yong-Qiang and Gao, Shang-Hua and Li, Xuan-Yi and Cheng, Ming-Ming and Ren, Bo},
  title = {VecRoad: Point-Based Iterative Graph Exploration for Road Graphs Extraction},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Towards Fairness in Visual Recognition: Effective Strategies for Bias Mitigation
Zeyu Wang, Klint Qinami, Ioannis Christos Karakozis, Kyle Genova, Prem Nair, Kenji Hata, Olga Russakovsky


Computer vision models learn to perform a task by capturing relevant statistics from training data. It has been shown that models learn spurious age, gender, and race correlations when trained for seemingly unrelated tasks like activity recognition or image captioning. Various mitigation techniques have been presented to prevent models from utilizing or learning such biases. However, there has been little systematic comparison between these techniques. We design a simple but surprisingly effective visual recognition benchmark for studying bias mitigation. Using this benchmark, we provide a thorough analysis of a wide range of techniques. We highlight the shortcomings of popular adversarial training approaches for bias mitigation, propose a simple but similarly effective alternative to the inference-time Reducing Bias Amplification method of Zhao et al., and design a domain-independent training technique that outperforms all other methods. Finally, we validate our findings on the attribute classification task in the CelebA dataset, where attribute presence is known to be correlated with the gender of people in the image, and demonstrate that the proposed technique is effective at mitigating real-world gender bias.
[dataset, shift, recognition, work, visual, evaluation, outperforms, provide, time] [benchmark, correlation, object, table, positive, achieves, feature, map, biased] [model, trained, adversarial, effective, spurious] [color, prior, output, method] [domain, gender, attribute, pte, target, mitigation, image, omain, amplification, skew, maxy, representation, celeba, discriminative, grayscale, rba, perform, train, loss, confusion, protected, men] [bias, training, accuracy, inference, learning, test, class, classifier, arg, baseline, consider, ptr, fairness, data, set, max, task, classification, presence, distribution, setting, simple, learned, number, softmax, imagenet, reducing, mitigating, oversampling, validation, alternative, deep, machine, uniform] [computer, demonstrate, approach, vision]
@InProceedings{Wang_2020_CVPR,
  author = {Wang, Zeyu and Qinami, Klint and Karakozis, Ioannis Christos and Genova, Kyle and Nair, Prem and Hata, Kenji and Russakovsky, Olga},
  title = {Towards Fairness in Visual Recognition: Effective Strategies for Bias Mitigation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Hierarchical Human Parsing With Typed Part-Relation Reasoning
Wenguan Wang, Hailong Zhu, Jifeng Dai, Yanwei Pang, Jianbing Shen, Ling Shao


Human parsing is for pixel-wise human semantic understanding. As human bodies are underlying hierarchically structured, how to model human structures is the central theme in this task. Focusing on this, we seek to simultaneously exploit the representational capacity of deep graph networks and the hierarchical human structures. In particular, we provide following two contributions. First, three kinds of part relations, i.e., decomposition, composition, and dependency, are, for the first time, completely and precisely described by three distinct relation networks. This is in stark contrast to previous parsers, which only focus on a portion of the relations and adopt a type-agnostic relation modeling strategy. More expressive relation information can be captured by explicitly imposing the parameters in the relation networks to satisfy the specific characteristics of different relations. Second, previous parsers largely ignore the need for an approximation algorithm over the loopy human hierarchy, while we instead address an iterative reasoning process, by assimilating generic message-passing networks with their edge-typed, convolutional counterparts. With these efforts, our parser lays the foundation for more sophisticated and flexible human relation patterns of reasoning. Comprehensive experiments on five datasets demonstrate that our parser sets a new state-of-the-art on each.
[relation, node, graph, parser, attention, dependency, three, modeling, structured, hierarchical, compositional, message, cont, decompositional, cnif, previous, passing, dec, reasoning, child, attdec, attcom, context, understanding, visual] [parsing, feature, semantic, table, liang, segmentation, edge, xiaohui, xiaodan, wenguan, parent, jian, jiashi, atr, alan, object, jianbing, illustration] [model, iterative, shuicheng, clothing, fashion, datasets, input] [convolutional, spatial, method, comparison, figure, designed, based, convolution, pixel] [image, address, representation, distinct, corresponding] [learning, inference, neural, performance, network, function, better, set, test, deep, average, indicates, expressive, training, hierarchy] [human, pose, structure, body, full, approach, joint]
@InProceedings{Wang_2020_CVPR,
  author = {Wang, Wenguan and Zhu, Hailong and Dai, Jifeng and Pang, Yanwei and Shen, Jianbing and Shao, Ling},
  title = {Hierarchical Human Parsing With Typed Part-Relation Reasoning},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Compositional Convolutional Neural Networks: A Deep Architecture With Innate Robustness to Partial Occlusion
Adam Kortylewski, Ju He, Qing Liu, Alan L. Yuille


Recent work has shown that deep convolutional neural networks (DCNNs) do not generalize well under partial occlusion. Inspired by the success of compositional models at classifying partially occluded objects, we propose to integrate compositional models and DCNNs into a unified deep model with innate robustness to partial occlusion. We term this architecture Compositional Convolutional Neural Network. In particular, we propose to replace the fully connected classification head of a DCNN with a differentiable compositional model. The generative nature of the compositional model enables it to localize occluders and subsequently focus on the non-occluded parts of the object. We conduct classification experiments on artificially occluded images as well as real images of partially occluded objects from the MS-COCO dataset. The results show that DCNNs do not classify occluded objects robustly, even when trained with data that is strongly augmented with partial occlusions. Our proposed model outperforms standard DCNNs by a large margin at classifying partially occluded objects, even when it has not been exposed to occluded objects during training. Additional experiments demonstrate that CompositionalNets can also localize the occluders accurately, despite being trained with class labels only. The code and data used in this work are publicly available.
[compositional, localize, dataset, artificial, integrate, outperforms, hierarchical] [occlusion, occluded, feature, compositionalnets, occluders, object, partially, dcnns, vmf, classifying, propose, alan, fully, map, table, tdapnet, localization, unified, head, score, innate, compositionalnet] [model, occluder, trained, robustness, robust, artificially] [proposed, likelihood, convolutional, figure, pattern, dcnn, ieee, adam] [image, generative, real, cluster, loss, discriminative, perform, qualitative] [neural, classification, deep, mixture, training, data, performance, learning, compared, network, classify, learned, class, augmentation, arxiv, preprint, number, note, outperform, augmented, architecture, standard] [partial, computer, vision, conference, well, position, differentiable, demonstrate]
@InProceedings{Kortylewski_2020_CVPR,
  author = {Kortylewski, Adam and He, Ju and Liu, Qing and Yuille, Alan L.},
  title = {Compositional Convolutional Neural Networks: A Deep Architecture With Innate Robustness to Partial Occlusion},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Spatial Pyramid Based Graph Reasoning for Semantic Segmentation
Xia Li, Yibo Yang, Qijie Zhao, Tiancheng Shen, Zhouchen Lin, Hong Liu


The convolution operation suffers from a limited receptive filed, while global modeling is fundamental to dense prediction tasks, such as semantic segmentation. In this paper, we apply graph convolution into the semantic segmentation task and propose an improved Laplacian. The graph reasoning is directly performed in the original feature space organized as a spatial pyramid. Different from existing methods, our Laplacian is data-dependent and we introduce an attention diagonal matrix to learn a better distance metric. It gets rid of projecting and re-projecting processes, which makes our proposed method a light-weight module that can be easily plugged into current computer vision architectures. More importantly, performing graph reasoning directly in the feature space retains spatial relationships and makes spatial pyramid possible to explore multiple long-range contextual patterns from different scales. Experiments on Cityscapes, COCO Stuff, PASCAL Context and PASCAL VOC demonstrate the effectiveness of our proposed methods on semantic segmentation. We achieve comparable performance with advantages in computational and memory overhead.
[graph, reasoning, spygr, context, attention, multiple, current, overhead, order, dataset, outperforms, prediction, build] [feature, semantic, pyramid, global, contextual, segmentation, pascal, table, coco, cnn, stuff, object, stride, module, propagation, final, miou, propose] [input, original, model] [spatial, convolution, convolutional, method, proposed, figure, output, danet, performed, based, spectral, introduced, transform] [image, perform, introduce, representation, train, learn] [matrix, similarity, computational, better, deep, network, learning, diagonal, performance, set, space, neural, computation, training, calculate, memory, data, arxiv, preprint, inner, product, layer, test, improved, performing, retains] [laplacian, directly, distance, scene, capture, dense, left]
@InProceedings{Li_2020_CVPR,
  author = {Li, Xia and Yang, Yibo and Zhao, Qijie and Shen, Tiancheng and Lin, Zhouchen and Liu, Hong},
  title = {Spatial Pyramid Based Graph Reasoning for Semantic Segmentation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Learning Video Object Segmentation From Unlabeled Videos
Xiankai Lu, Wenguan Wang, Jianbing Shen, Yu-Wing Tai, David J. Crandall, Steven C. H. Hoi


We propose a new method for video object segmentation (VOS) that addresses object pattern learning from unlabeled videos, unlike most existing methods which rely heavily on extensive annotated data. We introduce a unified unsupervised/weakly supervised learning framework, called MuG, that comprehensively captures intrinsic properties of VOS at multiple granularities. Our approach can help advance understanding of visual patterns in VOS and significantly reduce annotation burden. With a carefully-designed architecture and strong representation learning ability, our learned model can be applied to diverse VOS settings, including object-level zero-shot VOS, instance-level zero-shot VOS, and one-shot VOS. Experiments demonstrate promising performance in these settings, as well as the potential of MuG in leveraging unlabeled data to further improve the segmentation accuracy.
[video, frame, granularity, evaluation, visual, current, three, understanding, embedding, time] [object, vos, segmentation, weakly, semantic, global, feature, supervision, table, saliency, wenguan, jianbing, mask, tracking, foreground, cam, fully, map, recall, region, background, tracked, annotation, leading, zvos, instance, val, boundary, federico, ling] [model, trained, input, query, primary] [method, based, convolutional, patch, pattern, flow, fast] [unsupervised, supervised, loss, representation, discriminative] [learning, training, performance, network, deep, unlabeled, knowledge, similarity, neural, set, learned, inference, test, randomly, decay, accuracy, applied] [initial, local, correspondence, matching, compute]
@InProceedings{Lu_2020_CVPR,
  author = {Lu, Xiankai and Wang, Wenguan and Shen, Jianbing and Tai, Yu-Wing and Crandall, David J. and Hoi, Steven C. H.},
  title = {Learning Video Object Segmentation From Unlabeled Videos},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Part-Aware Context Network for Human Parsing
Xiaomei Zhang, Yingying Chen, Bingke Zhu, Jinqiao Wang, Ming Tang


Recent works have made significant progress in human parsing by exploiting rich contexts. However, human parsing still faces a challenge of how to generate adaptive contextual features for the various sizes and shapes of human parts. In this work, we propose a Part-aware Context Network (PCNet), a novel and effective algorithm to deal with the challenge. PCNet mainly consists of three modules, including a part class module, a relational aggregation module, and a relational dispersion module. The part class module extracts the high-level representations of every human part from a categorical perspective. We design a relational aggregation module to capture the representative global context by mining associated semantics of human parts, which adaptively augments the context for human parts. We propose a relational dispersion module to generate the discriminative and effective local context and neglect disturbing one by making the affinity of human parts dispersed. The relational dispersion module ensures that features in the same class will be close to each other and away from those of different classes. By fusing the outputs of the relational aggregation module, the relational dispersion module and the backbone network, our PCNet generates adaptive contextual features for various sizes of human parts, improving the parsing accuracy. We achieve a new state-of-the-art segmentation performance on three challenging human parsing datasets, i.e., PASCAL-Person-Part, LIP, and CIHP.
[relational, context, graph, three, lip, attention, relation, dataset, associated, outperforms] [module, aggregation, parsing, pcnet, global, semantic, backbone, denotes, table, contextual, liang, affinity, including, achieves, cihp, xiaodan, ablation, xiaohui, propose, segmentation, object, improves, alan] [dispersion, original, effective, input, datasets] [convolution, figure, convolutional, method, proposed, dynamic, ieee, adaptive, kernel, adaptively, pixel, output, pattern, comparison] [generate, discriminative, image, generates, generated, loss] [network, class, performance, number, layer, learning, set, validation, baseline, applied, experiment, best, neural, deep, filter, compared] [human, local, conference, computer, body, pose, international, joint, vision]
@InProceedings{Zhang_2020_CVPR,
  author = {Zhang, Xiaomei and Chen, Yingying and Zhu, Bingke and Wang, Jinqiao and Tang, Ming},
  title = {Part-Aware Context Network for Human Parsing},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
SCOUT: Self-Aware Discriminant Counterfactual Explanations
Pei Wang, Nuno Vasconcelos


The problem of counterfactual visual explanations is considered. A new family of discriminant explanations is introduced. These produce heatmaps that attribute high scores to image regions informative of a classifier prediction but not of a counter class. They connect attributive explanations, which are based on a single heat map, to counterfactual explanations, which account for both predicted class and counter class. The latter are shown to be computable by combination of two discriminant explanations, with reversed class pairs. It is argued that self-awareness, namely the ability to produce classification confidence scores, is important for the computation of discriminant explanations, which seek to identify regions where it is easy to discriminate between prediction and counter class. This suggests the computation of discriminant explanations by the combination of three attribution maps. The resulting counterfactual explanations are optimization free and thus much faster than previous methods. To address the difficulty of their evaluation, a proxy task and set of quantitative metrics are also proposed. Experiments under this protocol show that the proposed counterfactual explanations outperform the state of the art while achieving speeds much faster, for popular networks. In a human-learning machine teaching experiment, they are also shown to improve mean student accuracy from chance level to 95%.
[prediction, evaluation, visual, expert, work, three, previous] [confidence, map, score, predicted, region, object, segmentation, advanced] [counterfactual, discriminant, attribution, attributive, explanation, counter, difficult, scout, piou, tend, protocol, feedback, query, exhaustive, identify] [figure, proposed, based, comparison, quantitative, ieee, combination, high] [image, user, produce, generated, specific, bird, attribute, asked] [class, machine, learning, deep, classifier, teaching, network, large, neural, arxiv, preprint, performance, set, classification, training, respect, size, layer, informative, computation, contrastive, search, function, popular] [conference, computed, computer, ground, international, truth, system, compute, vision, require, human, single, second]
@InProceedings{Wang_2020_CVPR,
  author = {Wang, Pei and Vasconcelos, Nuno},
  title = {SCOUT: Self-Aware Discriminant Counterfactual Explanations},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Weakly-Supervised Semantic Segmentation via Sub-Category Exploration
Yu-Ting Chang, Qiaosong Wang, Wei-Chih Hung, Robinson Piramuthu, Yi-Hsuan Tsai, Ming-Hsuan Yang


Existing weakly-supervised semantic segmentation methods using image-level annotations typically rely on initial responses to locate object regions. However, such response maps generated by the classification network usually focus on discriminative object parts, due to the fact that the network does not need the entire object for optimizing the objective function. To enforce the network to pay attention to other parts of an object, we propose a simple yet effective approach that introduces a self-supervised task by exploiting the sub-category information. Specifically, we perform clustering on image features to generate pseudo sub-categories labels within each annotated parent class, and construct a sub-category objective to assign the network to a more challenging task. By iteratively clustering image features, the training process does not limit itself to the most discriminative object parts, hence improving the quality of the response maps. We conduct extensive analysis to validate the proposed method and show that our approach performs favorably against the state-of-the-art approaches.
[attention, visual, dataset, context] [semantic, segmentation, response, object, feature, parent, cam, map, pascal, voc, table, final, refinement, region, weakly, focus, propose, challenging, category, framework, improves, adopt, miou] [model, improving, iterative, original, improve] [method, proposed, figure, based, convolutional, analysis, applying, performs, existing, validate, enhance] [image, pseudo, unsupervised, train, discriminative, supervised, generate, learn, loss, extensive, generating, person, generated] [classification, clustering, training, network, learning, class, better, task, activation, objective, classifier, validation, label, number, optimize, algorithm, simple, deep, performance, note, set, learned] [initial, approach, ground, iteratively, truth, demonstrate]
@InProceedings{Chang_2020_CVPR,
  author = {Chang, Yu-Ting and Wang, Qiaosong and Hung, Wei-Chih and Piramuthu, Robinson and Tsai, Yi-Hsuan and Yang, Ming-Hsuan},
  title = {Weakly-Supervised Semantic Segmentation via Sub-Category Exploration},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Continual Learning With Extended Kronecker-Factored Approximate Curvature
Janghyeon Lee, Hyeong Gwon Hong, Donggyu Joo, Junmo Kim


We propose a quadratic penalty method for continual learning of neural networks that contain batch normalization (BN) layers. The Hessian of a loss function represents the curvature of the quadratic penalty function, and a Kronecker-factored approximate curvature (K-FAC) is used widely to practically compute the Hessian of a neural network. However, the approximation is not valid if there is dependence between examples, typically caused by BN layers in deep network architectures. We extend the K-FAC method so that the inter-example relations are taken into account and the Hessian of deep neural networks can be properly approximated under practical assumptions. We also propose a method of weight merging and reparameterization to properly handle statistical parameters of BN, which plays a critical role for continual learning with BN, and a method that selects hyperparameters without source task data. Our method shows better performance than baselines in the permuted MNIST task with BN layers and in sequential learning from the ImageNet classification task to fine-grained classification tasks with ResNet-50, without any explicit or implicit use of source task data for hyperparameter selection.
[sequential, multiple, current, natural] [propose] [original, model, mnist] [method, block, proposed, figure, valid, simply, convolutional, preceding] [loss, source, target, dependence, free] [task, learning, hessian, network, penalty, continual, weight, neural, matrix, merged, approximation, hyperparameters, brn, layer, equation, gradient, hyperparameter, function, statistical, performance, catastrophic, deep, data, training, fixed, imagenet, validation, size, parameter, set, accuracy, approximate, approximated, classification, optimization, forgetting, quadratic, batch, better, diagonal, fisher, linear, large, rate, considered, learned, kronecker, appendix, simple, small, population, decay, number] [conference, international, curvature, damping, initial, computer, single, vision, local]
@InProceedings{Lee_2020_CVPR,
  author = {Lee, Janghyeon and Hong, Hyeong Gwon and Joo, Donggyu and Kim, Junmo},
  title = {Continual Learning With Extended Kronecker-Factored Approximate Curvature},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Phase Consistent Ecological Domain Adaptation
Yanchao Yang, Dong Lao, Ganesh Sundaramoorthi, Stefano Soatto


We introduce two criteria to regularize the optimization involved in learning a classifier in a domain where no annotated data are available, leveraging annotated data in a different domain, a problem known as unsupervised domain adaptation. We focus on the task of semantic segmentation, where annotated synthetic data are aplenty, but annotating real data is laborious. The first criterion, inspired by visual psychophysics, is that the map between the two image domains be phase-preserving. This restricts the set of possible learned maps, while enabling enough flexibility to transfer semantic information. The second criterion aims to leverage ecological statistics, or regularities in the scene which are manifest in any image of it, regardless of the characteristics of the illuminant or the imaging sensor. It is implemented using a deep neural network that scores the likelihood of each possible segmentation given a single un-annotated image. Incorporating these two priors in a standard domain adaptation framework improves performance across the board in the most common unsupervised domain adaptation benchmarks for semantic segmentation.
[dataset, visual, recognition] [semantic, segmentation, map, achieves, ablation, feature, miou, improves, apply, annotated] [compatibility, model, trained, adversarial, improve, original] [phase, ieee, pattern, method, prior, fourier, output, amplitude, proposed, result, based, figure, convolutional] [domain, image, target, adaptation, source, consistency, cpn, uda, unsupervised, alignment, surrogate, train, loss, cycada, transformed, conditional, xsi, bdl, ecological, synthetic, transfer, sky, adaptsegnet, introduce, component] [network, training, learning, performance, data, space, note, learned, class, better, deep, neural, machine, implemented, best, classifier, set, standard, prevent] [conference, computer, vision, scene, transformation, international, consistent, single, european, stefano]
@InProceedings{Yang_2020_CVPR,
  author = {Yang, Yanchao and Lao, Dong and Sundaramoorthi, Ganesh and Soatto, Stefano},
  title = {Phase Consistent Ecological Domain Adaptation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
AD-Cluster: Augmented Discriminative Clustering for Domain Adaptive Person Re-Identification
Yunpeng Zhai, Shijian Lu, Qixiang Ye, Xuebo Shan, Jie Chen, Rongrong Ji, Yonghong Tian


Domain adaptive person re-identification (re-ID) is a challenging task, especially when person identities in target domains are unknown. Existing methods attempt to address this challenge by transferring image styles or aligning feature distributions across domains, whereas the rich unlabeled samples in target domains are not sufficiently exploited. This paper presents a novel augmented discriminative clustering (AD-Cluster) technique that estimates and augments person clusters in target domains and enforces the discrimination ability of re-ID models with the augmented clusters. AD-Cluster is trained by iterative density-based clustering, adaptive sample augmentation, and discriminative feature learning. It learns an image generator and a feature encoder which aim to maximize the intra-cluster diversity in the sample space and minimize the intra-cluster distance in the feature space in an adversarial min-max manner. Finally, AD-Cluster increases the diversity of sample clusters and improves the discrimination capability of re-ID models greatly. Extensive experiments over Market-1501 and DukeMTMC-reID show that AD-Cluster outperforms the state-of-the-art with large margins.
[outperforms, illustrated] [feature, map, cam, table, liang, china, improves, predicted, denotes] [model, adversarial, trained, iterative, original, identity] [ieee, adaptive, proposed, figure] [person, domain, target, discriminative, unsupervised, image, source, encoder, generator, cluster, transfer, diversity, adaptation, discrimination, supervised, uda, loss, learn, ability, generative, ltri, ldiv, enforces, learns, gan, ddiv, train] [sample, clustering, learning, training, augmentation, performance, accuracy, augmented, baseline, deep, network, unlabeled, space, large, optimization, maximize, minimize, maximizing, triplet, number, better, procedure] [distance, camera, defined, direct, well]
@InProceedings{Zhai_2020_CVPR,
  author = {Zhai, Yunpeng and Lu, Shijian and Ye, Qixiang and Shan, Xuebo and Chen, Jie and Ji, Rongrong and Tian, Yonghong},
  title = {AD-Cluster: Augmented Discriminative Clustering for Domain Adaptive Person Re-Identification},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
3D-MPA: Multi-Proposal Aggregation for 3D Semantic Instance Segmentation
Francis Engelmann, Martin Bokeloh, Alireza Fathi, Bastian Leibe, Matthias Niessner


We present 3D-MPA, a method for instance segmentation on 3D point clouds. Given an input point cloud, we propose an object-centric approach where each point votes for its object center. We sample object proposals from the predicted object centers. Then, we learn proposal features from grouped point features that voted for the same object center. A graph convolutional network introduces inter-proposal relations, providing higher-level feature learning in addition to the lower-level point features. Each proposal comprises a semantic label, a set of associated points over which we define a foreground-background mask, an objectness score and aggregation features. Previous works usually perform non-maximum-suppression (NMS) over proposals to obtain the final object detections or semantic instances. However, NMS can discard potentially correct predictions. Instead, our approach keeps all proposals and groups them together based on the learned aggregation features. We show that grouping proposals improves over NMS and outperforms previous state-of-the-art methods on the tasks of 3D object detection and semantic instance segmentation on the ScanNetV2 benchmark and the S3DIS dataset.
[recognition, graph, associated, predict, previous, prediction, multiple, outperforms, work, embedding, gcn] [object, proposal, instance, semantic, segmentation, aggregation, feature, detection, center, final, bounding, predicted, grouping, objectness, table, iou, mask, map, voted, backbone, refined, ablation, grouped, score] [input, model] [ieee, pattern, convolutional, method, based] [loss, generation] [network, learning, set, number, deep, neural, learned, validation, metric, experiment, average, report, precision, class] [point, conference, computer, vision, truth, ground, cloud, approach, international, volumetric, define, sparse, predicts, scene, dense, geometric, indoor, multi, reconstruction]
@InProceedings{Engelmann_2020_CVPR,
  author = {Engelmann, Francis and Bokeloh, Martin and Fathi, Alireza and Leibe, Bastian and Niessner, Matthias},
  title = {3D-MPA: Multi-Proposal Aggregation for 3D Semantic Instance Segmentation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Deep Active Learning for Biased Datasets via Fisher Kernel Self-Supervision
Denis Gudovskiy, Alec Hodgkinson, Takuya Yamaguchi, Sotaro Tsukizawa


Active learning (AL) aims to minimize labeling efforts for data-demanding deep neural networks (DNNs) by selecting the most representative data points for annotation. However, currently used methods are ill-equipped to deal with biased data. The main motivation of this paper is to consider a realistic setting for pool-based semi-supervised AL, where the unlabeled collection of train data is biased. We theoretically derive an optimal acquisition function for AL in this setting. It can be formulated as distribution shift minimization between unlabeled train data and weakly-labeled validation dataset. To implement such acquisition function, we propose a low-complexity method for feature density matching using self-supervised Fisher kernel (FK) as well as several novel pseudo-label estimators. Our FK-based method outperforms state-of-the-art methods on MNIST, SVHN, and ImageNet classification while requiring only 1/10th of processing. The conducted experiments show at least 40% drop in labeling efforts for the biased class-imbalanced data compared to existing methods.
[dataset, shift] [biased, feature, dkl, labeling, propose, framework, ablation] [dnn, model, fraction, mnist] [method, figure, acquisition, proposed, kernel, prior, result] [train, image, unsupervised, loss, representation] [data, learning, distribution, training, fisher, function, compared, class, validation, accuracy, test, random, unlabeled, size, task, complexity, deep, number, pool, vaal, pfk, active, metric, imbalance, neural, imagenet, pretraining, matrix, better, density, subset, selected, machine, clustering, labeled, sampling, ropt, arg, mutual, svhn, varr, minimize, setting, optimal, classification, large, find, forward, processing, similarity, respect, objective, log, setup] [conference, uncertainty, descriptor, international, estimation, full, collection, matching]
@InProceedings{Gudovskiy_2020_CVPR,
  author = {Gudovskiy, Denis and Hodgkinson, Alec and Yamaguchi, Takuya and Tsukizawa, Sotaro},
  title = {Deep Active Learning for Biased Datasets via Fisher Kernel Self-Supervision},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Adaptive Graph Convolutional Network With Attention Graph Clustering for Co-Saliency Detection
Kaihua Zhang, Tengpeng Li, Shiwen Shen, Bo Liu, Jin Chen, Qingshan Liu


Co-saliency detection aims to discover the common and salient foregrounds from a group of relevant images. For this task, we present a novel adaptive graph convolutional network with attention graph clustering (GCAGC). Three major contributions have been made, and are experimentally shown to have substantial practical merits. First, we propose a graph convolutional network design to extract information cues to characterize the intra- and inter-image correspondence. Second, we develop an attention graph clustering algorithm to discriminate the common objects from all the salient foreground objects in an unsupervised fashion. Third, we present a unified framework with encoder-decoder structure to jointly train and optimize the graph convolutional network, attention graph cluster, and co-saliency detection decoder in an end-to-end manner. We evaluate our proposed GCAGC method on three co-saliency detection benchmark datasets (iCoseg, Cosal2015 and COCO-SEG). Our GCAGC method obtains significant improvements over the state-of-the-arts on most of them.
[graph, attention, agcn, three, visual, node, extract, decoder, adjacency, gcns] [detection, gcagc, salient, feature, denotes, agcm, foreground, cnn, framework, background, including, cosaliency, semantic, object, saliency, edge, lgc, rcgs, unified, challenging, module, score, mcatt, esmg, icoseg] [model, input, datasets, developed] [convolutional, adaptive, proposed, figure, spatial, method, develop, output, learnable, filtering, noisy] [image, common, learn, discover] [network, clustering, group, learning, neural, deep, matrix, set, arxiv, preprint, task, design, compared, weight, better, function, training, performance] [structure, capture, jointly, well, directly, correspondence]
@InProceedings{Zhang_2020_CVPR,
  author = {Zhang, Kaihua and Li, Tengpeng and Shen, Shiwen and Liu, Bo and Chen, Jin and Liu, Qingshan},
  title = {Adaptive Graph Convolutional Network With Attention Graph Clustering for Co-Saliency Detection},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
A2dele: Adaptive and Attentive Depth Distiller for Efficient RGB-D Salient Object Detection
Yongri Piao, Zhengkun Rong, Miao Zhang, Weisong Ren, Huchuan Lu


Existing state-of-the-art RGB-D salient object detection methods explore RGB-D data relying on a two-stream architecture, in which an independent subnetwork is required to process depth data. This inevitably incurs extra computational costs and memory consumption, and using depth data during testing may hinder the practical applications of RGB-D saliency detection. To tackle these two dilemmas, we propose a depth distiller (A2dele) to explore the way of using network prediction and attention as two bridges to transfer the depth knowledge from the depth stream to the RGB stream. First, by adaptively minimizing the differences between predictions generated from the depth stream and RGB stream, we realize the desired control of pixel-wise depth knowledge transferred to the RGB stream. Second, to transfer the localization knowledge to RGB features, we encourage consistencies between the dilated prediction of the depth stream and the attention map from the RGB stream. As a result, we achieve a lightweight architecture without use of depth data at test time by embedding our A2dele. Our extensive experimental evaluation on five benchmarks demonstrate that our RGB stream achieves state-of-the-art performance, which tremendously minimizes the model size by 76% and runs 12 times faster, compared with the best performing method. Furthermore, our A2dele can be applied to existing RGB-D networks to significantly improve their efficiency while maintaining performance (boosts FPS by nearly twice for DMRA and 3 times for CPFP).
[stream, attention, prediction, privileged, decoder, visual, embedding, evaluation, observed] [salient, object, saliency, attentive, detection, map, distiller, nlpr, dmra, table, propose, cpfp, localization, background, achieves, njud, extra, challenging, semantic, represents, cpd, huchuan, rfb, module] [model, original, improve, datasets, input, effectively] [adaptive, figure, proposed, conv, method, existing, comparison, fusion, convolutional, chen] [transfer, transferring, loss, transferred, desired, free, bridge, generated, discriminative] [distillation, knowledge, network, scheme, size, achieve, learning, training, data, compared, set, process, test, performance, efficient, subnetwork, design, large, architecture, comparable, reliable] [depth, rgb, stereo, accurate]
@InProceedings{Piao_2020_CVPR,
  author = {Piao, Yongri and Rong, Zhengkun and Zhang, Miao and Ren, Weisong and Lu, Huchuan},
  title = {A2dele: Adaptive and Attentive Depth Distiller for Efficient RGB-D Salient Object Detection},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Deep Fair Clustering for Visual Learning
Peizhao Li, Han Zhao, Hongfu Liu


Fair clustering aims to hide sensitive attributes during data partition by balancing the distribution of protected subgroups in each cluster. Existing work attempts to address this problem by reducing it to a classical balanced clustering with a constraint on the proportion of protected subgroups of the input space. However, the input space may limit the clustering performance, and so far only low-dimensional datasets have been considered. In light of these limitations, in this paper, we propose Deep Fair Clustering (DFC) to learn fair and clustering-favorable representations for clustering simultaneously. Our approach could effectively filter out sensitive attributes from representations, and also lead to representations that are amenable for the following cluster analysis. Theoretically, we show that our fairness constraint in DFC will not incur much loss in terms of several clustering metrics. Empirically, we provide extensive experimental demonstrations on four visual datasets to corroborate the superior performance of the proposed approach over existing fair clustering and deep clustering methods on both cluster validity and fairness criterion.
[visual, work, provide, dataset, goal] [feature, assignment, propose, validity, table] [sensitive, quality, original, adversarial, input, datasets, mnist, conduct, external, worst, experimental, model] [partition, analysis, proposed, existing, figure, method, color, reverse, based, spectral, ieee] [protected, cluster, attribute, representation, subgroup, encoder, structural, loss, independent, learn, unsupervised, unfair, preservation] [clustering, fair, deep, fairness, learning, data, accuracy, algorithm, balance, label, achieve, training, entropy, set, distribution, machine, balanced, performance, neural, function, arxiv, preprint, best, objective, space, fairlet, large, classifier, problem, soft, mtfl, size, nmi] [conference, international, matching, structure, computer, constraint, well]
@InProceedings{Li_2020_CVPR,
  author = {Li, Peizhao and Zhao, Han and Liu, Hongfu},
  title = {Deep Fair Clustering for Visual Learning},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Bidirectional Graph Reasoning Network for Panoptic Segmentation
Yangxin Wu, Gengwei Zhang, Yiming Gao, Xiajun Deng, Ke Gong, Xiaodan Liang, Liang Lin


Recent researches on panoptic segmentation resort to a single end-to-end network to combine the tasks of instance segmentation and semantic segmentation. However, prior models only unified the two related tasks at the architectural level via a multi-branch scheme or revealed the underlying correlation between them by unidirectional feature fusion, which disregards the explicit semantic and co-occurrence relations among objects and background. Inspired by the fact that context information is critical to recognize and localize the objects, and inclusive object details are significant to parse the background scene, we thus investigate on explicitly modeling the correlations between object and background to achieve a holistic understanding of an image in the panoptic segmentation task. We introduce a Bidirectional Graph Reasoning Network (BGRNet), which incorporates graph structure into the conventional panoptic segmentation network to mine the intra-modular and inter-modular relations within and between foreground things and background stuff classes. In particular, BGRNet first constructs image-specific graphs in both instance and semantic segmentation branches that enable flexible reasoning at the proposal level and class level, respectively. To establish the correlations between separate branches and fully leverage the complementary relations between things and stuff, we propose a Bidirectional Graph Connection Module to diffuse information across branches in a learnable fashion. Experimental results demonstrate the superiority of our BGRNet that achieves the new state-of-the-art performance on challenging COCO and ADE20K panoptic segmentation benchmarks.
[graph, reasoning, bidirectional, attention, visual, node, build, context, modeling] [segmentation, stuff, panoptic, semantic, feature, instance, branch, foreground, bgrnet, background, object, module, mask, table, fully, global, region, pqst, kaiming, ross, unidirectional, coco, refine, center, building, wst, unified, level, challenging, backbone, detection, wth, pqt, proposal, propose, achieves, parsing] [model, quality] [figure, convolutional, based, pixel, enhance, convolution, method] [extracted, image, xst, separate, project, underlying] [class, connection, network, performance, arxiv, preprint, matrix, knowledge, weight, scheme, investigate] [local, scene, single, demonstrate, structure, well]
@InProceedings{Wu_2020_CVPR,
  author = {Wu, Yangxin and Zhang, Gengwei and Gao, Yiming and Deng, Xiajun and Gong, Ke and Liang, Xiaodan and Lin, Liang},
  title = {Bidirectional Graph Reasoning Network for Panoptic Segmentation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Exploit Clues From Views: Self-Supervised and Regularized Learning for Multiview Object Recognition
Chih-Hui Ho, Bo Liu, Tz-Ying Wu, Nuno Vasconcelos


Multiview recognition has been well studied in the literature and achieves decent performance in object recognition and retrieval task. However, most previous works rely on supervised learning and some impractical underlying assumptions, such as the availability of all views in training and inference time. In this work, the problem of multiview self-supervised learning (MV-SSL) is investigated, where only image to object association is given. Given this setup, a novel surrogate task for self-supervised learning is proposed by pursuing "object invariant" representation. This is solved by randomly selecting an image feature of an object as object prototype, accompanied with multiview consistency regularization, which results in view invariant stochastic prototype embedding (VISPE). Experiments shows that the recognition and retrieval results using VISPE outperform that of other self-supervised learning methods on seen and unseen data. VISPE can also be applied to semi-supervised scenario and demonstrates robust performance with limited data available. Code is available at https://github.com/chihhuiho/VISPE
[embedding, recognition, embeddings, dataset, retrieval, previous, context, work, visual, multiple, video] [object, feature, instance, table] [model, generalization, trained, impractical, face] [proposed, based, ieee, convolutional, pattern, june, figure] [unseen, image, surrogate, prototype, randomization, unsupervised, vispe, invariant, learn, lwumor, loss, representation, supervised, consistency] [learning, training, set, class, neural, task, metric, ssl, deep, data, better, triplet, performance, softmax, regularization, network, stochastic, learned, algorithm, consider, stable, subset, problem, good, number, accuracy, clustering, randomly, memory] [multiview, view, computer, conference, vision, shape, robot, pose, modelnet, structure, approach, international, well]
@InProceedings{Ho_2020_CVPR,
  author = {Ho, Chih-Hui and Liu, Bo and Wu, Tz-Ying and Vasconcelos, Nuno},
  title = {Exploit Clues From Views: Self-Supervised and Regularized Learning for Multiview Object Recognition},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Spherical Space Domain Adaptation With Robust Pseudo-Label Loss
Xiang Gu, Jian Sun, Zongben Xu


Adversarial domain adaptation (DA) has been an effective approach for learning domain-invariant features by adversarial training. In this paper, we propose a novel adversarial DA approach completely defined in spherical feature space, in which we define spherical classifier for label prediction and spherical domain discriminator for discriminating domain labels. To utilize pseudo-label robustly, we develop a robust pseudo-label loss in the spherical feature space, which weights the importance of estimated labels of target data by posterior probability of correct labeling, modeled by Gaussian-uniform mixture model in spherical feature space. Extensive experiments show that our method achieves state-of-the-art results, and also confirm effectiveness of spherical classifier, spherical discriminator and spherical robust pseudo-label loss.
[correct, dataset, outperforms, visual] [feature, labeling, table, propose, achieves, extractor, effectiveness, regression] [robust, adversarial, model] [based, method, transform, proposed, figure] [domain, target, loss, adaptation, discriminator, source, unsupervised, dann, conditional, adaption, wrongly, mstn, mingsheng, learn, jianmin, utilize, invariant, kate, completely, defining, rsda, transfer, snr, symnets, mdd, trevor] [space, classifier, data, learning, deep, labeled, probability, mixture, class, classification, training, entropy, network, layer, neural, linear, accuracy, distribution, perceptron, discussed, posterior, performance, logistic, performing, better, large] [spherical, defined, approach, euclidean, estimate, michael, novel, estimated, distance, define]
@InProceedings{Gu_2020_CVPR,
  author = {Gu, Xiang and Sun, Jian and Xu, Zongben},
  title = {Spherical Space Domain Adaptation With Robust Pseudo-Label Loss},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Stochastic Classifiers for Unsupervised Domain Adaptation
Zhihe Lu, Yongxin Yang, Xiatian Zhu, Cong Liu, Yi-Zhe Song, Tao Xiang


A common strategy adopted by existing state-of-the-art unsupervised domain adaptation (UDA) methods is to employ two classifiers to identify the misaligned local regions between source and target domain. Following the 'wisdom of the crowd' principle, one has to ask: why stop at two? Indeed, we find that using more classifiers leads to better performance, but also introduces more model parameters, therefore risking overfitting. In this paper, we introduce a novel method called STochastic clAssifieRs (STAR) for addressing this problem. Instead of representing one classifier as a weight vector, STAR models it as a Gaussian distribution with its variance representing the inter-classifier discrepancy. With STAR, we can now sample an arbitrary number of classifiers from the distribution, whilst keeping the model size the same as having two classifiers. Extensive experiments demonstrate that a variety of existing UDA methods can greatly benefit from STAR and achieve the state-of-the-art performance on both image classification and semantic segmentation tasks.
[recognition, sign, three, step, traffic, previous] [semantic, feature, segmentation, table, final, object, extra] [model, adversarial, mnist, trained, decision, datasets, identify] [star, method, figure, existing, based, proposed, gaussian] [domain, source, uda, target, unsupervised, mcd, image, whilst, usps, clan, adaptation, discrepancy, alignment, loss, digit, diverse, tao, common, asn, synthia, misaligned, transfer, misalignment] [classifier, distribution, number, classification, training, stochastic, learning, data, performance, deep, accuracy, weight, neural, large, set, reported, sampled, variance, size, standard, larger, test, network, better, sample, best, task, svhn, suggests, random, iteration, layer, batch, setting] [local, joint, uncertainty]
@InProceedings{Lu_2020_CVPR,
  author = {Lu, Zhihe and Yang, Yongxin and Zhu, Xiatian and Liu, Cong and Song, Yi-Zhe and Xiang, Tao},
  title = {Stochastic Classifiers for Unsupervised Domain Adaptation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Unsupervised Learning of Intrinsic Structural Representation Points
Nenglun Chen, Lingjie Liu, Zhiming Cui, Runnan Chen, Duygu Ceylan, Changhe Tu, Wenping Wang


Learning structures of 3D shapes is a fundamental problem in the field of computer graphics and geometry processing. We present a simple yet interpretable unsupervised method for learning a new structural representation in the form of 3D structure points. The 3D structure points produced by our method encode the shape structure intrinsically and exhibit semantic consistency across all the shape instances with similar structures. This is a challenging goal that has not fully been achieved by other methods. Specifically, our method takes a 3D point cloud as input and encodes it as a set of local features. The local features are then passed through a novel point integration module to produce a set of 3D structure points. The chamfer distance is used as reconstruction loss to ensure the structure points lie close to the input point cloud. Extensive experiments have shown that our method outperforms the state-of-the-art on the semantic shape correspondence task and achieves comparable performance with the state-of-the-art on the segmentation label transfer task. Moreover, the PCA based shape embedding built upon consistent structure points demonstrates good performance in preserving the shape structures. Code is available at https://github.com/NolenChen/3DStructurePoints
[embedding, structured, hierarchical] [semantic, segmentation, map, module, feature, contextual, category] [input, testing, example] [figure, based, method, proposed, integration] [corresponding, consistency, unsupervised, produced, transfer, structural, encoder, real, learn, train, row, representation, produce, loss, semantically, generate, pji] [learning, network, label, deep, sample, set, probability, training, performance, arxiv, preprint, good, architecture, task, large, processing, sampling, space, average, number, learned, note] [point, structure, shape, cloud, correspondence, local, pca, consistent, leonidas, reconstruction, scanned, principal, hao, computer, shapenet, vladimir, approach, functional, thomas, collection, pointnet, well, keypoints, dense, mlp, geometric]
@InProceedings{Chen_2020_CVPR,
  author = {Chen, Nenglun and Liu, Lingjie and Cui, Zhiming and Chen, Runnan and Ceylan, Duygu and Tu, Changhe and Wang, Wenping},
  title = {Unsupervised Learning of Intrinsic Structural Representation Points},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
PolyTransform: Deep Polygon Transformer for Instance Segmentation
Justin Liang, Namdar Homayounfar, Wei-Chiu Ma, Yuwen Xiong, Rui Hu, Raquel Urtasun


In this paper, we propose PolyTransform, a novel instance segmentation algorithm that produces precise, geometry-preserving masks by combining the strengths of prevailing segmentation approaches and modern polygon-based methods. In particular, we first exploit a segmentation network to generate instance masks. We then convert the masks into a set of polygons that are then fed to a deforming network that transforms the polygons such that they better fit the object boundaries. Our experiments on the challenging Cityscapes dataset show that our PolyTransform significantly improves the performance of the backbone instance segmentation network and ranks 1st on the Cityscapes test-set leaderboard. We also show impressive gains in the interactive annotation setting.
[dataset, exploit, transformer, goal] [instance, segmentation, object, annotation, feature, mask, upsnet, table, raquel, interactive, bounding, boundary, semantic, box, polytransform, backbone, sanja, improvement, coco, iou, val, kaiming, annotate, panet, split, detection, predicted, level, car, alexander, ross, piotr, propose] [model, improve] [figure, output, convolutional, pixel, extraction] [image, fine, train, loss, generate] [network, initialization, deep, report, set, learning, test, metric, better, validation, task, performance, problem, neural, active, gain, average, modern] [polygon, ground, approach, deforming, truth, vertex, capture, novel, fit, compute, scene, complex, handle]
@InProceedings{Liang_2020_CVPR,
  author = {Liang, Justin and Homayounfar, Namdar and Ma, Wei-Chiu and Xiong, Yuwen and Hu, Rui and Urtasun, Raquel},
  title = {PolyTransform: Deep Polygon Transformer for Instance Segmentation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Interactive Two-Stream Decoder for Accurate and Fast Saliency Detection
Huajun Zhou, Xiaohua Xie, Jian-Huang Lai, Zixuan Chen, Lingxiao Yang


Recently, contour information largely improves the performance of saliency detection. However, the discussion on the correlation between saliency and contour remains scarce. In this paper, we first analyze such correlation and then propose an interactive two-stream decoder to explore multiple cues, including saliency, contour and their correlation. Specifically, our decoder consists of two branches, a saliency branch and a contour branch. Each branch is assigned to learn distinctive features for predicting the corresponding map. Meanwhile, the intermediate connections are forced to learn the correlation by interactively transmitting the features from each branch to the other one. In addition, we develop an adaptive contour loss to automatically discriminate hard examples during learning process. Extensive experiments on six benchmarks well demonstrate that our network achieves competitive performance with a fast speed around 50 FPS. Moreover, our VGG-based model only contains 17.08 million parameters, which is significantly smaller than other VGG-based approaches. Code has been made available at: https://github.com/moothes/ITSD-pytorch.
[attention, decoder, hierarchical, integrate, speed, cue, visual] [saliency, contour, salient, object, detection, correlation, feature, hard, module, branch, map, predicted, table, huchuan, itsd, employed, interactive, egnet, poolnet, cpd, supervision, boundary, global, segmentation, propose, achieves, ctloss] [model, improve, input, vgg] [ieee, proposed, pattern, figure, mae, fusion, intermediate, convolutional, method, adaptive, based, analysis, develop, fast, introduced, existing] [loss, learn, image, train, corresponding] [network, learning, performance, deep, number, size, compared, arxiv, preprint, objective, function, best, machine, task, neural] [computer, conference, vision, well, international, ground, accurate]
@InProceedings{Zhou_2020_CVPR,
  author = {Zhou, Huajun and Xie, Xiaohua and Lai, Jian-Huang and Chen, Zixuan and Yang, Lingxiao},
  title = {Interactive Two-Stream Decoder for Accurate and Fast Saliency Detection},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Towards Better Generalization: Joint Depth-Pose Learning Without PoseNet
Wang Zhao, Shaohui Liu, Yezhi Shu, Yong-Jin Liu


In this work, we tackle the essential problem of scale inconsistency for self supervised joint depth-pose learning. Most existing methods assume that a consistent scale of depth and pose can be learned across all input samples, which makes the learning problem harder, resulting in degraded performance and limited generalization in indoor environments and long-sequence visual odometry application. To address this issue, we propose a novel system that explicitly disentangles scale from the network estimation. Instead of relying on PoseNet architecture, our method recovers relative pose by directly solving fundamental matrix from dense optical flow correspondence and makes use of a two-view triangulation module to recover an up-to-scale 3D structure. Then, we align the scale of the depth prediction with the triangulated point cloud and use the transformed depth map for depth error computation and dense reprojection check. Our whole system can be jointly trained end-to-end. Extensive experiments show that our system not only reaches state-of-the-art performance on KITTI depth and flow estimation, but also significantly improves the generalization ability of existing self-supervised depth-pose learning methods under a variety of challenging scenarios, and achieves state-of-the-art results among self-supervised learning-based methods on KITTI Odometry and NYUv2 dataset. Furthermore, we present some interesting findings on the limitation of PoseNet-based relative pose estimation methods in terms of generalization ability. Code is available at https://github.com/B1ueber2y/TrianFlow.
[visual, prediction, dataset, explicitly, essentially, work] [map, challenging, table, predicted, module, occlusion, score, propose, achieves, supervision] [generalization, robust, inconsistency, input, robustness, original, zhou] [flow, optical, scale, method, figure, existing, pixel, proposed, recover] [unsupervised, image, ability, loss, consistency, unseen, align, learn] [learning, training, network, deep, performance, matrix, problem, better, sample, large, test, data, design, learned, compared, neural] [depth, pose, estimation, monocular, system, camera, relative, odometry, kitti, error, triangulation, joint, fundamental, dense, triangulated, correspondence, posenet, indoor, reprojection, directly, geometric, accurate, single, structure, slam, consistent, jointly, photometric]
@InProceedings{Zhao_2020_CVPR,
  author = {Zhao, Wang and Liu, Shaohui and Shu, Yezhi and Liu, Yong-Jin},
  title = {Towards Better Generalization: Joint Depth-Pose Learning Without PoseNet},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
LT-Net: Label Transfer by Learning Reversible Voxel-Wise Correspondence for One-Shot Medical Image Segmentation
Shuxin Wang, Shilei Cao, Dong Wei, Renzhen Wang, Kai Ma, Liansheng Wang, Deyu Meng, Yefeng Zheng


We introduce a one-shot segmentation method to alleviate the burden of manual annotation for medical images. The main idea is to treat one-shot segmentation as a classical atlas-based segmentation problem, where voxel-wise correspondence from the atlas to the unlabelled data is learned. Subsequently, segmentation label of the atlas can be transferred to the unlabelled data with the learned correspondence. However, since ground truth correspondence between images is usually unavailable, the learning system must be well-supervised to avoid mode collapse and convergence failure. To overcome this difficulty, we resort to the forward-backward consistency, which is widely used in correspondence problems, and additionally learn the backward correspondences from the warped atlases back to the original atlas. This cycle-correspondence learning design enables a variety of extra, cycle-consistency-based supervision signals to make the training process stable, while also boost the performance. We demonstrate the superiority of our method over both deep learning-based one-shot segmentation methods and a classical multi-atlas segmentation method via thorough experiments.
[work] [segmentation, supervision, framework, map, table, extra, tracking, labelled, ablation, sota] [adversarial, original, difference, model] [medical, method, voxelmorph, proposed, classical, warped, brain, dice, anatomical, optical, flow, ieee, warp, lsmooth, pattern, spatial, slice, comparison] [image, atlas, consistency, loss, cycle, cyc, anatomy, lgan, lcyc, learn, synthetic, transfer, target, mabmis, dataaug, assisted, lanatomy, gan, ltrans, corresponding, introduce] [learning, label, unlabelled, backward, deep, forward, data, learned, network, training, neural, min, max, computing, design, performance, basic, sample, problem, test] [correspondence, computer, transformation, registration, vision, reconstructed, matching, defined, constraint]
@InProceedings{Wang_2020_CVPR,
  author = {Wang, Shuxin and Cao, Shilei and Wei, Dong and Wang, Renzhen and Ma, Kai and Wang, Liansheng and Meng, Deyu and Zheng, Yefeng},
  title = {LT-Net: Label Transfer by Learning Reversible Voxel-Wise Correspondence for One-Shot Medical Image Segmentation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
FGN: Fully Guided Network for Few-Shot Instance Segmentation
Zhibo Fan, Jin-Gang Yu, Zhihao Liang, Jiarong Ou, Changxin Gao, Gui-Song Xia, Yuanqing Li


Few-shot instance segmentation (FSIS) conjoins the few-shot learning paradigm with general instance segmentation, which provides a possible way of tackling instance segmentation in the lack of abundant labeled data for training. This paper presents a Fully Guided Network (FGN) for few-shot instance segmentation. FGN perceives FSIS as a guided model where a so-called support set is encoded and utilized to guide the predictions of a base instance segmentation network (i.e., Mask R-CNN), critical to which is the guidance mechanism. In this view, FGN introduces different guidance mechanisms into the various key components in Mask R-CNN, including Attention-Guided RPN, Relation-Guided Detector, and Attention-Guided FCN, in order to make full use of the guidance effect from the support set and adapt better to the inter-class generalization. Experiments on public datasets demonstrate that our proposed FGN can outperform the state-of-the-art methods.
[three, previous, work] [instance, segmentation, mask, fgn, object, cow, guided, rpn, fsis, plant, semantic, detection, pot, ted, fully, table, guide, siamese, feature, detector, fcn, bottle, key, branch, mrcnn, mbike, bike, background, voc, paradigm, bbox] [model, query, termed, effective, experimental] [guidance, ieee, proposed, pattern, comparison, figure] [image, bus, bird, conditional] [set, support, base, learning, network, training, classification, performance, data, setting, task, problem, strategy, dbase, achieve, meta, general, fsl, class, outperform] [novel, conference, computer, vision, full, matching, international, limited]
@InProceedings{Fan_2020_CVPR,
  author = {Fan, Zhibo and Yu, Jin-Gang and Liang, Zhihao and Ou, Jiarong and Gao, Changxin and Xia, Gui-Song and Li, Yuanqing},
  title = {FGN: Fully Guided Network for Few-Shot Instance Segmentation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
A Quantum Computational Approach to Correspondence Problems on Point Sets
Vladislav Golyanik, Christian Theobalt


Modern adiabatic quantum computers (AQC) are already used to solve difficult combinatorial optimisation problems in various domains of science. Currently, only a few applications of AQC in computer vision have been demonstrated. We review AQC and derive a new algorithm for correspondence problems on point sets suitable for execution on AQC. Our algorithm has a subquadratic computational complexity of the state preparation. Examples of successful transformation estimation and point set alignment by simulated sampling are shown in the systematic experimental evaluation. Finally, we analyse the differences in the solutions and the corresponding energy values.
[state, time] [template] [physical, model, experimental, aqa, noise, difference, theory, christian] [classical, method, analysis, magnetic, journal, reference, electron, result, spectral, field] [alignment, corresponding, image, misalignment, gap] [set, energy, computing, matrix, number, problem, algorithm, computational, superposition, modern, complexity, binary, optimal, applied, machine, theorem, evolution, finding, denoted, probability, computation, exponential, arxiv, william, andrew, combinatorial, hardware] [quantum, point, transformation, computer, system, annealing, adiabatic, ground, qubit, qubits, basis, hamiltonian, aqc, estimation, vision, initial, qubop, spin, solution, form, optimisation, annealers, globally, ising, approach, international, conference, simulated, solving, rigid, gravitational, correspondence, registration]
@InProceedings{Golyanik_2020_CVPR,
  author = {Golyanik, Vladislav and Theobalt, Christian},
  title = {A Quantum Computational Approach to Correspondence Problems on Point Sets},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Data-Efficient Semi-Supervised Learning by Reliable Edge Mining
Peibin Chen, Tao Ma, Xu Qin, Weidi Xu, Shuchang Zhou


Learning powerful discriminative features is a challenging task in Semi-Supervised Learning, as the estimation of the feature space is more likely to be wrong with scarcer labeled data. Previous methods utilize a relation graph with edges representing 'similarity' or 'dissimilarity' between nodes. Similar nodes are forced to output consistent features, while dissimilar nodes are forced to be inconsistent. However, since unlabeled data may be wrongly labeled, the judgment of edges may be unreliable. Besides, the nodes connected by edges may already be well fitted, thus contributing little to the model training. We propose Reliable Edge Mining (REM), which forms a reliable graph by only selecting reliable and useful edges. Guided by the graph, the feature extractor is able to learn discriminative features in a data-efficient way, and consequently boosts the accuracy of the learned classifier. Visual analyses show that the features learned are more discriminative and better reveals the underlying structure of the data. REM can be combined with perturbation-based methods like Pi-model, TempEns and Mean Teacher to further improve accuracy. Experiments prove that our method is data-efficient on simple tasks like SVHN and CIFAR-10, and achieves state-of-the-art results on the challenging CIFAR-100.
[graph, previous, node, embedding, construct, considering, annual, december, connected, long, represent, constructed] [edge, feature, extractor, represents, table, guided, challenging, achieves] [model, original, tiny, adversarial, example] [output, method, ieee, based, pattern, figure, june, cvpr] [attribute, discriminative, generating, encourages, learn, corresponding, generative, utilize] [reliable, rem, learning, data, certainty, sntg, neural, unlabeled, processing, teacher, deep, set, training, test, eij, confident, labeled, mining, network, reliability, select, candidate, number, standard, ssl, rate, randomly, function, better, svhn, machine, expected, surpasses] [conference, computer, error, vision, international, form, neighbor]
@InProceedings{Chen_2020_CVPR,
  author = {Chen, Peibin and Ma, Tao and Qin, Xu and Xu, Weidi and Zhou, Shuchang},
  title = {Data-Efficient Semi-Supervised Learning by Reliable Edge Mining},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
NestedVAE: Isolating Common Factors via Weak Supervision
Matthew J. Vowels, Necati Cihan Camgoz, Richard Bowden


Fair and unbiased machine learning is an important and active field of research, as decision processes are increasingly driven by models that learn from data. Unfortunately, any biases present in the data may be learned by the model, thereby inappropriately transferring that bias into the decision making process. We identify the connection between the task of bias reduction and that of isolating factors common between domains whilst encouraging domain specific invariance. To isolate the common factors we combine the theory of deep latent variable models with information bottleneck theory for scenarios whereby data may be naturally paired across domains and no additional supervision is required. The result is the Nested Variational AutoEncoder (NestedVAE). Two outer VAEs with shared weights attempt to reconstruct the input and infer a latent space, whilst a nested VAE attempts to reconstruct the latent representation of one image, from the latent representation of its paired image. In so doing, the nested VAE isolates the common latent factors/causes and becomes invariant to unwanted factors that are not shared between paired images. We also propose a new metric to provide a balanced method of evaluating consistency and classifier performance across domains which we refer to as the Adjusted Parity metric. An evaluation of NestedVAE on both domain and attribute invariance, change detection, and learning common factors for the prediction of biological sex demonstrates that NestedVAE significantly outperforms alternative methods.
[work, dataset, prediction, embeddings, outperforms, decoder, confounders, pair, order] [detection, supervision, weak, table] [model, adversarial, trained, mnist, change, white, black, theory, identity] [figure, prior, method, biological] [latent, domain, nestedvae, vae, common, variational, representation, parity, invariant, disentanglement, digit, vaes, sex, specific, nested, shared, invariance, disentangled, unsupervised, paired, whilst, image, learn, variable, attribute, autoencoder, generative, utkface, gender] [learning, data, training, deep, network, distribution, machine, bias, classification, performance, adjusted, bottleneck, outer, inference, best, metric, classifier, alternative, accuracy, task, random, learned, test, fair, number, problem, label, neural] [rotation, rotated, well, term]
@InProceedings{Vowels_2020_CVPR,
  author = {Vowels, Matthew J. and Camgoz, Necati Cihan and Bowden, Richard},
  title = {NestedVAE: Isolating Common Factors via Weak Supervision},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Progressive Adversarial Networks for Fine-Grained Domain Adaptation
Sinan Wang, Xinyang Chen, Yunbo Wang, Mingsheng Long, Jianmin Wang


Fine-grained visual categorization has long been considered as an important problem, however, its real application is still restricted, since precisely annotating a large fine-grained image dataset is a laborious task and requires expert-level human knowledge. A solution to this problem is applying domain adaptation approaches to fine-grained scenarios, where the key idea is to discover the commonality between existing fine-grained image datasets and massive unlabeled data in the wild. The main technical bottleneck lies in that the large inter-domain variation will deteriorate the subtle boundaries of small inter-class variation during domain alignment. This paper presents the Progressive Adversarial Networks (PAN) to align fine-grained categories across domains with a curriculum-based adversarial learning framework. In particular, throughout the learning process, domain adaptation is carried out through all multi-grained features, progressively exploiting the label hierarchy from coarse to fine. The progressive learning is applied upon both category classification and domain alignment, boosting both the discriminability and the transferability of the fine-grained features. Our method is evaluated on three benchmarks, two of which are proposed by us, and it outperforms the state-of-the-art domain adaptation methods.
[visual, recognition, bilinear, dataset, three, granularity, work, hierarchical, outperforms] [feature, table, categorization, extractor, web, cnn, predicted, benchmark, category] [adversarial, trained, model, datasets, pietro] [pan, figure, method, proposed, existing] [domain, adaptation, progressive, source, target, transfer, loss, image, subtle, curriculum, discriminator, pal, mingsheng, jianmin, progressively, conditional, compcars, dann, serge, trevor, subordinate, corresponding, diversity, avg] [learning, label, large, deep, classifier, accuracy, distribution, data, network, small, classification, class, problem, average, performance, schedule, number, task, hierarchy] [error, joint, hybrid, coarse, michael, subhransu]
@InProceedings{Wang_2020_CVPR,
  author = {Wang, Sinan and Chen, Xinyang and Wang, Yunbo and Long, Mingsheng and Wang, Jianmin},
  title = {Progressive Adversarial Networks for Fine-Grained Domain Adaptation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
A Disentangling Invertible Interpretation Network for Explaining Latent Representations
Patrick Esser, Robin Rombach, Bjorn Ommer


Neural networks have greatly boosted performance in computer vision by learning powerful representations of input data. The drawback of end-to-end training for maximal overall performance are black-box models whose hidden representations are lacking interpretability: Since distributed coding is optimal for latent layers to improve their robustness, attributing meaning to parts of a hidden feature vector or to individual neurons is hindered. We formulate interpretation as a translation of hidden representations onto semantic concepts that are comprehensible to the user. The mapping between both domains has to be bijective so that semantic modifications in the target domain correctly alter the original representation. The proposed invertible interpretation network can be transparently applied on top of existing architectures with no need to modify or retrain them. Consequently, we translate an original representation to an equivalent yet interpretable one and backwards without affecting the expressiveness and performance of the original. The invertible interpretation network disentangles the hidden representation into separate, semantically meaningful concepts. Moreover, we present an efficient approach to define semantic concepts by only sketching two images and also an unsupervised strategy. Experimental evaluation demonstrates the wide applicability to interpretation of existing classification and image generation networks as well as to semantically guided image manipulation.
[hidden, multiple, visual, provide, walk, understanding] [semantic, object, map, correlation, feature] [interpretation, concept, original, interpretability, case, internal, adversarial, modify, model, change, analyze, input, trained, meaning] [invertible, pattern, ieee, existing, figure, residual, convolutional, based, output] [image, representation, interpretable, factor, latent, disentangled, translation, attribute, generative, digit, meaningful, arbitrary, disentangling, unsupervised, invertibility, autoencoder, modified, specific, transfer, translate, semantically] [network, deep, neural, training, space, linear, learning, distributed, data, classifier, dimensionality, performance, applied, layer, class, arxiv, preprint, distribution, vector, classification, learned, processing] [approach, computer, conference, vision, international, enables]
@InProceedings{Esser_2020_CVPR,
  author = {Esser, Patrick and Rombach, Robin and Ommer, Bjorn},
  title = {A Disentangling Invertible Interpretation Network for Explaining Latent Representations},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Modeling the Background for Incremental Learning in Semantic Segmentation
Fabio Cermelli, Massimiliano Mancini, Samuel Rota Bulo, Elisa Ricci, Barbara Caputo


Despite their effectiveness in a wide range of tasks, deep architectures suffer from some important limitations. In particular, they are vulnerable to catastrophic forgetting, i.e. they perform poorly when they are required to update their model as new classes are available but the original training set is not retained. This paper addresses this problem in the context of semantic segmentation. Current strategies fail on this task because they do not consider a peculiar aspect of semantic segmentation: since each training step provides annotation only for a subset of all possible classes, pixels of the background class (i.e. pixels that do not belong to any other classes) exhibit a semantic distribution shift. In this work we revisit classical incremental learning methods, proposing a new distillation-based framework which explicitly accounts for this shift. Furthermore, we introduce a novel strategy to initialize classifier's parameters, thus preventing biased predictions toward the background class. We demonstrate the effectiveness of our approach with an extensive evaluation on the Pascal-VOC 2012 and ADE20K datasets, significantly outperforming state of the art incremental learning methods.
[previous, step, current, shift, modeling, considering, account, multiple, explicitly, three, dataset, work, state] [semantic, background, segmentation, object, table, qxt, overlapped, addition, fully, assigned, propose] [model, disjoint, experimental] [method, pixel, output, convolutional, proposed] [loss, image, address, learn] [learning, class, incremental, icl, distillation, training, standard, catastrophic, set, problem, forgetting, lwf, performance, knowledge, probability, deep, label, classifier, setting, network, strategy, classification, initialization, space, ilt, denote, large, task, neural, report, ewc, best, peculiar, distribution, initialize, objective] [novel, approach, ground, assume, defined, truth]
@InProceedings{Cermelli_2020_CVPR,
  author = {Cermelli, Fabio and Mancini, Massimiliano and Bulo, Samuel Rota and Ricci, Elisa and Caputo, Barbara},
  title = {Modeling the Background for Incremental Learning in Semantic Segmentation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Interpreting the Latent Space of GANs for Semantic Face Editing
Yujun Shen, Jinjin Gu, Xiaoou Tang, Bolei Zhou


Despite the recent advance of Generative Adversarial Networks (GANs) in high-fidelity image synthesis, there lacks enough understanding of how GANs are able to map a latent code sampled from a random distribution to a photo-realistic image. Previous work assumes the latent space learned by GANs follows a distributed representation but observes the vector arithmetic phenomenon. In this work, we propose a novel framework, called InterFaceGAN, for semantic face editing by interpreting the latent semantics learned by GANs. In this framework, we conduct a detailed study on how different semantics are encoded in the latent space of GANs for face synthesis. We find that the latent code of well-trained generative models actually learns a disentangled representation after linear transformations. We explore the disentanglement between various semantics and manage to decouple some entangled semantics with subspace projection, leading to more precise control of facial attributes. Besides manipulating gender, age, expression, and the presence of eyeglasses, we can even vary the face pose as well as fix the artifacts accidentally generated by GAN models. The proposed method is further applied to achieve real image manipulation when combined with GAN inversion methods or some encoder-involved models. Extensive results suggest that learning to synthesize faces spontaneously brings a disentangled and controllable facial attribute representation.
[semantics, work, moving, understanding] [semantic, correlation, propose, positive, framework, boundary, map] [face, age, manipulation, adversarial, original, facial, model, input, hyperplane, study, encoded, change] [figure, proposed, analysis, separation] [latent, gan, image, generative, code, gans, real, attribute, gender, conditional, synthesis, stylegan, editing, interfacegan, manipulating, pggan, smile, disentangled, generator, encoder, disentanglement, inversion, learn, bolei, learns, yujun, representation, entangled, synthesized] [space, linear, training, learned, random, achieve, learning, data, fixed, deep, sampled, distribution, find, better, vector, applied, respect, sample, set, arxiv, preprint] [pose, direction, property, well, distance, single, projection, approach, normal, directly, david, varying]
@InProceedings{Shen_2020_CVPR,
  author = {Shen, Yujun and Gu, Jinjin and Tang, Xiaoou and Zhou, Bolei},
  title = {Interpreting the Latent Space of GANs for Semantic Face Editing},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Super-BPD: Super Boundary-to-Pixel Direction for Fast Image Segmentation
Jianqiang Wan, Yang Liu, Donglai Wei, Xiang Bai, Yongchao Xu


Image segmentation is a fundamental vision task and still remains a crucial step for many applications. In this paper, we propose a fast image segmentation method based on a novel super boundary-to-pixel direction (super-BPD) and a customized segmentation algorithm with super-BPD. Precisely, we define BPD on each pixel as a two-dimensional unit vector pointing from its nearest boundary to the pixel. In the BPD, nearby pixels from different regions have opposite directions departing from each other, and nearby pixels in the same region have directions pointing to the other or each other (i.e., around medial points). We make use of such property to partition image into super-BPDs, which are novel informative superpixels with robust direction similarity for fast grouping into segmentation regions. Extensive experimental results on BSDS500 and Pascal Context demonstrate the accuracy and efficiency of the proposed super-BPD in segmenting images. Specifically, we achieve comparable or superior performance with MCG while running at 25fps vs 0.07fps. Super-BPD also exhibits a noteworthy transferability to unseen scenes.
[context, graph, dataset, adjacency, step] [segmentation, bpd, pascal, region, object, del, boundary, fop, nearby, achieves, weak, merge, proposal, semantic, mcg, propose, superpixels, threshold, detection, merging, grouping, watershed, aspp, pointing, egb, slic, rag, superbpd, instance, adopt, parent] [robust, depicted, input] [pixel, ieee, proposed, pattern, analysis, based, adjacent, fast, partition, super, perceptual, figure, convolutional, neighboring, classical, field] [image, loss, learn, generation] [similarity, efficient, machine, small, set, accuracy, performance, learning, algorithm, size, number, alternative, good, group, deep, large, function] [direction, root, initial, vision, novel, define, defined]
@InProceedings{Wan_2020_CVPR,
  author = {Wan, Jianqiang and Liu, Yang and Wei, Donglai and Bai, Xiang and Xu, Yongchao},
  title = {Super-BPD: Super Boundary-to-Pixel Direction for Fast Image Segmentation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Self-Learning With Rectification Strategy for Human Parsing
Tao Li, Zhiyuan Liang, Sanyuan Zhao, Jiahao Gong, Jianbing Shen


In this paper, we solve the sample shortage problem in the human parsing task. We begin with the self-learning strategy, which generates pseudo-labels for unlabeled data to retrain the model. However, directly using noisy pseudo-labels will cause error amplification and accumulation. Considering the topology structure of human body, we propose a trainable graph reasoning method that establishes internal structural connections between graph nodes to correct two typical errors in the pseudo-labels, i.e., the global structural error and the local consistency error. For the global error, we first transform category-wise features into a high-level graph model with coarse-grained structural information, and then decouple the high-level graph to reconstruct the category features. The reconstructed features have a stronger ability to represent the topology structure of the human body. Enlarging the receptive field of features can effectively reducing the local error. We first project feature pixels into a local graph model to capture pixel-wise relations in a hierarchical graph manner, then reverse the relation information back to the pixels. With the global structural and local consistency modules, these errors are rectified and confident pseudo-labels are generated for retraining. Extensive experiments on the LIP and the ATR datasets demonstrate the effectiveness of our global and local rectification modules. Our method outperforms other state-of-the-art methods in supervised human parsing tasks.
[graph, reasoning, hierarchical, lip, dataset, gsm, trainable, node, ied, represent, build, correct, fhead] [global, module, segmentation, parsing, predicted, semantic, denotes, feature, miou, propose, atr, effectiveness, adopt, category] [model, dress, stronger] [rectification, ieee, noisy, proposed, method, rectified, convolutional, receptive, adopted, transform] [structural, consistency, perform, representation, image, corresponding, lcm, generated] [network, labeled, training, strategy, learning, retraining, data, deep, number, matrix, performance, unlabeled, retrain, improved, arxiv, neural, label, algorithm, set, weight, average, preprint, process, decoupling, problem] [human, local, structure, error, body, pose, left, capture]
@InProceedings{Li_2020_CVPR,
  author = {Li, Tao and Liang, Zhiyuan and Zhao, Sanyuan and Gong, Jiahao and Shen, Jianbing},
  title = {Self-Learning With Rectification Strategy for Human Parsing},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Hyperbolic Visual Embedding Learning for Zero-Shot Recognition
Shaoteng Liu, Jingjing Chen, Liangming Pan, Chong-Wah Ngo, Tat-Seng Chua, Yu-Gang Jiang


This paper proposes a Hyperbolic Visual Embedding Learning Network for zero-shot recognition. The network learns image embeddings in hyperbolic space, which is capable of preserving the hierarchical structure of semantic classes in low dimensions. Comparing with existing zero-shot learning approaches, the network is more robust because the embedding feature in hyperbolic space better represents class hierarchy and thereby avoid misleading resulted from unrelated siblings. Our network outperforms exiting baselines under hierarchical evaluation with an extremely challenging setting, i.e., learning only from 1,000 categories to recognize 20,841 unseen categories. While under flat evaluation, it has competitive performance as state-of-the-art methods but with five times lower embedding dimensions. Our code is publicly available (https://github.com/ShaoTengLiu/Hyperbolic_ZSL).
[embedding, embeddings, hierarchical, glove, visual, word, recognition, graph, explicit, dataset, correct, red, predict, outperforms, evaluation, work, transformer] [semantic, feature, map, parent, table, object] [model, robust, ball] [based, figure, sync, proposed, method, existing, version, tree] [image, unseen, zsl, squirrel, learns, learn, manifold, loss, gzsl, specific, train, project, mapping] [hyperbolic, space, class, learning, knowledge, label, hierarchy, performance, devise, riemannian, conse, gcnz, network, exponential, set, training, wordnet, dgp, learned, test, better, vector, metric, function, equation, imagenet, general, compared, dimension, deep] [euclidean, distance, defined, implicit, transformation, projected, structure, directly, distant]
@InProceedings{Liu_2020_CVPR,
  author = {Liu, Shaoteng and Chen, Jingjing and Pan, Liangming and Ngo, Chong-Wah and Chua, Tat-Seng and Jiang, Yu-Gang},
  title = {Hyperbolic Visual Embedding Learning for Zero-Shot Recognition},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Sequential Mastery of Multiple Visual Tasks: Networks Naturally Learn to Learn and Forget to Forget
Guy Davidson, Michael C. Mozer


We explore the behavior of a standard convolutional neural net in a continual-learning setting that introduces visual classification tasks sequentially and requires the net to master new tasks while preserving mastery of previously learned tasks. This setting corresponds to that which human learners face as they acquire domain expertise serially, for example, as an individual studies a textbook. Through simulations involving sequences of ten related visual tasks, we find reason for optimism that nets will scale well as they advance from having a single skill to becoming multi-skill domain experts. We observe two key phenomena. First, forward facilitation---the accelerated learning of task n+1 having learned n previous tasks---grows with n. Second, backward interference---the forgetting of the n previous tasks when learning task n+1 ---diminishes with n. Amplifying forward facilitation is the goal of research on metalearning, and attenuating backward interference is the goal of research on catastrophic forgetting. We find that both of these goals are attained simply through broader exposure to a domain.
[visual, sequence, specialized, three, heterogeneous, ten, previous, observed, psychological, modulation, multiple, behavior, corresponds, skill, facilitation] [improves, object] [trained, model, curve, input, series] [figure, introduced, convolutional, net, journal, indicate, half] [learn, domain, perform, image, representation] [task, training, learning, number, forgetting, accuracy, episode, catastrophic, neural, metalearning, criterion, interference, required, learned, function, memory, maml, reach, continual, performance, backward, network, ordinal, processing, standard, data, knowledge, practice, rate, mastery, forward, set, architecture, setting, decay, cognitive, machine, sequentially, requires, increase, reduce, fact, retraining, epoch] [human, position, conference, single, international, computer, vision]
@InProceedings{Davidson_2020_CVPR,
  author = {Davidson, Guy and Mozer, Michael C.},
  title = {Sequential Mastery of Multiple Visual Tasks: Networks Naturally Learn to Learn and Forget to Forget},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Distilling Effective Supervision From Severe Label Noise
Zizhao Zhang, Han Zhang, Sercan O. Arik, Honglak Lee, Tomas Pfister


Collecting large-scale data with clean labels for supervised training of neural networks is practically challenging. Although noisy labels are usually cheap to acquire, existing methods suffer a lot from label noise. This paper targets at the challenge of robust training at high label noise regimes. The key insight to achieve this goal is to wisely leverage a small trusted set to estimate exemplar weights and pseudo labels for noisy data in order to reuse them for supervised training. We present a holistic framework to train deep neural networks in a way that is highly invulnerable to label noise. Our method sets the new state of the art on various types of label noise and achieves excellent performance on large-scale datasets with real-world label noise. For instance, on CIFAR100 with a 40% uniform noise ratio and only 10 trusted labeled data per class, our method achieves 80.2% classification accuracy, where the error rate is only 1.4% higher than a neural network trained without label noise. Moreover, increasing the noise ratio to 80%, our method still maintains a high accuracy of 75.5%, compared to the previous best accuracy 48.2%.
[dataset, step, previous, construct] [table, labeling, achieves, framework] [noise, probe, model, robust, trained, mislabeled, effective, datasets, clean, input, original, highly] [noisy, method, figure, high, proposed, based, comparison, low] [pseudo, loss, supervised, train, image, lkl, generate, exemplar] [data, training, learning, label, accuracy, neural, deep, meta, trusted, set, ratio, small, class, best, performance, rate, compared, standard, random, uniform, mixup, batch, augmentation, labeled, gradient, arxiv, preprint, observe, learned, validation, test, equation, indicates, rog, requires, large, weight, descent, reduce] [approach, estimate, initial]
@InProceedings{Zhang_2020_CVPR,
  author = {Zhang, Zizhao and Zhang, Han and Arik, Sercan O. and Lee, Honglak and Pfister, Tomas},
  title = {Distilling Effective Supervision From Severe Label Noise},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Eternal Sunshine of the Spotless Net: Selective Forgetting in Deep Networks
Aditya Golatkar, Alessandro Achille, Stefano Soatto


We explore the problem of selectively forgetting a particular subset of the data used for training a deep neural network. While the effects of the data to be forgotten can be hidden from the output of the network, insights may still be gleaned by probing deep into its weights. We propose a method for "scrubbing" the weights clean of information about a particular set of training data. The method does not require retraining from scratch, nor access to the data originally used for training. Instead, the weights are modified so that any probing function of the weights is indistinguishable from the same function applied to the weights of a network trained without the data to be forgotten. This condition is a generalized and weaker form of Differential Privacy. Exploiting ideas related to the stability of stochastic gradient descent, we introduce an upper-bound on the amount of information remaining in the weights, which can be estimated efficiently even for deep neural networks.
[dataset, time, selective, extract] [confidence, denotes, seed, add] [model, trained, noise, original, stability, case, differential, attacker, definition, adding, robust] [figure, ldr, method, output, remove, proposed] [loss, introduce, train, variational] [forgetting, scrubbing, data, function, forget, random, training, procedure, deep, learning, cohort, neural, algorithm, network, readout, set, gradient, distribution, test, forgotten, optimal, class, subset, amount, arxiv, remaining, machine, fisher, log, preprint, stochastic, number, bound, proposition, retrain, problem, scrub, hessian, approximation, knowledge, quadratic, optimization, entropy, retain, scrubbed, membership, divergence, catastrophic, accuracy, simple, fixed, closer] [error, conference, assume, compute, point, local]
@InProceedings{Golatkar_2020_CVPR,
  author = {Golatkar, Aditya and Achille, Alessandro and Soatto, Stefano},
  title = {Eternal Sunshine of the Spotless Net: Selective Forgetting in Deep Networks},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
CenterMask: Single Shot Instance Segmentation With Point Representation
Yuqing Wang, Zhaoliang Xu, Hao Shen, Baoshan Cheng, Lirong Yang


In this paper, we propose a single-shot instance segmentation method, which is simple, fast and accurate. There are two main challenges for one-stage instance segmentation: object instances differentiation and pixel-wise feature alignment. Accordingly, we decompose the instance segmentation into two parallel subtasks: Local Shape prediction that separates instances even in overlapping conditions, and Global Saliency generation that segments the whole image in a pixel-to-pixel manner. The outputs of the two branches are assembled to form the final instance masks. To realize that, the local shape information is adopted from the representation of object center points. Totally trained from scratch and without any bells and whistles, the proposed CenterMask achieves 34.5 mask AP with a speed of 12.3 fps, using a single-model with single-scale training/testing on the challenging COCO dataset. The accuracy is higher than all other one-stage instance segmentation methods except the 5 times slower TensorMask, which shows the effectiveness of CenterMask. Besides, our method can be easily embedded to other one-stage object detectors such as FCOS and performs well, showing the generation of CenterMask.
[represent, prediction, predict] [saliency, instance, mask, segmentation, object, centermask, global, branch, center, map, backbone, feature, table, final, head, detection, precise, predicted, ross, achieves, represents, coco, semantic, height, apm, apl, visualization, kaiming, realizes, fully, faster, offset, brings, piotr, assembled, challenging, effectiveness] [model, trained] [figure, comparison, proposed, method, performs, pixel, output, aps, convolutional, combination, parallel] [loss, representation, separate, image, realize, extracted, corresponding, generation] [size, performance, setting, number, binary, higher, function, compared] [shape, local, point, coarse, form, overlapping, predicts, single, well]
@InProceedings{Wang_2020_CVPR,
  author = {Wang, Yuqing and Xu, Zhaoliang and Shen, Hao and Cheng, Baoshan and Yang, Lirong},
  title = {CenterMask: Single Shot Instance Segmentation With Point Representation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Mitigating Bias in Face Recognition Using Skewness-Aware Reinforcement Learning
Mei Wang, Weihong Deng


Racial equality is an important theme of international human rights law, but it has been largely obscured when the overall face recognition accuracy is pursued blindly. More facts indicate racial bias indeed degrades the fairness of recognition system and the error rates on non-Caucasians are usually much higher than Caucasians. To encourage fairness, we introduce the idea of adaptive margin to learn balanced performance for different races based on large margin losses. A reinforcement learning based race balance network (RL-RBN) is proposed. We formulate the process of finding the optimal margins for non-Caucasians as a Markov decision process and employ deep Q-learning to learn policies for an agent to select appropriate margin by approximating the Q-value function. Guided by the agent, the skewness of feature scatter between races can be reduced. Besides, we provide two ethnicity aware training datasets, called BUPT-Globalface and BUPT-Balancedface dataset, which can be utilized to study racial bias from both data and algorithm aspects. Extensive experiments on RFW database show that RL-RBN successfully mitigates racial bias and learns more balanced performance.
[recognition, agent, action, policy, state, reinforcement, dataset, step, reward, current] [feature, aware, table, guided] [face, racial, race, rfw, arcface, binter, datasets, cosface, african, trained, asian, indian, ethnicity, skewness, difficult, caucasian, generalization, noise, debiased] [adaptive, based, gaussian, method, figure, ieee, formulated, blur] [train, loss, learn, domain, perform, gender] [margin, training, deep, bias, performance, balanced, learning, group, set, network, fairness, accuracy, arxiv, preprint, data, number, larger, optimal, algorithm, distribution, large, test, softmax, process, ratio] [distance, computed, conference, international, computer, angle, error]
@InProceedings{Wang_2020_CVPR,
  author = {Wang, Mei and Deng, Weihong},
  title = {Mitigating Bias in Face Recognition Using Skewness-Aware Reinforcement Learning},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
MineGAN: Effective Knowledge Transfer From GANs to Target Domains With Few Images
Yaxing Wang, Abel Gonzalez-Garcia, David Berga, Luis Herranz, Fahad Shahbaz Khan, Joost van de Weijer


One of the attractive characteristics of deep neural networks is their ability to transfer knowledge obtained in one domain to other related domains. As a result, high-quality networks can be trained in domains with relatively little training data. This property has been extensively studied for discriminative networks but has received significantly less attention for generative models. Given the often enormous effort required to train GANs, both computationally as well as in the dataset collection, the re-use of pretrained GANs is a desirable objective. We propose a novel knowledge transfer method for generative models based on mining the knowledge that is most beneficial to a specific target domain, either from a single or multiple pretrained GANs. This is done using a miner network that identifies which part of the generative distribution of each pretrained GAN outputs samples closest to the target domain. Mining effectively steers GAN sampling towards suitable regions of the latent space, which facilitates the posterior finetuning and avoids pathologies of other methods such as mode collapse and lack of flexibility. We perform experiments on several complex datasets using various GAN architectures (BigGAN, Progressive GAN) and show that the proposed method, called MineGAN, effectively transfers knowledge to domains with few target images, outperforming existing methods. In addition, MineGAN can successfully transfer knowledge from multiple pretrained GANs. Our code is available at: https://github.com/yaxingwang/MineGAN.
[multiple, red] [table, including] [adversarial, trained, model, input, quality, identifies, datasets, targeted] [method, prior, figure, proposed, based, convolutional] [target, pretrained, minegan, gan, generative, generator, transfer, transfergan, miner, generate, image, critic, gans, real, generated, transferring, conditional, progressive, latent, ptdata, fid, kmmd, domain, generation, fake, ppgn, introduce, discriminator, pdata, bsa, source, distinguish, variable, train, mode, generates, biggan, loss] [distribution, knowledge, training, mining, data, deep, finetuning, scratch, class, sampling, set, batch, sample, network, normalization, imagenet, label, neural, learning, consider, architecture, learned, small, closer, lower, improved] [single, approach]
@InProceedings{Wang_2020_CVPR,
  author = {Wang, Yaxing and Gonzalez-Garcia, Abel and Berga, David and Herranz, Luis and Khan, Fahad Shahbaz and Weijer, Joost van de},
  title = {MineGAN: Effective Knowledge Transfer From GANs to Target Domains With Few Images},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
DLWL: Improving Detection for Lowshot Classes With Weakly Labelled Data
Vignesh Ramanathan, Rui Wang, Dhruv Mahajan


Large detection datasets have a long tail of lowshot classes with very few bounding box annotations. We wish to improve detection for lowshot classes with weakly labelled web-scale datasets only having image-level labels. This requires a detection framework that can be jointly trained with limited number of bounding box annotated images and large number of weakly labelled images. Towards this end, we propose a modification to the FRCNN model to automatically infer label assignment for objects proposals from weakly labelled images during training. We pose this label assignment as a Linear Program with constraints on the number and overlap of object instances in an image. We show that this can be solved efficiently during training for weakly labelled images. Compared to just training with few annotated examples, augmenting with weakly labelled examples in our framework provides significant gains. We demonstrate this on the LVIS dataset 3.5 gain in AP as well as different lowshot variants of the COCO dataset. We provide a thorough analysis of the effect of amount of weakly labelled and fully labelled data required to train the detection model. Our DLWL framework can also outperform self-supervised baselines like omni-supervision for lowshot classes.
[dataset, work, multiple, dog] [weakly, labelled, object, lowshot, detection, bounding, box, frcnn, supervision, proposal, weak, fully, assignment, dlwl, rare, segmentation, instance, highshot, lvis, framework, coco, augmenting, pascal, threshold, localization, main, score] [model, improve, trained, datasets] [ieee, pattern, based, figure, noisy, analysis] [supervised, image, train, augment, loss, cat] [class, training, number, data, label, performance, learning, large, average, compared, network, linear, standard, better, observe, count, amount, classification, optimization, set, note, arxiv, preprint, problem, gain, lead] [computer, conference, additional, vision, international, approach, initial, program, well, refer]
@InProceedings{Ramanathan_2020_CVPR,
  author = {Ramanathan, Vignesh and Wang, Rui and Mahajan, Dhruv},
  title = {DLWL: Improving Detection for Lowshot Classes With Weakly Labelled Data},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Unsupervised Deep Shape Descriptor With Point Distribution Learning
Yi Shi, Mengchen Xu, Shuaihang Yuan, Yi Fang


Deep learning models have achieved great success in supervised shape descriptor learning for 3D shape retrieval, classification, and correspondence. However, the unsupervised shape descriptor calculated via deep learning is less studied than that of supervised ones due to the design challenges of unsupervised neural network architecture. This paper proposes a novel probabilistic framework for the learning of unsupervised deep shape descriptors with point distribution learning. In our approach, we firstly associate each point with a Gaussian, and the point clouds are modeled as the distribution of the points. We then use deep neural networks (DNNs) to model a maximum likelihood estimation process that is traditionally solved with an iterative Expectation-Maximization (EM) process. Our key novelty is that "training" these DNNs with unsupervised self-correspondence L2 distance loss will elegantly reveal the statically significant deep shape descriptor representation for the distribution of the point clouds. We have conducted experiments over various 3D datasets. Qualitative and quantitative comparisons demonstrate that our proposed method achieves superior classification performance over existing unsupervised 3D shape descriptors. In addition, we verified the following attractive properties of our shape descriptor through experiments: multi-scale shape representation, robustness to shape rotation, and robustness to noise.
[decoder, recognition, evaluation, describe, represent] [instance, feature, object, global, level] [model, trained, noise, robust, deviation, original, major, great, developed] [gaussian, figure, ieee, proposed, pattern, based, likelihood, convolutional, method, optimized] [unsupervised, representation, corresponding, loss, generative, image, synthesized, supervised, generated, encoder, learns, mapping, generate, learn, latent] [distribution, learning, deep, data, neural, performance, standard, training, set, classification, network, sampled, process, processing, space, learned, entire, sampling, accuracy, experiment, calculated, probabilistic, maximum, probability, fixed] [shape, point, descriptor, computer, cloud, conference, geometric, vision, distance, approach, reconstruction, volumetric, rotation, international, estimation]
@InProceedings{Shi_2020_CVPR,
  author = {Shi, Yi and Xu, Mengchen and Yuan, Shuaihang and Fang, Yi},
  title = {Unsupervised Deep Shape Descriptor With Point Distribution Learning},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Stylization-Based Architecture for Fast Deep Exemplar Colorization
Zhongyou Xu, Tingting Wang, Faming Fang, Yun Sheng, Guixu Zhang


Exemplar-based colorization aims to add colors to a grayscale image guided by a content related reference im- age. Existing methods are either sensitive to the selection of reference images (content, position) or extremely time and resource consuming, which limits their practical applica- tion. To tackle these problems, we propose a deep exemplar colorization architecture inspired by the characteristics of stylization in feature extracting and blending. Our coarse- to-fine architecture consists of two parts: a fast transfer sub-net and a robust colorization sub-net. The transfer sub- net obtains a coarse chrominance map via matching basic feature statistics of the input pairs in a progressive way. The colorization sub-net refines the map to generate the final re- sults. The proposed end-to-end network can jointly learn faithful colorization with a related reference and plausible color prediction with unrelated reference. Extensive exper- imental validation demonstrates that our approach outper- forms the state-of-the-art methods in less time whether in exemplar-based colorization or image stylization tasks.
[time, dataset, automatic, inspired, previous, decoder] [feature, map, propose, propagate, level, semantic, final, alleviate] [input, trained, model, datasets] [reference, result, color, gray, method, ieee, proposed, figure, chrominance, comparison, fast, based, output, net, perceptual] [colorization, image, transfer, target, stylization, style, adain, photorealistic, user, content, photowct, loss, unrelated, tab, encoder, grayscale, consists, plausible, generate, learn, colorized, corresponding, arbitrary, introduce, semantically] [network, deep, architecture, learning, achieve, compared, layer, task, operation, similarity, training, problem, applied, neural] [matching, coarse, initial, acm, well, computer, single, structure]
@InProceedings{Xu_2020_CVPR,
  author = {Xu, Zhongyou and Wang, Tingting and Fang, Faming and Sheng, Yun and Zhang, Guixu},
  title = {Stylization-Based Architecture for Fast Deep Exemplar Colorization},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Cars Can't Fly Up in the Sky: Improving Urban-Scene Segmentation via Height-Driven Attention Networks
Sungha Choi, Joanne T. Kim, Jaegul Choo


This paper exploits the intrinsic features of urban-scene images and proposes a general add-on module, called height-driven attention networks (HANet), for improving semantic segmentation for urban-scene images. It emphasizes informative features or classes selectively according to the vertical position of a pixel. The pixel-wise class distributions are significantly different from each other among horizontally segmented sections in the urban-scene images. Likewise, urban-scene images have their own distinct characteristics, but most semantic segmentation networks do not reflect such unique attributes in the architecture. The proposed network architecture incorporates the capability exploiting the attributes to handle the urban scene dataset effectively. We validate the consistent performance (mIoU) increase of various semantic segmentation models on two datasets when HANet is adopted. This extensive quantitative analysis demonstrates that adding our module to existing models is easy and cost-effective. Our method achieves a new state-of-the-art performance on the Cityscapes benchmark with a large margin among ResNet101 based segmentation models. Also, we show that the proposed model is coherent with the facts observed in the urban scene by visualizing and interpreting the attention map. Our code and trained models are publicly available.
[attention, recognition, positional, context, dataset, multiple, road, urban, work, three] [semantic, hanet, segmentation, map, feature, pooling, table, backbone, height, denotes, including, region, miou, atrous, assigned, contextual, global, stride, fully, horizontal, main] [model, middle, adding, korea] [ieee, pattern, convolutional, proposed, spatial, output, method, comparison, based, figure, analysis, intermediate, channel] [image, row, sky] [class, baseline, average, performance, entire, lower, number, set, size, learning, validation, architecture, distribution, upper, training, deep, layer, network, entropy, neural, small, classification, scaling, wide] [computer, vision, conference, vertical, international, position, scene]
@InProceedings{Choi_2020_CVPR,
  author = {Choi, Sungha and Kim, Joanne T. and Choo, Jaegul},
  title = {Cars Can't Fly Up in the Sky: Improving Urban-Scene Segmentation via Height-Driven Attention Networks},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
State-Aware Tracker for Real-Time Video Object Segmentation
Xi Chen, Zuoxin Li, Ye Yuan, Gang Yu, Jianxin Shen, Donglian Qi


In this work, we address the task of semi-supervised video object segmentation (VOS) and explore how to make efficient use of video property to tackle the challenge of semi-supervision. We propose a novel pipeline called State-Aware Tracker (SAT), which can produce accurate segmentation results with real-time speed. For higher efficiency, SAT takes advantage of the inter-frame consistency and deals with each target object as a tracklet. For more stable and robust performance over video sequences, SAT gets awareness for each state and makes self-adaptation via two feedback loops. One loop assists SAT in generating more stable tracklets. The other loop helps to construct a more robust and holistic target representation. SAT achieves a promising result of 72.3% J&F mean with 39 FPS on DAVIS 2017-Val dataset, which shows a decent trade-off between efficiency and accuracy.
[video, state, frame, modeling, current, previous, predict, visual, temporal, provide, construct, speed, order] [object, segmentation, global, mask, feature, sat, denotes, region, score, tracking, saliency, predicted, vos, box, brings, bounding, tracker, achieves, apply, background, table, fps, challenge, propose, davis, regression, head] [robust, feedback, offline, model, input] [ieee, method, pattern, fast, result, cropping, high, figure, based, enhance, crop, version] [target, encoder, representation, image, appearance] [network, search, strategy, online, abnormal, similarity, training, accuracy, process, learning, stable, performance, filter, task, efficient, binary, set, data] [computer, conference, loop, vision, joint, accurate, estimation, normal, pipeline, estimator]
@InProceedings{Chen_2020_CVPR,
  author = {Chen, Xi and Li, Zuoxin and Yuan, Ye and Yu, Gang and Shen, Jianxin and Qi, Donglian},
  title = {State-Aware Tracker for Real-Time Video Object Segmentation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Iteratively-Refined Interactive 3D Medical Image Segmentation With Multi-Agent Reinforcement Learning
Xuan Liao, Wenhao Li, Qisen Xu, Xiangfeng Wang, Bo Jin, Xiaoyun Zhang, Yanfeng Wang, Ya Zhang


Existing automatic 3D image segmentation methods usually fail to meet the clinic use. Many studies have explored an interactive strategy to improve the image segmentation performance by iteratively incorporating user hints. However, the dynamic process for successive interactions is largely ignored. We here propose to model the dynamic process of iterative interactive image segmentation as a Markov decision process (MDP) and solve it with reinforcement learning (RL). Unfortunately, it is intractable to use single-agent RL for voxel-wise prediction due to the large exploration space. To reduce the exploration space to a tractable size, we treat each voxel as an agent with a shared voxel-level behavior strategy so that it can be solved with multi-agent reinforcement learning. An additional advantage of this multi-agent model is to capture the dependency among voxels for segmentation task. Meanwhile, to enrich the information of previous segmentations, we reserve the prediction uncertainty in the state space of MDP and derive an adjustment action space leading to a more precise and finer segmentation. In addition, to improve the efficiency of exploration, we design a relative cross-entropy gain-based reward to update the policy in a constrained direction. Experimental results on various medical datasets have shown that our method significantly outperforms existing state-of-the-art methods, with the advantage of less interactions and a faster convergence.
[hint, action, previous, reward, interaction, prediction, step, state, current, agent, policy, reinforcement, successive, dataset, three, sequence, actor, time, considering] [segmentation, interactive, refinement, map, table, click, object, improvement, head, propose, visualization, precise] [model, datasets, accumulated, experimental, testing, influence, improve, iterative] [medical, method, based, adjustment, convolutional, block, existing, dynamic, brain, mri, figure, advantage, result] [image, user] [probability, update, performance, network, set, training, learning, process, space, better, large, binary, algorithm, gain, number, good, neural] [initial, relative, ground, voxel, coarse, voxels, truth, iteratively, geodesic, distance, uncertainty, finer, error]
@InProceedings{Liao_2020_CVPR,
  author = {Liao, Xuan and Li, Wenhao and Xu, Qisen and Wang, Xiangfeng and Jin, Bo and Zhang, Xiaoyun and Wang, Yanfeng and Zhang, Ya},
  title = {Iteratively-Refined Interactive 3D Medical Image Segmentation With Multi-Agent Reinforcement Learning},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
ENSEI: Efficient Secure Inference via Frequency-Domain Homomorphic Convolution for Privacy-Preserving Visual Recognition
Song Bian, Tianchen Wang, Masayuki Hiromoto, Yiyu Shi, Takashi Sato


In this work, we propose ENSEI, a secure inference (SI) framework based on the frequency-domain secure convolution (FDSC) protocol for the efficient execution of image inference in the encrypted domain. Our observation is that, under the combination of homomorphic encryption and secret sharing, homomorphic convolution can be obliviously carried out in the frequency domain, significantly simplifying the related computations. We provide protocol designs and parameter derivations for number-theoretic transform (NTT) based FDSC. In the experiment, we thoroughly study the accuracy-efficiency trade-offs between time- and frequency-domain homomorphic convolution. With ENSEI, compared to the best known works, we achieve 5--11x online time reduction, up to 33x setup time reduction, and up to 10x reduction in the overall inference time. A further 33% of bandwidth reductions can be obtained on binary neural networks with only 3% of accuracy degradation on the CIFAR-10 dataset.
[time, three, prediction, visual, recognition] [table, main, key, cnn] [protocol, security, privacy, input] [homomorphic, convolution, secure, ensei, alice, ciphertext, ntt, bob, modulus, mod, gazelle, encryption, plaintext, secret, relu, dft, ieee, based, pahe, oblivious, high, frequency, transform, proposed, conv, result, rlwe, yiyu, field, cryptographic] [image, transformed] [neural, inference, accuracy, network, parameter, multiplication, architecture, filter, efficient, deep, observe, vector, dimension, binary, computational, complexity, set, encrypted, precision, arxiv, preprint, general, weight, setup, design, machine, learning, performance, matrix, operation, sharing, number, product, larger, smaller, small, online, reduction, scheme, procedure] [conference, computer, complex]
@InProceedings{Bian_2020_CVPR,
  author = {Bian, Song and Wang, Tianchen and Hiromoto, Masayuki and Shi, Yiyu and Sato, Takashi},
  title = {ENSEI: Efficient Secure Inference via Frequency-Domain Homomorphic Convolution for Privacy-Preserving Visual Recognition},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Multi-Scale Interactive Network for Salient Object Detection
Youwei Pang, Xiaoqi Zhao, Lihe Zhang, Huchuan Lu


Deep-learning based salient object detection methods achieve great progress. However, the variable scale and unknown category of salient objects are great challenges all the time. These are closely related to the utilization of multi-level and multi-scale features. In this paper, we propose the aggregate interaction modules to integrate the features from adjacent levels, in which less noise is introduced because of only using small up-/down-sampling rates. To obtain more efficient multi-scale features from the integrated features, the self-interaction modules are embedded in each decoder unit. Besides, the class imbalance issue caused by the scale variation weakens the effect of the binary cross entropy loss and results in the spatial inconsistency of the predictions. Therefore, we exploit the consistency-enhanced loss to highlight the fore-/back-ground difference and preserve the intra-class consistency. Experimental results on five benchmark datasets demonstrate that the proposed method without any post-processing performs favorably against 23 state-of-the-art approaches. The source code will be publicly available at https://github.com/lartpang/MINet.
[ucf, attention, interaction, prediction, visual, extract, integrate, decoder] [salient, object, saliency, detection, feature, huchuan, cpd, afnet, branch, pagr, module, amulet, egnet, picanet, msrnet, nldf, sims, favg, mlmsnet, propose, foreground, aggregate, pyramid, recall, lihe, region, sim, fmax, threshold, ali, interactive] [model, improve, variation, deal, input] [proposed, scale, convolutional, spatial, mae, based, method, figure, resolution, residual, adjacent, existing, fusion, ieee] [loss, image, issue, cross, cel] [network, learning, deep, layer, strategy, training, better, precision, imbalance, computational, binary, large, performance, function, evaluate] [ground]
@InProceedings{Pang_2020_CVPR,
  author = {Pang, Youwei and Zhao, Xiaoqi and Zhang, Lihe and Lu, Huchuan},
  title = {Multi-Scale Interactive Network for Salient Object Detection},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Interactive Multi-Label CNN Learning With Partial Labels
Dat Huynh, Ehsan Elhamifar


We address the problem of efficient end-to-end learning a multi-label Convolutional Neural Network (CNN) on training images with partial labels. Training a CNN with partial labels, hence a small number of images for every label, using the standard cross-entropy loss is prone to overfitting and performance drop. We introduce a new loss function that regularizes the cross-entropy loss with a cost function that measures the smoothness of labels and features of images on the data manifold. Given that optimizing the new loss function over the CNN parameters requires learning similarities among labels and images, which itself depends on knowing the parameters of the CNN, we develop an efficient interactive learning framework in which the two steps of similarity learning and CNN training interact and improve the performance of each another. Our method learns the CNN parameters without requiring keeping all training data in the memory, allows to learn few informative similarities only for images in each mini-batch and handles changing feature representations. By extensive experiments on Open Images, CUB and MS-COCO datasets,we demonstrate the effectiveness of our method. In particular, on the large-scale Open Images dataset, we improve the state of the art by 1.02% in mAP score over 5,000 classes.
[graph, dataset, artificial, prediction, recognition, work, visual, aaai, dependency] [cnn, map, framework, feature, interactive, score, table, positive, labeling, improvement] [model, improve] [method, ieee, pattern, noisy, figure, proposed, adaptive] [image, loss, missing, learn, cub, curriculum, attribute, notice, representation, fixing] [learning, label, training, performance, similarity, number, function, open, data, set, large, learned, classifier, class, logistic, group, better, problem, neural, vector, algorithm, validation, efficient, requires, classification, multilabel, fixed, regularization, small, knowing, negative, unlabeled, find, denote] [conference, computer, partial, smoothness, vision, international, require, allows, define]
@InProceedings{Huynh_2020_CVPR,
  author = {Huynh, Dat and Elhamifar, Ehsan},
  title = {Interactive Multi-Label CNN Learning With Partial Labels},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
ViewAL: Active Learning With Viewpoint Entropy for Semantic Segmentation
Yawar Siddiqui, Julien Valentin, Matthias Niessner


We propose ViewAL, a novel active learning strategy for semantic segmentation that exploits viewpoint consistency in multi-view datasets. Our core idea is that inconsistencies in model predictions across viewpoints provide a very reliable measure of uncertainty and encourage the model to perform well irrespective of the viewpoint under which objects are observed. To incorporate this uncertainty measure, we introduce a new viewpoint entropy formulation, which is the basis of our active learning strategy. In addition, we propose uncertainty computations on a superpixel level, which exploits inherently localized signal in the segmentation task, directly lowering the annotation costs. This combination of viewpoint entropy and the use of superpixels allows to efficiently select samples that are highly informative for improving the network. We demonstrate that our proposed active learning strategy not only yields the best-performing models for the same amount of required labeled data, but also significantly reduces labeling effort. For instance, our method achieves 95% of maximum achievable network performance using only 7%, 17%, and 24% labeled data on SceneNet-RGBD, ScanNet, and Matterport3D, respectively. On these datasets, the best state-of-the-art method achieves the same performance with 14%, 27% and 33% labeled data. Finally, we demonstrate that labeling using superpixels yields the same quality of ground-truth compared to labeling whole images, but requires 25% less time.
[dataset, time, associated, prediction, multiple, work, current] [superpixels, semantic, segmentation, score, superpixel, labeling, miou, object, annotation, achieves, propose, detection, improves] [model, effort, trained, coming] [method, ieee, pattern, figure, convolutional, pixel, based] [image] [learning, active, entropy, labeled, performance, divergence, data, selection, dropout, network, deep, label, unlabeled, probability, set, selected, machine, neural, maximum, expected, softmax, training, random, arxiv, preprint, select, average, classification, class, distribution, andrew, viewal, sampling, strategy, informative, subset, number] [view, conference, computer, uncertainty, vision, scannet, international, approach, indoor, european, viewpoint, ground, pose, truth, scene]
@InProceedings{Siddiqui_2020_CVPR,
  author = {Siddiqui, Yawar and Valentin, Julien and Niessner, Matthias},
  title = {ViewAL: Active Learning With Viewpoint Entropy for Semantic Segmentation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Scene-Adaptive Video Frame Interpolation via Meta-Learning
Myungsub Choi, Janghoon Choi, Sungyong Baik, Tae Hyun Kim, Kyoung Mu Lee


Video frame interpolation is a challenging problem because there are different scenarios for each video depending on the variety of foreground and background motion, frame rate, and occlusion. It is therefore difficult for a single network with fixed parameters to generalize across different videos. Ideally, one could have a different network for each scenario, but this is computationally infeasible for practical applications. In this work, we propose to adapt the model to each video by making use of additional information that is readily available at test time and yet has not been exploited in previous works. We first show the benefits of 'test-time adaptation' through simple fine-tuning of a network, then we greatly improve its efficiency by incorporating meta-learning. We obtain significant performance gains with only a single gradient update without any additional parameters. Finally, we show that our meta-learning framework can be easily employed to any video frame interpolation network and can consistently improve its performance on multiple benchmark datasets.
[frame, video, time, dataset, sequence, multiple, step, lin, goal] [framework, table, final] [model, input, original, improve] [interpolation, motion, sepconv, proposed, method, flow, psnr, intermediate, adaptive, existing, optical, figure, based, prior, feasibility, low] [adaptation, loss, image, train, qualitative] [test, performance, learning, baseline, training, algorithm, update, gradient, inner, network, task, number, dti, vimeoseptuplet, adapt, large, set, maml, deep, better, neural, note, outer, lti, adapted, problem, data, process, small, achieve, greatly, overfitting, knowledge, optimization] [loop, single, additional, estimation, well]
@InProceedings{Choi_2020_CVPR,
  author = {Choi, Myungsub and Choi, Janghoon and Baik, Sungyong and Kim, Tae Hyun and Lee, Kyoung Mu},
  title = {Scene-Adaptive Video Frame Interpolation via Meta-Learning},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Action Segmentation With Joint Self-Supervised Temporal Domain Adaptation
Min-Hung Chen, Baopu Li, Yingze Bao, Ghassan AlRegib, Zsolt Kira


Despite the recent progress of fully-supervised action segmentation techniques, the performance is still not fully satisfactory. One main challenge is the problem of spatiotemporal variations (e.g. different people may perform the same activity in various ways). Therefore, we exploit unlabeled videos to address this problem by reformulating the action segmentation task as a cross-domain problem with domain discrepancy caused by spatio-temporal variations. To reduce the discrepancy, we propose SelfSupervised Temporal Domain Adaptation (SSTDA), which contains two self-supervised auxiliary tasks (binary and sequential domain prediction) to jointly align cross-domain feature spaces embedded with local and global temporal dynamics, achieving better performance than other Domain Adaptation (DA) approaches. On three challenging benchmark datasets (GTEA, 50Salads, and Breakfast), SSTDA outperforms the current state-of-the-art method by large margins (e.g. for the F1@25 score, from 59.6% to 69.1% on Breakfast, from 73.4% to 81.5% on 50Salads, and from 83.6% to 89.1% on GTEA), and requires only 65% of the labeled training data for comparable performance, demonstrating the usefulness of adapting to unlabeled target videos across variations. The source code is available at https://github.com/cmhungsteve/SSTDA.
[temporal, action, sstda, video, prediction, sequential, recognition, three, outperforms, predict, lld, spatiotemporal, untrimmed, integrating, integrate, ggd, embedded, current, long, multiple] [segmentation, feature, table, global, main, segment, propose, fully, stage] [auxiliary, model, datasets, adversarial, acc, effectively] [ieee, pattern, figure, proposed, method, convolution] [domain, target, source, adaptation, edit, address, discrepancy, loss, unsupervised, learn, align, lgd] [learning, performance, binary, unlabeled, training, task, baseline, problem, labeled, data, number, design, permutation, achieve, large, best, classifier] [conference, local, vision, computer, human, approach, international, jointly, compare, ground]
@InProceedings{Chen_2020_CVPR,
  author = {Chen, Min-Hung and Li, Baopu and Bao, Yingze and AlRegib, Ghassan and Kira, Zsolt},
  title = {Action Segmentation With Joint Self-Supervised Temporal Domain Adaptation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Pixel Consensus Voting for Panoptic Segmentation
Haochen Wang, Ruotian Luo, Michael Maire, Greg Shakhnarovich


The core of our approach, Pixel Consensus Voting, is a framework for instance segmentation based on the generalized Hough transform. Pixels cast discretized, probabilistic votes for the likely regions that contain instance centroids. At the detected peaks that emerge in the voting heatmap, backprojection is applied to collect pixels and produce instance masks. Unlike a sliding window detector that densely enumerates object proposals, our method detects instances as a result of the consensus among pixel-wise votes. We implement vote aggregation and backprojection using native operators of a convolutional neural network. The discretization of centroid voting reduces the training of instance segmentation to pixel labeling, analogous and complementary to FCN-style semantic segmentation, leading to an efficient and unified architecture that jointly models things and stuff. We demonstrate the effectiveness of our pipeline on COCO and Cityscapes Panoptic Segmentation and obtain competitive results. Code will be open-sourced.
[work, prediction, step, oracle, length] [voting, instance, segmentation, object, detection, coco, panoptic, semantic, pcv, segment, vote, mask, region, peak, val, resnet, table, hough, discretization, centroid, branch, ross, consensus, backprojection, stuff, feature, center, kaiming, pqth, category, location, piotr, aggregation, offset, pqst] [query, input, model, trained, argmax] [pixel, ieee, pattern, spatial, figure, convolutional, dilated, transform, cell] [loss, image, generalized, produce] [filter, size, classification, network, training, learning, performance, small, default, set, simple, arxiv, preprint, large, normalization, deep, neural, efficient, clustering] [computer, conference, grid, vision, single, ground, international, truth, european, approach, predicts, pose]
@InProceedings{Wang_2020_CVPR,
  author = {Wang, Haochen and Luo, Ruotian and Maire, Michael and Shakhnarovich, Greg},
  title = {Pixel Consensus Voting for Panoptic Segmentation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Minimizing Discrete Total Curvature for Image Processing
Qiuxiang Zhong, Yutong Li, Yijie Yang, Yuping Duan


The curvature regularities have received growing attention with the advantage of providing strong priors in the continuity of edges in image processing applications. However, owing to the non-convex and non-smooth properties of the high-order regularizer, the numerical solution becomes challenging in real-time tasks. In this paper, we propose a novel curvature regularity, the total curvature (TC), by minimizing the normal curvatures along different directions. We estimate the normal curvatures discretely in the local neighborhood according to differential geometry theory. The resulting curvature regularity can be regarded as a re-weighted total variation (TV) minimization problem, which can be efficiently solved by the alternating direction method of multipliers (ADMM) based algorithm. By comparing with TV and Euler's elastica energy, we demonstrate the effectiveness and superiority of the total curvature regularity for various image processing applications.
[three, time, tmax] [segmentation, cpu, level, siam, mask, table, center] [model, noise, tony, variation] [figure, journal, ieee, proposed, psnr, based, denoising, half, intensity, ssim, imaging, method, fast, gaussian, neighboring, color, admm] [image, inpainting, minimizing, corresponding, introduce, preserve, variational] [total, minimization, algorithm, processing, discrete, set, energy, applied, efficiently, function, linear, problem, regularization, better, approximated, compared] [curvature, elastica, normal, point, tangent, regularity, computer, numerical, conference, well, estimated, surface, direction, defined, plane, compute, local, grid, fundamental, solved, geometric, solve, second, trv, international, demonstrate, smooth, arclength, vision, smoother, approach]
@InProceedings{Zhong_2020_CVPR,
  author = {Zhong, Qiuxiang and Li, Yutong and Yang, Yijie and Duan, Yuping},
  title = {Minimizing Discrete Total Curvature for Image Processing},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Towards Robust Image Classification Using Sequential Attention Models
Daniel Zoran, Mike Chrzanowski, Po-Sen Huang, Sven Gowal, Alex Mott, Pushmeet Kohli


In this paper we propose to augment a modern neural-network architecture with an attention model inspired by human perception. Specifically, we adversarially train and analyze a neural model incorporating a human inspired, visual attention component that is guided by a recurrent top-down sequential process. Our experimental evaluation uncovers several notable findings about the robustness and behavior of this new model. First, introducing attention to the model significantly improves adversarial robustness resulting in state-of-the-art ImageNet accuracies under a wide range of random targeted attack strengths. Second, we show that by varying the number of attention steps (glances/fixations) for which the model is unrolled, we are able to make its defense capabilities stronger, even in light of stronger attacks --- resulting in a "computational race" between the attacker and the defender. Finally, we show that some of the adversarial examples generated by attacking our model are quite different from conventional adversarial examples --- they contain global, salient and spatially coherent structures coming from the target class that would be recognizable even to a human, and work by distracting the attention of the model away from the main object in the original image.
[attention, visual, time, lstm, step, sequential, recurrent, work, dataset, answer] [object, resnet, map, main, table, salient, improves] [model, adversarial, attack, pgd, robustness, trained, input, success, targeted, risk, robust, adversarially, spsa, denoise, perturbation, query, stronger, primate, strong, nature, nominal, ian] [spatial, figure, output, ieee, tensor, convolutional] [image, generated, target, source, loss, produce, train] [neural, arxiv, preprint, training, gradient, imagenet, accuracy, network, number, processing, learning, deep, random, class, better, classification, performance, note, bottleneck, alex, large, vector] [human, vision, structure, computer, conference, coherent, basis, allows]
@InProceedings{Zoran_2020_CVPR,
  author = {Zoran, Daniel and Chrzanowski, Mike and Huang, Po-Sen and Gowal, Sven and Mott, Alex and Kohli, Pushmeet},
  title = {Towards Robust Image Classification Using Sequential Attention Models},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Discovering Synchronized Subsets of Sequences: A Large Scale Solution
Evangelos Sariyanidi, Casey J. Zampella, Keith G. Bartley, John D. Herrington, Theodore D. Satterthwaite, Robert T. Schultz, Birkan Tunc


Finding the largest subset of sequences (i.e., time series) that are correlated above a certain threshold, within large datasets, is of significant interest for computer vision and pattern recognition problems across domains, including behavior analysis, computational biology, neuroscience, and finance. Maximal clique algorithms can be used to solve this problem, but they are not scalable. We present an approximate, but highly efficient and scalable, method that represents the search space as a union of sets called epsilon-expanded clusters, one of which is theoretically guaranteed to contain the largest subset of synchronized sequences. The method finds synchronized sets by fitting a Euclidean ball on epsilon-expanded clusters, using Jung's theorem. We validate the method on data from the three distinct domains of facial behavior analysis, finance, and neuroscience, where we respectively discover the synchrony among pixels of face videos, stock market item prices, and dynamic brain connectivity data. Experiments show that our method produces results comparable to, but up to 300 times faster than, maximal clique algorithms, with speed gains increasing exponentially with the number of input sequences.
[time, temporal, stock, three, dataset, speed, identifying, sequence, behavior] [correlation, faster, table] [synchronized, syncref, largest, facial, maximal, clique, synchrony, input, face, correlated, ball, identify, mpe, maxclique, mmi, expression, fernando, discovering, condition, expanded, robust, financial] [method, brain, ieee, exact, analysis, phase, motion, pattern, journal, warping, dynamic] [cluster, representation, market, satisfy, corresponding, discovery, distinct, unsupervised] [set, subset, number, data, problem, large, approximate, finding, find, theorem, algorithm, john, entire, randomly, computational, space, pairwise, clustering, efficiently, appendix, matrix] [pca, solution, computer, functional, fitting, approach, human, vision]
@InProceedings{Sariyanidi_2020_CVPR,
  author = {Sariyanidi, Evangelos and Zampella, Casey J. and Bartley, Keith G. and Herrington, John D. and Satterthwaite, Theodore D. and Schultz, Robert T. and Tunc, Birkan},
  title = {Discovering Synchronized Subsets of Sequences: A Large Scale Solution},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Going Deeper With Lean Point Networks
Eric-Tuan Le, Iasonas Kokkinos, Niloy J. Mitra


In this work we introduce Lean Point Networks (LPNs) to train deeper and more accurate point processing networks by relying on three novel point processing blocks that improve memory consumption, inference time, and accuracy: a convolution-type block for point sets that blends neighborhood information in a memory-efficient manner; a crosslink block that efficiently shares information across low- and high-resolution processing branches; and a multi-resolution point cloud processing block for faster diffusion of information. By combining these blocks, we design wider and deeper point-based architectures. We report systematic accuracy and memory consumption improvements on multiple publicly available segmentation tasks by using our generic modules as drop-in replacements for the blocks of multiple architectures (PointNet++, DGCNN, SpiderNet, PointCNN).
[time, dataset, three, speed, going, work, turn, graph] [segmentation, table, iou, propose, shallow, main, feature, grouping] [knn, improve, generic, datasets, original] [convolution, residual, block, convolutional, flow, based, figure, proposed] [lpn, image, train] [memory, network, deep, processing, performance, slp, inference, consumption, deeper, learning, architecture, compared, layer, pool, backward, convpn, efficiency, matrix, design, training, accuracy, increasing, number, report, baseline, complexity, counterpart, efficient, pass, impact, neural, mres, increase, gradient, data, operation, standard] [point, cloud, partnet, scannet, lean, pointnet, neighborhood, footprint, computer, shape, complex, local, allow, vision, mlp]
@InProceedings{Le_2020_CVPR,
  author = {Le, Eric-Tuan and Kokkinos, Iasonas and Mitra, Niloy J.},
  title = {Going Deeper With Lean Point Networks},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Efficient and Robust Shape Correspondence via Sparsity-Enforced Quadratic Assignment
Rui Xiang, Rongjie Lai, Hongkai Zhao


In this work, we introduce a novel local pairwise descriptor and then develop a simple, effective iterative method to solve the resulting quadratic assignment through sparsity control for shape correspondence between two approximate isometric surfaces. Our pairwise descriptor is based on the stiffness and mass matrix of finite element approximation of the Laplace-Beltrami differential operator, which is local in space, sparse to represent, and extremely easy to compute while containing global information. It allows us to deal with open surfaces, partial matching, and topological perturbations robustly. To solve the resulting quadratic assignment problem efficiently, the two key ideas of our iterative algorithm are: 1) select pairs with good (approximate) correspondence as anchor points, 2) solve a regularized quadratic assignment problem only in the neighborhood of selected anchor points through sparsity control. These two ingredients can improve and increase the number of anchor points quickly while reducing the computation cost in each quadratic assignment iteration significantly. With enough high-quality anchor points, one may use various pointwise global features with reference to these anchor points to further improve the dense shape correspondence. We use various experiments to show the efficiency, quality, and versatility of our method on large data sets, patches, and point clouds (without global meshes).
[step, element, construct] [anchor, assignment, map, global, mass, post, boundary] [distortion, topological, iterative, model] [method, figure, based, kernel, patch, ieee, spectral, spectrum, column, result] [corresponding, control, mapping] [matrix, sparsity, stochastic, pairwise, algorithm, data, good, pointwise, quadratic, problem, find, number, size, set, iteration, large, test, processing, approximation, selected, computation, efficient] [local, shape, correspondence, point, qap, doubly, computer, dense, matching, relaxed, geodesic, solve, neighborhood, mesh, conference, descriptor, distance, heat, cloud, initial, volume, isometric, stiffness, sparse, partial, ron, vision, constraint, lbo, second, tosca, international, michael, intrinsic, cost, surface]
@InProceedings{Xiang_2020_CVPR,
  author = {Xiang, Rui and Lai, Rongjie and Zhao, Hongkai},
  title = {Efficient and Robust Shape Correspondence via Sparsity-Enforced Quadratic Assignment},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Explainable Object-Induced Action Decision for Autonomous Vehicles
Yiran Xu, Xiaoyin Yang, Lihang Gong, Hsuan-Chu Lin, Tz-Ying Wu, Yunsheng Li, Nuno Vasconcelos


A new paradigm is proposed for autonomous driving. The new paradigm lies between the end-to-end and pipelined approaches, and is inspired by how humans solve the problem. While it relies on scene understanding, the latter only considers objects that could originate hazard. These are denoted as action inducing, since changes in their state should trigger vehicle actions. They also define a set of explanations for these actions, which should be produced jointly with the latter. An extension of the BDD100K dataset, annotated for a set of 4 actions and 21 explanations, is proposed. A new multi-task formulation of the problem, which optimizes the accuracy of both action commands and explanations, is then introduced. A CNN architecture is finally proposed to solve this problem, by combining reasoning about action inducing objects and global scene context. Experimental results show that the requirement of explanations improves the recognition of action-inducing objects, which in turn leads to better action predictions.
[action, driving, prediction, recognition, attention, dataset, lane, traffic, multiple, reasoning, visual, turn, video, predict, driver, associated] [global, object, autonomous, table, feature, module, faster, detection, car, contextual, annotated, backbone, map, semantic, branch, score] [explanation, model, datasets, trained] [ieee, proposed, figure, pattern, based, light, combination] [image, selector, trevor, generation, produced, loss, produce] [network, performance, set, learning, induced, deep, architecture, neural, training, number, processing, problem, classification, selection, data, size] [computer, conference, vision, local, scene, single, system, left, joint, international, european]
@InProceedings{Xu_2020_CVPR,
  author = {Xu, Yiran and Yang, Xiaoyin and Gong, Lihang and Lin, Hsuan-Chu and Wu, Tz-Ying and Li, Yunsheng and Vasconcelos, Nuno},
  title = {Explainable Object-Induced Action Decision for Autonomous Vehicles},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Spatially Attentive Output Layer for Image Classification
Ildoo Kim, Woonhyuk Baek, Sungwoong Kim


Most convolutional neural networks (CNNs) for image classification use a global average pooling (GAP) followed by a fully-connected (FC) layer for output logits. However, this spatial aggregation procedure inherently restricts the utilization of location-specific information at the output layer, although this spatial information can be beneficial for classification. In this paper, we propose a novel spatial output layer on top of the existing convolutional feature maps to explicitly exploit the location-specific output information. In specific, given the spatial feature maps, we replace the previous GAP-FC layer with a spatially attentive output layer (SAOL) by employing a attention mask on spatial logits. The proposed location-specific attention selectively aggregates spatial logits within a target region, which leads to not only the performance improvement but also spatially interpretable outputs. Moreover, the proposed SAOL also permits to fully exploit location-specific self-supervision as well as self-distillation to enhance the generalization ability during training. The proposed SAOL with self-supervision and self-distillation can be easily plugged into existing CNNs. Experimental results on various classification tasks with representative architectures show consistent performance improvements by SAOL at almost the same computational cost.
[attention, recognition, previous, visual, mechanism] [saol, feature, object, map, table, localization, abn, final, cam, semantic, wsol, attentive, pooling, aggregation] [improve, model, original, trained, auxiliary, input] [spatial, output, proposed, ieee, convolutional, based, pattern, spatially, figure, block, convolution, method, conventional, residual, existing, cnns, intermediate] [cutmix, image, loss, target, generate, produce, supervised, interpretable] [classification, logits, layer, learning, deep, class, neural, performance, network, baseline, imagenet, computational, activation, better, arxiv, preprint, average, data, training] [conference, vision, computer, international, well, additional, novel]
@InProceedings{Kim_2020_CVPR,
  author = {Kim, Ildoo and Baek, Woonhyuk and Kim, Sungwoong},
  title = {Spatially Attentive Output Layer for Image Classification},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Attack to Explain Deep Representation
Mohammad A. A. K. Jalwana, Naveed Akhtar, Mohammed Bennamoun, Ajmal Mian


Deep visual models are susceptible to extremely low magnitude perturbations to input images. Though carefully crafted, the perturbation patterns generally appear noisy, yet they are able to perform controlled manipulation of model predictions. This observation is used to argue that deep representation is misaligned with human perception. This paper counter-argues and proposes the first attack on deep learning that aims at explaining the learned representation instead of fooling it. By extending the input domain of the manipulative signal and employing a model faithful channelling, we iteratively accumulate adversarial perturbations for a deep model. The accumulated signal gradually manifests itself as a collection of visually salient features of the target label (in model fooling), casting adversarial perturbations as primitive features of the target label. Our attack provides the first demonstration of systematically computing perturbations for adversarially non-robust classifiers that comprise salient visual features of objects. We leverage the model explaining character of our algorithm to perform image generation, inpainting and interactive image manipulation by attacking adversarially robust classifiers. The visually appealing results across these applications demonstrate the utility of our attack (and perturbations in general) beyond model fooling.
[visual, perception, semantics] [salient, seed, interactive, refinement, refined, region, focus] [adversarial, perturbation, model, robust, attack, input, manipulation, technique, explaining, universal, representative, adversarially, explanation, fooling, attacking, santurkar, emerge, iterative, norm, dimitris, aleksander] [ieee, pattern, signal, visually, based, figure, high, convolutional, proposed] [image, target, representation, inpainting, generated, misalignment, loss, perform, domain, utility, discriminative] [deep, algorithm, arxiv, preprint, set, gradient, class, label, distribution, objective, learning, classifier, random, neural, classification, probability, sample, processing, andrew, computing] [computer, conference, vision, human, computed, geometric, direction, compute, demonstrate, surface, iteratively]
@InProceedings{Jalwana_2020_CVPR,
  author = {Jalwana, Mohammad A. A. K. and Akhtar, Naveed and Bennamoun, Mohammed and Mian, Ajmal},
  title = {Attack to Explain Deep Representation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Computing Valid P-Values for Image Segmentation by Selective Inference
Kosuke Tanizaki, Noriaki Hashimoto, Yu Inatsu, Hidekata Hontani, Ichiro Takeuchi


Image segmentation is one of the most fundamental tasks in computer vision. In many practical applications, it is essential to properly evaluate the reliability of individual segmentation results. In this study, we propose a novel framework for quantifying the statistical significance of individual segmentation results in the form of p-values by statistically testing the difference between the object region and the background region. This seemingly simple problem is actually quite challenging because the difference --- called segmentation bias --- can be deceptively large due to the adaptation of the segmentation algorithm to the data. To overcome this difficulty, we introduce a statistical approach called selective inference, and develop a framework for computing valid p-values in which segmentation bias is properly accounted for. Although the proposed framework is potentially applicable to various segmentation algorithms, we focus in this paper on graph-cut- and threshold-based segmentation algorithms, and develop two specific methods for computing valid p-values for the segmentation results obtained by these algorithms. We prove the theoretical validity of these two methods and demonstrate their practicality by applying them to the segmentation of medical images.
[selective, individual, graph, naive, node, observed] [segmentation, object, background, psegi, threshold, region, significance, framework, tumor, seed, global, determined] [difference, original, quantify, testing, condition] [pixel, proposed, method, result, valid, based, medical, figure, journal, analysis, called, intensity, event, ieee, pattern, adjacent] [image, properly, specific, conditional, pathological, target] [algorithm, statistical, problem, set, inference, reliability, quadratic, data, consider, distribution, computing, null, test, written, variance, large, function, maximum, weight, theorem, fibrous, bias, similarity, indicates, selected, practical, deceptively, cut, average, linear, vector] [local, defined, form, international, hypothesis, computed, computer, approach]
@InProceedings{Tanizaki_2020_CVPR,
  author = {Tanizaki, Kosuke and Hashimoto, Noriaki and Inatsu, Yu and Hontani, Hidekata and Takeuchi, Ichiro},
  title = {Computing Valid P-Values for Image Segmentation by Selective Inference},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Unsupervised Learning From Video With Deep Neural Embeddings
Chengxu Zhuang, Tianwei She, Alex Andonian, Max Sobol Mark, Daniel Yamins


Because of the rich dynamical structure of videos andtheir ubiquity in everyday life, it is a natural idea that video data could serve as a powerful unsupervised learning signal for visual representations. However, instantiating this idea, especially at large scale, has remained a significant artificial intelligence challenge. Here we present the Video Instance Embedding (VIE) framework, which trains deep nonlinear embeddings on video sequence inputs. By learning embedding dimensions that identify and group similar videos together, while pushing inherently different videos apart in the embedding space, VIE captures the strong statistical structure inherent in videos, without the need for external annotation labels. We find that, when trained on a large-scale video dataset, VIE yields powerful representations both for action recognition and single-frame object categorization, showing substantially improving on the state of the art wherever direct comparisons are possible. We show that a two-pathway model with both static and dynamic processingpathways is optimal, provide analyses indicating how the model works, and perform ablation studies showing the importance of key architecture and loss function choices. Our results suggest that deep neural embeddings are a promising approach to unsupervised video learning fora wide variety of task domains.
[video, vie, embedding, action, visual, temporal, kinetics, static, recognition, frame, embeddings, work, slowfast, previous, spatiotemporal, natural, powerful, multiple, dataset] [table, including, object, feature, instance, aggregation, improvement] [model, trained, input, datasets] [dynamic, ieee, convolutional, motion, pattern, method] [unsupervised, loss, supervised, transfer, representation, image] [learning, deep, neural, performance, network, training, architecture, function, better, sampling, data, processing, task, imagenet, arxiv, preprint, learned, test, scratch, predictive, memory, augmentation, find, general, procedure] [computer, conference, vision, approach, single, local, international, daniel, directly, dense, european]
@InProceedings{Zhuang_2020_CVPR,
  author = {Zhuang, Chengxu and She, Tianwei and Andonian, Alex and Mark, Max Sobol and Yamins, Daniel},
  title = {Unsupervised Learning From Video With Deep Neural Embeddings},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Partial Weight Adaptation for Robust DNN Inference
Xiufeng Xie, Kyu-Han Kim


Mainstream video analytics uses a pre-trained DNN model with an assumption that inference input and training data follow the same probability distribution. However, this assumption does not always hold in the wild: autonomous vehicles may capture video with varying brightness; unstable wireless bandwidth calls for adaptive bitrate streaming of video; and, inference servers may serve inputs from heterogeneous IoT devices/cameras. In such situations, the level of input distortion changes rapidly, thus reshaping the probability distribution of the input. We present GearNN, an adaptive inference architecture that accommodates DNN inputs with varying distortions. GearNN employs an optimization algorithm to identify a tiny set of "distortion-sensitive" DNN parameters, given a memory budget. Based on the distortion level of the input, GearNN then adapts only the distortion-sensitive parameters, while reusing the rest of constant parameters across all input qualities. In our evaluation of DNN inference with dynamic input distortions, GearNN improves the accuracy (mIoU) by an average of 18.12% over a DNN trained with the undistorted dataset and 4.84% over stability training from Google, with only 1.8% extra memory overhead.
[dataset, video, multiple, visual, overhead] [level, mask, achieves, segmentation, response, miou, partially] [dnn, gearnn, input, distortion, original, quality, stability, jpeg, distorted, adaptor, model, switching, dnns, sensitivity, change, streaming, accommodate, robust, tiny, adapts, difference, trained, undistorted, iot] [spatial, brightness, frequency, adaptive, figure, high, dynamic, based, resolution, ieee, existing, compression, low, drn, output] [loss, image, adaptation, transfer] [training, inference, accuracy, memory, size, mixed, data, weight, neural, average, higher, performance, base, small, learning, layer, set, portion, deep, probability, architecture] [partial, fit, computer, conference, relative, instantaneous, single]
@InProceedings{Xie_2020_CVPR,
  author = {Xie, Xiufeng and Kim, Kyu-Han},
  title = {Partial Weight Adaptation for Robust DNN Inference},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Probability Weighted Compact Feature for Domain Adaptive Retrieval
Fuxiang Huang, Lei Zhang, Yang Yang, Xichuan Zhou


Domain adaptive image retrieval includes single-domain retrieval and cross-domain retrieval. Most of the existing image retrieval methods only focus on single-domain retrieval, which assumes that the distributions of retrieval databases and queries are similar. However, in practical application, the discrepancies between retrieval databases often taken in ideal illumination/pose/background/camera conditions and queries usually obtained in uncontrolled conditions are very large. In this paper, considering the practical application, we focus on challenging cross-domain retrieval. To address the problem, we propose an effective method named Probability Weighted Compact Feature Learning (PWCF), which provides inter-domain correlation guidance to promote cross-domain retrieval accuracy and learns a series of compact binary codes to improve the retrieval speed. First, we derive our loss function through the Maximum A Posteriori Estimation (MAP): Bayesian Perspective (BP) induced focal-triplet loss, BP induced quantization loss and BP induced classification loss. Second, we propose a common manifold structure between domains to explore the potential correlation across domains. Considering the original feature representation is biased due to the inter-domain discrepancy, the manifold structure is difficult to be constructed. Therefore, we propose a new feature named Histogram Feature of Neighbors (HFON) from the sample statistics perspective. Extensive experiments on various benchmark databases validate that our method outperforms many state-of-the-art image retrieval methods for domain adaptive image retrieval. The source code is available at https://github.com/fuxianghuang1/PWCF .
[retrieval, represent, considering, dataset, length] [feature, propose, named, correlation, hard, table, map, positive] [original] [ieee, histogram, method, based, adaptive, pattern, proposed, figure] [domain, loss, image, source, target, pwcf, manifold, sik, sdh, code, transfer, hfon, itq, discrepancy, content, lsh, och, gth] [induced, binary, probability, hashing, sample, learning, function, training, data, compact, quantization, performance, set, triplet, number, sij, class, distribution, reduce, negative, maximum, similarity, algorithm, matrix, objective, variant, zij, weighted, bayesian, standard, update, sgh] [conference, distance, international, structure, nearest, computer, solution, vision, solving, capture, represented]
@InProceedings{Huang_2020_CVPR,
  author = {Huang, Fuxiang and Zhang, Lei and Yang, Yang and Zhou, Xichuan},
  title = {Probability Weighted Compact Feature for Domain Adaptive Retrieval},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Where Does It End? - Reasoning About Hidden Surfaces by Object Intersection Constraints
Michael Strecke, Jorg Stuckler


Dynamic scene understanding is an essential capability in robotics and VR/AR. In this paper we propose Co-Section, an optimization-based approach to 3D dynamic scene reconstruction, which infers hidden shape information from intersection constraints. An object-level dynamic SLAM frontend detects, segments, tracks and maps dynamic objects in the scene. Our optimization backend completes the shapes using hull and intersection constraints between the objects. In experiments, we demonstrate our approach on real and synthetic dynamic scene datasets. We also assess the shape completion performance of our method quantitatively. To the best of our knowledge, our approach is the first method to incorporate such physical plausibility constraints on object intersections for shape completion of dynamic objects in an energy minimization framework.
[observed, moving, multiple, incorporate, future, static, time] [object, oriented, propose, association, background, including, mask, global] [model, input, physical] [dynamic, method, ieee, pattern, figure, optimized, proposed, field, fast] [synthetic, qualitative, oct, free] [optimization, energy, data, baseline, accuracy, function, minimization, set, max, regularization, equation] [intersection, surface, approach, hull, distance, shape, point, depth, reconstruction, constraint, scene, computer, conference, vision, completion, implicit, signed, tsdf, international, slam, dense, mesh, michael, volumetric, sdf, voxel, completeness, frontend, full, formulation, complete, single, estimated, pose, reconstructed, plausibility, local, term, well, schroers, edata]
@InProceedings{Strecke_2020_CVPR,
  author = {Strecke, Michael and Stuckler, Jorg},
  title = {Where Does It End? - Reasoning About Hidden Surfaces by Object Intersection Constraints},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
PolarNet: An Improved Grid Representation for Online LiDAR Point Clouds Semantic Segmentation
Yang Zhang, Zixiang Zhou, Philip David, Xiangyu Yue, Zerong Xi, Boqing Gong, Hassan Foroosh


The requirement of fine-grained perception by autonomous driving systems has resulted in recently increased research in the online semantic segmentation of single-scan LiDAR. Emerging datasets and technological advancements have enabled researchers to benchmark this problem and improve the applicable semantic segmentation algorithms. Still, online semantic segmentation of LiDAR scans in autonomous driving applications remains challenging due to three reasons: (1) the need for near-real-time latency with limited hardware, (2) points are distributed unevenly across space, and (3) an increasing number of more fine-grained semantic classes. The combination of the aforementioned challenges motivates us to propose a new LiDAR-specific, KNN-free segmentation algorithm - PolarNet. Instead of using common spherical or bird's-eye-view projection, our polar bird's-eye-view representation balances the points per grid and thus indirectly redistributes the network's attention over the long-tailed points distribution over the radial axis in polar coordination. We find that our encoding scheme greatly increases the mIoU in three drastically different real urban LiDAR single-scan segmentation datasets while retaining ultra low latency and near real-time throughput.
[dataset, prediction, perception, outperforms, graph, downstream] [segmentation, bev, lidar, polar, semantic, cartesian, miou, object, feature, detection, polarnet, squeezeseg, denotes, table, semantickitti, cnn, iou, autonomous, backbone, split, challenging, fully, improvement, assigned, propose] [model, input, datasets] [pattern, ieee, convolution, unet, sensor, cell, convolutional, based, figure, spatial, field] [representation, image] [network, neural, class, performance, data, validation, learning, distribution, number, test, training, size, problem, matrix, space, improved, online, deep] [point, grid, conference, computer, scan, vision, spherical, cloud, ring, international, projection, pointnet, distance, coordinate, kitti, approach, single, despite]
@InProceedings{Zhang_2020_CVPR,
  author = {Zhang, Yang and Zhou, Zixiang and David, Philip and Yue, Xiangyu and Xi, Zerong and Gong, Boqing and Foroosh, Hassan},
  title = {PolarNet: An Improved Grid Representation for Online LiDAR Point Clouds Semantic Segmentation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Pathological Retinal Region Segmentation From OCT Images Using Geometric Relation Based Augmentation
Dwarikanath Mahapatra, Behzad Bozorgtabar, Ling Shao


Medical image segmentation is important for computer aided diagnosis. Pixelwise manual annotations of large datasets require high expertise and is time consuming. Conventional data augmentations have limited benefit by not fully representing the underlying distribution of the training set, thus affecting model robustness when tested on images captured from different sources. Prior work leverages synthetic images for data augmentation ignoring the interleaved geometric relationship between different anatomical labels. We propose improvements over previous GAN-based medical image synthesis methods by jointly encoding the intrinsic relationship of geometry and shape. Latent space variable sampling results in diverse generated images from a base image and improves robustness. Augmented datasets using our method for automatic segmentation of retinal optical coherence tomography (OCT) images outperform existing methods on the public RETOUCH dataset having images captured from different acquisition procedures. Ablation studies and visual analysis also demonstrate benefits of integrating geometry and diversity.
[dataset, relationship, relation] [segmentation, mask, table, ablation, fully, segment, region] [adversarial, model, trained, coherence] [figure, method, medical, unet, based, proposed, convolutional, ieee, output, anatomical, agreement, device, dsc, optical] [image, geogan, diseased, retinal, oct, generated, generation, generate, disease, generative, loss, pathological, diversity, synthetic, zhao, dwarikanath, dagan, cgan, conditional, realistic, generator, latent, train, corresponding, behzad, lshape, retouch, macular, ladv, lclass, learn] [training, data, augmentation, test, network, performance, set, classification, sampling, learning, label, base, manual, distribution, deep, neural, population, parameter, better, best] [normal, fluid, shape, uncertainty, geometric, approach, registration, computer, limited, geometry, demonstrate]
@InProceedings{Mahapatra_2020_CVPR,
  author = {Mahapatra, Dwarikanath and Bozorgtabar, Behzad and Shao, Ling},
  title = {Pathological Retinal Region Segmentation From OCT Images Using Geometric Relation Based Augmentation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Transferring and Regularizing Prediction for Semantic Segmentation
Yiheng Zhang, Zhaofan Qiu, Ting Yao, Chong-Wah Ngo, Dong Liu, Tao Mei


Semantic segmentation often requires a large set of images with pixel-level annotations. In the view of extremely expensive expert labeling, recent research has shown that the models trained on photo-realistic synthetic data (e.g., computer games) with computer-generated annotations can be adapted to real images. Despite this progress, without constraining the prediction on real images, the models will easily overfit on synthetic data due to severe domain mismatch. In this paper, we novelly exploit the intrinsic properties of semantic segmentation to alleviate such problem for model transfer. Specifically, we present a Regularizer of Prediction Transfer (RPT) that imposes the intrinsic properties as constraints to regularize model transfer in an unsupervised fashion. These constraints include patch-level, cluster-level and context-level semantic prediction consistencies at different levels of image formation. As the transfer is label-free and data-driven, the robustness of prediction is addressed by selectively involving a subset of image regions for model regularization. Extensive experiments are conducted to verify the proposal of RPT on the transfer of models trained on GTA5 and SYNTHIA (synthetic data) to Cityscapes dataset (urban street scenes). RPT shows consistent improvements when injecting the constraints on several neural networks for semantic segmentation. More remarkably, when integrating RPT into the adversarial-based segmentation framework, we report to-date the best results: mIoU of 53.2%/51.7% when transferring from GTA5/SYNTHIA to Cityscapes, respectively.
[prediction, lstm, road, sequence, state, three, visual, logic, urban, dataset] [semantic, segmentation, superpixel, category, superpixels, fcn, miou, fully, feature, predicted, table, building, framework, segment, region] [model, adversarial, trained, logical] [spatial, figure, convolutional, based, proposed] [domain, adaptation, image, target, consistency, dominative, rpt, fcnadv, synthetic, source, loss, real, transfer, synthia, masked, unsupervised, transferring, ting, tao, learnt, gap] [learning, training, network, performance, data, regularization, probability, number, top, label, deep, regularizer, best, architecture, updating, baseline, large, set, expensive, problem] [computer, defined, ground, intrinsic, consistent]
@InProceedings{Zhang_2020_CVPR,
  author = {Zhang, Yiheng and Qiu, Zhaofan and Yao, Ting and Ngo, Chong-Wah and Liu, Dong and Mei, Tao},
  title = {Transferring and Regularizing Prediction for Semantic Segmentation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
PREDICT & CLUSTER: Unsupervised Skeleton Based Action Recognition
Kun Su, Xiulong Liu, Eli Shlizerman


We propose a novel system for unsupervised skeleton-based action recognition. Given inputs of body-keypoints sequences obtained during various movements, our system associates the sequences with actions. Our system is based on an encoder-decoder recurrent neural network, where the encoder learns a separable feature representation within its hidden states formed by training the model to perform the prediction task. We show that according to such unsupervised training, the decoder and the encoder self-organize their hidden states into a feature space which clusters similar movements into the same cluster and distinct movements into distant clusters. Current state-of-the-art methods for action recognition are strongly supervised, i.e., rely on providing labels for training. Unsupervised methods have been proposed, however, they require camera and depth inputs (RGB+D) at each time step. In contrast, our system is fully unsupervised, does not require action labels at any stage and can operate with body-keypoints input only. Furthermore, the method can perform on various dimensions of body-keypoints (2D or 3D) and can include additional cues describing movements. We evaluate our system on three action recognition benchmarks with different numbers of actions and examples. Our results outperform prior unsupervised skeleton-based methods, unsupervised RGB+D based methods on cross-view tests and while being unsupervised have similar performance to supervised skeleton-based action recognition.
[action, skeleton, decoder, recognition, sequence, hidden, state, prediction, three, rnn, time, graph, predict, recurrent, ntu, include, previous, dataset, work, temporal] [feature, final, propose, table] [input, datasets, effective] [ieee, based, method, pattern, prior, proposed, figure, motion, scale, convolutional, captured, output, enhanced, performed] [unsupervised, encoder, supervised, learn, representation, perform, gan, loss, cluster, learns] [training, network, performance, data, learning, neural, accuracy, random, task, large, vector, deep, classification, better, test] [human, computer, conference, system, body, keypoints, vision, depth, additional, require, initial, view]
@InProceedings{Su_2020_CVPR,
  author = {Su, Kun and Liu, Xiulong and Shlizerman, Eli},
  title = {PREDICT & CLUSTER: Unsupervised Skeleton Based Action Recognition},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Model Adaptation: Unsupervised Domain Adaptation Without Source Data
Rui Li, Qianfen Jiao, Wenming Cao, Hau-San Wong, Si Wu


In this paper, we investigate a challenging unsupervised domain adaptation setting --- unsupervised model adaptation. We aim to explore how to rely only on unlabeled target data to improve performance of an existing source prediction model on the target domain, since labeled source data may not be available in some real-world scenarios due to data privacy issues. For this purpose, we propose a new framework, which is referred to as collaborative class conditional generative adversarial net to bypass the dependence on the source data. Specifically, the prediction model is to be improved through generated target-style data, which provides more accurate guidance for the generator. As a result, the generator and the prediction model can collaborate with each other without source data. Furthermore, due to the lack of supervision from source data, we propose a weight constraint that encourages similarity to the source model. A clustering-based regularization is also introduced to produce more discriminative features in the target domain. Compared to conventional domain adaptation methods, our model achieves superior performance on multiple adaptation tasks with only unlabeled target data, which verifies its effectiveness in this challenging setting.
[prediction, dataset, collaborative, multiple, outperforms, sign, visual, recognition] [table, propose, achieves, effectiveness, semantic, challenging, feature] [model, adversarial, improve, datasets, generalization, improving, noise] [method, based, proposed, existing, figure, enhanced] [domain, adaptation, source, target, unsupervised, conditional, generative, generated, generator, generation, discrepancy, digit, transfer, kate, absence, loss, gan, translation, image, row, judy, mingsheng, jianmin, learn, discriminator] [data, performance, learning, regularization, class, training, deep, accuracy, unlabeled, distribution, labeled, weight, task, similarity, neural, improved, compared, large, label, expected, achieve, entropy, set, log, classification, objective] [constraint, joint, michael, demonstrate]
@InProceedings{Li_2020_CVPR,
  author = {Li, Rui and Jiao, Qianfen and Cao, Wenming and Wong, Hau-San and Wu, Si},
  title = {Model Adaptation: Unsupervised Domain Adaptation Without Source Data},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Evade Deep Image Retrieval by Stashing Private Images in the Hash Space
Yanru Xiao, Cong Wang, Xing Gao


With the rapid growth of visual content, deep learning to hash is gaining popularity in the image retrieval community recently. Although it greatly facilitates search efficiency, privacy is also at risks when images on the web are retrieved at a large scale and exploited as a rich mine of personal information. An adversary can extract private images by querying similar images from the targeted category for any usable model. Existing methods based on image processing preserve privacy at a sacrifice of perceptual quality. In this paper, we propose a new mechanism based on adversarial examples to "stash" private images in the deep hash space while maintaining perceptual similarity. We first find that a simple approach of hamming distance maximization is not robust against brute-force adversaries. Then we develop a new loss function by maximizing the hamming distance to not only the original category, but also the centers from all the classes, partitioned into clusters of various sizes. The extensive experiment shows that the proposed defense can harden the attacker's efforts by 2-7 orders of magnitude, without significant increase of computational overhead and perceptual degradation. We also demonstrate 30-60% transferability in hash space with a black-box setting. The code is available at: https://github.com/sugarruy/hashstash
[retrieval, mechanism, visual, social, extract] [weak, center, threshold, propose, table] [original, private, defense, privacy, attack, strong, cwdm, adversary, adversarial, model, fashion, hdm, transferability, database, noise, success, query, dbscan] [based, ieee, perceptual, high, figure, pattern, proposed, ssim] [image, protected, loss, cluster, code, user, target] [hamming, hash, number, deep, function, expected, imagenet, learning, set, space, search, maximization, best, quadratic, large, data, hashing, similarity, clustering, accuracy, distribution, neural, weighted, gradient, min, optimization, total, class, classification, arxiv, preprint] [distance, conference, computer, acm, system, rest]
@InProceedings{Xiao_2020_CVPR,
  author = {Xiao, Yanru and Wang, Cong and Gao, Xing},
  title = {Evade Deep Image Retrieval by Stashing Private Images in the Hash Space},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Advisable Learning for Self-Driving Vehicles by Internalizing Observation-to-Action Rules
Jinkyu Kim, Suhong Moon, Anna Rohrbach, Trevor Darrell, John Canny


Humans learn to drive through both practice and theory, e.g. by studying the rules, while most self-driving systems are limited to the former. Being able to incorporate human knowledge of typical causal driving behaviour should benefit autonomous systems. We propose a new approach that learns vehicle control with the help of human advice. Specifically, our system learns to summarize its visual observations in natural language, predict an appropriate action response (e.g. "I see a pedestrian crossing, so I stop"), and predict the controls, accordingly. Moreover, to enhance interpretability of our system, we introduce a fine-grained attention mechanism which relies on semantic segmentation and object-centric RoI pooling. We show that our approach of training the autonomous system with human advice, grounded in a rich semantic representation, matches or outperforms prior work in terms of control prediction and explanation generation. Our approach also results in more interpretable visual explanations by visualizing object-centric attention maps. Code is available at https://github.com/JinkyuKimUCB/advisable-driving.
[attention, visual, driving, textual, vehicle, action, advice, natural, observation, language, lane, provide, command, dataset, explainable, ing, advisable, work, prediction, predict, lstm, speed, attended, attends, future, road, slow, red, incorporate, evaluation, trajectory, trust] [car, semantic, segmentation, feature, instance, module, pedestrian, predicted, propose, improves] [model, input, trained] [figure, spatial, light] [control, image, encoder, generated, user, latent, generator, conditioned, representation, trevor, generates, generate, learn, corresponding, train, loss] [learning, controller, training, performance, baseline, deep, report, vector, neural, evaluate, convnet] [human, system, approach, form]
@InProceedings{Kim_2020_CVPR,
  author = {Kim, Jinkyu and Moon, Suhong and Rohrbach, Anna and Darrell, Trevor and Canny, John},
  title = {Advisable Learning for Self-Driving Vehicles by Internalizing Observation-to-Action Rules},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
ProAlignNet: Unsupervised Learning for Progressively Aligning Noisy Contours
VSR Veeravasarapu, Abhishek Goel, Deepak Mittal, Maneesh Singh


Contour shape alignment is a fundamental but challenging problem in computer vision, especially when the observations are partial, noisy, and largely misaligned. Recent ConvNet-based architectures that were proposed to align image structures tend to fail with contour representation of shapes, mostly due to the use of proximity-insensitive pixel-wise similarity measures as loss functions in their training processes. This work presents a novel ConvNet, "ProAlignNet," that accounts for large scale misalignments and complex transformations between the contour shapes. It infers the warp parameters in a multi-scale fashion with progressively increasing complex transformations over increasing scales. It learns --without supervision-- to align contours, agnostic to noise and missing parts, by training with a novel loss function which is derived an upperbound of a proximity-sensitive and local shape-dependent similarity metric that uses classical Morphological Chamfer Distance Transform. We evaluate the reliability of these proposals on a simulated MNIST noisy contours dataset via some basic sanity check experiments. Next, we demonstrate the effectiveness of the proposed models in two real-world applications of (i) aligning geo-parcel data to aerial image maps and (ii) refining coarsely annotated segmentation labels. In both applications, the proposed models consistently perform superior to state-of-the-art methods.
[work, proximity, coarsely] [contour, aerial, segmentation, table, semantic, val, annotated, propose] [trained, noise, morphological, mnist, original, input] [warp, figure, transform, scale, proposed, noisy, ieee, transforms, based, multiscale, field, affine, method, refining, convolutional, resolution, spatial] [image, loss, source, alignment, target, aligned, train, align, misalignment, progressively, unsupervised, aligning, learns] [training, increasing, data, max, learning, function, network, set, process, similarity, predictor, performance, neural, backward, layer, better, problem, requires, deep] [chamfer, distance, proalignnet, local, upperbound, shape, transformation, computer, coarser, conference, novel, dirnet, coarse, complex, alignet, error, vision, overlaid]
@InProceedings{Veeravasarapu_2020_CVPR,
  author = {Veeravasarapu, VSR and Goel, Abhishek and Mittal, Deepak and Singh, Maneesh},
  title = {ProAlignNet: Unsupervised Learning for Progressively Aligning Noisy Contours},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Attribution in Scale and Space
Shawn Xu, Subhashini Venugopalan, Mukund Sundararajan


We study the attribution problem for deep networks applied to perception tasks. For vision tasks, attribution techniques attribute the prediction of a network to the pixels of the input image. We propose a new technique called Blur Integrated Gradients (Blur IG). This technique has several advantages over other methods. First, it can tell at what scale a network recognizes an object. It produces scores in the scale/frequency dimension, that we find captures interesting phenomena. Second, it satisfies the scale-space axioms, which imply that it employs perturbations that are free of artifact. We therefore produce explanations that are cleaner and consistent with the operation of deep networks. Third, it eliminates the need for baseline parameter for Integrated Gradients for perception tasks. This is desirable because the choice of baseline has a significant effect on the explanations. We compare the proposed technique against previous techniques and demonstrate application on three tasks: ImageNet object recognition, Diabetic Retinopathy prediction, and AudioSet audio event identification. Code and examples are at https://github.com/PAIR-code/saliency.
[prediction, audio, perception, visual] [feature, positive, confidence, saliency] [attribution, model, input, explanation, diabetic, violin, theory, retinopathy, gradcam, technique, black, satisfies, literature, blurig, condition, true, gaussians, identify, remark, xrai] [blur, figure, scale, integrated, gaussian, intensity, based, integration, signal, method, kernel, ieee, event, result, frequency, color] [image, produce, target, specific, notice] [path, baseline, deep, class, random, network, imagenet, label, lower, higher, function, scaling, proposition, indicates, learning, classification, space, smaller, large, gradient, note, top] [computer, vision, international, conference, form, second, arch, human, axis, defined, distance]
@InProceedings{Xu_2020_CVPR,
  author = {Xu, Shawn and Venugopalan, Subhashini and Sundararajan, Mukund},
  title = {Attribution in Scale and Space},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Towards Causal VQA: Revealing and Reducing Spurious Correlations by Invariant and Covariant Semantic Editing
Vedika Agarwal, Rakshith Shetty, Mario Fritz


Despite significant success in Visual Question Answering (VQA), VQA models have been shown to be notoriously brittle to linguistic variations in the questions. Due to deficiencies in models and datasets, today's models often rely on correlations rather than predictions that are causal w.r.t. data. In this paper, we propose a novel way to analyze and measure the robustness of the state of the art models w.r.t semantic visual variations as well as propose ways to make models more robust against spurious correlations. Our method performs automated semantic image manipulations and tests for consistency in model predictions to quantify the model robustness as well as generate synthetic data to counter these problems. We perform our analysis on three diverse, state of the art VQA models and diverse question types with a particular focus on challenging counting questions. In addition, we show that models can be made significantly more robust against inconsistent predictions using our edited data. Finally, we show that results also translate to real-world error cases of state of the art models, which results in improved overall performance
[vqa, question, answer, saaa, snmn, visual, dataset, iqas, red, answering, three, predict, linguistic, correct, making, vocabulary, brittle, exploit, order] [object, semantic, table, propose, coco, instance, module, predicted, overlap] [model, white, robustness, study, original, spurious, change, robust, inconsistent, improve, iqa, trained, create] [figure, counting, remove, removal, based, removing, color, removed, glass, analysis, existing] [edited, real, image, consistency, synthetic, covariant, train, target, invariant, editing, expect, edit, generate, irrelevant, mapping, corresponding] [data, accuracy, augmentation, set, baseline, learning, test, neural, reduction, number, mentioned, select, uniform] [consistent, well, approach, accurate]
@InProceedings{Agarwal_2020_CVPR,
  author = {Agarwal, Vedika and Shetty, Rakshith and Fritz, Mario},
  title = {Towards Causal VQA: Revealing and Reducing Spurious Correlations by Invariant and Covariant Semantic Editing},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Deep Relational Reasoning Graph Network for Arbitrary Shape Text Detection
Shi-Xue Zhang, Xiaobin Zhu, Jie-Bo Hou, Chang Liu, Chun Yang, Hongfa Wang, Xu-Cheng Yin


Arbitrary shape text detection is a challenging task due to the high variety and complexity of scenes texts. In this paper, we propose a novel unified relational reasoning graph network for arbitrary shape text detection. In our method, an innovative local graph bridges a text proposal model via Convolutional Neural Network (CNN) and a deep relational reasoning network via Graph Convolutional Network (GCN), making our network end-to-end trainable. To be concrete, every text instance will be divided into a series of small rectangular components, and the geometry attributes (e.g., height, width, and orientation) of the small components will be estimated by our text proposal model. Given the geometry attributes, the local graph construction model can roughly establish linkages between different text components. For further reasoning and deducing the likelihood of linkages between the component and its neighbors, we adopt a graph-based network to perform deep relational reasoning on local graphs. Experiments on public available datasets demonstrate the state-of-the-art performance of our method. Code is available at https://github.com/GXYM/DRRG.
[text, graph, reasoning, relational, linkage, pivot, node, hmean, link, embedding, long, gcn, dataset, relationship, adjacency, shortest, resize, natural] [detection, instance, feature, proposal, region, side, xiang, adopt, achieves, regression, center, recall, map, represents, bottom, wei, propose, unified, textsnake, framework] [detecting, datasets, robust, experimental, craft] [method, based, convolutional, likelihood, proposed, ieee, figure, convolution, pattern] [component, arbitrary, image, consists, loss, perform] [network, deep, matrix, set, performance, group, classification, learning, training, applied, precision, clustering, top] [local, scene, shape, geometry, geometric, sin, novel]
@InProceedings{Zhang_2020_CVPR,
  author = {Zhang, Shi-Xue and Zhu, Xiaobin and Hou, Jie-Bo and Liu, Chang and Yang, Chun and Wang, Hongfa and Yin, Xu-Cheng},
  title = {Deep Relational Reasoning Graph Network for Arbitrary Shape Text Detection},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Large-Scale Object Detection in the Wild From Imbalanced Multi-Labels
Junran Peng, Xingyuan Bu, Ming Sun, Zhaoxiang Zhang, Tieniu Tan, Junjie Yan


Training with more data has always been the most stable and effective way of improving performance in deep learn-ing era. As the largest object detection dataset so far, OpenImages brings great opportunities and challenges for object detection in general and sophisticated scenarios. However, owing to its semi-automatic collecting and labeling pipeline to deal with the huge data scale, Open Images dataset suffers from label-related problems that objects may explicitly or implicitly have multiple labels and the label distribution is extremely imbalanced. In this work, we quantitatively analyze these label problems and provide a simple but effective solution. We design a concurrent softmax to handle the multi-label problems in object detection and propose a soft-sampling methods with hybrid training scheduler to deal with the label imbalance. Overall, our method yields a dramatic improvement of 3.34 points, leading to the best single model with 60.90 mAP on the public object detection test set of Open Images. And our ensembling result achieves 67.17mAP, which is 4.29 points higher than the first place method last year.
[dataset, multiple, scheduler, explicitly] [object, concurrent, detection, leaf, table, map, infrequent, parent, box, category, challenge, car, knife, achieves] [model, deal, trained, effective, public, apple, evaluated, toy] [ieee, figure, pattern, method, proposed, based] [loss, image, train, fruit] [training, open, softmax, sampling, label, number, data, performance, learning, imagenet, class, problem, deep, frequent, labeled, imbalanced, set, large, better, classification, distribution, neural, strategy, rij, probability, balance, arxiv, preprint, extremely, best, higher, achieve] [computer, conference, vision, human, hybrid, bed, international, single]
@InProceedings{Peng_2020_CVPR,
  author = {Peng, Junran and Bu, Xingyuan and Sun, Ming and Zhang, Zhaoxiang and Tan, Tieniu and Yan, Junjie},
  title = {Large-Scale Object Detection in the Wild From Imbalanced Multi-Labels},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
BBN: Bilateral-Branch Network With Cumulative Learning for Long-Tailed Visual Recognition
Boyan Zhou, Quan Cui, Xiu-Shen Wei, Zhao-Min Chen


Our work focuses on tackling the challenging but natural visual recognition task of long-tailed data distribution (i.e., a few classes occupy most of the data, while most classes have rarely few samples). In the literature, class re-balancing strategies (e.g., re-weighting and re-sampling) are the prominent and effective methods proposed to alleviate the extreme imbalance for dealing with long-tailed problems. In this paper, we firstly discover that these re-balancing methods achieving satisfactory recognition accuracy owe to that they could significantly promote the classifier learning of deep networks. However, at the same time, they will unexpectedly damage the representative ability of the learned deep features to some extent. Therefore, we propose a unified Bilateral-Branch Network (BBN) to take care of both representation learning and classifier learning simultaneously, where each branch does perform its own duty separately. In particular, our BBN model is further equipped with a novel cumulative learning strategy, which is designed to first learn the universal patterns and then pay attention to the tail data gradually. Extensive experiments on four benchmark datasets, including the large-scale iNaturalist ones, justify that the proposed BBN can significantly outperform state-of-the-art methods. Furthermore, validation experiments can demonstrate both our preliminary discovery and effectiveness of tailored designs in BBN for long-tailed problems. Our method won the first place in the iNaturalist 2019 large scale species classification competition, and our code is open-source and available at https://github.com/Megvii-Nanjing/BBN.
[recognition, three, previous, visual, work, damage] [branch, feature, table, extreme, promote, achieves] [datasets, model, original, trained, universal, conduct, effective] [proposed, figure, conventional] [representation, loss, image, corresponding, ability, manifold] [learning, class, training, data, bbn, deep, tail, inaturalist, network, distribution, imbalance, cumulative, number, sampler, mixup, better, reversed, learned, imbalanced, uniform, rate, performance, validation, achieve, strategy, sample, decay, accuracy, large, test, lower, separately, design, sampling, parabolic, arxiv, preprint] [error]
@InProceedings{Zhou_2020_CVPR,
  author = {Zhou, Boyan and Cui, Quan and Wei, Xiu-Shen and Chen, Zhao-Min},
  title = {BBN: Bilateral-Branch Network With Cumulative Learning for Long-Tailed Visual Recognition},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Momentum Contrast for Unsupervised Visual Representation Learning
Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, Ross Girshick


We present Momentum Contrast (MoCo) for unsupervised visual representation learning. From a perspective on contrastive learning as dictionary look-up, we build a dynamic dictionary with a queue and a moving-averaged encoder. This enables building a large and consistent dictionary on-the-fly that facilitates contrastive unsupervised learning. MoCo provides competitive results under the common linear protocol on ImageNet classification. More importantly, the representations learned by MoCo transfer well to downstream tasks. MoCo can outperform its supervised pre-training counterpart in 7 detection/segmentation tasks on PASCAL VOC, COCO, and other datasets, sometimes surpassing it by large margins. This suggests that the gap between unsupervised and supervised representation learning has been largely closed in many vision tasks.
[visual, bank, mechanism, downstream, three, current, dataset, previous, language, context] [key, ross, table, instance, detection, object, kaiming, apbb, feature, voc, piotr, positive, semantic, building, segmentation, pascal, coco, mask] [query, encoded, input, adversarial] [figure, method, comparison, contrast, based] [supervised, unsupervised, loss, encoder, representation, image, encoders, perform, common] [moco, learning, contrastive, dictionary, imagenet, pretext, momentum, memory, apmk, large, training, random, data, task, accuracy, queue, size, linear, set, update, larger, schedule, deep, counterpart, better, sample, classification, network, negative, andrew] [form, consistent, estimation]
@InProceedings{He_2020_CVPR,
  author = {He, Kaiming and Fan, Haoqi and Wu, Yuxin and Xie, Saining and Girshick, Ross},
  title = {Momentum Contrast for Unsupervised Visual Representation Learning},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Classifying, Segmenting, and Tracking Object Instances in Video with Mask Propagation
Gedas Bertasius, Lorenzo Torresani


We introduce a method for simultaneously classifying, segmenting and tracking object instances in a video sequence. Our method, named MaskProp, adapts the popular Mask R-CNN to video by adding a mask propagation branch that propagates frame-level object instance masks from each video frame to all the other frames in a video clip. This allows our system to predict clip-level instance tracks with respect to the object instances segmented in the middle frame of the clip. Clip-level instance tracks generated densely for each frame in the sequence are finally aggregated to produce video-level object instance segmentation and classification. Our experiments demonstrate that our clip-level instance segmentation makes our approach robust to motion blur and object occlusions in video. MaskProp achieves the best reported accuracy on the YouTube-VIS dataset, outperforming the ICCV 2019 video instance segmentation challenge winner despite being much simpler and using orders of magnitude less labeled data (1.3M vs 1B images and 860K vs 14M bounding boxes). The project page is at: https://gberta.github.io/maskprop/.
[video, time, frame, clip, latexit, length, recognition, individual, temporally] [instance, object, segmentation, mask, propagation, feature, masktrack, tracking, maskprop, track, detection, predicted, propagated, branch, map, detected, coco, table, challenge, head, score, bounding, iou, backbone, semantic, kaiming, ross, bastian, segmenting, winner, unified, ensemblevis] [model] [method, tensor, ieee, figure, pattern, deformable, motion, convolutional, iccv] [image, separate] [accuracy, requires, average, set, imagenet, network, labeled, data, task, large, compared, learning, simple, performance] [computer, vision, conference, approach, computed, international, system, despite, compute, centered, compare, matching, overlapping]
@InProceedings{Bertasius_2020_CVPR,
  author = {Bertasius, Gedas and Torresani, Lorenzo},
  title = {Classifying, Segmenting, and Tracking Object Instances in Video with Mask Propagation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Weakly Supervised Fine-Grained Image Classification via Guassian Mixture Model Oriented Discriminative Learning
Zhihui Wang, Shijie Wang, Shuhui Yang, Haojie Li, Jianjun Li, Zezhou Li


Existing weakly supervised fine-grained image recognition (WFGIR) methods usually pick out the discriminative regions from the high-level feature maps directly. We discover that due to the operation of stacking local receptive filed, Convolutional Neural Network causes the discriminative region diffusion in high-level feature maps, which leads to inaccurate discriminative region localization. In this paper, we propose an end-to-end Discriminative Feature-oriented Gaussian Mixture Model (DF-GMM), to address the problem of discriminative region diffusion and find better fine-grained details. Specifically, DF-GMM consists of 1) a low-rank representation mechanism (LRM), which learns a set of low-rank discriminative bases by Gaussian Mixture Model (GMM) in high-level semantic feature maps to improve discriminative ability of feature representation, 2) a low-rank representation reorganization mechanism (LR ^2 M) which resumes the space information corresponding to low-rank discriminative bases to reconstruct the low-rank feature maps. It alleviates the discriminative region diffusion problem and locate discriminative regions more precisely. Extensive experiments verify that DF-GMM yields the best performance under the same settings with the most competitive approaches, in CUB-Bird, Stanford-Cars datasets, and FGVC Aircraft.
[recognition, mechanism, step, attention, visual, speed, difficulty, dataset, work] [feature, region, table, response, correlation, weakly, znk, map, denotes, object, cnn, pooling, localization, wfgir, focus, propose, reorganization, global, achieves, semantic, located] [model, diffusion, original] [gaussian, cvpr, figure, proposed, method, convolutional, spatial, gmm, june, scale, lowrank, july, patch, comparison, iccv, ieee] [discriminative, image, representation, supervised, loss, latent, discover, learns] [learning, network, problem, weight, mixture, linear, performance, accuracy, indicates, selected, matrix, number, default, classification, data, find, set, select, deep, initialization, function, min] [conference, computer, local, october, international]
@InProceedings{Wang_2020_CVPR,
  author = {Wang, Zhihui and Wang, Shijie and Yang, Shuhui and Li, Haojie and Li, Jianjun and Li, Zezhou},
  title = {Weakly Supervised Fine-Grained Image Classification via Guassian Mixture Model Oriented Discriminative Learning},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Bridging the Gap Between Anchor-Based and Anchor-Free Detection via Adaptive Training Sample Selection
Shifeng Zhang, Cheng Chi, Yongqiang Yao, Zhen Lei, Stan Z. Li


Object detection has been dominated by anchor-based detectors for several years. Recently, anchor-free detectors have become popular due to the proposal of FPN and Focal Loss. In this paper, we first point out that the essential difference between anchor-based and anchor-free detection is actually how to define positive and negative training samples, which leads to the performance gap between them. If they adopt the same definition of positive and negative samples during training, there is no obvious difference in the final performance, no matter regressing from a box or a point. This shows that how to select positive and negative training samples is important for current object detectors. Then, we propose an Adaptive Training Sample Selection (ATSS) to automatically select positive and negative samples according to statistical characteristics of object. It significantly improves the performance of anchor-based and anchor-free detectors and bridges the gap between them. Finally, we discuss the necessity of tiling multiple anchors per location on the image to detect objects. Extensive experiments conducted on MS COCO support our aforementioned analysis and conclusions. With the newly introduced ATSS, we improve state-of-the-art detectors by a large margin to 50.7% AP without introducing any overhead. The code is available at https://github.com/sfzhang15/ATSS.
[abhinav, multiple] [object, anchor, positive, box, retinanet, fcos, center, table, iou, detection, pyramid, feature, coco, final, location, bounding, detector, regression, detect, minival, threshold, ross, region, anchorbased, preset, faster, apm, apl, kaiming, proposal, improves, tiling, introducing] [difference, definition, improve] [method, scale, proposed, figure, adaptive, spatial, analysis, based, aspect, listed, convolutional, high, aps] [gap, loss, image] [training, negative, select, performance, sample, selection, candidate, network, statistical, set, learning, standard, large, number, better, strategy, hyperparameter, hyperparameters] [point, define, essential, accurate]
@InProceedings{Zhang_2020_CVPR,
  author = {Zhang, Shifeng and Chi, Cheng and Yao, Yongqiang and Lei, Zhen and Li, Stan Z.},
  title = {Bridging the Gap Between Anchor-Based and Anchor-Free Detection via Adaptive Training Sample Selection},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Learning User Representations for Open Vocabulary Image Hashtag Prediction
Thibaut Durand


In this paper, we introduce an open vocabulary model for image hashtag prediction - the task of mapping an image to its accompanying hashtags. Recent work shows that to build an accurate hashtag prediction model, it is necessary to model the user because of the self-expression problem, in which similar image content may be labeled with different tags. To take into account the user behaviour, we propose a new model that extracts a representation of a user based on his/her image history. Our model allows to improve a user representation with new images or add a new user without retraining the model. Because new hashtags appear all the time on social networks, we design an open vocabulary model which can deal with new hashtags without retraining the model. Our model learns a cross-modal embedding between user conditional visual representations and hashtag word representations. Experiments on a subset of the YFCC100M dataset demonstrate the efficacy of our user representation in user conditional hashtag prediction and user retrieval. We further validate the open vocabulary prediction ability of our model.
[hashtags, hashtag, vocabulary, embedding, visual, prediction, embeddings, word, history, extract, bilinear, recognition, gru, dataset, retrieval, glove, time, predict, tag, exploit, language, temporal, work] [table, semantic, propose, improves] [model, deal, improve] [ieee, pattern, fusion, analysis, based, figure, proposed, introduced, tensor] [user, representation, image, conditional, unseen, learn, pretrained, loss, content] [learning, open, training, set, classification, observe, retraining, large, sum, standard, space, fixed, performance, deep, note, product, similarity, neural, machine, problem, knowledge, evaluate, empirical, processing, task, number, size, setting, agnostic, dimension] [conference, vision, computer, joint, compute, international, approach, compare, allows]
@InProceedings{Durand_2020_CVPR,
  author = {Durand, Thibaut},
  title = {Learning User Representations for Open Vocabulary Image Hashtag Prediction},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Sketch Less for More: On-the-Fly Fine-Grained Sketch-Based Image Retrieval
Ayan Kumar Bhunia, Yongxin Yang, Timothy M. Hospedales, Tao Xiang, Yi-Zhe Song


Fine-grained sketch-based image retrieval (FG-SBIR) addresses the problem of retrieving a particular photo instance given a user's query sketch. Its widespread applicability is however hindered by the fact that drawing a sketch takes time, and most people struggle to draw a complete and faithful sketch. In this paper, we reformulate the conventional FG-SBIR framework to tackle these challenges, with the ultimate goal of retrieving the target photo with the least number of strokes possible. We further propose an on-the-fly design that starts retrieving as soon as the user starts drawing. To accomplish this, we devise a reinforcement learning based cross-modal retrieval framework that directly optimizes rank of the ground-truth photo over a complete sketch drawing episode. Additionally, we introduce a novel reward scheme that circumvents the problems related to irrelevant sketch strokes, and thus provides us with a more consistent rank list during the retrieval. We achieve superior early-retrieval efficiency over state-of-the-art methods and alternative baselines on two publicly available fine-grained sketch retrieval datasets.
[retrieval, policy, reward, reinforcement, retrieve, embedding, order, account, rasterized, visual, attention, time, goal] [framework, feature, table, global, branch, final, focus] [model, query, trained] [based, figure, proposed, designed, noisy] [sketch, photo, image, loss, sbir, drawing, paired, tao, user, list, stroke, representation, target, generation, train, fatt, yongxin, introduce] [rank, learning, triplet, deep, performance, ranking, early, baseline, objective, number, timothy, episode, training, network, vector, data, function, negative, layer, design, compared, optimization, john] [complete, rendering, distance, incomplete, partial, well, computer, vision, completion, initial, novel]
@InProceedings{Bhunia_2020_CVPR,
  author = {Bhunia, Ayan Kumar and Yang, Yongxin and Hospedales, Timothy M. and Xiang, Tao and Song, Yi-Zhe},
  title = {Sketch Less for More: On-the-Fly Fine-Grained Sketch-Based Image Retrieval},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Few-Shot Pill Recognition
Suiyi Ling, Andreas Pastor, Jing Li, Zhaohui Che, Junle Wang, Jieun Kim, Patrick Le Callet


Pill image recognition is vital for many personal/public health-care applications and should be robust to diverse unconstrained real-world conditions. Most existing pill recognition models are limited in tackling this challenging few-shot learning problem due to the insufficient instances per category. With limited training data, neural network-based models have limitations in discovering most discriminating features, or going deeper. Especially, existing models fail to handle the hard samples taken under less controlled imaging conditions. In this study, a new pill image database, namely CURE, is first developed with more varied imaging conditions and instances for each pill category. Secondly, a W2-net is proposed for better pill segmentation. Thirdly, a Multi-Stream (MS) deep network that captures task-related features along with a novel two-stage training methodology are proposed. Within the proposed framework, a Batch All strategy that considers all the samples is first employed for the sub-streams, and then a Batch Hard strategy that considers only the hard samples mined in the first stage is utilized for the fusion network. By doing so, complex samples that could not be represented by one type of feature could be focused and the model could be forced to exploit other domain-related information more effectively. Experiment results show that the proposed model outperforms state-of-the-art models on both the National Institute of Health (NIH) and our CURE database.
[recognition, text, stream, dataset, three, individual, recognizer] [hard, table, contour, map, stage, category, segmentation, employed, feature] [pill, imprinted, model, cure, nih, trained, summarized, consumer, database, dts, retrained, input] [proposed, figure, fusion, reference, based, ieee, adam, pattern, imaging, noisy, mdp] [texture, image, loss, train] [learning, training, network, strategy, set, batch, triplet, neural, deep, performance, data, similarity, metric, compared, better, number, considered, size, function, probability, impact, negative, selected, processing] [conference, second, international, rgb, computer, vision, novel, system, shape]
@InProceedings{Ling_2020_CVPR,
  author = {Ling, Suiyi and Pastor, Andreas and Li, Jing and Che, Zhaohui and Wang, Junle and Kim, Jieun and Callet, Patrick Le},
  title = {Few-Shot Pill Recognition},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
PointRend: Image Segmentation As Rendering
Alexander Kirillov, Yuxin Wu, Kaiming He, Ross Girshick


We present a new method for efficient high-quality image segmentation of objects and scenes. By analogizing classical computer graphics methods for efficient rendering with over- and undersampling challenges faced in pixel labeling tasks, we develop a unique perspective of image segmentation as a rendering problem. From this vantage, we present the PointRend (Point-based Rendering) neural network module: a module that performs point-based segmentation predictions at adaptively selected locations based on an iterative subdivision algorithm. PointRend can be flexibly applied to both instance and semantic segmentation tasks by building on top of existing state-of-the-art models. While many concrete implementations of the general idea are possible, we show that a simple design already achieves excellent results. Qualitatively, PointRend outputs crisp object boundaries in regions that are over-smoothed by previous methods. Quantitatively, PointRend yields significant gains on COCO and Cityscapes, for both instance and semantic segmentation. PointRend's efficiency enables output resolutions that are otherwise impractical in terms of memory or computation compared to existing approaches. Code has been made available at https://github.com/facebookresearch/detectron2/tree/master/projects/PointRend.
[prediction, regular, predict, bilinear, step] [pointrend, segmentation, mask, semantic, feature, instance, subdivision, head, coco, object, table, biased, ross, map, predicted, lvis, semanticfpn, kaiming, box, piotr, module] [input, uncertain] [output, resolution, conv, based, convolutional, figure, method, high, adaptively, convolution, classical] [image, representation] [sampling, training, selected, strategy, inference, standard, higher, selection, sampled, default, small, neural, applied, label, number, network, class, uniform, design, set, larger, vector, baseline] [coarse, point, grid, computer, mlp, rendering, compute]
@InProceedings{Kirillov_2020_CVPR,
  author = {Kirillov, Alexander and Wu, Yuxin and He, Kaiming and Girshick, Ross},
  title = {PointRend: Image Segmentation As Rendering},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
ABCNet: Real-Time Scene Text Spotting With Adaptive Bezier-Curve Network
Yuliang Liu, Hao Chen, Chunhua Shen, Tong He, Lianwen Jin, Liangwei Wang


Scene text detection and recognition has received increasing research attention. Existing methods can be roughly categorized into two groups: character-based and segmentation-based. These methods either are costly for character annotation or need to maintain a complex pipeline, which is often not suitable for real-time applications. Here we address the problem by proposing the Adaptive Bezier-Curve Network (\BeCan). Our contributions are three-fold: 1) For the first time, we adaptively fit oriented or curved text by a parameterized Bezier curve. 2) We design a novel BezierAlign layer for extracting accurate convolution features of a text instance with arbitrary shapes, significantly improving the precision compared with previous methods. 3) Compared with standard bounding box detection, our Bezier curve detection introduces negligible computation overhead, resulting in superiority of our method in both efficiency and accuracy. Experiments on oriented or curved benchmark datasets, namely Total-Text and CTW1500, demonstrate that \BeCan achieves state-of-the-art accuracy, meanwhile significantly improving the speed. In particular, on Total-Text, our real-time version is over 10 times faster than recent state-of-the-art methods with a competitive recognition accuracy. Code is available at https://git.io/AdelaiDet.
[text, bezier, curved, recognition, spotting, bezieralign, abcnet, previous, dataset, represent, trainable, yuliang, straight, lianwen, time, recognize] [oriented, detection, bounding, represents, feature, table, box, roi, proposal, cnn, mask, faster, propose, region, ablation, branch, chunhua, tong, framework, grouping, xiang, annotation, introduces, detect] [curve, original, robust] [method, figure, ieee, quadrilateral, proposed, convolutional, cubic, based, analysis, output, version, pattern, adaptive] [image, control, arbitrary, generation, synthesized, document, qualitative] [sampling, data, network, neural, size, compared, performance, equation, note, computation, standard, deep, inference, better, training, number, parameterized, design, learning, improved] [scene, ground, truth, demonstrate, shape]
@InProceedings{Liu_2020_CVPR,
  author = {Liu, Yuliang and Chen, Hao and Shen, Chunhua and He, Tong and Jin, Lianwen and Wang, Liangwei},
  title = {ABCNet: Real-Time Scene Text Spotting With Adaptive Bezier-Curve Network},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Learning Temporal Co-Attention Models for Unsupervised Video Action Localization
Guoqiang Gong, Xinghan Wang, Yadong Mu, Qi Tian


Temporal action localization (TAL) in untrimmed videos recently receives tremendous research enthusiasm. To our best knowledge, this is the first attempt in the literature to explore this task under an unsupervised setting, hereafter referred to as action co-localization (ACL), where only the total count of unique actions that appear in the video set is known. To solve ACL, we propose a two-step "clustering + localization" iterative procedure. The clustering step provides noisy pseudo-labels for the localization step, and the localization step provides temporal co-attention models that in turn improve the clustering performance. Using such two-step procedure, weakly-supervised TAL can be regarded as a direct extension of our ACL model. Technically, our contributions are two-folds: 1) temporal co-attention models, either class-specific or class-agnostic, learned from video-level labels or pseudo-labels in an iterative reinforced fashion; 2) new losses specially designed for ACL, including action-background separation loss and cluster-based triplet loss. Comprehensive evaluations are conducted on 20-action THUMOS14 and 100-action ActivityNet-1.2. On both benchmarks, the proposed model for ACL exhibits strong performances, even surprisingly comparable with state-of-the-art weakly-supervised methods. For example, previous best weakly-supervised model achieves 26.8% under mAP@0.5 on THUMOS14, our new records are 30.1% (weakly-supervised) and 25.0% (unsupervised).
[action, temporal, video, attention, untrimmed, snippet, work, evaluation, context, previous, extract, step, recognition, inspired] [localization, feature, map, denotes, table, score, weakly, detection, propose, aggregation, precise, threshold, proposal, acl, boundary, category, object, module, background, iou] [model, iterative] [separation, proposed, figure, method, based, flow, high, spectral, block, designed] [loss, unsupervised, cluster, representation, supervised, pseudo, image, train, learn] [clustering, set, training, number, label, denote, class, network, iteration, average, performance, triplet, test, validation, architecture, learning, best, task, problem, setting] [rgb]
@InProceedings{Gong_2020_CVPR,
  author = {Gong, Guoqiang and Wang, Xinghan and Mu, Yadong and Tian, Qi},
  title = {Learning Temporal Co-Attention Models for Unsupervised Video Action Localization},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Spatiotemporal Fusion in 3D CNNs: A Probabilistic View
Yizhou Zhou, Xiaoyan Sun, Chong Luo, Zheng-Jun Zha, Wenjun Zeng


Despite the success in still image recognition, deep neural networks for spatiotemporal signal tasks (such as human action recognition in videos) still suffers from low efficacy and inefficiency over the past years. Recently, human experts have put more efforts into analyzing the importance of different components in 3D convolutional neural networks (3D CNNs) to design more powerful spatiotemporal learning backbones. Among many others, spatiotemporal fusion is one of the essentials. It controls how spatial and temporal signals are extracted at each layer during inference. Previous attempts usually start by ad-hoc designs that empirically combine certain convolutions and then draw conclusions based on the performance obtained by training the corresponding networks. These methods only support network-level analysis on limited number of fusion strategies. In this paper, we propose to convert the spatiotemporal fusion strategies into a probability space, which allows us to perform network-level evaluations of various fusion strategies without having to train them separately. Besides, we can also obtain fine-grained numerical information such as layer-level preference on spatiotemporal fusion within the probability space. Our approach greatly boosts the efficiency of analyzing spatiotemporal fusion. Based on the probability space, we further generate new fusion strategies which achieve the state-of-the-art performance on four well-known action recognition datasets.
[spatiotemporal, action, video, temporal, unit, recognition, three, individual, dataset, yizhou, xiaoyan, work, wenjun, semantics, construct, associated, droppath] [template, propose, backbone, feature, effectiveness, table, denotes] [evaluated, derived, analyze] [fusion, ieee, based, posteriori, pattern, convolution, cnns, spatial, analysis, kernel, conv, method, designed, convolutional, proposed, residual, optimized, motion] [variational, corresponding] [probability, network, space, distribution, strategy, basic, training, performance, layer, learning, marginal, probabilistic, neural, number, arxiv, preprint, accuracy, dropout, deep, random, weight, best, equivalent, higher, observe, set, mixed] [conference, computer, vision, approach, defined, international, human, numerical, european]
@InProceedings{Zhou_2020_CVPR,
  author = {Zhou, Yizhou and Sun, Xiaoyan and Luo, Chong and Zha, Zheng-Jun and Zeng, Wenjun},
  title = {Spatiotemporal Fusion in 3D CNNs: A Probabilistic View},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Uncertainty-Aware Score Distribution Learning for Action Quality Assessment
Yansong Tang, Zanlin Ni, Jiahuan Zhou, Danyang Zhang, Jiwen Lu, Ying Wu, Jie Zhou


Assessing action quality from videos has attracted growing attention in recent years. Most existing approaches usually tackle this problem based on regression algorithms, which ignore the intrinsic ambiguity in the score labels caused by multiple judges or their subjective appraisals. To address this issue, we propose an uncertainty-aware score distribution learning (USDL) approach for action quality assessment (AQA). Specifically, we regard an action as an instance associated with a score distribution, which describes the probability of different evaluated scores. Moreover, under the circumstance where finer-grained score labels are available (e.g., difficulty degree of an action or multiple scores from different judges), we further devise a multi-path uncertainty-aware score distribution learning (MUSDL) method to explore the disentangled components of a score. In order to demonstrate the effectiveness of our proposed methods, We conduct experiments on two AQA datasets containing various Olympic actions. Our approaches set new state-of-the-arts under the Spearman's Rank Correlation (i.e., 0.8102 on AQA-7 and 0.9273 on MTL-AQA).
[action, video, dataset, multiple, diving, temporal, difficulty, order, prediction, three, recognition, predict] [score, final, predicted, regression, table, effectiveness, grant, china, fully, pooling] [model, aqa, assessment, quality, musdl, usdl, degree, judge, surgical, study, spre, olympic, yansong, assessing, conduct, parmar, input, facial, testing] [method, figure, proposed, gaussian, based, xin, existing, performs, analysis, convolutional] [generated, generate] [distribution, learning, label, network, training, performance, compared, best, jiwen, set, better, deep, average, normalized, jie, probability, calculated, soft, experiment] [single, approach, uncertainty, computer, well, ambiguity, vision]
@InProceedings{Tang_2020_CVPR,
  author = {Tang, Yansong and Ni, Zanlin and Zhou, Jiahuan and Zhang, Danyang and Lu, Jiwen and Wu, Ying and Zhou, Jie},
  title = {Uncertainty-Aware Score Distribution Learning for Action Quality Assessment},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Learning Interactions and Relationships Between Movie Characters
Anna Kukleva, Makarand Tapaswi, Ivan Laptev


Interactions between people are often governed by their relationships. On the flip side, social relationships are built upon several interactions. Two strangers are more likely to greet and introduce themselves while becoming friends over time. We are fascinated by this interplay between interactions and relationships, and believe that it is an important aspect of understanding social situations. In this work, we propose neural models to learn and jointly predict interactions, relationships, and the pair of characters that are involved. We note that interactions are informed by a mixture of visual and dialog cues, and present a multimodal architecture to extract meaningful information from them. Localizing the pair of interacting characters in video is a time-consuming process, instead, we train our model to learn from clip-level weak labels. We evaluate our models on the MovieGraphs dataset and show the impact of modalities, use of longer temporal context for predicting relationships, and achieve encouraging performance using weak labels as compared with ground-truth labels. Code is online.
[interaction, pair, character, relationship, clip, visual, social, predict, video, prediction, predicting, people, dataset, understanding, temporal, multiple, work, action, dialog, correct, sic, movie, multimodal, int, modeling, trimmed, char, makarand, juno, short, long, explains, recognition, localizing, moviegraphs] [weak, table, parent, feature, score, hard, sanja, track] [model, example, studying, improve, drop] [based, relu, figure, presented, performed] [loss, learn, train, common] [accuracy, learning, max, set, function, test, performance, label, sum, training, number, andrew, best, entire, neural, note] [joint, jointly, approach, human, overlapping, full, rel]
@InProceedings{Kukleva_2020_CVPR,
  author = {Kukleva, Anna and Tapaswi, Makarand and Laptev, Ivan},
  title = {Learning Interactions and Relationships Between Movie Characters},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Video Panoptic Segmentation
Dahun Kim, Sanghyun Woo, Joon-Young Lee, In So Kweon


Panoptic segmentation has become a new standard of visual recognition task by unifying previous semantic segmentation and instance segmentation tasks in concert. In this paper, we propose and explore a new video extension of this task, called video panoptic segmentation. The task requires generating consistent panoptic segmentation as well as an association of instance ids across video frames. To invigorate research on this new task, we present two types of video panoptic datasets. The first is a re-organization of the synthetic VIPER dataset into the video panoptic format to exploit its large-scale pixel annotations. The second is a temporal extension on the Cityscapes val. set, by providing new video panoptic annotations (Cityscapes-VPS). Moreover, we propose a novel video panoptic segmentation network (VPSNet) which jointly predicts object classes, bounding boxes, masks, instance id tracking, and semantic segmentation in video frames. To provide appropriate metrics for this task, we propose a video panoptic quality (VPQ) metric and evaluate our method and several other baselines. Experimental results demonstrate the effectiveness of the presented two datasets. We achieve state-of-the-art results in image PQ on Cityscapes and also in VPQ on Cityscapes-VPS and VIPER datasets.
[video, dataset, temporal, frame, recognition, provide, work, temporally, prediction, evaluation] [panoptic, segmentation, instance, object, semantic, feature, viper, track, vpsnet, fuse, vps, stuff, vpq, propose, module, tracking, predicted, mask, iou, head, table, thing, roi, tube, fusetrack, kaiming, benchmark, coco] [quality, datasets, model, public, improve] [ieee, pattern, reference, figure, method, proposed, window, existing, pixel, based, flow, high] [image, target, consistency, train] [task, network, set, baseline, class, learning, metric, arxiv, preprint, problem, evaluate, number, design, label, size] [computer, conference, vision, ground, truth, single, international, matching, dense, scene]
@InProceedings{Kim_2020_CVPR,
  author = {Kim, Dahun and Woo, Sanghyun and Lee, Joon-Young and Kweon, In So},
  title = {Video Panoptic Segmentation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Understanding Human Hands in Contact at Internet Scale
Dandan Shan, Jiaqi Geng, Michelle Shu, David F. Fouhey


Hands are the central means by which humans manipulate their world and being able to reliably extract hand state information from Internet videos of humans engaged in their hands has the potential to pave the way to systems that can learn from petabytes of video data. This paper proposes steps towards this by inferring a rich representation of hands engaged in interaction method that includes: hand location, side, contact state, and a box around the object in contact. To support this effort, we gather a large-scale dataset of hands in contact with objects consisting of 131 days of footage as well as a 100K annotated hand-contact video frame dataset. The learned model on this dataset can serve as a foundation for hand-contact understanding in videos. We quantitatively evaluate it both on its own and in service of predicting and learning from 3D meshes of human hands.
[dataset, state, video, interaction, understanding, rich, work, correct, build, youtube, multiple, recognizing, egocentric, visual, frame, predicting, infer, identifying, action] [object, box, detection, side, bounding, table, annotation, annotated, center] [datasets, trained, model, identify, vgg] [figure, scale, method, analysis, existing, comparison] [image, person, train, representation, learn, foundation, source] [training, learning, data, evaluate, set, performance, standard, test, wide, random, compared] [hand, contact, system, well, mesh, pose, reconstruction, full, internet, human, vlog, engaged, approach, grasp, variety, single, enables, portable, viva, david, estimation, rgb, gathered, mano]
@InProceedings{Shan_2020_CVPR,
  author = {Shan, Dandan and Geng, Jiaqi and Shu, Michelle and Fouhey, David F.},
  title = {Understanding Human Hands in Contact at Internet Scale},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
End-to-End Learning of Visual Representations From Uncurated Instructional Videos
Antoine Miech, Jean-Baptiste Alayrac, Lucas Smaira, Ivan Laptev, Josef Sivic, Andrew Zisserman


Annotating videos is cumbersome, expensive and not scalable. Yet, many strong video models still rely on manually annotated data. With the recent introduction of the HowTo100M dataset, narrated videos now offer the possibility of learning video representations without manual supervision. In this work we propose a new learning approach, MIL-NCE, capable of addressing mis- alignments inherent in narrated videos. With this approach we are able to learn strong video representations from scratch, without the need for any manual annotation. We evaluate our representations on a wide range of four downstream tasks over eight datasets: action recognition (HMDB-51, UCF-101, Kinetics-700), text-to- video retrieval (YouCook2, MSR-VTT), action localization (YouTube-8M Segments, CrossTask) and action segmentation (COIN). Our method outperforms all published self-supervised approaches for these tasks as well as several fully supervised baselines.
[video, action, visual, dataset, language, instructional, embedding, narrated, multiple, text, narration, uncurated, recognition, josef, evaluation, pair, ucf, ivan, retrieval, nce, clip, work, outperforms, crosstask, time, downstream, speech, ctr, hmdb] [positive, table, annotated, instance, localization, ablation, challenging, object] [model, trained, noise] [method, figure, based, prior] [representation, learn, supervised, learnt, image, train, pretrained, perform, loss] [learning, training, arxiv, preprint, set, evaluate, candidate, manually, negative, andrew, number, contrastive, data, deep, classification, objective, imagenet, unlabeled, consider, sampling, report] [joint, approach, compare, well, single]
@InProceedings{Miech_2020_CVPR,
  author = {Miech, Antoine and Alayrac, Jean-Baptiste and Smaira, Lucas and Laptev, Ivan and Sivic, Josef and Zisserman, Andrew},
  title = {End-to-End Learning of Visual Representations From Uncurated Instructional Videos},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
You2Me: Inferring Body Pose in Egocentric Video via First and Second Person Interactions
Evonne Ng, Donglai Xiang, Hanbyul Joo, Kristen Grauman


The body pose of a person wearing a camera is of great interest for applications in augmented reality, healthcare, and robotics, yet much of the person's body is out of view for a typical wearable camera. We propose a learning-based approach to estimate the camera wearer's 3D body pose from egocentric video sequences. Our key insight is to leverage interactions with another person---whose body pose we can directly observe---as a signal inherently linked to the body pose of the first-person subject. We show that since interactions between individuals often induce a well-ordered series of back-and-forth responses, it is possible to learn a temporal model of the interlinked poses even though one party is largely out of view. We demonstrate our idea on a variety of domains with dyadic interaction and show the substantial impact on egocentric body pose estimation, which improves the state of the art.
[egocentric, recognition, video, wearer, activity, interactee, social, interaction, work, sequence, frame, lstm, state, static, time, wearable, recurrent, predict, infer, recognizing, standing, embedding, visual, modeling, individual, action, skeleton] [panoptic, predicted, table, feature] [model, input, detecting] [ieee, pattern, method, motion, figure, prior, existing, dynamic] [person, row] [network, test, data, training, upper, learning, vector, neural, average, impact, better] [pose, camera, body, conference, vision, computer, approach, human, estimation, joint, ground, capture, scene, kinect, international, second, view, visible, estimate, truth, single, openpose, european, hand, structure, full, studio, inferred]
@InProceedings{Ng_2020_CVPR,
  author = {Ng, Evonne and Xiang, Donglai and Joo, Hanbyul and Grauman, Kristen},
  title = {You2Me: Inferring Body Pose in Egocentric Video via First and Second Person Interactions},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Learning a Weakly-Supervised Video Actor-Action Segmentation Model With a Wise Selection
Jie Chen, Zhiheng Li, Jiebo Luo, Chenliang Xu


We address weakly-supervised video actor-action segmentation (VAAS), which extends general video object segmentation (VOS) to additionally consider action labels of the actors. The most successful methods on VOS synthesize a pool of pseudo-annotations (PAs) and then refine them iteratively. However, they face challenges as to how to select from a massive amount of PAs high-quality ones, how to set an appropriate stop condition for weakly-supervised training, and how to initialize PAs pertaining to VAAS. To overcome these challenges, we propose a general Weakly-Supervised framework with a Wise Selection of training samples and model evaluation criterion (WS^2). Instead of blindly trusting quality-inconsistent PAs, WS^2 employs a learning-based selection to select effective PAs and a novel region integrity criterion as a stopping condition for weakly-supervised training. In addition, a 3D-Conv GCAM is devised to adapt to the VAAS task. Extensive experiments show that WS^2 achieves state-of-the-art performance on both weakly-supervised VOS and VAAS tasks and is on par with the best fully-supervised method on VAAS.
[action, video, attention, actor, multiple, frame, dataset, prediction, recognition, evaluation, predict] [segmentation, object, gcam, ric, vaas, mask, mioupa, foreground, map, semantic, vos, table, minit, region, background, slic, weakly, framework, integrity, refined, highest, superpixel, boundary, weaklysupervised, segment, mrefine, refinement, propose, achieves, union] [model, trained, iterative, effective] [ieee, pattern, figure, version, motion, proposed, patch, convolutional, comparison] [train, supervised, generate] [pas, training, network, set, selected, learning, criterion, selection, select, performance, subset, validation, test, classification, label, evolution] [computer, conference, vision, full, initial, european]
@InProceedings{Chen_2020_CVPR,
  author = {Chen, Jie and Li, Zhiheng and Luo, Jiebo and Xu, Chenliang},
  title = {Learning a Weakly-Supervised Video Actor-Action Segmentation Model With a Wise Selection},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Learning to Measure the Static Friction Coefficient in Cloth Contact
Abdullah Haroon Rasheed, Victor Romero, Florence Bertails-Descoubes, Stefanie Wuhrer, Jean-Sebastien Franco, Arnaud Lazarus


Measuring friction coefficients between cloth and an external body is a longstanding issue in mechanical engineering, never yet addressed with a pure vision-based system. The latter offers the prospect of simpler, less invasive friction measurement protocols compared to traditional ones, and can vastly benefit from recent deep learning advances. Such a novel measurement strategy however proves challenging, as no large labelled dataset for cloth contact exists, and creating one would require thousands of physics workbench measurements with broad coverage of cloth-material pairs. Using synthetic data instead is only possible assuming the availability of a soft-body mechanical simulator with true-to-life friction physics accuracy, yet to be verified. We propose a first vision-based measurement network for friction between cloth and a substrate, using a simple and repeatable video acquisition protocol. We train our network on purely synthetic data generated by a state-of-the-art frictional contact simulator, which we carefully calibrate and validate against real experiments under controlled conditions. We show promising results on a large set of contact pairs between real cloth samples and various kinds of substrates, with 93.6% of all measurements predicted within 0.1 range of standard physics bench measurements.
[dataset, visual, video, static, work, simulator, predicting, sequence] [propose, predicted, object, table] [model, physical, input, experimental, behaviour, controlled, protocol, curve] [coefficient, based, strip, range, motion, figure, reference, relu, conv] [real, conditional, synthetic, train, generalisation, image] [data, training, baseline, test, learning, parameter, deep, set, accuracy, architecture, class, label, problem, setup, neural, consider, experiment] [friction, material, cloth, simulated, contact, error, estimation, substrate, computer, measurement, simulation, frictional, rgus, estimating, dry, estimate, conference, textile, physically, law, supplemental, acm, elastic, force, vision, accurate, geometric, capture, drag, calibrated, viewpoint, plane]
@InProceedings{Rasheed_2020_CVPR,
  author = {Rasheed, Abdullah Haroon and Romero, Victor and Bertails-Descoubes, Florence and Wuhrer, Stefanie and Franco, Jean-Sebastien and Lazarus, Arnaud},
  title = {Learning to Measure the Static Friction Coefficient in Cloth Contact},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
SpeedNet: Learning the Speediness in Videos
Sagie Benaim, Ariel Ephrat, Oran Lang, Inbar Mosseri, William T. Freeman, Michael Rubinstein, Michal Irani, Tali Dekel


We wish to automatically predict the "speediness" of moving objects in videos - whether they move faster, at, or slower than their "natural" speed. The core component in our approach is SpeedNet--a novel deep network trained to detect if a video is playing at normal rate, or if it is sped up. SpeedNet is trained on a large corpus of natural videos in a self-supervised manner, without requiring any manual annotations. We show how this single, binary classification network can be used to detect arbitrary rates of speediness of objects. We demonstrate prediction results by SpeedNet on a wide range of videos containing complex natural motions, and examine the visual cues it utilizes for making those predictions. Importantly, we show that through predicting the speed of videos, the model learns a powerful and meaningful space-time representation that goes beyond simple motion cues. We demonstrate how those learned features can boost the performance of self supervised action recognition, and can be used for video retrieval. Furthermore, we also apply SpeedNet for generating time-varying, adaptive video speedups, which can allow viewers to watch videos faster, but with less of the jittery, unnatural motions typical to videos that are sped up uniformly.
[video, speed, speednet, speediness, action, played, temporal, natural, kinetics, moving, prediction, recognition, frame, sped, clip, playback, predict, predicting, slow, time, dataset, work, visual, order, described, sequence, walking] [object, feature, segment, score] [model, trained, input, magnitude, original, query, curve] [motion, adaptive, spatial, ieee, method, pattern, figure, version, fast, flow, based] [representation, train, factor, arbitrary] [speedup, training, learning, network, test, task, set, accuracy, rate, classification, consider, random, arxiv, preprint, deep, large, max, vector, binary, performance] [normal, computer, vision, conference, camera, human, solving, international, second, demonstrate, determine]
@InProceedings{Benaim_2020_CVPR,
  author = {Benaim, Sagie and Ephrat, Ariel and Lang, Oran and Mosseri, Inbar and Freeman, William T. and Rubinstein, Michael and Irani, Michal and Dekel, Tali},
  title = {SpeedNet: Learning the Speediness in Videos},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Telling Left From Right: Learning Spatial Correspondence of Sight and Sound
Karren Yang, Bryan Russell, Justin Salamon


Self-supervised audio-visual learning aims to capture useful representations of video by leveraging correspondences between visual and audio inputs. Existing approaches have focused primarily on matching semantic information between the sensory streams. We propose a novel self-supervised task to leverage an orthogonal principle: matching spatial information in the audio stream to the positions of sound sources in the visual stream. Our approach is simple yet effective. We train a model to determine whether the left and right audio channels have been flipped, forcing it to reason about spatial localization across the visual and audio streams. To train and evaluate our method, we introduce a large-scale video dataset, YouTube-ASMR-300K, with spatial audio comprising over 900 hours of footage. We demonstrate that understanding spatial correspondence enables models to perform better on three audio-visual tasks, achieving quantitative gains over supervised and self-supervised baselines that do not leverage spatial audio cues. We also show how to extend our self-supervised approach to 360 degree videos with ambisonic audio.
[audio, visual, sound, video, dataset, stream, time, temporal, binaural, downstream, evaluation, spatialization, embeddings, gao, understanding, three, audiovisual, moving, order] [localization, semantic, table, location, feature] [model, trained, input, difference, strong] [spatial, ieee, based, figure, separation, pattern, proposed, signal] [pretrained, source, learn, supervised, representation, train, alignment, separate] [task, learning, pretext, test, classification, learned, training, performance, network, data, neural, set, imagenet, andrew, processing, evaluate, scratch] [correspondence, stereo, conference, vision, computer, left, mono, determine, matching, leverage, error, leveraging, international]
@InProceedings{Yang_2020_CVPR,
  author = {Yang, Karren and Russell, Bryan and Salamon, Justin},
  title = {Telling Left From Right: Learning Spatial Correspondence of Sight and Sound},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Visual-Textual Capsule Routing for Text-Based Video Segmentation
Bruce McIntosh, Kevin Duarte, Yogesh S Rawat, Mubarak Shah


Joint understanding of vision and natural language is a challenging problem with a wide range of applications in artificial intelligence. In this work, we focus on integration of video and text for the task of actor and action video segmentation from a sentence. We propose a capsule-based approach which performs pixel-level localization based on a natural language query describing the actor of interest. We encode both the video and textual input in the form of capsules, which provide a more effective representation in comparison with standard convolution based features. Our novel visual-textual routing mechanism allows for the fusion of video and text capsules to successfully localize the actor and action. The existing works on actor-action localization are mainly focused on localization in a single frame instead of the full video. Different from existing works, we propose to perform the localization on all frames of the video. To validate the potential of the proposed network for actor and action video localization, we extend an existing actor-action dataset (A2D) with annotations for all the frames. The experimental evaluation demonstrates the effectiveness of our capsule network for text selective actor and action localization in videos. The proposed method also improves upon the performance of the existing state-of-the art works on single frame-based localization.
[video, sentence, actor, action, textual, dataset, visual, language, natural, text, described, frame, previous, outperforms, multiple, artificial, captioning] [segmentation, iou, bounding, localization, box, feature, propose, map, object, merging, table, detection, merge] [trained, masking, input, model, query, effective, create] [capsule, routing, method, figure, based, proposed, dynamic, output, convolutional, existing, ieee, convolution] [image, loss, representation, perform, corresponding, row, generate, train] [network, classification, find, set, training, task, procedure, neural, algorithm, test, higher, entire, processing] [conference, vision, computer, single, pose, second, approach, allows]
@InProceedings{McIntosh_2020_CVPR,
  author = {McIntosh, Bruce and Duarte, Kevin and Rawat, Yogesh S and Shah, Mubarak},
  title = {Visual-Textual Capsule Routing for Text-Based Video Segmentation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Graph-Structured Referring Expression Reasoning in the Wild
Sibei Yang, Guanbin Li, Yizhou Yu


Grounding referring expressions aims to locate in an image an object referred to by a natural language expression. The linguistic structure of a referring expression provides a layout of reasoning over the visual contents, and it is often crucial to align and jointly understand the image and the referring expression. In this paper, we propose a scene graph guided modular network (SGMN), which performs reasoning over a semantic graph and a scene graph with neural modules under the guidance of the linguistic structure of the expression. In particular, we model the image as a structured semantic graph, and parse the expression into a language scene graph. The language scene graph not only decodes the linguistic structure of the expression, but also has a consistent representation with the image semantic graph. In addition to exploring structured solutions to grounding referring expressions, we also propose Ref-Reasoning, a large-scale real-world dataset for structured referring expression reasoning. We automatically generate referring expressions over the scene graphs of images using diverse expression templates and functional programs. This dataset is equipped with real-world visual contents as well as semantically rich expressions with different reasoning layouts. Experimental results show that our SGMN not only significantly outperforms existing state-of-the-art algorithms on the new Ref-Reasoning dataset, but also surpasses state-of-the-art structured methods on commonly used benchmark datasets. It can also provide interpretable visual evidences of reasoning.
[graph, node, referring, attention, reasoning, visual, language, dataset, sgmn, grounding, linguistic, structured, relation, referent, vko, vks, modular, natural, evaluation, word, order, associated, blue] [semantic, object, module, map, feature, loc, edge, table, benchmark, merge, final, guided, holistic] [expression, datasets, model, norm, input, query] [ieee, pattern, spatial, existing, proposed, guidance, figure, intermediate, method, performs] [image, generate, learn, transfer, appearance, representation, perform, layout] [neural, set, performance, inference, network, performing, number, process, denoted, learning] [scene, conference, vision, computer, structure, functional, compute, combine, defined]
@InProceedings{Yang_2020_CVPR,
  author = {Yang, Sibei and Li, Guanbin and Yu, Yizhou},
  title = {Graph-Structured Referring Expression Reasoning in the Wild},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Say As You Wish: Fine-Grained Control of Image Caption Generation With Abstract Scene Graphs
Shizhe Chen, Qin Jin, Peng Wang, Qi Wu


Humans are able to describe image contents with coarse to fine details as they wish. However, most image captioning models are intention-agnostic which cannot generate diverse descriptions according to different user intentions initiatively. In this work, we propose the Abstract Scene Graph (ASG) structure to represent user intention in fine-grained level and control what and how detailed the generated description should be. The ASG is a directed graph consisting of three types of abstract nodes (object, attribute, relationship) grounded in the image without any concrete semantic labels. Thus it is easy to obtain either manually or automatically. From the ASG, we propose a novel ASG2Caption model, which is able to recognise user intentions and semantics in the graph, and therefore generate desired captions following the graph structure. Our model achieves better controllability conditioning on ASGs than carefully designed baselines on both VisualGenome and MSCOCO datasets. It also significantly improves the caption diversity via automatically sampling diverse ASGs as control signals. Code will be released at https://github.com/cshizhe/asg2cap.
[graph, node, asg, attention, captioning, caption, asgs, visual, describe, visualgenome, order, mscoco, automatically, decoder, language, relationship, embedding, intention, designated, grounded, evaluation, hat, three, embeddings, connected, lstm, represent, previous] [semantic, propose, object, global, table, feature, level, fully, add, employ] [model, access] [ieee, pattern, flow, figure, proposed, signal, based] [image, control, generate, user, diverse, abstract, generation, generated, content, generating, diversity, controllable, attribute, encoder, desired, row, controllability, corresponding] [neural, learning, updating, machine, set, training, sampled, best, update] [conference, computer, vision, scene, structure, international, capture, dense]
@InProceedings{Chen_2020_CVPR,
  author = {Chen, Shizhe and Jin, Qin and Wang, Peng and Wu, Qi},
  title = {Say As You Wish: Fine-Grained Control of Image Caption Generation With Abstract Scene Graphs},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Hierarchical Conditional Relation Networks for Video Question Answering
Thao Minh Le, Vuong Le, Svetha Venkatesh, Truyen Tran


Video question answering (VideoQA) is challenging as it requires modeling capacity to distill dynamic visual artifacts and distant relations and to associate them with linguistic concepts. We introduce a general-purpose reusable neural unit called Conditional Relation Network (CRN) that serves as a building block to construct more sophisticated structures for representation and reasoning over video. CRN takes as input an array of tensorial objects and a conditioning feature, and computes an array of encoded output objects. Model building becomes a simple exercise of replication, rearrangement and stacking of these reusable units for diverse modalities and contextual information. This design thus supports high-order relational and multi-step reasoning. The resulting architecture for VideoQA is a CRN hierarchy whose branches represent sub-videos or clips, all sharing the same question as the contextual condition. Our evaluations on well-known datasets achieved new SoTA results, demonstrating the impact of building a general-purpose reasoning unit on complex domains such as VideoQA.
[video, question, crn, kmax, hcrn, linguistic, hierarchical, conditioning, relation, visual, unit, reasoning, temporal, videoqa, clip, answer, frame, attention, crns, context, action, relational, answering, length, long, modeling, multimodal, time, dataset, interaction, order, mechanism, reusable] [feature, table, level, building, object, including] [model, input, zhou] [motion, ieee, output, pattern, block] [representation, conditional, conditioned, appearance, jun] [network, hierarchy, neural, design, memory, architecture, size, deep, set, performance, learning, count, task, linear, sampling, deeper, subset, number] [conference, array, computer, vision, international, acm]
@InProceedings{Le_2020_CVPR,
  author = {Le, Thao Minh and Le, Vuong and Venkatesh, Svetha and Tran, Truyen},
  title = {Hierarchical Conditional Relation Networks for Video Question Answering},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
REVERIE: Remote Embodied Visual Referring Expression in Real Indoor Environments
Yuankai Qi, Qi Wu, Peter Anderson, Xin Wang, William Yang Wang, Chunhua Shen, Anton van den Hengel


One of the long-term challenges of robotics is to enable robots to interact with humans in the visual world via natural language, as humans are visual animals that communicate through language. Overcoming this challenge requires the ability to perform a wide variety of complex tasks in response to multifarious instructions from humans. In the hope that it might drive progress towards more flexible and powerful human interactions with robots, we propose a dataset of varied and complex robot tasks, described in natural language, in terms of objects visible in a large set of real images. Given an instruction, success requires navigating through a previously-unseen environment to identify an object. This represents a practical challenge, but one that closely reflects one of the core visual problems in robotics. Several state-of-the-art vision-and-language navigation, and referring-expression models are tested to verify the difficulty of this new task, but none of them show promising results because there are many fundamental differences between our task and previous ones. A novel Interactive Navigator-Pointer model is also proposed that provides a strong baseline on the task. The proposed model especially achieves the best performance on the unseen test split, but still leaves substantial room for improvement compared to the human performance. Repository: https://github.com/YuankaiQi/REVERIE.
[navigation, agent, referring, reverie, visual, language, natural, instruction, dataset, goal, pointer, embodied, action, navigator, simulator, length, three, comprehension, current, attention, navigate, question, previous, localise, context, step, progress, bring, navigable, grounding, provide, evaluation, word, interaction] [object, bounding, interactive, module, remote, location, sota, box, val, challenge, achieves, including] [expression, model, success, identify, picture, strong] [proposed, dynamic, output, figure] [target, unseen, real, loss, bedroom] [task, test, performance, requires, set, baseline, candidate, data, top, rate, number, path, achieve] [human, viewpoint, robot, matching, provided, room, detailed, camera, indoor, rel]
@InProceedings{Qi_2020_CVPR,
  author = {Qi, Yuankai and Wu, Qi and Anderson, Peter and Wang, Xin and Wang, William Yang and Shen, Chunhua and Hengel, Anton van den},
  title = {REVERIE: Remote Embodied Visual Referring Expression in Real Indoor Environments},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Iterative Answer Prediction With Pointer-Augmented Multimodal Transformers for TextVQA
Ronghang Hu, Amanpreet Singh, Trevor Darrell, Marcus Rohrbach


Many visual scenes contain text that carries crucial information, and it is thus essential to understand text in images for downstream reasoning tasks. For example, a deep water label on a warning sign warns people about the danger in the scene. Recent work has explored the TextVQA task that requires reading and understanding text in images to answer a question. However, existing approaches for TextVQA are mostly based on custom pairwise fusion mechanisms between a pair of two modalities and are restricted to a single prediction step by casting TextVQA as a classification task. In this work, we propose a novel model for the TextVQA task based on a multimodal transformer architecture accompanied by a rich representation for text in images. Our model naturally fuses different modalities homogeneously by embedding them into a common semantic space where self-attention is applied to model inter- and intra- modality context. Furthermore, it enables iterative answer decoding with a dynamic pointer network, allowing the model to form an answer through multi-step prediction instead of one-step classification. Our model outperforms existing approaches on three benchmark datasets for the TextVQA task by a large margin.
[ocr, answer, question, text, textvqa, visual, embedding, word, multimodal, transformer, decoding, token, dataset, previous, lorra, work, fasttext, vocabulary, prediction, attention, multiple, bert, three, rich, pointer, vqa, modality, decoder, language, step, outperforms, answering, extract, phoc, book, predict, sign, understanding] [feature, detected, object, table, bbox] [model, iterative, input, datasets] [output, based, figure, ieee, dynamic, pattern, fusion] [image, representation, pretrained, copying, appearance, list, common] [fixed, set, classifier, arxiv, preprint, training, space, learning, task, accuracy, test, architecture, large, vector, neural, processing] [conference, computer, vision, single, international]
@InProceedings{Hu_2020_CVPR,
  author = {Hu, Ronghang and Singh, Amanpreet and Darrell, Trevor and Rohrbach, Marcus},
  title = {Iterative Answer Prediction With Pointer-Augmented Multimodal Transformers for TextVQA},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
SQuINTing at VQA Models: Introspecting VQA Models With Sub-Questions
Ramprasaath R. Selvaraju, Purva Tendulkar, Devi Parikh, Eric Horvitz, Marco Tulio Ribeiro, Besmira Nushi, Ece Kamar


Existing VQA datasets contain questions with varying levels of complexity. While the majority of questions in these datasets require perception for recognizing existence, properties, and spatial relationships of entities, a significant portion of questions pose challenges that correspond to reasoning tasks - tasks that can only be answered through a synthesis of perception and knowledge about the world, logic and / or reasoning. Analyzing performance across this distinction allows us to notice when existing VQA models have consistency issues - they answer the reasoning questions correctly but fail on associated low-level perception questions. For example, in Figure 1, models answer the complex reasoning question "Is the banana ripe enough to eat?" correctly, but fail on the associated perception question "Are the bananas mostly green or yellow?" indicating that the model likely answered the reasoning question correctly but for the wrong reason. We quantify the extent to which this phenomenon occurs by creating a new Reasoning split of the VQA dataset and collecting VQAintrospect, a new dataset1 which currently consists of 200K new perception questions which serve as sub questions corresponding to the set of perceptual tasks needed to effectively answer the complex reasoning questions in the Reasoning split. Our evaluation shows that state-of-the-art VQA models have comparable performance in answering perception and reasoning questions, but suffer from consistency problems. To address this shortcoming, we propose an approach called Sub-Question Importance-aware Network Tuning (SQuINT), which encourages the model to attend to the same parts of the image when answering the reasoning question and the perception sub question. We show that SQuINT improves model consistency by 7%, also marginally improving performance on the Reasoning questions in VQA, while also displaying better attention maps.
[reasoning, perception, question, answer, vqa, visual, attention, dataset, answering, associated, evaluation, squint, pythia, devi, correctly, grounding, ripe, work, language, dhruv, banana, compositional, vqaintrospect, current, natural, understanding, qualification, multiple, explicitly, order, described, correct, ramprasaath, wrong] [main, split] [model, original, answered, trained, datasets, identify, incorrect] [figure, spatial, ieee, fail, color, based] [loss, consistency, image, corresponding, cross, encourages, asked, common, train] [knowledge, accuracy, learning, data, task, entropy, requires, binary, performance, network, evaluate, simple, note, finetuning, tuning] [require, complex, vision, conference, approach, computer, additional, detailed]
@InProceedings{Selvaraju_2020_CVPR,
  author = {Selvaraju, Ramprasaath R. and Tendulkar, Purva and Parikh, Devi and Horvitz, Eric and Ribeiro, Marco Tulio and Nushi, Besmira and Kamar, Ece},
  title = {SQuINTing at VQA Models: Introspecting VQA Models With Sub-Questions},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Vision-Language Navigation With Self-Supervised Auxiliary Reasoning Tasks
Fengda Zhu, Yi Zhu, Xiaojun Chang, Xiaodan Liang


Vision-Language Navigation (VLN) is a task where an agent learns to navigate following a natural language instruction. The key to this task is to perceive both the visual scene and natural language sequentially. Conventional approaches fully exploit vision and language features in cross-modal grounding. However, the VLN task remains challenging, since previous works have implicitly neglected the rich semantic information contained in environments (such as navigation graphs or sub-trajectory semantics). In this paper, we introduce Auxiliary Reasoning Navigation (AuxRN), a framework with four self-supervised auxiliary reasoning tasks to exploit the additional training signals derived from these semantic information. The auxiliary tasks have four reasoning objectives: explaining the previous actions, evaluating the trajectory consistency, estimating the progress and predict the next direction. As a result, these additional training signals help the agent to acquire knowledge of semantic representations in order to reason about its activities and build a thorough perception of environments. Our experiments demonstrate that auxiliary reasoning tasks improve both the performance of the main task and the model generalizability by a large margin. We further demonstrate empirically that an agent trained with self-supervised auxiliary reasoning tasks substantially outperforms the previous state-of-the-art method, being the best existing approach on the standard benchmark.
[navigation, agent, reasoning, language, trajectory, progress, attention, visual, forcing, previous, action, reinforcement, auxrn, context, natural, retelling, prediction, spl, step, imitation, outperforms, vln, turn, instruction, predict, feto, panoramic, word, fbt, current, monitor, three, question, beam] [feature, ablation, propose, val] [auxiliary, model, success, study] [method, pattern, ieee, proposed, result, based] [unseen, image, loss, introduce, domain, consists, train] [learning, task, teacher, training, validation, arxiv, preprint, set, baseline, data, higher, rate, processing, performance, evaluate, number, student, label, neural, standard, search] [vision, conference, computer, matching, error, international, estimation, approach, single, david, orientation, compare]
@InProceedings{Zhu_2020_CVPR,
  author = {Zhu, Fengda and Zhu, Yi and Chang, Xiaojun and Liang, Xiaodan},
  title = {Vision-Language Navigation With Self-Supervised Auxiliary Reasoning Tasks},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Sign Language Transformers: Joint End-to-End Sign Language Recognition and Translation
Necati Cihan Camgoz, Oscar Koller, Simon Hadfield, Richard Bowden


Prior work on Sign Language Translation has shown that having a mid-level sign gloss representation (effectively recognizing the individual signs) improves the translation performance drastically. In fact, the current state-of-the-art in translation requires gloss level tokenization in order to work. We introduce a novel transformer based architecture that jointly learns Continuous Sign Language Recognition and Translation while being trainable in an end-to-end manner. This is achieved by using a Connectionist Temporal Classification (CTC) loss to bind the recognition and translation problems into a single unified architecture. This joint approach does not require any ground-truth timing information, simultaneously solving two co-dependant sequence-to-sequence learning problems and leads to significant performance gains. We evaluate the recognition and translation performances of our approaches on the challenging RWTH-PHOENIX-Weather-2014T (PHOENIX14T) dataset. We report state-of-the-art sign language recognition and translation results achieved by our Sign Language Transformers. Our translation networks outperform both sign video to spoken language and gloss to spoken language translation models, in some cases more than doubling the performance (9.58 vs. 21.80 BLEU-4 Score). We also share new baseline translation results using transformer networks for several other text-to-text sign language translation tasks.
[sign, language, recognition, spoken, transformer, gloss, video, slt, cslr, word, sequence, necati, cihan, oscar, slrt, wer, automatic, german, hermann, work, sentence, evaluation, embedding, temporal, understanding, heute, nacht, goal, nmt] [table, level] [model, trained] [based, spatial, ieee, pattern, proposed, method] [translation, loss, learn, representation, train, generate, utilize, mapping] [machine, performance, learning, neural, training, network, linear, set, processing, best, grad, share, deep, report, performing, computational, test, architecture, problem, improved] [conference, international, continuous, computer, vision, approach, richard, jointly, joint, system]
@InProceedings{Camgoz_2020_CVPR,
  author = {Camgoz, Necati Cihan and Koller, Oscar and Hadfield, Simon and Bowden, Richard},
  title = {Sign Language Transformers: Joint End-to-End Sign Language Recognition and Translation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Multi-Task Collaborative Network for Joint Referring Expression Comprehension and Segmentation
Gen Luo, Yiyi Zhou, Xiaoshuai Sun, Liujuan Cao, Chenglin Wu, Cheng Deng, Rongrong Ji


Referring expression comprehension (REC) and segmentation (RES) are two highly-related tasks, which both aim at identifying the referent according to a natural language expression. In this paper, we propose a novel Multi-task Collaborative Network (MCN) to achieve a joint learning of REC and RES for the first time. In MCN, RES can help REC to achieve better language-vision alignment, while REC can help RES to better locate the referent. In addition, we address a key challenge in this multi-task setup, i.e., the prediction conflict, with two innovative designs namely, Consistency Energy Maximization (CEM) and Adaptive Soft Non-Located Suppression (ASNLS). Specifically, CEM enables REC and RES to focus on similar visual regions by maximizing the consistency energy between two tasks. ASNLS supresses the response of unrelated regions in RES based on the prediction of REC. To validate our model, we conduct extensive experiments on three benchmark datasets of REC and RES, i.e., RefCOCO, RefCOCO+ and RefCOCOg. The experimental results report the significant performance gains of MCN over all existing methods, i.e., up to +7.13% for REC and +11.50% for RES over SOTA, which well confirm the validity of our model for joint REC and RES learning.
[rec, mcn, referring, prediction, collaborative, multimodal, cem, visual, asnls, language, attention, comprehension, three, refcoco, natural, refcocog, referent, mattnet, correct, testb, modeling, testa] [segmentation, bounding, val, feature, suppression, object, detection, hard, propose, box, mask, response, denotes, predicted, semantic, focus, regression, confidence, final, iou, table] [expression, model] [adaptive, based, existing, proposed, figure, convolutional, crop] [loss, image, consistency, address, tsc] [network, learning, performance, set, energy, soft, inference, maximization, test, processing, compared, better, deep, connection, log, training, arxiv, maximize, impact] [joint, single, error, compare, structure]
@InProceedings{Luo_2020_CVPR,
  author = {Luo, Gen and Zhou, Yiyi and Sun, Xiaoshuai and Cao, Liujuan and Wu, Chenglin and Deng, Cheng and Ji, Rongrong},
  title = {Multi-Task Collaborative Network for Joint Referring Expression Comprehension and Segmentation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Counterfactual Vision and Language Learning
Ehsan Abbasnejad, Damien Teney, Amin Parvaneh, Javen Shi, Anton van den Hengel


The ongoing success of visual question answering methods has been somwehat surprising given that, at its most general, the problem requires understanding the entire variety of both visual and language stimuli. It is particularly remarkable that this success has been achieved on the basis of comparatively small datasets, given the scale of the problem. One explanation is that this has been accomplished partly by exploiting bias in the datasets rather than developing deeper multi-modal reasoning. This fundamentally limits the generalization of the method, and thus its practical applicability. We propose a method that addresses this problem by introducing counterfactuals in the training. In doing so we leverage structural causal models for counterfactual evaluation to formulate alternatives, for instance, questions that could be asked of the same image set. We show that simulating plausible alternative training data through this process results in better generalization.
[question, answer, vqa, exogenous, visual, intervention, language, counterfactuals, embedding, answering, dataset, scm, causal, natural, agent, recognition, dhruv, devi, reasoning, observational, anton, den, observed, multimodal, interested, christopher, understanding] [table, van] [counterfactual, model, risk, input, adversarial, generalization, improve, trained, success] [pattern, ieee, prior, likelihood, figure] [image, variable, generate, corresponding, generating, learn, loss, target, structural, asked] [learning, training, distribution, data, set, posterior, better, machine, alternative, performance, note, processing, ehsan, process, empirical, baseline, test, minimum, simple, sampling, function, deep, random, objective, max, number, requires] [approach, conference, vision, computer, international, additional]
@InProceedings{Abbasnejad_2020_CVPR,
  author = {Abbasnejad, Ehsan and Teney, Damien and Parvaneh, Amin and Shi, Javen and Hengel, Anton van den},
  title = {Counterfactual Vision and Language Learning},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Iterative Context-Aware Graph Inference for Visual Dialog
Dan Guo, Hui Wang, Hanwang Zhang, Zheng-Jun Zha, Meng Wang


Visual dialog is a challenging task that requires the comprehension of the semantic dependencies among implicit visual and textual contexts. This task can refer to the relation inference in a graphical model with sparse contexts and unknown graph structure (relation descriptor), and how to model the underlying context-aware relation inference is critical. To this end, we propose a novel Context-Aware Graph (CAG) neural network. Each node in the graph corresponds to a joint semantic feature, including both object-based (visual) and history-related (textual) context representations. The graph structure (relations in dialog) is iteratively updated using an adaptive top-K message passing mechanism. Specifically, in every message passing step, each node selects the most K relevant nodes, and only receives messages from them. Then, after the update, we impose graph attention on all the nodes to get the final graph embedding and infer the answer. In CAG, each node has dynamic relations in the graph (different related K neighbor nodes), and only the most relevant nodes are attributive to the context-aware relational graph inference. Experimental results on VisDial v0.9 and v1.0 datasets show that CAG outperforms comparative methods. Visualization results further validate the interpretability of our method.
[graph, visual, attention, cag, question, node, message, context, dialog, textual, visdial, passing, relational, history, relevant, step, embedding, reasoning, answer, infer, current, word, hanwang, outperforms, attended, lstm, fga, receives, mechanism, gnn, command, sequence] [denotes, feature, final, edge, object, semantic, visualization, correlation, table, propose, attentive, map] [iterative, model, vgg] [dynamic, proposed, adaptive, figure, adjacent, method, dual, resolution, guidance] [image, representation] [inference, learning, neural, number, network, performance, compared, candidate, process, set, test, updated] [neighbor, structure, iteratively, dan, implicit, joint]
@InProceedings{Guo_2020_CVPR,
  author = {Guo, Dan and Wang, Hui and Zhang, Hanwang and Zha, Zheng-Jun and Wang, Meng},
  title = {Iterative Context-Aware Graph Inference for Visual Dialog},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
TA-Student VQA: Multi-Agents Training by Self-Questioning
Peixi Xiong, Ying Wu


There are two main challenges in Visual Question Answering (VQA). The first one is that each model obtains its strengths and shortcomings when applied to several questions; what is more, the "ceiling effect" for specific questions is difficult to overcome with simple consecutive training. The second challenge is that even the state-of-the-art dataset is of large scale, questions targeted at a single image are off in format and lack diversity in content. We introduce our self-questioning model with multi-agent training: TA-student VQA. This framework differs from standard VQA algorithms by involving question-generating mechanisms and collaborative learning questions between question-answering agents. Thus, TA-student VQA overcomes the limitation of the content diversity and format variation of questions and improves the overall performance of multiple question-answering agents. We evaluate our model on VQA-v2, which outperforms algorithms without such mechanisms. In addition, TA-student VQA achieves a greater model capacity, allowing it to answer more generated questions in addition to those in the annotated datasets.
[question, visual, answering, answer, vqa, agent, lstm, dataset, agts, natural, reasoning, language, long, oracle, policy, december, exam, previous, work, time, annual] [annotated, stage, cnn, supervision, main, table, association] [model, improve, type, adversarial] [ieee, figure, pattern, cvpr, method, june, based, proposed, output, format, img] [image, generating, generated, generation, responsible, lack, corresponding, diversity, generative] [learning, iteration, neural, training, processing, update, informative, standard, performance, data, deep, evaluate, knowledge, computational, good, student, better, set, empirical] [conference, computer, vision, system, ground, international, second, structure, european]
@InProceedings{Xiong_2020_CVPR,
  author = {Xiong, Peixi and Wu, Ying},
  title = {TA-Student VQA: Multi-Agents Training by Self-Questioning},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Exploring Self-Attention for Image Recognition
Hengshuang Zhao, Jiaya Jia, Vladlen Koltun


Recent work has shown that self-attention can serve as a basic building block for image recognition models. We explore variations of self-attention and assess their effectiveness for image recognition. We consider two forms of self-attention. One is pairwise self-attention, which generalizes standard dot-product attention and is fundamentally a set operator. The other is patchwise self-attention, which is strictly more powerful than convolution. Our pairwise self-attention networks match or outperform their convolutional counterparts, and the patchwise models substantially outperform the convolutional baselines. We also conduct experiments that probe the robustness of learned representations and conclude that self-attention networks may have significant benefits in terms of robustness and generalization.
[attention, recognition, explore, relation, powerful, work, language] [feature, table, aggregation, building, aggregated, stride] [robustness, controlled, attack, conduct] [convolutional, convolution, block, spatial, channel, comparison, operator, output, residual, method, based, indicate] [image, mapping] [linear, pairwise, patchwise, number, set, accuracy, vector, dimensionality, scalar, function, weight, lower, size, outperform, imagenet, deep, flop, training, computation, network, transition, params, reported, adapt, parameter, rate, neural, applied, learning, impact, memory, operation, layer, hadamard, max, pool, reduce] [footprint, position, computer, transformation, vision, construction, match, local]
@InProceedings{Zhao_2020_CVPR,
  author = {Zhao, Hengshuang and Jia, Jiaya and Koltun, Vladlen},
  title = {Exploring Self-Attention for Image Recognition},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Cops-Ref: A New Dataset and Task on Compositional Referring Expression Comprehension
Zhenfang Chen, Peng Wang, Lin Ma, Kwan-Yee K. Wong, Qi Wu


Referring expression comprehension (REF) aims at identifying a particular object in a scene by a natural language expression. It requires joint reasoning over the textual and visual domains to solve the problem. Some popular referring expression datasets, however, fail to provide an ideal test bed for evaluating the reasoning ability of the models, mainly because 1) their expressions typically describe only some simple distinctive properties of the object and 2) their images contain limited distracting information. To bridge the gap, we propose a new dataset for visual reasoning in context of referring expression comprehension with two main features. First, we design a novel expression engine rendering various reasoning logics that can be flexibly combined with rich visual properties to generate expressions with varying compositionality. Second, to better exploit the full reasoning chain embodied in an expression, we propose a new test setting by adding additional distracting images containing objects sharing similar properties with the referent, thus minimising the success rate of reasoning-free cross-domain alignment. We evaluate several state-of-the-art REF models, but find none of them can achieve promising performance. A proposed modular hard mining strategy performs the best but still leaves substantial room for improvement.
[reasoning, visual, dataset, referring, distracting, logic, mattnet, modular, textual, compositional, provide, order, grounding, natural, language, comprehension, gqa, pair, grounder, graph, text, previous, context, flowery, attention, refcoco, described, engine] [object, hard, distractors, table, region, category, semantic, achieves, propose, including] [expression, model, datasets, stronger, original, study] [proposed, based, tree] [target, image, ability, cat, generate, distinguish, attribute, synthetic, specific] [performance, task, set, negative, test, requires, mining, simple, evaluate, strategy, training, number, frequent, bias, best, accuracy, find, similarity, achieve, average] [scene, full, form, left, complex, matching, ground, define]
@InProceedings{Chen_2020_CVPR,
  author = {Chen, Zhenfang and Wang, Peng and Ma, Lin and Wong, Kwan-Yee K. and Wu, Qi},
  title = {Cops-Ref: A New Dataset and Task on Compositional Referring Expression Comprehension},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Improving Convolutional Networks With Self-Calibrated Convolutions
Jiang-Jiang Liu, Qibin Hou, Ming-Ming Cheng, Changhu Wang, Jiashi Feng


Recent advances on CNNs are mostly devoted to designing more complex architectures to enhance their representation learning capacity. In this paper, we consider how to improve the basic convolutional feature transformation process of CNNs without tuning the model architectures. To this end, we present a novel self-calibrated convolutions that explicitly expand fields-of-view of each convolutional layers through internal communications and hence enrich the output features. In particular, unlike the standard convolutions that fuse spatial and channel-wise information using small kernels (e.g., 3x3), self-calibrated convolutions adaptively build long-range spatial and inter-channel dependencies around each spatial location through a novel self-calibration operation. Thus, it can help CNNs generate more discriminative representations by explicitly incorporating richer information. Our self-calibrated convolution design is simple and generic, and can be easily applied to augment standard convolutional layers without introducing extra parameters and complexity. Extensive experiments demonstrate that when applying self-calibrated convolutions into different backbones, our networks can significantly improve the baseline models in a variety of vision tasks, including image recognition, object detection, instance segmentation, and keypoint detection, with no need to change the network architectures. We hope this work could provide a promising way for future research in designing novel convolutional feature transformations for improving convolutional networks. Code is available on the project page.
[attention, work, multiple, context] [feature, scnet, object, detection, resnet, table, pooling, instance, resnext, building, mask, introducing, grouped, selfcalibrated, side, coco, kaiming, location, including, locate, adopt] [model, input, original, generalization, helpful] [convolutional, proposed, spatial, convolution, figure, residual, scale, output, version, based, learnable] [image, discriminative, produced, target] [network, classification, operation, set, deep, better, architecture, large, accuracy, rate, report, design, neural, size, performance, space, learning, layer, investigate, designing, standard, small, imagenet, average] [transformation, vision, keypoint, demonstrate, human, capture, approach]
@InProceedings{Liu_2020_CVPR,
  author = {Liu, Jiang-Jiang and Hou, Qibin and Cheng, Ming-Ming and Wang, Changhu and Feng, Jiashi},
  title = {Improving Convolutional Networks With Self-Calibrated Convolutions},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Modality Shifting Attention Network for Multi-Modal Video Question Answering
Junyeong Kim, Minuk Ma, Trung Pham, Kyungsu Kim, Chang D. Yoo


This paper considers a network referred to as Modality Shifting Attention Network (MSAN) for Multimodal Video Question Answering (MVQA) task. MSAN decomposes the task into two sub-tasks: (1) localization of temporal moment relevant to the question, and (2) accurate prediction of the answer based on the localized moment. The modality required for temporal localization may be different from that for answer prediction, and this ability to shift modality is essential for performing the task. To this end, MSAN is based on (1) the moment proposal network (MPN) that attempts to locate the most appropriate temporal moment from each of the modalities, and also on (2) the heterogeneous reasoning network (HRN) that predicts the answer using an attention mechanism on both modalities. MSAN is able to place importance weight on the two modalities for each sub-task using a component referred to as Modality Importance Modulation (MIM). Experimental results show that MSAN outperforms previous state-of-the-art by achieving 71.13% test accuracy on TVQA benchmark dataset. Extensive ablation studies and qualitative analysis are conducted to validate various components of the network.
[video, modality, attention, moment, question, temporal, msan, answer, visual, subtitle, mpn, heterogeneous, reasoning, moi, answering, mechanism, modulation, tvqa, recognition, mvqa, prediction, three, context, action, text, multimodal, correct, relevant, localize, ham, previous, vcpt, natural, language, localizes] [feature, localization, proposal, ablation, shifting, table, final, score, framework, object, interest] [model, mim, input, type, study, example] [ieee, pattern, based, figure, proposed, motion, analysis, hrn, comparison, method] [image] [network, set, required, memory, performance, accuracy, validation, learning, test, space, candidate, task, ranking, requires, number, weight, consider] [conference, computer, vision, international, localized, represented, hypothesis, well]
@InProceedings{Kim_2020_CVPR,
  author = {Kim, Junyeong and Ma, Minuk and Pham, Trung and Kim, Kyungsu and Yoo, Chang D.},
  title = {Modality Shifting Attention Network for Multi-Modal Video Question Answering},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Learning to Structure an Image With Few Colors
Yunzhong Hou, Liang Zheng, Stephen Gould


Color and structure are the two pillars that construct an image. Usually, the structure is well expressed through a rich spectrum of colors, allowing objects in an image to be recognized by neural networks. However, under extreme limitations of color space, the structure tends to vanish, and thus a neural network might fail to understand the image. Interested in exploring this interplay between color and structure, we study the scientific problem of identifying and preserving the most informative image structures while constraining the color space to just a few bits, such that the resulting image can be recognized with possibly high accuracy. To this end, we propose a color quantization network, ColorCNN, which learns to structure the images from the classification loss in an end-to-end manner. Given a color space size, ColorCNN quantizes colors in the original image by generating a color index map and an RGB color palette. Then, this color-quantized image is fed to a pre-trained task network to evaluate its performance. In our experiment, with only a 1-bit color space (i.e., two colors), the proposed network achieves 82.1% top-1 accuracy on the CIFAR10 dataset, outperforming traditional color quantization methods by a large margin. For applications, when encoded with PNG, the proposed color quantization shows superiority over other image compression methods in the extremely low bit-rate regime. The code is available at https://github.com/hou-yz/color_distillation.
[recognition, critical, visual] [map, denotes, table, feature, propose] [original, jpeg, trained, datasets, study] [color, colorcnn, ieee, mediancut, compression, traditional, pattern, output, pixel, figure, palette, jitter, proposed, result, compressed, method, convolutional] [image, preserve] [quantization, quantized, accuracy, network, classification, neural, space, learning, regularization, classifier, higher, clustering, size, arxiv, preprint, problem, large, set, small, probability, test, deep, training, average, performance, activation, lower, informative, extremely, softmax, alexnet, arg, weighted, rate, design, forward, pass, max, distribution, lead, weight] [octree, conference, structure, computer, full, vision, term, computed, michael, international]
@InProceedings{Hou_2020_CVPR,
  author = {Hou, Yunzhong and Zheng, Liang and Gould, Stephen},
  title = {Learning to Structure an Image With Few Colors},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
On the General Value of Evidence, and Bilingual Scene-Text Visual Question Answering
Xinyu Wang, Yuliang Liu, Chunhua Shen, Chun Chet Ng, Canjie Luo, Lianwen Jin, Chee Seng Chan, Anton van den Hengel, Liangwei Wang


Visual Question Answering (VQA) methods have made incredible progress, but suffer from a failure to generalize. This is visible in the fact that they are vulnerable to learning coincidental correlations in the data rather than deeper relations between image content and ideas expressed in language. We present a dataset that takes a step towards addressing this problem in that it contains questions expressed in two languages, and an evaluation process that co-opts a well understood image-based metric to reflect the method's ability to reason. Measuring reasoning directly encourages generalization by penalizing answers that are coincidentally correct. The dataset reflects the scene-text version of the VQA problem, and the reasoning evaluation can be seen as a text-based version of a referring expression challenge. Experiments and analyses are provided that show the value of the dataset. The dataset is available at www.est-vqa.org.
[vqa, question, text, answer, chinese, ocr, dataset, evidence, visual, evaluation, english, answering, correct, language, reasoning, provide, reading, embedding, clc, textual, three, bilingual, recognition, levenshtein, multiple, word, vocabulary, anton, den, current] [bounding, predicted, score, box, challenge, table, detection, van, employed, focusing] [model, datasets, generic] [proposed, ieee, based, figure, method, existing, traditional, conventional, version] [image, content, encoder, ability, corresponding, generalize] [metric, performance, training, baseline, set, space, accuracy, achieve, classification, upper, data, support, fixed, requires, bound, normalized, fact, problem, test, required, evaluate] [scene, provided, require, well, approach]
@InProceedings{Wang_2020_CVPR,
  author = {Wang, Xinyu and Liu, Yuliang and Shen, Chunhua and Ng, Chun Chet and Luo, Canjie and Jin, Lianwen and Chan, Chee Seng and Hengel, Anton van den and Wang, Liangwei},
  title = {On the General Value of Evidence, and Bilingual Scene-Text Visual Question Answering},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
From Paris to Berlin: Discovering Fashion Style Influences Around the World
Ziad Al-Halah, Kristen Grauman


The evolution of clothing styles and their migration across the world is intriguing, yet difficult to describe quantitatively. We propose to discover and quantify fashion influences from everyday images of people wearing clothes. We introduce an approach that detects which cities influence which other cities in terms of propagating their styles. We then leverage the discovered influence patterns to inform a forecasting model that predicts the popularity of any given style at any given city into the future. Demonstrating our idea with GeoStyle--a large-scale dataset of 7.7M images covering 44 major world cities, we present the discovered influence relationships, revealing how cities exert and receive fashion influence for an array of 50 observed visual styles. Furthermore, the proposed forecasting model achieves state-of-the-art results for a challenging style forecasting task, showing the advantage of grounding visual style evolution both spatially and temporally.
[visual, forecasting, time, temporal, dataset, trajectory, forecast, modeling, future, seasonal, relation, people, work, social, observed, multiple, york, predict] [global, correlation, location, score, propose, table] [influence, fashion, model, popularity, city, clothing, coherence, trend, paris, influencer, discovering, kristen, geostyle, influenced, major, influential, exerted, gdp, everyday, worldwide, analyze, trained, deseasonalized, asian, austin, milan] [based, high, figure, prior, analysis, net] [style, image, discover, learn, attribute, introduce, loss] [discovered, set, meta, ranking, impact, learned, consider, weight, average, rank, population, number] [approach, capture, vision, error, acm, ground, coherent, relative]
@InProceedings{Al-Halah_2020_CVPR,
  author = {Al-Halah, Ziad and Grauman, Kristen},
  title = {From Paris to Berlin: Discovering Fashion Style Influences Around the World},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
A Local-to-Global Approach to Multi-Modal Movie Scene Segmentation
Anyi Rao, Linning Xu, Yu Xiong, Guodong Xu, Qingqiu Huang, Bolei Zhou, Dahua Lin


Scene, as the crucial unit of storytelling in movies, contains complex activities of actors and their interactions in a physical environment. Identifying the composition of scenes serves as a critical step towards semantic understanding of movies. This is very challenging - compared to the videos studied in conventional vision problems, e.g. action recognition, as scenes in movies usually contain much richer temporal structures and more complex semantic information. Towards this goal, we scale up the scene segmentation task by building a large-scale video dataset MovieScenes, which contains 21K annotated scene segments from 150 movies. We further propose a local-to-global scene segmentation framework, which integrates multi-modal information across three levels, i.e. clip, segment, and movie. This framework is able to distill complex semantics from hierarchical temporal structures over a long movie, providing top-down guidance for scene segmentation. Our experiments show that the proposed network is able to segment a movie into scenes with high accuracy, consistently outperforming previous methods. We also found that pretraining on our MovieScenes can bring significant improvements to the existing approaches.
[movie, dataset, sequence, multiple, video, lgss, understanding, temporal, action, long, moviescenes, visual, clip, three, place, prediction, modeling, time, audio, semantics, short, bbc] [semantic, segmentation, boundary, global, level, grouping, table, segment, detection, improves, framework, annotation, siamese, annotated, propose, hard, achieves] [model, help, datasets] [figure, super, ieee, existing, pattern, method, based, analysis, range, result, high] [representation, pretrained, cross, image, consistency] [shot, optimal, number, cut, task, set, performance, optimization, best, achieve, better, deep, problem, large, base] [scene, local, conference, computer, vision, coarse, complex, initial, capture, international, cast]
@InProceedings{Rao_2020_CVPR,
  author = {Rao, Anyi and Xu, Linning and Xiong, Yu and Xu, Guodong and Huang, Qingqiu and Zhou, Bolei and Lin, Dahua},
  title = {A Local-to-Global Approach to Multi-Modal Movie Scene Segmentation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
G-TAD: Sub-Graph Localization for Temporal Action Detection
Mengmeng Xu, Chen Zhao, David S. Rojas, Ali Thabet, Bernard Ghanem


Temporal action detection is a fundamental yet challenging task in video understanding. Video context is a critical cue to effectively detect actions, but current works mainly focus on temporal context, while neglecting semantic context as well as other important context properties. In this work, we propose a graph convolutional network (GCN) model to adaptively incorporate multi-level semantic context into video features and cast temporal action detection as a sub-graph localization problem. Specifically, we formulate video snippets as graph nodes, snippet-snippet correlations as edges, and actions associated with context as target sub-graphs. With graph convolution as the basic operation, we design a GCN block called GCNeXt, which learns the features of each node by aggregating its context and dynamically updates the edges in the graph. To localize each sub-graph, we also design an SGAlign layer to embed each sub-graph into the Euclidean space. Extensive experiments show that G-TAD is capable of finding effective video context without extra supervision and achieves state-of-the-art performance on two detection benchmarks. On ActivityNet-1.3 it obtains an average mAP of 34.09%; on THUMOS14 it reaches 51.6% at IoU@0.5 when combined with a proposal processing method. The code has been made available at https://github.com/frostinassiky/gtad.
[action, temporal, video, context, graph, gcnext, recognition, sgalign, node, bernard, untrimmed, snippet, represent, sequence, gcn, predict, tiou, fabian, caba, victor, activity, limin, yuanjun, dynamically] [semantic, detection, feature, localization, edge, proposal, map, background, iou, region, achieves] [model, input] [ieee, pattern, convolution, based, convolutional, adaptively, figure, output, block, proposed] [loss, alignment] [set, network, performance, average, training, layer, deep, learning, classification, amount, arxiv, preprint, strategy, ratio, size, validation, start, number, sampling] [vision, conference, computer, international, defined, well, european, define, point, human]
@InProceedings{Xu_2020_CVPR,
  author = {Xu, Mengmeng and Zhao, Chen and Rojas, David S. and Thabet, Ali and Ghanem, Bernard},
  title = {G-TAD: Sub-Graph Localization for Temporal Action Detection},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Detailed 2D-3D Joint Representation for Human-Object Interaction
Yong-Lu Li, Xinpeng Liu, Han Lu, Shiyi Wang, Junqi Liu, Jiefeng Li, Cewu Lu


Human-Object Interaction (HOI) detection lies at the core of action understanding. Besides 2D information such as human/object appearance and locations, 3D pose is also usually utilized in HOI learning since its view-independence. However, rough 3D body joints just carry sparse body information and are not sufficient to understand complex interactions. Thus, we need detailed 3D body shape to go further. Meanwhile, the interacted object in 3D is also not fully studied in HOI learning. In light of these, we propose a detailed 2D-3D joint representation learning method. First, we utilize the single-view human body capture method to obtain detailed 3D body, face and hand shapes. Next, we estimate the 3D object location and size with reference to the 2D human-object spatial configuration and object category priors. Finally, a joint learning framework and cross-modal consistency tasks are proposed to learn the joint HOI representation. To better evaluate the 2D ambiguity processing capacity of models, we propose a new benchmark named Ambiguous-HOI consisting of hard ambiguous images. Extensive experiments in large-scale HOI benchmark and Ambiguous-HOI show impressive effectiveness of our method. Code and data are available at https://github.com/DirtyHarryLYL/DJ-RN.
[attention, action, cewu, extract, visual, recognition, interaction, represent, embedding] [object, hoi, feature, map, location, detection, adopt, semantic, propose, benchmark, center, ican, interactiveness, category, box, concatenate] [face, fsp, model, datasets] [spatial, block, method, based, figure, proposed, prior, recover] [representation, consistency, image, learn, corresponding, generate, train, appearance, alignment, consists, loss, latt] [learning, configuration, network, set, arxiv, preprint, size, deep, evaluate] [body, human, pose, joint, sphere, detailed, shape, volume, estimation, estimate, point, estimated, capture, depth, hand, single, finally, full]
@InProceedings{Li_2020_CVPR,
  author = {Li, Yong-Lu and Liu, Xinpeng and Lu, Han and Wang, Shiyi and Liu, Junqi and Li, Jiefeng and Lu, Cewu},
  title = {Detailed 2D-3D Joint Representation for Human-Object Interaction},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
One-Shot Adversarial Attacks on Visual Tracking With Dual Attention
Xuesong Chen, Xiyu Yan, Feng Zheng, Yong Jiang, Shu-Tao Xia, Yong Zhao, Rongrong Ji


Almost all adversarial attacks in computer vision are aimed at pre-known object categories, which could be offline trained for generating perturbations. But as for visual object tracking, the tracked target categories are normally unknown in advance. However, the tracking algorithms also have potential risks of being attacked, which could be maliciously used to fool the surveillance systems. Meanwhile, it is still a challenging task that adversarial attacks on tracking since it has the free-model tracked target. Therefore, to help draw more attention to the potential risks, we study adversarial attacks on tracking algorithms. In this paper, we propose a novel one-shot adversarial attack method to generate adversarial examples for free-model single object tracking, where merely adding slight perturbations on the target patch in the initial frame causes state-of-the-art trackers to lose the target in subsequent frames. Specifically, the optimization objective of the proposed attack consists of two components and leverages the dual attention mechanisms. The first component adopts a targeted attack strategy by optimizing the batch confidence loss with confidence attention while the second one applies a general perturbation strategy by optimizing the feature loss with channel attention. Experimental results show that our approach can significantly lower the accuracy of the most advanced Siamese network-based trackers on three benchmarks.
[attention, visual, frame, recognition, video, mechanism, evaluation, dataset, org] [tracking, siamese, confidence, object, feature, siamrpn, siamfc, siammask, tracker, box, table, propose, subsequent, template, including, vot, map, denotes, tracked] [attack, adversarial, success, perturbation, original, experimental, attacked, attacking, adding, clean, noise, fool, difficult, example] [dual, method, ieee, based, pattern, proposed, patch, pixel, gaussian] [target, loss, generate, image] [precision, network, deep, learning, candidate, rate, batch, function, accuracy, classification, algorithm, strategy, general, set, search, random, reduces, task, optimization, online, problem, similarity, best, applied, arxiv, preprint, potential] [computer, vision, conference, initial]
@InProceedings{Chen_2020_CVPR,
  author = {Chen, Xuesong and Yan, Xiyu and Zheng, Feng and Jiang, Yong and Xia, Shu-Tao and Zhao, Yong and Ji, Rongrong},
  title = {One-Shot Adversarial Attacks on Visual Tracking With Dual Attention},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Rethinking Classification and Localization for Object Detection
Yue Wu, Yinpeng Chen, Lu Yuan, Zicheng Liu, Lijuan Wang, Hongzhi Li, Yun Fu


Two head structures (i.e. fully connected head and convolution head) have been widely used in R-CNN based detectors for classification and localization tasks. However, there is a lack of understanding of how does these two head structures work for these two tasks. To address this issue, we perform a thorough analysis and find an interesting fact that the two head structures have opposite preferences towards the two tasks. Specifically, the fully connected head (fc-head) is more suitable for the classification task, while the convolution head (conv-head) is more suitable for the localization task. Furthermore, we examine the output feature maps of both heads and find that fc-head has more spatial sensitivity than conv-head. Thus, fc-head has more capability to distinguish a complete object from part of an object, but is not robust to regress the whole object. Based upon these findings, we propose a Double-Head method, which has a fully connected head focusing on classification and a convolution head for bounding box regression. Without bells and whistles, our method gains +3.5 and +2.8 AP on MS COCO dataset from Feature Pyramid Network (FPN) baselines with ResNet-50 and ResNet-101 backbones, respectively.
[connected, outperforms, dataset, multiple] [box, head, object, feature, bounding, fpn, correlation, regression, iou, proposal, fully, detection, map, rcnn, localization, coco, regressed, table, faster, unfocused, score, backbone, ross, propose, roi, cascade, mask] [medium, suitable] [figure, spatial, convolution, ieee, output, method, conv, pattern, fusion, comparison, residual, analysis, block, based] [loss, corresponding, perform] [classification, higher, weight, task, group, network, performance, small, training, better, standard, large, compared, baseline, class, deep] [conference, computer, vision, single, ground, international, truth]
@InProceedings{Wu_2020_CVPR,
  author = {Wu, Yue and Chen, Yinpeng and Yuan, Lu and Liu, Zicheng and Wang, Lijuan and Li, Hongzhi and Fu, Yun},
  title = {Rethinking Classification and Localization for Object Detection},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Correspondence Networks With Adaptive Neighbourhood Consensus
Shuda Li, Kai Han, Theo W. Costain, Henry Howard-Jenkins, Victor Prisacariu


In this paper, we tackle the task of establishing dense visual correspondences between images containing objects of the same category. This is a challenging task due to large intra-class variations and a lack of dense pixel level annotations. We propose a convolutional neural network architecture, called adaptive neighbourhood consensus network (ANC-Net), that can be trained end-to-end with sparse key-point annotations, to handle this challenge. At the core of ANC-Net is our proposed non-isotropic 4D convolution kernel, which forms the building block for the adaptive neighbourhood consensus module for robust matching. We also introduce a simple and efficient multi-scale self-similarity module in ANC-Net to make the learned feature robust to intra-class variations. Furthermore, we propose a novel orthogonal loss that can enforce the one-to-one matching constraint. We thoroughly evaluate the effectiveness of our method on various benchmarks, where it substantially outperforms state-of-the-art methods.
[recognition, pair, outperforms, multiple] [feature, map, semantic, correlation, module, consensus, cnn, anc, propose, mgt, object, refined, effectiveness, table, bumsub] [model, trained, robust, input] [neighbourhood, ieee, pattern, figure, adaptive, proposed, method, convolutional, introduced, convolution, flow, kernel, dccnet, called, based, scale, selfsimilarity, isotropic] [image, target, loss, introduce, source, learn, cub] [learning, size, orthogonal, training, probability, large, set, neural, task, network, evaluate, better] [matching, vision, computer, correspondence, dense, point, sparse, nearest, estimation, jean, novel, capture, handle, enforce]
@InProceedings{Li_2020_CVPR,
  author = {Li, Shuda and Han, Kai and Costain, Theo W. and Howard-Jenkins, Henry and Prisacariu, Victor},
  title = {Correspondence Networks With Adaptive Neighbourhood Consensus},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Multiple Anchor Learning for Visual Object Detection
Wei Ke, Tianliang Zhang, Zeyi Huang, Qixiang Ye, Jianzhuang Liu, Dong Huang


Classification and localization are two pillars of visual object detectors. However, in CNN-based detectors, these two modules are usually optimized under a fixed set of candidate (or anchor) bounding boxes. This configuration significantly limits the possibility to jointly optimize classification and localization. In this paper, we propose a Multiple Instance Learning (MIL) approach that selects anchors and jointly optimizes the two modules of a CNN-based object detector. Our approach, referred to as Multiple Anchor Learning (MAL), constructs anchor bags and selects the most representative anchors from each bag. Such an iterative selection process is potentially NP-hard to optimize. To address this issue, we solve MAL by repetitively depressing the confidence of selected anchors by perturbing their corresponding features. In an adversarial selection-depression manner, MAL not only pursues optimal solutions but also fully leverages multiple anchors/features to learn a detection model. Experiments show that MAL improves the baseline RetinaNet with significant margins on the commonly used MS-COCO object detection benchmark and achieves new state-of-the-art detection performance compared with recent methods.
[multiple, outperforms, attention, visual, work, predict] [anchor, mal, object, detection, localization, depression, retinanet, detector, feature, bounding, table, bag, box, achieves, propose, positive, fpn, backbone, map, ross, freeanchor, regression, centernet, denotes, wei, instance, improves, iou, faster, pyramid] [adversarial, input] [ieee, method, convolutional, figure, based, high, aspect] [loss, image, learn] [learning, network, selection, training, optimal, baseline, performance, strategy, set, selected, selects, optimization, procedure, selecting, function, arg, optimize, compared, top, better, best, indicates, parameter] [approach, jointly, local, accurate]
@InProceedings{Ke_2020_CVPR,
  author = {Ke, Wei and Zhang, Tianliang and Huang, Zeyi and Ye, Qixiang and Liu, Jianzhuang and Huang, Dong},
  title = {Multiple Anchor Learning for Visual Object Detection},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
PhraseCut: Language-Based Image Segmentation in the Wild
Chenyun Wu, Zhe Lin, Scott Cohen, Trung Bui, Subhransu Maji


We consider the problem of segmenting image regions given a natural language phrase, and study it on a novel dataset of 77,262 images and 345,486 phrase-region pairs. Our dataset is collected on top of the Visual Genome dataset and uses the existing annotations to generate a challenging set of referring phrases for which the corresponding regions are manually annotated. Phrases in our dataset correspond to multiple regions and describe a large number of object and stuff categories as well as their attributes such as color, shape, parts, and relationships with other entities in the image. Our experiments show that the scale and diversity of concepts in our dataset poses significant challenges to the existing state-of-the-art. We systematically handle the long-tail nature of these concepts and present a modular approach to combine category, attribute, and relationship cues that outperforms existing approaches.
[dataset, referring, language, relationship, phrase, visual, attention, grounding, natural, step, recognition, vgp, hrase, mattnet, modular, modeling, prediction, describe, phrasecut, genome, evaluation, individual, refcoco, context, embedding] [category, module, object, instance, segmentation, box, rmi, mask, table, rare, stuff, region, final, detection, coco, score] [input, model, expression, datasets] [pattern, figure, based, existing, frequency] [image, attribute, target, corresponding, generate, generation] [performance, top, number, large, set, task, network, small, test, training, binary, size] [vision, computer, conference, international, approach, well, european]
@InProceedings{Wu_2020_CVPR,
  author = {Wu, Chenyun and Lin, Zhe and Cohen, Scott and Bui, Trung and Maji, Subhransu},
  title = {PhraseCut: Language-Based Image Segmentation in the Wild},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Mask Encoding for Single Shot Instance Segmentation
Rufeng Zhang, Zhi Tian, Chunhua Shen, Mingyu You, Youliang Yan


To date, instance segmentation is dominated by two-stage methods, as pioneered by Mask R-CNN. In contrast, one-stage alternatives cannot compete with Mask R-CNN in mask AP, mainly due to the difficulty of compactly representing masks, making the design of one-stage methods very challenging. In this work, we propose a simple single-shot instance segmentation framework, termed mask encoding based instance segmentation (MEInst). Instead of predicting the two-dimensional mask directly, MEInst distills it into a compact and fixed-dimensional representation vector, which allows the instance segmentation task to be incorporated into one-stage bounding-box detectors and results in a simple yet efficient instance segmentation framework. The proposed one-stage MEInst achieves 36.4% in mask AP with single-model (ResNeXt-101-FPN backbone) and single-scale testing on the MS-COCO benchmark. We show that the much simpler and flexible one-stage instance segmentation method, can also achieve competitive performance. This framework can be easily adapted for other instance-level recognition tasks. Code is available at: git.io/AdelaiDet
[encoding, prediction] [mask, instance, segmentation, meinst, object, fcos, apbb, detection, table, feature, coco, regression, achieves, semantic, backbone, predicted, bounding, chunhua, framework, box, head, kaiming, fully, polarmask, apl, ross, piotr, challenging, branch, including, denotes, apm, zhi, propose] [model, termed, easily] [ieee, figure, method, convolutional, receptive, performs, based, deformable, aps, pixel] [representation, loss, image] [performance, better, large, learning, training, compact, compared, network, simple, classification, dimension, matrix, note, vector, number, set, shot, deep, process, arxiv, task] [reconstruction, single, demonstrate, error, pipeline]
@InProceedings{Zhang_2020_CVPR,
  author = {Zhang, Rufeng and Tian, Zhi and Shen, Chunhua and You, Mingyu and Yan, Youliang},
  title = {Mask Encoding for Single Shot Instance Segmentation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Action Genome: Actions As Compositions of Spatio-Temporal Scene Graphs
Jingwei Ji, Ranjay Krishna, Li Fei-Fei, Juan Carlos Niebles


Action recognition has typically treated actions and activities as monolithic events that occur in videos. However, there is evidence from Cognitive Science and Neuroscience that people actively encode activities into consistent hierarchical part structures. However, in Computer Vision, few explorations on representations that encode event partonomies have been made. Inspired by evidence that the prototypical unit of an event is an action-object interaction, we introduce Action Genome, a representation that decomposes actions into spatio-temporal scene graphs. Action Genome captures changes between objects and their pairwise relationships while an action occurs. It contains 10K videos with 0.4M objects and 1.7M visual relationships annotated. With Action Genome, we extend an existing action recognition model by incorporating scene graphs as spatio-temporal feature banks to achieve better performance on the Charades dataset. Next, by decomposing and learning the temporal changes in visual relationships that result in an action, we demonstrate the utility of a hierarchical event decomposition by enabling few-shot action recognition, achieving 42.7% mAP using as few as 10 examples. Finally, we benchmark existing scene graph models on the new task of spatio-temporal scene graph prediction.
[action, graph, video, visual, recognition, sgfb, genome, prediction, relationship, temporal, predict, multiple, ranjay, sequence, dataset, people, understanding, hperson, oracle, work, occur, hierarchical, structured, three, lying, lfb, modeling, sitting, beneath] [object, feature, table, detection, bounding, benchmark] [improve, model, study, trained] [ieee, pattern, event, figure, existing, proposed, spatial] [image, person, representation, generation, introduce, enable] [task, learning, arxiv, preprint, performance, neural, cognitive, deep, better, set, classification] [scene, computer, conference, vision, international, european, ground, front, contact, decomposition, truth, human, michael]
@InProceedings{Ji_2020_CVPR,
  author = {Ji, Jingwei and Krishna, Ranjay and Fei-Fei, Li and Niebles, Juan Carlos},
  title = {Action Genome: Actions As Compositions of Spatio-Temporal Scene Graphs},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Learning Unseen Concepts via Hierarchical Decomposition and Composition
Muli Yang, Cheng Deng, Junchi Yan, Xianglong Liu, Dacheng Tao


Composing and recognizing new concepts from known sub-concepts has been a fundamental and challenging vision task, mainly due to 1) the diversity of sub-concepts and 2) the intricate contextuality between sub-concepts and their corresponding visual features. However, most of the current methods simply treat the contextuality as rigid semantic relationships and fail to capture fine-grained contextual correlations. We propose to learn unseen concepts in a hierarchical decomposition-and-composition manner. Considering the diversity of sub-concepts, our method decomposes each seen image into visual elements according to its labels, and learns corresponding sub-concepts in their individual subspaces. To model intricate contextuality between sub-concepts and their visual features, compositions are generated from these subspaces in three hierarchical forms, and the composed concepts are learned in a unified composition space. To further refine the captured contextual relationships, adaptively semi-positive concepts are defined and then learned with pseudo supervision exploited from the generated compositions. We validate the proposed approach on two challenging benchmarks, and demonstrate its superiority over state-of-the-art approaches.
[visual, word, compositional, hierarchical, recognition, three, embedding, state, composed, evaluation] [object, lcls, contextual, positive, semantic, table, propose, supervision, ablation, guided, key, denotes, anchor] [concept, model, cheng, input, trained] [proposed, figure, adaptive, method, adaptively, scale] [unseen, young, composition, attribute, image, lcomp, corresponding, hidc, learn, tiger, lconc, contextuality, loss, generated, attrasoperator, lrec, zsl, advfinegrained, adjusting, cat, intricate, pseudo, lquin, composing, ability, common] [training, open, learning, margin, closed, triplet, negative, test, accuracy, metric, space, parameter, deep, reported, label, max] [hybrid, approach]
@InProceedings{Yang_2020_CVPR,
  author = {Yang, Muli and Deng, Cheng and Yan, Junchi and Liu, Xianglong and Tao, Dacheng},
  title = {Learning Unseen Concepts via Hierarchical Decomposition and Composition},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Hi-CMD: Hierarchical Cross-Modality Disentanglement for Visible-Infrared Person Re-Identification
Seokeon Choi, Sumin Lee, Youngeun Kim, Taekyung Kim, Changick Kim


Visible-infrared person re-identification (VI-ReID) is an important task in night-time surveillance applications, since visible cameras are difficult to capture valid appearance information under poor illumination conditions. Compared to traditional person re-identification that handles only the intra-modality discrepancy, VI-ReID suffers from additional cross-modality discrepancy caused by different types of imaging systems. To reduce both intra- and cross-modality discrepancies, we propose a Hierarchical Cross-Modality Disentanglement (Hi-CMD) method, which automatically disentangles ID-discriminative factors and ID-excluded factors from visible-thermal images. We only use ID-discriminative factors for robust cross-modality matching without ID-excluded factors such as pose or illumination. To implement our approach, we introduce an ID-preserving person image generation network and a hierarchical feature learning module. Our generation network learns the disentangled representation by generating a new cross-modality image with different poses and illuminations while preserving a person's identity. At the same time, the feature learning module enables our model to explicitly extract the common ID-discriminative characteristic between visible-infrared images. Extensive experimental results demonstrate that our method outperforms the state-of-the-art methods on two VI-ReID datasets. The source code is available at: https://github.com/bismex/HiCMD.
[hierarchical, recognition, modality, work, evaluation] [feature, module, map, thermal, framework, apply] [adversarial, identity, clothes, testing] [illumination, ieee, pattern, method, figure, proposed, based] [person, image, attribute, code, generation, loss, recon, disentanglement, prototype, style, infrared, crossmodality, regdb, reid, representation, cross, disentangled, disentangle, characteristic, aex, lrecon, introduce, generator, eip, ladv, appearance, common, maintaining, transfer, unsupervised, cycle] [learning, network, set, training, reduce, best, strategy, task, compared, problem, note, performance] [pose, conference, computer, vision, visible, reconstruction, approach, novel, matching, distance, international]
@InProceedings{Choi_2020_CVPR,
  author = {Choi, Seokeon and Lee, Sumin and Kim, Youngeun and Kim, Taekyung and Kim, Changick},
  title = {Hi-CMD: Hierarchical Cross-Modality Disentanglement for Visible-Infrared Person Re-Identification},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
In Defense of Grid Features for Visual Question Answering
Huaizu Jiang, Ishan Misra, Marcus Rohrbach, Erik Learned-Miller, Xinlei Chen


Popularized as `bottom-up' attention, bounding box (or region) based visual features have recently surpassed vanilla grid-based convolutional features as the de facto standard for vision and language tasks like visual question answering (VQA). However, it is not clear whether the advantages of regions (e.g. better localization) are the key reasons for the success of bottom-up attention. In this paper, we revisit grid features for VQA, and find they can work surprisingly well -- running more than an order of magnitude faster with the same accuracy (e.g. if pre-trained in a similar fashion). Through extensive experiments, we verify that this observation holds true across different VQA models (reporting a state-of-the-art accuracy on VQA 2.0 test-std, 72.71), datasets, and generalizes well to other tasks like image captioning. As grid features make the model design and training process much simpler, this enables us to train them end-to-end and also use a more flexible network design. We learn VQA models end-to-end, from pixels directly to answers, and show that strong performance is achievable without using any region annotations in pre-training. We hope our findings help further improve the scientific understanding and the practical application of VQA. Code and features will be made available.
[vqa, visual, attention, language, question, time, work, dataset, pythia, answering, captioning, devi, answer, represent, lawrence, marcus] [region, feature, object, faster, table, detection, coco, detector, resnet, final, main, backbone, ross] [model, input, trained, roipool, original, major, effective, study] [convolutional, based, figure, output] [image, attribute, train, loss, yfcc] [accuracy, imagenet, better, number, performance, set, size, task, training, note, arxiv, preprint, learning, find, deep, standard, reported, larger, report, network, selection, achieve, inference, top, compared] [grid, vision, directly, additional, well]
@InProceedings{Jiang_2020_CVPR,
  author = {Jiang, Huaizu and Misra, Ishan and Rohrbach, Marcus and Learned-Miller, Erik and Chen, Xinlei},
  title = {In Defense of Grid Features for Visual Question Answering},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Multi-Mutual Consistency Induced Transfer Subspace Learning for Human Motion Segmentation
Tao Zhou, Huazhu Fu, Chen Gong, Jianbing Shen, Ling Shao, Fatih Porikli


Human motion segmentation based on transfer subspace learning is a rising interest in action-related tasks. Although progress has been made, there are still several issues within the existing methods. First, existing methods transfer knowledge from source data to target tasks by learning domain-invariant features, but they ignore to preserve domain-specific knowledge. Second, the transfer subspace learning is employed in either low-level or high-level feature spaces, but few methods consider fusing multi-level features for subspace learning. To this end, we propose a novel multi-mutual consistency induced transfer subspace learning framework for human motion segmentation. Specifically, our model factorizes the source and target data into distinct multi-layer feature spaces and reduces the distribution gap between them through a multi-mutual consistency learning strategy. In this way, the domain-specific knowledge and domain-invariant properties can be explored simultaneously. Our model also conducts the transfer subspace learning on different layers to capture multi-level structural information. Further, to preserve the temporal correlations, we project the learned representations into a block-like space. The proposed model is efficiently optimized by using the Augmented Lagrange Multiplier (ALM) algorithm. Experimental results on four human motion datasets demonstrate the effectiveness of our method over other state-of-the-art approaches.
[temporal, action, dataset, video, multiple, three] [segmentation, feature, effectiveness, framework, affinity, obtains, correlation] [model, difference, experimental, datasets, original] [motion, ieee, method, proposed, comparison, based, analysis, figure, ssc, existing, fusion] [transfer, source, target, representation, consistency, structural, keck, preserve, domain, utilize, lrr, unsupervised, tsc, project, corresponding, adaptation] [subspace, learning, data, clustering, deep, matrix, knowledge, performance, strategy, learned, nmi, algorithm, better, distribution, min, set, number, layer, optimization, dictionary, denote, updating, convergence, complexity, reduce, normalized, parameter] [human, capture, sparse, constraint, novel]
@InProceedings{Zhou_2020_CVPR,
  author = {Zhou, Tao and Fu, Huazhu and Gong, Chen and Shen, Jianbing and Shao, Ling and Porikli, Fatih},
  title = {Multi-Mutual Consistency Induced Transfer Subspace Learning for Human Motion Segmentation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Dense Regression Network for Video Grounding
Runhao Zeng, Haoming Xu, Wenbing Huang, Peihao Chen, Mingkui Tan, Chuang Gan


We address the problem of video grounding from natural language queries. The key challenge in this task is that one training video might only contain a few annotated starting/ending frames that can be used as positive examples for model training. Most conventional approaches directly train a binary classifier using such imbalance data, thus achieving inferior results. The key idea of this paper is to use the distances between the frame within the ground truth and the starting (ending) frame as dense supervisions to improve the video grounding accuracy. Specifically, we design a novel dense regression network (DRN) to regress the distances from each frame to the starting (ending) frame of the video segment described by the query. We also propose a simple but effective IoU regression head module to explicitly consider the localization quality of the grounding results (i.e., the IoU between the predicted location and the ground truth). Experimental results show that our approach significantly outperforms state-of-the-arts on three datasets (i.e., Charades-STA, ActivityNet-Captions, and TACoS).
[video, grounding, temporal, frame, starting, dataset, predict, language, interaction, natural, recognition, embedding, three, action, described, outperforms, time, work] [regression, iou, location, head, feature, box, score, propose, localization, positive, semantic, module, table, object, predicted, bounding, centerness, val, segment, annotated, ablation, fcos] [query, model, quality, testing, input] [ieee, drn, method, fusion, figure, pattern] [train, mingkui, target, loss, image, runhao, consists] [training, network, performance, neural, select, learning, set, task, best, number, evaluate, follow, consider] [ground, conference, computer, vision, matching, truth, dense, international, directly, predicts, distance]
@InProceedings{Zeng_2020_CVPR,
  author = {Zeng, Runhao and Xu, Haoming and Huang, Wenbing and Chen, Peihao and Tan, Mingkui and Gan, Chuang},
  title = {Dense Regression Network for Video Grounding},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Neural Architecture Search for Lightweight Non-Local Networks
Yingwei Li, Xiaojie Jin, Jieru Mei, Xiaochen Lian, Linjie Yang, Cihang Xie, Qihang Yu, Yuyin Zhou, Song Bai, Alan L. Yuille


Non-Local (NL) blocks have been widely studied in various vision tasks. However, it has been rarely explored to embed the NL blocks in mobile neural networks, mainly due to the following challenges: 1) NL blocks generally have heavy computation cost which makes it difficult to be applied in applications where computational resources are limited, and 2) it is an open problem to discover an optimal configuration to embed NL blocks into mobile neural networks. We propose AutoNL to overcome the above two obstacles. Firstly, we propose a Lightweight Non-Local (LightNL) block by squeezing the transformation operations and incorporating compact features. With the novel design choices, the proposed LightNL block is 400 times computationally cheaper than its conventional counterpart without sacrificing the performance. Secondly, by relaxing the structure of the LightNL block to be differentiable during training, we propose an efficient neural architecture search algorithm to learn an optimal configuration of LightNL blocks in an end-to-end manner. Notably, using only 32 GPU hours, the searched AutoNL model achieves 77.7% top-1 accuracy on ImageNet under a typical mobile setting (350M FLOPs), significantly outperforming previous mobile models including MobileNetV2 (+5.7%), FBNet (+2.8%) and MnasNet (+2.1%). Code and models are available at https://github.com/LiYingwei/AutoNL.
[attention, work] [feature, affinity, propose, achieves, semantic, denotes, table, alan, improves, heavy, pascal, including, segmentation, apply] [model, input, original] [block, proposed, downsampling, figure, lightweight, convolution, convolutional, spatial, channel, conv, kernel, method, conventional] [image] [lightnl, search, neural, architecture, computation, mobile, performance, design, efficient, accuracy, imagenet, matrix, autonl, deep, optimal, network, configuration, algorithm, learning, computing, reduce, insert, compact, computationally, classification, better, xgt, arxiv, preprint, computational, searching, depthwise, indicator, quoc, mnasnet, large, mobilenet, function, ratio, training, gpu, searched, xri, fbnet, manually, achieve] [cost, depth, vision, compute, differentiable, transformation, novel]
@InProceedings{Li_2020_CVPR,
  author = {Li, Yingwei and Jin, Xiaojie and Mei, Jieru and Lian, Xiaochen and Yang, Linjie and Xie, Cihang and Yu, Qihang and Zhou, Yuyin and Bai, Song and Yuille, Alan L.},
  title = {Neural Architecture Search for Lightweight Non-Local Networks},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Learning Saliency Propagation for Semi-Supervised Instance Segmentation
Yanzhao Zhou, Xin Wang, Jianbin Jiao, Trevor Darrell, Fisher Yu


Instance segmentation is a challenging task for both modeling and annotation. Due to the high annotation cost, modeling becomes more difficult because of the limited amount of supervision. We aim to improve the accuracy of the existing instance segmentation models by utilizing a large amount of detection supervision. We propose ShapeProp, which learns to activate the salient regions within the object detection and propagate the areas to the whole instance through an iterative learnable message passing module. ShapeProp can benefit from more bounding box supervision to locate the instances more accurately and utilize the feature activations from the larger number of instances to achieve more accurate segmentation. We extensively evaluate ShapeProp on three datasets (MS COCO, PASCAL VOC, and BDD100k) with different supervision setups based on both two-stage (Mask R-CNN) and single-stage (RetinaMask) models. The results show our method establishes new states of the art for semi-supervised instance segmentation.
[recognition, message, passing, predict, extract, provide] [mask, instance, segmentation, shapeprop, box, object, saliency, detection, head, module, improves, propagation, supervision, region, semantic, bounding, salient, grabcut, feature, fully, backbone, segment, shapemask, voc, weakly, kaiming, propagate, retinamask, predicted] [model, strong, generalization, improve, trained, quality] [ieee, pattern, prior, existing, based, method, figure, abundant] [supervised, learn, representation, latent, unseen] [learning, baseline, activation, training, setting, set, deep, label, accuracy, subset, data, performance, test, number, note] [shape, computer, vision, conference, limited, approach, novel, full]
@InProceedings{Zhou_2020_CVPR,
  author = {Zhou, Yanzhao and Wang, Xin and Jiao, Jianbin and Darrell, Trevor and Yu, Fisher},
  title = {Learning Saliency Propagation for Semi-Supervised Instance Segmentation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Speech2Action: Cross-Modal Supervision for Action Recognition
Arsha Nagrani, Chen Sun, David Ross, Rahul Sukthankar, Cordelia Schmid, Andrew Zisserman


Is it possible to guess human action from dialogue alone? In this work we investigate the link between spoken words and actions in movies. We note that movie screenplays describe actions, as well as contain the speech of characters and hence can be used to learn this correlation with no additional supervision. We train a BERT-based Speech2Action classifier on over a thousand movie screenplays, to predict action labels from transcribed speech segments. We then apply this model to the speech segments of a large unlabelled movie corpus (188M speech segments from 288K movies). Using the predictions of this model, we obtain weak action labels for over 800K video clips. By training on these video clips, we demonstrate superior action recognition performance on standard action recognition benchmarks, without using a single manually labelled action example.
[speech, action, video, dataset, verb, ava, movie, mined, visual, imsdb, corpus, recognition, dance, described, phone, kinetics, order, keyword, ivan, dialogue, transcribed, kiss, spotting, cordelia, work, automatically, temporal, drink, evaluation] [labelled, weak, stage, correlation, supervision, weakly, table, segment] [model, datasets, correlated, trained] [ieee, pattern, method, scale, performed] [train, learn, supervised] [training, learning, performance, data, number, large, manual, classification, note, classifier, manually, unlabelled, andrew, set, applied, baseline, arxiv, preprint, mining, follow, test, evaluate, label] [conference, computer, vision, human, international, single, directly, point, well]
@InProceedings{Nagrani_2020_CVPR,
  author = {Nagrani, Arsha and Sun, Chen and Ross, David and Sukthankar, Rahul and Schmid, Cordelia and Zisserman, Andrew},
  title = {Speech2Action: Cross-Modal Supervision for Action Recognition},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Normalized and Geometry-Aware Self-Attention Network for Image Captioning
Longteng Guo, Jing Liu, Xinxin Zhu, Peng Yao, Shichen Lu, Hanqing Lu


Self-attention (SA) network has shown profound value in image captioning. In this paper, we improve SA from two aspects to promote the performance of image captioning. First, we propose Normalized Self-Attention (NSA), a reparameterization of SA that brings the benefits of normalization inside SA. While normalization is previously only applied outside SA, we introduce a novel normalization method and demonstrate that it is both possible and beneficial to perform it on the hidden activations inside SA. Second, to compensate for the major limit of Transformer that it fails to model the geometry structure of the input objects, we propose a class of Geometry-aware Self-Attention (GSA) that extends SA to explicitly and efficiently consider the relative geometry relations between the objects in the image. To construct our image captioning model, we combine the two modules and apply it to the vanilla self-attention network. We extensively evaluate our proposals on MS-COCO image captioning dataset and superior results are achieved when comparing to state-of-the-art approaches. Further experiments on three challenging tasks, i.e. video captioning, machine translation, and visual question answering, show the generality of our methods.
[captioning, attention, visual, gsa, transformer, question, cider, gij, video, shift, evaluation, dataset, decoder, sequence, relation, outperforms, mcan, caption, dynamically] [table, object, module, jing, score, inside, instance, effectiveness, add, propose, apply] [model, input, internal, covariate, query, generality] [san, method, ieee, pattern, proposed, channel] [image, encoder, content] [normalization, network, layer, machine, nsa, normalized, neural, normalizing, arxiv, preprint, performance, deep, replace, distribution, parameter, applied, vanilla, problem, softmax, baseline, learning, set, training] [geometric, conference, computer, geometry, vision, relative, computed, position, absolute, additional, compare, international]
@InProceedings{Guo_2020_CVPR,
  author = {Guo, Longteng and Liu, Jing and Zhu, Xinxin and Yao, Peng and Lu, Shichen and Lu, Hanqing},
  title = {Normalized and Geometry-Aware Self-Attention Network for Image Captioning},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Memory Enhanced Global-Local Aggregation for Video Object Detection
Yihong Chen, Yue Cao, Han Hu, Liwei Wang


How do humans recognize an object in a piece of video? Due to the deteriorated quality of single frame, it may be hard for people to identify an occluded object in this frame by just utilizing information within one image. We argue that there are two important cues for humans to recognize objects in videos: the global semantic information and the local localization information. Recently, plenty of methods adopt the self-attention mechanisms to enhance the features in key frame with either global semantic information or local localization information. In this paper we introduce memory enhanced global-local aggregation (MEGA) network, which is among the first trials that takes full consideration of both global and local information. Furthermore, empowered by a novel and carefully-designed Long Range Memory (LRM) module, our proposed MEGA could enable the key frame to get access to much more content than any previous methods. Enhanced by these two sources of information, our method achieves state-of-the-art performance on ImageNet VID dataset. Code is available at https://github.com/Scalsol/mega.pytorch.
[frame, video, relation, long, temporal, previous, current, vid, time] [global, aggregation, object, detection, mega, key, module, feature, semantic, stage, table, ineffective, map, localization, denotes, cached, aggregate, regression, final, ross, gather, precomputed, totally, main, box, lrm] [model, insufficient, frm, access, influence] [range, reference, enhanced, figure, result, method, enhance, adjacent, rdn, introduced, convolutional, version, stack, proposed, scale, flow, intermediate] [content] [memory, number, size, base, performance, set, candidate, process, pool, imagenet, network, connection, problem, approximation, deep, better, classification, setting, utilizing, impact, function, update] [local, full, novel, single, refer, solving, solve, globally, form]
@InProceedings{Chen_2020_CVPR,
  author = {Chen, Yihong and Cao, Yue and Hu, Han and Wang, Liwei},
  title = {Memory Enhanced Global-Local Aggregation for Video Object Detection},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Solving Mixed-Modal Jigsaw Puzzle for Fine-Grained Sketch-Based Image Retrieval
Kaiyue Pang, Yongxin Yang, Timothy M. Hospedales, Tao Xiang, Yi-Zhe Song


ImageNet pre-training has long been considered crucial by the fine-grained sketch-based image retrieval (FG-SBIR) community due to the lack of large sketch-photo paired datasets for FG-SBIR training. In this paper, we propose a self-supervised alternative for representation pre-training. Specifically, we consider the jigsaw puzzle game of recomposing images from shuffled parts. We identify two key facets of jigsaw task design that are required for effective FG-SBIR pre-training. The first is formulating the puzzle in a mixed-modality fashion. Second we show that framing the optimisation as permutation matrix inference via Sinkhorn iterations is more effective than the common classifier formulation of Jigsaw self-supervision. Experiments show that this self-supervised pre-training strategy significantly outperforms the standard ImageNet-based pipeline across all four product-level FG-SBIR benchmarks. Interestingly it also leads to improved cross-category generalisation across both pre-train/fine-tune and fine-tune/testing stages.
[dataset, visual, granularity, retrieval, provide, downstream] [category, object, cnn, feature, instance, assignment, table, stage, including, edge] [model, success, input, effective, datasets, testing] [figure, patch, proposed, method, operator] [jigsaw, puzzle, image, photo, shoe, representation, edgemap, qmul, oursshoe, generalisation, tao, sketch, handbag, sketchy, sbir, target, edgemaps, loss, shuffled, domain] [imagenet, learning, training, task, triplet, permutation, strategy, performance, matrix, standard, rate, deep, set, data, classification, problem, timothy, ranking, product, requires, best, number, better, architecture, required] [solving, solver, sinkhorn, chair, matching, solve]
@InProceedings{Pang_2020_CVPR,
  author = {Pang, Kaiyue and Yang, Yongxin and Hospedales, Timothy M. and Xiang, Tao and Song, Yi-Zhe},
  title = {Solving Mixed-Modal Jigsaw Puzzle for Fine-Grained Sketch-Based Image Retrieval},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
LG-GAN: Label Guided Adversarial Network for Flexible Targeted Attack of Point Cloud Based Deep Networks
Hang Zhou, Dongdong Chen, Jing Liao, Kejiang Chen, Xiaoyi Dong, Kunlin Liu, Weiming Zhang, Gang Hua, Nenghai Yu


Deep neural networks have made tremendous progress in 3D point-cloud recognition. Recent works have shown that these 3D recognition networks are also vulnerable to adversarial samples produced from various attack methods, including optimization-based 3D Carlini-Wagner attack, gradient-based iterative fast gradient method, and skeleton-detach based point-dropping. However, after a careful analysis, these methods are either extremely slow because of the optimization/iterative scheme, or not flexible to support targeted attack of a specific category. To overcome these shortcomings, this paper proposes a novel label guided adversarial network (LG-GAN) for real-time flexible targeted point cloud attack. To the best of our knowledge, this is the first generation based 3D point cloud attack method. By feeding the original point clouds and target attack label into LG-GAN, it can learn how to deform the point clouds to mislead the recognition network into the specific label only with a single forward pass. In detail, LG-GAN first leverages one multi-branch adversarial network to extract hierarchical features of the input point clouds, then incorporates the specified label information into multiple intermediate features using the label encoder. Finally, the encoded features will be fed into the coordinate reconstruction decoder to generate the target adversarial sample. By evaluating different point-cloud recognition models (e.g., PointNet, PointNet++ and DGCNN), we demonstrate that the proposed LG-GAN can support flexible targeted attack on the fly while guaranteeing good attack performance and higher efficiency simultaneously.
[recognition, graph, three, multiple, work, extract, hierarchical, critical] [object, guided, jing, weiming, faster, overcome, effectiveness, propose] [attack, adversarial, targeted, success, nenghai, original, input, fgsm, ifgm, dongdong, model, defense, ian, hang, vulnerable, motivated] [ieee, based, pattern, method, flexible, proposed, existing, intermediate, convolutional] [target, loss, generation, generate, specific, real] [network, deep, neural, label, arxiv, preprint, learning, classification, better, support, objective, rate, processing, gradient, good, performance, efficiency, evaluate, achieve] [point, computer, conference, vision, cloud, international, hao, novel, reconstruction, demonstrate, pointnet, shape, leonidas, single]
@InProceedings{Zhou_2020_CVPR,
  author = {Zhou, Hang and Chen, Dongdong and Liao, Jing and Chen, Kejiang and Dong, Xiaoyi and Liu, Kunlin and Zhang, Weiming and Hua, Gang and Yu, Nenghai},
  title = {LG-GAN: Label Guided Adversarial Network for Flexible Targeted Attack of Point Cloud Based Deep Networks},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Memory Aggregation Networks for Efficient Interactive Video Object Segmentation
Jiaxu Miao, Yunchao Wei, Yi Yang


Interactive video object segmentation (iVOS) aims at efficiently harvesting high-quality segmentation masks of the target object in a video with user interactions. Most previous state-of-the-arts tackle the iVOS with two independent networks for conducting user interaction and temporal propagation, respectively, leading to inefficiencies during the inference stage. In this work, we propose a unified framework, named Memory Aggregation Networks (MA-Net), to address the challenging iVOS in a more efficient way. Our MA-Net integrates the interaction and the propagation operations into a single network, which significantly promotes the efficiency of iVOS in the scheme of multi-round interactions. More importantly, we propose a simple yet effective memory aggregation mechanism to record the informative knowledge from the previous interaction rounds, improving the robustness in discovering challenging objects of interest greatly. We conduct extensive experiments on the validation set of DAVIS Challenge 2018 benchmark. In particular, our MA-Net achieves the J@60 score of 76.1% without any bells and whistles, outperforming the state-of-the-arts with more than 2.7%.
[frame, interaction, embedding, video, previous, current, time, mechanism, multiple, temporal] [segmentation, map, object, round, global, interactive, propagation, mask, annotated, davis, branch, ivos, predicted, employ, vos, propose, denotes, aggregation, challenge, achieves, positive, head, tackle, score, annotation, refine] [model, input] [pixel, method, figure, convolutional, conv, read, proposed, based] [user, target, generate, unsupervised, encoder, train, synthesized, utilize] [memory, set, training, learning, augmented, accuracy, efficient, record, size, informative, validation, network, processing, knowledge, denote, number, select, inference, negative, computation] [local, matching, nearest, distance, compute, additional]
@InProceedings{Miao_2020_CVPR,
  author = {Miao, Jiaxu and Wei, Yunchao and Yang, Yi},
  title = {Memory Aggregation Networks for Efficient Interactive Video Object Segmentation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
VQA With No Questions-Answers Training
Ben-Zion Vatashsky, Shimon Ullman


Methods for teaching machines to answer visual questions have made significant progress in recent years, but current methods still lack important human capabilities, including integrating new visual classes and concepts in a modular manner, providing explanations for the answers and handling new domains without explicit examples. We propose a novel method that consists of two main parts: generating a question graph representation, and an answering procedure, guided by the abstract structure of the question graph to invoke an extendable set of visual estimators. Training is performed for the language part and the visual part on their own, but unlike existing schemes, the method does not require any training using images with associated questions and answers. This approach is able to handle novel domains (extended question types and new object classes, properties and relations) as long as corresponding visual estimators are available. In addition, it can provide explanations to its answers and suggest alternatives when questions are not grounded in the image. We demonstrate that this approach achieves both high performance and domain extensibility without any questions-answers training.
[visual, question, answering, graph, clevr, uncord, language, pythia, vqa, dataset, blue, node, answer, provide, devi, reasoning, natural, sequence, dhruv, recognition, anton, yellow, attention, include, vocabulary] [object, table, van, including, mask] [trained, model] [figure, ieee, color, pattern, method, based, existing, valid] [domain, corresponding, image, representation, extended, modified, generated, mapping] [training, arxiv, preprint, neural, set, learning, data, procedure, accuracy, performance, machine, simple, number, test, processing, size, general, note, required] [conference, vision, computer, property, novel, left, approach, well, international, handle, scene, additional, full]
@InProceedings{Vatashsky_2020_CVPR,
  author = {Vatashsky, Ben-Zion and Ullman, Shimon},
  title = {VQA With No Questions-Answers Training},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Counting Out Time: Class Agnostic Video Repetition Counting in the Wild
Debidatta Dwibedi, Yusuf Aytar, Jonathan Tompson, Pierre Sermanet, Andrew Zisserman


We present an approach for estimating the period with which an action is repeated in a video. The crux of the approach lies in constraining the period prediction module to use temporal self-similarity as an intermediate representation bottleneck that allows generalization to unseen repetitions in videos in the wild. We train this model, called RepNet, with a synthetic dataset that is generated from a large unlabeled video collection by sampling short clips of varying lengths and repeating them with different periods and counts. This combination of synthetic data and a powerful yet constrained model, allows us to predict periods in a class-agnostic fashion. Our model substantially exceeds the state of the art performance on existing periodicity (PERTUBE) and repetition counting (QUVA) benchmarks. We also collect a new challenging dataset called Countix ( 90 times larger than existing datasets) which captures the challenges of repetition counting in real-world videos. Project webpage: https://sites.google.com/view/repnet .
[video, frame, temporal, dataset, action, length, embeddings, predict, recognition, kinetics, tsm, temporally, visual, transformer, speed, prediction] [table, detection, semantic, score] [model, trained, datasets] [repetition, period, counting, periodicity, motion, repeating, countix, ieee, figure, periodic, repnet, pattern, repeated, obo, existing, quva, mae, scale, analysis, convolutional] [synthetic, real, train, encoder, representation, project, image, person, generated, diverse] [training, data, count, matrix, learning, number, architecture, andrew, augmentation, set, classification, predictor, performance, deep, large, layer, network, size, neural, learned] [computer, conference, vision, camera, international, estimation, jonathan]
@InProceedings{Dwibedi_2020_CVPR,
  author = {Dwibedi, Debidatta and Aytar, Yusuf and Tompson, Jonathan and Sermanet, Pierre and Zisserman, Andrew},
  title = {Counting Out Time: Class Agnostic Video Repetition Counting in the Wild},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
SaccadeNet: A Fast and Accurate Object Detector
Shiyi Lan, Zhou Ren, Yi Wu, Larry S. Davis, Gang Hua


Object detection is an essential step towards holistic scene understanding. Most existing object detection algorithms attend to certain object areas once and then predict the object locations. However, scientists have revealed that human do not look at the scene in fixed steadiness. Instead, human eyes move around, locating informative parts to understand the object location. This active perceiving movement process is called saccade. In this paper, inspired by such mechanism, we propose a fast and accurate object detector called SaccadeNet. It contains four main modules, the Center Attentive Module, the Corner Attentive Module, the Attention Transitive Module, and the Aggregation Attentive Module, which allows it to attend to different informative object keypoints actively, and predict object locations from coarse to fine. The Corner Attentive Module is used only during training to extract more informative corner features which brings free-lunch performance boost. On the MS COCO dataset, we achieve the performance of 40.4% mAP at 28 FPS and 30.5% mAP at 118 FPS. Among all the real-time object detectors, our SaccadeNet achieves the best detection performance, which demonstrates the effectiveness of the proposed detection mechanism.
[attention, work, predict, previous, time] [object, center, saccadenet, corner, module, bounding, attentive, table, feature, head, coco, pascal, backbone, detection, box, heatmap, map, iou, aggregation, centernet, location, grouping, cnn, voc, boundary, transitive, achieves, region, holistic, faster, centerness, height, dla, represents] [input, middle, study] [figure, output, proposed, ieee, fast, convolutional, pattern, based] [loss, image] [performance, informative, training, inference, size, layer, set, number, width, arxiv, preprint, large, network, conducted, learning] [keypoints, accurate, computer, conference, vision, keypoint, predicts, human, directly]
@InProceedings{Lan_2020_CVPR,
  author = {Lan, Shiyi and Ren, Zhou and Wu, Yi and Davis, Larry S. and Hua, Gang},
  title = {SaccadeNet: A Fast and Accurate Object Detector},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Multi-Granularity Reference-Aided Attentive Feature Aggregation for Video-Based Person Re-Identification
Zhizheng Zhang, Cuiling Lan, Wenjun Zeng, Zhibo Chen


Video-based person re-identification (reID) aims at matching the same person across video clips. It is a challenging task due to the existence of redundancy among frames, newly revealed appearance, occlusion, and motion blurs. In this paper, we propose an attentive feature aggregation module, namely Multi-Granularity Reference-aided Attentive Feature Aggregation (MG-RAFA), to delicately aggregate spatio-temporal features into a discriminative video-level feature representation. In order to determine the contribution/importance of a spatial-temporal feature node, we propose to learn the attention from a global view with convolutional operations. Specifically, we stack its relations, i.e.no, pairwise correlations with respect to a representative set of reference feature nodes (S-RFNs) that represents global video information, together with the feature itself to infer the attention. Moreover, to exploit the semantics of different levels, we propose to learn multi-granularity attentions based on the relations captured at different granularities. Extensive ablation studies demonstrate the effectiveness of our attentive feature aggregation module MG-RAFA. Our framework achieves the state-of-the-art performance on three benchmark datasets.
[attention, temporal, video, granularity, modeling, node, semantics, frame, relation, outperforms, sequence, three, difficulty, multiple, recurrent] [feature, aggregation, global, attentive, propose, pooling, map, final, table, module, denotes, effectiveness, split, aggregate, semantic] [effective, representative, model, fall] [spatial, reference, proposed, figure, convolutional, comparison, motion, captured, relu, resolution] [person, discriminative, reid, loss, learn, representation, corresponding] [set, network, average, scheme, vector, computational, baseline, number, optimization, redundancy, neural, learning, design, learned, pairwise, performance, setting, small, complexity, function, dimension] [capture, single, matching, human]
@InProceedings{Zhang_2020_CVPR,
  author = {Zhang, Zhizheng and Lan, Cuiling and Zeng, Wenjun and Chen, Zhibo},
  title = {Multi-Granularity Reference-Aided Attentive Feature Aggregation for Video-Based Person Re-Identification},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Video Object Grounding Using Semantic Roles in Language Description
Arka Sadhu, Kan Chen, Ram Nevatia


We explore the task of Video Object Grounding (VOG), which grounds objects in videos referred to in natural language descriptions. Previous methods apply image grounding based algorithms to address VOG, fail to explore the object relation information and suffer from limited generalization. Here, we investigate the role of object relations in VOG and propose a novel framework VOGNet to encode multi-modal object relations via self-attention with relative position encoding. To evaluate VOGNet, we propose novel contrasting sampling methods to generate more challenging grounding input samples, and construct a new dataset called ActivityNet-SRL (ASRL) based on existing caption and grounding datasets. Experiments on ASRL validate the need of encoding object relations in VOG, and our VOGNet outperforms competitive baselines by a significant margin.
[video, language, spat, visual, grounding, transformer, vognet, temp, encoding, role, vog, encode, frame, temporal, concatenation, multiple, natural, dataset, referring, action, phrase, relation, grounded, sacc, description, referred, time, svsq, imggrnd, vidgrnd, contrasting, lemmatized, concatenated, man, question, correct, evaluation, attention] [object, semantic, proposal, table, propose, module, height, feature, bounding, box, sep, score, segment] [model, query, acc, trained] [figure, based, spatial, proposed] [image] [contrastive, sampling, training, set, sample, accuracy, width, network, mark, validation, neural, learning, number, applied, test] [relative, position, single, system, additional]
@InProceedings{Sadhu_2020_CVPR,
  author = {Sadhu, Arka and Chen, Kan and Nevatia, Ram},
  title = {Video Object Grounding Using Semantic Roles in Language Description},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Designing Network Design Spaces
Ilija Radosavovic, Raj Prateek Kosaraju, Ross Girshick, Kaiming He, Piotr Dollar


In this work, we present a new network design paradigm. Our goal is to help advance the understanding of network design and discover design principles that generalize across settings. Instead of focusing on designing individual network instances, we design network design spaces that parametrize populations of networks. The overall process is analogous to classic manual design of networks, but elevated to the design space level. Using our methodology we explore the structure aspect of network design and arrive at a low-dimensional design space consisting of simple, regular networks that we call RegNet. The core insight of the RegNet parametrization is surprisingly simple: widths and depths of good networks can be explained by a quantized linear function. We analyze the RegNet design space and arrive at interesting findings that do not match the current practice of network design. The RegNet design space provides simple and fast networks that work well across a wide range of flop regimes. Under comparable training settings and flops, the RegNet models outperform the popular EfficientNet models while being up to 5x faster on GPUs.
[goal, work, individual] [stage, table, apply, focus, resnet] [model, methodology, analyze] [block, figure, convolutional] [train, generalize, discover, consists] [design, network, space, width, training, regnetx, regnet, best, fficient, anynetxe, linear, number, anynetxa, neural, good, anynetx, anynetxc, manual, search, regime, note, better, general, anynetxb, designing, simple, standard, higher, top, test, cumulative, group, architecture, bottleneck, mobile, empirical, size, anynetxd, efit, quantized, deep, process, population, distribution, anynet, params, flop, finding, fixed, sampling, imagenet, schedule, parameterization] [error, structure, single, compute, compare, body, depth, initial, refer]
@InProceedings{Radosavovic_2020_CVPR,
  author = {Radosavovic, Ilija and Kosaraju, Raj Prateek and Girshick, Ross and He, Kaiming and Dollar, Piotr},
  title = {Designing Network Design Spaces},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
12-in-1: Multi-Task Vision and Language Representation Learning
Jiasen Lu, Vedanuj Goswami, Marcus Rohrbach, Devi Parikh, Stefan Lee


Much of vision-and-language research focuses on a small but diverse set of independent tasks and supporting datasets often studied in isolation; however, the visually-grounded language understanding skills required for success at these tasks overlap significantly. In this work, we investigate these relationships between vision-and-language tasks by developing a large-scale, multi-task model. Our approach culminates in a single model on 12 datasets from four broad categories of task including visual question answering, caption-based image retrieval, grounding referring expressions, and multimodal verification. Compared to independently trained single-task models, this represents a reduction from approximately 3 billion parameters to 270 million while simultaneously improving performance by 2.05 points on average across tasks. We use our multi-task framework to perform in-depth analysis of the effect of joint training diverse tasks. Further, we show that finetuning task-specific models from our single multi-task model can lead to further improvements, achieving performance at or above the state-of-the-art.
[visual, language, question, referring, vqa, grounding, multitask, dataset, natural, vilbert, work, retrieval, token, outperforms, caption, devi, answering, gqa, dsgt, dhruv] [table, score, coco, overlap, flickr] [model, trained, datasets, expression, input, cleaned] [dynamic, comparison, ieee, based, proposed, pattern, analysis] [image, train, curriculum, independent, shared, row, diverse, loss, representation, perform] [task, training, learning, performance, test, arxiv, preprint, average, consider, base, architecture, set, pretraining, validation, neural, data, number, deep, compared, setting, size, simple, group] [single, vision, full, conference, computer, jointly, approach, compute]
@InProceedings{Lu_2020_CVPR,
  author = {Lu, Jiasen and Goswami, Vedanuj and Rohrbach, Marcus and Parikh, Devi and Lee, Stefan},
  title = {12-in-1: Multi-Task Vision and Language Representation Learning},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
MLCVNet: Multi-Level Context VoteNet for 3D Object Detection
Qian Xie, Yu-Kun Lai, Jing Wu, Zhoutao Wang, Yiming Zhang, Kai Xu, Jun Wang


In this paper, we address the 3D object detection task by capturing multi-level contextual information with the self-attention mechanism and multi-scale feature fusion. Most existing 3D object detection methods recognize objects individually, without giving any consideration on contextual information between these objects. Comparatively, we propose Multi-Level Context VoteNet (MLCVNet) to recognize 3D objects correlatively, building on the state-of-the-art VoteNet. We introduce three context modules into the voting and classifying stages of VoteNet to encode contextual information at different levels. Specifically, a Patch-to-Patch Context (PPC) module is employed to capture contextual information between the point patches, before voting for their corresponding object centroid points. Subsequently, an Object-to-Object Context (OOC) module is incorporated before the proposal and classification stage, to capture the contextual information between object candidates. Finally, a Global Scene Context (GSC) module is designed to learn the global scene context. We demonstrate these by capturing contextual information at patch, object and scene levels. Our method is an effective way to promote detection accuracy, achieving new state-of-the-art detection performance on challenging 3D object detection datasets, i.e., SUN RGBD and ScanNet. We also release our code at https://github.com/NUAAXQ/MLCVNet.
[context, three, attention, selfattention, work, dataset, recognize, mechanism, encode, understanding, multiple] [object, detection, contextual, module, votenet, feature, global, voting, sun, bounding, ppc, ooc, mlcvnet, table, proposal, segmentation, gsc, map, propose, semantic, level, box, hough, surrounding, effectiveness, seed] [input, model, improve] [ieee, pattern, proposed, patch, method, based, comparison, figure, fusion, convolutional] [cluster, missing, qualitative, image] [network, learning, deep, neural, processing, performance, data, max, architecture, strategy, set] [point, scene, computer, conference, vision, cloud, indoor, capture, chair, scannet, room, mlp, international, leonidas, demonstrate]
@InProceedings{Xie_2020_CVPR,
  author = {Xie, Qian and Lai, Yu-Kun and Wu, Jing and Wang, Zhoutao and Zhang, Yiming and Xu, Kai and Wang, Jun},
  title = {MLCVNet: Multi-Level Context VoteNet for 3D Object Detection},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Listen to Look: Action Recognition by Previewing Audio
Ruohan Gao, Tae-Hyun Oh, Kristen Grauman, Lorenzo Torresani


In the face of the video data deluge, today's expensive clip-level classifiers are increasingly impractical. We propose a framework for efficient action recognition in untrimmed video that uses audio as a preview mechanism to eliminate both short-term and long-term visual redundancies. First, we devise an ImgAud2Vid framework that hallucinates clip-level features by distilling from lighter modalities---a single frame and its accompanying audio---reducing short-term temporal redundancy for efficient clip-level recognition. Second, building on ImgAud2Vid, we further propose ImgAud-Skimming, an attention-based long short-term memory network that iteratively selects useful moments in untrimmed videos, reducing long-term temporal redundancy for efficient video-level recognition. Extensive experiments on four action recognition datasets demonstrate that our method achieves the state-of-the-art in terms of both recognition accuracy and speed.
[video, action, audio, recognition, aud, untrimmed, indexing, frame, time, temporal, visual, clip, long, work, lstm, short, step, attention, state, prediction, hidden, sound, activitynet, skimming, kimming, mechanism, accompanying, spatiotemporal, modality, current] [feature, propose, key, framework, final, segment, table] [model, input] [method, figure, based, preview, fusion, convolutional] [image, perform] [efficient, network, accuracy, distillation, learning, selection, selected, expensive, redundancy, teacher, distilled, deep, uniform, processing, efficiency, layer, student, average, imagenet, selects, subset, entire, process, select, achieve, computation, vector] [approach, single, descriptor, human, cost]
@InProceedings{Gao_2020_CVPR,
  author = {Gao, Ruohan and Oh, Tae-Hyun and Grauman, Kristen and Torresani, Lorenzo},
  title = {Listen to Look: Action Recognition by Previewing Audio},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Attention Convolutional Binary Neural Tree for Fine-Grained Visual Categorization
Ruyi Ji, Longyin Wen, Libo Zhang, Dawei Du, Yanjun Wu, Chen Zhao, Xianglong Liu, Feiyue Huang


Fine-grained visual categorization (FGVC) is an important but challenging task due to high intra-class variances and low inter-class variances caused by deformation, occlusion, illumination, etc. An attention convolutional binary neural tree architecture is presented to address those problems for weakly supervised FGVC. Specifically, we incorporate convolutional operations along edges of the tree structure, and use the routing functions in each node to determine the root-to-leaf computational paths within the tree. The final decision is computed as the summation of the predictions from leaf nodes. The deep convolutional operations learn to capture the representations of objects, and the tree structure characterizes the coarse-to-fine hierarchical feature learning process. In addition, we use the attention transformer module to enforce the network to capture discriminative features. The negative log-likelihood loss is used to train the entire network in an end-to-end fashion by SGD with back-propagation. Several experiments on the CUB-200-2011, Stanford Cars and Aircraft datasets demonstrate that the proposed method performs favorably against the state-of-the-arts.
[attention, transformer, dataset, visual, node, hierarchical, construct, context, mechanism, prediction, evaluation] [leaf, module, acnet, object, feature, branch, backbone, table, focus, effectiveness, height, pooling, fully, weakly, global, final, aircraft, cnn, aspp, dcl, category, categorization, challenging] [decision, model, accumulated] [tree, convolutional, method, figure, routing, proposed, formed, block, summation, dilated, high] [discriminative, supervised, image, learn, subordinate, representation, loss] [network, neural, learning, architecture, deep, training, accuracy, binary, classification, set, layer, probability, stanford, better, process, size, label, best, performance, number, data] [accurate, left, capture, structure]
@InProceedings{Ji_2020_CVPR,
  author = {Ji, Ruyi and Wen, Longyin and Zhang, Libo and Du, Dawei and Wu, Yanjun and Zhao, Chen and Liu, Xianglong and Huang, Feiyue},
  title = {Attention Convolutional Binary Neural Tree for Fine-Grained Visual Categorization},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Music Gesture for Visual Sound Separation
Chuang Gan, Deng Huang, Hang Zhao, Joshua B. Tenenbaum, Antonio Torralba


Recent deep learning approaches have achieved impressive performance on visual sound separation tasks. However, these approaches are mostly built on appearance and optical flow like motion feature representations, which exhibit limited abilities to find the correlations between audio signals and visual points, especially when separating multiple instruments of the same types, such as multiple violins in a scene. To address this, we propose "Music Gesture," a keypoint-based structured representation to explicitly model the body and finger movements of musicians when they perform music. We first adopt a context-aware graph network to integrate visual semantic context with body dynamics and then apply an audio-visual fusion model to associate body movements with the corresponding audio signals. Experimental results on three music performance datasets show: 1) strong improvements upon benchmark metrics for hetero-musical separation tasks (i.e. different instruments); 2) new ability for effective homo-musical separation for piano, flute, and trumpet duets, which to our best knowledge has never been achieved with alternative methods.
[sound, visual, video, audio, music, speech, structured, context, spectrogram, work, attention, graph, previous, dataset, chuang, antonio, explicit, som, temporal, extract, multiple, three, instrument, musical, urmp] [table, feature, semantic, adopt, module, associate, predicted, propose, challenging] [model, input, study, hang] [separation, analysis, motion, fusion, proposed, ieee, figure, separating, signal, based, dynamic] [source, perform, real, separate, appearance, consists] [network, learning, deep, andrew, training, performance, matrix, mixture, arxiv, preprint, processing, neural, data, better] [body, human, hand, conference, system, keypoints, international, vision, keypoint, sdr, computer, pose, approach]
@InProceedings{Gan_2020_CVPR,
  author = {Gan, Chuang and Huang, Deng and Zhao, Hang and Tenenbaum, Joshua B. and Torralba, Antonio},
  title = {Music Gesture for Visual Sound Separation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Referring Image Segmentation via Cross-Modal Progressive Comprehension
Shaofei Huang, Tianrui Hui, Si Liu, Guanbin Li, Yunchao Wei, Jizhong Han, Luoqi Liu, Bo Li


Referring image segmentation aims at segmenting the foreground masks of the entities that can well match the description given in the natural language expression. Previous approaches tackle this problem using implicit feature interaction and fusion between visual and linguistic modalities, but usually fail to explore informative words of the expression to well align features from the two modalities for accurately identifying the referred entity. In this paper, we propose a Cross-Modal Progressive Comprehension (CMPC) module and a Text-Guided Feature Exchange (TGFE) module to effectively address the challenging task. Concretely, the CMPC module first employs entity and attribute words to perceive all the related entities that might be considered by the expression. Then, the relational words are adopted to highlight the correct entity as well as suppress other irrelevant ones by multimodal graph reasoning. In addition to the CMPC module, we further leverage a simple yet effective TGFE module to integrate the reasoned multimodal features from different levels with the guidance of textual information. In this way, features from multi-levels could communicate with each other and be refined based on the textual context. We conduct extensive experiments on four popular referring segmentation benchmarks and achieve new state-of-the-art performances. Code is available at https://github.com/spyflying/CMPC-Refseg.
[referring, graph, visual, multimodal, entity, reasoning, exchange, cmpc, linguistic, relational, context, comprehension, referent, man, word, language, tgfe, natural, holding, perception, unc, relationship, highlight, referred, lstm, illustrated, multiple, frisbee, rcl, previous, interaction] [feature, module, segmentation, val, table, iou, propose, region, semantic, level, global, stage, effectiveness, cnn, denotes, mask, fully, affinity, challenging] [expression, model, white, conduct] [convolution, figure, spatial, based, method, fusion, fused, convlstm, convolutional] [image, progressive, row, attribute, progressively, extracted] [set, number, arxiv, preprint, problem, matrix, deep, learning, vector, baseline] [well, vertex, single, full]
@InProceedings{Huang_2020_CVPR,
  author = {Huang, Shaofei and Hui, Tianrui and Liu, Si and Li, Guanbin and Wei, Yunchao and Han, Jizhong and Liu, Luoqi and Li, Bo},
  title = {Referring Image Segmentation via Cross-Modal Progressive Comprehension},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Cloth in the Wind: A Case Study of Physical Measurement Through Simulation
Tom F. H. Runia, Kirill Gavrilyuk, Cees G. M. Snoek, Arnold W. M. Smeulders


For many of the physical phenomena around us, we have developed sophisticated models explaining their behavior. Nevertheless, measuring physical properties from visual observations is challenging due to the high number of causally underlying physical parameters -- including material properties and external forces. In this paper, we propose to measure latent physical properties for cloth in the wind without ever having seen a real example before. Our solution is an iterative refinement procedure with simulation at its core. The algorithm gradually updates the physical model parameters by running a simulation of the observed phenomenon and comparing the current simulation to a real-world observation. The correspondence is measured using an embedding function that maps physically similar examples to nearby points. We consider a case study of cloth in the wind, with curling flags as our leading example -- a seemingly simple phenomena but physically highly involved. Based on the physics of cloth and its visual manifestation, we propose an instantiation of the embedding function. For this mapping, modeled as a deep network, we introduce a spectral layer that decomposes a video volume into its temporal spectral power and corresponding frequencies. Our experiments demonstrate that the proposed method compares favorably to prior work on the task of measuring cloth material properties and external wind force from a real-world video.
[video, embedding, visual, speed, dataset, work, engine, temporal, observation, clip, goal] [propose, area, table] [physical, model, external, input, example, case, iterative] [spectral, figure, proposed, method, frequency, based, signal] [real, train, corresponding, learn, image] [similarity, function, space, consider, parameter, layer, measure, power, learning, network, optimization, weight, stretching, number, dij, metric, training, deep, contrastive, search] [cloth, wind, simulation, material, bending, intrinsic, render, decomposition, simulated, hanging, measurement, distance, fabric, extrinsic, measuring, xsim, measured, flagsim, estimating, sdn, physically, force, scene, compare, surface]
@InProceedings{Runia_2020_CVPR,
  author = {Runia, Tom F. H. and Gavrilyuk, Kirill and Snoek, Cees G. M. and Smeulders, Arnold W. M.},
  title = {Cloth in the Wind: A Case Study of Physical Measurement Through Simulation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
The Garden of Forking Paths: Towards Multi-Future Trajectory Prediction
Junwei Liang, Lu Jiang, Kevin Murphy, Ting Yu, Alexander Hauptmann


This paper studies the problem of predicting the distribution over multiple possible future paths of people as they move through various visual scenes. We make two main contributions. The first contribution is a new dataset, created in a realistic 3D simulator, which is based on real world trajectory data, and then extrapolated by human annotators to achieve different latent goals. This provides the first benchmark for quantitative evaluation of the models to predict multi-future trajectories. The second contribution is a new model to generate multiple plausible future trajectories, which contains novel designs of using multi-scale location encodings and convolutional RNNs over graphs. We refer to our model as Multiverse. We show that our model achieves the best results on our dataset, as well as on the real-world VIRAT/ActEV dataset (which just contains one possible future).
[trajectory, future, prediction, multiple, dataset, predict, video, agent, attention, social, decoder, time, forking, graph, visual, evaluation, vehicle, lstm, forecasting, alexandre, predicting, work, state, activity, silvio, red] [location, semantic, final, segmentation, table, benchmark, predicted, offset, object, pedestrian, junwei, alexander, tracking] [model, trained, input] [convolutional, method, proposed, cell, spatial, column, output, quantitative, comparison, figure] [real, person, plausible, synthetic, generate, fine, encoder, train] [arxiv, preprint, learning, set, top, data, distribution, training, size, average, performance, evaluate, test, path] [grid, human, ground, truth, scene, simulation, single, second, error, coarse]
@InProceedings{Liang_2020_CVPR,
  author = {Liang, Junwei and Jiang, Lu and Murphy, Kevin and Yu, Ting and Hauptmann, Alexander},
  title = {The Garden of Forking Paths: Towards Multi-Future Trajectory Prediction},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
CentripetalNet: Pursuing High-Quality Keypoint Pairs for Object Detection
Zhiwei Dong, Guoxuan Li, Yue Liao, Fei Wang, Pengju Ren, Chen Qian


Keypoint-based detectors have achieved pretty-well performance. However, incorrect keypoint matching is still widespread and greatly affects the performance of the detector. In this paper, we propose CentripetalNet which uses centripetal shift to pair corner keypoints from the same instance. CentripetalNet predicts the position and the centripetal shift of the corner points and matches corners whose shifted results are aligned. Combining position information, our approach matches corner points more accurately than the conventional embedding approaches do. Corner pooling extracts information inside the bounding boxes onto the border. To make this information more aware at the corners, we design a cross-star deformable convolution network to conduct feature adaption. Furthermore, we explore instance segmentation on anchor-free detectors by equipping our CentripetalNet with a mask prediction module. On COCO test-dev, our CentripetalNet not only outperforms all existing anchor-free detectors with an AP of 48.0% but also achieves comparable performance to the state-of-the-art instance segmentation approaches with a 40.2% Mask AP. Code is available at https: //github.com/KiveeDong/CentripetalNet.
[shift, prediction, embedding, pair, predict] [corner, centripetal, object, feature, centripetalnet, center, bounding, instance, mask, detection, offset, segmentation, cornernet, predicted, module, box, anchor, associative, table, roi, region, pooling, centernet, apply, achieves, map, extreme, backbone, ross, add, kaiming, propose, positive] [model, improve] [deformable, convolution, figure, method, field, conv, based, chen, guiding, convolutional, comparison, pattern] [adaption, loss, generate, train, learn] [performance, network, learning, compared, arxiv, preprint, set, deep, better, fei, large, training, top] [matching, geometric, computer, point, match, border, predicts, ground, position]
@InProceedings{Dong_2020_CVPR,
  author = {Dong, Zhiwei and Li, Guoxuan and Liao, Yue and Wang, Fei and Ren, Pengju and Qian, Chen},
  title = {CentripetalNet: Pursuing High-Quality Keypoint Pairs for Object Detection},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
PV-RCNN: Point-Voxel Feature Set Abstraction for 3D Object Detection
Shaoshuai Shi, Chaoxu Guo, Li Jiang, Zhe Wang, Jianping Shi, Xiaogang Wang, Hongsheng Li


We present a novel and high-performance 3D object detection framework, named PointVoxel-RCNN (PV-RCNN), for accurate 3D object detection from point clouds. Our proposed method deeply integrates both 3D voxel Convolutional Neural Network (CNN) and PointNet-based set abstraction to learn more discriminative point cloud features. It takes advantages of efficient learning and high-quality proposals of the 3D voxel CNN and the flexible receptive fields of the PointNet-based networks. Specifically, the proposed framework summarizes the 3D scene with a 3D voxel CNN into a small set of keypoints via a novel voxel set abstraction module to save follow-up computations and also to encode representative scene features. Given the high-quality 3D proposals generated by the voxel CNN, the RoI-grid pooling is proposed to abstract proposal-specific features from the keypoints to the RoI-grid points via keypoint set abstraction. Compared with conventional pooling operations, the RoI-grid feature points encode much richer context information for accurately estimating object confidences and locations. Extensive experiments on both the KITTI dataset and the Waymo Open dataset show that our proposed PV-RCNN surpasses state-of-the-art 3D detection methods with remarkable margins.
[dataset, encoding, previous, multiple, difficulty] [feature, detection, object, abstraction, proposal, pooling, lidar, roi, cnn, table, module, propose, refinement, confidence, waymo, box, level, map, hard, easy, framework, aggregate, aggregated, semantic, split, car, predicted, val, moderate, autonomous, recall, iou] [] [ieee, proposed, method, receptive, pattern, raw, convolution, neighboring, adopted, convolutional, flexible, based] [learn] [set, performance, learning, strategy, small, number, open, operation, network, efficient, better, training, test] [point, voxel, keypoint, scene, keypoints, conference, kitti, sparse, computer, vision, accurate, cloud, grid, voxels, rgb, directly, international, novel]
@InProceedings{Shi_2020_CVPR,
  author = {Shi, Shaoshuai and Guo, Chaoxu and Jiang, Li and Wang, Zhe and Shi, Jianping and Wang, Xiaogang and Li, Hongsheng},
  title = {PV-RCNN: Point-Voxel Feature Set Abstraction for 3D Object Detection},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Graph Embedded Pose Clustering for Anomaly Detection
Amir Markovitz, Gilad Sharir, Itamar Friedman, Lihi Zelnik-Manor, Shai Avidan


We propose a new method for anomaly detection of human actions. Our method works directly on human pose graphs that can be computed from an input video sequence. This makes the analysis independent of nuisance parameters such as viewpoint or illumination. We map these graphs to a latent space and cluster them. Each action is then represented by its soft-assignment to each of the clusters. This gives a kind of "bag of words" representation to the data, where every action is represented by its similarity to a group of base action-words. Then, we use a Dirichlet process based mixture, that is useful for handling proportional data such as our soft-assignment vectors, to determine if an action is normal or not. We evaluate our method on two types of data sets. The first is a fine-grained anomaly detection data set (e.g. ShanghaiTech) where we wish to detect unusual variations of some action. The second is a coarse-grained anomaly detection data set (e.g., a Kinetics-based data set) where few actions are considered normal, and every other action should be considered abnormal. Extensive experiments on the benchmarks show that our method1performs considerably better than other state of the art methods.
[graph, action, adjacency, video, temporal, dataset, dirichlet, attention, sequence, embedding, embedded, campus, people, frame, prediction, gcn, embed, kinetics, pik] [detection, score, split, table] [model, input] [method, based, proposed, figure, shanghaitech, convolutional, ieee, spatial, optimized, presented, convolution] [cluster, latent, meaningful, autoencoder, consists, loss, representation, unsupervised, extracted] [anomaly, clustering, training, abnormal, algorithm, sample, deep, learning, data, random, distribution, set, large, number, process, setting, denote, evaluate, mixture, test, matrix, learned, considered, amount, subset] [pose, human, normal, conference, represented, estimation, international, inferred, computer, single, reconstruction, provided, vision, determine, second, capture]
@InProceedings{Markovitz_2020_CVPR,
  author = {Markovitz, Amir and Sharir, Gilad and Friedman, Itamar and Zelnik-Manor, Lihi and Avidan, Shai},
  title = {Graph Embedded Pose Clustering for Anomaly Detection},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Disp R-CNN: Stereo 3D Object Detection via Shape Prior Guided Instance Disparity Estimation
Jiaming Sun, Linghao Chen, Yiming Xie, Siyu Zhang, Qinhong Jiang, Xiaowei Zhou, Hujun Bao


In this paper, we propose a novel system named Disp R-CNN for 3D object detection from stereo images. Many recent works solve this problem by first recovering a point cloud with disparity estimation and then apply a 3D detector. The disparity map is computed for the entire image, which is costly and fails to leverage category-specific prior. In contrast, we design an instance disparity estimation network (iDispNet) that predicts disparity only for pixels on objects of interest and learns a category-specific shape prior for more accurate disparity estimation. To address the challenge from scarcity of disparity annotation in training, we propose to use a statistical shape model to generate dense disparity pseudo-ground-truth without the need of LiDAR point clouds, which makes our system more widely applicable. Experiments on the KITTI dataset show that, even when LiDAR ground-truth is not available at training time, Disp R-CNN achieves competitive performance and outperforms previous state-of-the-art methods by 20% in terms of average precision. The code will be available at https://github.com/zju3dv/disprcnn.
[dataset, previous, provide, time] [object, detection, instance, lidar, bounding, idispnet, supervision, foreground, autonomous, benchmark, mask, easy, propose, map, detector, box, segmentation, roi, apbev, hard] [model, input] [disparity, method, ieee, pattern, proposed, prior, psmnet, figure, based, running, pixel] [generation] [training, network, dimension, process, data, performance, large, statistical, learning, optimization, regularization, deep, design, average, set, function] [shape, point, stereo, estimation, conference, cloud, kitti, computer, vision, depth, left, accurate, sparse, dense, cost, monocular, international, defined, disp, matching, full, volume, zoomnet, pose, system]
@InProceedings{Sun_2020_CVPR,
  author = {Sun, Jiaming and Chen, Linghao and Xie, Yiming and Zhang, Siyu and Jiang, Qinhong and Zhou, Xiaowei and Bao, Hujun},
  title = {Disp R-CNN: Stereo 3D Object Detection via Shape Prior Guided Instance Disparity Estimation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Deepstrip: High-Resolution Boundary Refinement
Peng Zhou, Brian Price, Scott Cohen, Gregg Wilensky, Larry S. Davis


In this paper, we target refining the boundaries in high resolution images given low resolution masks. For memory and computation efficiency, we propose to convert the regions of interest into strip images and compute a boundary prediction in the strip domain. To detect the target boundary, we present a framework with two prediction layers. First, all potential boundaries are predicted as an initial prediction and then a selection layer is used to pick the target boundary and smooth the result. To encourage accurate prediction, a loss which measures the boundary distance in strip domain is introduced. In addition, we enforce a matching consistency and C0 continuity regularization to the network to reduce false alarms. Extensive experiments on both public and a newly created high resolution dataset strongly validate our approach.
[prediction, predict, dataset, bilinear, provide, extract, speed] [boundary, mask, final, object, segmentation, denotes, height, davis, propose, contour, continuity, region, edge, refine, detection, table, interactive, refinement, precise, guided, semantic, sanja, framework, apply] [quality, input, original, argmax, public] [strip, resolution, upsampled, pixahr, upsampling, high, figure, based, low, scale, dice, pixel, column, convolutional, bilateral, steal, comparison, brian] [image, loss, target, factor, encourage] [memory, selection, layer, network, deep, learning, potential, size, active, set, regularization, computation, performance, function, applied, closer, better, sij, evaluate, large, efficient] [ground, approach, truth, initial, distance, matching, coarse, directly, normal, dense, accurate, spline, full]
@InProceedings{Zhou_2020_CVPR,
  author = {Zhou, Peng and Price, Brian and Cohen, Scott and Wilensky, Gregg and Davis, Larry S.},
  title = {Deepstrip: High-Resolution Boundary Refinement},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Smoothing Adversarial Domain Attack and P-Memory Reconsolidation for Cross-Domain Person Re-Identification
Guangcong Wang, Jian-Huang Lai, Wenqi Liang, Guangrun Wang


Most of the existing person re-identification (re-ID) methods achieve promising accuracy in a supervised manner, but they assume the identity labels of the target domain is available. This greatly limits the scalability of person re-ID in real-world scenarios. Therefore, the current person re-ID community focuses on the cross-domain person re-ID that aims to transfer the knowledge from a labeled source domain to an unlabeled target domain and exploits the specific knowledge from the data distribution of the target domain to further improve the performance. To reduce the gap between the source and target domains, we propose a Smoothing Adversarial Domain Attack (SADA) approach that guides the source domain images to align the target domain images by using a trained camera classifier. To stabilize a memory trace of cross-domain knowledge transfer after its initial acquisition from the source domain, we propose a p-Memory Reconsolidation (pMR) method that reconsolidates the source knowledge with a small probability p during the self-training of the target domain. With both SADA and pMR, the proposed method significantly improves the cross-domain person re-ID. Extensive experiments on Market-1501 and DukeMTMC-reID benchmarks show that our pMR-SADA outperforms all of the state-of-the-arts by a large margin.
[dataset, current, outperforms, observed] [feature, map, table, cam, propose, improvement, main, denotes, effectiveness, obtains, ablation] [attack, model, adversarial, identity, improve, tar, iterative, overview, experimental] [proposed, method, based, convolutional, figure] [domain, person, target, source, transfer, unsupervised, image, reconsolidation, transferred, pmr, sada, gap, align, aligned, train, amnesia, gan, sour, supervised, representation, loss] [learning, smoothing, knowledge, memory, clustering, algorithm, probability, deep, classifier, network, distribution, neural, arxiv, preprint, reduce, set, unlabeled, small, large, performance, labeled, best, metric, data, accuracy] [camera, approach, second, local]
@InProceedings{Wang_2020_CVPR,
  author = {Wang, Guangcong and Lai, Jian-Huang and Liang, Wenqi and Wang, Guangrun},
  title = {Smoothing Adversarial Domain Attack and P-Memory Reconsolidation for Cross-Domain Person Re-Identification},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Meshed-Memory Transformer for Image Captioning
Marcella Cornia, Matteo Stefanini, Lorenzo Baraldi, Rita Cucchiara


Transformer-based architectures represent the state of the art in sequence modeling tasks like machine translation and language understanding. Their applicability to multi-modal contexts like image captioning, however, is still largely under-explored. With the aim of filling this gap, we present M2 - a Meshed Transformer with Memory for Image Captioning. The architecture improves both the image encoding and the language generation steps: it learns a multi-level representation of the relationships between image regions integrating learned a priori knowledge, and uses a mesh-like connectivity at decoding stage to exploit low- and high-level features. Experimentally, we investigate the performance of the M2 Transformer and different fully-attentive models in comparison with recurrent ones. When tested on COCO, our proposal achieves a new state of the art in single-model and ensemble configurations on the "Karpathy" test split and on the online test server. We also assess its performances when describing objects unseen in the training set. Trained models and code for reproducing the experiments are publicly available at: https://github.com/aimagelab/meshed-memory-transformer.
[attention, transformer, encoding, captioning, decoder, language, visual, sequence, meshed, decoding, cider, state, word, recurrent, evaluation, previous, three, dataset, represent, exploit, describing, connected, understanding] [table, object, art, coco, region, final, achieves, key] [model, input, ensemble, original, trained] [output, pattern, ieee, priori, operator, learnable, comparison, figure, method, convolutional] [image, encoder, generation] [layer, test, training, set, memory, performance, neural, machine, respect, learned, online, knowledge, evaluate, architecture, applied, standard, network, pairwise, learning] [conference, computer, approach, vision, international, novel, compare, single, connectivity, computed]
@InProceedings{Cornia_2020_CVPR,
  author = {Cornia, Marcella and Stefanini, Matteo and Baraldi, Lorenzo and Cucchiara, Rita},
  title = {Meshed-Memory Transformer for Image Captioning},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Learning From Noisy Anchors for One-Stage Object Detection
Hengduo Li, Zuxuan Wu, Chen Zhu, Caiming Xiong, Richard Socher, Larry S. Davis


State-of-the-art object detectors rely on regressing and classifying an extensive list of possible anchors, which are divided into positive and negative samples based on their intersection-over-union (IoU) with corresponding ground-truth objects. Such a harsh split conditioned on IoU results in binary labels that are potentially noisy and challenging for training. In this paper, we propose to mitigate noise incurred by imperfect label assignment such that the contributions of anchors are dynamically determined by a carefully constructed cleanliness score associated with each anchor. Exploring outputs from both regression and classification branches, the cleanliness scores, estimated without incurring any additional computational overhead, are used not only as soft labels to supervise the training of the classification branch but also sample re-weighting factors for improved localization and classification accuracy. We conduct extensive experiments on COCO, and demonstrate, among other things, the proposed approach steadily improves RetinaNet by 2% with various backbones.
[dynamically] [object, cleanliness, positive, detection, iou, regression, table, localization, retinanet, apos, box, hard, confidence, backbone, faster, branch, improves, background, ross, score, coco, kaiming, split, focus, aneg, recall, assignment, regressed, anchor, region, lcls, lreg] [noise] [method, noisy, based, figure, proposed, high] [loss, image, train, extensive, learn] [classification, training, soft, negative, set, network, learning, performance, candidate, sample, baseline, label, better, deep, large, binary, number, imbalance, sampling, labeled, equation, mitigate, standard, neural, note] [approach, focal, demonstrate]
@InProceedings{Li_2020_CVPR,
  author = {Li, Hengduo and Wu, Zuxuan and Zhu, Chen and Xiong, Caiming and Socher, Richard and Davis, Larry S.},
  title = {Learning From Noisy Anchors for One-Stage Object Detection},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Instance-Aware, Context-Focused, and Memory-Efficient Weakly Supervised Object Detection
Zhongzheng Ren, Zhiding Yu, Xiaodong Yang, Ming-Yu Liu, Yong Jae Lee, Alexander G. Schwing, Jan Kautz


Weakly supervised learning has emerged as a compelling tool for object detection by reducing the need for strong supervision during training. However, major challenges remain: (1) differentiation of object instances can be ambiguous; (2) detectors tend to focus on discriminative parts rather than entire objects; (3) without ground truth, object proposals have to be redundant for high recalls, causing significant memory consumption. Addressing these challenges is difficult, as it often requires to eliminate uncertainties and trivial solutions. To target these issues we develop an instance-aware and context-focused unified framework. It employs an instance-aware self-training algorithm and a learnable Concrete DropBlock while devising a memory-efficient sequential batch back-propagation. Our proposed method achieves state-of-the-art results on COCO (12.1% AP, 24.8% AP50), VOC 2007 (54.9% AP), and VOC 2012 (52.1% AP), improving baselines by great margins. In addition, the proposed method is the first to benchmark ResNet based models and weakly supervised video object detection. Refer to our project page for code, models, and more details: https://github.com/NVlabs/wetectron.
[multiple, sequential, video, three] [object, weakly, detection, instance, voc, region, wsod, category, coco, mist, module, feature, supervision, proposal, regression, segmentation, localization, table, domination, yong, box, recall, score, roi, iou, bounding, resnet, framework, faster] [model, input, improve, trained] [proposed, method, spatial, figure, intermediate, convolutional] [supervised, image, discriminative, pseudo, address, jae, generate] [training, dropblock, memory, concrete, performance, deep, learning, network, set, algorithm, batch, number, class, student, dropout, better, forward, large, classification, applied, test, online, size, update, increase] [single, ambiguity, computed]
@InProceedings{Ren_2020_CVPR,
  author = {Ren, Zhongzheng and Yu, Zhiding and Yang, Xiaodong and Liu, Ming-Yu and Lee, Yong Jae and Schwing, Alexander G. and Kautz, Jan},
  title = {Instance-Aware, Context-Focused, and Memory-Efficient Weakly Supervised Object Detection},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Density-Based Clustering for 3D Object Detection in Point Clouds
Syeda Mariam Ahmed, Chee Meng Chew


Current 3D detection networks either rely on 2D object proposals or try to directly predict bounding box parameters from each point in a scene. While former methods are dependent on performance of 2D detectors, latter approaches are challenging due to the sparsity and occlusion in point clouds, making it difficult to regress accurate parameters. In this work, we introduce a novel approach for 3D object detection that is significant in two main aspects: a) cascaded modular approach that focuses the receptive field of each module on specific points in the point cloud, for improved feature learning and b) a class agnostic instance segmentation module that is initiated using unsupervised clustering. The objective of a cascaded approach is to sequentially minimize the number of points running through the network. While three different modules perform the tasks of background-foreground segmentation, class agnostic instance segmentation and object detection, through individually trained point based networks. We also evaluate bayesian uncertainty in modules, demonstrating the over all level of confidence in our prediction results. Performance of the network is evaluated on the SUN RGB-D benchmark dataset, that demonstrates an improvement as compared to state-of-the-art methods.
[predict, prediction, associated] [object, segmentation, instance, detection, module, feature, bounding, epn, box, centroid, predicted, offset, amodal, foreground, table, proposal, semantic, propose, cnn, map, represents, voting, background, regression] [original, input, true, trained, dbscan] [based, proposed, ieee, figure, pattern, cascaded, convolutional, method] [consists, loss, unsupervised, generate, cluster] [network, class, number, learning, size, deep, layer, bayesian, clustering, performance, agnostic, binary, classification, neural, algorithm, improved, evaluate, task, achieve, higher, dropout, density, objective, smaller, vector] [point, conference, cloud, computer, directly, uncertainty, pointnet, approach, vision, international, ground, novel, defined, joint]
@InProceedings{Ahmed_2020_CVPR,
  author = {Ahmed, Syeda Mariam and Chew, Chee Meng},
  title = {Density-Based Clustering for 3D Object Detection in Point Clouds},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Few-Shot Video Classification via Temporal Alignment
Kaidi Cao, Jingwei Ji, Zhangjie Cao, Chien-Yi Chang, Juan Carlos Niebles


Difficulty in collecting and annotating large-scale video data raises a growing interest in learning models which can recognize novel classes with only a few training examples. In this paper, we propose the Ordered Temporal Alignment Module (OTAM), a novel few-shot learning framework that can learn to classify a previously unseen video. While most previous work neglects long-term temporal ordering information, our proposed model explicitly leverages the temporal ordering information in video data through ordered temporal alignment. This leads to strong data-efficiency for few-shot learning. In concrete, our proposed pipeline learns a deep distance measurement of the query video with respect to novel class proxies over its alignment path. We adopt an episode-based training scheme and directly optimize the few-shot learning objective. We evaluate OTAM on two challenging real-world datasets, Kinetics and Something-Something-V2, and show that our model leads to significant improvement of few-shot video classification over a wide range of competitive baselines and outperforms state-of-the-art benchmarks by a large margin.
[video, temporal, action, ordered, kinetics, embedding, previous, sequence, cmn, work, frame, explicitly, otam, relation, prediction, recognition, visual, time, dtw, outperforms, making] [score, feature, module, ordering, final, propose, framework] [model, query, datasets] [ieee, method, pattern, proposed, figure] [alignment, learn, unseen, representation, introduce, image, perform] [learning, classification, support, training, data, set, function, matrix, arxiv, preprint, deep, meta, network, class, measure, average, path, find, metric, problem, dimension, smoothing, large, labeled, neural, machine, randomly, follow, sample] [distance, conference, computer, novel, vision, matching, approach, international, compare, european]
@InProceedings{Cao_2020_CVPR,
  author = {Cao, Kaidi and Ji, Jingwei and Cao, Zhangjie and Chang, Chien-Yi and Niebles, Juan Carlos},
  title = {Few-Shot Video Classification via Temporal Alignment},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Densely Connected Search Space for More Flexible Neural Architecture Search
Jiemin Fang, Yuzhu Sun, Qian Zhang, Yuan Li, Wenyu Liu, Xinggang Wang


Neural architecture search (NAS) has dramatically advanced the development of neural network design. We revisit the search space design in most previous NAS methods and find the number and widths of blocks are set manually. However, block counts and block widths determine the network scale (depth and width) and make a great influence on both the accuracy and the model cost (FLOPs/latency). In this paper, we propose to search block counts and block widths by designing a densely connected search space, i.e., DenseNAS. The new search space is represented as a dense super network, which is built upon our designed routing blocks. In the super network, routing blocks are densely connected and we search for the best path between them to derive the final architecture. We further propose a chained cost estimation algorithm to approximate the model cost during the search. Both the accuracy and model cost are optimized in DenseNAS. For experiments on the MobileNetV2-based search space, DenseNAS achieves 75.3% top-1 accuracy on ImageNet with only 361MB FLOPs and 17.9ms latency on a single TITAN-XP. The larger model searched by DenseNAS achieves 76.1% accuracy with only 479M FLOPs. DenseNAS further promotes the ImageNet classification accuracies of ResNet-18, -34 and -50-B by 1.5%, 0.5% and 0.3% with 200M, 600M and 680M FLOPs reduction respectively. The related code is available at https://github.com/JaminFong/DenseNAS.
[connected, hierarchical, previous, construct] [wei, propose, subsequent, stage, final, denotes, object, assign, semantic] [model, input] [routing, block, super, densely, spatial, output, method, based, skip, tensor, convolutional, kernel, expansion, flexible, cell] [image, representation] [search, architecture, network, neural, space, layer, basic, width, design, operation, path, efficient, candidate, quoc, set, algorithm, searched, number, barret, performance, accuracy, probability, parameter, connection, optimize, chained, searching, relaxation, total, learning, mbconvs, latency, manually, better, imagenet, vijay, andrew, computation, densenas] [cost, depth, estimation, dense, assume, define, derive, differentiable, structure]
@InProceedings{Fang_2020_CVPR,
  author = {Fang, Jiemin and Sun, Yuzhu and Zhang, Qian and Li, Yuan and Liu, Wenyu and Wang, Xinggang},
  title = {Densely Connected Search Space for More Flexible Neural Architecture Search},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Fine-Grained Video-Text Retrieval With Hierarchical Graph Reasoning
Shizhe Chen, Yida Zhao, Qin Jin, Qi Wu


Cross-modal retrieval between videos and texts has attracted growing attentions due to the rapid emergence of videos on the web. The current dominant approach is to learn a joint embedding space to measure cross-modal similarities. However, simple embeddings are insufficient to represent complicated visual and textual details, such as scenes, objects, actions and their compositions. To improve fine-grained video-text retrieval, we propose a Hierarchical Graph Reasoning (HGR) model, which decomposes video-text matching into global-to-local levels. The model disentangles text into a hierarchical semantic graph including three levels of events, actions, entities, and generates hierarchical textual embeddings via attention-based graph reasoning. Different levels of texts can guide the learning of diverse and hierarchical video representations for cross-modal matching to capture both global and local details. Experimental results on three video-text datasets demonstrate the advantages of our model. Such hierarchical decomposition also enables better generalization across datasets and improves the ability to distinguish fine-grained semantic differences. Code will be released at https://github.com/cshizhe/hgr_v2t.
[video, graph, retrieval, hierarchical, action, node, attention, reasoning, hgr, dataset, encoding, text, entity, embeddings, textual, three, role, embedding, rsum, sentence, medr, mnr, visual, encode, multiple, gcn, word, vatex, complicated, sequence, order] [semantic, global, propose, table, level, achieves, improves, hard] [model, testing, generalization, datasets, type] [event, ieee, figure, pattern, dual, proposed] [utilize, image, ability, corresponding, learn, unseen] [performance, better, binary, set, selection, average, arxiv, preprint, learning, similarity] [local, conference, matching, computer, vision, joint, international, capture, compute, european, demonstrate]
@InProceedings{Chen_2020_CVPR,
  author = {Chen, Shizhe and Zhao, Yida and Jin, Qin and Wu, Qi},
  title = {Fine-Grained Video-Text Retrieval With Hierarchical Graph Reasoning},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Warp to the Future: Joint Forecasting of Features and Feature Motion
Josip Saric, Marin Orsic, Tonci Antunovic, Sacha Vrazic, Sinisa Segvic


We address anticipation of scene development by forecasting semantic segmentation of future frames. Several previous works approach this problem by F2F (feature-to-feature) forecasting where future features are regressed from observed features. Different from previous work, we consider a novel F2M (feature-to-motion) formulation, which performs the forecast by warping observed features according to regressed feature flow. This formulation models a causal relationship between the past and the future, and regularizes inference by reducing dimensionality of the forecasting target. However, emergence of future scenery which was not visible in observed frames can not be explained by warping. We propose to address this issue by complementing F2M forecasting with the classic F2F approach. We realize this idea as a multi-head F2MF model built atop shared features. Experiments show that the F2M head prevails in static parts of the scene while the F2F head kicks-in to fill-in the novel regions. The proposed F2MF model operates in synergy with correlation features and outperforms all previous approaches both in short-term and mid-term forecast on the Cityscapes dataset.
[forecasting, future, observed, forecast, video, previous, prediction, forecasted, work, frame, outperforms, receives, temporal, order, recognition, time] [semantic, feature, segmentation, correlation, module, miou, head, table, regressed, cnn] [model, input, trained] [flow, ieee, warping, convolutional, optical, pattern, motion, pixel, figure, warp, upsampling, proposed, receptive, advantage, based, deformable, classic, clear] [independent, image, compound, shared] [backward, accuracy, forward, training, requires, deep, better, performance, best, consider, inference, learning, achieve, path, small, task] [conference, computer, vision, approach, correspondence, dense, scene, novel, geometric, international]
@InProceedings{Saric_2020_CVPR,
  author = {Saric, Josip and Orsic, Marin and Antunovic, Tonci and Vrazic, Sacha and Segvic, Sinisa},
  title = {Warp to the Future: Joint Forecasting of Features and Feature Motion},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Network Adjustment: Channel Search Guided by FLOPs Utilization Ratio
Zhengsu Chen, Jianwei Niu, Lingxi Xie, Xuefeng Liu, Longhui Wei, Qi Tian


Automatic designing computationally efficient neural networks has received much attention in recent years. Existing approaches either utilize network pruning or leverage the network architecture search methods. This paper presents a new framework named network adjustment, which considers network accuracy as a function of FLOPs, so that under each network configuration, one can estimate the FLOPs utilization ratio (FUR) for each layer and use it to determine whether to increase or decrease the number of channels on the layer. Note that FUR, like the gradient of a non-linear function, is accurate only in a small neighborhood of the current network. Hence, we design an iterative mechanism so that the initial network undergoes a number of steps, each of which has a small 'adjusting rate' to control the changes to the network. The computational overhead of the entire search process is reasonable, i.e., comparable to that of re-training the final model from scratch. Experiments on standard image classification datasets and a wide range of base networks demonstrate the effectiveness of our approach, which consistently outperforms the pruning counterpart. The code is available at https://github.com/danczs/NetworkAdjustment.
[outperforms, current, work] [table, assigned, named] [original, model, trained, drop, decrease, input, iterative] [channel, method, convolutional, residual, adjustment, ieee, output, pattern, figure, conv, based, padding] [image, adjusting] [network, search, searched, layer, neural, pruning, accuracy, training, arxiv, preprint, configuration, number, architecture, deep, fur, efficient, set, utilization, learning, indicates, computational, ratio, process, spatialdropout, validation, performance, knowledge, increase, gradient, efficiency, reduce, imagenet, increased, designing, small, wide, iteration, involves, rate, data, searching, lccl, sfp, fpgm, function, entire, find, amount, equivalent, evaluate, optimization] [computer, conference, vision, structure, approach, compute, differentiable]
@InProceedings{Chen_2020_CVPR,
  author = {Chen, Zhengsu and Niu, Jianwei and Xie, Lingxi and Liu, Xuefeng and Wei, Longhui and Tian, Qi},
  title = {Network Adjustment: Channel Search Guided by FLOPs Utilization Ratio},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Where Does It Exist: Spatio-Temporal Video Grounding for Multi-Form Sentences
Zhu Zhang, Zhou Zhao, Yang Zhao, Qi Wang, Huasheng Liu, Lianli Gao


In this paper, we consider a novel task, Spatio-Temporal Video Grounding for Multi-Form Sentences (STVG). Given an untrimmed video and a declarative/interrogative sentence depicting an object, STVG aims to localize the spatio-temporal tube of the queried object. STVG has two challenging settings: (1) We need to localize spatio-temporal object tubes from untrimmed videos, where the object may only exist in a very small segment of the video; (2) We deal with multi-form sentences, including the declarative sentences with explicit objects and interrogative sentences with unknown objects. Existing methods cannot tackle the STVG task due to the ineffective tube pre-generation and the lack of object relationship modeling. Thus, we then propose a novel Spatio-Temporal Graph Reasoning Network (STGRN) for this task. First, we build a spatio-temporal region graph to capture the region relationships with temporal object dynamics, which involves the implicit and explicit spatial subgraphs in each frame and the temporal dynamic subgraph across frames. We then incorporate textual clues into the graph and develop the multi-step cross-modal graph reasoning. Next, we introduce a spatio-temporal localizer with a dynamic selection method to directly retrieve the spatio-temporal tubes without tube pre-generation. Moreover, we contribute a large-scale video grounding dataset VidSTG based on video relation dataset VidOR. The extensive experiments demonstrate the effectiveness of our method.
[temporal, graph, grounding, video, frame, visual, stvg, interrogative, language, relation, sentence, explicit, localizer, dataset, natural, textual, declarative, localize, stgrn, clip, subgraph, referring, boy, attention, rit, viou, spatiotemporal, queried, modeling, retrieve, extract, moment, reasoning, build, vidstg, tem, stpr, wsstg, retrieval, sit, word] [region, object, tube, feature, score, apply, propose, bounding, semantic, localization, faster] [zhou, query, model, expression] [spatial, dynamic, convolution, develop, method, figure, existing] [loss, encoder, unknown, learn, alignment] [set, task, network, selection, performance, layer, select] [ground, capture, novel, implicit, directly, truth, matching]
@InProceedings{Zhang_2020_CVPR,
  author = {Zhang, Zhu and Zhao, Zhou and Zhao, Yang and Wang, Qi and Liu, Huasheng and Gao, Lianli},
  title = {Where Does It Exist: Spatio-Temporal Video Grounding for Multi-Form Sentences},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Cross-Modal Cross-Domain Moment Alignment Network for Person Search
Ya Jing, Wei Wang, Liang Wang, Tieniu Tan


Text-based person search has drawn increasing attention due to its wide applications in video surveillance. However, most of the existing models depend heavily on paired image-text data, which is very expensive to acquire. Moreover, they always face huge performance drop when directly exploiting them to new domains. To overcome this problem, we make the first attempt to adapt the model to new target domains in the absence of pairwise labels, which combines the challenges from both cross-modal (text-based) person search and cross-domain person search. Specially, we propose a moment alignment network (MAN) to solve the cross-modal cross-domain person search task in this paper. The idea is to learn three effective moment alignments including domain alignment (DA), cross-modal alignment (CA) and exemplar alignment (EA), which together can learn domain-invariant and semantic aligned cross-modal representations to improve model generalization. Extensive experiments are conducted on CUHK Person Description dataset (CUHK-PEDES) and Richly Annotated Pedestrian dataset (RAP). Experimental results show that our proposed model achieves the state-of-the-art performances on five transfer tasks.
[dataset, man, moment, visual, three, textual, shift, text, attention] [propose, semantic, feature, table, achieves, wei, effectiveness, liang] [model, adversarial, query, trained, datasets] [based, proposed, method, adaptive, utilized, called, comparison, existing] [domain, person, target, source, alignment, image, transfer, learn, exemplar, loss, adaptation, unsupervised, train, supervised, gap, perform, corresponding, discriminative, pseudo, utilize, crossmodal, rap] [learning, class, search, data, labeled, learned, network, performance, deep, unlabeled, distribution, task, neural, best, classifier, note, set, training, compared, sample, ranking] [matching, distance]
@InProceedings{Jing_2020_CVPR,
  author = {Jing, Ya and Wang, Wei and Wang, Liang and Tan, Tieniu},
  title = {Cross-Modal Cross-Domain Moment Alignment Network for Person Search},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Self-Training With Noisy Student Improves ImageNet Classification
Qizhe Xie, Minh-Thang Luong, Eduard Hovy, Quoc V. Le


We present a simple self-training method that achieves 88.4% top-1 accuracy on ImageNet, which is 2.0% better than the state-of-the-art model that requires 3.5B weakly labeled Instagram images. On robustness test sets, it improves ImageNet-A top-1 accuracy from 61.0% to 83.7%, reduces ImageNet-C mean corruption error from 45.7 to 28.3, and reduces ImageNet-P mean flip rate from 27.8 to 12.2. To achieve this result, we first train an EfficientNet model on labeled ImageNet images and use it as a teacher to generate pseudo labels on 300M unlabeled images. We then train a larger EfficientNet as a student model on the combination of labeled and pseudo labeled images. We iterate this process by putting back the student as the teacher. During the generation of the pseudo labels, the teacher is not noised so that the pseudo labels are as accurate as possible. However, during the learning of the student, we inject noise such as dropout, stochastic depth and data augmentation via RandAugment to the student so that the student generalizes better than the teacher.
[work, previous, natural, three] [improves, table, weakly, main] [model, noise, robustness, adversarial, trained, iterative, improve, study, difficult] [method, ieee, pattern, figure, resolution, convolutional] [pseudo, train, image, consistency, supervised, learn, loss, generate, cross] [unlabeled, learning, noisystudent, student, labeled, teacher, data, neural, accuracy, arxiv, training, preprint, imagenet, deep, better, size, batch, processing, augmentation, machine, set, quoc, large, dropout, test, larger, stochastic, efficientnet, noised, entropy, knowledge, wsl, rate, soft, algorithm, number, requires, compared, randaugment] [conference, computer, vision, international, depth, david, error]
@InProceedings{Xie_2020_CVPR,
  author = {Xie, Qizhe and Luong, Minh-Thang and Hovy, Eduard and Le, Quoc V.},
  title = {Self-Training With Noisy Student Improves ImageNet Classification},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Learning Longterm Representations for Person Re-Identification Using Radio Signals
Lijie Fan, Tianhong Li, Rongyao Fang, Rumen Hristov, Yuan Yuan, Dina Katabi


Person Re-Identification (ReID) aims to recognize a person-of-interest across different places and times. Existing ReID methods rely on images or videos collected using RGB cameras. They extract appearance features like clothes, shoes, hair, etc. Such features, however, can change drastically from one day to the next, leading to inability to identify people over extended time periods. In this paper, we introduce RF-ReID, a novel approach that harnesses radio frequency (RF) signals for longterm person ReID. RF signals traverse clothes and reflect off the human body; thus they can be used to extract more persistent human-identifying features like body size and shape. We evaluate the performance of RF-ReID on longitudinal datasets that span days and weeks, where the person may wear different clothes across days. Our experiments demonstrate that RF-ReID outperforms state-of-the-art RGB-based ReID approaches for long term person ReID. Our results also reveal two interesting features: First since RF signals work in the presence of occlusions and poor lighting, RF-ReID allows for person ReID in such scenarios. Second, unlike photos and videos which reveal personal and private information, RF signals are more privacy-preserving, and hence can help extend person ReID to privacy-concerned domains, like healthcare.
[people, attention, environment, video, radio, skeleton, dataset, walking, extract, work, prediction, longterm, temporal, time, hierarchical, identifiable, recognition, mingmin, rumen, wear, wearing] [feature, tracklet, tracklets, table, gallery, module, heatmap, aggregate, segment, map] [model, clothes, query, collected, identify, change, poor, personal, identification, datasets, private] [ieee, figure, extraction, pattern, based, signal] [person, reid, loss, discriminator, dina, appearance, learn, image, train, row] [data, network, training, set, learning, performance, triplet, evaluate, deep, task, test, computing, layer] [conference, human, computer, international, rgb, second, vision, body, acm, shape, compare]
@InProceedings{Fan_2020_CVPR,
  author = {Fan, Lijie and Li, Tianhong and Fang, Rongyao and Hristov, Rumen and Yuan, Yuan and Katabi, Dina},
  title = {Learning Longterm Representations for Person Re-Identification Using Radio Signals},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
LatentFusion: End-to-End Differentiable Reconstruction and Rendering for Unseen Object Pose Estimation
Keunhong Park, Arsalan Mousavian, Yu Xiang, Dieter Fox


Current 6D object pose estimation methods usually require a 3D model for each object. These methods also require additional training in order to incorporate new objects. As a result, they are difficult to scale to a large number of objects and cannot be directly applied to unseen objects. We propose a novel framework for 6D pose estimation of unseen objects. We present a network that reconstructs a latent 3D representation of an object using a small number of reference views at inference time. Our network is able to render the latent 3D representation from arbitrary views. Using this neural renderer, we directly optimize for pose given an input image. By training our network with a large number of 3D shapes for reconstruction and rendering, our network generalizes well to unseen objects. We present a new dataset for unseen object pose estimation--MOPED. We evaluate the performance of our method for unseen object pose estimation on MOPED as well as the ModelNet and LINEMOD datasets. Our method performs competitively to supervised methods that are trained on those objects. Code and data will be available at https://keunhong.com/publications/latentfusion/
[recognition, dataset, unit, multiple, modeling, evaluation] [object, add, feature, predicted, mask, table, category] [input, model, query, trained] [reference, method, ieee, color, pattern, pixel, output, high, based, figure] [latent, image, representation, unseen, loss, train, target, translation, perform, arbitrary] [network, neural, training, learning, number, space, small, evaluate, optimize, test, sample, deep, random, large, inference, requires] [pose, estimation, depth, rendering, conference, computer, vision, reconstruction, camera, view, novel, differentiable, volume, require, point, single, directly, voxel, international, shape, rgb, moped, modelnet, rendered, distance, pipeline, well, canonical, rotation, european, dieter, render]
@InProceedings{Park_2020_CVPR,
  author = {Park, Keunhong and Mousavian, Arsalan and Xiang, Yu and Fox, Dieter},
  title = {LatentFusion: End-to-End Differentiable Reconstruction and Rendering for Unseen Object Pose Estimation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Learning Instance Occlusion for Panoptic Segmentation
Justin Lazarow, Kwonjoon Lee, Kunyu Shi, Zhuowen Tu


Panoptic segmentation requires segments of both "things" (countable object instances) and "stuff" (uncountable and amorphous regions) within a single output. A common approach involves the fusion of instance segmentation (for "things") and semantic segmentation (for "stuff") into a non-overlapping placement of segments, and resolves overlaps. However, instance ordering with detection confidence do not correlate well with natural occlusion relationship. To resolve this issue, we propose a branch that is tasked with modeling how two instance masks should overlap one another as a binary relation. Our method, named OCFusion, is lightweight but particularly effective in the instance fusion process. OCFusion is trained with the ground truth relation derived automatically from the existing dataset annotations. We obtain state-of-the-art results on COCO and show competitive results on the Cityscapes panoptic segmentation benchmark.
[dataset, work, relation, time, tasked] [occlusion, instance, panoptic, segmentation, mask, head, semantic, table, coco, object, detection, ocfusion, assigned, fpn, stuff, pqth, thing, piotr, confidence, feature, box, pqst, branch, backbone, appreciable, pyramid, ordering, overlap, threshold, val, adaptis, kaiming, ross, zhuowen, parsing] [model] [fusion, method, figure, comparison, output, proposed, convolution, deformable, based, resolve, existing] [image, loss, corresponding, common, proposes] [baseline, number, process, learning, binary, classification, top, network, problem, class, architecture, ratio, validation, task, performance, computational] [intersection, single, ground, truth, approach, well, additional, iij, human, computer, scene, handle]
@InProceedings{Lazarow_2020_CVPR,
  author = {Lazarow, Justin and Lee, Kwonjoon and Shi, Kunyu and Tu, Zhuowen},
  title = {Learning Instance Occlusion for Panoptic Segmentation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Vision-Dialog Navigation by Exploring Cross-Modal Memory
Yi Zhu, Fengda Zhu, Zhaohuan Zhan, Bingqian Lin, Jianbin Jiao, Xiaojun Chang, Xiaodan Liang


Vision-dialog navigation posed as a new holy-grail task in vision-language disciplinary targets at learning an agent endowed with the capability of constant conversation for help with natural language and navigating according to human responses. Besides the common challenges faced in visual language navigation, vision-dialog navigation also requires to handle well with the language intentions of a series of questions about the temporal context from dialogue history and co-reasoning both dialogs and visual scenes. In this paper, we propose the Cross-modal Memory Network (CMN) for remembering and understanding the rich information relevant to historical navigation actions. Our CMN consists of two memory modules, the language memory module (L-mem) and the visual memory module (V-mem). Specifically, L-mem learns latent relationships between the current language interaction and a dialog history by employing a multi-head attention mechanism. V-mem learns to associate the current visual views and the cross-modal memory about the previous navigation actions. The cross-modal memory is generated via a vision-to-language attention and a language-to-vision attention. Benefiting from the collaborative learning of the L-mem and the V-mem, our CMN is able to explore the memory about the decision making of historical navigation actions which is for the current step. Experiments on the CVDN dataset show that our CMN outperforms the previous state-of-the-art model by a significant margin on both seen and unseen environments.
[visual, language, dialog, navigation, previous, attention, navigator, current, agent, cmn, action, oracle, history, panoramic, goal, vln, step, natural, question, ndh, historical, instruction, reinforcement, dialogue, temporal, understanding, cvdn, evlm, progress, context, interaction, explore, environment, embodied, reasoning, encoding, dctx, shortest, exploring, rich, making, work, dataset] [module, feature, round, propose, val, supervision] [decision, model, help, input] [method, proposed, based, pattern, resolve] [representation, unseen, target, consists, learns] [memory, learning, path, baseline, task, arxiv, preprint, performance, better, mixed, rate, processing, network] [conference, vision, computer, international, scene, human]
@InProceedings{Zhu_2020_CVPR,
  author = {Zhu, Yi and Zhu, Fengda and Zhan, Zhaohuan and Lin, Bingqian and Jiao, Jianbin and Chang, Xiaojun and Liang, Xiaodan},
  title = {Vision-Dialog Navigation by Exploring Cross-Modal Memory},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
ALFRED: A Benchmark for Interpreting Grounded Instructions for Everyday Tasks
Mohit Shridhar, Jesse Thomason, Daniel Gordon, Yonatan Bisk, Winson Han, Roozbeh Mottaghi, Luke Zettlemoyer, Dieter Fox


We present ALFRED (Action Learning From Realistic Environments and Directives), a benchmark for learning a mapping from natural language instructions and egocentric vision to sequences of actions for household tasks. ALFRED includes long, compositional tasks with non-reversible state changes to shrink the gap between research benchmarks and real-world applications. ALFRED consists of expert demonstrations in interactive visual environments for 25k natural language directives. These directives contain both high-level goals like "Rinse off a mug and place it in the coffee maker." and low-level language instructions like "Walk to the coffee maker on the right." ALFRED tasks are more complex in terms of sequence length, action space, and language than existing vision- and-language task datasets. We show that a baseline model based on recent embodied vision-and-language tasks performs poorly on ALFRED, suggesting that there is significant room for developing innovative grounded visual language understanding models with this benchmark.
[language, alfred, visual, expert, action, agent, navigation, interaction, natural, state, demonstration, goal, potato, place, sponge, environment, progress, embodied, turn, instruction, grounding, question, coffee, planning, hidden, include, walk, lstm, egocentric, sequence, previous, ego, dataset, monitoring, drying, three, attention] [object, mask, table, interactive, semantic, benchmark] [model, success, datasets, example, counter, clean, manipulation, input] [slice, figure, high, based, existing] [unseen, train, corresponding] [task, learning, test, weighted, discrete, validation, performance, evaluate, path, baseline, number] [daniel, human, vision, heat, pick, left, well]
@InProceedings{Shridhar_2020_CVPR,
  author = {Shridhar, Mohit and Thomason, Jesse and Gordon, Daniel and Bisk, Yonatan and Han, Winson and Mottaghi, Roozbeh and Zettlemoyer, Luke and Fox, Dieter},
  title = {ALFRED: A Benchmark for Interpreting Grounded Instructions for Everyday Tasks},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
NMS by Representative Region: Towards Crowded Pedestrian Detection by Proposal Pairing
Xin Huang, Zheng Ge, Zequn Jie, Osamu Yoshie


Although significant progress has been made in pedestrian detection recently, pedestrian detection in crowded scenes is still challenging. The heavy occlusion between pedestrians imposes great challenges to the standard Non-Maximum Suppression (NMS). A relative low threshold of intersection over union (IoU) leads to missing highly overlapped pedestrians, while a higher one brings in plenty of false positives. To avoid such a dilemma, this paper proposes a novel Representative Region NMS (R2NMS) approach leveraging the less occluded visible parts, effectively removing the redundant boxes without bringing in many false positives. To acquire the visible parts, a novel Paired-Box Model (PBM) is proposed to simultaneously predict the full and visible boxes of a pedestrian. The full and visible boxes constitute a pair serving as the sample unit of the model, thus guaranteeing a strong correspondence between the two boxes throughout the detection pipeline. Moreover, convenient feature integration of the two boxes is allowed for the better performance on both full and visible pedestrian detection tasks. Experiments on the challenging CrowdHuman and CityPersons benchmarks sufficiently validate the effectiveness of the proposed approach on pedestrian detection in the crowded situation.
[pair, attention, unit, red] [detection, pedestrian, bboxes, iou, pbm, feature, table, proposal, npm, false, crowdhuman, faster, threshold, citypersons, predicted, object, adaptivenms, mask, crowded, bbox, ppfe, occlusion, region, occluded, propose, scored, recall, anchor, overlapped, positive, roi, ross, effectiveness, detected] [model, highly, original, strong, difficult] [ieee, method, proposed, pattern, based, low, figure] [paired, reasonable, missing, image] [performance, standard, baseline, set, validation, large, better, density, rate, evaluate, learning, higher, number, sample] [visible, full, body, conference, computer, vision, international, novel, human, predicts, relative, correspondence]
@InProceedings{Huang_2020_CVPR,
  author = {Huang, Xin and Ge, Zheng and Jie, Zequn and Yoshie, Osamu},
  title = {NMS by Representative Region: Towards Crowded Pedestrian Detection by Proposal Pairing},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Visual Commonsense R-CNN
Tan Wang, Jianqiang Huang, Hanwang Zhang, Qianru Sun


We present a novel unsupervised feature representation learning method, Visual Commonsense Region-based Convolutional Neural Network (VC R-CNN), to serve as an improved visual region encoder for high-level tasks such as captioning and VQA. Given a set of detected object regions in an image (e.g., using Faster R-CNN), like any other unsupervised feature learning methods (e.g., word2vec), the proxy training objective of VC R-CNN is to predict the contextual objects of a region. However, they are fundamentally different: the prediction of VC R-CNN is by using causal intervention: P(Y|do(X)), while others are by using the conventional likelihood: P(Y|X). This is also the core reason why VC R-CNN can learn "sense-making" knowledge like chair can be sat -- while not just "common" co-occurrences such as chair is likely to exist if table is observed. We extensively apply VC R-CNN features in prevailing models of three popular tasks: Image Captioning, VQA, and VCR, and observe consistent performance boosts across them, achieving many new state-of-the-arts.
[visual, causal, context, attention, intervention, commonsense, question, captioning, sense, vqa, dataset, confounder, prediction, obj, confounders, toilet, observational, language, downstream, aoanet, devi, hanwang, reason, three, dog, previous, baseball, answering, word, sink, predicting, vcr, concatenated, reasoning, making] [feature, table, object, faster, roi, region, denotes, detection, ablative, alexander, ski, sota] [model, trained, original, difference, ball] [figure, proposed, based, ncc, snow] [image, common, person, unsupervised, representation, causality] [learning, dictionary, training, task, open, arxiv, preprint, bias, deep, compared, neural, knowledge, note, learned, predictor, set, random, denote, validation, network, observe, probability, test] [leg, chair, single]
@InProceedings{Wang_2020_CVPR,
  author = {Wang, Tan and Huang, Jianqiang and Zhang, Hanwang and Sun, Qianru},
  title = {Visual Commonsense R-CNN},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
What Deep CNNs Benefit From Global Covariance Pooling: An Optimization Perspective
Qilong Wang, Li Zhang, Banggu Wu, Dongwei Ren, Peihua Li, Wangmeng Zuo, Qinghua Hu


Recent works have demonstrated that global covariance pooling (GCP) has the ability to improve performance of deep convolutional neural networks (CNNs) on visual classification task. Despite considerable advance, the reasons on effectiveness of GCP on deep CNNs have not been well studied. In this paper, we make an attempt to understand what deep CNNs benefit from GCP in a viewpoint of optimization. Specifically, we explore the effect of GCP on deep CNNs in terms of the Lipschitzness of optimization loss and the predictiveness of gradients, and show that GCP can make the optimization landscape more smooth and the gradients more predictive. Furthermore, we discuss the connection between GCP and second-order optimization for deep CNNs. More importantly, above findings can account for several merits of covariance pooling for training deep CNNs that have not been recognized previously or fully explored, including significant acceleration of network convergence (i.e., the networks trained with GCP can support rapid decay of learning rates, achieving favorable performance while significantly reducing number of training epochs), stronger robustness to distorted examples generated by image corruptions and perturbations, and good generalization ability to different vision tasks, e.g., object detection and instance segmentation. We conduct extensive experiments using various deep CNN architectures on diversified tasks, and the results provide strong support to our findings.
[visual, understand, work, provide, step, explore, bilinear] [faster, table, pooling, cnn, global, object, backbone, detection, mask, effectiveness, instance, including] [gcp, trained, lradju, lrnorm, lipschitzness, robustness, predictiveness, lrf, gcpm, improve, input, stability, gcpd, generalization, model, original, rapid] [cnns, method, convolution, figure, convolutional, indicate] [gap, loss, image, ast, ability, train] [deep, optimization, training, gradient, matrix, convergence, learning, performance, better, epoch, network, normalization, accuracy, covariance, landscape, decay, shufflenet, neural, connection, imagenet, set, classification, support, achieve, setting, acceleration, note, power] [vision, matching]
@InProceedings{Wang_2020_CVPR,
  author = {Wang, Qilong and Zhang, Li and Wu, Banggu and Ren, Dongwei and Li, Peihua and Zuo, Wangmeng and Hu, Qinghua},
  title = {What Deep CNNs Benefit From Global Covariance Pooling: An Optimization Perspective},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
EfficientDet: Scalable and Efficient Object Detection
Mingxing Tan, Ruoming Pang, Quoc V. Le


Model efficiency has become increasingly important in computer vision. In this paper, we systematically study neural network architecture design choices for object detection and propose several key optimizations to improve efficiency. First, we propose a weighted bi-directional feature pyramid network (BiFPN), which allows easy and fast multi-scale feature fusion; Second, we propose a compound scaling method that uniformly scales the resolution, depth, and width for all backbone, feature network, and box/class prediction networks at the same time. Based on these optimizations and EfficientNet backbones, we have developed a new family of object detectors, called EfficientDet, which consistently achieve much better efficiency than prior art across a wide spectrum of resource constraints. In particular, with single-model and single-scale, our EfficientDetD7 achieves state-of-the-art 52.2 AP on COCO test-dev with 52M parameters and 325B FLOPs, being 4x - 9x smaller and using 13x - 42x fewer FLOPs than previous detector.
[previous, prediction, bidirectional, three, node] [feature, object, bifpn, efficientdet, achieves, backbone, fpn, table, retinanet, coco, detection, pyramid, level, propose, detector, panet, faster, ross, kaiming, semantic, piotr, resnet] [model, input, study, improve, developed] [fusion, fast, figure, scale, based, method, comparison, repeated, resolution, conv, proposed, convolutional] [compound, image] [network, scaling, accuracy, better, fewer, normalized, size, efficiency, latency, architecture, softmax, resource, neural, achieve, gpu, weight, quoc, amoebanet, weighted, search, batch, ratio, efficient, design, learning, width, performance, efficientnet, family, wide] [jointly]
@InProceedings{Tan_2020_CVPR,
  author = {Tan, Mingxing and Pang, Ruoming and Le, Quoc V.},
  title = {EfficientDet: Scalable and Efficient Object Detection},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Fast Template Matching and Update for Video Object Tracking and Segmentation
Mingjie Sun, Jimin Xiao, Eng Gee Lim, Bingfeng Zhang, Yao Zhao


In this paper, the main task we aim to tackle is the multi-instance semi-supervised video object segmentation across a sequence of frames where only the first-frame box-level ground-truth is provided. Detection-based algorithms are widely adopted to handle this task, and the challenges lie in the selection of the matching method to predict the result as well as to decide whether to update the target template using the newly predicted result. The existing methods, however, make these selections in a rough and inflexible way, compromising their performance. To overcome this limitation, we propose a novel approach which utilizes reinforcement learning to make these two decisions at the same time. Specifically, the reinforcement learning agent learns to decide whether to update the target template according to the quality of the predicted result. The choice of the matching method will be determined at the same time, based on the action history of the reinforcement learning agent. Experiments show that our method is almost 10 times faster than the previous state-of-the-art method with even higher accuracy (region similarity of 69.1% on DAVIS 2017 dataset).
[video, agent, dataset, frame, step, action, evaluation, speed, current, previous, decide, reinforcement, three, observed, state, reward, sequence] [object, segmentation, template, vos, davis, predicted, including, region, bounding, score, tracking, box, final, vot, mask, instance, adopt, faster, rame, table, achieves, adopting, segtrack, feature, pmask, boltvos, luc, van] [trained, model, quality, preliminary] [method, result, figure, fast, running, adopted, existing, proposed] [target, generate, image, third, appearance] [update, accuracy, network, learning, training, similarity, simple, candidate, indicates, process, task, higher, find, evaluate, performance, online, set] [matching, second]
@InProceedings{Sun_2020_CVPR,
  author = {Sun, Mingjie and Xiao, Jimin and Lim, Eng Gee and Zhang, Bingfeng and Zhao, Yao},
  title = {Fast Template Matching and Update for Video Object Tracking and Segmentation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Counterfactual Samples Synthesizing for Robust Visual Question Answering
Long Chen, Xin Yan, Jun Xiao, Hanwang Zhang, Shiliang Pu, Yueting Zhuang


Despite Visual Question Answering (VQA) has realized impressive progress over the last few years, today's VQA models tend to capture superficial linguistic correlations in the train set and fail to generalize to the test set with different QA distributions. To reduce the language biases, several recent works introduce an auxiliary question-only model to regularize the training of targeted VQA model, and achieve dominating performance on VQA-CP. However, since the complexity of design, current methods are unable to equip the ensemble-based models with two indispensable characteristics of an ideal VQA model: 1) visual-explainable: the model should rely on the right visual regions when making decisions. 2) question-sensitive: the model should be sensitive to the linguistic variations in question. To this end, we propose a model-agnostic Counterfactual Samples Synthesizing (CSS) training scheme. The CSS generates numerous counterfactual training samples by masking critical objects in images or words in questions, and assigning different ground-truth answers. After training with the complementary samples (ie, the original and generated samples), the VQA models are forced to focus on all critical objects and words, which significantly improves both visual-explainable and question-sensitive abilities. In return, the performance of these models is further boosted. Extensive ablations have shown the effectiveness of CSS. Particularly, by building on top of the model LMH, we achieve a record-breaking performance of 58.95% on VQA-CP v2, with 6.5% gains.
[vqa, question, visual, critical, answer, lmh, answering, attention, language, word, pvqa, updn, contribution, devi, assigning, pair, long, linguistic, incorporated, dataset, fvqa, dhruv, hanwang, making, natural, blue, red] [object, table, effectiveness, predicted, improves, score, highest, achieves] [model, counterfactual, improve, original, influence, adversarial, drop] [figure, color, dynamic, green, based, quantitative] [image, ability, synthesizing, introduce, loss, jun, train, qualitative] [training, performance, set, test, sample, algorithm, compared, reduce, observe, replacing, design, function, baseline, best, selection, size, metric] [ground, human, local, truth]
@InProceedings{Chen_2020_CVPR,
  author = {Chen, Long and Yan, Xin and Xiao, Jun and Zhang, Hanwang and Pu, Shiliang and Zhuang, Yueting},
  title = {Counterfactual Samples Synthesizing for Robust Visual Question Answering},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Local-Global Video-Text Interactions for Temporal Grounding
Jonghwan Mun, Minsu Cho, Bohyung Han


This paper addresses the problem of text-to-video temporal grounding, which aims to identify the time interval in a video semantically relevant to a text query. We tackle this problem using a novel regression-based model that learns to extract a collection of mid-level features for semantic phrases in a text query, which corresponds to important semantic entities described in the query (e.g., actors, objects, and actions), and reflect bi-modal interactions between the linguistic features of the query and the visual features of the video in multiple levels. The proposed method effectively predicts the target time interval by exploiting contextual information from local to global during bi-modal interactions. Through in-depth ablation studies, we find out that incorporating both local and global context in video and text interactions is crucial to the accurate grounding. Our experiment shows that the proposed method outperforms the state of the arts on Charades-STA and ActivityNet Captions datasets by large margins, 7.44% and 4.61% points at Recall@tIoU=0.5 metric, respectively.
[temporal, attention, video, context, phrase, modeling, time, text, embedding, action, activitynet, extract, modality, ltag, lgi, grounding, multiple, understanding, interaction, nlblock, individual, sequential, ldqa, three, sentence, activity, visual, ablr, word, sqan, language, outperforms, identifying] [semantic, segment, global, feature, table, localization, regression, detection, location, attentive, ablation, main, proposal] [query, model, effective] [fusion, proposed, method, based, resblock, guidance, figure, comparison] [loss, perform, distinct, masked, corresponding, target] [performance, network, note, algorithm, number, set, learning, best, hadamard, learned, vector, product, matrix] [local, interval, position, matching]
@InProceedings{Mun_2020_CVPR,
  author = {Mun, Jonghwan and Cho, Minsu and Han, Bohyung},
  title = {Local-Global Video-Text Interactions for Temporal Grounding},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Set-Constrained Viterbi for Set-Supervised Action Segmentation
Jun Li, Sinisa Todorovic


This paper is about weakly supervised action segmentation, where the ground truth specifies only a set of actions present in a training video, but not their true temporal ordering. Prior work typically uses a classifier that independently labels video frames for generating the pseudo ground truth, and multiple instance learning for training the classifier. We extend this framework by specifying an HMM, which accounts for co-occurrences of action classes and their temporal lengths, and by explicitly training the HMM on a Viterbi-based loss. Our first contribution is the formulation of a new set-constrained Viterbi algorithm (SCV). Given a video, the SCV generates the MAP action segmentation that satisfies the ground truth. This prediction is used as a framewise pseudo ground truth in our HMM training. Our second contribution in training is a new regularization of feature affinities between training videos that share the same action classes. Evaluation on action segmentation and alignment on the Breakfast, MPII Cooking2, Hollywood Extended datasets demonstrates our significant performance improvement for the two tasks over prior work.
[action, scv, video, temporal, hmm, frame, static, work, length, sequence, evaluation, hollywood, cooking, state, step, three, breakfast, outperforms, framewise] [map, segmentation, weakly, ordering, predicted, supervision, feature, fully, score, art, table] [model, true, example] [dynamic, ieee, prior, pattern, figure] [loss, supervised, pseudo, alignment, shared, legal, consists, generating, extended, train] [training, class, set, test, viterbi, algorithm, regularization, learning, inference, carlo, neural, complexity, sampling, number, monte, problem, network, label, total, consider, maximum] [ground, truth, conference, computer, vision, second, estimating, international, approach, estimate, mpii]
@InProceedings{Li_2020_CVPR,
  author = {Li, Jun and Todorovic, Sinisa},
  title = {Set-Constrained Viterbi for Set-Supervised Action Segmentation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Probabilistic Video Prediction From Noisy Data With a Posterior Confidence
Yunbo Wang, Jiajun Wu, Mingsheng Long, Joshua B. Tenenbaum


We study a new research problem of probabilistic future frames prediction from a sequence of noisy inputs, which is useful because it is difficult to guarantee the quality of input frames in practical spatiotemporal prediction applications. It is also challenging because it involves two levels of uncertainty: the perceptual uncertainty from noisy observations and the dynamics uncertainty in forward modeling. In this paper, we propose to tackle this problem with an end-to-end trainable model named Bayesian Predictive Network (BP-Net). Unlike previous work in stochastic video prediction that assumes spatiotemporal coherence and therefore fails to deal with perceptual uncertainty, BP-Net models both levels of uncertainty in an integrated framework. Furthermore, unlike previous work that can only provide unsorted estimations of future frames, BP-Net leverages a differentiable sequential importance sampling (SIS) approach to make future predictions based on the inference of underlying physical states, thereby providing sorted prediction candidates in accordance with the SIS importance weights, i.e., the confidences. Our experiment results demonstrate that BP-Net remarkably outperforms existing approaches on predicting future frames from noisy data.
[prediction, video, future, time, state, sequence, frame, previous, sequential, observation, dataset, spatiotemporal, recurrent, work, predicting, temporal, lstm, moving] [module, predicted, table, highest, confidence, cnn] [model, input, adversarial, quality, noise, worst, trained, mnist, physical] [particle, noisy, perceptual, figure, ssim, based, denoising, filtering, prior, existing, proposed, convolutional, method, mse, phase, gaussian, integrated] [generated, variational, generate, content] [predictive, bayesian, learning, training, weight, stochastic, deep, deterministic, problem, network, inference, best, sampling, compared, note, random, probabilistic, better, test, find, baseline, posterior, approximate, algorithm, neural, data] [uncertainty, measurement, differentiable, ground, well, truth]
@InProceedings{Wang_2020_CVPR,
  author = {Wang, Yunbo and Wu, Jiajun and Long, Mingsheng and Tenenbaum, Joshua B.},
  title = {Probabilistic Video Prediction From Noisy Data With a Posterior Confidence},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Beyond Short-Term Snippet: Video Relation Detection With Spatio-Temporal Global Context
Chenchen Liu, Yang Jin, Kehan Xu, Guoqiang Gong, Yadong Mu


Video visual relation detection (VidVRD) aims to describe all interacting objects in a video. Different from relationships in static images, videos contain an addition temporal channel. A majority of existing works divide a video into short segments, predict relationships in each segment, and merge them. Such methods cannot capture relations involving long motions. Predicting the same relationship across neighboring video segments is also inefficient. To address these issues, this work proposes a novel sliding-window scheme to simultaneously predict short-term and long-term relationships. We run windows with different kernel sizes on object tracklets to generate sub-tracklet proposals with different duration, while the computational load is similar to that in segment-based methods. To fully utilize spatial and temporal information in videos, we construct one spatial and one temporal graph and employ Graph Convloutional Network to generate contextual embedding for tracklet proposal compatibility evaluation. We only predict relationships on highly-compatible proposal pairs. Our method achieves state-of-the-art performance on both ImageNet-VidVRD and VidOR dataset across multiple tasks. Especially for ImageNet-VidVRD, we obtain an average of 3% (R@50 from 8.07% to 11.21%) improvement under all evaluation metrics.
[relation, video, graph, visual, temporal, relationship, pair, gcn, vidvrd, three, dataset, tagging, vidor, embedding, predict, predicting, multiple, evaluation, short, length, prediction, static] [object, detection, proposal, tracklet, stage, feature, module, detected, contextual, tracklets, ablation, association, tracking, bounding, extractor, table, category, pooling, sliding, fully] [subject, compatibility, model, detecting] [spatial, method, motion, convolutional, figure, proposed, achieved, window, convolution, ieee] [image, generate] [neural, deep, performance, network, classification, layer, better, number, learning, greedy, evaluate, observe, set] [relative, approach, varying, acm, pipeline, scene, local, compare]
@InProceedings{Liu_2020_CVPR,
  author = {Liu, Chenchen and Jin, Yang and Xu, Kehan and Gong, Guoqiang and Mu, Yadong},
  title = {Beyond Short-Term Snippet: Video Relation Detection With Spatio-Temporal Global Context},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Visual Grounding in Video for Unsupervised Word Translation
Gunnar A. Sigurdsson, Jean-Baptiste Alayrac, Aida Nematzadeh, Lucas Smaira, Mateusz Malinowski, Joao Carreira, Phil Blunsom, Andrew Zisserman


There are thousands of actively spoken languages on Earth, but a single visual world. Grounding in this visual world has the potential to bridge the gap between all these languages. Our goal is to use visual grounding to improve unsupervised word mapping between languages. The key idea is to establish a common visual representation between two languages by learning embeddings from unpaired instructional videos narrated in the native language. Given this shared embedding we demonstrate that (i) we can map words between the languages, particularly the 'visual' words; (ii) that the shared embedding provides a good initialization for existing unsupervised text-based word translation techniques, forming the basis for our proposed hybrid visual-text mapping algorithm, MUVE; and (iii) our approach achieves superior performance by addressing the shortcomings of text-based methods -- it is more robust, handles datasets with less commonality, and is applicable to low-resource languages. We apply these methods to translate words from English to French, Korean, and Japanese -- all without any parallel corpora and simply by watching many videos of people speaking while doing things.
[word, language, visual, video, english, french, embedding, muve, embeddings, instructional, bilingual, multilingual, muse, grounding, text, adaptlayer, vocabulary, multiple, dataset, retrieval, work, vecmap, narrated, watching, question, three, multimodal, describe] [table, map, propose] [model, robust, datasets, improve, create] [method, parallel, figure, proposed, visually] [unsupervised, translation, mapping, shared, translate, loss, dissimilarity, common, learn, paired, domain, image, encoder, representation, supervised, list, pretrained] [training, learning, linear, dictionary, performance, space, base, observe, size, simple, report, machine, orthogonal, random, sharing, layer, set, task, large, amount, batch, matrix, better] [joint, approach, vision, second, procrustes]
@InProceedings{Sigurdsson_2020_CVPR,
  author = {Sigurdsson, Gunnar A. and Alayrac, Jean-Baptiste and Nematzadeh, Aida and Smaira, Lucas and Malinowski, Mateusz and Carreira, Joao and Blunsom, Phil and Zisserman, Andrew},
  title = {Visual Grounding in Video for Unsupervised Word Translation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Two Causal Principles for Improving Visual Dialog
Jiaxin Qi, Yulei Niu, Jianqiang Huang, Hanwang Zhang


This paper unravels the design tricks adopted by us, the champion team MReaL-BDAI, for Visual Dialog Challenge 2019: two causal principles for improving Visual Dialog (VisDial). By "improving", we mean that they can promote almost every existing VisDial model to the state-of-the-art performance on the leader-board. Such a major improvement is only due to our careful inspection on the causality behind the model and data, finding that the community has overlooked two causalities in VisDial. Intuitively, Principle 1 suggests: we should remove the direct input of the dialog history to the answer model, otherwise a harmful shortcut bias will be introduced; Principle 2 says: there is an unobserved confounder for history, question, and answer, leading to spurious correlations from training data. In particular, to remove the confounder suggested in Principle 2, we propose several causal intervention algorithms, which make the training fundamentally different from the traditional likelihood estimation. Note that the two principles are model-agnostic, so they are applicable in any VisDial model.
[visual, causal, answer, visdial, dialog, history, question, attention, graph, relevance, confounder, hidden, word, hanwang, dataset, ndcg, illustrated, language, wearing, hciae, coatt, rva, community, vqa, natural, length, embedding, node, three] [score, challenge, denotes, table, feature, framework] [model, ranked, type, input] [figure, pattern, ieee, proposed, existing, applying, based, likelihood] [loss, image, generation, preference, introduce, common] [baseline, set, principle, learning, neural, note, performance, bias, training, candidate, processing, machine, dictionary, rank, ranking, validation, top, sample, approximation, inference, average, process, denote, implementation, shortcut, applied] [conference, computer, vision, direct, international, dense, unobserved, supplementary]
@InProceedings{Qi_2020_CVPR,
  author = {Qi, Jiaxin and Niu, Yulei and Huang, Jianqiang and Zhang, Hanwang},
  title = {Two Causal Principles for Improving Visual Dialog},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Spatio-Temporal Graph for Video Captioning With Knowledge Distillation
Boxiao Pan, Haoye Cai, De-An Huang, Kuan-Hui Lee, Adrien Gaidon, Ehsan Adeli, Juan Carlos Niebles


Video captioning is a challenging task that requires a deep understanding of visual scenes. State-of-the-art methods generate captions using either scene-level or object-level information but without explicitly modeling object interactions. Thus, they often fail to make visually grounded predictions, and are sensitive to spurious correlations. In this paper, we propose a novel spatio-temporal graph model for video captioning that exploits object interactions in space and time. Our model builds interpretable links and is able to provide explicit visual grounding. To avoid unstable performance caused by the variable number of objects, we further propose an object-aware knowledge distillation mechanism, in which local object information is used to regularize global scene features. We demonstrate the efficacy of our approach through extensive experiments on two benchmarks, showing our approach yields competitive performance with interpretable predictions.
[video, graph, temporal, language, captioning, visual, time, frame, action, explicitly, msvd, previous, attention, order, transformer, evaluation, extract, gspace, gtime, interaction, modeling, work, sequence, understanding, privileged, hou, decoder, description, text] [object, branch, feature, wang, propose, global, pooling, table, wei, focus] [model] [spatial, ieee, proposed, pattern, method, convolutional] [perform, representation, image, cat, generate, interpretable] [knowledge, distillation, network, arxiv, preprint, learning, neural, performance, number, set, deep, note, training, follow, space, classification, test, better, standard, problem] [scene, conference, computer, vision, full, directly, approach, capture, international, well, compare, transformation, human, local]
@InProceedings{Pan_2020_CVPR,
  author = {Pan, Boxiao and Cai, Haoye and Huang, De-An and Lee, Kuan-Hui and Gaidon, Adrien and Adeli, Ehsan and Niebles, Juan Carlos},
  title = {Spatio-Temporal Graph for Video Captioning With Knowledge Distillation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
A Real-Time Cross-Modality Correlation Filtering Method for Referring Expression Comprehension
Yue Liao, Si Liu, Guanbin Li, Fei Wang, Yanjie Chen, Chen Qian, Bo Li


Referring expression comprehension aims to localize the object instance described by a natural language expression. Current referring expression methods have achieved good performance. However, none of them is able to achieve real-time inference without accuracy drop. The reason for the relatively slow inference speed is that these methods artificially split the referring expression comprehension into two sequential stages including proposal generation and proposal ranking. It does not exactly conform to the habit of human cognition. To this end, we propose a novel Realtime Cross-modality Correlation Filtering method (RCCF). RCCF reformulates the referring expression comprehension as a correlation filtering process. The expression is first mapped from the language domain to the visual domain and then treated as a template (kernel) to perform correlation filtering on the image feature map. The peak value in the correlation heatmap indicates the center points of the target box. In addition, RCCF also regresses a 2-D object size and 2-D offset. The center point coordinates, object size and center point offset together to form the target bounding box. Our method runs at 40 FPS while achieving leading performance in RefClef, RefCOCO, RefCOCO+ and RefCOCOg benchmarks. In the challenging RefClef dataset, our methods almost double the state-of-the-art performance (34.70% increased to 63.79%). We hope this work can arouse more attention and studies to the new cross-modality correlation filtering framework as well as the one-stage framework for referring expression comprehension.
[referring, visual, three, language, comprehension, refcoco, mattnet, attention, localize, described, refclef, refcocog, natural, context, grounding, speed, time, prediction] [object, correlation, feature, center, offset, rccf, regression, table, detection, region, heatmap, map, level, framework, proposal, coco, ablation, template, peak, highest, module, location] [expression, model, input] [method, filtering, figure, proposed, output, spatial, convolutional, achieved, based, existing, kernel] [image, target, row, encoder, perform, corresponding, loss, generate, train, generation] [size, performance, filter, inference, set, layer, setting, network, deep, imagenet, achieve, precision, function, comparing] [point, single, second, well, matching, error, match, local]
@InProceedings{Liao_2020_CVPR,
  author = {Liao, Yue and Liu, Si and Li, Guanbin and Wang, Fei and Chen, Yanjie and Qian, Chen and Li, Bo},
  title = {A Real-Time Cross-Modality Correlation Filtering Method for Referring Expression Comprehension},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Better Captioning With Sequence-Level Exploration
Jia Chen, Qin Jin


Sequence-level learning objective has been widely used in captioning tasks to achieve the state-of-the-art performance for many models. In this objective, the model is trained by the reward on the quality of its generated captions (sequence-level). In this work, we show the limitation of the current sequence-level learning objective for captioning tasks from both theory and empirical result. In theory, we show that the current objective is equivalent to only optimizing the precision side of the caption set generated by the model and therefore overlooks the recall side. Empirical result shows that the model trained by this objective tends to get lower score on the recall side. We propose to add a sequence-level exploration term to the current objective to boost recall. It guides the model to explore more plausible captions in the training. In this way, the proposed objective takes both the precision and recall sides of generated captions into account. Experiments show the effectiveness of the proposed method on both video and image captioning datasets.
[captioning, caption, current, man, sll, standing, attention, video, exploration, sentence, toilet, step, cider, evaluation, word, correct, sitting, decoding, explore, sequence, visual, couple, ocean, reinforcement] [recall, side, groundtruth, table, predicted, score, semantic, improves, level, propose] [model, input, trained, original, difference] [proposed, ieee, figure, pattern, method, output, based] [image, diversity, loss, generated, generalized] [objective, learning, precision, set, function, training, neural, performance, standard, sample, empirical, sampling, sampled, membership, better, task, network, log, architecture, theoretical, proxy, regularization, number, width, calculate, gradient, equivalent, test] [term, conference, computer, vision, measurement, compare, defined, relaxed]
@InProceedings{Chen_2020_CVPR,
  author = {Chen, Jia and Jin, Qin},
  title = {Better Captioning With Sequence-Level Exploration},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Violin: A Large-Scale Dataset for Video-and-Language Inference
Jingzhou Liu, Wenhu Chen, Yu Cheng, Zhe Gan, Licheng Yu, Yiming Yang, Jingjing Liu


We introduce a new task, Video-and-Language Inference, for joint multimodal understanding of video and text. Given a video clip with aligned subtitles as premise, paired with a natural language hypothesis based on the video content, a model needs to infer whether the hypothesis is entailed or contradicted by the given video clip. A new large-scale dataset, named Violin (VIdeO-and-Language INference), is introduced for this task, which consists of 95,322 video-hypothesis pairs from 15,887 video clips, spanning over 582 hours of video. These video clips contain rich content with diverse temporal dynamics, event shifts, and people interactions, collected from two sources: (i) popular TV shows, and (ii) movie clips from YouTube channels. In order to address our new multimodal inference task, a model is required to possess sophisticated reasoning skills, from surface-level grounding (e.g., identifying objects and characters in the video) to in-depth commonsense reasoning (e.g., inferring causal relations of events in the video). We present a detailed analysis of the dataset and an extensive evaluation over many strong baselines, providing valuable insights on the challenges of this new task.
[video, visual, statement, dataset, language, reasoning, natural, question, bert, understanding, iolin, text, multimodal, man, movie, woman, clip, inferring, temporal, attention, glove, three, compositional, word, licheng, answering, entailment, tvqa, phone, explicit, sequence, stmt, jingjing, rich, textual, upset, provide, subtitle] [positive, table, module, challenging, global, det, benchmark] [model, collected, adversarial, adding, datasets, input] [figure, proposed, analysis, img, fusion] [image, zhe, encoder, content, trevor, aligned] [arxiv, preprint, task, negative, inference, learning, test, accuracy, requires, select, baseline, popular, required, bias, performance, neural, network] [human, write, joint, hypothesis, require]
@InProceedings{Liu_2020_CVPR,
  author = {Liu, Jingzhou and Chen, Wenhu and Cheng, Yu and Gan, Zhe and Yu, Licheng and Yang, Yiming and Liu, Jingjing},
  title = {Violin: A Large-Scale Dataset for Video-and-Language Inference},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
RiFeGAN: Rich Feature Generation for Text-to-Image Synthesis From Prior Knowledge
Jun Cheng, Fuxiang Wu, Yanling Tian, Lei Wang, Dapeng Tao


Text-to-image synthesis is a challenging task that generates realistic images from a textual sequence, which usually contains limited information compared with the corresponding image and so is ambiguous and abstractive. The limited textual information only describes a scene partly, which will complicate the generation with complementing the other details implicitly and lead to low-quality images. To address this problem, we propose a novel rich feature generating text-to-image synthesis, called RiFeGAN, to enrich the given description. In order to provide additional visual details and avoid conflicting, RiFeGAN exploits an attention-based caption matching model to select and refine the compatible candidate captions from prior knowledge. Given enriched captions, RiFeGAN uses self-attentional embedding mixtures to extract features across them effectually and handle the diverging features further. Then it exploits multi-captions attentional generative adversarial networks to synthesize images from those features. The experiments conducted on widely-used datasets show that the models can generate images from enriched captions effectually and improve the results significantly.
[caption, yellow, attentional, oursa, grey, purple, embedding, text, brown, pink, long, prominent, visual, red, crown, ourf, extract, exploit, belly, sentence, orange, saems, attention, embeddings, txt, conflicting, word, attngan] [score, feature, propose, semantic] [black, white, model, item, adversarial, compatible, improve, datasets] [ieee, figure, light, prior, gray, dark, pattern, convolutional] [bird, image, flower, synthesize, synthesized, real, inception, generative, generate, loss, generated, synthesis, generating, generation, corresponding, generator, retrieved, introduce, gan, gans] [small, training, neural, knowledge, large, problem, enriched, arxiv, preprint, base, processing] [matching, conference, computer, limited]
@InProceedings{Cheng_2020_CVPR,
  author = {Cheng, Jun and Wu, Fuxiang and Tian, Yanling and Wang, Lei and Tao, Dapeng},
  title = {RiFeGAN: Rich Feature Generation for Text-to-Image Synthesis From Prior Knowledge},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Graph Structured Network for Image-Text Matching
Chunxiao Liu, Zhendong Mao, Tianzhu Zhang, Hongtao Xie, Bin Wang, Yongdong Zhang


Image-text matching has received growing interest since it bridges vision and language. The key challenge lies in how to learn correspondence between image and text. Existing works learn coarse correspondence based on object co-occurrence statistics, while failing to learn fine-grained phrase correspondence. In this paper, we present a novel Graph Structured Matching Network (GSMN) to learn fine-grained correspondence. The GSMN explicitly models object, relation and attribute as a structured phrase, which not only allows to learn correspondence of object, relation and attribute separately, but also benefits to learn fine-grained correspondence of structured phrase. This is achieved by node-level matching and structure-level matching. The node-level matching associates each node with its relevant nodes from another modality, where the node can be object, relation or attribute. The associated nodes then jointly infer fine-grained correspondence by fusing neighborhood associations at structure-level matching. Comprehensive experiments show that GSMN outperforms state-of-the-art methods on benchmarks, with relative Recall@1 improvements of nearly 7% and 2% on Flickr30K and MSCOCO, respectively. Code will be released at: https://github.com/CrossmodalGroup/GSMN.
[graph, node, relation, visual, textual, text, phrase, structured, infer, word, gsmn, attention, associated, multimodal, correlate, dog, explicitly, relevant, brown, represent, gru, pfan] [object, global, faster, feature, semantic, salient, edge, focus, region, propose, matched, table] [model] [figure, ieee, pattern, proposed, based, performs, method] [image, learn, attribute, representation, corresponding, common, specific] [learning, network, similarity, weight, set, matrix, note, fixed, learned, greatly, indicates, better] [correspondence, matching, vision, conference, computer, approach, jointly, local, sparse, relative, scan, scene, compute, dense, coarse]
@InProceedings{Liu_2020_CVPR,
  author = {Liu, Chunxiao and Mao, Zhendong and Zhang, Tianzhu and Xie, Hongtao and Wang, Bin and Zhang, Yongdong},
  title = {Graph Structured Network for Image-Text Matching},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Straight to the Point: Fast-Forwarding Videos via Reinforcement Learning Using Textual Data
Washington Ramos, Michel Silva, Edson Araujo, Leandro Soriano Marcolino, Erickson Nascimento


The rapid increase in the amount of published visual data and the limited time of users bring the demand for processing untrimmed videos to produce shorter versions that convey the same information. Despite the remarkable progress that has been made by summarization methods, most of them can only select a few frames or skims, which creates visual gaps and breaks the video context. In this paper, we present a novel methodology based on a reinforcement learning formulation to accelerate instructional videos. Our approach can adaptively select frames that are not relevant to convey the information without creating gaps in the final video. Our agent is textually and visually oriented to select which frames to remove to shrink the input video. Additionally, we propose a novel network, called Visually-guided Document Attention Network (VDAN), able to generate a highly discriminative embedding space to represent both textual and visual data. Our experiments show that our method achieves the best performance in terms of F1 Score and coverage at the video segment level.
[video, agent, embedding, visual, reinforcement, textual, frame, attention, summarization, recognition, instructional, action, state, reward, hidden, represent, recipe, temporal, relevant, text, composed, current, hij, step, work, ffnet, vdan, egocentric, long, cooking] [semantic, score, segment, shorter] [create, input, creates, encoded, trained, methodology, highly] [method, ieee, based, pattern, figure, proposed, june, observes] [document, image, train, user, produce, creating] [learning, network, space, set, training, best, rate, select, performance, neural, data, task, deep, sampling, note, vector, higher, test, accelerate] [vision, computer, approach, coverage, conference, novel, international]
@InProceedings{Ramos_2020_CVPR,
  author = {Ramos, Washington and Silva, Michel and Araujo, Edson and Marcolino, Leandro Soriano and Nascimento, Erickson},
  title = {Straight to the Point: Fast-Forwarding Videos via Reinforcement Learning Using Textual Data},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Multi-Modality Cross Attention Network for Image and Sentence Matching
Xi Wei, Tianzhu Zhang, Yan Li, Yongdong Zhang, Feng Wu


The key of image and sentence matching is to accurately measure the visual-semantic similarity between an image and a sentence. However, most existing methods make use of only the intra-modality relationship within each modality or the inter-modality relationship between image regions and sentence words for the cross-modal matching task. Different from them, in this work, we propose a novel MultiModality Cross Attention (MMCA) Network for image and sentence matching by jointly modeling the intra-modality and inter-modality relationships of image regions and sentence words in a unified deep model. In the proposed MMCA, we design a novel cross-attention mechanism, which is able to exploit not only the intra-modality relationship within each modality, but also the inter-modality relationship between image regions and sentence words to complement and enhance each other for image and sentence matching. Extensive experimental results on two standard benchmarks including Flickr30K and MS-COCO demonstrate that the proposed model performs favorably against state-of-the-art image and sentence matching methods.
[sentence, attention, visual, retrieval, relationship, transformer, textual, embedding, language, word, embeddings, unit, man, extract, dashed, exploit, question, work, mechanism, hidden, blue, water, modeling] [module, table, including, unified, global, feature, region, building, key, matched] [model, testing, query, experimental] [proposed, figure, based, method, pattern, ieee, green, convolutional, output, existing] [image, cross, representation, loss, yan] [deep, arxiv, preprint, neural, set, learning, network, similarity, layer, machine, triplet, training, impact, task] [matching, vision, conference, computer, jointly, novel, fragment]
@InProceedings{Wei_2020_CVPR,
  author = {Wei, Xi and Zhang, Tianzhu and Li, Yan and Zhang, Yongdong and Wu, Feng},
  title = {Multi-Modality Cross Attention Network for Image and Sentence Matching},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Generalized ODIN: Detecting Out-of-Distribution Image Without Learning From Out-of-Distribution Data
Yen-Chang Hsu, Yilin Shen, Hongxia Jin, Zsolt Kira


Deep neural networks have attained remarkable performance when applied to data that comes from the same distribution as that of the training set, but can significantly degrade otherwise. Therefore, detecting whether an example is out-of-distribution (OoD) is crucial to enable a system that can reject such samples or alert users. Recent works have made significant progress on OoD benchmarks consisting of small image datasets. However, many recent methods based on neural networks rely on training or tuning with both in-distribution and out-of-distribution data. The latter is generally hard to define a-priori, and its selection can easily bias the learning. We base our work on a popular method ODIN, proposing two strategies for freeing it from the needs of tuning with OoD data, while improving its OoD detection performance. We specifically propose to decompose confidence scoring as well as a modified input pre-processing method. We show that both of these significantly help in detection performance. Our further analysis on a larger scale image dataset shows that the two types of distribution shifts, specifically semantic shift and non-semantic shift, present a significant difference in the difficulty of the problem, providing an analysis of when ODIN-like strategies do or do not work.
[shift, dataset, work, three] [detection, table, semantic, confidence, extra, scoring, score, benchmark, val] [input, model, trained, detecting, original, robustness, perturbation, magnitude, datasets, case] [figure, method, generally, analysis, high, tuned, ieee, based, scale, gaussian] [image, domain, decomposed, modified] [ood, data, neural, learning, performance, din, class, classifier, distribution, training, probability, function, deep, odin, rate, note, pout, svhn, tuning, network, deconf, arxiv, set, mahalanobis, preprint, baseline, temperature, scaling, preprocessing, number, processing, setting, strategy, higher, regularization, problem, softmax, metric, hyperparameter, classification, aha, uniform, better, best] [conference, international, computer, structure, vision]
@InProceedings{Hsu_2020_CVPR,
  author = {Hsu, Yen-Chang and Shen, Yilin and Jin, Hongxia and Kira, Zsolt},
  title = {Generalized ODIN: Detecting Out-of-Distribution Image Without Learning From Out-of-Distribution Data},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Learning Augmentation Network via Influence Functions
Donghoon Lee, Hyunsin Park, Trung Pham, Chang D. Yoo


Data augmentation can impact the generalization performance of an image classification model in a significant way. However, it is currently conducted on the basis of trial and error, and its impact on the generalization performance cannot be predicted during training. This paper considers an influence function that predicts how generalization performance, in terms of validation loss, is affected by a particular augmented training sample. The influence function provides an approximation of the change in validation loss without actually comparing the performances that include and exclude the sample in the training process. Based on this function, a differentiable augmentation network is learned to augment an input training sample to reduce validation loss. The augmented sample is fed into the classification network, and its influence is approximated as a function of the parameters of the last fully-connected layer of the classification network. By backpropagating the influence to the augmentation network, the augmentation network parameters are learned. Experimental results on CIFAR-10, CIFAR-100, and ImageNet show that the proposed method provides better generalization performance than conventional data augmentation methods do.
[recognition, dataset, include] [table, framework] [model, adversarial, generalization, input, influence, change, trained, technology] [proposed, spatial, ieee, pattern, convolutional, method, figure, based, transform, conventional, applying, comparison] [loss, appearance, transformed, image, generative, unsupervised, generates, gan, train, transposed] [augmentation, training, learning, data, validation, neural, sample, network, augmented, performance, function, deep, learned, test, conducted, baseline, zval, set, machine, imagenet, random, layer, gpu, space, arxiv, preprint, impact, approximation, approximated, labeled, ratner, consider, maximize, requires, average, processing, trial] [transformation, conference, vision, computer, international, differentiable, compute, error]
@InProceedings{Lee_2020_CVPR,
  author = {Lee, Donghoon and Park, Hyunsin and Pham, Trung and Yoo, Chang D.},
  title = {Learning Augmentation Network via Influence Functions},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
X-Linear Attention Networks for Image Captioning
Yingwei Pan, Ting Yao, Yehao Li, Tao Mei


Recent progress on fine-grained visual recognition and visual question answering has featured Bilinear Pooling, which effectively models the 2nd order interactions across multi-modal inputs. Nevertheless, there has not been evidence in support of building such interactions concurrently with attention mechanism for image captioning. In this paper, we introduce a unified attention block --- X-Linear attention block, that fully employs bilinear pooling to selectively capitalize on visual information or perform multi-modal reasoning. Technically, X-Linear attention block simultaneously exploits both the spatial and channel-wise bilinear attention distributions to capture the 2^ nd order interactions between the input single-modal or multi-modal features. Higher and even infinity order feature interactions are readily modeled through stacking multiple X-Linear attention blocks and equipping the block with Exponential Linear Unit (ELU) in a parameter-free fashion, respectively. Furthermore, we present X-Linear Attention Networks (dubbed as X-LAN) that novelly integrates X-Linear attention block(s) into image encoder and sentence decoder of image captioning model to leverage higher order intra- and inter-modal interactions. The experiments on COCO benchmark demonstrate that our X-LAN obtains to-date the best published CIDEr performance of 132.0% on COCO Karpathy test split. When further endowing Transformer with X-Linear attention blocks, CIDEr is boosted up to 132.8%. Source code is available at https://github.com/Panda-Peter/image-captioning.
[attention, order, bilinear, visual, sentence, captioning, embed, lstm, decoder, attended, interaction, infinity, embedding, mechanism, cider, word, natural, decoding, reasoning, yingwei, question, hidden, truck, sitting, yehao, karpathy, time, current] [feature, pooling, region, faster, module, table, coco, fully, key, denotes, unified, boost] [query, model, input, testing] [block, spatial, conventional, output, figure, enhanced, stacking, elu, stack, high] [image, encoder, content, transformed, ting, tao, perform] [higher, set, linear, performance, training, softmax, neural, weight, exponential, group, sum, capacity] [capture, demonstrate]
@InProceedings{Pan_2020_CVPR,
  author = {Pan, Yingwei and Yao, Ting and Li, Yehao and Mei, Tao},
  title = {X-Linear Attention Networks for Image Captioning},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Unsupervised Person Re-Identification via Multi-Label Classification
Dongkai Wang, Shiliang Zhang


The challenge of unsupervised person re-identification (ReID) lies in learning discriminative features without true labels. This paper formulates unsupervised person ReID as a multi-label classification task to progressively seek true labels. Our method starts by assigning each person image with a single-class label, then evolves to multi-label classification by leveraging the updated ReID model for label prediction. The label prediction comprises similarity computation and cycle consistency to ensure the quality of predicted labels. To boost the ReID model training efficiency in multi-label classification, we further propose the memory-based multi-label classification loss (MMCL). MMCL works with memory-based non-parametric classifier and integrates multi-label classification and single-label classification in an unified framework. Our label prediction and MMCL work iteratively and substantially boost the ReID performance. Experiments on several large-scale person ReID datasets demonstrate the superiority of our method in unsupervised person ReID. Our method also allows to use labeled person images in other domains. Under this transfer learning setting, our method also achieves state-of-the-art performance.
[prediction, dataset, bank, work, outperforms] [positive, feature, table, hard, score, map, achieves, predicted, liang, threshold, propose, utilizes] [model, effectively, knn, true, datasets] [method, comparison] [person, unsupervised, mmcl, image, loss, mplp, reid, transfer, duke, market, cycle, domain, source, train, issue, discriminative, consistency, shiliang, cross, supervised, ecn, ssg, msmt, wen, ensure] [learning, label, classification, negative, similarity, training, labeled, class, accuracy, performance, memory, gradient, deep, better, data, number, network, large, mining, set, unlabeled, select, test, rank, paper, updated, discussed, multilabel, neural] [single, vanishing, computes, computed, leveraging, iteratively]
@InProceedings{Wang_2020_CVPR,
  author = {Wang, Dongkai and Zhang, Shiliang},
  title = {Unsupervised Person Re-Identification via Multi-Label Classification},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Overcoming Classifier Imbalance for Long-Tail Object Detection With Balanced Group Softmax
Yu Li, Tao Wang, Bingyi Kang, Sheng Tang, Chunfeng Wang, Jintao Li, Jiashi Feng


Solving long-tail large vocabulary object detection with deep learning based models is a challenging and demanding task, which is however under-explored. In this work, we provide the first systematic analysis on the underperformance of state-of-the-art models in front of long-tail distribution. We find existing detection methods are unable to model few-shot classes when the dataset is extremely skewed, which can result in classifier imbalance in terms of parameter magnitude. Directly adapting long-tail classification models to detection frameworks can not solve this problem due to the intrinsic difference between detection and classification. In this work, we propose a novel balanced group softmax (BAGS) module for balancing the classifiers within the detection frameworks through group-wise training. It implicitly modulates the training process for the head and tail classes and ensures they are both sufficiently trained, without requiring any extra sampling for the instances from the tail classes. Extensive experiments on the very recent long-tail large vocabulary object recognition benchmark LVIS show that our proposed BAGS significantly improves the performance of detectors with various backbones and frameworks on both object detection and instance segmentation. It beats all state-of-the-art methods transferred from long-tail image classification and establishes new state-of-the-art. Code is available at https://github.com/FishYuLi/BalancedGroupSoftmax.
[dataset, prediction] [detection, object, category, head, instance, background, map, proposal, lvis, faster, mask, feature, coco, segmentation, table, ross, foreground, bounding, false, cascade, htc, denotes, framework, improvement, kaiming, key, propose, module] [model, trained, original] [ieee, pattern, method, based, figure, analysis] [loss, corresponding] [training, softmax, tail, group, classification, classifier, balanced, weight, number, data, performance, class, learning, imbalanced, probability, large, imbalance, deep, sampling, distribution, arxiv, preprint, better, set, initialized, find] [computer, conference, well, vision, normal, focal, international]
@InProceedings{Li_2020_CVPR,
  author = {Li, Yu and Wang, Tao and Kang, Bingyi and Tang, Sheng and Wang, Chunfeng and Li, Jintao and Feng, Jiashi},
  title = {Overcoming Classifier Imbalance for Long-Tail Object Detection With Balanced Group Softmax},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
What You See is What You Get: Exploiting Visibility for 3D Object Detection
Peiyun Hu, Jason Ziglar, David Held, Deva Ramanan


Recent advances in 3D sensing have created unique challenges for computer vision. One fundamental challenge is finding a good representation for 3D sensor data. Most popular representations (such as PointNet) are proposed in the context of processing truly 3D data (e.g. points sampled from mesh models), ignoring the fact that 3D sensored data such as a LiDAR sweep is in fact 2.5D. We argue that representing 2.5D data as collections of (x,y,z) points fundamentally destroys hidden information about freespace. In this paper, we demonstrate such knowledge can be efficiently recovered through 3D raycasting and readily incorporated into batch-based gradient learning. We describe a simple approach to augmenting voxel-based networks with visibility: we add a voxelized visibility map as an additional input stream. In addition, we show that visibility can be combined with two crucial modifications common to state-of-the-art 3D detectors: synthetic data augmentation of virtual objects and temporal aggregation of LiDAR sweeps over multiple time frames. On the NuScenes 3D detection benchmark, we show that, by adding an additional stream for visibility input, we can significantly improve the overall detection accuracy of a state-of-the-art 3D detector.
[temporal, late, context, stream, reasoning, integrate, time, naive, multiple, exploit] [object, lidar, detection, map, pointpillars, raycasting, feature, nuscenes, drilling, improvement, aggregation, framework, improves, ablation, culling, backbone, official, anchor, sensored, augmenting, including, autonomous, detect] [visibility, ray, model, input, original, drop, origin] [sensor, fusion, range, based, convolutional, figure, proposed, captured] [representation, visualize, introduce, image] [data, augmentation, deep, augmented, simple, network, early, set, processing, learning, algorithm, probabilistic, training, vanilla, efficient, strategy, general, online] [point, voxel, approach, occupancy, virtual, scene, freespace, instantaneous, compute, occupied, demonstrate, additional, volume, cloud, voxels, sweep]
@InProceedings{Hu_2020_CVPR,
  author = {Hu, Peiyun and Ziglar, Jason and Held, David and Ramanan, Deva},
  title = {What You See is What You Get: Exploiting Visibility for 3D Object Detection},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Deep Structure-Revealed Network for Texture Recognition
Wei Zhai, Yang Cao, Zheng-Jun Zha, HaiYong Xie, Feng Wu


Texture recognition is a challenging visual task since various primitives along with their arrangements can be recognized from a same texture image when perceiving with different contexts. Some recent work building on CNNs exploits orderless aggregating to provide invariance to spatial arrangements. However, these methods ignore the inherent structural property of textures, which is a critical cue for distinguishing and describing texture images in the wild. To address this problem, we propose a novel Deep Structure-Revealed Network (DSR-Net) that leverages spatial dependency among the captured primitives as structural representation for texture recognition. Specifically, a primitive capturing module (PCM) is devised to generate multiple primitives from eight directional spatial contexts, in which deep features are firstly extracted under the constrains of direction map and then encoded based on the similarities of neighborhood. Next, these primitives are associated with a dependence learning module (DLM) to generate structural representation, in which a two-way collaborative relationship strategy is introduced to perceive the spatial dependencies among multiple primitives. At last, the structure-revealed texture representations are integrated with spatial ordered information to achieve real-world texture recognition. Evaluation on the five most challenging texture recognition datasets has demonstrated the superiority of the proposed model against state-of-the-art methods. The structure-revealed performances of DSR-Net are further verified on some extensive experiments, including fine-grained classification and semantic segmentation.
[dependency, revealed, ordered, collaborative] [map, pcm, inherent, pooling, branch, feature, reshape, pspnet, module, center, resnet] [input, true] [spatial, dsr] [avg, dependence, image, composition, gatys] [weight, candidate, learning, metric] [position, directional, structure, principal, ground]
@InProceedings{Zhai_2020_CVPR,
  author = {Zhai, Wei and Cao, Yang and Zha, Zheng-Jun and Xie, HaiYong and Wu, Feng},
  title = {Deep Structure-Revealed Network for Texture Recognition},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Online Knowledge Distillation via Collaborative Learning
Qiushan Guo, Xinjiang Wang, Yichao Wu, Zhipeng Yu, Ding Liang, Xiaolin Hu, Ping Luo


This work presents an efficient yet effective online Knowledge Distillation method via Collaborative Learning, termed KDCL, which is able to consistently improve the generalization ability of deep neural networks (DNNs) that have different learning capacities. Unlike existing two-stage knowledge distillation approaches that pre-train a DNN with large capacity as the "teacher" and then transfer the teacher's knowledge to another "student" DNN unidirectionally (i.e. one-way), KDCL treats all DNNs as "students" and collaboratively trains them in a single stage (knowledge is transferred among arbitrary students during collaborative training), enabling parallel computing, fast computations, and appealing generalization ability. Specifically, we carefully design multiple methods to generate soft target as supervisions by effectively ensembling predictions of students and distorting the input images. Extensive experiments show that KDCL consistently improves all the "students" on different datasets, including CIFAR-100 and ImageNet. For example, when trained together by using KDCL, ResNet-50 and MobileNetV2 achieve 78.2% and 74.0% top-1 accuracy on ImageNet, outperforming the original results by 1.4% and 2.0% respectively. We also verify that models pre-trained with KDCL transfer well to object detection and semantic segmentation on MS COCO dataset. For instance, the FPN detector is improved by 0.9% mAP.
[collaborative, multiple, recognition, dataset, element, illustrated] [table, supervision, object, propose, detection, branch, extra, fusing] [model, generalization, trained, improve, ensemble, logit, input, quality, identical] [method, output, ieee, pattern, proposed, result, based] [target, ability, generate, loss, ensembling, transfer, invariance, produced, diversity, generated, image] [knowledge, network, soft, teacher, learning, training, student, distillation, neural, kdcl, accuracy, set, performance, validation, deep, data, online, rate, clnn, weight, vanilla, linear, compact, better, logits, test, arxiv, preprint, dml, gain, probability, imagenet, capacity, randomly, mutual] [conference, computer, vision, additional, ground, error, complex]
@InProceedings{Guo_2020_CVPR,
  author = {Guo, Qiushan and Wang, Xinjiang and Wu, Yichao and Yu, Zhipeng and Liang, Ding and Hu, Xiaolin and Luo, Ping},
  title = {Online Knowledge Distillation via Collaborative Learning},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Dynamic Convolution: Attention Over Convolution Kernels
Yinpeng Chen, Xiyang Dai, Mengchen Liu, Dongdong Chen, Lu Yuan, Zicheng Liu


Light-weight convolutional neural networks (CNNs) suffer performance degradation as their low computational budgets constrain both the depth (number of convolution layers) and the width (number of channels) of CNNs, resulting in limited representation capability. To address this issue, we present Dynamic Convolution, a new design that increases model complexity without increasing the network depth or width. Instead of using a single convolution kernel per layer, dynamic convolution aggregates multiple parallel convolution kernels dynamically based upon their attentions, which are input dependent. Assembling multiple kernels is not only computationally efficient due to the small kernel size, but also has more representation power since these kernels are aggregated in a non-linear way via attention. By simply using dynamic convolution for the state-of-the-art architecture MobileNetV3-Small, the top-1 accuracy of ImageNet classification is boosted by 2.9% with only 4% additional FLOPs and 2.9 AP gain is achieved on COCO keypoint detection.
[attention, static, multiple, recognition, work, three] [table, backbone, head, aggregated, improvement, extra, cnn] [model, input, trained] [convolution, dynamic, kernel, figure, output, method, ieee, residual, pattern, convolutional, based, dconv, june, low] [image, representation, train] [neural, efficient, learning, training, accuracy, network, layer, computational, architecture, performance, temperature, softmax, deep, classification, search, width, number, counterpart, compared, imagenet, best, denote, bottleneck, small, linear, increase, early, weight, rate, song, madds, depthwise, size, shuffle] [conference, computer, vision, cost, international, depth, additional, constraint, single, keypoint]
@InProceedings{Chen_2020_CVPR,
  author = {Chen, Yinpeng and Dai, Xiyang and Liu, Mengchen and Chen, Dongdong and Yuan, Lu and Liu, Zicheng},
  title = {Dynamic Convolution: Attention Over Convolution Kernels},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
3DSSD: Point-Based 3D Single Stage Object Detector
Zetong Yang, Yanan Sun, Shu Liu, Jiaya Jia


Prevalence of voxel-based 3D single-stage detectors contrast with underexplored point-based methods. In this paper, we present a lightweight point-based 3D single stage object detector 3DSSD to achieve decent balance of accuracy and efficiency. In this paradigm, all upsampling layers and the refinement stage, which are indispensable in all existing point-based methods, are abandoned. We instead propose a fusion sampling strategy in downsampling process to make detection on less representative points feasible. A delicate box prediction network, including a candidate generation layer and an anchor-free regression head with a 3D center-ness assignment strategy, is developed to meet the demand of high accuracy and speed. Our 3DSSD paradigm is an elegant single-stage anchor-free one. We evaluate it on widely used KITTI dataset and more challenging nuScenes dataset. Our method outperforms all state-of-the-art voxel-based single-stage methods by a large margin, and even yields comparable performance with two-stage point-based methods, with amazing inference speed of 25+ FPS, 2x faster than former state-of-the-art point-based methods.
[prediction, outperforms, illustrated, dataset, order, predict, extract, multiple, time] [object, detection, table, regression, refinement, nuscenes, positive, assignment, instance, shifting, recall, feature, lidar, stage, box, head, including, proposal, backbone, semantic, bounding, apply, final, pedestrian, module, center, detector, faster, autonomous, hard] [representative, model, original] [fusion, method, figure, lightweight, existing, based, convolutional, downsampling] [loss, generation, corresponding, utilize] [sampling, candidate, network, strategy, performance, inference, set, classification, layer, better, large, applied, negative, class, compared, higher, label, randomly, evaluate] [point, kitti, distance, cloud, second, orientation, compare, single, voxel, angle]
@InProceedings{Yang_2020_CVPR,
  author = {Yang, Zetong and Sun, Yanan and Liu, Shu and Jia, Jiaya},
  title = {3DSSD: Point-Based 3D Single Stage Object Detector},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Deep Degradation Prior for Low-Quality Image Classification
Yang Wang, Yang Cao, Zheng-Jun Zha, Jing Zhang, Zhiwei Xiong


State-of-the-art image classification algorithms building upon convolutional neural networks (CNNs) are commonly trained on large annotated datasets of high-quality images. When applied to low-quality images, they will suffer a significant degradation in performance, since the structural and statistical properties of pixels in the neighborhood are obstructed by image degradation. To address this problem, this paper proposes a novel deep degradation prior for low-quality image classification. It is based on statistical observations that, in the deep representation space, image patches with structural similarity have uniform distribution even if they come from different images, and the distributions of corresponding patches in low- and high-quality images have uniform margins under the same degradation condition. Therefore, we propose a feature de-drifting module (FDM) to learn the mapping relationship between deep representations of low- and high- quality images, and leverage it as a deep degradation prior (DDP) for low-quality image classification. Since the statistical properties are independent to image content, deep degradation prior can be learned on a training set of limited images without supervision of semantic labels and served in a form of "plugging-in" module of the existing classification networks to improve their performance on degraded images. Evaluations on the benchmark dataset ImageNet-C demonstrate that our proposed DDP can improve the accuracy of the pre-trained network model by more than 20% under various degradation conditions. Even under the extreme setting that only 10 images from CUB-C dataset are used for the training of DDP, our method improves the accuracy of VGG16 on ImageNet-C from 37% to 55%.
[dataset, visual, work] [feature, module, propose, semantic, level, jing] [trained, input, improve, quality, clean, robust] [degraded, degradation, clear, figure, ieee, method, fdm, fog, prior, contrast, foggy, pattern, proposed, convolutional, ddp, existing, enhancement, field, low, dehazing, receptive, patch, output, enhanced, based, brightness, directtest] [image, representation, structural, domain, yang, learn, mapping, proposes, cub] [deep, classification, accuracy, network, performance, learned, neural, training, learning, statistical, alexnet, imagenet, filter, paper, uniform, distribution] [vision, conference, computer, local, structure, novel]
@InProceedings{Wang_2020_CVPR,
  author = {Wang, Yang and Cao, Yang and Zha, Zheng-Jun and Zhang, Jing and Xiong, Zhiwei},
  title = {Deep Degradation Prior for Low-Quality Image Classification},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
ViBE: Dressing for Diverse Body Shapes
Wei-Lin Hsiao, Kristen Grauman


Body shape plays an important role in determining what garments will best suit a given person, yet today's clothing recommendation methods take a "one shape fits all" approach. These body-agnostic vision methods and datasets are a barrier to inclusion, ill-equipped to provide good suggestions for diverse body shapes. We introduce ViBE, a VIsual Body-aware Embedding that captures clothing's affinity with different body shapes. Given an image of a person, the proposed embedding identifies garments that will flatter her specific body shape. We show how to learn the embedding from an online catalog displaying fashion models of various shapes and sizes wearing the products, and we devise a method to explain the algorithm's suggestions for well-fitting garments. We apply our approach to a dataset of diverse subjects, and demonstrate its strong advantages over status quo body-agnostic recommendation, both according to automated metrics and human opinion.
[embedding, visual, dataset, people, work, wearing, multiple, current] [positive, affinity, cnn] [clothing, fashion, garment, model, recommendation, catalog, suitable, dress, recommended, compatibility, auc, datasets, type, item, flatter, worn, example, fbody, vital, create] [figure, based, existing, method, proposed, prior, quantitative] [image, person, unseen, diverse, loss, specific, train, style, user, introduce] [learning, learned, test, best, size, product, training, data, agnostic, deep, online] [body, shape, human, fit, approach, virtual, smpl, computer, mesh, michael, capture, single, gerard, vision, estimation, joint, silhouette, vibe, implicit]
@InProceedings{Hsiao_2020_CVPR,
  author = {Hsiao, Wei-Lin and Grauman, Kristen},
  title = {ViBE: Dressing for Diverse Body Shapes},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Don't Judge an Object by Its Context: Learning to Overcome Contextual Bias
Krishna Kumar Singh, Dhruv Mahajan, Kristen Grauman, Yong Jae Lee, Matt Feiszli, Deepti Ghadiyaram


Existing models often leverage co-occurrences between objects and their context to improve recognition accuracy. However, strongly relying on context risks a model's generalizability, especially when typical co-occurrence patterns are absent. This work focuses on addressing such contextual biases to improve the robustness of the learnt feature representations. Our goal is to accurately recognize a category in the absence of its context, without compromising on performance when it co-occurs with context. Our key idea is to decorrelate feature representations of a category from its co-occurring context. We achieve this by learning a feature subspace that explicitly represents categories occurring in the absence of context along side a joint feature subspace that represents both categories and context. Our very simple yet effective method is extensible to two multi-label tasks -- object and attribute classification. On 4 challenging datasets, we demonstrate the effectiveness of our method in reducing contextual bias.
[context, recognition, visual, recognize, occur, goal, dataset, work] [category, biased, feature, object, occurs, contextual, cam, skateboard, occurring, location, propose, overlap, split, table, represents, key, decorrelate] [model, exclusive, typical, datasets, trained, identify] [method, proposed, spatial, figure, pixel, remove] [loss, absence, attribute, image, learn, perform, learns, person, gender] [bias, standard, training, performance, network, learning, class, test, data, classifier, classification, deep, space, weighted, subspace, higher, negative, entire, observe, compared, activation, better, set] [approach, rely, leverage, scene, second, jointly]
@InProceedings{Singh_2020_CVPR,
  author = {Singh, Krishna Kumar and Mahajan, Dhruv and Grauman, Kristen and Lee, Yong Jae and Feiszli, Matt and Ghadiyaram, Deepti},
  title = {Don't Judge an Object by Its Context: Learning to Overcome Contextual Bias},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
SESS: Self-Ensembling Semi-Supervised 3D Object Detection
Na Zhao, Tat-Seng Chua, Gim Hee Lee


The performance of existing point cloud-based 3D object detection methods heavily relies on large-scale high-quality 3D annotations. However, such annotations are often tedious and expensive to collect. Semi-supervised learning is a good alternative to mitigate the data annotation issue, but has remained largely unexplored in 3D object detection. Inspired by the recent success of self-ensembling technique in semi-supervised image classification task, we propose SESS, a self-ensembling semi-supervised 3D object detection framework. Specifically, we design a thorough perturbation scheme to enhance generalization of the network on unlabeled and new unseen data. Furthermore, we propose three consistency losses to enforce the consistency between two sets of predicted 3D object proposals, to facilitate the learning of structure and semantic invariances of objects. Extensive experiments conducted on SUN RGB-D and ScanNet datasets demonstrate the effectiveness of SESS in both inductive and transductive semi-supervised 3D object detection. Our SESS achieves competitive performance compared to the state-of-the-art fully-supervised method by using only 50% labeled data. Our code is available at https://github.com/Na-Z/sess.
[three] [object, detection, sess, votenet, bounding, sun, semantic, propose, predicted, framework, table, box, amodal, proposal, benchmark, denotes, center, detector, val] [perturbation, model, input, datasets, strong, improve, trained] [based, figure, proposed, comparison, existing, method, output] [consistency, supervised, loss, transductive, image, alignment, aligned, corresponding, transformed] [labeled, network, teacher, student, data, training, learning, unlabeled, set, performance, deep, random, scheme, inductive, semisupervised, stochastic, task, large, number, class, randomly, promising, applied, average, sampled, note] [point, cloud, ground, scene, scannet, truth, computed]
@InProceedings{Zhao_2020_CVPR,
  author = {Zhao, Na and Chua, Tat-Seng and Lee, Gim Hee},
  title = {SESS: Self-Ensembling Semi-Supervised 3D Object Detection},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Combining Detection and Tracking for Human Pose Estimation in Videos
Manchen Wang, Joseph Tighe, Davide Modolo


We propose a novel top-down approach that tackles the problem of multi-person human pose estimation and tracking in videos. In contrast to existing top-down approaches, our method is not limited by the performance of its person detector and can predict the poses of person instances not localized. It achieves this capability by propagating known person locations forward and backward in time and searching for poses in those regions. Our approach consists of three components: (i) a Clip Tracking Network that performs body joint detection and tracking simultaneously on small video clips; (ii) a Video Tracking Pipeline that merges the fixed-length tracklets produced by the Clip Tracking Network to arbitrary length tracks; and (iii) a Spatial-Temporal Merging procedure that refines the joint locations based on spatial and temporal smoothing terms. Thanks to the precision of our Clip Tracking Network and our merging procedure, our approach produces very accurate joint predictions and can fix common mistakes on hard scenarios like heavily entangled people. Our approach achieves state-of-the-art results on both joint detection and tracking, on both the PoseTrack 2017 and 2018 datasets, and against all top-down and bottom-down approaches.
[video, clip, temporal, frame, people, length, predict, multiple, time, three, short, correctly] [tracking, hrnet, bounding, detection, tracklets, detector, posetrack, merging, coco, location, achieves, box, tube, table, mota, merge, map, detect, missed, improvement, detected, merges, propose, propagating, object] [model, highly, datasets] [method, spatial, figure, based, running] [person, image, learn, arbitrary, entangled] [network, performance, procedure, set, baseline, similarity, evaluate, architecture, number, validation, problem, best, learning, large] [pose, approach, joint, estimation, body, human, pipeline, overlapping, novel, error, localized, hypothesis]
@InProceedings{Wang_2020_CVPR,
  author = {Wang, Manchen and Tighe, Joseph and Modolo, Davide},
  title = {Combining Detection and Tracking for Human Pose Estimation in Videos},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
SAPIEN: A SimulAted Part-Based Interactive ENvironment
Fanbo Xiang, Yuzhe Qin, Kaichun Mo, Yikuan Xia, Hao Zhu, Fangchen Liu, Minghua Liu, Hanxiao Jiang, Yifu Yuan, He Wang, Li Yi, Angel X. Chang, Leonidas J. Guibas, Hao Su


Building home assistant robots has long been a goal for vision and robotics researchers. To achieve this task, a simulated environment with physically realistic simulation, sufficient articulated objects, and transferability to the real robot is indispensable. Existing environments achieve these requirements for robotics simulation with different levels of simplification and focus. We take one step further in constructing an environment that supports household tasks for training robot learning algorithm. Our work, SAPIEN, is a realistic and physics-rich simulated environment that hosts a large-scale set of articulated objects. SAPIEN enables various robotic vision and interaction tasks that require detailed part-level understanding.We evaluate state-of-the-art vision algorithms for part detection and motion attribute recognition as well as demonstrate robotic interaction tasks using heuristic approaches and reinforcement learning algorithms. We hope that SAPIEN will open research directions yet to be explored, including learning cognition through interaction, part motion discovery, and construction of robotics-ready simulated game environment.
[environment, interaction, provide, dataset, perception, engine, movable, reinforcement, simulator, visual, navigation, recognition, planning, agent] [object, door, table, interactive, detection, segmentation, mask, including, box, semantic] [physical, manipulation, model, move, heuristic, game] [motion, figure, ieee, pattern, dynamic, based] [real, control, train, realistic, target] [learning, arxiv, preprint, training, set, support, average, deep, client, hinge, test, open] [sapien, robot, simulation, robotic, vision, conference, robotics, joint, computer, articulated, rendering, international, simulated, interface, hao, detailed, physx, point, drawer, axis, enables, demonstrate, camera, accurate, pose, rotation, thomas, angel, require, scene, rgb]
@InProceedings{Xiang_2020_CVPR,
  author = {Xiang, Fanbo and Qin, Yuzhe and Mo, Kaichun and Xia, Yikuan and Zhu, Hao and Liu, Fangchen and Liu, Minghua and Jiang, Hanxiao and Yuan, Yifu and Wang, He and Yi, Li and Chang, Angel X. and Guibas, Leonidas J. and Su, Hao},
  title = {SAPIEN: A SimulAted Part-Based Interactive ENvironment},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
RandLA-Net: Efficient Semantic Segmentation of Large-Scale Point Clouds
Qingyong Hu, Bo Yang, Linhai Xie, Stefano Rosa, Yulan Guo, Zhihua Wang, Niki Trigoni, Andrew Markham


We study the problem of efficient semantic segmentation for large-scale 3D point clouds. By relying on expensive sampling techniques or computationally heavy pre/post-processing steps, most existing approaches are only able to be trained and operate over small-scale point clouds. In this paper, we introduce RandLA-Net, an efficient and lightweight neural architecture to directly infer per-point semantics for large-scale point clouds. The key to our approach is to use random point sampling instead of more complex point selection approaches. Although remarkably computation and memory efficient, random sampling can discard key features by chance. To overcome this, we introduce a novel local feature aggregation module to progressively increase the receptive field for each 3D point, thereby effectively preserving geometric details. Extensive experiments show that our RandLA-Net can process 1 million points in a single pass with up to 200x faster than existing approaches. Moreover, our RandLA-Net clearly surpasses state-of-the-art approaches for semantic segmentation on two large-scale benchmarks Semantic3D and SemanticKITTI.
[graph, unit, time, encoding, explicitly, attention, multiple, infer, work] [feature, semantic, segmentation, attentive, pooling, aggregation, module, table, spg, semantickitti, object, fps, key, lidar] [input, effectively, effective] [existing, receptive, figure, spatial, based, residual, dilated, convolutional, block, fast, convolution, raw] [learn, shared, consists, progressively, preserve] [sampling, random, neural, memory, learning, network, deep, process, computationally, large, number, expensive, set, computational, total, performance, small, sample, evaluate, consumption, entire, pki, computation] [point, local, cloud, neighbouring, locse, directly, approach, complex, geometric, single, relative, pointnet]
@InProceedings{Hu_2020_CVPR,
  author = {Hu, Qingyong and Yang, Bo and Xie, Linhai and Rosa, Stefano and Guo, Yulan and Wang, Zhihua and Trigoni, Niki and Markham, Andrew},
  title = {RandLA-Net: Efficient Semantic Segmentation of Large-Scale Point Clouds},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
SurfelGAN: Synthesizing Realistic Sensor Data for Autonomous Driving
Zhenpei Yang, Yuning Chai, Dragomir Anguelov, Yin Zhou, Pei Sun, Dumitru Erhan, Sean Rafferty, Henrik Kretzschmar


Autonomous driving system development is critically dependent on the ability to replay complex and diverse traffic scenarios in simulation. In such scenarios, the ability to accurately simulate the vehicle sensors such as cameras, lidar or radar is hugely helpful. However, current sensor simulators leverage gaming engines such as Unreal or Unity, requiring manual creation of environments, objects, and material properties. Such approaches have limited scalability and fail to produce realistic approximations of camera, lidar, and radar data without significant additional work. In this paper, we present a simple yet effective approach to generate realistic scenario sensor data, based only on a limited amount of lidar and camera data collected by an autonomous vehicle. Our approach uses texture-mapped surfels to efficiently reconstruct the scene from an initial vehicle pass or set of passes, preserving rich information about object 3D geometry and appearance, as well as the scene conditions. We then leverage a SurfelGAN network to reconstruct realistic camera images for novel positions and orientations of the self-driving vehicle and moving objects in the scene. We demonstrate our approach on the Waymo Open Dataset and show that it can synthesize realistic camera data for simulated scenarios. We also create a novel dataset that contains cases in which two self-driving vehicles observe the same scene at the same time. We use this dataset to provide additional evaluation and demonstrate the usefulness of our SurfelGAN model.
[vehicle, dataset, driving, environment, work, evaluation, rich, multiple, time] [autonomous, object, lidar, semantic, map, detector, bounding, segmentation, waymo, instance, detection, propose] [adversarial, model, trained, quality, internal, original, scenario] [sensor, based, dynamic, figure, proposed] [real, image, generated, realistic, loss, unpaired, generate, supervised, paired, realism, synthesis, consistency, generator, gan, synthetic, discriminator, cycle, representation] [training, data, arxiv, preprint, learning, set, open, large, deep, baseline, metric, simple, network] [surfel, scene, camera, surfelgan, reconstruction, novel, rendering, additional, reconstruct, approach, simulation, distance, pose, view, system, geometry, point, render, leverage, well, demonstrate, simulated]
@InProceedings{Yang_2020_CVPR,
  author = {Yang, Zhenpei and Chai, Yuning and Anguelov, Dragomir and Zhou, Yin and Sun, Pei and Erhan, Dumitru and Rafferty, Sean and Kretzschmar, Henrik},
  title = {SurfelGAN: Synthesizing Realistic Sensor Data for Autonomous Driving},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
A Programmatic and Semantic Approach to Explaining and Debugging Neural Network Based Object Detectors
Edward Kim, Divya Gopinath, Corina Pasareanu, Sanjit A. Seshia


Even as deep neural networks have become very effective for tasks in vision and perception, it remains difficult to explain and debug their behavior. In this paper, we present a programmatic and semantic approach to explaining, understanding, and debugging the correct and incorrect behaviors of a neural network based perception system. Our approach is semantic in that it employs a high-level representation of the distribution of environment scenarios that the detector is intended to work on. It is programmatic in that the representation is a program in a domain-specific probabilistic programming language using which synthetic data can be generated to train and test the neural network. We present a framework that assesses the performance of the neural network to identify correct and incorrect detections, extracts rules from those results that semantically characterizes the correct and incorrect scenarios, and then specializes the probabilistic program with those rules in order to more precisely characterize the scenarios in which the neural network operates correctly or not, without human intervention. We demonstrate our results using the Scenic probabilistic programming language and a neural network-based object detector. Our experiments show that it is possible to automatically generate compact rules that significantly increase the correct detection rate (or conversely the incorrect detection rate) of the network and can thus help with debugging and understanding its behavior.
[correct, ego, perception, behavior, language, extract, work, environment, traffic, provide] [feature, object, detection, car, anchor, semantic, table, module, autonomous, detector, framework, refined, score, detected] [scenario, cenic, incorrect, decision, rule, model, explaining, othercar, input, help, technique, debugging, programmatic, explain] [tree, pattern, based, figure, method, output, extraction, color, low, proposed] [generated, generate, image] [neural, learning, set, data, network, deep, programming, activation, precision, number, probabilistic, test, label, rate, space, performance, machine, lead] [approach, program, conference, vision, computer, ground, scene, distance, international, truth]
@InProceedings{Kim_2020_CVPR,
  author = {Kim, Edward and Gopinath, Divya and Pasareanu, Corina and Seshia, Sanjit A.},
  title = {A Programmatic and Semantic Approach to Explaining and Debugging Neural Network Based Object Detectors},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Predicting Semantic Map Representations From Images Using Pyramid Occupancy Networks
Thomas Roddick, Roberto Cipolla


Autonomous vehicles commonly rely on highly detailed birds-eye-view maps of their environment, which capture both static elements of the scene such as road layout as well as dynamic elements such as other cars and pedestrians. Generating these map representations on the fly is a complex multi-stage process which incorporates many important vision-based elements, including ground plane estimation, road segmentation and 3D object detection. In this work we present a simple, unified approach for estimating these map representations directly from monocular images using a single end-to-end deep learning architecture. For the maps themselves we adopt a semantic Bayesian occupancy grid framework, allowing us to trivially accumulate information over multiple cameras and timesteps. We demonstrate the effectiveness of our approach by evaluating against several challenging baselines on the NuScenes and Argoverse datasets, and show that we are able to achieve a relative improvement of 9.1% and 22.3% respectively compared to the best-performing existing method.
[transformer, multiple, predict, road, predicting, dataset, driving, argoverse, work, represent, prediction, static, provide, build] [map, semantic, feature, object, pyramid, autonomous, nuscenes, segmentation, lidar, topdown, table, vpn, backbone, final, including, propose, height] [input, model] [figure, method, spatial, sensor, ieee, inverse, high, output, resolution, range, prior] [image, representation, loss, mapping] [network, layer, learning, class, deep, simple, bayesian, performance, ipm, arxiv, preprint, set, probability] [occupancy, grid, ground, conference, depth, approach, dense, vision, view, camera, monocular, computer, plane, directly, truth, ved, well, geometry, coordinate, capture, single, complete]
@InProceedings{Roddick_2020_CVPR,
  author = {Roddick, Thomas and Cipolla, Roberto},
  title = {Predicting Semantic Map Representations From Images Using Pyramid Occupancy Networks},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Efficient Derivative Computation for Cumulative B-Splines on Lie Groups
Christiane Sommer, Vladyslav Usenko, David Schubert, Nikolaus Demmel, Daniel Cremers


Continuous-time trajectory representation has recently gained popularity for tasks where the fusion of high-frame-rate sensors and multiple unsynchronized devices is required. Lie group cumulative B-splines are a popular way of representing continuous trajectories without singularities. They have been used in near real-time SLAM and odometry systems with IMU, LiDAR, regular, RGB-D and event cameras, as well as for offline calibration. These applications require efficient computation of time derivatives (velocity, acceleration), but all prior works rely on a computationally suboptimal formulation. In this work we present an alternative derivation of time derivatives based on recurrence relations that needs O(k) instead of O(k^2) matrix operations (for a spline of order k) and results in simple and elegant expressions. While producing the same result, the proposed approach significantly speeds up the trajectory optimization and allows for computing simple analytic derivatives with respect to spline knots. The results presented in this paper pave the way for incorporating continuous-time trajectory representations into more applications where real-time performance is required.
[time, trajectory, order, work, temporal, multiple, represent, explicitly] [split, apply, table, focus] [definition, difference, case] [jacobian, proposed, recurrence, transform, ieee, motion] [representation, control] [matrix, group, computation, cumulative, vector, optimization, acceleration, baseline, simple, number, set, multiplication, theorem, efficient, log, linear, note, paper, general, uniform, angular, scheme, computing, space, computational] [lie, spline, calibration, second, derivative, formulation, jacobians, velocity, computer, defined, vision, computed, compute, conference, continuous, analytic, imu, international, accelerometer, gyroscope, estimate, estimation, define, camera, algebra, representing, odometry, allows, pose]
@InProceedings{Sommer_2020_CVPR,
  author = {Sommer, Christiane and Usenko, Vladyslav and Schubert, David and Demmel, Nikolaus and Cremers, Daniel},
  title = {Efficient Derivative Computation for Cumulative B-Splines on Lie Groups},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
RL-CycleGAN: Reinforcement Learning Aware Simulation-to-Real
Kanishka Rao, Chris Harris, Alex Irpan, Sergey Levine, Julian Ibarz, Mohi Khansari


Deep neural network based reinforcement learning (RL) can learn appropriate visual representations for complex tasks like vision-based robotic grasping without the need for manually engineering or prior learning a perception system. However, data for RL is collected via running an agent in the desired environment, and for applications like robotics, running a robot in the real world may be extremely costly and time consuming. Simulated training offers an appealing alternative, but ensuring that policies trained in simulation can transfer effectively into the real world requires additional machinery. Simulations may not match reality, and typically bridging the simulation-to-reality gap requires domain knowledge and task-specific engineering. We can automate this process by employing generative models to translate simulated images into realistic ones. However, this sort of translation is typically task-agnostic, in that the translated images may not preserve all features that are relevant to the task. In this paper, we introduce the RL-scene consistency loss for image translation, which ensures that the translation operation is invariant with respect to the Q-values associated with the image. This allows us to learn a task-aware translation. Incorporating this loss into unsupervised domain translation, we obtain the RL-CycleGAN, a new approach for simulation-to-real-world transfer for reinforcement learning. In evaluations of RL-CycleGAN on two vision-based robotics grasping tasks, we show that RL-CycleGAN offers a substantial improvement over a number of prior methods for sim-to-real transfer, attaining excellent real-world performance with only a modest number of real-world observations.
[visual, policy, reinforcement, simulator, three, semantics, relevant, incorporating] [object, table, achieves] [trained, model, success, adversarial, input, collected, original] [rcan, figure, based, prior, ieee] [real, image, domain, gan, cyclegan, realistic, consistency, adaptation, loss, train, graspgan, transfer, randomization, gap, rlcyclegan, generated, qreal, qsim, generative, address, generator, gans, learn, unsupervised] [data, learning, performance, training, task, deep, adapted, adapt, large, neural, requires, randomized, setup, sergey, learned, required, applied, alex, network] [grasping, simulated, robot, grasp, simulation, robotic, scene, approach, require, jointly, additional, directly, conference]
@InProceedings{Rao_2020_CVPR,
  author = {Rao, Kanishka and Harris, Chris and Irpan, Alex and Levine, Sergey and Ibarz, Julian and Khansari, Mohi},
  title = {RL-CycleGAN: Reinforcement Learning Aware Simulation-to-Real},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
LiDARsim: Realistic LiDAR Simulation by Leveraging the Real World
Sivabalan Manivasagam, Shenlong Wang, Kelvin Wong, Wenyuan Zeng, Mikita Sazanovich, Shuhan Tan, Bin Yang, Wei-Chiu Ma, Raquel Urtasun


We tackle the problem of producing realistic simulations of LiDAR point clouds, the sensor of preference for most self-driving vehicles. We argue that, by leveraging real data, we can simulate the complex world more realistically compared to employing virtual worlds built from CAD/procedural models. Towards this goal, we first build a large catalog of 3D static maps and 3D dynamic objects by driving around several cities with our self-driving fleet. We can then generate scenarios by selecting a scene from our catalog and "virtually" placing the self-driving vehicle (SDV) and a set of dynamic objects from the catalog in plausible locations in the scene. To produce realistic simulations, we develop a novel simulator that captures both the power of physics-based and learning-based simulation. We first utilize raycasting over the 3D scene and then use a deep neural network to produce deviations from the physics-based simulation, producing realistic LiDAR point clouds. We showcase LiDARsim's usefulness for perception algorithms-testing on long-tail events and end-to-end closed-loop evaluation on safety-critical scenarios.
[vehicle, perception, driving, build, dataset, static, carla, incidence, time, bank] [lidar, object, autonomous, segmentation, table, map, iou, detection, raycasting, semantic, bounding, box, apply] [trained, model, testing, safety, evaluated, ray, original, catalog] [sensor, dynamic, simulate, figure, intensity, based] [real, realistic, train, generate, unknown, generated, gap, utilize, domain, generation, realism] [data, learning, evaluate, test, performance, set, network, neural, deep, training, arxiv, preprint, compared, large] [simulation, lidarsim, point, scene, raydrop, cloud, virtual, simulated, kitti, system, cad, autonomy, approach, osis, leveraging, well, demonstrate, full, surfel, sweep, realistically]
@InProceedings{Manivasagam_2020_CVPR,
  author = {Manivasagam, Sivabalan and Wang, Shenlong and Wong, Kelvin and Zeng, Wenyuan and Sazanovich, Mikita and Tan, Shuhan and Yang, Bin and Ma, Wei-Chiu and Urtasun, Raquel},
  title = {LiDARsim: Realistic LiDAR Simulation by Leveraging the Real World},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Just Go With the Flow: Self-Supervised Scene Flow Estimation
Himangi Mittal, Brian Okorn, David Held


When interacting with highly dynamic environments, scene flow allows autonomous systems to reason about the non-rigid motion of multiple independent objects. This is of particular interest in the field of autonomous driving, in which many cars, people, bicycles, and other objects need to be accurately tracked. Current state-of-the-art methods require annotated scene flow data from autonomous driving scenes to train scene flow networks with supervised learning. As an alternative, we present a method of training scene flow that uses two self-supervised losses, based on nearest neighbors and cycle consistency. These self-supervised losses allow us to train our method on large unlabeled autonomous driving datasets; the resulting method matches current state-of-the-art supervised performance using no real world annotations and exceeds state-of-the-art performance when combining our self-supervised approach with supervised learning on a smaller labeled dataset.
[time, dataset, work, outperforms, previous, current, driving, combining, temporal] [nuscenes, autonomous, predicted, annotated, lidar, table, ablation, overlap, object, propose] [datasets, trained, acc, original, true] [flow, method, figure, reverse, motion, epe, combination, captured, analysis, optical, tuned] [supervised, loss, cycle, synthetic, consistency, transformed, real, train, fine, selfsupervised, image, perform] [training, baseline, data, learning, large, performance, unlabeled, small, network, deep, amount, forward, equation, smaller, labeled] [point, scene, cloud, kitti, nearest, neighbor, ground, estimation, truth, compute, estimated, estimate, anchored, position, rigid, degenerate, error, purely, anchoring, approach, directly, distance, well, compare, andreas, avoid]
@InProceedings{Mittal_2020_CVPR,
  author = {Mittal, Himangi and Okorn, Brian and Held, David},
  title = {Just Go With the Flow: Self-Supervised Scene Flow Estimation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
TITAN: Future Forecast Using Action Priors
Srikanth Malla, Behzad Dariush, Chiho Choi


We consider the problem of predicting the future trajectory of scene agents from egocentric views obtained from a moving platform. This problem is important in a variety of domains, particularly for autonomous systems making reactive or strategic decisions in navigation. In an attempt to address this problem, we introduce TITAN (Trajectory Inference using Targeted Action priors Network), a new model that incorporates prior positions, actions, and context to forecast future trajectory of agents and future ego-motion. In the absence of an appropriate dataset for this task, we created the TITAN dataset that consists of 700 labeled video-clips (with odometry) captured from a moving vehicle on highly interactive urban traffic scenes in Tokyo. Our dataset includes 50 labels including vehicle states and actions, pedestrian age groups, and targeted pedestrian action attributes that are organized hierarchically corresponding to atomic, simple/complex-contextual, transportive, and communicative actions. To evaluate our model, we conducted extensive experiments on the TITAN dataset, revealing significant performance improvement against baselines and state-of-the-art algorithms. We also report promising results from our Agent Importance Mechanism (AIM), a module which provides insight into assessment of perceived risk by calculating the relative influence of each agent on the future ego-trajectory. The dataset is available at https://usa.honda-ri.com/titan
[future, action, titan, trajectory, dataset, agent, prediction, forecast, vehicle, time, interaction, egocentric, video, behavior, recognition, context, hidden, state, moving, includes, predicting, traffic, driving, participant, ego, incorporates, mechanism, social, predict, atomic, evaluation, chiho, urban, organized, hierarchically] [bounding, pedestrian, box, object, module, contextual, localization, predicted, table, autonomous, location, including] [model, datasets, input, age] [ieee, motion, figure, pattern, prior, captured, proposed, method] [aim, loss, encoder, target] [performance, better, set, arxiv, preprint, neural, note, problem, respect, consider, evaluate] [conference, computer, vision, human, scene, international, error, european, capture, complex]
@InProceedings{Malla_2020_CVPR,
  author = {Malla, Srikanth and Dariush, Behzad and Choi, Chiho},
  title = {TITAN: Future Forecast Using Action Priors},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Robust Learning Through Cross-Task Consistency
Amir R. Zamir, Alexander Sax, Nikhil Cheerla, Rohan Suri, Zhangjie Cao, Jitendra Malik, Leonidas J. Guibas


Visual perception entails solving a wide set of tasks (e.g., object detection, depth estimation, etc). The predictions made for different tasks out of one image are not independent, and therefore, are expected to be 'consistent'. We propose a flexible and fully computational framework for learning while enforcing Cross-Task Consistency (X-TAC). The proposed formulation is based on 'inference path invariance' over an arbitrary graph of prediction domains. We observe that learning with cross-task consistency leads to more accurate predictions, better generalization to out-of-distribution samples, and improved sample efficiency. This framework also leads to a powerful unsupervised quantity, called 'Consistency Energy, based on measuring the intrinsic consistency of the system. Consistency Energy well correlates with the supervised error (r=0.67), thus it can be employed as an unsupervised robustness metric as well as for detection of out-of-distribution inputs (AUC=0.99). The evaluations were performed on multiple datasets, including Taskonomy, Replica, CocoDoom, and ApolloScape.
[prediction, multiple, dataset, shift, graph, predicting, explicit] [predicted, confidence, including] [trained, query, case, adversarial, generalization, concept] [perceptual, method, figure, ieee, pattern, based, proposed, output] [consistency, domain, lxy, loss, image, transfer, separate, arbitrary, independent, unsupervised, learn] [learning, training, energy, baseline, network, taskonomy, neural, path, data, general, function, standard, set, inference, test, arxiv, inequality, preprint, better, higher, objective, large, optimization, task, convergence, sample] [fxy, depth, triangle, consistent, computer, error, enforcing, direct, vision, conference, surface, ground, constraint, truth, supplementary, term, replica, well, system, rgb, provided, leonidas, curvature]
@InProceedings{Zamir_2020_CVPR,
  author = {Zamir, Amir R. and Sax, Alexander and Cheerla, Nikhil and Suri, Rohan and Cao, Zhangjie and Malik, Jitendra and Guibas, Leonidas J.},
  title = {Robust Learning Through Cross-Task Consistency},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Dynamic Refinement Network for Oriented and Densely Packed Object Detection
Xingjia Pan, Yuqiang Ren, Kekai Sheng, Weiming Dong, Haolei Yuan, Xiaowei Guo, Chongyang Ma, Changsheng Xu


Object detection has achieved remarkable progress in the past decade. However, the detection of oriented and densely packed objects remains challenging because of following inherent reasons: (1) receptive fields of neurons are all axis-aligned and of the same shape, whereas objects are usually of diverse shapes and align along various directions; (2) detection models are typically trained with generic knowledge and may not generalize well to handle specific objects at test time; (3) the limited dataset hinders the development on this task. To resolve the first two issues, we present a dynamic refinement network that consists of two novel components, i.e., a feature selection module (FSM) and a dynamic refinement head (DRH). Our FSM enables neurons to adjust receptive fields in accordance with the shapes and orientations of target objects, whereas the DRH empowers our model to refine the prediction dynamically in an object-aware manner. To address the limited availability of related benchmarks, we collect an extensive and fully annotated dataset, namely, SKU110K-R, which is relabeled with oriented bounding boxes based on SKU110K. We perform quantitative evaluations on several publicly available benchmarks including DOTA, HRSC2016, SKU110K, and our own SKU110K-R dataset. Experimental results show that our method achieves consistent and substantial gains compared with baseline approaches. Our source code and dataset will be released to encourage follow-up research.
[dataset, prediction, attention, evaluation, multiple, three, predict] [object, feature, oriented, detection, bounding, refinement, regression, table, offset, map, module, roi, head, represents, propose, refine, predicted, aerial, region, drhs, add, adopt, iou, remote, ross, coco, horizontal] [model, example] [dynamic, ieee, method, receptive, fsm, kernel, convolution, pattern, dota, densely, packed, adjust, based, figure, spatial, flexible] [factor, misalignment, target] [network, baseline, set, classification, selection, training, test, learned, learning, select, improved, size, general] [conference, computer, rotation, vision, rotated, international, angle, dense, consistent]
@InProceedings{Pan_2020_CVPR,
  author = {Pan, Xingjia and Ren, Yuqiang and Sheng, Kekai and Dong, Weiming and Yuan, Haolei and Guo, Xiaowei and Ma, Chongyang and Xu, Changsheng},
  title = {Dynamic Refinement Network for Oriented and Densely Packed Object Detection},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
AOWS: Adaptive and Optimal Network Width Search With Latency Constraints
Maxim Berman, Leonid Pishchulin, Ning Xu, Matthew B. Blaschko, Gerard Medioni


Neural architecture search (NAS) approaches aim at automatically finding novel CNN architectures that fit computational constraints while maintaining a good performance on the target platform. We introduce a novel efficient one-shot NAS approach to optimally search for channel numbers, given latency constraints on a specific hardware. We first show that we can use a black-box approach to estimate a realistic latency model for a specific inference platform, without the need for low-level access to the inference computation. Then, we design a pairwise MRF to score any channel configuration and use dynamic programming to efficiently decode the best performing configuration, yielding an optimal solution for the network width search. Finally, we propose an adaptive channel configuration sampling scheme to gradually specialize the training phase to the target computational constraints. Experiments on ImageNet classification show that our approach can find networks fitting the resource constraints on different target platforms while improving accuracy over the state-of-the-art efficient networks.
[individual, modeling, order] [table, propose, cpu, final] [model, trained, input, case] [channel, adaptive, figure, convolutional, dynamic, ieee, output] [target, specific, image] [latency, search, network, training, neural, slimmable, inference, configuration, aows, number, layer, autoslim, greedy, architecture, space, efficient, resource, optimization, sampling, set, optimal, performance, efficiently, size, proxy, algorithm, viterbi, selection, learning, gpu, ows, procedure, problem, width, accuracy, entire, validation, batch, trt, design, pairwise, classification, better, optimizing, min, andrew, quoc, computational] [error, approach, vision, conference, computer, single, measured, novel, allows, international]
@InProceedings{Berman_2020_CVPR,
  author = {Berman, Maxim and Pishchulin, Leonid and Xu, Ning and Blaschko, Matthew B. and Medioni, Gerard},
  title = {AOWS: Adaptive and Optimal Network Width Search With Latency Constraints},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
High-Dimensional Convolutional Networks for Geometric Pattern Recognition
Christopher Choy, Junha Lee, Rene Ranftl, Jaesik Park, Vladlen Koltun


High-dimensional geometric patterns appear in many computer vision problems. In this work, we present high-dimensional convolutional networks for geometric pattern recognition problems that arise in 2D and 3D registration problems. We first propose high-dimensional convolutional networks from 4 to 32 dimensions and analyze the geometric pattern recognition capacity in high-dimensional linear regression problems. Next, we show that the 3D correspondences form hyper-surface in a 6-dimensional space and validate our network on 3D registration problems. Finally, we use image correspondences, which form a 4-dimensional hyper-conic section, and show that the high-dimensional convolutional networks are on par with many state-of-the-art multi-layered perceptrons.
[dataset, recognition] [feature, global, table, detection, score, denotes, consensus] [robust, model, input, noise, success, study] [convolutional, tensor, pattern, kernel, convolution, resblock, conv, figure, indicate, based, filtering, presented] [image, translation, generalized] [network, set, deep, data, matrix, rate, size, efficient, learning, ratio, requires, presence, training, random, find, average, linear, higher, problem, space, sample, architecture, precision, neural, number] [registration, geometric, sparse, inlier, point, correspondence, error, approach, computer, form, ransac, fpfh, voxel, fgr, vladlen, local, leverage, estimation, vision, inliers, geometry, jaesik, epipolar, fitting, essential, rotation, structure, fundamental, outlier, globally]
@InProceedings{Choy_2020_CVPR,
  author = {Choy, Christopher and Lee, Junha and Ranftl, Rene and Park, Jaesik and Koltun, Vladlen},
  title = {High-Dimensional Convolutional Networks for Geometric Pattern Recognition},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Filter Response Normalization Layer: Eliminating Batch Dependence in the Training of Deep Neural Networks
Saurabh Singh, Shankar Krishnan


Batch Normalization (BN) uses mini-batch statistics to normalize the activations during training, introducing dependence between mini-batch elements. This dependency can hurt the performance if the mini-batch size is too small, or if the elements are correlated. Several alternatives, such as Batch Renormalization and Group Normalization (GN), have been proposed to address this issue. However, they either do not match the performance of BN for large batches, or still exhibit degradation in performance for smaller batches, or introduce artificial constraints on the model architecture. In this paper we propose the Filter Response Normalization (FRN) layer, a novel combination of a normalization and an activation function, that can be used as a replacement for other normalizations and activations. Our method operates on each activation channel of each batch element independently, eliminating the dependency on other batch elements. Our method outperforms BN and other alternatives in a variety of settings for all batch sizes. FRN layer performs 0.7-1.0% better than BN on top-1 validation accuracy with large mini-batch sizes for Imagenet classification using InceptionV3 and ResnetV2-50 architectures. Further, it performs >1% better than GN on the same problem in the small mini-batch size regime. For object detection problem on COCO dataset, FRN layer outperforms all other methods by at least 0.3-0.5% in all batch size regimes.
[outperforms, connected] [table, object, detection, fully, response, kaiming, propose, coco] [model, trained, batchnorm, largest] [method, high, figure, relu, proposed, degradation, channel, performs, comparison, convolutional, learnable, pattern, combination, based, ieee] [image, train, perform, common, issue, discrepancy] [normalization, batch, frn, training, performance, layer, size, activation, large, learning, imagenet, classification, neural, tlu, filter, deep, small, network, groupnorm, higher, rate, accuracy, group, number, note, smaller, consistently, sample, function, better, validation, lead, normalized, decay, gpu, thresholded, learned, normalizing, performing, normalize] [computer, conference, vision, variety, david]
@InProceedings{Singh_2020_CVPR,
  author = {Singh, Saurabh and Krishnan, Shankar},
  title = {Filter Response Normalization Layer: Eliminating Batch Dependence in the Training of Deep Neural Networks},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Deep Iterative Surface Normal Estimation
Jan Eric Lenssen, Christian Osendorfer, Jonathan Masci


This paper presents an end-to-end differentiable algorithm for robust and detail-preserving surface normal estimation on unstructured point-clouds. We utilize graph neural networks to iteratively parameterize an adaptive anisotropic kernel that produces point weights for weighted least-squares plane fitting in local neighborhoods. The approach retains the interpretability and efficiency of traditional sequential plane fitting while benefiting from adaptation to data set statistics through deep learning. This results in a state-of-the-art surface normal estimator that is robust to noise, outliers and point density variation, preserves sharp features through anisotropic kernels and equivariance through a local quaternion-based spatial transformer. Contrary to previous deep learning methods, the proposed approach does not require any hand-crafted features or preprocessing. It improves on the state-of-the-art results while being more than two orders of magnitude faster and more parameter efficient.
[graph, dataset, recognition, observed] [edge, table] [noise, robust, model, input, trained, iterative] [sharp, kernel, figure, method, ieee, proposed, low, pattern, comparison, high, presented, traditional, spatial] [consists, jan] [deep, learning, neural, network, algorithm, set, matrix, density, large, size, data, problem, function, test, weighted, number, better, training, optimization, processing, architecture, arg, min, average] [point, normal, surface, neighborhood, estimation, computer, approach, conference, local, pca, vision, varying, fitting, cloud, plane, pcpnet, rotation, unoriented, error, reconstruction, angle, differentiable, international, unstructured, geometric, solution, provided, iteratively, estimating]
@InProceedings{Lenssen_2020_CVPR,
  author = {Lenssen, Jan Eric and Osendorfer, Christian and Masci, Jonathan},
  title = {Deep Iterative Surface Normal Estimation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Dataless Model Selection With the Deep Frame Potential
Calvin Murdock, Simon Lucey


Choosing a deep neural network architecture is a fundamental problem in applications that require balancing performance and parameter efficiency. Standard approaches rely on ad-hoc engineering or computationally expensive validation on a specific dataset. We instead attempt to quantify networks by their intrinsic capacity for unique and robust representations, enabling efficient architecture comparisons without requiring any data. Building upon theoretical connections between deep learning and sparse approximation, we propose the deep frame potential: a measure of coherence that is approximately related to representation stability but has minimizers that depend only on network structure. This provides a framework for jointly quantifying the contributions of architectural hyper-parameters such as depth, width, and skip connections. We validate its use as a criterion for model selection and demonstrate correlation with generalization error on a variety of common residual and densely connected network architectures.
[frame, connected, provide, individual] [correlation, propose] [model, coherence, generalization, input, effective, case, theory, overcomplete, sensitivity, trained, norm] [residual, skip, convolutional, densely, comparison, coding, figure, analysis, low] [gram, corresponding, representation] [deep, network, group, potential, minimum, validation, dictionary, chain, number, lower, learning, neural, mutual, data, width, theoretical, layer, parameter, matrix, architecture, capacity, nonzero, induced, approximation, training, bound, increasing, normalized, base, machine, performance, empirical, efficiency, optimization, selection, better, approximate, nonnegative, regularization, reduce, improved, arg] [sparse, error, conference, structure, local, international, compare]
@InProceedings{Murdock_2020_CVPR,
  author = {Murdock, Calvin and Lucey, Simon},
  title = {Dataless Model Selection With the Deep Frame Potential},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
UNAS: Differentiable Architecture Search Meets Reinforcement Learning
Arash Vahdat, Arun Mallya, Ming-Yu Liu, Jan Kautz


Neural architecture search (NAS) aims to discover network architectures with desired properties such as high accuracy or low latency. Recently, differentiable NAS (DNAS) has demonstrated promising results while maintaining a search cost orders of magnitude lower than reinforcement learning (RL) based NAS. However, DNAS models can only optimize differentiable loss functions in search, and they require an accurate differentiable approximation of non-differentiable criteria. In this work, we present UNAS, a unified framework for NAS, that encapsulates recent DNAS and RL-based approaches under one framework. Our framework brings the best of both worlds, and it enables us to search for architectures with both differentiable and non-differentiable criteria in one unified framework while maintaining a low search cost. Further, we introduce a new objective function for search based on the generalization gap that prevents the selection of architectures prone to overfitting. We present extensive experiments on the CIFAR-10, CIFAR-100 and ImageNet datasets and we perform search in two fundamentally different search spaces. We show that UNAS obtains the state-of-the-art average accuracy on all three datasets when compared to the architectures searched in the DARTS space. Moreover, we show that UNAS can find an efficient and accurate architecture in the ProxylessNAS search space, that outperforms existing MobileNetV2 based architectures. The source code is available at https://github.com/NVlabs/unas.
[work, previous, node, reinforcement, three] [framework, table, categorical] [generalization, reparameterization, model, input, correlated] [cell, skip, low, based, introduced, convolutional, comparison, figure, proposed] [loss, gap, image] [architecture, search, unas, gradient, network, latency, function, neural, objective, best, distribution, proxylessnas, discovered, number, operation, imagenet, performance, training, validation, learning, space, efficient, reinforce, snas, quoc, approximation, expected, optimizing, andrew, discrete, set, rebar, updating, arxiv, preprint, find, problem, gpu, applied, variance] [differentiable, estimator, accurate, continuous, well, error, estimation, cost, structure]
@InProceedings{Vahdat_2020_CVPR,
  author = {Vahdat, Arash and Mallya, Arun and Liu, Ming-Yu and Kautz, Jan},
  title = {UNAS: Differentiable Architecture Search Meets Reinforcement Learning},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Local Context Normalization: Revisiting Local Normalization
Anthony Ortiz, Caleb Robinson, Dan Morris, Olac Fuentes, Christopher Kiekintveld, Md Mahmudulla Hassan, Nebojsa Jojic


Normalization layers have been shown to improve convergence in deep neural networks, and even add useful inductive biases. In many vision applications the local spatial context of the features is important, but most common normalization schemes including Group Normalization (GN), Instance Normalization (IN), and Layer Normalization (LN) normalize over the entire spatial dimension of a feature. This can wash out important signals and degrade performance. For example, in applications that use satellite imagery, input images can be arbitrarily large; consequently, it is nonsensical to normalize over the entire area. Positional Normalization (PN), on the other hand, only normalizes over a single spatial position at a time. A natural compromise is to normalize features by local context, while also taking into account group level information. In this pa- per, we propose Local Context Normalization (LCN): a normalization layer where every feature is normalized based on a window around it and the filters in its group. We propose an algorithmic solution to make LCN efficient for arbitrary window sizes, even if every point in the image has a unique window. LCN outperforms its Batch Normalization (BN), GN, IN, and LN counterparts for object detection, semantic segmentation, and instance segmentation applications in several benchmark datasets, while keeping performance in- dependent of the batch size and facilitating transfer learning.
[context, dataset, microsoft, outperforms, visual] [lcn, table, feature, segmentation, object, semantic, instance, detection, global, iou, miou, coco, aerial, imagery, labeling, including, propose] [input, trained, model, chosen] [window, spatial, proposed, pixel, ieee, contrast, pattern, land, method, convolutional, figure, nebojsa, channel, based, dilated] [image, train, mapping, keeping, transfer] [normalization, size, small, performance, batch, group, learning, number, neural, training, set, deep, layer, variance, test, rate, data, imagenet, class, convergence, normalizes, normalized, implementation, arxiv, preprint, entire, dimension, efficient, network, best, classification, normalize, integral] [local, computer, conference, vision, cover, computed, international]
@InProceedings{Ortiz_2020_CVPR,
  author = {Ortiz, Anthony and Robinson, Caleb and Morris, Dan and Fuentes, Olac and Kiekintveld, Christopher and Hassan, Md Mahmudulla and Jojic, Nebojsa},
  title = {Local Context Normalization: Revisiting Local Normalization},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
ACNe: Attentive Context Normalization for Robust Permutation-Equivariant Learning
Weiwei Sun, Wei Jiang, Eduard Trulls, Andrea Tagliasacchi, Kwang Moo Yi


Many problems in computer vision require dealing with sparse, unordered data in the form of point clouds. Permutation-equivariant networks have become a popular solution - they operate on individual data points with simple perceptrons and extract contextual information with global pooling. This can be achieved with a simple normalization of the feature maps, a global operation that is unaffected by the order. In this paper, we propose Attentive Context Normalization (ACN), a simple yet effective technique to build permutation-equivariant networks robust to outliers. Specifically, we show how to normalize the feature maps with weights that are estimated within the network, excluding outliers from this normalization. We use this mechanism to leverage two types of attention: local and global - by combining them, our method is able to find the essential data points in high-dimensional space in order to solve a given task. We demonstrate through extensive experiments that our approach, which we call Attentive Context Networks (ACNe), provides a significant leap in performance compared to the state-of-the-art on camera pose estimation, robust fitting, and point cloud classification under noise and outliers. Source code: https://github.com/vcg-uvic/acne.
[attention, context, multiple, order, work, mechanism, latexit] [feature, attentive, table, global, apply, focus, map, add] [robust, effective] [method, pattern, performs, proposed, tensor, output, convolutional, residual] [image, perform, train, generate, loss] [normalization, deep, network, neural, data, classification, learning, performance, ratio, note, learned, consider, training, best, number, problem, architecture, sample, vector, matrix, better, test, normalize, layer, evaluate, group, report, simple, small, standard, weight, set] [point, acne, local, cne, cloud, stereo, pointnet, ransac, fundamental, well, outlier, relative, fitting, acn, approach, estimation, oanet, pose, ground, additional, form, solution, demonstrate, camera, truth, inliers]
@InProceedings{Sun_2020_CVPR,
  author = {Sun, Weiwei and Jiang, Wei and Trulls, Eduard and Tagliasacchi, Andrea and Yi, Kwang Moo},
  title = {ACNe: Attentive Context Normalization for Robust Permutation-Equivariant Learning},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Learning Situational Driving
Eshed Ohn-Bar, Aditya Prakash, Aseem Behl, Kashyap Chitta, Andreas Geiger


Human drivers have a remarkable ability to drive in diverse visual conditions and situations, e.g., from maneuvering in rainy, limited visibility conditions with no lane markings to turning in a busy intersection while yielding to pedestrians. In contrast, we find that state-of-the-art sensorimotor driving models struggle when encountering diverse settings with varying relationships between observation and action. To generalize when making decisions across diverse conditions, humans leverage multiple types of situation-specific reasoning and learning strategies. Motivated by this observation, we develop a framework for learning a situational driving policy that effectively captures reasoning under varying types of scenarios. Our key idea is to learn a mixture model with a set of policies that can capture multiple driving modes. We first optimize the mixture model through behavior cloning and show it to result in significant gains in terms of driving performance in diverse conditions. We then refine the model by directly optimizing for the driving task itself, i.e., supervised with the navigation task reward. Our method is more scalable than methods assuming access to privileged information, e.g., perception labels, as it only assumes demonstration and reward-based supervision. We achieve over 98% success rate on the CARLA driving benchmark as well as state-of-the-art performance on a newly introduced generalization benchmark.
[driving, policy, behavior, expert, situational, moe, navigation, imitation, town, context, visual, cloning, carla, agent, cilrs, step, embedding, evaluation, reinforcement, sensorimotor, three, static, monolithic, reasoning, privileged, action, drive, multiple, perception, work, hierarchical] [table, autonomous, benchmark, employ, improves, framework, lsd] [model, generalization, success, trained, improve, analyze, access] [dynamic, weather, proposed, comparison] [diverse, learn, control, loss, image, generalize, train] [learning, performance, training, task, mixture, learned, neural, network, optimization, data, set, processing, respect, rate, requires, objective] [approach, directly, additional, varying, leverage, well]
@InProceedings{Ohn-Bar_2020_CVPR,
  author = {Ohn-Bar, Eshed and Prakash, Aditya and Behl, Aseem and Chitta, Kashyap and Geiger, Andreas},
  title = {Learning Situational Driving},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
From Depth What Can You See? Depth Completion via Auxiliary Image Reconstruction
Kaiyue Lu, Nick Barnes, Saeed Anwar, Liang Zheng


Depth completion recovers dense depth from sparse measurements, e.g., LiDAR. Existing depth-only methods use sparse depth as the only input. However, these methods may fail to recover semantics consistent boundaries, or small/thin objects due to 1) the sparse nature of depth points and 2) the lack of images to provide semantic cues. This paper continues this line of research and aims to overcome the above shortcomings. The unique design of our depth completion model is that it simultaneously outputs a reconstructed image and a dense depth map. Specifically, we formulate image reconstruction from sparse depth as an auxiliary task during training that is supervised by the unlabelled gray-scale images. During testing, our system accepts sparse depth as the only input, i.e., the image is not required. Our design allows the depth completion network to learn complementary image features that help to better understand object structures. The extra supervision incurred by image reconstruction is minimal, because no annotations other than the image are needed. We evaluate our method on the KITTI depth completion benchmark and show that depth completion can be significantly improved via the auxiliary supervision of image reconstruction. Our algorithm consistently outperforms depth-only methods and is also effective for indoor scenes like NYUv2.
[provide, recognition, visual] [semantic, feature, object, module, benchmark, segmentation, map, cnn, lidar, table, detection] [model, auxiliary, input, complementary, primary] [ieee, pattern, recover, existing, comparison, method, figure, quantitative, mae, fusion, performs, june, output] [image, loss, shared, lack, supervised, corresponding, semantically, learn] [learning, network, better, training, sharing, performance, smaller, task, data, deep, best, larger, general] [depth, completion, sparse, conference, computer, vision, reconstruction, dense, rmse, international, kitti, rgb, ground, consistent, reconstructed, truth, indoor, additional, irmse, well, structure, distant, scene]
@InProceedings{Lu_2020_CVPR,
  author = {Lu, Kaiyue and Barnes, Nick and Anwar, Saeed and Zheng, Liang},
  title = {From Depth What Can You See? Depth Completion via Auxiliary Image Reconstruction},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Symmetry and Group in Attribute-Object Compositions
Yong-Lu Li, Yue Xu, Xiaohan Mao, Cewu Lu


Attributes and objects can compose diverse compositions. To model the compositional nature of these general concepts, it is a good choice to learn them through transformations, such as coupling and decoupling. However, complex transformations need to satisfy specific principles to guarantee the rationality. In this paper, we first propose a previously ignored principle of attribute-object transformation: Symmetry. For example, coupling peeled-apple with attribute peeled should result in peeled-apple, and decoupling peeled from apple should still output apple. Incorporating the symmetry principle, a transformation framework inspired by group theory is built, i.e. SymNet. SymNet consists of two modules, Coupling Network and Decoupling Network. With the group axioms and symmetry property as objectives, we adopt Deep Neural Networks to implement SymNet and train it in an end-to-end paradigm. Moreover, we propose a Relative Moving Distance (RMD) based recognition method to utilize the attribute change instead of the attribute pattern itself to classify attributes. Our symmetry learning can be utilized for the Compositional Zero-Shot Learning task and outperforms the state-of-the-art on widely-used benchmarks. Code is available at https://github.com/DirtyHarryLYL/SymNet.
[embedding, moving, visual, recognition, embeddings, word, pair, compositional, construct, previous, retrieval, cewu, element, inspired, outperforms] [object, propose, framework, semantic, category] [model, input, identity, theory, apple, change, typical] [coupling, method, based, output, figure] [attribute, foi, symnet, image, fox, czsl, unseen, con, composition, specific, peeled, invertibility, latent, decon, loss, learn, satisfy, compositionality, train, lsym, utilize, address, operate, perform] [group, learning, decoupling, space, network, linear, operation, set, classification, test, evaluate, deep, implement, task, comparing, training, product, sample] [symmetry, distance, relative, transformation, property, define, well]
@InProceedings{Li_2020_CVPR,
  author = {Li, Yong-Lu and Xu, Yue and Mao, Xiaohan and Lu, Cewu},
  title = {Symmetry and Group in Attribute-Object Compositions},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Noise-Aware Fully Webly Supervised Object Detection
Yunhang Shen, Rongrong Ji, Zhiwei Chen, Xiaopeng Hong, Feng Zheng, Jianzhuang Liu, Mingliang Xu, Qi Tian


We investigate the emerging task of learning object detectors with sole image-level labels on the web without requiring any other supervision like precise annotations or additional images from well-annotated benchmark datasets. Such a task, termed as fully webly supervised object detection, is extremely challenging, since image-level labels on the web are always noisy, leading to poor performance of the learned detectors. In this work, we propose an end-to-end framework to jointly learn webly supervised detectors and reduce the negative impact of noisy labels. Such noise is heterogeneous, which is further categorized into two types, namely background noise and foreground noise. Regarding the background noise, we propose a residual learning structure incorporated with weakly supervised detection, which decomposes background noise and models clean data. To explicitly learn the residual feature between clean data and noisy labels, we further propose a spatially-sensitive entropy criterion, which exploits the conditional distribution of detection results to estimate the confidence of background categories being noise. Regarding the foreground noise, a bagging-mixup learning is introduced, which suppresses foreground noisy signals from incorrectly labelled images, whilst maintaining the diversity of training data. We evaluate the proposed approach on popular benchmark datasets by training detectors on web images, which are retrieved by the corresponding category tags from photo-sharing sites. Extensive experiments show that our method achieves significant improvements over the state-of-the-art methods.
[sse, multiple, three] [object, detection, web, background, foreground, weakly, voc, pascal, head, webly, confidence, category, wsddn, framework, coco, proposal, detector, fully, propose, bounding, flickr, achieves, instance, semantic, wsod, fwebsod, segmentation, table] [noise, trained, model, clean, datasets, original, google] [noisy, method, proposed, residual, figure, spatial, convolutional, result] [supervised, image, learn, corresponding, target, domain, whilst, synthetic] [learning, training, data, entropy, label, set, deep, reduce, negative, impact, performance, classification, baseline, criterion, test, distribution, evaluate, compared] [estimate, additional, handle, approach]
@InProceedings{Shen_2020_CVPR,
  author = {Shen, Yunhang and Ji, Rongrong and Chen, Zhiwei and Hong, Xiaopeng and Zheng, Feng and Liu, Jianzhuang and Xu, Mingliang and Tian, Qi},
  title = {Noise-Aware Fully Webly Supervised Object Detection},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
3D Part Guided Image Editing for Fine-Grained Object Understanding
Zongdai Liu, Feixiang Lu, Peng Wang, Hui Miao, Liangjun Zhang, Ruigang Yang, Bin Zhou


Holistically understanding an object with its 3D movable parts is essential for visual models of a robot to interact with the world. For example, only by understanding many possible part dynamics of other vehicles (e.g., door or trunk opening, taillight blinking for changing lane), a self-driving vehicle can be success in dealing with emergency cases. However, existing visual models tackle rarely on these situations, but focus on bounding box detection. In this paper, we fill this important missing piece in autonomous driving by solving two critical issues. First, for dealing with data scarcity, we propose an effective training data generation process by fitting a 3D car model with dynamic parts to cars in real images. This allows us to directly edit the real images using the aligned 3D parts, yielding effective training data for learning robust deep neural networks (DNNs). Secondly, to benchmark the quality of 3D part understanding, we collected a large dataset in real driving scenario with cars in uncommon states (CUS), i.e. with door or trunk opened etc., which demonstrates that our trained network with edited images largely outperforms other baselines in terms of 2D detection and instance segmentation accuracy.
[dataset, understanding, state, environment, driving, work, vehicle, movable, visual] [car, object, uncommon, detection, instance, segmentation, benchmark, backbone, semantic, autonomous, door, map, labelled, region, peng, guided, bounding, apolloscape, parsing] [model, datasets, trained, invisible, study] [ieee, dynamic, pattern, existing, figure, output, motion, reverse, trunk] [editing, real, image, train, synthetic, perform, domain, generate, generation, gap, edited, common] [network, data, training, deep, baseline, number, performance, large, arxiv, preprint, neural, evaluate, learning, amount, manually] [computer, conference, vision, rendering, international, directly, single, pose, kitti, scene]
@InProceedings{Liu_2020_CVPR,
  author = {Liu, Zongdai and Lu, Feixiang and Wang, Peng and Miao, Hui and Zhang, Liangjun and Yang, Ruigang and Zhou, Bin},
  title = {3D Part Guided Image Editing for Fine-Grained Object Understanding},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
STINet: Spatio-Temporal-Interactive Network for Pedestrian Detection and Trajectory Prediction
Zhishuai Zhang, Jiyang Gao, Junhua Mao, Yukai Liu, Dragomir Anguelov, Congcong Li


Detecting pedestrians and predicting future trajectories for them are critical tasks for numerous applications, such as autonomous driving. Previous methods either treat the detection and prediction as separate tasks or simply add a trajectory regression head on top of a detector. In this work, we present a novel end-to-end two-stage network: Spatio-Temporal-Interactive Network (STINet). In addition to 3D geometry modeling of pedestrians, we model the temporal information for each of the pedestrians. To do so, our method predicts both current and past locations in the first stage, so that each pedestrian can be linked across frames and the comprehensive spatio-temporal information can be captured in the second stage. Also, we model the interaction among objects with an interaction graph, to gather the information among the neighboring objects. Comprehensive experiments on the Lyft Dataset and the recently released large-scale Waymo Open Dataset for both object detection and future trajectory prediction validate the effectiveness of the proposed method. For the Waymo Open Dataset, we achieve a bird-eyes-view (BEV) detection AP of 80.73 and trajectory prediction average displacement error (ADE) of 33.67cm for pedestrians, which establish the state-of-the-art for both tasks.
[trajectory, prediction, future, temporal, current, history, interaction, dataset, modeling, predict, graph, sequence, three, frame, relational, ade, build, explicitly, movement, action, video, length, node] [detection, feature, object, proposal, backbone, regression, stinet, table, pedestrian, pillar, box, waymo, lyft, intentnet, effectiveness, propose, bev, predicted, anchor, sti, region, iou, reg] [model, comprehensive, input] [proposed, ieee, figure, method, dynamic, pattern, based, convolutional] [generate, corresponding, loss, train] [network, classification, path, open, performance, inference, average, better, neural, indicates, size] [local, conference, point, computer, geometry, vision, well, predicts, international]
@InProceedings{Zhang_2020_CVPR,
  author = {Zhang, Zhishuai and Gao, Jiyang and Mao, Junhua and Liu, Yukai and Anguelov, Dragomir and Li, Congcong},
  title = {STINet: Spatio-Temporal-Interactive Network for Pedestrian Detection and Trajectory Prediction},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Rethinking Performance Estimation in Neural Architecture Search
Xiawu Zheng, Rongrong Ji, Qiang Wang, Qixiang Ye, Zhenguo Li, Yonghong Tian, Qi Tian


Neural architecture search (NAS) remains a challenging problem, which is attributed to the indispensable and time-consuming component of performance estimation (PE). In this paper, we provide a novel yet systematic rethinking of PE in a resource constrained regime, termed budgeted PE (BPE), which precisely and effectively estimates the performance of an architecture sampled from an architecture space. Since searching an optimal BPE is extremely time-consuming as it requires to train a large number of networks for evaluation, we propose a Minimum Importance Pruning (MIP) approach. Given a dataset and a BPE search space, MIP estimates the importance of hyper-parameters using random forest and subsequently prunes the minimum one from the next iteration. In this way, MIP effectively prunes less important hyper-parameters to allocate more computational resource on more important ones, thus achieving an effective exploration. By combining BPE with various search algorithms including reinforcement learning, evolution algorithm, random search, and differentiable architecture search, we achieve 1, 000x of NAS speed up with a negligible performance drop comparing to the SOTA. All the NAS search codes are available at: https: //github.com/zhengxiawu/rethinking_performance_ estimation_in_NAS
[time, reinforcement, previous, node, dataset, evaluation] [correlation, global, including, denotes, regression] [trained, effective, example, termed] [method, based, proposed, cell, convolutional, partition] [image, corresponding, train, specific] [search, architecture, performance, random, training, bpe, neural, learning, space, optimal, minimum, number, network, set, pruning, optimization, sampled, efficient, algorithm, size, large, forest, evolution, find, sampling, sample, parameter, lowest, validation, rongrong, budgeted, batch, evaluate, rank, operation, computational, gpu, consumption, hyperparameter, spearman, xiawu, extremely, deep, epoch, rate, fewer, function, process, barret, quoc, comparing, evolutionary] [estimation, local, cost, estimate, well]
@InProceedings{Zheng_2020_CVPR,
  author = {Zheng, Xiawu and Ji, Rongrong and Wang, Qiang and Ye, Qixiang and Li, Zhenguo and Tian, Yonghong and Tian, Qi},
  title = {Rethinking Performance Estimation in Neural Architecture Search},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Feature-Metric Registration: A Fast Semi-Supervised Approach for Robust Point Cloud Registration Without Correspondences
Xiaoshui Huang, Guofeng Mei, Jian Zhang


We present a fast feature-metric point cloud registration framework, which enforces the optimisation of registration by minimising a feature-metric projection error without correspondences. The advantage of the feature-metric projection error is robust to noise, outliers and density difference in contrast to the geometric projection error. Besides, minimising the feature-metric projection error does not need to search the correspondences so that the optimisation speed is fast. The principle behind the proposed method is that the feature difference is smallest if point clouds are aligned very well. We train the proposed method in a semi-supervised or unsupervised approach, which requires limited or no registration label data. Experiments demonstrate our method obtains higher accuracy and robustness than the state-of-the-art methods. Besides, experimental results show that the proposed method can handle significant noise and density difference, and solve both same-source and cross-source point cloud registration.
[dataset, decoder] [feature, framework, module, branch, obtains, global] [difference, input, noise, robust, trained] [method, figure, proposed, comparison, ieee, pattern, classical, column, extraction, jacobian, range, gaussian] [unsupervised, encoder, learn, loss, train, alignment, proposes, generate, image, row, minimizing] [learning, network, deep, density, training, better, problem, neural, matrix, best, process, data, large, performance, algorithm, principle, accuracy, searching] [registration, point, cloud, transformation, error, projection, solve, vision, computer, conference, geometric, pointnetlk, local, estimation, rotation, approach, distinctive, directly, initial, optimisation, estimate, distance, partial, correspondence, demonstrate, handle, solving, euclidean]
@InProceedings{Huang_2020_CVPR,
  author = {Huang, Xiaoshui and Mei, Guofeng and Zhang, Jian},
  title = {Feature-Metric Registration: A Fast Semi-Supervised Approach for Robust Point Cloud Registration Without Correspondences},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Learning Multi-View Camera Relocalization With Graph Neural Networks
Fei Xue, Xin Wu, Shaojun Cai, Junqiu Wang


We propose to construct a view graph to excavate the information of the whole given sequence for absolute camera pose estimation. Specifically, we harness GNNs to model the graph, allowing even non-consecutive frames to exchange information with each other. Rather than adopting the regular GNNs directly, we redefine the nodes, edges, and embedded functions to fit the relocalization task. Redesigned GNNs cooperate with CNNs in guiding knowledge propagation and feature extraction respectively to process multi-view high-dimension image features iteratively at different levels. Besides, a general graph-based loss function beyond constraints between consecutive views is employed for training the network in an end-to-end fashion. Extensive experiments conducted on both indoor and outdoor datasets demonstrate that our method outperforms previous approaches especially in large-scale and challenging scenarios.
[graph, gnns, message, multiple, node, lsg, visual, mapnet, dataset, previous, gnn, oxford, passing, temporal, concatenated, modeling, lstms, sequence, exchange, outperforms, sequential, contribute, connected, attention] [feature, edge, table, localization, challenging, pooling, regression] [model] [method, extraction, cnns, consecutive, convolutional, utilized, dynamic, figure, output, channel, xin] [image, loss, consistency, source] [neural, deep, function, learning, performance, network, process, number, updating, size] [pose, camera, relocalization, absolute, posenet, orientation, position, robotcar, error, cambridge, scene, relative, full, torsten, single, local, defined, marc]
@InProceedings{Xue_2020_CVPR,
  author = {Xue, Fei and Wu, Xin and Cai, Shaojun and Wang, Junqiu},
  title = {Learning Multi-View Camera Relocalization With Graph Neural Networks},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
MotionNet: Joint Perception and Motion Prediction for Autonomous Driving Based on Bird's Eye View Maps
Pengxiang Wu, Siheng Chen, Dimitris N. Metaxas


The ability to reliably perceive the environmental states, particularly the existence of objects and their motion behavior, is crucial for autonomous driving. In this work, we propose an efficient deep model, called MotionNet, to jointly perform perception and motion prediction from 3D point clouds. MotionNet takes a sequence of LiDAR sweeps as input and outputs a bird's eye view (BEV) map, which encodes the object category and motion information in each grid cell. The backbone of MotionNet is a novel spatio-temporal pyramid network, which extracts deep spatial and temporal features in a hierarchical fashion. To enforce the smoothness of predictions over both space and time, the training of MotionNet is further regularized with novel spatial and temporal consistency losses. Extensive experiments show that the proposed method overall outperforms the state-of-the-arts, including the latest scene-flow- and 3D-object-detection-based methods. This indicates the potential value of the proposed method serving as a backup to the bounding-box-based system, and providing complementary information to the motion planner in autonomous driving. Code is available at https://www.merl.com/research/license#MotionNet.
[temporal, prediction, motionnet, static, time, state, speed, future, current, perception, trajectory, sequence, vehicle, represent, predict, provide, frame, environmental, spatiotemporal, hierarchical] [object, bev, detection, table, stc, category, autonomous, bounding, feature, lidar, pyramid, predicted, box, map, stpn, pooling, raquel, nuscenes, denotes] [model, input, complementary] [motion, ieee, pattern, cell, spatial, flow, method, based, output, fusion, convolutional, proposed] [consistency, representation] [training, network, classification, performance, learning, accuracy, deep, data, number, task, best, neural, space, note, processing, consider] [conference, point, computer, vision, cloud, grid, international, estimation, occupancy, scene, novel, smooth]
@InProceedings{Wu_2020_CVPR,
  author = {Wu, Pengxiang and Chen, Siheng and Metaxas, Dimitris N.},
  title = {MotionNet: Joint Perception and Motion Prediction for Autonomous Driving Based on Bird's Eye View Maps},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
EcoNAS: Finding Proxies for Economical Neural Architecture Search
Dongzhan Zhou, Xinchi Zhou, Wenwei Zhang, Chen Change Loy, Shuai Yi, Xuesen Zhang, Wanli Ouyang


Neural Architecture Search (NAS) achieves significant progress in many computer vision tasks. While many methods are proposed to improve the efficiency of NAS, the search progress is still laborious because training and evaluating plausible architectures over large search space is time-consuming. Assessing network candidates under a proxy (i.e., computationally reduced setting) thus becomes inevitable. In this paper, we observe that most existing proxies exhibit different behaviors in maintaining the rank consistency among network candidates. In particular, some proxies can be more reliable - the rank of candidates does not differ much comparing their reduced setting performance and final performance. In this paper, we systematically investigate some widely adopted reduction factors and report our observations. Inspired by these observations, we present a reliable proxy and further formulate a hierarchical proxy strategy that spends more computations on candidate networks that are potentially more accurate, while discards unpromising ones in early stage with a fast proxy. This leads to an economical evolutionary-based NAS (EcoNAS), which achieves an impressive 400xsearch time reduction in comparison to the evolutionary-based state of the art [19] (8 v.s. 3150 GPU days). Some new proxies led by our observations can also be applied to accelerate other NAS methods while still able to discover good candidate networks with performance matching those found by previous proxy strategies. Codes and models will be released to facilitate future research.
[hierarchical, previous, time, overhead] [table] [model, original, input, trained, refers] [figure, resolution, based, method] [consistency, train, image, corresponding] [reduced, search, training, proxy, setting, architecture, reduction, number, rank, neural, network, strategy, econas, reducing, fewer, good, reliable, evaluate, computation, large, accuracy, top, performance, gpu, reduce, amoebanet, set, algorithm, evolution, quoc, searching, ratio, design, indicates, population, space, sample, learning, searched, average, entropy, retraining, economical, promising, best, imagenet, randomly] [consistent, cost, error]
@InProceedings{Zhou_2020_CVPR,
  author = {Zhou, Dongzhan and Zhou, Xinchi and Zhang, Wenwei and Loy, Chen Change and Yi, Shuai and Zhang, Xuesen and Ouyang, Wanli},
  title = {EcoNAS: Finding Proxies for Economical Neural Architecture Search},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Hit-Detector: Hierarchical Trinity Architecture Search for Object Detection
Jianyuan Guo, Kai Han, Yunhe Wang, Chao Zhang, Zhaohui Yang, Han Wu, Xinghao Chen, Chang Xu


Neural Architecture Search (NAS) has achieved great success in image classification task. Some recent works have managed to explore the automatic design of efficient backbone or feature fusion layer for object detection. However, these methods focus on searching only one certain component of object detector while leaving others manually designed. We identify the inconsistency between searched component and manually designed ones would withhold the detector of stronger performance. To this end, we propose a hierarchical trinity search framework to simultaneously discover efficient architectures for all components (i.e. backbone, neck, and head) of object detector in an end-to-end manner. In addition, we empirically reveal that different parts of the detector prefer different operators. Motivated by this, we employ a novel scheme to automatically screen different sub search spaces for different components so as to perform the end-to-end search for each component on the corresponding sub search space efficiently. Without bells and whistles, our searched architecture, namely Hit-Detector, achieves 41.4% mAP on COCO minival set with 27M parameters. Our implementation is available at \href https://github.com/ggjy/HitDet.pytorch https://github.com/ggjy/HitDet.pytorch .
[hierarchical, three] [object, backbone, detection, detector, head, neck, map, feature, fpn, coco, table, trinity, detnas, pyramid, ross, jian, propose, achieves, proposal, ldet, bounding, region, val, kaiming, faster, denotes] [model, screening, suitable, input] [based, figure, proposed, convolution, block, designed, screen, convolutional, residual, chao, method] [image, component, corresponding] [search, architecture, space, searched, operation, neural, network, arxiv, searching, better, set, classification, layer, min, learning, large, arg, baseline, design, efficient, size, manually, rate, training, number, performance, xiangyu, indicates, higher] [differentiable]
@InProceedings{Guo_2020_CVPR,
  author = {Guo, Jianyuan and Han, Kai and Wang, Yunhe and Zhang, Chao and Yang, Zhaohui and Wu, Han and Chen, Xinghao and Xu, Chang},
  title = {Hit-Detector: Hierarchical Trinity Architecture Search for Object Detection},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Geometrically Principled Connections in Graph Neural Networks
Shunwang Gong, Mehdi Bahri, Michael M. Bronstein, Stefanos Zafeiriou


Graph convolution operators bring the advantages of deep learning to a variety of graph and mesh processing tasks previously deemed out of reach. With their continued success comes the desire to design more powerful architectures, often by adapting existing deep learning techniques to non-Euclidean data. In this paper, we argue geometry should remain the primary driving force behind innovation in the emerging field of geometric deep learning. We relate graph neural networks to widely successful computer graphics and data approximation models: radial basis functions (RBFs). We conjecture that, like RBFs, graph convolution layers would benefit from the addition of simple functions to the powerful convolution kernels. We introduce affine skip connections, a novel building block formed by combining a fully connected layer with any graph convolution operator. We experimentally demonstrate the effectiveness of our technique, and show the improved performance is the consequence of more than the increased number of parameters. Operators equipped with the affine skip connection markedly outperform their base performance on every task we evaluated, i.e., shape reconstruction, dense shape correspondence, and graph classification. We hope our simple and effective approach will serve as a solid baseline and help ease future research in graph neural networks.
[graph, gcn, order, gnns, powerful, connected] [center, table, denotes, addition] [model, improve, radial, mnist] [affine, skip, convolution, kernel, aff, ieee, convolutional, residual, pattern, block, figure, interpolation, operator, based, rbfs, fast, society] [learn, learns, discriminative] [neural, learning, performance, deep, classification, function, network, learned, size, vanilla, number, matrix, processing, data, layer, connection, accuracy, gradient, weight, simple, improved, space] [computer, conference, shape, vision, vertex, michael, rbf, mesh, error, correspondence, geodesic, monet, geometric, feastnet, polynomial, basis, reconstruction, surface, acm, dense, local, vanishing, compare, median, international]
@InProceedings{Gong_2020_CVPR,
  author = {Gong, Shunwang and Bahri, Mehdi and Bronstein, Michael M. and Zafeiriou, Stefanos},
  title = {Geometrically Principled Connections in Graph Neural Networks},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
On Vocabulary Reliance in Scene Text Recognition
Zhaoyi Wan, Jielei Zhang, Liang Zhang, Jiebo Luo, Cong Yao


The pursuit of high performance on public benchmarks has been the driving force for research in scene text recognition, and notable progresses have been achieved. However, a close investigation reveals a startling fact that the state-of-the-art methods perform well on images with words within vocabulary but generalize poorly to images with words outside vocabulary. We call this phenomenon "vocabulary reliance". In this paper, we establish an analytical framework, in which different datasets, metrics and module combinations for quantitative comparisons are devised, to conduct an in-depth study on the problem of vocabulary reliance in scene text recognition. Key findings include: (1) Vocabulary reliance is ubiquitous, i.e., all existing algorithms more or less exhibit such characteristic; (2) Attention-based decoders prove weak in generalizing to words outside vocabulary and segmentation-based decoders perform well in utilizing visual features; (3) Context modeling is highly coupled with the prediction layers. These findings provide new insights and can benefit future research in scene text recognition. Furthermore, we propose a simple yet effective mutual learning strategy to allow models of two families (attention-based and segmentation-based) to learn collaboratively. This remedy alleviates the problem of vocabulary reliance and significantly improves the overall scene text recognition performance.
[text, vocabulary, recognition, pred, context, ctc, cong, reliance, visual, prediction, cntx, evaluation, blstm, ppm, corpus, iiit, lstm, dataset, three, sequence, modeling, attention, decoder, natural, minghui, order, observation, reading, recurrent, word, illustrated] [module, xiang, feature, framework, detection, table, propose, effectiveness] [model, trained, datasets, generalization, robust, effective, collected] [ieee, pattern, proposed, figure, comparison, based, convolutional] [gap, ability, image, synthetic, perform, generated, learn] [learning, data, training, accuracy, performance, mutual, network, deep, test, neural, strategy, ratio, problem, large, evaluate, machine, algorithm] [scene, conference, computer, vision, international, accurate, well]
@InProceedings{Wan_2020_CVPR,
  author = {Wan, Zhaoyi and Zhang, Jielei and Zhang, Liang and Luo, Jiebo and Yao, Cong},
  title = {On Vocabulary Reliance in Scene Text Recognition},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Generating Accurate Pseudo-Labels in Semi-Supervised Learning and Avoiding Overconfident Predictions via Hermite Polynomial Activations
Vishnu Suresh Lokhande, Songwong Tasneeyapant, Abhay Venkatesh, Sathya N. Ravi, Vikas Singh


Rectified Linear Units (ReLUs) are among the most widely used activation function in a broad variety of tasks in vision. Recent theoretical results suggest that despite their excellent practical performance, in various cases, a substitution with basis expansions (e.g., polynomials) can yield significant benefits from both the optimization and generalization perspective. Unfortunately, the existing results remain limited to networks with a couple of layers, and the practical viability of these results is not yet known. Motivated by some of these results, we explore the use of Hermite polynomial expansions as a substitute for ReLUs in deep networks. While our experiments with supervised learning do not provide a clear verdict, we find that this strategy offers considerable benefits in semi-supervised learning (SSL) / transductive learning settings. We carefully develop this idea and show how the use of Hermite polynomials based activations can yield improvements in pseudo-label accuracies and sizable financial savings (due to concurrent runtime benefits). Further, we show via theoretical analysis, that the networks (with Hermite activations) offer robustness to noise and other attractive mathematical properties.
[provide, behavior, work, hidden, dataset, time, order] [faster, table, confidence] [noise, model, interesting] [relu, figure, based, high, ieee, analysis, expansion] [loss, supervised, specific, train] [hermite, learning, deep, activation, network, hermites, training, function, test, optimization, data, accuracy, landscape, neural, number, lower, ssl, better, rate, convergence, relus, layer, set, performance, observe, find, softsign, saas, arxiv, preprint, needed, inner, higher, compared, theoretical, yield, mathematical, unlabeled, svhn, maximum, empirical, standard] [polynomial, computer, vision, smoother, conference, accurate, property, basis, form, cost, well, smooth, initial]
@InProceedings{Lokhande_2020_CVPR,
  author = {Lokhande, Vishnu Suresh and Tasneeyapant, Songwong and Venkatesh, Abhay and Ravi, Sathya N. and Singh, Vikas},
  title = {Generating Accurate Pseudo-Labels in Semi-Supervised Learning and Avoiding Overconfident Predictions via Hermite Polynomial Activations},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
GraspNet-1Billion: A Large-Scale Benchmark for General Object Grasping
Hao-Shu Fang, Chenxi Wang, Minghao Gou, Cewu Lu


Object grasping is critical for many applications, which is also a challenging computer vision problem. However, for cluttered scene, current researches suffer from the problems of insufficient training data and the lacking of evaluation benchmarks. In this work, we contribute a large-scale grasp pose detection dataset with a unified evaluation system. Our dataset contains 97,280 RGB-D image with over one billion grasp poses. Meanwhile, our evaluation system directly reports whether a grasping is successful by analytic computation, which is able to evaluate any kind of grasp poses without exhaustively labeling ground-truth. In addition, we propose an end-to-end grasp pose prediction network given point cloud inputs, where we learn approaching direction and operation parameters in a decoupled manner. A novel grasp affinity field is also designed to improve the grasping robustness. We conduct extensive experiments to show that our dataset and evaluation system can align well with real-world experiments and our proposed network achieves the state-of-the-art performance. Our dataset, source code and models are publicly available at www.graspnet.net.
[dataset, evaluation, previous, predict, frame, prediction, provide] [object, detection, rectangle, denotes, adopt, unified, propose, annotated, annotation, predicted, benchmark, table, score, confidence] [quality, datasets, conduct] [based, ieee, proposed, method, convolutional, pattern, scale, figure, high] [representation, real, image, loss] [network, learning, data, deep, arxiv, preprint, set, evaluate, larger, classification, neural, training, width, metric, operation, better, sampled] [grasp, point, pose, grasping, approaching, conference, cloud, gripper, camera, computer, international, vision, robotic, rotation, robotics, well, cluttered, scene, graspable, vij, robot, directly, analytic, novel, single, estimation, avoid, arm, collision]
@InProceedings{Fang_2020_CVPR,
  author = {Fang, Hao-Shu and Wang, Chenxi and Gou, Minghao and Lu, Cewu},
  title = {GraspNet-1Billion: A Large-Scale Benchmark for General Object Grasping},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
PFRL: Pose-Free Reinforcement Learning for 6D Pose Estimation
Jianzhun Shao, Yuhang Jiang, Gu Wang, Zhigang Li, Xiangyang Ji


6D pose estimation from a single RGB image is a challenging and vital task in computer vision. The current mainstream deep model methods resort to 2D images annotated with real-world ground-truth 6D object poses, whose collection is fairly cumbersome and expensive, even unavailable in many cases. In this work, to get rid of the burden of 6D annotations, we formulate the 6D pose refinement as a Markov Decision Process and impose on the reinforcement learning approach with only 2D image annotations as weakly-supervised 6D pose information, via a delicate reward definition and a composite reinforced optimization method for efficient and effective policy training. Experiments on LINEMOD and T-LESS datasets demonstrate that our Pose-Free approach is able to achieve state-of-the-art performance compared with the methods without using real-world ground-truth 6D pose labels.
[reward, policy, reinforcement, current, action, state, time, agent, observed, visual, exploit, work] [object, add, mask, iou, propose, detection, table, bounding, refinement] [model, trained, decision, difference] [ieee, method, pattern, figure, based, refining] [image, translation, composite, real, train, synthetic, loss, disentangled] [learning, optimization, training, network, problem, deep, data, achieve, performance, accuracy, discrete, compared, function, metric, update, strategy, process, space, design, denote] [pose, estimation, computer, conference, initial, rotation, vision, international, linemod, reinforced, rendered, approach, aae, continuous, european, single, rgb, error, matching, dpod]
@InProceedings{Shao_2020_CVPR,
  author = {Shao, Jianzhun and Jiang, Yuhang and Wang, Gu and Li, Zhigang and Ji, Xiangyang},
  title = {PFRL: Pose-Free Reinforcement Learning for 6D Pose Estimation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Through Fog High-Resolution Imaging Using Millimeter Wave Radar
Junfeng Guan, Sohrab Madani, Suraj Jog, Saurabh Gupta, Haitham Hassanieh


This paper demonstrates high-resolution imaging using millimeter Wave (mmWave) radars that can function even in dense fog. We leverage the fact that mmWave signals have favorable propagation characteristics in low visibility conditions, unlike optical sensors like cameras and LiDARs which cannot penetrate through dense fog. Millimeter-wave radars, however, suffer from very low resolution, specularity, and noise artifacts. We introduce HawkEye, a system that leverages a cGAN architecture to recover high-frequency shapes from raw low-resolution mmWave heat-maps. We propose a novel design that addresses challenges specific to the structure and nature of the radar signals involved. We also develop a data synthesizer to aid with large-scale dataset generation for training. We implement our system on a custom-built mmWave radar platform and demonstrate performance improvement over both standard mmWave radars and other competitive baselines.
[work, dataset] [car, heatmap, map, lidar, object] [input, heatmaps, visibility, adversarial, model, ranging, noise] [fog, resolution, output, high, imaging, ieee, low, figure, based, column, signal, weather, skip, frequency, clear, pattern, june, perceptual] [image, loss, generator, gan, real, synthesized, discriminator, generated, corresponding, representation, conditional] [data, performance, neural, architecture, design, network, compared, note, test, processing, function, standard, learning, achieve, deep] [mmwave, radar, depth, hawkeye, millimeter, wave, camera, point, conference, scene, stereo, specularity, system, view, nearest, vision, computer, cloud, dense, synthesizer, neighbor, press, international, ground, shape, capture, multipath, human, ghz]
@InProceedings{Guan_2020_CVPR,
  author = {Guan, Junfeng and Madani, Sohrab and Jog, Suraj and Gupta, Saurabh and Hassanieh, Haitham},
  title = {Through Fog High-Resolution Imaging Using Millimeter Wave Radar},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Disentangling Physical Dynamics From Unknown Factors for Unsupervised Video Prediction
Vincent Le Guen, Nicolas Thome


Leveraging physical knowledge described by partial differential equations (PDEs) is an appealing way to improve unsupervised video forecasting models. Since physics is too restrictive for describing the full visual content of generic video sequences, we introduce PhyDNet, a two-branch deep architecture, which explicitly disentangles PDE dynamics from unknown complementary information. A second contribution is to propose a new recurrent physical cell (PhyCell), inspired from data assimilation techniques, for performing PDE-constrained prediction in latent space. Extensive experiments conducted on four various datasets show the ability of PhyDNet to outperform state-of-the-art methods. Ablation studies also highlight the important gain brought out by both disentanglement and PDE-constrained prediction. Finally, we show that PhyDNet presents interesting features for dealing with missing data and long-term forecasting.
[video, phydnet, prediction, phycell, recurrent, moving, future, kalman, ddpae, forecasting, sequence, traffic, pde, pdes, modeling, recognition, frame, order, time, lmoment, state, assimilation] [branch, ablation] [physical, model, input, differential, datasets, mnist] [figure, mse, residual, mae, ssim, prior, dedicated, based, convolutional, convlstm, motion, pattern, cell, pixel, flow] [latent, disentangling, unsupervised, missing, image, specific, loss, representation, encoder, sst, unknown, introduce] [neural, learning, deep, processing, data, space, machine, training, gain, general, architecture, dynamical, predictor, filter, learned, linear, network] [conference, international, computer, vision, human, partial, supplementary, complex]
@InProceedings{Guen_2020_CVPR,
  author = {Guen, Vincent Le and Thome, Nicolas},
  title = {Disentangling Physical Dynamics From Unknown Factors for Unsupervised Video Prediction},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
D2Det: Towards High Quality Object Detection and Instance Segmentation
Jiale Cao, Hisham Cholakkal, Rao Muhammad Anwer, Fahad Shahbaz Khan, Yanwei Pang, Ling Shao


We propose a novel two-stage detection method, D2Det, that collectively addresses both precise localization and accurate classification. For precise localization, we introduce a dense local regression that predicts multiple dense box offsets for an object proposal. Different from traditional regression and keypoint-based localization employed in two-stage detectors, our dense local regression is not limited to a quantized set of keypoints within a fixed region and has the ability to regress position-sensitive real number dense offsets, leading to more precise localization. The dense local regression is further improved by a binary overlap prediction strategy that reduces the influence of background region on the final box regression. For accurate classification, we introduce a discriminative RoI pooling scheme that samples from various sub-regions of a proposal and performs adaptive weighting to obtain discriminative features. On MS COCO test-dev, our D2Det outperforms existing two-stage methods, with a single-model performance of 45.4 AP, using ResNet101 backbone. When using multi-scale training and inference, D2Det obtains AP of 50.1. In addition to detection, we adapt D2Det for instance segmentation, achieving a mask AP of 40.2 with a two-fold speedup, compared to the state-of-the-art. We also demonstrate the effectiveness of our D2Det on airborne sensors by performing experiments for object detection in UAV images (UAVDT dataset) and instance segmentation in satellite images (iSAID dataset). Source code is available at https://github.com/JialeCao001/D2Det.
[prediction, multiple, dataset, connected] [object, regression, box, detection, roi, coco, instance, feature, pooling, mask, localization, proposal, faster, fpn, segmentation, bounding, offset, overlap, fully, achieves, precise, region, backbone, uavdt, dlr, yanwei, obtains, branch, global, cascade, ross, employed, background] [case] [ieee, pattern, method, existing, comparison, traditional, convolutional, adaptive, performs, figure] [discriminative, introduce, corresponding, target] [classification, candidate, compared, binary, gain, set, training, network, performance, sampling, standard, large, achieving, precision, number, weighting] [local, dense, computer, grid, vision, international, single, accurate, predicts, keypoints, absolute, approach]
@InProceedings{Cao_2020_CVPR,
  author = {Cao, Jiale and Cholakkal, Hisham and Anwer, Rao Muhammad and Khan, Fahad Shahbaz and Pang, Yanwei and Shao, Ling},
  title = {D2Det: Towards High Quality Object Detection and Instance Segmentation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
LiDAR-Based Online 3D Video Object Detection With Graph-Based Message Passing and Spatiotemporal Transformer Attention
Junbo Yin, Jianbing Shen, Chenye Guan, Dingfu Zhou, Ruigang Yang


Existing LiDAR-based 3D object detectors usually focus on the single-frame detection, while ignoring the spatiotemporal information in consecutive point cloud frames. In this paper, we propose an end-to-end online 3D video object detector that operates on point cloud sequences. The proposed model comprises a spatial feature encoding component and a spatiotemporal feature aggregation component. In the former component, a novel Pillar Message Passing Network (PMPNet) is proposed to encode each discrete point cloud frame. It adaptively collects information for a pillar node from its neighbors by iterative message passing, which effectively enlarges the receptive field of the pillar feature. In the latter component, we propose an Attentive Spatiotemporal Transformer GRU (AST-GRU) to aggregate the spatiotemporal information, which enhances the conventional ConvGRU with an attentive memory gating mechanism. AST-GRU contains a Spatial Transformer Attention (STA) module and a Temporal Transformer Attention (TTA) module, which can emphasize the foreground objects and align the dynamic objects, respectively. Experimental results demonstrate that the proposed 3D video object detector achieves state-of-the-art performance on the large-scale nuScenes benchmark.
[video, spatiotemporal, transformer, message, attention, graph, node, temporal, pmpnet, encoding, convgru, passing, state, frame, extract, recurrent, sta, tta, sequence, step, gru, rich, previous, time] [object, detection, feature, pillar, attentive, detector, nuscenes, module, head, pointpillars, jianbing, aggregation, autonomous, propose, foreground, backbone, wenguan, fully, car, key, table, benchmark] [input, model, effectively] [spatial, convolutional, output, consecutive, dynamic, proposed, adaptively, receptive, motion, deformable] [loss, component, representation, utilize] [memory, network, neural, learning, better, performance, layer, deep, update, iteration, online, gating, vanilla, number, size] [point, cloud, neighbor]
@InProceedings{Yin_2020_CVPR,
  author = {Yin, Junbo and Shen, Jianbing and Guan, Chenye and Zhou, Dingfu and Yang, Ruigang},
  title = {LiDAR-Based Online 3D Video Object Detection With Graph-Based Message Passing and Spatiotemporal Transformer Attention},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Orthogonal Convolutional Neural Networks
Jiayun Wang, Yubei Chen, Rudrasis Chakraborty, Stella X. Yu


Deep convolutional neural networks are hindered by training instability and feature redundancy towards further performance improvement. A promising solution is to impose orthogonality on convolutional filters. We develop an efficient approach to impose filter orthogonality on a convolutional layer based on the doubly block-Toeplitz matrix representation of the convolutional kernel, instead of the common kernel orthogonality approach, which we show is only necessary but not sufficient for ensuring orthogonal convolutions. Our proposed orthogonal convolution requires no additional parameters and little computational overhead. It consistently outperforms the kernel orthogonality alternative on a wide range of tasks such as image classification and inpainting under supervised, semi-supervised and unsupervised settings. It learns more diverse and expressive features with better training stability, robustness, and generalization. Our code is publicly available.
[time, outperforms, recognition] [feature, table, stride, region, backbone, adopt, add] [input, attack, model, condition, analyze, dbt, adversarial, check, robustness, identity] [kernel, convolutional, convolution, spectrum, output, ieee, channel, column, tensor, figure, pattern, spatial, proposed, cnns, based] [image, row, inpainting, loss, unsupervised, generation] [orthogonal, orthogonality, neural, matrix, ocnn, learning, deep, regularization, classification, training, filter, baseline, layer, performance, accuracy, size, efficient, weight, imagenet, network, machine, processing, uniform, gradient, regularizer, ocnns, linear, gain, standard, set, redundancy, observe, task, improved] [conference, international, vision, computer, additional, approach, doubly, vanishing]
@InProceedings{Wang_2020_CVPR,
  author = {Wang, Jiayun and Chen, Yubei and Chakraborty, Rudrasis and Yu, Stella X.},
  title = {Orthogonal Convolutional Neural Networks},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Self-Robust 3D Point Recognition via Gather-Vector Guidance
Xiaoyi Dong, Dongdong Chen, Hang Zhou, Gang Hua, Weiming Zhang, Nenghai Yu


In this paper, we look into the problem of 3D adversary attack, and propose to leverage the internal properties of the point clouds and the adversarial examples to design a new self-robust deep neural network (DNN) based 3D recognition systems. As a matter of fact, on one hand, point clouds are highly structured. Hence for each local part of clean point clouds, it is possible to learn what is it ("part of a bottle") and its relative position ("upper part of a bottle") to the global object center. On the other hand, with the visual quality constraint, 3D adversarial samples often only produce small local perturbations, thus they will roughly keep the original global center but may cause incorrect local relative position estimation. Motivated by these two properties, we use relative position (dubbed as "gather-vector") as the adversarial indicator and propose a new robust gather module. Equipped with this module, we further propose a new self-robust 3D point recognition network. Through extensive experiments, we demonstrate that the proposed method can improve the robustness of the target attack under the white-box setting significantly. For I-FGSM based attack, our method reduces the attack success rate from 94.37 % to 75.69 %. For C&W based attack, our method reduces the attack success rate more than 40.00 %. Moreover, our method is complementary to other types of defense methods to achieve better defense results.
[] [] [] [] [] [] []
@InProceedings{Dong_2020_CVPR,
  author = {Dong, Xiaoyi and Chen, Dongdong and Zhou, Hang and Hua, Gang and Zhang, Weiming and Yu, Nenghai},
  title = {Self-Robust 3D Point Recognition via Gather-Vector Guidance},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
VectorNet: Encoding HD Maps and Agent Dynamics From Vectorized Representation
Jiyang Gao, Chen Sun, Hang Zhao, Yi Shen, Dragomir Anguelov, Congcong Li, Cordelia Schmid


Behavior prediction in dynamic, multi-agent systems is an important problem in the context of self-driving cars, due to the complex representations and interactions of road components, including moving agents (e.g. pedestrians and vehicles) and road context information (e.g. lanes, traffic lights). This paper introduces VectorNet, a hierarchical graph neural network that first exploits the spatial locality of individual road components represented by vectors and then models the high-order interactions among all components. In contrast to most recent approaches, which render trajectories of moving agents and road context information as bird-eye images and encode them with convolutional neural networks (ConvNets), our approach operates on the primitive vector representation. By operating on the vectorized high definition (HD) maps and agent trajectories, we avoid lossy rendering and computationally intensive ConvNet encoding steps. To further boost VectorNet's capability in learning context features, we propose a novel auxiliary task to recover the randomly masked out map entities and agent trajectories based on their context. We evaluate VectorNet on our in-house behavior prediction benchmark and the recently released Argoverse forecasting dataset. Our method achieves on par or better performance than the competitive rendering approach on both benchmarks while saving over 70% of the model parameters with an order of magnitude reduction in FLOPs. It also obtains state-of-the-art performance on the Argoverse dataset.
[graph, node, prediction, agent, trajectory, vectornet, argoverse, polyline, behavior, context, dataset, vehicle, road, future, rasterized, polylines, vectorized, multiple, hierarchical, lane, encode, forecasting, observed, driving, attention, subgraph, moving, predict, ade, interaction, social, time] [map, feature, table, global, propose, ablation, location, semantic] [model, input, auxiliary, study] [based, receptive, spatial, convolutional, method, cropping, kernel, figure, proposed, crop, resolution] [target, representation, masked, diverse] [number, performance, size, impact, learning, vector, layer, network, convnet, better, baseline, computation, set, objective, convnets, test, neural, average] [approach, scene, form, completion, rendering, single, point, compare]
@InProceedings{Gao_2020_CVPR,
  author = {Gao, Jiyang and Sun, Chen and Zhao, Hang and Shen, Yi and Anguelov, Dragomir and Li, Congcong and Schmid, Cordelia},
  title = {VectorNet: Encoding HD Maps and Agent Dynamics From Vectorized Representation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
ECA-Net: Efficient Channel Attention for Deep Convolutional Neural Networks
Qilong Wang, Banggu Wu, Pengfei Zhu, Peihua Li, Wangmeng Zuo, Qinghua Hu


Recently, channel attention mechanism has demonstrated to offer great potential in improving the performance of deep convolutional neural networks (CNNs). However, most existing methods dedicate to developing more sophisticated attention modules for achieving better performance, which inevitably increase model complexity. To overcome the paradox of performance and complexity trade-off, this paper proposes an Efficient Channel Attention (ECA) module, which only involves a handful of parameters while bringing clear performance gain. By dissecting the channel attention module in SENet, we empirically show avoiding dimensionality reduction is important for learning channel attention, and appropriate cross-channel interaction can preserve performance while significantly decreasing model complexity. Therefore, we propose a local cross-channel interaction strategy without dimensionality reduction, which can be efficiently implemented via 1D convolution. Furthermore, we develop a method to adaptively select kernel size of 1D convolution, determining coverage of local cross-channel interaction. The proposed ECA module is both efficient and effective, e.g., the parameters and computations of our modules against backbone of ResNet50 are 80 vs. 24.37M and 4.7e-4 GFlops vs. 3.86 GFlops, respectively, and the performance boost is more than 2% in terms of Top-1 accuracy. We extensively evaluate our ECA module on image classification, object detection and instance segmentation with backbones of ResNets and MobileNetV2. The experimental results show our module is more efficient while performing favorably against its counterparts.
[attention, interaction, outperforms] [module, fps, table, cnn, backbone, object, detection, effectiveness, coco, resnet, cbam, mask, achieves, faster, kaiming, feature, pooling, employ, instance, segmentation, ross] [model, original, improve, effective, verify] [channel, eca, block, kernel, method, convolution, senet, convolutional, cnns, adaptively, figure, lightweight, avoiding, spatial, clear, superior, develop] [image, learn] [deep, size, dimensionality, efficient, performance, reduction, learning, neural, lower, group, better, network, weight, linear, evaluate, efficiency, indicates, complexity, involves, parameter, number, set, compared, resnets, note, accuracy, setting] [local, compare, demonstrate, coverage]
@InProceedings{Wang_2020_CVPR,
  author = {Wang, Qilong and Wu, Banggu and Zhu, Pengfei and Li, Peihua and Zuo, Wangmeng and Hu, Qinghua},
  title = {ECA-Net: Efficient Channel Attention for Deep Convolutional Neural Networks},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
MTL-NAS: Task-Agnostic Neural Architecture Search Towards General-Purpose Multi-Task Learning
Yuan Gao, Haoping Bai, Zequn Jie, Jiayi Ma, Kui Jia, Wei Liu


We propose to incorporate neural architecture search (NAS) into general-purpose multi-task learning (GP-MTL). Existing NAS methods typically define different search spaces according to different tasks. In order to adapt to different task combinations (i.e., task sets), we disentangle the GP-MTL networks into single-task backbones (optionally encode the task priors), and a hierarchical and layerwise features sharing/fusing scheme across them. This enables us to design a novel and general task-agnostic search space, which inserts cross-task edges (i.e., feature fusion connections) into fixed single-task network backbones. Moreover, we also propose a novel single-shot gradient-based search algorithm that closes the performance gap between the searched architectures and the final evaluation architecture. This is realized with a minimum entropy regularization on the architecture weights during the search phase, which makes the architecture weights converge to near-discrete values and therefore achieves a single model. As a result, our searched model can be directly used for evaluation without (re-)training from scratch. We perform extensive experiments using different single-task backbones on various task sets, demonstrating the promising performance obtained by exploiting the hierarchical and layerwise features, as well as the desirable generalizability to different i) task sets and ii) single-task backbones. The code of our paper is available at https://github.com/bhpfelix/MTLNAS.
[evaluation, multiple, hierarchical, prediction, dataset, node] [feature, semantic, backbone, object, segmentation, discretization, table, propose] [model] [fusion, method, proposed, convolutional] [source, perform, gap, image, target] [search, architecture, neural, learning, task, network, entropy, stochastic, performance, space, deterministic, deep, sampling, relaxation, objective, regularization, fixed, algorithm, minimum, operation, note, general, optimization, bias, variance, design, classification, problem, snas, quoc, searched, random, subfigure, large, discrete, set, distribution, arxiv, preprint, layerwise, converge, candidate] [continuous, novel, surface, single, normal, enables, directly, scene, minimal]
@InProceedings{Gao_2020_CVPR,
  author = {Gao, Yuan and Bai, Haoping and Jie, Zequn and Ma, Jiayi and Jia, Kui and Liu, Wei},
  title = {MTL-NAS: Task-Agnostic Neural Architecture Search Towards General-Purpose Multi-Task Learning},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
PnPNet: End-to-End Perception and Prediction With Tracking in the Loop
Ming Liang, Bin Yang, Wenyuan Zeng, Yun Chen, Rui Hu, Sergio Casas, Raquel Urtasun


We tackle the problem of joint perception and motion forecasting in the context of self-driving vehicles. Towards this goal we propose PnPNet, an end-to-end model that takes as input sequential sensor data, and outputs at each time step object tracks and their future trajectories. The key component is a novel tracking module that generates object tracks online from detections and exploits trajectory level features for motion forecasting. Specifically, the object tracks get updated at each time step by solving both the data association problem and the trajectory estimation problem. Importantly, the whole model is end-to-end trainable and benefits from joint optimization of all tasks. We validate PnPNet on two large-scale driving datasets, and show significant improvements over the state-of-the-art with better occlusion recovery and more accurate future prediction.
[prediction, trajectory, perception, frame, future, forecasting, time, current, temporal, history, previous, driving, observation, multiple, modular, explicit, lstm, evaluation] [object, tracking, pnpnet, detection, module, track, feature, association, raquel, tracker, score, lidar, autonomous, table, level, occlusion, occluded, bin, propose, paradigm, detector, map, nuscenes, backbone, affinity, refine, iou] [model, input] [motion, sensor, proposed, figure, fusion, based, convolutional] [representation, perform] [problem, data, performance, online, evaluate, learning, network, baseline, better, discrete, note] [joint, point, system, accurate, approach, single, loop, estimation, solve, continuous, error, matching]
@InProceedings{Liang_2020_CVPR,
  author = {Liang, Ming and Yang, Bin and Zeng, Wenyuan and Chen, Yun and Hu, Rui and Casas, Sergio and Urtasun, Raquel},
  title = {PnPNet: End-to-End Perception and Prediction With Tracking in the Loop},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Revisiting the Sibling Head in Object Detector
Guanglu Song, Yu Liu, Xiaogang Wang


The "shared head for classification and localization" (sibling head), firstly denominated in Fast RCNN, has been leading the fashion of the object detection community in the past five years. This paper provides the observation that the spatial misalignment between the two object functions in the sibling head can considerably hurt the training process, but this misalignment can be resolved by a very simple operator called task-aware spatial disentanglement (TSD). Considering the classification and regression, TSD decouples them from the spatial dimension by generating two disentangled proposals for them, which are estimated by the shared proposal. This is inspired by the natural insight that for one instance, the features in some salient area may have rich information for classification while these around the boundary may be good at bounding box regression. Surprisingly, this simple design can boost all backbones and models on both MS COCO and Google OpenImage consistently by 3% mAP. Further, we propose a progressive constraint to enlarge the performance margin between the disentangled and the shared proposals, and gain 1% more mAP. We show the TSD breaks through the upper bound of nowadays single-model detector by a large margin (mAP 49.4 with ResNet-101, 51.2 with SENet154), and is the core model of our 1st place solution on the Google OpenImage Challenge 2019.
[tangled] [tsd, head, sibling, object, detection, feature, localization, proposal, coco, fpn, faster, iou, backbone, openimage, roi, box, precise, map, table, mask, propose, pooling, inherent, ross, rcnn, bounding, detector, score, confidence, fully, regression, ablation, piotr, kaiming] [derived, improve, conduct, model, easily, sensitive] [spatial, ieee, pattern, proposed, method, deformable, based, classical, figure, scale, comparison, column] [shared, disentanglement, disentangled, misalignment, progressive, loss, specific] [classification, performance, training, conflict, variant, indicates, dimension, learning, simple, network, set, margin, large, task, better, design] [conference, computer, vision, international, constraint, european, grid, detailed, single]
@InProceedings{Song_2020_CVPR,
  author = {Song, Guanglu and Liu, Yu and Wang, Xiaogang},
  title = {Revisiting the Sibling Head in Object Detector},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Visual Reaction: Learning to Play Catch With Your Drone
Kuo-Hao Zeng, Roozbeh Mottaghi, Luca Weihs, Ali Farhadi


In this paper we address the problem of visual reaction: the task of interacting with dynamic environments where the changes in the environment are not necessarily caused by the agents itself. Visual reaction entails predicting the future changes in a visual environment and planning accordingly. We study the problem of visual reaction in the context of playing catch with a drone in visually rich synthetic environments. This is a challenging problem since the agent is required to learn (1) how objects with different physical properties and shapes move, (2) what sequence of actions should be taken according to the prediction, (3) how to adjust the actions based on the visual feedback from the dynamic environment (e.g., when objects bouncing off a wall), and (4) how to reason and act with an unexpected state change in a timely manner. We propose a new dataset for this task, which includes 30K throws of 20 types of objects in different directions with different forces. Our results show that our model that integrates a forecaster with a planner outperforms a set of strong baselines that are based on tracking as well as pure model-based and model-free RL baselines. The code and dataset are available at github.com/KuoHaoZeng/Visual_Reaction.
[action, agent, visual, future, forecaster, state, current, planner, prediction, environment, catch, forecasting, policy, provide, trajectory, reinforcement, time, catching, sequence, predict, mpc, abhinav, launcher, video, planning, includes, thrown, receives, sdt, reaction, predicting, outperforms, static, movement, kalman, roozbeh, dataset, work] [object, propose, ali, interactive, category] [model, ball, physical, success, change] [based, motion, dynamic, figure] [train, learn] [learning, sampler, problem, uniform, training, acceleration, set, consider, network, best, number, rate, sergey, task, deep, equation, maximum, performance] [drone, position, camera, velocity, david, estimate, angle, approach, mlp, human, ground]
@InProceedings{Zeng_2020_CVPR,
  author = {Zeng, Kuo-Hao and Mottaghi, Roozbeh and Weihs, Luca and Farhadi, Ali},
  title = {Visual Reaction: Learning to Play Catch With Your Drone},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Prime Sample Attention in Object Detection
Yuhang Cao, Kai Chen, Chen Change Loy, Dahua Lin


It is a common paradigm in object detection frameworks to treat all samples equally and target at maximizing the performance on average. In this work, we revisit this paradigm through a careful study on how different samples contribute to the overall performance measured in terms of mAP. Our study suggests that the samples in each mini-batch are neither independent nor equally important, and therefore a better classifier on average does not necessarily result in higher mAP. Motivated by this study, we propose the notion of Prime Samples, those that play a key role in driving the detection performance. We further develop a simple yet effective sampling and learning strategy called PrIme Sample Attention (PISA) that directs the focus of the training process towards such samples. Our experiments demonstrate that it is often more effective to focus on prime samples than hard samples when training a detector. Particularly, on the MSCOCO dataset, PISA outperforms the random sampling baseline and hard mining schemes, e.g. OHEM and Focal Loss, consistently by around 2% on both single-stage and two-stage detectors, even with a strong backbone ResNeXt-101. Code is available at: https://github.com/open-mmlab/mmdetection.
[attention, hierarchical] [prime, iou, pisa, object, hard, positive, detection, regression, score, box, table, faster, map, propose, retinanet, localization, ross, bounding, highest, mask, reweighting, kaiming, backbone, coco, focus, ohem, adopt, located, achieves] [study, effective, strong] [high, ieee, figure, pattern, isr, low, adopted, based, analysis, method, proposed] [loss, learn] [classification, negative, sampling, sample, random, performance, training, rank, higher, mining, average, better, carl, strategy, metric, learning, classifier, simple, ranking, top, class, distribution, function, larger] [conference, computer, local, vision, ground, accurate]
@InProceedings{Cao_2020_CVPR,
  author = {Cao, Yuhang and Chen, Kai and Loy, Chen Change and Lin, Dahua},
  title = {Prime Sample Attention in Object Detection},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
SpineNet: Learning Scale-Permuted Backbone for Recognition and Localization
Xianzhi Du, Tsung-Yi Lin, Pengchong Jin, Golnaz Ghiasi, Mingxing Tan, Yin Cui, Quoc V. Le, Xiaodan Song


Convolutional neural networks typically encode an input image into a series of intermediate features with decreasing resolutions. While this structure is suited to classification tasks, it does not perform well for tasks requiring simultaneous recognition and localization (e.g., object detection). The encoder-decoder architectures are proposed to resolve this by applying a decoder network onto a backbone model designed for classification tasks. In this paper, we argue encoder-decoder architecture is ineffective in generating strong multi-scale features because of the scale-decreased backbone. We propose SpineNet, a backbone with scale-permuted intermediate features and cross-scale connections that is learned on an object detection task by Neural Architecture Search. Using similar building blocks, SpineNet models outperform ResNet-FPN models by 3%+ AP at various scales while using 10-20% fewer FLOPs. In particular, SpineNet-190 achieves 52.1% AP on COCO, attaining the new state-of-the-art performance for single model object detection without test-time augmentation. SpineNet can transfer to classification tasks, achieving 5% top-1 accuracy improvement on a challenging iNaturalist fine-grained dataset. Code is at: https://github.com/tensorflow/tpu/tree/master/models/official/detection.
[decoder, recognition, visual] [feature, backbone, object, detection, table, mask, building, achieves, coco, retinanet, resnet, box, apply, improvement, pyramid, kaiming, bounding, ordering, fish, level, parent, piotr, ross] [model, input, protocol, hourglass, trained] [block, scale, figure, convolutional, intermediate, resolution, output, residual, proposed, spatial] [image, learn, target] [architecture, spinenet, network, search, neural, classification, learned, learning, performance, dimension, design, imagenet, space, training, size, stem, resampling, quoc, scalepermuted, accuracy, inaturalist, computation, deep, large, scaledecreased, applied, candidate, fixed, note] [depth, shape]
@InProceedings{Du_2020_CVPR,
  author = {Du, Xianzhi and Lin, Tsung-Yi and Jin, Pengchong and Ghiasi, Golnaz and Tan, Mingxing and Cui, Yin and Le, Quoc V. and Song, Xiaodan},
  title = {SpineNet: Learning Scale-Permuted Backbone for Recognition and Localization},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
KeyPose: Multi-View 3D Labeling and Keypoint Estimation for Transparent Objects
Xingyu Liu, Rico Jonschkowski, Anelia Angelova, Kurt Konolige


Estimating the 3D pose of desktop objects is crucial for applications such as robotic manipulation. Many existing approaches to this problem require a depth map of the object for both training and prediction, which restricts them to opaque, lambertian objects that produce good returns in an RGBD sensor. In this paper we forgo using a depth sensor in favor of raw stereo input. We address two problems: first, we establish an easy method for capturing and labeling 3D keypoints on desktop objects with an RGB camera; and second, we develop a deep neural network, called KeyPose, that learns to accurately predict object poses using 3D keypoints, from stereo input, and works even for transparent objects. To evaluate the performance of our method, we create a dataset of 15 clear objects in five classes, with 48K 3D-keypoint labeled images. We train both instance and category models, and show generalization to new textures, poses, and objects. KeyPose surpasses state-of-the-art performance in 3D pose estimation on this dataset by factors of 1.5 to 3.5, even in cases where the competing method is provided with ground-truth depth. Stereo input is essential for this performance as it improves results compared to using monocular input by a factor of 2. We will release a public version of the data capture and labeling pipeline, the transparent object database, and the KeyPose models and evaluation code. Project website: https://sites.google.com/corp/view/keypose.
[dataset, late, work, context, predict] [object, labeling, predicted, table, instance, ablation, map, category, detection, location, cnn] [model, input, trained, difference] [method, figure, disparity, mae, crop, capturing, fusion, existing] [image, loss, real, unseen, produce, train, factor] [data, training, large, size, deep, early, labeled, probability, permutation, number, small, good, performance, note, test, problem, evaluate, compared] [pose, depth, stereo, transparent, keypoint, keypoints, estimation, rgb, rgbd, keypose, opaque, camera, error, monocular, densefusion, uvd, left, projection, estimating, point, cad, capture, geometry, require, robotic, single, assume, second, well, pipeline, direct, rigid]
@InProceedings{Liu_2020_CVPR,
  author = {Liu, Xingyu and Jonschkowski, Rico and Angelova, Anelia and Konolige, Kurt},
  title = {KeyPose: Multi-View 3D Labeling and Keypoint Estimation for Transparent Objects},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
SegGCN: Efficient 3D Point Cloud Segmentation With Fuzzy Spherical Kernel
Huan Lei, Naveed Akhtar, Ajmal Mian


Fuzzy clustering is known to perform well in real-world applications. Inspired by this observation, we incorporate a fuzzy mechanism into discrete convolutional kernels for 3D point clouds as our first major contribution. The proposed fuzzy kernel is defined over a spherical volume that uses discrete bins. Discrete volumetric division can normally make a kernel vulnerable to boundary effects during learning as well as point density during inference. However, the proposed kernel remains robust to boundary conditions and point density due to the fuzzy mechanism. Our second major contribution comes as the proposal of an efficient graph convolutional network, SegGCN for segmenting point clouds. The proposed network exploits ResNet like blocks in the encoder and 1 x 1 convolutions in the decoder. SegGCN capitalizes on the separable convolution operation of the proposed fuzzy kernel for efficiency. We establish the effectiveness of the SegGCN with the proposed kernel on the challenging S3DIS and ScanNet real-world datasets. Our experiments demonstrate that the proposed network can segment over one million points per second with highly competitive performance.
[graph, decoder, mechanism, construct, three] [hard, feature, resnet, semantic, table, boundary, template, apply, miou, segmentation, propose, bin, cnn] [input, radial, robust] [kernel, fuzzy, convolution, proposed, convolutional, ieee, seggcn, pattern, range, based, spatial, lei, separable, fic, block, sph, color, dynamic, output, elu, conv] [target, encoder, image, perform, common] [network, learning, deep, efficient, performance, size, set, discrete, neural, computational, training, batch, standard, search, space, machine, weighted, process, parameter, data] [point, spherical, conference, computer, cloud, vision, kpconv, neighbor, international, local, elevation, scannet, compute, indoor, volumetric, demonstrate, neighborhood, single, vertex]
@InProceedings{Lei_2020_CVPR,
  author = {Lei, Huan and Akhtar, Naveed and Mian, Ajmal},
  title = {SegGCN: Efficient 3D Point Cloud Segmentation With Fuzzy Spherical Kernel},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
nuScenes: A Multimodal Dataset for Autonomous Driving
Holger Caesar, Varun Bankiti, Alex H. Lang, Sourabh Vora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, Oscar Beijbom


Robust detection and tracking of objects is crucial for the deployment of autonomous vehicle technology. Image based benchmark datasets have driven development in computer vision tasks such as object detection, tracking and segmentation of agents in the environment. Most autonomous vehicles, however, carry a combination of cameras and range sensors such as lidar and radar. As machine learning based methods for detection and tracking become more prevalent, there is a need to train and evaluate such methods on datasets containing range sensor data along with images. In this work we present nuTonomy scenes (nuScenes), the first dataset to carry the full autonomous vehicle sensor suite: 6 cameras, 5 radars and 1 lidar, all with full 360 degree field of view. nuScenes comprises 1000 scenes, each 20s long and fully annotated with 3D bounding boxes for 23 classes and 8 attributes. It has 7x as many annotations and 100x as many images as the pioneering KITTI dataset. We define novel 3D detection and tracking metrics. We also provide careful dataset analysis as well as baselines for lidar and image based detection and tracking. Data, development kit and more information are available online.
[dataset, driving, multimodal, provide, time, prediction, vehicle, multiple, evaluation, traffic, urban] [detection, lidar, object, nuscenes, tracking, autonomous, semantic, map, localization, pointpillars, annotated, recall, pedestrian, monodis, iou, table, center, box, benchmark, track, raquel, released] [datasets, release] [sensor, based, range, figure, fusion, method, weather] [image, train, attribute, diverse] [data, average, performance, set, class, metric, network, arxiv, preprint, achieve, deep, precision, training, learning, large, size, baseline, best, open, top, number] [kitti, radar, camera, scene, full, matching, point, monocular, velocity, well, distance, error, capture, single, approach]
@InProceedings{Caesar_2020_CVPR,
  author = {Caesar, Holger and Bankiti, Varun and Lang, Alex H. and Vora, Sourabh and Liong, Venice Erin and Xu, Qiang and Krishnan, Anush and Pan, Yu and Baldan, Giancarlo and Beijbom, Oscar},
  title = {nuScenes: A Multimodal Dataset for Autonomous Driving},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
PVN3D: A Deep Point-Wise 3D Keypoints Voting Network for 6DoF Pose Estimation
Yisheng He, Wei Sun, Haibin Huang, Jianran Liu, Haoqiang Fan, Jian Sun


In this work, we present a novel data-driven method for robust 6DoF object pose estimation from a single RGBD image. Unlike previous methods that directly regressing pose parameters, we tackle this challenging task with a keypoint-based approach. Specifically, we propose a deep Hough voting network to detect 3D keypoints of objects and then estimate the 6D pose parameters within a least-squares fitting manner. Our method is a natural extension of 2D-keypoint approaches that successfully work on RGB based 6DoF estimation. It allows us to fully utilize the geometric constraint of rigid objects with the extra depth information and is easy for a network to learn and optimize. Extensive experiments were conducted to demonstrate the effectiveness of 3D-keypoint detection in the 6D pose estimation task. Experimental results also show our method outperforms the state-of-the-art methods by large margins on several benchmarks. Code and video are available at https://github.com/ethnhe/PVN3D.git.
[evaluation, dataset, making] [object, semantic, module, segmentation, voting, detection, offset, instance, center, feature, table, clamp, box, predicted, hough, vote, voted, occlusion, add, fps, challenging, detect, extra, belongs] [model, robust, iterative, trained] [ieee, pattern, figure, method, based] [translation, image, distinguish, utilize] [network, deep, learning, large, algorithm, training, selected, performance, space, applied, neural, follow, clustering] [pose, conference, computer, keypoints, keypoint, point, estimation, vision, international, coordinate, rgbd, jointly, fitting, ground, european, directly, rigid, depth, distance, estimate, geometric, rotation, camera, truth, single, rgb, projection, linemod, approach, dense]
@InProceedings{He_2020_CVPR,
  author = {He, Yisheng and Sun, Wei and Huang, Haibin and Liu, Jianran and Fan, Haoqiang and Sun, Jian},
  title = {PVN3D: A Deep Point-Wise 3D Keypoints Voting Network for 6DoF Pose Estimation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Probabilistic Pixel-Adaptive Refinement Networks
Anne S. Wannenwetsch, Stefan Roth


Encoder-decoder networks have found widespread use in various dense prediction tasks. However, the strong reduction of spatial resolution in the encoder leads to a loss of location information as well as boundary artifacts. To address this, image-adaptive post-processing methods have shown beneficial by leveraging the high-resolution input image(s) as guidance data. We extend such approaches by considering an important orthogonal source of information: the network's confidence in its own predictions. We introduce probabilistic pixel-adaptive convolutions (PPACs), which not only depend on image guidance data for filtering, but also respect the reliability of per-pixel predictions. As such, PPACs allow for image-adaptive smoothing and simultaneously propagating pixels of high confidence into less reliable regions, while respecting object boundaries. We demonstrate their utility in refinement networks for optical flow and semantic segmentation, where PPACs lead to a clear reduction in boundary artifacts. Moreover, our proposed refinement step is able to substantially improve the accuracy on various widely used benchmarks.
[prediction, step] [refinement, semantic, confidence, table, advanced, final, object, segmentation, guided, feature, map, apply, improvement, refined, propose] [input, improve, clean] [flow, optical, ppac, guidance, pixel, sintel, proposed, pac, convolution, filtering, ppacs, output, kernel, method, convolutional, spatial, extend, field, fast, bilateral, aee, figure] [image, underlying] [network, probabilistic, data, normalization, learning, probability, deep, test, neural, number, accuracy, weight, standard, simple, reliability, clearly, large, rate, filter, reliable, bayesian, learned] [kitti, well, uncertainty, approach, outlier, dense, joint, additional, estimation, allow, error, leverage]
@InProceedings{Wannenwetsch_2020_CVPR,
  author = {Wannenwetsch, Anne S. and Roth, Stefan},
  title = {Probabilistic Pixel-Adaptive Refinement Networks},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Discovering Human Interactions With Novel Objects via Zero-Shot Learning
Suchen Wang, Kim-Hui Yap, Junsong Yuan, Yap-Peng Tan


We aim to detect human interactions with novel objects through zero-shot learning. Different from previous works, we allow unseen object categories by using its semantic word embedding. To do so, we design a human-object region proposal network specifically for the human-object interaction detection task. The core idea is to leverage human visual clues to localize objects which are interacting with humans. We show that our proposed model can outperform existing methods on detecting interacting objects, and generalize well to novel objects. To recognize objects from unseen categories, we devise a zero-shot classification module upon the classifier of seen categories. It utilizes the classifier logits for seen categories to estimate a vector in the semantic space, and then performs nearest search to find the closest unseen category. We validate our method on V-COCO and HICO-DET datasets, and obtain superior results on detecting human interactions with both seen and unseen objects.
[interaction, visual, verb, embedding, recognition, word, recognize, prediction, attention, previous, embeddings] [object, region, detection, semantic, detect, hoi, score, proposal, category, interactiveness, feature, box, table, module, noninteracting, anchor, hois, rpn, focus, detector, detected, propose, main, predicted, ablation, faster, humanobject, vcoco] [model, detecting] [ieee, figure, based, proposed, pattern, existing, method] [unseen, train, generated, image] [network, learning, set, classification, test, performance, evaluate, classifier, training, top, vector, average, design, knowledge, space, class, probability, softmax, number] [human, novel, conference, vision, computer, interacting, international, estimate, well]
@InProceedings{Wang_2020_CVPR,
  author = {Wang, Suchen and Yap, Kim-Hui and Yuan, Junsong and Tan, Yap-Peng},
  title = {Discovering Human Interactions With Novel Objects via Zero-Shot Learning},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Equalization Loss for Long-Tailed Object Recognition
Jingru Tan, Changbao Wang, Buyu Li, Quanquan Li, Wanli Ouyang, Changqing Yin, Junjie Yan


Object recognition techniques using convolutional neural networks (CNN) have achieved great success. However, state-of-the-art object detection methods still perform poorly on large vocabulary and long-tailed datasets, e.g. LVIS. In this work, we analyze this problem from a novel perspective: each positive sample of one category can be seen as a negative sample for other categories, making the tail categories receive more discouraging gradients. Based on it, we propose a simple but effective loss, named equalization loss, to tackle the problem of long-tailed rare categories by simply ignoring those gradients for rare categories. The equalization loss protects the learning of rare categories from being at a disadvantage during the network parameter updating. Thus the model is capable of learning better discriminative features for objects of rare classes. Without any bells and whistles, our method achieves AP gains of 4.1% and 4.8% for the rare and common categories on the challenging LVIS benchmark, compared to the Mask R-CNN baseline. With the utilization of the effective equalization loss, we finally won the 1st place in the LVIS Challenge 2019. Code has been made available at: https://github.com/tztztztztz/eql.detectron2
[dataset, recognition] [equalization, category, rare, mask, object, table, lvis, sigmoid, detection, apr, ross, eql, feature, apf, instance, background, foreground, threshold, apc, positive, propose, challenge, improvement, kaiming, cascade, ablation, piotr, achieves, false] [model, analyze, great, influence] [ieee, pattern, method, figure, based, frequency, convolutional] [loss, image, common, introduce] [learning, function, large, negative, number, class, training, weight, set, decay, problem, softmax, sample, tail, gradient, distribution, frequent, neural, discouraging, imbalance, network, better, classification, deep, sampling, performance, data, test, equation, average, classifier, probability, imbalanced] [computer, conference, vision, international, novel, term]
@InProceedings{Tan_2020_CVPR,
  author = {Tan, Jingru and Wang, Changbao and Li, Buyu and Li, Quanquan and Ouyang, Wanli and Yin, Changqing and Yan, Junjie},
  title = {Equalization Loss for Long-Tailed Object Recognition},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Learning Depth-Guided Convolutions for Monocular 3D Object Detection
Mingyu Ding, Yuqi Huo, Hongwei Yi, Zhe Wang, Jianping Shi, Zhiwu Lu, Ping Luo


3D object detection from a single image without LiDAR is a challenging task due to the lack of accurate depth information. Conventional 2D convolutions are unsuitable for this task because they fail to capture local object and its scale information, which are vital for 3D object detection. To better represent 3D structure, prior arts typically transform depth maps estimated from 2D images into a pseudo-LiDAR representation, and then apply existing 3D point-cloud based object detectors. However, their results depend heavily on the accuracy of the estimated depth maps, resulting in suboptimal performance. In this work, instead of using pseudo-LiDAR representation, we improve the fundamental 2D fully convolutions by proposing a new local convolutional network (LCN), termed Depth-guided Dynamic-Depthwise-Dilated LCN (D4LCN), where the filters and their receptive fields can be automatically learned from image-based depth maps, making different pixels of different images have different filters. D4LCN overcomes the limitation of conventional 2D convolutions and narrows the gap between image representation and 3D point cloud representation. Extensive experiments show that D^4LCN outperforms existing works by large margins. For example, the relative improvement of D4LCN against the state-of-the-art on KITTI is 9.1% in the moderate setting. D4LCN ranks 1st on KITTI monocular 3D object detection benchmark at the time of submission (car, December 2019). The code is available at https://github.com/dingmyu/D4LCN
[three, shift, dataset] [detection, object, feature, map, lcn, box, moderate, module, anchor, easy, hard, regression, table, autonomous, car, denotes] [input] [convolutional, convolution, filtering, method, adaptive, receptive, figure, output, dilation, kernel, scale, fail, dynamic, dilated, based, channel, fusion, extraction, result] [image, learn, loss, generated, gap, generate, generation, corresponding] [network, filter, learning, set, layer, number, arxiv, preprint, deep, better, rate, size, accuracy, problem, note, training, class, data, large, performance] [depth, monocular, point, local, rgb, kitti, estimated, ground, accurate, view, capture, cloud, structure]
@InProceedings{Ding_2020_CVPR,
  author = {Ding, Mingyu and Huo, Yuqi and Yi, Hongwei and Wang, Zhe and Shi, Jianping and Lu, Zhiwu and Luo, Ping},
  title = {Learning Depth-Guided Convolutions for Monocular 3D Object Detection},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Seeing Through Fog Without Seeing Fog: Deep Multimodal Sensor Fusion in Unseen Adverse Weather
Mario Bijelic, Tobias Gruber, Fahim Mannan, Florian Kraus, Werner Ritter, Klaus Dietmayer, Felix Heide


The fusion of multimodal sensor streams, such as camera, lidar, and radar measurements, plays a critical role in object detection for autonomous vehicles, which base their decision making on these inputs. While existing methods exploit redundant information in good environmental conditions, they fail in adverse weather where the sensory streams can be asymmetrically distorted. These rare "edge-case" scenarios are not represented in available datasets, and existing fusion architectures are not designed to handle them. To address this challenge we present a novel multimodal dataset acquired in over 10,000 km of driving in northern Europe. Although this dataset is the first large multimodal dataset in adverse weather, with 100k labels for lidar, camera, radar, and gated NIR sensors, it does not facilitate training as extreme weather is rare. To this end, we present a deep fusion network for robust fusion without a large corpus of labeled training data covering all asymmetric distortions. Departing from proposal-level fusion, we propose a single-shot model that adaptively fuses features, driven by measurement entropy. We validate the proposed method, trained on clean data, on our extensive validation dataset. Code and data are available here https://github.com/princeton-computational-imaging/SeeingThroughFog.
[multimodal, dataset, gated, driving, work, provide, road, exchange] [detection, lidar, feature, object, autonomous, ssd, including, rare, propose, hard, semantic] [model, datasets, trained, input, robust] [weather, fusion, ieee, adverse, sensor, fog, pattern, proposed, existing, figure, clear, automotive, adaptive, sensory, fir, resolution, illumination, based, covering, extraction, validate, convolutional, severe, method] [image, domain, adaptation, asymmetric, unseen, real] [data, deep, entropy, training, network, performance, large, architecture, rate, test, learning, labeled, neural, processing] [conference, computer, vision, camera, radar, international, scene, rgb, depth, single, dense, intelligent, measurement, rely, system, supplemental, estimation, limited]
@InProceedings{Bijelic_2020_CVPR,
  author = {Bijelic, Mario and Gruber, Tobias and Mannan, Fahim and Kraus, Florian and Ritter, Werner and Dietmayer, Klaus and Heide, Felix},
  title = {Seeing Through Fog Without Seeing Fog: Deep Multimodal Sensor Fusion in Unseen Adverse Weather},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Don't Even Look Once: Synthesizing Features for Zero-Shot Detection
Pengkai Zhu, Hanxiao Wang, Venkatesh Saligrama


Zero-shot detection, namely, localizing both seen and unseen objects, increasingly gains importance for large-scale applications, with large number of object classes, since, collecting sufficient annotated data with ground truth bounding boxes is simply not scalable. While vanilla deep neural networks deliver high performance for objects available during training, unseen object detection degrades significantly. At a fundamental level, while vanilla detectors are capable of proposing bounding boxes, which include unseen objects, they are often incapable of assigning high-confidence to unseen objects, due to the inherent precision/recall tradeoffs that requires rejecting background objects. We propose a novel detection algorithm "Don't Even Look Once (DELO)," that synthesizes visual features for unseen objects and augments existing training algorithms to incorporate unseen object detection. Our proposed scheme is evaluated on Pascal VOC and MSCOCO, and we demonstrate significant improvements in test accuracy over vanilla and other state-of-art zero-shot detectors
[visual, recognition, evaluation, three, prediction, decoder] [confidence, bounding, object, detection, feature, semantic, background, pascal, box, split, score, foreground, map, propose, voc, table, objectness] [trained, original, model, venkatesh, datasets, improve] [ieee, pattern, high, proposed, method, june, existing, cell] [unseen, consistency, zsd, loss, generalized, delo, image, attribute, checker, dres, synthetic, generative, generated, conditional, generator, cvae, train, real, synthesize, latent, nunseen, gzsd] [performance, learning, class, predictor, training, vanilla, number, data, size, set, large, rate, classification, consider, evaluate, neural, metric, classifier] [computer, conference, vision, well, ground, truth]
@InProceedings{Zhu_2020_CVPR,
  author = {Zhu, Pengkai and Wang, Hanxiao and Saligrama, Venkatesh},
  title = {Don't Even Look Once: Synthesizing Features for Zero-Shot Detection},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
EPOS: Estimating 6D Pose of Objects With Symmetries
Tomas Hodan, Daniel Barath, Jiri Matas


We present a new method for estimating the 6D pose of rigid objects with available 3D models from a single RGB input image. The method is applicable to a broad range of objects, including challenging ones with global or partial symmetries. An object is represented by compact surface fragments which allow handling symmetries in a systematic manner. Correspondences between densely sampled pixels and the fragments are predicted using an encoder-decoder network. At each pixel, the network predicts: (i) the probability of each object's presence, (ii) the probability of the fragments given the object's presence, and (iii) the precise 3D location on each fragment. A data-dependent number of corresponding 3D locations is selected per pixel, and poses of possibly multiple object instances are estimated using a robust and efficient variant of the PnP-RANSAC algorithm. In the BOP Challenge 2019, the method outperforms all RGB and most RGB-D and D methods on the T-LESS and LM-O datasets. On the YCB-V dataset, it is superior to all competitors, with a large margin over the second-best RGB method. Source code is at: cmp.felk.cvut.cz/epos.
[multiple, time, predict, predicting, three, includes, evaluation] [object, location, predicted, regression, global, precise, including, challenging, challenge, threshold, detection] [robust, model, input, case] [method, pixel, range, proposed, achieved, convolutional, spatial] [image, corresponding, loss] [number, network, set, average, training, probability, performance, deep, efficient, learning, algorithm, sampled, variant, neural, accuracy, random, test] [pose, fragment, surface, estimation, rgb, single, armspd, epos, partial, estimating, estimate, estimated, visible, local, bop, defined, distance, bij, ransac, epnp, vincent, accurate, fitting, solver, carsten]
@InProceedings{Hodan_2020_CVPR,
  author = {Hodan, Tomas and Barath, Daniel and Matas, Jiri},
  title = {EPOS: Estimating 6D Pose of Objects With Symmetries},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Train in Germany, Test in the USA: Making 3D Object Detectors Generalize
Yan Wang, Xiangyu Chen, Yurong You, Li Erran Li, Bharath Hariharan, Mark Campbell, Kilian Q. Weinberger, Wei-Lun Chao


In the domain of autonomous driving, deep learning has substantially improved the 3D object detection accuracy for LiDAR and stereo camera data alike. While deep networks are great at generalization, they are also notorious to overfit to all kinds of spurious artifacts, such as brightness, car sizes and models, that may appear consistently throughout the data. In fact, most datasets for autonomous driving are collected within a narrow subset of cities within one country, typically under similar weather conditions. In this paper we consider the task of adapting 3D object detectors from one dataset to another. We observe that naively, this appears to be a very challenging task, resulting in drastic drops in accuracy levels. We provide extensive experiments to investigate the true adaptation challenges and arrive at a surprising conclusion: the primary adaptation hurdle to overcome are differences in car sizes across geographic areas. A simple correction based on the average car size yields a strong correction of the adaptation gap. Our proposed method is simple and easily incorporated into most 3D object detection frameworks. It provides a first baseline for 3D object detection adaptation across countries, and gives hope that the underlying problem may be more within grasp than one may have hoped to believe. Our code is available at https://github. com/cxy1997/3D_adapt_auto_driving.
[argoverse, dataset, driving, multiple, provide] [object, detection, waymo, car, nuscenes, box, lyft, lidar, bounding, detector, autonomous, rcnn, oint, table, iou, hard, raquel, semantic, proposal, focus, detected, predicted, moderate, bharath] [trained, datasets, model, input, collected, tested, difference, percentage] [figure, based, analysis, output, adaptive, captured, result, clear] [domain, target, adaptation, source, corresponding, yan, adapting] [size, training, learning, performance, deep, data, statistical, average, labeled, investigate, validation, distribution, network, report, normalization, simple, number, setting, neural, mark, accuracy] [kitti, point, stereo, cloud, depth, view, single, well, scene]
@InProceedings{Wang_2020_CVPR,
  author = {Wang, Yan and Chen, Xiangyu and You, Yurong and Li, Li Erran and Hariharan, Bharath and Campbell, Mark and Weinberger, Kilian Q. and Chao, Wei-Lun},
  title = {Train in Germany, Test in the USA: Making 3D Object Detectors Generalize},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Exploring Categorical Regularization for Domain Adaptive Object Detection
Chang-Dong Xu, Xing-Ran Zhao, Xin Jin, Xiu-Shen Wei


In this paper, we tackle the domain adaptive object detection problem, where the main challenge lies in significant domain gaps between source and target domains. Previous work seeks to plainly align image-level and instance-level shifts to eventually minimize the domain discrepancy. However, they still overlook to match crucial image regions and important instances across domains, which will strongly affect domain shift mitigation. In this work, we propose a simple but effective categorical regularization framework for alleviating this issue. It can be applied as a plug-and-play component on a series of Domain Adaptive Faster R-CNN methods which are prominent for dealing with domain adaptive detection. Specifically, by integrating an image-level multi-label classifier upon the detection backbone, we can obtain the sparse but crucial image regions corresponding to categorical information, thanks to the weakly localization ability of the classification manner. Meanwhile, at the instance level, we leverage the categorical consistency between image-level predictions (by the classifier) and instance-level predictions (by the detection head) as a regularization factor to automatically hunt for the hard aligned instances of target domains. Extensive experiments of various domain shift scenarios show that our method obtains a significant performance gain over original Domain Adaptive Faster R-CNN detectors. Furthermore, qualitative visualization and analyses can demonstrate the ability of our method for attending on the key regions/instances targeting on domain adaptation. Our code is open-source and available at https://github.com/Megvii-Nanjing/CR-DA-DET.
[shift, dataset] [detection, faster, object, categorical, framework, instance, backbone, crucial, pascal, hard, feature, weakly, voc, module, localization, visualization, semantic, region, table, ross, main, background, global, category, kaiming, interest, annotated, detector, map, imagelevel] [series, adversarial, model, trained, improve, original] [adaptive, figure, foggy, weather, method, convolutional, cnns, output] [domain, target, alignment, source, image, consistency, adaptation, loss, ability, aligned, train, ccr, align, learn] [regularization, training, performance, learning, better, large, denote, weight, dissimilar, network, deep, set, baseline] [scene, local, accurate, enables]
@InProceedings{Xu_2020_CVPR,
  author = {Xu, Chang-Dong and Zhao, Xing-Ran and Jin, Xin and Wei, Xiu-Shen},
  title = {Exploring Categorical Regularization for Domain Adaptive Object Detection},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Neural Implicit Embedding for Point Cloud Analysis
Kent Fujiwara, Taiichi Hashimoto


We present a novel representation for point clouds that encapsulates the local characteristics of the underlying structure. The key idea is to embed an implicit representation of the point cloud, namely the distance field, into neural networks. One neural network is used to embed a portion of the distance field around a point. The resulting network weights are concatenated to be used as a representation of the corresponding point cloud instance. To enable comparison among the weights, Extreme Learning Machine (ELM) is employed as the embedding network. Invariance to scale and coordinate change can be achieved by introducing a scale commutative activation layer to the ELM, and aligning the distance field into a canonical pose. Experimental results using our representation demonstrate that our proposal is capable of similar or better classification and segmentation performance compared to the state-of-the-art point-based methods, while requiring less time for training.
[embedding, embed, trainable, convert] [global, segmentation, feature, table, instance, extreme, object, propose, proposal, surrounding] [original, conduct, input, change] [ieee, method, field, pattern, proposed, scale, convolutional, analysis, comparison, figure, applying, prior, based, achieved] [representation, invariance, corresponding, train, invariant, unsupervised] [sampling, neural, data, network, learning, classification, accuracy, deep, set, space, number, random, training, scaling, activation, function, efficient, compared, fixed, vector, test, machine, matrix, fact] [point, distance, cloud, conference, computer, elm, sphere, local, vision, modelnet, implicit, coordinate, shape, surface, canonical, international, rotation, unique, radius, capture, demonstrate, directly, hao]
@InProceedings{Fujiwara_2020_CVPR,
  author = {Fujiwara, Kent and Hashimoto, Taiichi},
  title = {Neural Implicit Embedding for Point Cloud Analysis},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Pose-Guided Visible Part Matching for Occluded Person ReID
Shang Gao, Jingya Wang, Huchuan Lu, Zimo Liu


Occluded person re-identification is a challenging task as the appearance varies substantially with various obstacles, especially in the crowd scenario. To address this issue, we propose a Pose-guided Visible Part Matching (PVPM) method that jointly learns the discriminative features with pose-guided attention and self-mines the part visibility in an end-to-end framework. Specifically, the proposed PVPM includes two key components: 1) pose-guided attention (PGA) method for part feature pooling that exploits more discriminative local features; 2) pose-guided visibility predictor (PVP) that estimates whether a part suffers the occlusion or not. As there are no ground truth training annotations for the occluded part, we turn to utilize the characteristic of part correspondence in positive pairs and self-mining the correspondence scores via graph matching. The generated correspondence scores are then utilized as pseudo-labels for visibility predictor (PVP). Experimental results on three reported occluded benchmarks show that the proposed method achieves competitive performance to state-of-the-art methods. The source codes are available at https://github.com/hh23333/PVPM
[attention, graph, three, dataset, extract, prediction] [occluded, feature, pvpm, score, table, map, pvp, propose, rpp, occlusion, pga, including, region, gallery, pcb, pooling, positive, pedestrian, global, framework] [visibility, model, probe, input, experimental] [method, proposed, result, figure, comparison, formulated] [person, loss, reid, discriminative, image, target, appearance, utilize, learn, train, corresponding, generated] [training, performance, learning, predictor, data, number, problem, reported, better, setting, compared, network, set, indicates, manually, deep, task] [matching, pose, correspondence, visible, partial, body, local, distance, demonstrate, solve, estimation, human]
@InProceedings{Gao_2020_CVPR,
  author = {Gao, Shang and Wang, Jingya and Lu, Huchuan and Liu, Zimo},
  title = {Pose-Guided Visible Part Matching for Occluded Person ReID},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
ContourNet: Taking a Further Step Toward Accurate Arbitrary-Shaped Scene Text Detection
Yuxin Wang, Hongtao Xie, Zheng-Jun Zha, Mengting Xing, Zilong Fu, Yongdong Zhang


Scene text detection has witnessed rapid development in recent years. However, there still exists two main challenges: 1) many methods suffer from false positives in their text representations; 2) the large scale variance of scene texts makes it hard for network to learn samples. In this paper, we propose the ContourNet, which effectively handles these two problems taking a further step toward accurate arbitrary-shaped text detection. At first, a scale-insensitive Adaptive Region Proposal Network (Adaptive-RPN) is proposed to generate text proposals by only focusing on the Intersection over Union (IoU) values between predicted and ground-truth bounding boxes. Then a novel Local Orthogonal Texture-aware Module (LOTM) models the local texture information of proposal features in two orthogonal directions and represents text region with a set of contour points. Considering that the strong unidirectional or weakly orthogonal activation is usually caused by the monotonous texture characteristic of false-positive patterns (e.g. streaks.), our method effectively suppresses these false positives by only outputting predictions with high response value in both orthogonal directions. This gives more accurate description of text regions. Extensive experiments on three challenging datasets (Total-Text, CTW1500 and ICDAR2015) verify that our method achieves the state-of-the-art performance. Code is available at https://github.com/wangyuxin87/ContourNet.
[text, curved, dataset, considering, long, three, modeling] [detection, bounding, region, contour, box, proposal, lotm, predicted, fps, regression, achieves, effectiveness, improvement, horizontal, iou, propose, object, xiang, response, segmentation, localization, rpn, psenet, table, module, feature] [effectively, model, strong, detecting, datasets] [method, proposed, ieee, scale, pattern, based, convolutional, conventional, adaptive] [texture, loss, arbitrary, lomo, representation] [orthogonal, large, performance, training, network, variance, learning, algorithm, set, implemented, size, deep, problem, compared, better] [scene, conference, point, computer, local, international, accurate, vision, shape, approach, direction, vertical, jointly]
@InProceedings{Wang_2020_CVPR,
  author = {Wang, Yuxin and Xie, Hongtao and Zha, Zheng-Jun and Xing, Mengting and Fu, Zilong and Zhang, Yongdong},
  title = {ContourNet: Taking a Further Step Toward Accurate Arbitrary-Shaped Scene Text Detection},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Exploring Data Aggregation in Policy Learning for Vision-Based Urban Autonomous Driving
Aditya Prakash, Aseem Behl, Eshed Ohn-Bar, Kashyap Chitta, Andreas Geiger


Data aggregation techniques can significantly improve vision-based policy learning within a training environment, e.g., learning to drive in a specific simulation condition. However, as on-policy data is sequentially sampled and added in an iterative manner, the policy can specialize and overfit to the training conditions. For real-world applications, it is useful for the learned policy to generalize to novel scenarios that differ from the training conditions. To improve policy learning while maintaining robustness when training end-to-end driving policies, we perform an extensive analysis of data aggregation techniques in the CARLA environment. We demonstrate how the majority of them have poor generalization performance, and develop a novel approach with empirically better generalization performance compared to existing techniques. Our two key ideas are (1) to sample critical states from the collected on-policy data based on the utility they provide to the learned policy in terms of driving behavior, and (2) to incorporate a replay buffer which progressively focuses on the high uncertainty regions of the policy's state distribution. We evaluate the proposed approach on the CARLA NoCrash benchmark, focusing on the most challenging driving scenarios with dense pedestrian and vehicle traffic. Our approach improves driving success rate by 16% over state-of-the-art, achieving 87% of the expert performance while also reducing the collision rate by an order of magnitude without the use of any additional modality, auxiliary tasks, architectural modifications or reward from the environment.
[policy, driving, dagger, critical, expert, urban, iter, imitation, state, behavior, dataset, failure, recognition, town, reinforcement, brake, carla, traffic, timed, context] [autonomous, aggregation, table] [success, iterative, trained, deviation, collected, model, generalization, examine, case, improve] [ieee, based, pattern, weather, analysis, dynamic] [learn, control, supervised, conditional, smile, generalize] [learning, data, training, sampling, performance, better, learned, rate, variance, sample, replay, active, distribution, algorithm, buffer, observe, machine, compared, setting, deep, lead, proportion, problem, dart, set, random, andrew, neural, sampled, evaluate, indicates] [vision, computer, international, dense, approach, collision, uncertainty, error, robotics, intelligent]
@InProceedings{Prakash_2020_CVPR,
  author = {Prakash, Aditya and Behl, Aseem and Ohn-Bar, Eshed and Chitta, Kashyap and Geiger, Andreas},
  title = {Exploring Data Aggregation in Policy Learning for Vision-Based Urban Autonomous Driving},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Look-Into-Object: Self-Supervised Structure Modeling for Object Recognition
Mohan Zhou, Yalong Bai, Wei Zhang, Tiejun Zhao, Tao Mei


Most object recognition approaches predominantly focus on learning discriminative visual patterns, while overlooking the holistic object structure. Though important, structure modeling usually requires significant manual annotations and therefore is labor-intensive. In this paper, we propose to "look into object" (explicitly yet intrinsically model the object structure) through incorporating self-supervisions into the traditional framework. We show the recognition backbone can be substantially enhanced for more robust representation learning, without any cost of extra annotation and inference speed. Specifically, we first propose an object-extent learning module for localizing the object according to the visual patterns shared among the instances in the same category. We then design a spatial context learning module for modeling the internal structures of the object, through predicting the relative positions within the extent. These two modules can be easily plugged into any backbone networks during training and detached at inference time. Extensive experiments show that our look-into-object approach (LIO) achieves large performance gain on a number of benchmarks, including generic object recognition (ImageNet) and fine-grained object recognition tasks (CUB, Cars, Aircraft). We also show that this learning paradigm is highly generalizable to other tasks such as object detection and segmentation (MS COCO). Project page: https://github.com/JDAI-CV/LIO.
[recognition, visual, context, modeling, attention, three, localizing, understanding] [object, module, backbone, feature, lio, scl, extent, oel, positive, polar, mask, segmentation, semantic, detection, region, table, main, focus, propose, annotation, map, localization, lcls, framework, coco, warbler, holistic, extra, predicted, correlation, commonality] [model, trained, generic, input] [spatial, ieee, convolutional, proposed, pattern, method, based, figure] [image, discriminative, structural, learn, representation, pseudo, tao, introduce, ability] [learning, network, classification, deep, neural, performance, inference, baseline, general, better, size, layer, training, number, basic, accuracy] [computer, conference, structure, vision, additional, international, relative]
@InProceedings{Zhou_2020_CVPR,
  author = {Zhou, Mohan and Bai, Yalong and Zhang, Wei and Zhao, Tiejun and Mei, Tao},
  title = {Look-Into-Object: Self-Supervised Structure Modeling for Object Recognition},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Recognizing Objects From Any View With Object and Viewer-Centered Representations
Sainan Liu, Vincent Nguyen, Isaac Rehg, Zhuowen Tu


In this paper, we tackle an important task in computer vision: any view object recognition. In both training and testing, for each object instance, we are only given its 2D image viewed from an unknown angle. We propose a computational framework by designing object and viewer-centered neural networks (OVCNet) to recognize an object instance viewed from an arbitrary unknown angle. OVCNet consists of three branches that respectively implement object-centered, 3D viewer-centered, and in-plane viewer-centered recognition. We evaluate our proposed OVCNet using two metrics with unseen views from both seen and novel object instances. Experimental results demonstrate the advantages of OVCNet over classic 2D-image-based CNN classifiers, 3D-object (inferred from 2D image) classifiers, and competing multi-view based approaches. It gives rise to a viable and practical computing framework that combines both viewpoint-dependent and viewpoint-independent features for object recognition from any view.
[recognition, three, dataset, recognize] [object, table, module, branch, pascal, feature, ablation, final, instance, split] [model, ensemble, input, trained, study] [cnns, based, figure, convolutional] [image, representation, unseen, texture, arbitrary, pretrained, grayscale] [network, accuracy, training, classification, neural, test, learning, performance, evaluate, augmentation, layer, deep, set, subset, familiar, data, select, augmented, number, better, compared, algorithm, class, selection, processing, problem, standard] [spherical, genre, view, ovcnet, viewpoint, novel, shape, gmiro, gmivo, reconstruction, single, rotation, genretex, seeninstances, well, ocb, point, novelinstances, mvcnn, hao, computer, system]
@InProceedings{Liu_2020_CVPR,
  author = {Liu, Sainan and Nguyen, Vincent and Rehg, Isaac and Tu, Zhuowen},
  title = {Recognizing Objects From Any View With Object and Viewer-Centered Representations},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Gated Channel Transformation for Visual Recognition
Zongxin Yang, Linchao Zhu, Yu Wu, Yi Yang


In this work, we propose a generally applicable transformation unit for visual recognition with deep convolutional neural networks. This transformation explicitly models channel relationships with explainable control variables. These variables determine the neuron behaviors of competition or cooperation, and they are jointly optimized with the convolutional weight towards more accurate recognition. In Squeeze-and-Excitation (SE) Networks, the channel relationships are implicitly learned by fully connected layers, and the SE block is integrated at the block-level. We instead introduce a channel normalization layer to reduce the number of parameters and computational complexity. This lightweight layer incorporates a simple l2 normalization, enabling our transformation unit applicable to operator-level without much increase of additional parameters. Extensive experiments demonstrate the effectiveness of our unit with clear margins on many vision tasks, i.e., image classification on ImageNet, object detection and instance segmentation on COCO, video classification on Kinetics.
[embedding, visual, context, attention, cooperation, gated, recognition, work, behavior] [table, feature, improvement, propose, achieves, global, module, employ, mask, stage, kaiming, segmentation, improves, apply, val, backbone] [input, original, effective, model, norm, create, improve] [channel, convolutional, residual, based, output, scale, cnns] [adaptation, image, train] [gct, normalization, gating, training, deep, layer, learning, weight, performance, variance, neural, better, network, competition, imagenet, activation, compared, evaluate, design, small, computational, deeper, efficient, rate, number, complexity, batch, close, gate, ratio, reduce, large, resnets, process, applied, promising] [compare, transformation, error]
@InProceedings{Yang_2020_CVPR,
  author = {Yang, Zongxin and Zhu, Linchao and Wu, Yu and Yang, Yi},
  title = {Gated Channel Transformation for Visual Recognition},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Non-Local Neural Networks With Grouped Bilinear Attentional Transforms
Lu Chi, Zehuan Yuan, Yadong Mu, Changhu Wang


Modeling spatial or temporal long-range dependency plays a key role in deep neural networks. Conventional dominant solutions include recurrent operations on sequential data or deeply stacking convolutional layers with small kernel size. Recently, a number of non-local operators (such as self-attention based) have been devised. They are typically generic and can be plugged into many existing network pipelines for globally computing among any two neurons in a feature map. This work proposes a novel non-local operator. It is inspired by the attention mechanism of human visual system, which can quickly attend to important local parts in sight and suppress other less-relevant information. The core of our method is learnable and data-adaptive bilinear attentional transform (BA-Transform), whose merits are three-folds: first, BA-Transform is versatile to model a wide spectrum of local or global attentional operations, such as emphasizing specific local regions. Each BA-Transform is learned in a data-adaptive way; Secondly, to address the discrepancy among features, we further design grouped BA-Transforms, which essentially apply different attentional operations to different groups of feature channels; Thirdly, many existing non-local operators are computation-intensive. The proposed BA-Transform is implemented by simple matrix multiplication and admits better efficacy. For empirical evaluation, we perform comprehensive experiments on two large-scale benchmarks, ImageNet and Kinetics, for image / video classification respectively. The achieved accuracies and various ablation experiments consistently demonstrate significant improvement by large margins.
[attention, video, temporal, attentional, time, bilinear, work, visual, recurrent, action, modeling, inspired] [table, feature, global, adopt, pooling, ablation, improvement, map, effectiveness, kaiming, focus] [input, model, conduct] [convolutional, proposed, method, spatial, block, bat, conv, figure, based, gflops, channel, resolution, column, existing, residual, receptive, transform, high] [image, row] [neural, network, matrix, deep, classification, learning, number, imagenet, set, large, architecture, accuracy, learned, group, andrew, design, deeper, efficient, performance, reduce, baseline, better, achieve] [transformation, local, full, human]
@InProceedings{Chi_2020_CVPR,
  author = {Chi, Lu and Yuan, Zehuan and Mu, Yadong and Wang, Changhu},
  title = {Non-Local Neural Networks With Grouped Bilinear Attentional Transforms},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Generative-Discriminative Feature Representations for Open-Set Recognition
Pramuditha Perera, Vlad I. Morariu, Rajiv Jain, Varun Manjunatha, Curtis Wigington, Vicente Ordonez, Vishal M. Patel


We address the problem of open-set recognition, where the goal is to determine if a given sample belongs to one of the classes used for training a model (known classes). The main challenge in open-set recognition is to disentangle open-set samples that produce high class activations from known-set samples. We propose two techniques to force class activations of open-set samples to be low. First, we train a generative model for all known classes and then augment the input with the representation obtained from the generative model to learn a classifier. This network learns to associate high classification probabilities both when image content is from the correct class as well as when the input and the reconstructed image are consistent with each other. Second, we use self-supervision to force the network to learn more informative featues when assigning class scores to improve separation of classes from each other and from open-set samples. We evaluate the performance of the proposed method with recent open-set recognition works across three datasets, where we obtain state-of-the-art results.
[recognition, dataset, considering, vehicle, passed] [feature, detection, table, cnn, positive, object, score, branch] [input, model, trained, decision, study, sophisticated] [proposed, method, figure, ieee, high, pattern, half, separation, disparity, june, vishal, based, carried, conventional] [generative, image, produced, learn, representation, produce, producing, unsupervised, loss, richer, pramuditha, adobe] [network, class, classification, classifier, performance, learning, training, space, set, better, baseline, activation, augmented, considered, deep, softmax, sample, compared, data, machine, lower, vanilla, consider, openmax, svhn, open, problem, learned, randomly] [conference, vision, computer, transformation, reconstructed, international]
@InProceedings{Perera_2020_CVPR,
  author = {Perera, Pramuditha and Morariu, Vlad I. and Jain, Rajiv and Manjunatha, Varun and Wigington, Curtis and Ordonez, Vicente and Patel, Vishal M.},
  title = {Generative-Discriminative Feature Representations for Open-Set Recognition},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
RPM-Net: Robust Point Matching Using Learned Features
Zi Jian Yew, Gim Hee Lee


Iterative Closest Point (ICP) solves the rigid point cloud registration problem iteratively in two steps: (1) make hard assignments of spatially closest point correspondences, and then (2) find the least-squares rigid transformation. The hard assignments of closest point correspondences based on spatial distances are sensitive to the initial rigid transformation and noisy/outlier points, which often cause ICP to converge to wrong local minima. In this paper, we propose the RPM-Net -- a less sensitive to initialization and more robust deep learning-based approach for rigid point cloud registration. To this end, our network uses the differentiable Sinkhorn layer and annealing to get soft assignments of point correspondences from hybrid features learned from both spatial coordinates and local geometry. To further improve registration performance, we introduce a secondary network to predict optimal annealing parameters. Unlike some existing methods, our RPM-Net handles missing correspondences and point clouds with partial visibility. Experimental results show that our RPM-Net achieves state-of-the-art performance compared to existing non-deep learning and recent deep learning methods. Our source code is available at the project website (https://github.com/yewzijian/RPMNet).
[work, prediction, recognition, previous] [feature, table, partially, object, global] [robust, iterative, clean, secondary, improve] [spatial, ieee, pattern, extraction, figure, method, reference, transform, based, noisy, range, isotropic] [source, shared, learn, loss] [network, deep, performance, learned, learning, matrix, iteration, data, soft, parameter, problem, initialization, optimal, algorithm, number, stochastic, normalization, deterministic, evaluate, sample] [point, registration, cloud, rigid, local, conference, icp, computer, vision, mjk, transformation, annealing, rpm, international, hybrid, distance, closest, compute, sinkhorn, partial, visible, match, chamfer, pointnetlk, fgr, mlp, correspondence, approach, well, geometric, matching, initial, solution, handcrafted, doubly]
@InProceedings{Yew_2020_CVPR,
  author = {Yew, Zi Jian and Lee, Gim Hee},
  title = {RPM-Net: Robust Point Matching Using Learned Features},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Sideways: Depth-Parallel Training of Video Models
Mateusz Malinowski, Grzegorz Swirszcz, Joao Carreira, Viorica Patraucean


We propose Sideways, an approximate backpropagation scheme for training video models. In standard backpropagation, the gradients and activations at every computation step through the model are temporally synchronized. The forward activations need to be stored until the backward pass is executed, preventing inter-layer (depth) parallelization. However, can we leverage smooth, redundant input streams such as videos to develop a more efficient training scheme? Here, we explore an alternative to backpropagation; we overwrite network activations whenever new ones, i.e., from new frames, become available. Such a more gradual accumulation of information from both passes breaks the precise correspondence between gradients and activations, leading to theoretically more noisy weight updates. Counter-intuitively, we show that Sideways training of deep convolutional video networks not only still converges, but can also potentially exhibit better generalization compared to standard synchronized backpropagation.
[video, frame, time, action, temporal, step, work, sequence, regular, illustrated, mechanism, recognition, multiple] [module, originated, table] [input, trained, model, blocking] [figure, convolutional, high, parallel] [loss, train, cycle] [sideways, training, learning, neural, computation, data, forward, processing, backward, network, deep, algorithm, backpropagation, gradient, sgd, large, machine, arxiv, preprint, layer, report, efficient, process, classification, accuracy, number, standard, note, decoupled, striding, deways, better, stochastic, function, online, rate, update, performance, simple, andrew, weight, experiment, higher, investigate] [conference, international, single, computer, vision, local, smoothness, assume, computed, depth]
@InProceedings{Malinowski_2020_CVPR,
  author = {Malinowski, Mateusz and Swirszcz, Grzegorz and Carreira, Joao and Patraucean, Viorica},
  title = {Sideways: Depth-Parallel Training of Video Models},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Basis Prediction Networks for Effective Burst Denoising With Large Kernels
Zhihao Xia, Federico Perazzi, Michael Gharbi, Kalyan Sunkavalli, Ayan Chakrabarti


Bursts of images exhibit significant self-similarity across both time and space. This motivates a representation of the kernels as linear combinations of a small set of basis elements. To this end, we introduce a novel basis prediction network that, given an input burst, predicts a set of global basis kernels --- shared within the image --- and the corresponding mixing coefficients --- which are specific to individual pixels. Compared to state-of-the-art techniques that output a large tensor of per-pixel spatiotemporal kernels, our formulation substantially reduces the dimensionality of the network output. This allows us to effectively exploit comparatively larger denoising kernels, achieving both significant quality improvements (over 1dB PSNR) and faster run-times over state-of-the-art methods.
[prediction, video, frame, decoder, predict, work, natural, multiple, outperforms, individual] [table, predicted, overlap, global, ablation] [noise, input, quality, model] [denoising, burst, kernel, spatial, kpn, method, output, denoised, filtering, separable, noisy, color, ieee, motion, pixel, fourier, psnr, coefficient, figure, kpns, based, photography] [image, corresponding, encoder, produce, common, train, grayscale] [network, gain, performance, size, large, set, average, better, standard, training, learning, find, processing, deep, neural, larger, computational, report, validation, test, number, subspace, fixed, rank, small, memory] [basis, approach, structure, conference, direct, acm, computer, single, well]
@InProceedings{Xia_2020_CVPR,
  author = {Xia, Zhihao and Perazzi, Federico and Gharbi, Michael and Sunkavalli, Kalyan and Chakrabarti, Ayan},
  title = {Basis Prediction Networks for Effective Burst Denoising With Large Kernels},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Private-kNN: Practical Differential Privacy for Computer Vision
Yuqing Zhu, Xiang Yu, Manmohan Chandraker, Yu-Xiang Wang


With increasing ethical and legal concerns on privacy for deep models in visual recognition, differential privacy has emerged as a mechanism to disguise membership of sensitive data in training datasets. Recent methods like Private Aggregation of Teacher Ensembles (PATE) leverage a large ensemble of teacher models trained on disjoint subsets of private data, to transfer knowledge to a student model with privacy guarantees. However, labeled vision data is often expensive and datasets, when split into many disjoint training sets, lead to significantly sub-optimal accuracy and thus hardly sustain good privacy bounds. We propose a practically data-efficient scheme based on private release of k-nearest neighbor (kNN) queries, which altogether avoids splitting the training dataset. Our approach allows the use of privacy-amplification by subsampling and iterative refinement of the kNN feature embedding. We rigorously analyze the theoretical properties of our method and demonstrate strong experimental performance on practical computer vision datasets for face attribute recognition and person reidentification. In particular, we achieve comparable or better accuracy than PATE while reducing more than 90% of the privacy loss, thereby providing the "most practical method to-date" for private deep learning in computer vision.
[mechanism, dataset, answer] [feature, achieves, apply, table, aggregation, extractor] [privacy, private, model, screening, public, rdp, pate, differential, trained, noise, guarantee, knn, datasets, gnmax, query, face, privately, poisson, security, definition, differentially, budget, testing, symposium, sensitive, release, iterative, strong] [noisy, method, gaussian, figure, based, analysis, subsampled, high, ieee] [attribute, composition, pseudo, amplification, utility, celeba] [data, training, accuracy, learning, student, set, deep, sampling, teacher, unlabeled, number, classification, better, compared, practical, performance, machine, achieve, algorithm, procedure, total, ratio, function, large, labeled, process, parameter, neural, random, bound, svhn, standard] [cost, computer, vision, allows, neighbor]
@InProceedings{Zhu_2020_CVPR,
  author = {Zhu, Yuqing and Yu, Xiang and Chandraker, Manmohan and Wang, Yu-Xiang},
  title = {Private-kNN: Practical Differential Privacy for Computer Vision},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
SP-NAS: Serial-to-Parallel Backbone Search for Object Detection
Chenhan Jiang, Hang Xu, Wei Zhang, Xiaodan Liang, Zhenguo Li


Advanced object detectors usually adopt a backbone network designed and pretrained by ImageNet classification. Recently neural architecture search (NAS) has emerged to automatically design a task-specific backbone to bridge the gap between the tasks of classification and detection. In this paper, we propose a two-phase serial-to-parallel architecture search framework named SP-NAS towards a flexible task-oriented detection backbone. Specifically, the serial-searching round aims at finding a sequence of serial blocks with optimal scale and output channels in the feature hierarchy by a Swap-Expand-Reignite search algorithm; the parallel-searching phase then assembles several sub-architectures along with the previous searched backbone into a more powerful parallel-structured backbone. We efficiently search a detection backbone by exploring a network morphism strategy on multiple detection benchmarks. The resulting architectures achieve SOTA results, i.e. top performance (LAMR: 0.055) on the automotive detection leaderboard of EuroCityPersons benchmark, improving 2.3% mAP with less FLOPS than NAS-FPN on COCO, and reaching 84.1% AP50 on VOC better than DetNAS and Auto-FPN in terms of both accuracy and speed.
[dataset, current, time, multiple, previous] [backbone, detection, object, feature, coco, stage, voc, table, fpn, bdd, detnas, rcnn, map, fully, cascade, resnet, adopt, propose, det, focus, improvement, serialnet, lamr] [model] [figure, output, comparison, parallel, block, fusion, method, phase, flexible, repeated, resolution, subnet, designed, spatial] [image, train, pretrained] [search, architecture, network, imagenet, training, performance, searched, number, ecp, searching, neural, better, optimal, subnets, arxiv, preprint, space, classification, serial, strategy, efficient, find, algorithm, size, best, random, computation, pretraining, inference, learning, quoc, weight] [structure]
@InProceedings{Jiang_2020_CVPR,
  author = {Jiang, Chenhan and Xu, Hang and Zhang, Wei and Liang, Xiaodan and Li, Zhenguo},
  title = {SP-NAS: Serial-to-Parallel Backbone Search for Object Detection},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Structure Aware Single-Stage 3D Object Detection From Point Cloud
Chenhang He, Hui Zeng, Jianqiang Huang, Xian-Sheng Hua, Lei Zhang


3D object detection from point cloud data plays an essential role in autonomous driving. Current single-stage detectors are efficient by progressively downscaling the 3D point clouds in a fully convolutional manner. However, the downscaled features inevitably lose spatial information and cannot make full use of the structure information of 3D point cloud, degrading their localization precision. In this work, we propose to improve the localization precision of single-stage detectors by explicitly leveraging the structure information of 3D point cloud. Specifically, we design an auxiliary network which converts the convolutional features in the backbone network back to point-level representations. The auxiliary network is jointly optimized, by two point-level supervisions, to guide the convolutional features in the backbone network to be aware of the object structure. The auxiliary network can be detached after training and therefore introduces no extra computation in the inference stage. Besides, considering that single-stage detectors suffer from the discordance between the predicted bounding boxes and corresponding classification confidences, we develop an efficient part-sensitive warping operation to align the confidences to the predicted bounding boxes. Our proposed detector ranks at the top of KITTI 3D/BEV detection leaderboards and runs at 25 FPS for inference.
[current, three] [object, detection, bounding, feature, backbone, predicted, segmentation, pswarp, box, lidar, center, moderate, table, detector, hard, foreground, map, cnn, localization, apply, psroialign, bev, aware, fps, confidence, easy, fully, propose, guide, extra, final, employ] [auxiliary, input, model, improve] [method, proposed, convolutional, ieee, spatial, pattern, warping, figure, tensor] [generate, learn, corresponding, loss] [network, classification, task, performance, data, learning, efficient, training, neural, set, average, achieve, computational, deep, precision, inference, top] [point, conference, cloud, computer, vision, kitti, structure, accurate, second, estimation, international, sparse, voxel, grid]
@InProceedings{He_2020_CVPR,
  author = {He, Chenhang and Zeng, Hui and Huang, Jianqiang and Hua, Xian-Sheng and Zhang, Lei},
  title = {Structure Aware Single-Stage 3D Object Detection From Point Cloud},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
"Looking at the Right Stuff" - Guided Semantic-Gaze for Autonomous Driving
Anwesan Pal, Sayan Mondal, Henrik I. Christensen


In recent years, predicting driver's focus of attention has been a very active area of research in the autonomous driving community. Unfortunately, existing state-of-the-art techniques achieve this by relying only on human gaze information, thereby ignoring scene semantics. We propose a novel Semantics Augmented GazE (SAGE) detection approach that captures driving specific contextual information, in addition to the raw gaze. Such a combined attention mechanism serves as a powerful tool to focus on the relevant regions in an image frame in order to make driving both safe and efficient. Using this, we design a complete saliency prediction framework - SAGE-Net, which modifies the initial prediction from SAGE by taking into account vital aspects such as distance to objects (depth), ego vehicle speed, and pedestrian crossing intent. Exhaustive experiments conducted through four popular saliency algorithms show that on 49/56 (87.5%) cases - considering both the overall dataset and crucial driving scenarios, SAGE outperforms existing techniques without any additional computational overhead during the training process. The augmented dataset along with the relevant code are available as part of the supplementary material.
[sage, driving, attention, dataset, prediction, crossing, video, semantics, predicting, driver, bdda, visual, vehicle, intent, evaluation, relevant, road, context, vthresh, yrgb] [saliency, object, detection, map, focus, pedestrian, semantic, autonomous, predicted, salient, table, framework, segmentation, groundtruth, feature, detect, dkl, segmented, picanet, wenguan, jianbing, propose] [gaze, trained, model] [ieee, pattern, proposed, existing, figure, raw, comparison, convolutional, analysis] [image, completely, unsupervised] [learning, set, neural, consider, deep, entire, network, training, performance, conducted, algorithm, machine, arxiv, preprint, computational] [conference, computer, vision, depth, human, scene, approach, capture, international, intersection, single]
@InProceedings{Pal_2020_CVPR,
  author = {Pal, Anwesan and Mondal, Sayan and Christensen, Henrik I.},
  title = {"Looking at the Right Stuff" - Guided Semantic-Gaze for Autonomous Driving},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
What's Hidden in a Randomly Weighted Neural Network?
Vivek Ramanujan, Mitchell Wortsman, Aniruddha Kembhavi, Ali Farhadi, Mohammad Rastegari


Training a neural network is synonymous with learning the values of the weights. By contrast, we demonstrate that randomly weighted neural networks contain subnetworks which achieve impressive performance without ever training the weight values. Hidden in a randomly weighted Wide ResNet-50 is a subnetwork (with random weights) that is smaller than, but matches the performance of a ResNet-34 trained on ImageNet. Not only do these "untrained subnetworks" exist, but we provide an algorithm to effectively find them. We empirically show that as randomly weighted neural networks with fixed weights grow wider and deeper, an "untrained subnetwork" approaches a network with learned weights in accuracy.
[work, node, provide, associated, connected] [kaiming, score, denotes, achieves, highest, edge, fully] [subnetworks, zhou, model, trained, input, original, choose] [figure, performs, output, adam, method, high, scale, convolutional] [loss, perform, image, intuition, learn] [neural, subnetwork, network, randomly, weighted, weight, performance, learning, find, algorithm, width, imagenet, gradient, random, wuv, wide, accuracy, distribution, number, layer, suv, size, architecture, pool, training, good, standard, search, probability, rate, finding, fixed, optimize, achieve, batch, forward, consider, update, denote, stochastic, pass, initialization, decay, deep, initialized, frankle, learned, carbin, experiment, supermasks, better, agnostic] [well, normal, dense, varying, demonstrate]
@InProceedings{Ramanujan_2020_CVPR,
  author = {Ramanujan, Vivek and Wortsman, Mitchell and Kembhavi, Aniruddha and Farhadi, Ali and Rastegari, Mohammad},
  title = {What's Hidden in a Randomly Weighted Neural Network?},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Structured Multi-Hashing for Model Compression
Elad Eban, Yair Movshovitz-Attias, Hao Wu, Mark Sandler, Andrew Poon, Yerlan Idelbayev, Miguel A. Carreira-Perpinan


Despite the success of deep neural networks (DNNs), state-of-the-art models are too large to deploy on low-resource devices or common server configurations in which multiple models are held in memory. Model compression methods address this limitation by reducing the memory footprint, latency, or energy consumption of a model with minimal impact on accuracy. We focus on the task of reducing the number of learnable variables in the model. In this work we combine ideas from weight hashing and dimensionality reductions resulting in a simple and powerful structured multi-hashing method based on matrix products that allows direct control of model size of any deep network and is trained end-to-end. We demonstrate the strength of our approach by compressing models from the ResNet, EfficientNet, and MobileNet architecture families. Our method allows us to drastically decrease the number of variables while maintaining high accuracy. For instance, by applying our approach to EfficentNet-B4 (16M parameters) we reduce it to the size of B0 (5M parameters), while gaining over 3% in accuracy over B0 baseline. On the commonly used benchmark CIFAR10 we reduce the ResNet32 model by 75% with no loss in quality, and are able to do a 10x compression while still achieving above 90% accuracy.
[structured, trainable, connected, work, represent] [fully, resnet, table] [model, trained, original, strong, google] [compression, scale, figure, pattern, method, based, low, convolutional, compressed, resolution, ieee, compress, learnable, applying, tensor] [target, train, locality, mapping, variable, image] [hashing, size, number, neural, accuracy, deep, weight, smh, hash, set, network, note, learning, architecture, width, matrix, memory, smaller, multiplier, compressing, training, efficient, efficientnet, baseline, reducing, quantization, layer, andrew, large, family, arxiv, preprint, mobile, mapped, scheme, function, reducer, base, processing, mark, reduce, machine, linear, pruning] [computer, vision, conference, approach, compare, full, define]
@InProceedings{Eban_2020_CVPR,
  author = {Eban, Elad and Movshovitz-Attias, Yair and Wu, Hao and Sandler, Mark and Poon, Andrew and Idelbayev, Yerlan and Carreira-Perpinan, Miguel A.},
  title = {Structured Multi-Hashing for Model Compression},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
DOPS: Learning to Detect 3D Objects and Predict Their 3D Shapes
Mahyar Najibi, Guangda Lai, Abhijit Kundu, Zhichao Lu, Vivek Rathod, Thomas Funkhouser, Caroline Pantofaru, David Ross, Larry S. Davis, Alireza Fathi


We propose DOPS, a fast single-stage 3D object detection method for LIDAR data. Previous methods often make domain-specific design decisions, for example projecting points into a bird-eye view image in autonomous driving scenarios. In contrast, we propose a general-purpose method that works on both indoor and outdoor scenes. The core novelty of our method is a fast, single-pass architecture that both detects objects in 3D and estimates their shapes. 3D bounding box parameters are estimated in one pass for every point, aggregated through graph convolutions, and fed into a branch of the network that predicts latent codes representing the shape of each detected object. The latent shape space and shape decoder are learned on a synthetic dataset and then used as supervision for the end-to-end training of the 3D object detection pipeline. Thus our model is able to extract shapes without access to ground-truth shape information in the target dataset. During experiments, we find that our proposed method achieves state-of-the-art results by 5% on object detection in ScanNet scenes, and it gets top results by 3.4% in the Waymo Open Dataset, while reproducing the shapes of detected cars.
[embedding, predict, decoder, dataset, observed, graph, prediction, previous, predicting] [object, detection, predicted, box, bounding, pooling, lidar, waymo, semantic, center, autonomous, proposal, table, segmentation, score, propose, branch] [input, model] [ieee, pattern, figure, convolution, prior, convolutional, high, method, spatial, proposed, based, fast] [loss, consists, encoder, conditional, synthetic, representation, learn, learns] [network, training, learning, open, set, function, arxiv, preprint, learned, batch, processing, sampling, classification, deep, sample] [shape, point, sparse, conference, computer, vision, cloud, predicts, surface, distance, geo, approach, view, indoor, directly, sdf, rgb, convs, rotation, signed, leonidas, cad]
@InProceedings{Najibi_2020_CVPR,
  author = {Najibi, Mahyar and Lai, Guangda and Kundu, Abhijit and Lu, Zhichao and Rathod, Vivek and Funkhouser, Thomas and Pantofaru, Caroline and Ross, David and Davis, Larry S. and Fathi, Alireza},
  title = {DOPS: Learning to Detect 3D Objects and Predict Their 3D Shapes},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
AutoTrack: Towards High-Performance Visual Tracking for UAV With Automatic Spatio-Temporal Regularization
Yiming Li, Changhong Fu, Fangqiang Ding, Ziyuan Huang, Geng Lu


Most existing trackers based on discriminative correlation filters (DCF) try to introduce predefined regularization term to improve the learning of target objects, e.g., by suppressing background learning or by restricting change rate of correlation filters. However, predefined parameters introduce much effort in tuning them and they still fail to adapt to new situations that the designer did not think of. In this work, a novel approach is proposed to online automatically and adaptively learn spatio-temporal regularization term. Spatially local response map variation is introduced as spatial regularization to make DCF focus on the learning of trust-worthy parts of the object, and global response map variation determines the updating rate of the filter. Extensive experiments on four UAV benchmarks have proven the superiority of our method compared to the state-of-the-art CPU- and GPU-based trackers, with a speed of 60 frames per second running on a single CPU. Our tracker is additionally proposed to be applied in UAV localization. Considerable tests in the indoor practical scenarios have proven the effectiveness and versatility of our localization method. The code is available at https://github.com/vision4robotics/AutoTrack.
[visual, frame, speed, temporal, automatic] [tracking, autotrack, uav, response, strcf, correlation, object, dcf, threshold, bacf, staple, tracker, localization, kcf, kcc, fdsst, dsst, overlap, global, map, martin, aerial, fahad, table, shahbaz, changhong, boundary, siamese, gtk, location, uavdt, occlusion, feature, denotes, tracked] [success, variation, robust, hkt, change] [spatial, figure, method, based, proposed, adaptive, motion, illumination] [appearance, discriminative, infrared] [regularization, learning, precision, deep, rate, performance, compared, large, online, best, yiming, credibility] [system, local, error, position, well, michael, term, second, camera]
@InProceedings{Li_2020_CVPR,
  author = {Li, Yiming and Fu, Changhong and Ding, Fangqiang and Huang, Ziyuan and Lu, Geng},
  title = {AutoTrack: Towards High-Performance Visual Tracking for UAV With Automatic Spatio-Temporal Regularization},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
GP-NAS: Gaussian Process Based Neural Architecture Search
Zhihang Li, Teng Xi, Jiankang Deng, Gang Zhang, Shengzhao Wen, Ran He


Neural architecture search (NAS) advances beyond the state-of-the-art in various computer vision tasks by automating the designs of deep neural networks. In this paper, we aim to address three important questions in NAS: (1) How to measure the correlation between architectures and their performances? (2) How to evaluate the correlation between different architectures? (3) How to learn these correlations with a small number of samples? To this end, we first model these correlations from a Bayesian perspective. Specifically, by introducing a novel Gaussian Process based NAS (GP-NAS) method, the correlations are modeled by the kernel function and mean function. The kernel function is also learnable to enable adaptive modeling for complex correlations in different search spaces. Furthermore, by incorporating a mutual information based sampling method, we can theoretically ensure the high-performance architecture with only a small set of samples. After addressing these problems, training GP-NAS once enables direct performance prediction of any architecture in different scenarios and may obtain efficient networks for different deployment platforms. Extensive experiments on both image classification and face recognition tasks verify the effectiveness of our algorithm.
[child, time, recognition, predict, modeling, reinforcement, work] [correlation, propose, framework, predicted, achieves, semantic] [model, face, trained] [based, gaussian, kernel, method, prior, high, proposed] [image, train, conditioned] [architecture, search, network, neural, performance, sampling, training, function, distribution, matrix, efficient, theorem, learning, mutual, hyperparameters, sampled, equation, set, posterior, covariance, vector, number, space, quoc, process, algorithm, bayesian, strategy, accuracy, imagenet, gpu, deep, large, best, size, barret, measure, evaluate, small, vijay, andrew, classification, predictor, learned, searching, rate, mobile, weight, achieve, alternating] [estimate, estimation, recursively, error, computer, differentiable, truth, vision]
@InProceedings{Li_2020_CVPR,
  author = {Li, Zhihang and Xi, Teng and Deng, Jiankang and Zhang, Gang and Wen, Shengzhao and He, Ran},
  title = {GP-NAS: Gaussian Process Based Neural Architecture Search},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
NAS-FCOS: Fast Neural Architecture Search for Object Detection
Ning Wang, Yang Gao, Hao Chen, Peng Wang, Zhi Tian, Chunhua Shen, Yanning Zhang


The success of deep neural networks relies on significant architecture engineering. Recently neural architecture search (NAS) has emerged as a promise to greatly reduce manual effort in network design by automatically searching for optimal architectures, although typically such algorithms need an excessive amount of computational resources, e.g., a few thousand GPU-days. To date, on challenging vision tasks such as object detection, NAS, especially fast versions of NAS, is less studied. Here we propose to search for the decoder structure of object detectors with search efficiency being taken into consideration. To be more specific, we aim to efficiently search for the feature pyramid network (FPN) as well as the prediction head of a simple anchor-free object detector, namely FCOS, using a tailored reinforcement learning paradigm. With carefully designed search space, search algorithms and strategies for evaluating network quality, we are able to efficiently search a top-performing detection architecture within 4 days using 8 V100 GPUs. The discovered architecture surpasses state-of-the-art object detection models (such as Faster R-CNN, RetinaNet and FCOS) by 1.5 to 3.5 points in AP on the COCO dataset, with comparable computation complexity and memory footprint, demonstrating the efficacy of the proposed NAS for object detection.
[decoder, prediction, reward, three, evaluation, time, reinforcement] [head, object, fpn, detection, backbone, feature, fcos, pyramid, coco, table, kaiming, fully, concat, ross] [model, input, original] [output, ieee, figure, fast, based, convolutional, deformable] [shared, image] [search, neural, architecture, training, proxy, searched, space, network, searching, size, task, weight, sharing, set, sampling, controller, applied, learning, process, note, design, performance, basic, number, deep, reduce, simple, memory, efficient, better, batch, width, best, arxiv, preprint, efficiency] [structure, full, cost]
@InProceedings{Wang_2020_CVPR,
  author = {Wang, Ning and Gao, Yang and Chen, Hao and Wang, Peng and Tian, Zhi and Shen, Chunhua and Zhang, Yanning},
  title = {NAS-FCOS: Fast Neural Architecture Search for Object Detection},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
TCTS: A Task-Consistent Two-Stage Framework for Person Search
Cheng Wang, Bingpeng Ma, Hong Chang, Shiguang Shan, Xilin Chen


The state of the art person search methods separate person search into detection and re-ID stages, but ignore the consistency between these two stages. The general person detector has no special attention on the query target; The re-ID model is trained on hand-drawn bounding boxes which are not available in person search. To address the consistency problem, we introduce a Task-Consist Two-Stage (TCTS) person search framework, includes an identity-guided query (IDGQ) detector and a Detection Results Adapted (DRA) re-ID model. In the detection stage, the IDGQ detector learns an auxiliary identity branch to compute query similarity scores for proposals. With consideration of the query similarity scores and foreground score, IDGQ produces query-like bounding boxes for the re-ID stage. In the re-ID stage, we predict identity labels of detected bounding boxes, and use these examples to construct a more practical mixed train set for the DRA model. Training on the mixed train set improves the robustness of the re-ID stage to inaccurate detection. We evaluate our method on two benchmark datasets, CUHK-SYSU and PRW. Our framework achieves 93.9% of mAP and 95.1% of rank1 accuracy on CUHK-SYSU, outperforming the previous state of the art methods.
[order, state, attention] [bounding, detection, idgq, detected, detector, branch, gallery, dra, propose, feature, stage, map, pedestrian, score, achieves, foreground, tcts, framework, easy, faster, effectiveness, benchmark, proposal, positive, liang, art, recall, center, table] [query, identity, model, example, quality, auxiliary] [ieee, pattern, proposed, based, method, figure] [person, train, loss, factor, consistency, image] [search, set, similarity, unlabeled, training, network, performance, learning, mixed, hardness, indicates, size, base, problem, probability, number, classification, labeled, algorithm, weight, adapted, accuracy, softmax] [computer, conference, ground, vision, accurate, truth, focal, european]
@InProceedings{Wang_2020_CVPR,
  author = {Wang, Cheng and Ma, Bingpeng and Chang, Hong and Shan, Shiguang and Chen, Xilin},
  title = {TCTS: A Task-Consistent Two-Stage Framework for Person Search},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
SCATTER: Selective Context Attentional Scene Text Recognizer
Ron Litman, Oron Anschel, Shahar Tsiper, Roee Litman, Shai Mazor, R. Manmatha


Scene Text Recognition (STR), the task of recognizing text against complex image backgrounds, is an active area of research. Current state-of-the-art (SOTA) methods still struggle to recognize text written in arbitrary shapes. In this paper, we introduce a novel architecture for STR, named Selective Context ATtentional Text Recognizer (SCATTER). SCATTER utilizes a stacked block architecture with intermediate supervision during training, that paves the way to successfully train a deep BiLSTM encoder, thus improving the encoding of contextual dependencies. Decoding is done using a two-step 1D attention mechanism. The first attention step re-weights visual features from a CNN backbone together with contextual features computed by a BiLSTM layer. The second attention step, similar to previous papers, treats the features as a sequence and attends to the intra-sequence relationships. Experiments show that the proposed approach surpasses SOTA performance on irregular text recognition benchmarks by 3.7% on average.
[text, decoder, bilstm, irregular, recognition, attention, regular, selective, visual, sequence, dataset, natural, character, ctc, attentional, recognizing, prediction, cong, decoding, step, textspotter, recurrent, svt] [feature, contextual, supervision, cnn, table, refinement, cropped, detection, map, final, mask, xiang, backbone] [scatter, model, str, datasets, input, trained, robust] [intermediate, block, ieee, proposed, pattern, output, stacked, analysis, transform, figure] [image, encoder, synthetic, arbitrary, train] [training, architecture, accuracy, network, neural, test, processing, deep, number, average, arxiv, preprint, performance, increase, best, computational, inference, probability, increasing, baseline, machine] [scene, conference, computer, international, vision, second, novel, additional]
@InProceedings{Litman_2020_CVPR,
  author = {Litman, Ron and Anschel, Oron and Tsiper, Shahar and Litman, Roee and Mazor, Shai and Manmatha, R.},
  title = {SCATTER: Selective Context Attentional Scene Text Recognizer},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Learning Canonical Shape Space for Category-Level 6D Object Pose and Size Estimation
Dengsheng Chen, Jun Li, Zheng Wang, Kai Xu


We present a novel approach to category-level 6D object pose and size estimation. To tackle intra-class shape variations, we learn canonical shape space (CASS), a unified representation for a large variety of instances of a certain object category. In particular, CASS is modeled as the latent space of a deep generative model of canonical 3D shapes with normalized pose. We train a variational auto-encoder (VAE) for generating 3D point clouds in the canonical space from an RGBD image. The VAE is trained in a cross-category fashion, exploiting the publicly available large 3D shape repositories. Since the 3D point cloud is generated in normalized pose (with actual size), the encoder of the VAE learns view-factorized RGBD embedding. It maps an RGBD image in arbitrary view into a poseindependent 3D shape representation. Object pose is then estimated via contrasting it with a pose-dependent feature of the input RGBD extracted with a separate deep neural networks. We integrate the learning of CASS and pose and size estimation into an end-to-end trainable network, achieving the state-of-the-art performance.
[embedding, evaluation] [object, feature, detection, category, cnn, table, unified, map] [model, input, trained] [method, figure, based, crop, patch, ieee, extraction, light] [image, vae, encoder, mixing, train, target, loss, learn, representation, latent, corresponding, generative, unseen] [network, size, learning, space, batch, training, deep, normalized, accuracy, metric, arxiv, preprint, learned, data, distribution, neural, design, large] [pose, point, shape, rgbd, estimation, canonical, cloud, geometric, conference, reconstruction, computer, rgb, depth, cad, approach, estimated, reconstructed, matching, view, dense, vision, match, full, distance, international]
@InProceedings{Chen_2020_CVPR,
  author = {Chen, Dengsheng and Li, Jun and Wang, Zheng and Xu, Kai},
  title = {Learning Canonical Shape Space for Category-Level 6D Object Pose and Size Estimation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Hierarchical Scene Coordinate Classification and Regression for Visual Localization
Xiaotian Li, Shuzhe Wang, Yi Zhao, Jakob Verbeek, Juho Kannala


Visual localization is critical to many applications in computer vision and robotics. To address single-image RGB localization, state-of-the-art feature-based methods match local descriptors between a query image and a pre-built 3D model. Recently, deep neural networks have been exploited to regress the mapping between raw pixels and 3D coordinates in the scene, and thus the matching is implicitly performed by the forward pass through the network. However, in a large and ambiguous environment, learning such a regression task directly can be difficult for a single network. In this work, we present a new hierarchical scene coordinate network to predict pixel scene coordinates in a coarse-to-fine manner from a single RGB image. The network consists of a series of output layers, each of them conditioned on the previous ones. The final output layer predicts the 3D coordinates and the others produce progressively finer discrete location labels. The proposed method outperforms the baseline regression-only network and allows us to train compact models which scale robustly to large environments. It sets a new state-of-the-art for single-image RGB localization performance on the 7-Scenes, 12-Scenes, Cambridge Landmarks datasets, and three combined scenes. Moreover, for large-scale outdoor localization on the Aachen Day-Night dataset, we present a hybrid approach which outperforms existing scene coordinate regression methods, and reduces significantly the performance gap w.r.t. explicit feature matching methods.
[conditioning, hierarchical, visual, three, prediction, retrieval, previous, outperforms, dataset] [regression, localization, location, table, feature, final, global, predicted] [model, combined, datasets, input, trained, query] [method, receptive, output, scale, proposed, field] [image, loss, cluster, generator, perform] [network, classification, learning, performance, training, data, label, large, test, better, baseline, compared, note, layer, size, accuracy, discrete, architecture, deep, compact, larger, augmentation] [scene, coordinate, camera, pose, local, approach, rgb, esac, dense, single, torsten, allows, cambridge, finer, aachen, joint, directly, matching, accurate, regress, outdoor, coarse, novel, ground, truth]
@InProceedings{Li_2020_CVPR,
  author = {Li, Xiaotian and Wang, Shuzhe and Zhao, Yi and Verbeek, Jakob and Kannala, Juho},
  title = {Hierarchical Scene Coordinate Classification and Regression for Visual Localization},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
MiLeNAS: Efficient Neural Architecture Search via Mixed-Level Reformulation
Chaoyang He, Haishan Ye, Li Shen, Tong Zhang


Many recently proposed methods for Neural Architecture Search (NAS) can be formulated as bilevel optimization. For efficient implementation, its solution requires approximations of second-order methods. In this paper, we demonstrate that gradient errors caused by such approximations lead to suboptimality, in the sense that the optimization procedure fails to converge to a (locally) optimal solution. To remedy this, this paper proposes MiLeNAS, a mixed-level reformulation for NAS that can be optimized efficiently and reliably. It is shown that even when using a simple first-order method on the mixed-level formulation, MiLeNAS can achieve a lower validation error for NAS problems. Consequently, architectures obtained by our method achieve consistently higher accuracies than those obtained from bilevel optimization. Moreover, MiLeNAS proposes a framework beyond DARTS. It is upgraded via model size-based search and early stopping strategies to complete the search process in around 5 hours. Extensive experiments within the convolutional architecture search space validate the effectiveness of our approach.
[three, evaluation, speed] [framework, faster, fully, effectiveness] [model, caused, experimental] [method, figure, pattern, ieee, convolutional, proposed, range] [loss, image, gap] [search, milenas, architecture, bilevel, neural, optimization, validation, accuracy, training, ltr, size, searching, gradient, lval, approximation, stopping, network, equation, early, better, efficient, optimal, process, parameter, arxiv, preprint, gdas, strategy, find, evolution, min, learning, respect, overfitting, space, data, searched, performance, rate, larger, algorithm, requires, achieve, lower, deep, design, wval, operation, number, reformulation, simple, higher, problem, manual] [error, conference, vision, computer, demonstrate]
@InProceedings{He_2020_CVPR,
  author = {He, Chaoyang and Ye, Haishan and Shen, Li and Zhang, Tong},
  title = {MiLeNAS: Efficient Neural Architecture Search via Mixed-Level Reformulation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Scalable Uncertainty for Computer Vision With Functional Variational Inference
Eduardo D. C. Carvalho, Ronald Clark, Andrea Nicastro, Paul H. J. Kelly


As Deep Learning continues to yield successful applications in Computer Vision, the ability to quantify all forms of uncertainty is a paramount requirement for its safe and reliable deployment in the real-world. In this work, we leverage the formulation of variational inference in function space, where we associate Gaussian Processes (GPs) to both Bayesian CNN priors and variational family. Since GPs are fully determined by their mean and covariance functions, we are able to obtain predictive uncertainty estimates at the cost of a single forward pass through any chosen CNN architecture and for any supervised learning task. By leveraging the structure of the induced covariance matrices, we propose numerically efficient algorithms which enable fast training in the context of high-dimensional tasks such as depth estimation and semantic segmentation. Additionally, we provide sufficient conditions for constructing regression loss functions whose probabilistic counterparts are compatible with aleatoric uncertainty quantification.
[order, work, time, prediction, context] [semantic, segmentation, cnn, table, regression, pooling] [trained, model, input, display] [gaussian, prior, likelihood, kernel, ieee, output, pattern, method, figure, proposed, block] [variational, loss, supervised, corresponding] [learning, bayesian, function, covariance, neural, deep, inference, training, predictive, consider, distribution, test, machine, network, set, forward, probabilistic, matrix, stochastic, epistemic, large, number, architecture, deterministic, objective, arxiv, preprint, posterior, family, practical, pass, efficient, obtaining, approximate, log, choosing, process, weight, diagonal, dropout] [uncertainty, depth, conference, functional, computer, international, vision, estimation, aleatoric, calibration, structure, compute, cost, single, computed, ronald, dense]
@InProceedings{Carvalho_2020_CVPR,
  author = {Carvalho, Eduardo D. C. and Clark, Ronald and Nicastro, Andrea and Kelly, Paul H. J.},
  title = {Scalable Uncertainty for Computer Vision With Functional Variational Inference},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Uncertainty-Aware CNNs for Depth Completion: Uncertainty from Beginning to End
Abdelrahman Eldesokey, Michael Felsberg, Karl Holmquist, Michael Persson


The focus in deep learning research has been mostly to push the limits of prediction accuracy. However, this was often achieved at the cost of increased complexity, raising concerns about the interpretability and the reliability of deep networks. Recently, an increasing attention has been given to untangling the complexity of deep networks and quantifying their uncertainty for different computer vision tasks. Differently, the task of depth completion has not received enough attention despite the inherent noisy nature of depth sensors. In this work, we thus focus on modeling the uncertainty of depth data in depth completion starting from the sparse noisy input all the way to the final prediction. We propose a novel approach to identify disturbed measurements in the input by learning an input confidence estimator in a self-supervised manner based on the normalized convolutional neural networks (NCNNs). Further, we propose a probabilistic version of NCNNs that produces a statistically meaningful uncertainty measure for the final prediction. When we evaluate our approach on the KITTI dataset for depth completion, we outperform all the existing Bayesian Deep Learning approaches in terms of prediction accuracy, quality of the uncertainty measure, and the computational efficiency. Moreover, our small network with 670k parameters performs on-par with conventional approaches with millions of parameters. These results give strong evidence that separating the network into parallel uncertainty and prediction streams leads to state-of-the-art performance with accurate uncertainty estimates.
[prediction, dataset] [confidence, final, object, propose, table, groundtruth, lidar] [input, model, noise, trained, applicability, example, ensemble, evaluated, quality, case] [output, proposed, signal, figure, noisy, convolution, ieee, version, performs, flow, convolutional, pattern, optical, gaussian, likelihood, fusion] [loss, produce, learn, image] [network, probabilistic, data, variance, deep, normalized, measure, learning, layer, compared, binary, function, neural, task, bayesian, problem, training, accuracy, arxiv, computational, proper, impact] [uncertainty, depth, ncnn, sparse, approach, vision, error, computer, estimation, conference, disturbed, ncnns, rmse, completion, estimated, rgb, estimate, pncnn, accurate, ause, international, dense, basis, michael, cost, unguided, full]
@InProceedings{Eldesokey_2020_CVPR,
  author = {Eldesokey, Abdelrahman and Felsberg, Michael and Holmquist, Karl and Persson, Michael},
  title = {Uncertainty-Aware CNNs for Depth Completion: Uncertainty from Beginning to End},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Butterfly Transform: An Efficient FFT Based Neural Architecture Design
Keivan Alizadeh vahid, Anish Prabhu, Ali Farhadi, Mohammad Rastegari


In this paper, we show that extending the butterfly operations from the FFT algorithm to a general Butterfly Transform (BFT) can be beneficial in building an efficient block structure for CNN designs. Pointwise convolutions, which we refer to as channel fusions, are the main computational bottleneck in the state-of-the-art efficient CNNs (e.g. MobileNets). We introduce a set of criterion for channel fusion, and prove that BFT yields an asymptotically optimal FLOP count with respect to these criteria. By replacing pointwise convolutions with BFT, we reduce the computational complexity of these layers from O(n^2) to O(n log n) with respect to the number of channels. Our experimental evaluations show that our method results in significant accuracy gains across a wide range of network architectures, especially at low FLOP ranges. For example, BFT results in up to a 6.75% absolute Top-1 improvement for MobileNetV1, 4.4 % for ShuffleNet V2 and 5.4% for MobileNetV3 on ImageNet under a similar number of FLOPS. Notably, ShuffleNet-V2+BFT outperforms state-of-the-art architecture search methods MNasNet, FBNet and MobilenetV3 in the low FLOP regime.
[structured, outperforms, work, connected, node] [table, including, cnn] [input, model, internal, constrained, effective] [channel, fusion, figure, low, convolutional, output, transform, convolution, residual, block, method, range, tensor, spatial, proposed, based, recursive, fast, ieee] [image] [butterfly, neural, bft, pointwise, architecture, network, efficient, matrix, accuracy, deep, number, search, layer, design, flop, computational, training, weight, complexity, size, learning, base, bottleneck, log, linear, product, reduce, computation, arxiv, preprint, replacing, extremely, path, replace, procedure, small, bflayer, andrew, power, large, circulant, function] [structure, conference, international, transformation]
@InProceedings{vahid_2020_CVPR,
  author = {vahid, Keivan Alizadeh and Prabhu, Anish and Farhadi, Ali and Rastegari, Mohammad},
  title = {Butterfly Transform: An Efficient FFT Based Neural Architecture Design},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
A Certifiably Globally Optimal Solution to Generalized Essential Matrix Estimation
Ji Zhao, Wanting Xu, Laurent Kneip


We present a convex optimization approach for generalized essential matrix (GEM) estimation. The six-point minimal solver for the GEM has poor numerical stability and applies only for a minimal number of points. Existing non-minimal solvers for GEM estimation rely on either local optimization or relinearization techniques, which impedes high accuracy in common scenarios. Our proposed non-minimal solver minimizes the sum of squared residuals by reformulating the problem as a quadratically constrained quadratic program. The globally optimal solution is thus obtained by a semidefinite relaxation. The algorithm retrieves certifiably globally optimal solutions to the original non-convex problem in polynomial time. We also provide the necessary and sufficient conditions to recover the optimal GEM from the relaxed problems. The improved performance is demonstrated over experiments on both synthetic and real multi-camera systems.
[] [gem, global, redundant] [noise, central, condition, model, easily] [method, ieee, pattern, proposed, figure, scale, recover, dual, motion, primal] [generalized, translation, real, synthetic, common] [problem, matrix, optimal, optimization, number, quadratic, linear, theorem, denote, min, general, relaxation, note, set, vector, interior] [relative, pose, essential, estimation, median, point, computer, sdp, conference, solution, rotation, error, vision, camera, constraint, algebraic, semidefinite, minimal, formulation, globally, laurent, solver, certifiably, convex, sufficient, sdr, international, richard, local, polynomial, geometric, optimality, epipolar, qcqp, marc, kneip]
@InProceedings{Zhao_2020_CVPR,
  author = {Zhao, Ji and Xu, Wanting and Kneip, Laurent},
  title = {A Certifiably Globally Optimal Solution to Generalized Essential Matrix Estimation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
MUXConv: Information Multiplexing in Convolutional Neural Networks
Zhichao Lu, Kalyanmoy Deb, Vishnu Naresh Boddeti


Convolutional neural networks have witnessed remarkable improvements in computational efficiency in recent years. A key driving force has been the idea of trading-off model expressivity and efficiency through a combination of 1x1 and depth-wise separable convolutions in lieu of a standard convolutional layer. The price of the efficiency, however, is the sub-optimal flow of information across space and channels in the network. To overcome this limitation, we present MUXConv, a layer that is designed to increase the flow of information by progressively multiplexing channel and spatial information in the network, while mitigating computational complexity. Furthermore, to demonstrate the effectiveness of MUXConv, we integrate it within an efficient multi-objective evolutionary algorithm to search for the optimal model hyper-parameters while simultaneously optimizing accuracy, compactness, and computational efficiency. On ImageNet, the resulting models, dubbed MUXNets, match the performance (75.3% top-1 accuracy) and multiply-add operations (218M) of MobileNetV3 while being 1.6x more compact, and outperform other mobile models in all the three criteria. MUXNet also performs well under transfer learning and when adapted to object detection. On the ChestX-Ray 14 benchmark, its accuracy is comparable to the state-of-the-art while being 3.3x more compact and 14x more efficient. Similarly, detection on PASCAL VOC 2007 is 1.2% more accurate, 28% faster and 6% more compact compared to MobileNetV2.
[recognition, three, multiple] [feature, object, detection, superpixel, table, cnn, adopt, pascal] [model, input] [spatial, channel, convolutional, reference, ieee, pattern, flow, figure, processed, convolution, separable, designed, combination] [image, transfer] [multiplexing, accuracy, search, number, neural, computational, learning, efficient, muxnet, efficiency, operation, predictive, madds, size, architecture, compactness, imagenet, layer, performance, mobile, training, simultaneously, muxconv, hyperparameter, standard, evolutionary, muxnets, processing, subpixel, group, andrew, quoc, algorithm, compact, large, optimize, classification, multiplexed, width, objective, ideal, deep, small, shufflenet, shuffling] [conference, computer, vision, international, auto, single]
@InProceedings{Lu_2020_CVPR,
  author = {Lu, Zhichao and Deb, Kalyanmoy and Boddeti, Vishnu Naresh},
  title = {MUXConv: Information Multiplexing in Convolutional Neural Networks},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
PointGMM: A Neural GMM Network for Point Clouds
Amir Hertz, Rana Hanocka, Raja Giryes, Daniel Cohen-Or


Point clouds are a popular representation for 3D shapes. However, they encode a particular sampling without accounting for shape priors or non-local information. We advocate for the use of a hierarchical Gaussian mixture model (hGMM), which is a compact, adaptive and lightweight representation that probabilistically defines the underlying 3D surface. We present PointGMM, a neural network that learns to generate hGMMs which are characteristic of the shape class, and also coincide with the input point cloud. PointGMM is trained over a collection of shapes to learn a class-specific prior. The hierarchical representation has two main advantages: (i) coarse-to-fine learning, which avoids converging to poor local-minima; and (ii) (an unsupervised) consistent partitioning of the input shape. We show that as a generative model, PointGMM learns a meaningful latent space which enables generating consistent interpolations between existing shapes, as well as synthesizing novel shapes. We also present a novel framework for rigid registration using PointGMM, that learns to disentangle orientation from structure of an input shape.
[decoder, hierarchical, node, attention] [framework, feature, global, level, ablation] [input, model, trained, gaussians, robust] [gmm, gaussian, ieee, figure, pattern, output, proposed, method, tree] [latent, representation, generative, loss, learns, encoder, learn, generation, train, generate, disentangle, missing, generates, source] [network, vector, learning, neural, set, mixture, deep, sampling, learned, sample, vanilla, number, processing, task, training, arxiv, preprint, space, large, function, compared, architecture, test] [point, shape, pointgmm, registration, hgmm, cloud, transformation, gmms, computer, conference, partial, vision, rigid, mlp, approach, novel, local, enables, surface, representing, pointnet, canonical, leonidas, directly, additional, rotation, chair, airplane, acm, michael, hao, consistent]
@InProceedings{Hertz_2020_CVPR,
  author = {Hertz, Amir and Hanocka, Rana and Giryes, Raja and Cohen-Or, Daniel},
  title = {PointGMM: A Neural GMM Network for Point Clouds},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Noisier2Noise: Learning to Denoise From Unpaired Noisy Data
Nick Moran, Dan Schmidt, Yu Zhong, Patrick Coady


We present a method for training a neural network to perform image denoising without access to clean training examples or access to paired noisy training examples. Our method requires only a single noisy realization of each training example and a statistical model of the noise distribution, and is applicable to a wide variety of noise models, including spatially structured noise. Our model produces results which are competitive with other learned methods which require richer training data, and outperforms traditional non-learned denoising methods. We present derivations of our method for arbitrary additive noise, an improvement specific to Gaussian additive noise, and an extension to multiplicative Bernoulli noise.
[predict, structured, work] [propose, table] [noise, clean, input, trained, model, true, original, correction, quality, exist, correlated, white, access, multiplicative] [noisy, method, denoising, figure, output, psnr, ieee, realization, gaussian, kodak, pixel, recover, ssim, spatially, simply, comparison, pattern, applicable] [image, loss, synthetic, perform, paired, masked, train, produce, plausible, learn, produced] [network, training, neural, note, requires, sample, test, learning, observe, arxiv, preprint, standard, function, deep, additive, performance, distribution, set, find, average, data, algorithm, lower, larger, higher, bernoulli] [estimate, additional, single, require, conference, computer, variety, approach, reconstruction, uncertainty, view, vision]
@InProceedings{Moran_2020_CVPR,
  author = {Moran, Nick and Schmidt, Dan and Zhong, Yu and Coady, Patrick},
  title = {Noisier2Noise: Learning to Denoise From Unpaired Noisy Data},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
TRPLP - Trifocal Relative Pose From Lines at Points
Ricardo Fabbri, Timothy Duff, Hongyi Fan, Margaret H. Regan, David da Costa de Pinho, Elias Tsigaridas, Charles W. Wampler, Jonathan D. Hauenstein, Peter J. Giblin, Benjamin Kimia, Anton Leykin, Tomas Pajdla


We present a method for solving two minimal problems for relative camera pose estimation from three views, which are based on three view correspondences of (i) three points and one line and (ii) three points and two lines through two of the points. These problems are too difficult to be efficiently solved by the state of the art Grobner basis methods. Our method is based on a new efficient homotopy continuation (HC) solver, which dramatically speeds up previous HC solving by specializing HC methods to generic cases of our problems. We show in simulated experiments that our solvers are numerically robust and stable under image noise. We show in real experiment that (i) SIFT features provide good enough point-and-line correspondences for three-view reconstruction and (ii) that we can solve difficult cases with too few or too noisy tentative matches where the state of the art structure from motion initialization fails.
[three, dataset, multiple, failure, described, state] [feature, localization] [university, degree, curve, dim, differential] [figure, ieee, pattern, based, journal, method, traditional, motion, analysis, june] [image, real, third] [problem, number, triplet, general, set, average, data, start, efficient, stable, algorithm] [pose, trifocal, computer, estimation, minimal, relative, camera, point, orientation, homotopy, vision, conference, continuation, solver, polynomial, numerical, international, geometry, error, algebraic, solution, calibrated, solve, chicago, tangent, solving, structure, bifocal, system, cleveland, david, sift, reconstruction, approach, colmap, supplementary, ricardo, view, sfm, homogeneous, multiview, reprojection]
@InProceedings{Fabbri_2020_CVPR,
  author = {Fabbri, Ricardo and Duff, Timothy and Fan, Hongyi and Regan, Margaret H. and Pinho, David da Costa de and Tsigaridas, Elias and Wampler, Charles W. and Hauenstein, Jonathan D. and Giblin, Peter J. and Kimia, Benjamin and Leykin, Anton and Pajdla, Tomas},
  title = {TRPLP - Trifocal Relative Pose From Lines at Points},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
DSNAS: Direct Neural Architecture Search Without Parameter Retraining
Shoukang Hu, Sirui Xie, Hehui Zheng, Chunxiao Liu, Jianping Shi, Xunying Liu, Dahua Lin


If NAS methods are solutions, what is the problem? Most existing NAS methods require two-stage parameter optimization. However, performance of the same architecture in the two stages correlates poorly. In this work, we propose a new problem definition for NAS, task-specific end-to-end, based on this observation. We argue that given a computer vision task for which a NAS method is expected, this definition can reduce the vaguely-defined NAS evaluation to i) accuracy of this task and ii) the total computation consumed to finally obtain a model with satisfying accuracy. Seeing that most existing methods do not solve this problem directly, we propose DSNAS, an efficient differentiable NAS framework that simultaneously optimizes architecture and parameters with a low-biased Monte Carlo estimate. Child networks derived from DSNAS can be deployed directly without parameter retraining. Comparing with two-stage methods, DSNAS successfully discovers networks with comparable accuracy (74.4%) on ImageNet in 420 GPU hours, reducing the total time by more than 34%.
[evaluation, time, provide] [stage, correlation, parent, table, round, propose, framework] [model, derived] [existing, method, proposed, result, assumption] [progressive] [architecture, search, neural, dsnas, searching, network, snas, performance, training, accuracy, optimization, learning, arxiv, preprint, parameter, random, problem, efficiency, proxylessnas, forward, task, data, memory, gradient, computation, spos, retraining, gpu, process, choice, discrete, calculated, space, batch, objective, sampling, backward, total, efficient, set, machine, computational, metric, comparable, imagenet, number, expected, subnetwork, implementation, path, manual, quoc, optimizes, algorithm, searched, distribution, stochastic, complexity, validation, log, ranking, tau] [differentiable, computer, vision, solution, conference, directly, single, direct, continuous]
@InProceedings{Hu_2020_CVPR,
  author = {Hu, Shoukang and Xie, Sirui and Zheng, Hehui and Liu, Chunxiao and Shi, Jianping and Liu, Xunying and Lin, Dahua},
  title = {DSNAS: Direct Neural Architecture Search Without Parameter Retraining},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
MonoPair: Monocular 3D Object Detection Using Pairwise Spatial Relationships
Yongjian Chen, Lei Tai, Kai Sun, Mingyang Li


Monocular 3D object detection is an essential component in autonomous driving while challenging to solve, especially for those occluded samples which are only partially visible. Most detectors consider each 3D object as an independent training target, inevitably resulting in a lack of useful information for occluded samples. To this end, we propose a novel method to improve the monocular 3D object detection by considering the relationship of paired samples. This allows us to encode spatial constraints for partially-occluded objects from their adjacent neighbors. Specifically, the proposed detector computes uncertainty-aware predictions for object locations and 3D distances for the adjacent object pairs, which are subsequently jointly optimized by nonlinear least squares. Finally, the one-stage uncertainty-aware prediction structure and the post-optimization module are dedicatedly integrated for ensuring the run-time efficiency. Experiments demonstrate that our method yields the best performance on KITTI 3D detection benchmark, by outperforming state-of-the-art competitors by wide margins, especially for the hard samples.
[pair, prediction, three, relationship, graph, considering, represent, recognition] [object, detection, feature, bounding, autonomous, box, table, predicted, map, offset, monopair, center, apbv, occluded, pedestrian, location, regression, detector, backbone, branch, propose, hard] [input] [spatial, figure, ieee, output, pattern, method, proposed, based] [image, paired] [pairwise, network, training, weight, deep, optimization, matrix, neural, size, set, baseline, problem, compared, validation, learning, precision] [monocular, uncertainty, computer, constraint, vision, conference, depth, distance, camera, keypoint, error, kitti, coordinate, geometric, international, point, orientation, local, kvij, novel, accurate, additional, view, absolute]
@InProceedings{Chen_2020_CVPR,
  author = {Chen, Yongjian and Tai, Lei and Sun, Kai and Li, Mingyang},
  title = {MonoPair: Monocular 3D Object Detection Using Pairwise Spatial Relationships},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Regularization on Spatio-Temporally Smoothed Feature for Action Recognition
Jinhyung Kim, Seunghwan Cha, Dongyoon Wee, Soonmin Bae, Junmo Kim


Deep neural networks for video action recognition frequently require 3D convolutional filters and often encounter overfitting due to a larger number of parameters. In this paper, we propose Random Mean Scaling (RMS), a simple and effective regularization method, to relieve the overfitting problem in 3D residual networks. The key idea of RMS is to randomly vary the magnitude of low-frequency components of the feature to regularize the model. The low-frequency component can be derived by a spatio-temporal mean on the local patch of a feature. We present that selective regularization on this locally smoothed feature makes a model handle the low-frequency and high-frequency component distinctively, resulting in performance improvement. RMS can enhance a model with little additional computation only during training, similar to other regularization methods. RMS also can be incorporated into typical training process without any bells and whistles. Experimental results show the improvement in generalization performance on a popular action recognition datasets demonstrating the effectiveness of RMS as a regularization technique, compared to other state-of-the-art regularization methods.
[action, recognition, dataset, slowfast, video, three, evaluation, spatiotemporal, modulation] [feature, module, table, improves, resnet, propose, branch] [model, perturbation, input, effective, tested, difference, magnitude, generalization, trained, examine, type, choose] [rms, method, proposed, ieee, residual, convolutional, applying, pattern, gaussian, slowonly, shakedrop] [image, component, factor] [accuracy, regularization, performance, training, scaling, baseline, random, filter, learning, network, neural, validation, overfitting, number, compared, average, better, deep, simple, randomly, convnets, batch, larger, sampling, set, bottleneck, best, processing, classification, experiment] [conference, computer, vision, direction, international, additional, compare, local, position]
@InProceedings{Kim_2020_CVPR,
  author = {Kim, Jinhyung and Cha, Seunghwan and Wee, Dongyoon and Bae, Soonmin and Kim, Junmo},
  title = {Regularization on Spatio-Temporally Smoothed Feature for Action Recognition},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Towards Accurate Scene Text Recognition With Semantic Reasoning Networks
Deli Yu, Xuan Li, Chengquan Zhang, Tao Liu, Junyu Han, Jingtuo Liu, Errui Ding


Scene text image contains two levels of contents: visual texture and semantic information. Although the previous scene text recognition methods have made great progress over the past few years, the research on mining semantic information to assist text recognition attracts less attention, only RNN-like structures are explored to implicitly model semantic information. However, we observe that RNN based methods have some obvious shortcomings, such as time-dependent decoding manner and one-way serial transmission of semantic context, which greatly limit the help of semantic information and the computation efficiency. To mitigate these limitations, we propose a novel end-to-end trainable framework named semantic reasoning network (SRN) for accurate scene text recognition, where a global semantic reasoning module (GSRM) is introduced to capture global semantic context through multi-way parallel transmission. The state-of-the-art results on 7 public benchmarks, including regular text, irregular text and non-Latin long text, verify the effectiveness and robustness of the proposed method. In addition, the speed of SRN has significant advantages over the RNN based methods, demonstrating its value in practical use.
[text, visual, srn, attention, recognition, gsrm, reasoning, context, word, character, time, transformer, reading, embedding, long, order, irregular, sequence, ctc, decoding, trainable, string, prediction, cong, decoder, svt, chinese, step, pvam, mechanism, natural, svtp, previous] [semantic, global, module, feature, backbone, xiang, table, propose, framework, named, including, effectiveness] [model, robust, input] [parallel, based, method, fusion, proposed, spatial, figure, rectification, transmission] [aligned, image, loss, street] [network, set, training, performance, number, test, arxiv, preprint, neural, efficiency, serial, better, function, max, andrew, computation] [scene, capture, structure, accurate, novel, compare, volume]
@InProceedings{Yu_2020_CVPR,
  author = {Yu, Deli and Li, Xuan and Zhang, Chengquan and Liu, Tao and Han, Junyu and Liu, Jingtuo and Ding, Errui},
  title = {Towards Accurate Scene Text Recognition With Semantic Reasoning Networks},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Unsupervised Reinforcement Learning of Transferable Meta-Skills for Embodied Navigation
Juncheng Li, Xin Wang, Siliang Tang, Haizhou Shi, Fei Wu, Yueting Zhuang, William Yang Wang


Visual navigation is a task of training an embodied agent by intelligently navigating to a target object (e.g., television) using only visual observations. A key challenge for current deep reinforcement learning models lies in the requirements for a large amount of training data. It is exceedingly expensive to construct sufficient 3D synthetic environments annotated with the target object information. In this paper, we focus on visual navigation in the low-resource setting, where we have only a few training environments annotated with object information. We propose a novel unsupervised reinforcement learning approach to learn transferable meta-skills (e.g., bypass obstacles, go straight) from unannotated environments without any supervisory signals. The agent can then fast adapt to visual navigation through learning a high-level master policy to combine these meta-skills, when the visual-navigation-specified reward is provided. Experimental results show that our method significantly outperforms the baseline by 53.34% relatively on SPL, and further qualitative analysis demonstrates that our method learns transferable motor primitives for visual navigation.
[visual, navigation, policy, agent, reinforcement, master, reward, state, hierarchical, embodied, spl, automatically, current, environment, artificial] [object, final, propose, framework, ablation] [adversarial, success, model, university] [method, figure, fast, ieee, based, pattern] [generator, unsupervised, learn, shared, transferable, transfer, learns, target, generate, proposes, curriculum, diversity, generated] [task, learning, training, number, ultra, update, arxiv, preprint, learned, set, random, deep, algorithm, reptile, neural, rate, denote, large, adapt, performance, path, evaluate, baseline, gradient, curiosity] [conference, vision, international, approach, computer, joint, intrinsic, robotics, novel]
@InProceedings{Li_2020_CVPR,
  author = {Li, Juncheng and Wang, Xin and Tang, Siliang and Shi, Haizhou and Wu, Fei and Zhuang, Yueting and Wang, William Yang},
  title = {Unsupervised Reinforcement Learning of Transferable Meta-Skills for Embodied Navigation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Inferring Attention Shift Ranks of Objects for Image Saliency
Avishek Siris, Jianbo Jiao, Gary K.L. Tam, Xianghua Xie, Rynson W.H. Lau


Psychology studies and behavioural observation show that humans shift their attention from one location to another when viewing an image of a complex scene. This is due to the limited capacity of the human visual system in simultaneously processing multiple visual inputs. The sequential shifting of attention on objects in a non-task oriented viewing can be seen as a form of saliency ranking. Although there are methods proposed for predicting saliency rank, they are not able to model this human attention shift well, as they are primarily based on ranking saliency values from binary prediction. Following psychological studies, in this paper, we propose to predict the saliency rank by inferring human attention shift. Due to the lack of such data, we first construct a large-scale salient object ranking dataset. The saliency rank of objects is defined by the order that an observer attends to these objects based on attention shift. The final saliency rank is an average across the saliency ranks of multiple observers. We then propose a learning-based CNN to leverage both bottom-up and top-down attention mechanisms to predict the saliency rank. Experimental results show that the proposed network achieves state-of-the-art performances on salient object rank prediction. Code and dataset are available at https://github.com/SirisAvishek/Attention_Shift_Ranks
[attention, visual, shift, order, fixation, multiple, prediction, dataset, predict, mechanism, selective, psychological, individual, semantics, relevant] [saliency, object, salient, module, detection, segmentation, sor, map, rsdnet, backbone, score, predicted, mask, propose, final, descending, feature, ali, huchuan, highest, global, contextual, fixated, pyramid, behavioural] [model, study, motivated] [proposed, spatial, based, convolutional, method, figure, pixel, mae] [image, user, distinct, corresponding, generate, generated, person] [rank, network, ranking, learning, consider, binary, average, deep, higher, maximum, classification, neural, size, set, architecture, note] [human, approach, scene, supported, laurent]
@InProceedings{Siris_2020_CVPR,
  author = {Siris, Avishek and Jiao, Jianbo and Tam, Gary K.L. and Xie, Xianghua and Lau, Rynson W.H.},
  title = {Inferring Attention Shift Ranks of Objects for Image Saliency},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Camera On-Boarding for Person Re-Identification Using Hypothesis Transfer Learning
Sk Miraj Ahmed, Aske R. Lejbolle, Rameswar Panda, Amit K. Roy-Chowdhury


Most of the existing approaches for person re-identification consider a static setting where the number of cameras in the network is fixed. An interesting direction, which has received little attention, is to explore the dynamic nature of a camera network, where one tries to adapt the existing re-identification models after on-boarding new cameras, with little additional effort. There have been a few recent methods proposed in person re-identification that attempt to address this problem by assuming the labeled data in the existing network is still available while adding new cameras. This is a strong assumption since there may exist some privacy issues for which one may not have access to those data. Rather, based on the fact that it is easy to store the learned re-identifications models, which mitigates any data privacy concern, we develop an efficient model adaptation approach using hypothesis transfer learning that aims to transfer the knowledge using only source models and limited labeled data, but without using any source camera data from the existing network. Our approach minimizes the effect of negative transfer by finding an optimal weighted combination of multiple source models for transferring the knowledge. Extensive experiments on four challenging benchmark datasets with variable number of cameras well demonstrate the efficacy of our proposed approach over state-of-the-art methods.
[multiple, recognition, dataset, three, step, outperforms, pair] [feature, newly, labeling] [adding, access, model, datasets, difference] [existing, method, figure, based, proposed, introduced, analysis] [source, target, person, transfer, unsupervised, cmc, ward, camel, learn, transferring, domain, yij, raid, reidentification, adaptation, perform, image] [data, learning, labeled, metric, network, average, pairwise, optimization, optimal, xij, accuracy, training, problem, learned, deep, knowledge, number, negative, set, best, algorithm, theorem, rate, rank, consider, amount, large, compared, weight, small, note, respect, test] [camera, limited, approach, hypothesis, distance, matching, defined, term, projection, computed]
@InProceedings{Ahmed_2020_CVPR,
  author = {Ahmed, Sk Miraj and Lejbolle, Aske R. and Panda, Rameswar and Roy-Chowdhury, Amit K.},
  title = {Camera On-Boarding for Person Re-Identification Using Hypothesis Transfer Learning},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Joint Graph-Based Depth Refinement and Normal Estimation
Mattia Rossi, Mireille El Gheche, Andreas Kuhn, Pascal Frossard


Depth estimation is an essential component in understanding the 3D geometry of a scene, with numerous applications in urban and indoor settings. These scenarios are characterized by a prevalence of human made structures, which in most of the cases are either inherently piece-wise planar or can be approximated as such. With these settings in mind, we devise a novel depth refinement framework that aims at recovering the underlying piece-wise planarity of those inverse depth maps associated to piece-wise planar scenes. We formulate this task as an optimization problem involving a data fidelity term, which minimizes the distance to the noisy and possibly incomplete input inverse depth map, as well as a regularization, which enforces a piece-wise planar solution. As for the regularization term, we model the inverse depth map pixels as the nodes of a weighted graph, with the weight of the edge between two pixels capturing the likelihood that they belong to the same plane in the scene. The proposed regularization fits a plane at each pixel automatically, avoiding any a priori estimation of the scene planes, and enforces that strongly connected pixels are assigned to the same plane. The resulting optimization problem is solved efficiently with the ADAM solver. Extensive tests show that our method leads to a significant improvement in depth refinement, both visually and numerically, with respect to state-of-the-art algorithms on the Middlebury, KITTI and ETH3D multi-view datasets.
[dataset, order, associated, graph, recognition, multiple] [map, refinement, framework, refined, confidence, global, table, edge, bottom] [input, model] [disparity, method, pixel, bad, inverse, ieee, pattern, middlebury, proposed, reference, noisy, high, result, column, priori, scale, fast] [image, enforces, corresponding, row] [regularization, problem, set, considered, number, training, optimization, weight, large, reliable, data, weighted, belong, test, average, network, machine, function] [depth, error, normal, plane, planar, stereo, nltgv, conference, vision, scene, computer, ground, truth, kitti, estimation, term, international, matching, estimated, camera, andreas, second, local, reconstruction, provided, estimate, left, joint, incomplete, cost, underneath]
@InProceedings{Rossi_2020_CVPR,
  author = {Rossi, Mattia and Gheche, Mireille El and Kuhn, Andreas and Frossard, Pascal},
  title = {Joint Graph-Based Depth Refinement and Normal Estimation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
DR Loss: Improving Object Detection by Distributional Ranking
Qi Qian, Lei Chen, Hao Li, Rong Jin


Most of object detection algorithms can be categorized into two classes: two-stage detectors and one-stage detectors. Recently, many efforts have been devoted to one-stage detectors for the simple yet effective architecture. Different from two-stage detectors, one-stage detectors aim to identify foreground objects from all candidates in a single stage. This architecture is efficient but can suffer from the imbalance issue with respect to two aspects: the inter-class imbalance between the number of candidates from foreground and background classes and the intra-class imbalance in the hardness of background candidates, where only a few candidates are hard to be identified. In this work, we propose a novel distributional ranking (DR) loss to handle the challenge. For each image, we convert the classification problem to a ranking problem, which considers pairs of candidates within the image, to address the inter-class imbalance problem. Then, we push the distributions of confidence scores for foreground and background towards the decision boundary. After that, we optimize the rank of the expectations of derived distributions in lieu of original pairs. Our method not only mitigates the intra-class imbalance issue in background candidates but also improves the efficiency for the ranking algorithm. By merely replacing the focal loss in RetinaNet with the developed DR loss and applying ResNet-101 as the backbone, mAP of the single-scale test on COCO can be improved from 39.1% to 41.7% without bells and whistles, which demonstrates the effectiveness of the proposed loss function.
[pair] [positive, background, detection, object, foreground, hard, retinanet, ross, coco, table, improves, region, proposal, kaiming, confidence, adopt, piotr, effectiveness, illustration] [original, derived, identify, model, example, scenario, retina, developed] [proposed, comparison, conventional, convolutional] [loss, image, issue, corresponding, cross, address] [ranking, negative, distribution, imbalance, problem, number, performance, classification, rank, set, large, strategy, hinge, small, function, optimizing, min, objective, hardness, training, denote, learning, rate, distributional, expectation, entropy, standard, probability, max, observe, setting, deep, neural, compared, margin, candidate] [focal, single, handle]
@InProceedings{Qian_2020_CVPR,
  author = {Qian, Qi and Chen, Lei and Li, Hao and Jin, Rong},
  title = {DR Loss: Improving Object Detection by Distributional Ranking},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Self-Trained Deep Ordinal Regression for End-to-End Video Anomaly Detection
Guansong Pang, Cheng Yan, Chunhua Shen, Anton van den Hengel, Xiao Bai


Video anomaly detection is of critical practical importance to a variety of real applications because it allows human attention to be focused on events that are likely to be of interest, in spite of an otherwise overwhelming volume of video. We show that applying self-trained deep ordinal regression to video anomaly detection overcomes two key limitations of existing methods, namely, 1) being highly dependent on manually labeled normal training data; and 2) sub-optimal feature learning. By formulating a surrogate two-class ordinal regression task we devise an end-to-end trainable video anomaly detection approach that enables joint representation learning and anomaly scoring without manually labeled normal/abnormal data. Experiments on eight real-world video scenes show that our proposed method outperforms state-of-the-art methods that require no labeled training data by a substantial margin, and enables easy and accurate localization of the identified anomalies. Furthermore, we demonstrate that our method offers effective human-in-the-loop anomaly detection which can be critical in applications where anomalies are rare and the false-negative cost is high.
[video, frame, work, critical, previous, include, evaluation, identifying] [detection, feature, score, regression, scoring, improvement, del, table, false, positive, key] [model, auc, iterative, datasets, effective, identify, trained] [method, ieee, figure, existing, based, event, optimized, convolutional] [pseudo, unsupervised, address, corresponding, generate, perform] [anomaly, data, learning, ordinal, anomalous, labeled, set, deep, performance, learner, problem, abnormal, training, test, better, rate, large, iforest, umn, giorno, achieve, neural, function, layer, ucsd, manually, identified, compared, unmasking, note, class, best, network, activation] [normal, approach, initial, enables, well, human, require, iteratively]
@InProceedings{Pang_2020_CVPR,
  author = {Pang, Guansong and Yan, Cheng and Shen, Chunhua and Hengel, Anton van den and Bai, Xiao},
  title = {Self-Trained Deep Ordinal Regression for End-to-End Video Anomaly Detection},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Few-Shot Class-Incremental Learning
Xiaoyu Tao, Xiaopeng Hong, Xinyuan Chang, Songlin Dong, Xing Wei, Yihong Gong


The ability to incrementally learn new classes is crucial to the development of real-world artificial intelligence systems. In this paper, we focus on a challenging but practical few-shot class-incremental learning (FSCIL) problem. FSCIL requires CNN models to incrementally learn new classes from very few labelled samples, without forgetting the previously learned ones. To address this problem, we represent the knowledge using a neural gas (NG) network, which can learn and preserve the topology of the feature manifold formed by different classes. On this basis, we propose the TOpology-Preserving knowledge InCrementer (TOPIC) framework. TOPIC mitigates the forgetting of the old classes by stabilizing NG's topology and improves the representation learning for few-shot new classes by growing and adapting NG to new training samples. Comprehensive experimental results demonstrate that our proposed method significantly outperforms other state-of-the-art class-incremental learning methods on CIFAR100, miniImageNet, and CUB200 datasets.
[node, artificial, outperforms, recognize, dataset] [feature, centroid, table, cnn, adopt, achieves] [model, original] [output, ieee, figure, pattern, method, comparison] [loss, learns, learn, image, representation, corresponding] [learning, training, set, neural, class, fscil, knowledge, incremental, distillation, test, forgetting, topic, base, cil, gas, network, performance, catastrophic, memory, space, incrementally, vector, problem, processing, learned, number, classification, miniimagenet, session, requires, large, finetuning, classifier, encountered, arxiv, preprint, adapt, task, rate, quicknet, deep, continual, yihong, function, randomly] [conference, term, computer, vision, topology, approach, distance, international, defined, well]
@InProceedings{Tao_2020_CVPR,
  author = {Tao, Xiaoyu and Hong, Xiaopeng and Chang, Xinyuan and Dong, Songlin and Wei, Xing and Gong, Yihong},
  title = {Few-Shot Class-Incremental Learning},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
PolarMask: Single Shot Instance Segmentation With Polar Representation
Enze Xie, Peize Sun, Xiaoge Song, Wenhai Wang, Xuebo Liu, Ding Liang, Chunhua Shen, Ping Luo


In this paper, we introduce an anchor-box free and single shot instance segmentation method, which is conceptually simple, fully convolutional and can be used by easily embedding it into most off-the-shelf detection methods. Our method, termed PolarMask, formulates the instance segmentation problem as predicting contour of instance through instance center classification and dense distance regression in a polar coordinate. Moreover, we propose two effective approaches to deal with sampling high-quality center examples and optimization for dense distance regression, respectively, which can significantly improve the performance and simplify the training process. Without any bells and whistles, PolarMask achieves 32.9% in mask mAP with single-model and single-scale training/testing on the challenging COCO dataset. For the first time, we show that the complexity of instance segmentation, in terms of both design and computation complexity, can be the same as bounding box object detection and this much simpler and flexible instance segmentation framework can achieve competitive accuracy. We hope that the proposed PolarMask framework can serve as a fundamental and strong baseline for single shot instance segmentation task.
[prediction, predict, work, length, longer] [polar, instance, mask, iou, polarmask, segmentation, center, object, bounding, regression, box, centerness, contour, detection, backbone, branch, feature, apl, table, achieves, predicted, ross, kaiming, coco, apm, fully, simpler, semantic, main, fcos, area, piotr, propose, map, framework, cartesian, improves, fps, jifeng, challenging] [model, improve, effective, university] [ieee, figure, convolutional, aps, proposed, method, competitive] [loss, representation, image, introduce] [classification, training, performance, network, upper, set, bound, simple, number, compared, better, achieve, large, design, task, sample, best, shot, optimization, equal] [distance, dense, point, coordinate, single, directly, angle, dmin, fundamental]
@InProceedings{Xie_2020_CVPR,
  author = {Xie, Enze and Sun, Peize and Song, Xiaoge and Wang, Wenhai and Liu, Xuebo and Liang, Ding and Shen, Chunhua and Luo, Ping},
  title = {PolarMask: Single Shot Instance Segmentation With Polar Representation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
DeepEMD: Few-Shot Image Classification With Differentiable Earth Mover's Distance and Structured Classifiers
Chi Zhang, Yujun Cai, Guosheng Lin, Chunhua Shen


In this paper, we address the few-shot classification task from a new perspective of optimal matching between image regions. We adopt the Earth Mover's Distance (EMD) as a metric to compute a structural distance between dense image representations to determine image relevance. The EMD generates the optimal matching flows between structural elements that have the minimum matching cost, which is used to represent the image distance for classification. To generate the important weights of elements in the EMD formulation, we design a cross-reference mechanism, which can effectively minimize the impact caused by the cluttered background and large intra-class appearance variations. To handle k-shot classification, we propose to learn a structured fully connected layer that can directly classify dense image representations with the EMD. Based on the implicit function theorem, the EMD can be inserted as a layer into the network for end-to-end training. We conduct comprehensive experiments to validate our algorithm and we set new state-of-the-art performance on four popular few-shot classification benchmarks, namely miniImageNet, tieredImageNet, Fewshot-CIFAR100 (FC100) and Caltech-UCSD Birds-200-2011 (CUB).
[structured, connected, mechanism, embedding, node, recognition, multiple, previous, represent, outperforms] [feature, fully, propose, global, score, background, adopt, object] [model] [based, ieee, pattern, proposed, method, comparison, june, convolutional] [image, generate, learn, prototype] [emd, learning, layer, classification, network, optimal, problem, weight, earth, neural, metric, performance, training, function, algorithm, cosine, vector, set, support, optimization, baseline, deep, arxiv, preprint, classifier, class, average, data, classify, popular, standard, parameter, xij, large] [distance, matching, local, conference, computer, vision, compute, cost, implicit, compare, differentiable, dense]
@InProceedings{Zhang_2020_CVPR,
  author = {Zhang, Chi and Cai, Yujun and Lin, Guosheng and Shen, Chunhua},
  title = {DeepEMD: Few-Shot Image Classification With Differentiable Earth Mover's Distance and Structured Classifiers},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Detection in Crowded Scenes: One Proposal, Multiple Predictions
Xuangeng Chu, Anlin Zheng, Xiangyu Zhang, Jian Sun


We propose a simple yet effective proposal-based object detector, aiming at detecting highly-overlapped instances in crowded scenes. The key of our approach is to let each proposal predict a set of correlated instances rather than a single one in previous proposal-based frameworks. Equipped with new techniques such as EMD Loss and Set NMS, our detector can effectively handle the difficulty of detecting highly overlapped objects. On a FPN-Res50 baseline, our detector can obtain 4.9% AP gains on challenging CrowdHuman dataset and 1.0% \text MR ^ -2 improvements on CityPersons dataset, without bells and whistles. Moreover, on less crowed datasets like COCO, our approach can still achieve moderate improvement, suggesting the proposed method is robust to crowdedness.
[multiple, predict, prediction, previous, recognition, dataset, predicting, three, work, associated] [detection, crowded, proposal, fpn, object, instance, table, crowdhuman, refinement, box, recall, module, coco, false, detector, citypersons, confidence, pedestrian, jian, threshold, score, overlapped, duplicate, relationnet, ross, paradigm, overlap, kaiming, bernt, heavily] [detecting, original, effective, datasets, evaluated] [method, ieee, pattern, proposed, based, june] [loss, corresponding, introduce, common] [set, emd, baseline, validation, better, learning, find, note, xiangyu, performance, neural, network, deep, training, arxiv, preprint, simple, achieve] [computer, conference, vision, approach, ground, single, international, truth]
@InProceedings{Chu_2020_CVPR,
  author = {Chu, Xuangeng and Zheng, Anlin and Zhang, Xiangyu and Sun, Jian},
  title = {Detection in Crowded Scenes: One Proposal, Multiple Predictions},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Autolabeling 3D Objects With Differentiable Rendering of SDF Shape Priors
Sergey Zakharov, Wadim Kehl, Arjun Bhargava, Adrien Gaidon


We present an automatic annotation pipeline to recover 9D cuboids and 3D shapes from pre-trained off-the-shelf 2D detectors and sparse LIDAR data. Our autolabeling method solves an ill-posed inverse problem by considering learned shape priors and optimizing geometric and physical parameters. To address this challenging problem, we apply a novel differentiable shape renderer to signed distance fields (SDF), leveraged together with normalized object coordinate spaces (NOCS). Initially trained on synthetic data to predict shape and coordinates, our method uses these predictions for projective and geometric alignment over real samples. Moreover, we also propose a curriculum learning strategy, iteratively retraining on samples of increasing difficulty in subsequent self-improving annotation rounds. Our experiments on the KITTI3D dataset show that we can recover a substantial amount of accurate cuboids, and that these autolabels can be used to train 3D vehicle detectors with state-of-the-art results.
[dataset, predict, work, automatic, difficulty] [object, annotation, detection, easy, iou, bev, table, lidar, employ, detector, map, apply] [trained, verification, query] [figure, recover, method, scale, based] [synthetic, train, image, loss, curriculum, utilize, real] [network, space, learning, data, optimization, label, amount, achieve, better, vector, respect, evaluate, observe, metric, set, normalized] [shape, differentiable, surface, pose, autolabeling, distance, rendering, autolabels, approach, pipeline, loop, sdf, signed, renderer, coordinate, estimate, ground, geometric, human, deepsdf, truth, projective, concerning, ransac, point, estimated, define, autolabel, kitti, novel, scene, initial]
@InProceedings{Zakharov_2020_CVPR,
  author = {Zakharov, Sergey and Kehl, Wadim and Bhargava, Arjun and Gaidon, Adrien},
  title = {Autolabeling 3D Objects With Differentiable Rendering of SDF Shape Priors},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Interactive Object Segmentation With Inside-Outside Guidance
Shiyin Zhang, Jun Hao Liew, Yunchao Wei, Shikui Wei, Yao Zhao


This paper explores how to harvest precise object segmentation masks while minimizing the human interaction cost. To achieve this, we propose an Inside-Outside Guidance (IOG) approach in this work. Concretely, we leverage an inside point that is clicked near the object center and two outside points at the symmetrical corner locations (top-left and bottom-right or top-right and bottom-left) of a tight bounding box that encloses the target object. This results in a total of one foreground click and four background clicks for segmentation. The advantages of our IOG is four-fold: 1) the two outside points can help to remove distractions from other objects or background; 2) the inside point can help to eliminate the unrelated regions inside the bounding box; 3) the inside and outside points are easily identified, reducing the confusion raised by the state-of-the-art DEXTR in labeling some extreme samples; 4) our approach naturally supports additional clicks annotations for further correction. Despite its simplicity, our IOG not only achieves state-of-the-art performance on several popular benchmarks, but also demonstrates strong generalization capability across different domains such as street scenes, aerial imagery and medical images, without fine-tuning. In addition, we also propose a simple two-stage solution that enables our IOG to produce high quality instance segmentation masks from existing datasets with off-the-shelf bounding boxes such as ImageNet and Open Images, demonstrating the superiority of our IOG as an annotation tool.
[three, dataset, naturally] [segmentation, iog, interactive, object, pascal, box, inside, bounding, semantic, coco, annotation, click, extreme, instance, table, yunchao, backbone, background, center, dextr, propose, foreground, iou, effectiveness, mask, clicking, fully, mval, kaiming, imagery, grabcut, annotated, sstem] [model, generalization, datasets, iterative, input, quality, trained] [figure, proposed, guidance, convolutional, method, comparison] [image, user, train, target, qualitative] [network, training, performance, imagenet, simple, interior, deep, set, open, large, arxiv, preprint, setting, reported] [point, additional, approach, structure, scene, well, human, ground, truth, simulated, thomas]
@InProceedings{Zhang_2020_CVPR,
  author = {Zhang, Shiyin and Liew, Jun Hao and Wei, Yunchao and Wei, Shikui and Zhao, Yao},
  title = {Interactive Object Segmentation With Inside-Outside Guidance},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Mnemonics Training: Multi-Class Incremental Learning Without Forgetting
Yaoyao Liu, Yuting Su, An-An Liu, Bernt Schiele, Qianru Sun


Multi-Class Incremental Learning (MCIL) aims to learn new concepts by incrementally updating a model trained on previous concepts. However, there is an inherent trade-off to effectively learning new concepts without catastrophic forgetting of previous ones. To alleviate this issue, it has been proposed to keep around a few examples of the previous concepts but the effectiveness of this approach heavily depends on the representativeness of these examples. This paper proposes a novel and automatic framework we call mnemonics, where we parameterize exemplars and make them optimizable in an end-to-end manner. We train the framework through bilevel optimizations, i.e., model-level and exemplar-level. We conduct extensive experiments on three MCIL benchmarks, CIFAR-100, ImageNet-Subset and ImageNet, and show that using mnemonics exemplars can surpass the state-of-the-art by a large margin. Interestingly and quite intriguingly, the mnemonics exemplars tend to be on the boundaries between different classes.
[previous, work, current] [table, framework, visualization, global, subsequent, bernt] [model, trained, original, input] [figure, phase, proposed, based, called, adjust] [loss, train, learn, exemplar, image, transfer, adjusting, generative] [data, learning, training, class, herding, incremental, forgetting, mcil, classification, average, problem, set, random, lucir, number, optimization, temporary, imagenet, bilevel, machine, setting, icarl, validation, applied, test, catastrophic, sample, accuracy, baseline, distillation, subset, learned, performance, optimal, memory, note, weight, knowledge, early, optimize, uniform, denote, min, rate, bic, deep, yaoyao] [approach, bop, program, novel, local, initial]
@InProceedings{Liu_2020_CVPR,
  author = {Liu, Yaoyao and Su, Yuting and Liu, An-An and Schiele, Bernt and Sun, Qianru},
  title = {Mnemonics Training: Multi-Class Incremental Learning Without Forgetting},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Learning to Segment 3D Point Clouds in 2D Image Space
Yecheng Lyu, Xinming Huang, Ziming Zhang


In contrast to the literature where local patterns in 3D point clouds are captured by customized convolutional operators, in this paper we study the problem of how to effectively and efficiently project such point clouds into a 2D image space so that traditional 2D convolutional neural networks (CNNs) such as U-Net can be applied for segmentation. To this end, we are motivated by graph drawing and reformulate it as an integer programming problem to learn the topology-preserving graph-to-grid mapping for each individual point cloud. To accelerate the computation in practice, we further propose a novel hierarchical approximate algorithm. With the help of the Delaunay triangulation for graph construction from point clouds and a multi-scale U-Net for segmentation, we manage to demonstrate the state-of-the-art performance on ShapeNet and PartNet, respectively, with significant improvement over the literature. Code is available at https://github.com/Zhang-VISLab.
[graph, hierarchical, time, three, frame] [segmentation, table, propose, cnn, leading, lidar, miou, semantic, apply, denotes, feature, instance, key] [knn, literature] [convolutional, comparison, running, method, ieee, june, result, fast, integer, figure] [image, drawing, cluster, layout, representation, mapping, loss, avg] [network, algorithm, performance, complexity, learning, neural, number, size, problem, set, data, space, processing, sij, deep, applied, function, computational, computation, best, training, inference] [point, cloud, grid, local, triangulation, shapenet, shape, delaunay, partnet, pipeline, pointnet, well, sphere, mlp, novel, voxel, distance, october, conference, construction, kmeans, international]
@InProceedings{Lyu_2020_CVPR,
  author = {Lyu, Yecheng and Huang, Xinming and Zhang, Ziming},
  title = {Learning to Segment 3D Point Clouds in 2D Image Space},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Smooth Shells: Multi-Scale Shape Registration With Functional Maps
Marvin Eisenberger, Zorah Lahner, Daniel Cremers


We propose a novel 3D shape correspondence method based on the iterative alignment of so-called smooth shells. Smooth shells define a series of coarse-to-fine shape approximations designed to work well with multiscale algorithms. The main idea is to first align rough approximations of the geometry and then add more and more details to refine the correspondence. We fuse classical shape registration with Functional Maps by embedding the input shapes into an intrinsic-extrinsic product space. Moreover, we disambiguate intrinsic symmetries by applying a surrogate based Markov chain Monte Carlo initialization. Our method naturally handles various types of noise that commonly occur in real scans, like non-isometry or incompatible meshing. Finally, we demonstrate state-of-the-art quantitative results on several datasets and show that our pipeline produces smoother, more realistic results than other automatic matching methods in real world applications.
[embedding, hierarchical, modeling, previous, failure] [map, propose, main] [input, distortion, topological, noise, datasets, model] [method, spectral, figure, based, ieee, operator, scale, high, pattern, noisy] [alignment, align, surrogate, real] [product, algorithm, smoothing, initialization, space, rate, accuracy, markov, chain, monte, carlo, small, random] [shape, computer, correspondence, matching, functional, extrinsic, intrinsic, deformation, surface, michael, smooth, emanuele, approach, error, registration, daniel, geometry, local, vision, volume, rigid, geodesic, acm, conference, compute, mcmc, conformal, human, point, forum, defined, mesh, reconstruction, smoothness, faust, scape, rough]
@InProceedings{Eisenberger_2020_CVPR,
  author = {Eisenberger, Marvin and Lahner, Zorah and Cremers, Daniel},
  title = {Smooth Shells: Multi-Scale Shape Registration With Functional Maps},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Self-Supervised Equivariant Attention Mechanism for Weakly Supervised Semantic Segmentation
Yude Wang, Jie Zhang, Meina Kan, Shiguang Shan, Xilin Chen


Image-level weakly supervised semantic segmentation is a challenging problem that has been deeply studied in recent years. Most of advanced solutions exploit class activation map (CAM). However, CAMs can hardly serve as the object mask due to the gap between full and weak supervisions. In this paper, we propose a self-supervised equivariant attention mechanism (SEAM) to discover additional supervision and narrow the gap. Our method is based on the observation that equivariance is an implicit constraint in fully supervised semantic segmentation, whose pixel-level labels take the same spatial transformation as the input images during data augmentation. However, this constraint is lost on the CAMs trained by image-level supervision. Therefore, we propose consistency regularization on predicted CAMs from various transformed images to provide self-supervision for network learning. Moreover, we propose a pixel correlation module (PCM), which exploits context appearance information and refines the prediction of current pixel by its similar neighbors, leading to further improvement on CAMs consistency. Extensive experiments on PASCAL VOC 2012 dataset demonstrate our method outperforms state-of-the-art methods using the same level of supervision. The code is released online.
[recognition, attention, prediction, mechanism, context] [segmentation, semantic, seam, weakly, cam, pcm, supervision, module, object, fully, background, siamese, feature, miou, propose, pascal, voc, correlation, foreground, improvement, affinity, narrow, rescaling, table, branch, map, revised, refine, denotes, mfn, mfp] [original, model, improve, input] [ieee, pattern, pixel, affine, convolutional, method, figure, proposed] [supervised, image, loss, pseudo, generated, gap, train, transformed, learn] [network, classification, activation, learning, regularization, performance, function, baseline, deep, class, achieve, fewer, evaluate, set, test, compared, denote, improved] [vision, conference, equivariant, computer, additional, transformation, consistent, ground, truth, international, equivariance]
@InProceedings{Wang_2020_CVPR,
  author = {Wang, Yude and Zhang, Jie and Kan, Meina and Shan, Shiguang and Chen, Xilin},
  title = {Self-Supervised Equivariant Attention Mechanism for Weakly Supervised Semantic Segmentation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Efficient Neural Vision Systems Based on Convolutional Image Acquisition
Pedram Pad, Simon Narduzzi, Clement Kundig, Engin Turetken, Siavash A. Bigdeli, L. Andrea Dunbar


Despite the substantial progress made in deep learning in recent years, advanced approaches remain computationally intensive. The trade-off between accuracy and computation time and energy limits their use in real-time applications on low power and other resource-constrained systems. In this paper, we tackle this fundamental challenge by introducing a hybrid optical-digital implementation of a convolutional neural network (CNN) based on engineering of the point spread function (PSF) of an optical imaging system. This is done by coding an imaging aperture such that its PSF replicates a large convolution kernel of the first layer of a pre-trained CNN. As the convolution takes place in the optical domain, it has zero cost in terms of energy consumption and has zero latency independent of the kernel size. Experimental results on two datasets demonstrate that our approach yields more than two orders of magnitude reduction in the computational cost while achieving near-state-of-the-art accuracy, or equivalently, better accuracy at the same computational cost.
[recognition, dataset, order, unit] [mask, object, cnn] [input, physical, trained, digital, mnist, model] [optical, convolution, light, kernel, sensor, figure, convolutional, pixel, proposed, ieee, coded, logarithmic, based, spatial, printing, luminosity, transmission, pattern, high, pedram, imaging, field, lens, spatially] [image, domain, factor, train, perform] [neural, network, layer, accuracy, computational, processing, function, training, number, activation, linear, deep, setup, size, perceptron, large, architecture, performance, efficient, learning, power, small, design, set, data, energy, implementation, consumption, appendix] [system, vision, cost, plane, approach, conference, computer, scene, position, international, hybrid, single, transformation, rest]
@InProceedings{Pad_2020_CVPR,
  author = {Pad, Pedram and Narduzzi, Simon and Kundig, Clement and Turetken, Engin and Bigdeli, Siavash A. and Dunbar, L. Andrea},
  title = {Efficient Neural Vision Systems Based on Convolutional Image Acquisition},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Visual Chirality
Zhiqiu Lin, Jin Sun, Abe Davis, Noah Snavely


How can we tell whether an image has been mirrored? While we understand the geometry of mirror reflections very well, less has been said about how it affects distributions of imagery at scale, despite widespread use for data augmentation in computer vision. In this paper, we investigate how the statistics of visual data are changed by reflection. We refer to these changes as "visual chirality," after the concept of geometric chirality---the notion of objects that are distinct from their mirror image. Our analysis of visual chirality reveals surprising results, including low-level chiral signals pervading imagery stemming from image processing in cameras, to the ability to discover visual chirality in images of people and faces. Our work has implications for data augmentation, self-supervised learning, and image forensics.
[visual, work, text, shirt, dataset, understand, predicting, people, predict] [cam, feature, including, object] [instagram, trained, model, jpeg, input, face, heatmaps, flipped, resizing, analyze, asymmetry, interesting] [figure, analysis, cropping, high, reflection] [chirality, image, chiral, discover, ffhq, achiral, unsupervised, train, hair, learn, common, discriminative, commutativity, cluster, subtle, content] [network, data, distribution, learning, random, training, test, set, accuracy, processing, classification, deep, augmentation, task, sample, subset, neural, consider, note, standard, performance, activation, imagenet, randomly, paper, measure] [left, symmetry, computer, geometric, define, transformation, hand, relative, vision, refer, approach]
@InProceedings{Lin_2020_CVPR,
  author = {Lin, Zhiqiu and Sun, Jin and Davis, Abe and Snavely, Noah},
  title = {Visual Chirality},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
What Machines See Is Not What They Get: Fooling Scene Text Recognition Models With Adversarial Text Images
Xing Xu, Jiefu Chen, Jinhui Xiao, Lianli Gao, Fumin Shen, Heng Tao Shen


The research on scene text recognition (STR) has made remarkable progress in recent years with the development of deep neural networks (DNNs). Recent studies on adversarial attack have verified that a DNN model designed for non-sequential tasks (e.g., classification, segmentation and retrieval) can be easily fooled by adversarial examples. Actually, STR is an application highly related to security issues. However, there are few studies considering the safety and reliability of STR models that make sequential prediction. In this paper, we make the first attempt in attacking the state-of-the-art DNN-based STR models. Specifically, we propose a novel and efficient optimization-based method that can be naturally integrated to different sequential prediction schemes, i.e., connectionist temporal classification (CTC) and attention mechanism. We apply our proposed method to five state-of-the-art STR models with both targeted and untargeted attack modes, the comprehensive results on 7 real-world datasets and 2 synthetic datasets consistently show the vulnerability of these STR models with a significant performance drop. Finally, we also test our attack method on a real-world STR engine of Baidu OCR, which demonstrates the practical potentials of our method.
[sequence, text, recognition, iter, character, attention, rosetta, sequential, prediction, natural, ctc, visual, work] [table, rare, propose, detection, adopt] [attack, adversarial, str, targeted, untargeted, model, attacking, trba, crnn, datasets, dist, acc, original, robust, input, perturbation, example, fool, changed, success, baidu, accomplish] [ieee, method, pattern, valid, proposed, output] [image, generated, synthetic, loss, transfer, edit] [learning, log, probability, neural, rate, network, objective, training, deep, efficient, classification, algorithm, min, find, optimization, path, max, reliability, problem, label, best] [conference, computer, scene, international, vision, distance, system]
@InProceedings{Xu_2020_CVPR,
  author = {Xu, Xing and Chen, Jiefu and Xiao, Jinhui and Gao, Lianli and Shen, Fumin and Shen, Heng Tao},
  title = {What Machines See Is Not What They Get: Fooling Scene Text Recognition Models With Adversarial Text Images},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Dynamic Traffic Modeling From Overhead Imagery
Scott Workman, Nathan Jacobs


Our goal is to use overhead imagery to understand patterns in traffic flow, for instance answering questions such as how fast could you traverse Times Square at 3am on a Sunday. A traditional approach for solving this problem would be to model the speed of each road segment as a function of time. However, this strategy is limited in that a significant amount of data must first be collected before a model can be used and it fails to generalize to new areas. Instead, we propose an automatic approach for generating dynamic maps of traffic speeds using convolutional neural networks. Our method operates on overhead imagery, is conditioned on location and time, and outputs a local motion model that captures likely directions of travel and corresponding travel speeds. To train our model, we take advantage of historical traffic data collected from New York City. Experimental results demonstrate that our method can be applied to generate accurate city-scale traffic models.
[traffic, road, speed, overhead, time, urban, travel, uber, dataset, transportation, work, york, understanding, decoder, nathan, historical, monday, modeling, context, represent, scott, movement, three, goal, planning, environment, predicting] [segment, location, predicted, segmentation, region, aggregation, imagery, including, propose] [model, city, collected] [figure, flow, ieee, method, dynamic, pattern, convolutional, advantage, spatial] [image, corresponding, loss, street, generate, mapping, underlying] [data, learning, network, impact, function, large, average, architecture, evaluate, deep, neural, support, objective, number, simultaneously, training] [conference, computer, approach, orientation, estimating, estimate, international, vision, estimation, directly, capture, local, compare, satellite, compute]
@InProceedings{Workman_2020_CVPR,
  author = {Workman, Scott and Jacobs, Nathan},
  title = {Dynamic Traffic Modeling From Overhead Imagery},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Satellite Image Time Series Classification With Pixel-Set Encoders and Temporal Self-Attention
Vivien Sainte Fare Garnot, Loic Landrieu, Sebastien Giordano, Nesrine Chehata


Satellite image time series, bolstered by their growing availability, are at the forefront of an extensive effort towards automated Earth monitoring by international institutions. In particular, large-scale control of agricultural parcels is an issue of major political and economic importance. In this regard, hybrid convolutional-recurrent neural architectures have shown promising results for the automated classification of satellite image time series. We propose an alternative approach in which the convolutional layers are advantageously replaced with encoders operating on unordered sets of pixels to exploit the typically coarse resolution of publicly available satellite images. We also propose to extract temporal features using a bespoke neural architecture based on self-attention instead of recurrent networks. We demonstrate experimentally that our method not only outperforms previous state-of-the-art approaches in terms of precision, but also significantly decreases processing time and memory requirements. Lastly, we release a large open-access annotated dataset as a benchmark for future work on satellite image time series.
[temporal, time, transformer, recurrent, attention, sequence, work, dataset, encoding, observation, element, embedding, embeddings, positional, master] [remote, propose, miou, table, feature, head] [series, query, model, input, type] [crop, spatial, convolutional, sensing, parcel, agricultural, spectral, figure, proposed, resolution, pixel, cnns, cereal, automated, based, high, processed, pse, output] [image, encoder, mapping, encoders] [classification, architecture, neural, learning, network, deep, data, size, dimension, performance, memory, number, set, training, machine, efficient, better, processing, random, vector, note, smaller, equation] [satellite, approach, single, hybrid, cover, tae, handcrafted, geometric, international]
@InProceedings{Garnot_2020_CVPR,
  author = {Garnot, Vivien Sainte Fare and Landrieu, Loic and Giordano, Sebastien and Chehata, Nesrine},
  title = {Satellite Image Time Series Classification With Pixel-Set Encoders and Temporal Self-Attention},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
DAVD-Net: Deep Audio-Aided Video Decompression of Talking Heads
Xi Zhang, Xiaolin Wu, Xinliang Zhai, Xianye Ben, Chengjie Tu


Close-up talking heads are among the most common and salient object in video contents, such as face-to-face conversations in social media, teleconferences, news broadcasting, talk shows, etc. Due to the high sensitivity of human visual system to faces, compression distortions in talking heads videos are highly visible and annoying. To address this problem, we present a novel deep convolutional neural network (DCNN) method for very low bit rate video reconstruction of talking heads. The key innovation is a new DCNN architecture that can exploit the audio-video correlations to repair compression defects in the face region. We further improve reconstruction quality by embedding into our DCNN the encoder information of the video compression standards and introducing a constraining projection module in the network. Extensive experiments demonstrate that the proposed DCNN method outperforms the existing state-of-the-art methods on videos of talking heads.
[video, audio, talking, frame, prediction, attention, dataset, accompanying, extract, order, encoding, outperforms, speech, speaker, temporal, exploit] [feature, module, table, cnn, head, ablation] [face, quality, trained, facial, improve, generic, input] [compression, proposed, compressed, ieee, dct, restoration, spatial, block, obama, residual, decoded, transform, convolutional, existing, decompression, method, figure, artifact, fusion, coding, dkfn, mfqe, high, neighboring, signal, edvr, dcnn, removal, extraction] [image, encoder, code, common] [network, deep, neural, constraining, performance, arxiv, preprint, architecture, learning, design, processing, quantization, reduction, size, linear, training] [projection, conference, reconstruction, computer, international, vision, reconstructed]
@InProceedings{Zhang_2020_CVPR,
  author = {Zhang, Xi and Wu, Xiaolin and Zhai, Xinliang and Ben, Xianye and Tu, Chengjie},
  title = {DAVD-Net: Deep Audio-Aided Video Decompression of Talking Heads},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Learning When and Where to Zoom With Deep Reinforcement Learning
Burak Uzkent, Stefano Ermon


While high resolution images contain semantically more useful information than their lower resolution counterparts, processing them is computationally more expensive, and in some applications, e.g. remote sensing, they can be much more expensive to acquire. For these reasons, it is desirable to develop an automatic method to selectively use high resolution data when necessary while maintaining accuracy and reducing acquisition/run-time cost. In this direction, we propose PatchDrop a reinforcement learning approach to dynamically identify when and where to use/acquire high resolution data conditioned on the paired, cheap, low resolution images. We conduct experiments on CIFAR10, CIFAR100, ImageNet and fMoW datasets where we use significantly less high resolution data while maintaining similar accuracy to models which use full high resolution images.
[policy, agent, action, reinforcement, reward, recognition, step, bagnet, attention, represent] [cnn, table, remote, object, represents, map, hard] [drop, input, model, quality] [resolution, patchdrop, ieee, patch, high, pattern, method, low, proposed, fmow, spatial, figure, residual, convolutional, downsampling, cnns, zoom, adaptive] [image, perform, train, learn, loss, domain, learns, generate] [network, accuracy, number, learning, classifier, sampled, fcl, sample, training, imagenet, data, sampling, neural, test, size, deep, stochastic, set, performance, dropping, function, finetune, rate, learned, probability, large, class, finetuning, batch, average] [computer, conference, vision, jointly, full, satellite, european]
@InProceedings{Uzkent_2020_CVPR,
  author = {Uzkent, Burak and Ermon, Stefano},
  title = {Learning When and Where to Zoom With Deep Reinforcement Learning},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Cross-Domain Detection via Graph-Induced Prototype Alignment
Minghao Xu, Hang Wang, Bingbing Ni, Qi Tian, Wenjun Zhang


Applying the knowledge of an object detector trained on a specific domain directly onto a new domain is risky, as the gap between two domains can severely degrade model's performance. Furthermore, since different instances commonly embody distinct modal information in object detection scenario, the feature alignment of source and target domain is hard to be realized. To mitigate these problems, we propose a Graph-induced Prototype Alignment (GPA) framework to seek for category-level domain alignment via elaborate prototype representations. In the nutshell, more precise instance-level features are obtained through graph-based information propagation among region proposals, and, on such basis, the prototype representation of each class is derived for category-level domain alignment. In addition, in order to alleviate the negative effect of class-imbalance on domain adaptation, we design a Class-reweighted Contrastive Loss to harmonize the adaptation training process. Combining with Faster R-CNN, the proposed framework conducts feature alignment in a two-stage manner. Comprehensive results on various cross-domain detection tasks demonstrate that our approach outperforms existing methods with a remarkable margin. Our code is available at https://github.com/ChrisAllenMing/GPA-detection.
[graph, relation, dataset, adjacency] [region, detection, feature, object, faster, instance, proposal, rpn, category, table, propose, foreground, sim, car, ross, semantic, denotes, map] [adversarial, model, derived, trained, experimental] [ieee, pattern, figure, proposed, based, foggy, convolutional, commonly, method] [domain, alignment, adaptation, target, source, prototype, loss, unsupervised, gpa, corresponding, bingbing, discrepancy, image] [performance, learning, training, class, process, deep, matrix, parameter, contrastive, task, set, neural, weight, machine, better, higher, network, compared] [conference, computer, vision, international, approach, distance, camera]
@InProceedings{Xu_2020_CVPR,
  author = {Xu, Minghao and Wang, Hang and Ni, Bingbing and Tian, Qi and Zhang, Wenjun},
  title = {Cross-Domain Detection via Graph-Induced Prototype Alignment},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Meta-Learning of Neural Architectures for Few-Shot Learning
Thomas Elsken, Benedikt Staffler, Jan Hendrik Metzen, Frank Hutter


The recent progress in neural architecture search (NAS) has allowed scaling the automated design of neural architectures to real-world domains, such as object detection and semantic segmentation. However, one prerequisite for the application of NAS are large amounts of labeled data and compute resources. This renders its application challenging in few-shot learning scenarios, where many related tasks need to be learned, each with limited amounts of data and compute time. Thus, few-shot learning is typically done with a fixed neural architecture. To improve upon this, we propose MetaNAS, the first method which fully integrates NAS with gradient-based meta-learning. MetaNAS optimizes a meta-architecture along with the meta-weights during meta-training. During meta-testing, architectures can be adapted to a novel task with a few steps of the task optimizer, that is: task adaptation becomes computationally cheap and requires only little data per task. Moreover, MetaNAS is agnostic in that it can be used with arbitrary model-agnostic meta-learning algorithms and arbitrary gradient-based NAS methods. Empirical results on standard few-shot classification benchmarks show that MetaNAS with a combination of DARTS and REPTILE yields state-of-the-art results.
[work] [semantic] [input, model] [figure, proposed, method, prior, based, pattern, automated, combination, ieee] [loss, image, learn, adaptation, arbitrary] [architecture, learning, neural, task, eta, search, wmeta, meta, algorithm, training, reptile, dtrain, data, mixture, autometa, machine, set, network, fixed, performance, frank, large, standard, update, efficient, classification, space, metalearning, problem, setting, omniglot, deep, ptrain, adapted, requires, adapt, gradient, weight, operation, dtest, note, miniimagenet, quoc] [conference, international, single, compute, computer, vision, allows, thomas, differentiable, limited, novel, require, allow, well]
@InProceedings{Elsken_2020_CVPR,
  author = {Elsken, Thomas and Staffler, Benedikt and Metzen, Jan Hendrik and Hutter, Frank},
  title = {Meta-Learning of Neural Architectures for Few-Shot Learning},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Towards Inheritable Models for Open-Set Domain Adaptation
Jogendra Nath Kundu, Naveen Venkat, Ambareesh Revanur, Rahul M V, R. Venkatesh Babu


There has been a tremendous progress in Domain Adaptation (DA) for visual recognition tasks. Particularly, open-set DA has gained considerable attention wherein the target domain contains additional unseen categories. Existing open-set DA approaches demand access to a labeled source dataset along with unlabeled target instances. However, this reliance on co-existing source and target data is highly impractical in scenarios where data-sharing is restricted due to its proprietary nature or privacy concerns. Addressing this, we introduce a practical DA paradigm where a source-trained model is used to facilitate adaptation in the absence of the source dataset in future. To this end, we formalize knowledge inheritability as a novel concept and propose a simple yet effective solution to realize inheritable models suitable for the above practical paradigm. Further, we present an objective way to quantify inheritability to enable the selection of the most suitable source model for a given target domain, even in the absence of the source data. We provide theoretical insights followed by a thorough empirical evaluation demonstrating state-of-the-art open-set domain adaptation performance.
[sta, dataset, work] [feature, confidence, paradigm, instance, region, plot, propose] [model, trained, effectively, adversarial, sensitivity, access, suitable] [high, method, figure, assumption, output] [target, source, domain, adaptation, inheritable, inheritability, shared, vendor, unsupervised, absence, percentile, transfer, learn, unknown, uda, ability, osbp, latent, alignment, uoda, openness, kate, image, train] [knowledge, deep, client, data, training, performance, predictor, negative, label, space, labeled, classifier, learning, class, accuracy, practical, unlabeled, task, ood, set, density, proxy, empirical, neural, distribution, probability, best, sample, higher, open, presence, measure] [avoid]
@InProceedings{Kundu_2020_CVPR,
  author = {Kundu, Jogendra Nath and Venkat, Naveen and Revanur, Ambareesh and V, Rahul M and Babu, R. Venkatesh},
  title = {Towards Inheritable Models for Open-Set Domain Adaptation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Learning From Synthetic Animals
Jiteng Mu, Weichao Qiu, Gregory D. Hager, Alan L. Yuille


Despite great success in human parsing, progress for parsing other deformable articulated objects, like animals, is still limited by the lack of labeled data. In this paper, we use synthetic images and ground truth generated from CAD animal models to address this challenge. To bridge the domain gap between real and synthetic images, we propose a novel consistency-constrained semi-supervised learning method (CC-SSL). Our method leverages both spatial and temporal consistencies, to bootstrap weak models trained on synthetic data with unlabeled real images. We demonstrate the effectiveness of our method on highly deformable animals, such as horses and tigers. Without using any real image label, our method allows for accurate keypoint prediction on real images. Moreover, we quantitatively show that models using synthetic data achieve better generalization performance than models trained on real images across different domains in the Visual Domain Adaptation Challenge dataset. Our synthetic dataset contains 10+ animals with diverse poses and rich ground truth, which enables us to use the multi-task learning strategy to further boost models' performance.
[dataset, prediction, temporal, visual, rich, work, previous, built] [segmentation, parsing, propose, confidence, effectiveness, table, challenge] [model, trained, animal, datasets, generalization] [method, proposed, figure, transform] [synthetic, real, domain, image, adaptation, consistency, generation, target, invariance, generate, generated, unsupervised, horse, tiger, learn, source, diverse, proposes, texture, tigdog, painting, kpts, eric, address] [learning, data, training, better, accuracy, performance, algorithm, achieve, large, unlabeled, number, deep, problem, compared, equation, set, random] [pose, keypoints, ground, keypoint, cad, estimation, equivariance, human, truth, demonstrate, accurate, well, visible, michael, limited]
@InProceedings{Mu_2020_CVPR,
  author = {Mu, Jiteng and Qiu, Weichao and Hager, Gregory D. and Yuille, Alan L.},
  title = {Learning From Synthetic Animals},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Distilling Cross-Task Knowledge via Relationship Matching
Han-Jia Ye, Su Lu, De-Chuan Zhan


The discriminative knowledge from a high-capacity deep neural network (a.k.a. the "teacher") could be distilled to facilitate the learning efficacy of a shallow counterpart (a.k.a. the "student"). This paper deals with a general scenario reusing the knowledge from a cross-task teacher --- two models are targeting non-overlapping label spaces. We emphasize that the comparison ability between instances acts as an essential factor threading knowledge across domains, and propose the RElationship FacIlitated Local cLassifiEr Distillation (ReFilled) approach, which decomposes the knowledge distillation flow into branches for embedding and the top-layer classifier. In particular, different from reconciling the instance-label confidence between models, ReFilled requires the teacher to reweight the hard triplets push forwarded by the student so that the similarity comparison levels between instances are matched. A local embedding-induced classifier from the teacher further supervises the student's classification confidence. ReFilled demonstrates its effectiveness when reusing cross-task models, and also achieves state-of-the-art performance on the standard knowledge distillation benchmarks. The code of the paper can be accessed at https://github.com/njulus/ReFilled.
[embedding, relationship, current, embeddings, heterogeneous, three] [instance, table, feature, effectiveness, achieves, resnet] [model, trained, strong, help, verify] [based, comparison, figure] [ability, aligning, discriminative, target, train, learn, transfer, loss] [knowledge, teacher, student, learning, distillation, illed, classification, training, classifier, neural, deep, label, class, distill, reuse, triplet, reusing, data, network, performance, task, set, pijk, accuracy, width, similarity, multiplier, sampled, test, distilled, standard, experience, better, probability, distilling, soft, investigate, vanilla, rkd, number, evaluate, facilitated, dissimilar, stochastic, linear, larger, close] [local, approach, matching, relative, nearest, neighbor]
@InProceedings{Ye_2020_CVPR,
  author = {Ye, Han-Jia and Lu, Su and Zhan, De-Chuan},
  title = {Distilling Cross-Task Knowledge via Relationship Matching},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Open Compound Domain Adaptation
Ziwei Liu, Zhongqi Miao, Xingang Pan, Xiaohang Zhan, Dahua Lin, Stella X. Yu, Boqing Gong


A typical domain adaptation approach is to adapt models trained on the annotated data in a source domain (e.g., sunny weather) for achieving high performance on the test data in a target domain (e.g., rainy weather). Whether the target contains a single homogeneous domain or multiple heterogeneous domains, existing works always assume that there exist clear distinctions between the domains, which is often not true in practice (e.g., changes in weather). We study an open compound domain adaptation (OCDA) problem, in which the target is a compound of multiple homogeneous domains without domain labels, reflecting realistic data collection from mixed and novel situations. We propose a new approach based on two technical insights into OCDA: 1) a curriculum domain adaptation strategy to bootstrap generalization across domains in a data-driven self-organizing fashion and 2) a memory module to increase the model's agility towards novel domains. Our experiments on digit classification, facial expression recognition, semantic segmentation, and reinforcement learning demonstrate the effectiveness of our approach.
[multiple, reinforcement, visual, dynamically, dataset] [semantic, table, feature, module, benchmark, propose, segmentation, effectiveness, instance] [input, generalization, model, adversarial, mnist, datasets, robustness] [figure, existing, comparison, clear, handling] [domain, adaptation, target, source, compound, curriculum, encoder, unsupervised, edomain, latent, usps, representation, enhancer, jan, boqing, transfer, mcd, eclass, disentanglement, trevor, kate, agility, synnum, ocda, discriminative, vdirect, ziwei, realistic] [open, class, learning, data, memory, network, performance, knowledge, test, deep, number, training, learned, set, labeled, classification, classifier, indicator, evaluate, adapt, neural, random] [approach, novel, assume, direct, single, homogeneous]
@InProceedings{Liu_2020_CVPR,
  author = {Liu, Ziwei and Miao, Zhongqi and Pan, Xingang and Zhan, Xiaohang and Lin, Dahua and Yu, Stella X. and Gong, Boqing},
  title = {Open Compound Domain Adaptation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Context Prior for Scene Segmentation
Changqian Yu, Jingbo Wang, Changxin Gao, Gang Yu, Chunhua Shen, Nong Sang


Recent works have widely explored the contextual dependencies to achieve more accurate segmentation results. However, most approaches rarely distinguish different types of contextual dependencies, which may pollute the scene understanding. In this work, we directly supervise the feature aggregation to distinguish the intra-class and interclass context clearly. Specifically, we develop a Context Prior with the supervision of the Affinity Loss. Given an input image and corresponding ground truth, Affinity Loss constructs an ideal affinity map to supervise the learning of Context Prior. The learned Context Prior extracts the pixels belonging to the same category, while the reversed prior focuses on the pixels of different classes. Embedded into a conventional deep CNN, the proposed Context Prior Layer can selectively capture the intra-class and inter-class contextual dependencies, leading to robust feature representation. To validate the effectiveness, we design an effective Context Prior Network (CPNet). Extensive quantitative and qualitative evaluations demonstrate that the proposed model performs favorably against state-of-the-art semantic segmentation approaches. More specifically, our algorithm achieves 46.3% mIoU on ADE20K, 53.9% mIoU on PASCAL-Context, and 81.3% mIoU on Cityscapes. Code is available at https://git.io/ContextPrior.
[context, recognition, attention, dataset, work, reason, explicit] [affinity, aggregation, map, contextual, semantic, module, segmentation, aggregate, miou, feature, table, fully, achieves, backbone, category, adopt, global, supervision, pyramid, gang, aspp, leading, pooling, effectiveness, pspnet] [model, cpnet, input, testing, improve, conduct] [prior, spatial, ieee, pattern, proposed, convolutional, convolution, figure, based, separable, method, performs, favorably, pixel, comparison, supervise] [loss, image, generate, cross, learn] [network, layer, size, set, learning, ideal, validation, filter, neural, deep, training, performance, binary, entropy, large] [vision, conference, computer, scene, ground, capture, truth, international, demonstrate]
@InProceedings{Yu_2020_CVPR,
  author = {Yu, Changqian and Wang, Jingbo and Gao, Changxin and Yu, Gang and Shen, Chunhua and Sang, Nong},
  title = {Context Prior for Scene Segmentation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Tangent Images for Mitigating Spherical Distortion
Marc Eder, Mykhailo Shvets, John Lim, Jan-Michael Frahm


In this work, we propose "tangent images," a spherical image representation that facilitates transferable and scalable 360 degree computer vision. Inspired by techniques in cartography and computer graphics, we render a spherical image to a set of distortion-mitigated, locally-planar image grids tangent to a subdivided icosahedron. By varying the resolution of these grids independently of the subdivision level, we can effectively represent high resolution spherical images while still benefiting from the low-distortion icosahedral spherical approximation. We show that training standard convolutional neural networks on tangent images compares favorably to the many specialized spherical convolutional kernels that have been developed, while also scaling efficiently to handle significantly higher spherical resolutions. Furthermore, because our approach does not require specialized kernels, we show that we can transfer networks trained on perspective images to spherical data without fine-tuning and with limited performance drop-off. Finally, we demonstrate that tangent images can be used to improve the quality of sparse feature detection on spherical images, illustrating its usefulness for traditional computer vision tasks like structure-from-motion and SLAM.
[dataset, specialized, work, represent] [level, subdivision, segmentation, semantic, table, cnn, overlap, jiang] [distortion, trained, input, model, transferability] [resolution, method, convolution, equirectangular, prior, pixel, figure, convolutional, high, zhang, pattern, proposed, cnns, ieee, traditional, existing, scale, kernel, cohen] [image, representation, transfer, address, train] [performance, network, angular, data, set, training, higher, base, evaluate, test, learning, scaling, number, classification, standard, deep, accuracy, efficient, note, experiment] [spherical, tangent, perspective, computer, conference, icosahedral, vision, icosahedron, limited, fov, keypoints, provided, compute, subdivided, demonstrate, planar, rendering, rgb, approach, sphere, visible, camera, sparse, keypoint, surface, left, international, require]
@InProceedings{Eder_2020_CVPR,
  author = {Eder, Marc and Shvets, Mykhailo and Lim, John and Frahm, Jan-Michael},
  title = {Tangent Images for Mitigating Spherical Distortion},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Learning a Dynamic Map of Visual Appearance
Tawfiq Salem, Scott Workman, Nathan Jacobs


The appearance of the world varies dramatically not only from place to place but also from hour to hour and month to month. Every day billions of images capture this complex relationship, many of which are associated with precise time and location metadata. We propose to use these images to construct a global-scale, dynamic map of visual appearance attributes. Such a map enables fine-grained understanding of the expected appearance at any geographic location and time. Our approach integrates dense overhead imagery with location and time metadata into a general framework capable of mapping a wide variety of visual attributes. A key feature of our approach is that it requires no manual data annotation. We demonstrate how this approach can support various applications, including image-driven mapping, image geolocalization, and metadata verification.
[visual, overhead, time, nathan, dataset, context, scott, geographic, work, understanding, metadata, extract, predict, associated, day, construct, cvt, temporal, conditioning, tawfiq, previous, relationship, predicting] [location, map, imagery, feature, table, including, focus, highest, remote] [model, hour, trained, james] [ieee, dynamic, figure, pattern, captured, spatial, winter, convolutional, range, analysis, method] [image, attribute, mapping, appearance, corresponding, learn] [accuracy, learning, distribution, neural, set, network, training, test, support, deep, wide, problem, higher, expected, requires] [conference, computer, transient, international, approach, vision, distance, full, combine, scene, acm, capture, variety, directly, outdoor, estimator, compute, david, demonstrate]
@InProceedings{Salem_2020_CVPR,
  author = {Salem, Tawfiq and Workman, Scott and Jacobs, Nathan},
  title = {Learning a Dynamic Map of Visual Appearance},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Webly Supervised Knowledge Embedding Model for Visual Reasoning
Wenbo Zheng, Lan Yan, Chao Gou, Fei-Yue Wang


Visual reasoning between visual image and natural language description is a long-standing challenge in computer vision. While recent approaches offer a great promise by compositionality or relational computing, most of them are oppressed by the challenge of training with datasets containing only a limited number of images with ground-truth texts. Besides, it is extremely time-consuming and difficult to build a larger dataset by annotating millions of images with text descriptions that may very likely lead to a biased model. Inspired by the majority success of webly supervised learning, we utilize readily-available web images with its noisy annotations for learning a robust representation. Our key idea is to presume on web images and corresponding tags along with fully annotated datasets in learning with knowledge embedding. We present a two-stage approach for the task that can augment knowledge through an effective embedding model with weakly supervised web data. This approach learns not only knowledge-based embeddings derived from key-value memory networks to make joint and full use of textual and visual information but also exploits the knowledge to improve the performance with knowledge-based representation learning for applying other general reasoning tasks. Experimental results on two benchmarks show that the proposed approach significantly improves performance compared with the state-of-the-art methods and guarantees the robustness of our model against visual reasoning tasks and other reasoning tasks.
[visual, question, reasoning, embedding, dataset, modulation, clevr, language, text, nlvr, recognition, work, relational, embeddings, answer, natural, answering, build, compositional] [web, stage, webly, annotated, association, module, object, semantic, table, fully, weakly, thing, propose, final, ablation] [model, datasets, robust, clean, effective, trained, query] [ieee, noisy, figure, pattern, june, proposed, color, relu, advantage, conv, comparison] [representation, image, supervised, learn, film, attribute] [knowledge, learning, network, memory, training, mutual, performance, neural, number, computational, large, better, data, update, design, task, size, learned] [conference, computer, approach, vision, international, compare, joint, shape]
@InProceedings{Zheng_2020_CVPR,
  author = {Zheng, Wenbo and Yan, Lan and Gou, Chao and Wang, Fei-Yue},
  title = {Webly Supervised Knowledge Embedding Model for Visual Reasoning},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Gradually Vanishing Bridge for Adversarial Domain Adaptation
Shuhao Cui, Shuhui Wang, Junbao Zhuo, Chi Su, Qingming Huang, Qi Tian


In unsupervised domain adaptation, rich domain-specific characteristics bring great challenge to learn domain-invariant representations. However, domain discrepancy is considered to be directly minimized in existing solutions, which is difficult to achieve in practice. Some methods alleviate the difficulty by explicitly modeling domain-invariant and domain-specific parts in the representations, but the adverse influence of the explicit construction lies in the residual domain-specific characteristics in the constructed domain-invariant representations. In this paper, we equip adversarial domain adaptation with Gradually Vanishing Bridge (GVB) mechanism on both generator and discriminator. On the generator, GVB could not only reduce the overall transfer difficulty, but also reduce the influence of the residual domain-specific characteristics in domain-invariant representations. On the discriminator, GVB contributes to enhance the discriminating ability, and balance the adversarial training process. Experiments on three challenging datasets show that our GVB methods outperform strong competitors, and cooperate well with other adversarial methods. The code is available at https://github.com/cuishuhao/GVB.
[rich, mechanism, constructed, explicitly, three, multiple, outperforms] [table, key, challenging, framework, china, apply] [adversarial, influence, input, datasets, game] [ieee, figure, pattern, range, intermediate, method, existing, proposed, residual, result] [domain, bridge, adaptation, gvb, discriminator, generator, target, source, unsupervised, discrepancy, transfer, gradually, cdan, discriminating, image, symnets, generative, ability, loss, alignment, minmax, minimizing, kate, shuhui, qingming] [deep, training, learning, network, classifier, baseline, reduce, classification, data, large, neural, function, better, applied, layer, achieve, balanced, machine, balance, denoted, higher] [conference, computer, vision, international, vanishing, directly, reconstruction, distance]
@InProceedings{Cui_2020_CVPR,
  author = {Cui, Shuhao and Wang, Shuhui and Zhuo, Junbao and Su, Chi and Huang, Qingming and Tian, Qi},
  title = {Gradually Vanishing Bridge for Adversarial Domain Adaptation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Active Speakers in Context
Juan Leon Alcazar, Fabian Caba, Long Mai, Federico Perazzi, Joon-Young Lee, Pablo Arbelaez, Bernard Ghanem


Current methods for active speaker detection focus on modeling audiovisual information from a single speaker. This strategy can be adequate for addressing single-speaker scenarios, but it prevents accurate detection when the task is to identify who of many candidate speakers are talking. This paper introduces the Active Speaker Context, a novel representation that models relationships between multiple speakers over long time horizons. Our new model learns pairwise and temporal relations from a structured ensemble of audiovisual observations. Our experiments show that a structured feature ensemble already benefits active speaker detection performance. We also find that the proposed Active Speaker Context improves the state-of-the-art on the AVA-ActiveSpeaker dataset achieving an mAP of 87.1%. Moreover, ablation studies verify that this result is a direct consequence of our long-term multi-speaker analysis.
[speaker, context, temporal, time, audiovisual, multiple, video, audio, dataset, three, visual, speech, attention, long, work, embedding, asc, modeling, short, joon, son, chung, lstm, clip, outperforms, prediction] [refinement, detection, map, table, feature, final, improves, contextual, challenge, score, ablation, challenging] [face, model, input, ensemble, analyzing] [reference, method, figure, analysis, tensor, window, proposed] [representation, learns, loss] [active, performance, pairwise, baseline, number, sampling, learning, observe, training, andrew, size, validation, set, strategy, task, candidate, sampled, average, small, arxiv, preprint, randomly, sample] [single, approach, core, international, accurate, structure]
@InProceedings{Alcazar_2020_CVPR,
  author = {Alcazar, Juan Leon and Caba, Fabian and Mai, Long and Perazzi, Federico and Lee, Joon-Young and Arbelaez, Pablo and Ghanem, Bernard},
  title = {Active Speakers in Context},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Panoptic-DeepLab: A Simple, Strong, and Fast Baseline for Bottom-Up Panoptic Segmentation
Bowen Cheng, Maxwell D. Collins, Yukun Zhu, Ting Liu, Thomas S. Huang, Hartwig Adam, Liang-Chieh Chen


In this work, we introduce Panoptic-DeepLab, a simple, strong, and fast system for panoptic segmentation, aiming to establish a solid baseline for bottom-up methods that can achieve comparable performance of two-stage methods while yielding fast inference speed. In particular, Panoptic-DeepLab adopts the dual-ASPP and dual-decoder structures specific to semantic, and instance segmentation, respectively. The semantic segmentation branch is the same as the typical design of any semantic segmentation model (e.g., DeepLab), while the instance segmentation branch is class-agnostic, involving a simple instance center regression. As a result, our single Panoptic-DeepLab simultaneously ranks first at all three Cityscapes benchmarks, setting the new state-of-art of 84.2% mIoU, 39.0% AP, and 65.5% PQ on test set. Additionally, equipped with MobileNetV3, Panoptic-DeepLab runs nearly in real-time with a single 1025x2049 image (15.8 frames per second), while achieving a competitive performance on Cityscapes (54.1 PQ% on test set). On Mapillary Vistas test set, our ensemble of six models attains 42.7% PQ, outperforming the challenge winner in 2018 by a healthy margin of 1.5%. Finally, our Panoptic-DeepLab also performs on par with several top-down approaches on the challenging COCO dataset. For the first time, we demonstrate a bottom-up approach could deliver state-of-the-art results on panoptic segmentation.
[prediction, three, time] [segmentation, instance, semantic, panoptic, coco, mapillary, object, center, deeperlab, predicted, val, upsnet, extra, ross, mask, offset, atrous, stride, table, branch, tascnet, ssap, backbone, feature, grouping, adopt, achieves, head, adaptis, miou, kaiming, piotr, alexander, challenge, final, pyramid, heatmap, seamless] [model] [method, fast, convolution, pixel, scale, proposed, convolutional, output] [image, loss, generate, cross] [set, inference, best, simple, test, network, learning, training, deep, performance, size, report, large, baseline, hartwig, efficient, batch, data] [single, thomas, well]
@InProceedings{Cheng_2020_CVPR,
  author = {Cheng, Bowen and Collins, Maxwell D. and Zhu, Yukun and Liu, Ting and Huang, Thomas S. and Adam, Hartwig and Chen, Liang-Chieh},
  title = {Panoptic-DeepLab: A Simple, Strong, and Fast Baseline for Bottom-Up Panoptic Segmentation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Inter-Region Affinity Distillation for Road Marking Segmentation
Yuenan Hou, Zheng Ma, Chunxiao Liu, Tak-Wai Hui, Chen Change Loy


We study the problem of distilling knowledge from a large deep teacher network to a much smaller student network for the task of road marking segmentation. In this work, we explore a novel knowledge distillation (KD) approach that can transfer 'knowledge' on scene structure more effectively from a teacher to a student model. Our method is known as Inter-Region Affinity KD (IntRA-KD). It decomposes a given road scene image into different regions and represents each region as a node in a graph. An inter-region affinity graph is then formed by establishing pairwise relationships between nodes based on their similarity in feature distribution. To learn structural knowledge from the teacher network, the student is required to match the graph generated by the teacher. The proposed method shows promising results on three large-scale road marking segmentation benchmarks, i.e., ApolloScape, CULane and LLAMAS, by taking various lightweight models as students and ResNet-101 as the teacher. IntRA-KD consistently brings higher performance gains on all lightweight models, compared to previous distillation methods. Our code is available at https://github.com/ cardwing/Codes-for-IntRA-KD.
[road, moment, graph, attention, three, lane, relationship, latexit, work, extract, previous, dataset] [affinity, feature, map, marking, segmentation, erfnet, aoi, culane, pooling, apolloscape, denotes, enet, bifpn, table, represents, brings, challenging, semantic, miou] [model, input, effective, testing] [ieee, figure, pattern, spatial, proposed, lightweight, block, method, chen] [loss, structural, image, transfer] [distillation, knowledge, student, teacher, performance, similarity, learning, deep, network, class, set, distribution, number, label, small, compared, size, note, operation, large, training, mimic] [conference, computer, scene, international, vision, structure, limited]
@InProceedings{Hou_2020_CVPR,
  author = {Hou, Yuenan and Ma, Zheng and Liu, Chunxiao and Hui, Tak-Wai and Loy, Chen Change},
  title = {Inter-Region Affinity Distillation for Road Marking Segmentation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Unified Dynamic Convolutional Network for Super-Resolution With Variational Degradations
Yu-Syuan Xu, Shou-Yao Roy Tseng, Yu Tseng, Hsien-Kai Kuo, Yi-Min Tsai


Deep Convolutional Neural Networks (CNNs) have achieved remarkable results on Single Image Super-Resolution (SISR). Despite considering only a single degradation, recent studies also include multiple degrading effects to better reflect real-world cases. However, most of the works assume a fixed combination of degrading effects, or even train an individual network for different combinations. Instead, a more practical approach is to train a single network for wide-ranging and variational degradations. To fulfill this requirement, this paper proposes a unified network to accommodate the variations from inter-image (cross-image variations) and intra-image (spatial variations). Different from the existing works, we incorporate dynamic convolution which is a far more flexible alternative to handle different variations. In SISR with non-blind setting, our Unified Dynamic Convolutional Network for Variational Degradations (UDVD) is evaluated on both synthetic and real images with an extensive set of variations. The qualitative results demonstrate the effectiveness of UDVD over various existing works. Extensive experiments show that our UDVD achieves favorable or comparable performance on both synthetic and real images.
[recognition, multiple] [table, feature, level, unified, achieves, refinement] [noise, input, trained, quality, deal] [dynamic, udvd, convolution, kernel, srmd, ieee, pattern, proposed, rcan, degradation, convolutional, block, degrading, spatial, residual, blur, scale, sisr, sftmd, gaussian, psnr, ircnn, bicubicly, zssr, rdn, multistage, extraction, upsampling, superresolution, figure, blind, upsample, conv, output, existing] [image, variational, real, loss, synthetic, factor, generate, train, qualitative, generated, extensive, favorable] [network, fixed, width, deep, size, set, performance, paper, problem, note, operation, neural, better, comparable, best, rate, variant] [computer, vision, single, handle]
@InProceedings{Xu_2020_CVPR,
  author = {Xu, Yu-Syuan and Tseng, Shou-Yao Roy and Tseng, Yu and Kuo, Hsien-Kai and Tsai, Yi-Min},
  title = {Unified Dynamic Convolutional Network for Super-Resolution With Variational Degradations},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Making Better Mistakes: Leveraging Class Hierarchies With Deep Networks
Luca Bertinetto, Romain Mueller, Konstantinos Tertikas, Sina Samangooei, Nicholas A. Lord


Deep neural networks have improved image classification dramatically over the past decade, but have done so by focusing on performance measures that treat all classes other than the ground truth as equally wrong. This has led to a situation in which mistakes are less likely to be made than before, but are equally likely to be absurd or catastrophic when they do occur. Past works have recognised and tried to address this issue of mistake severity, often by using graph distances in class hierarchies, but this has largely been neglected since the advent of the current deep learning era in computer vision. In this paper, we aim to renew interest in this problem by reviewing past approaches and proposing two simple methods which outperform the prior art under several metrics on two large datasets with complex class hierarchies: tieredImageNet and iNaturalist'19.
[hierarchical, visual, embedding, dataset, node, making, incorporate, represent, embedded] [height, semantic, framework, interest, object, predicted, extent] [severity, case, example] [ieee, output, pattern, tree, method, prior, figure, based, simply] [loss, image, common, conditional, taxonomy, train] [class, hierarchy, learning, neural, soft, deep, classification, function, classifier, better, wordnet, standard, label, distribution, network, hxe, mistake, large, set, performance, simple, test, problem, barz, note, best, processing, imagenet, modern, denzler, architecture, lca, measure, consider, redmon, respective, linear, higher] [conference, computer, distance, error, international, ground, vision, truth, closely]
@InProceedings{Bertinetto_2020_CVPR,
  author = {Bertinetto, Luca and Mueller, Romain and Tertikas, Konstantinos and Samangooei, Sina and Lord, Nicholas A.},
  title = {Making Better Mistakes: Leveraging Class Hierarchies With Deep Networks},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Data-Free Knowledge Amalgamation via Group-Stack Dual-GAN
Jingwen Ye, Yixin Ji, Xinchao Wang, Xin Gao, Mingli Song


Recent advances in deep learning have provided procedures for learning one network to amalgamate multiple streams of knowledge from the pre-trained Convolutional Neural Network (CNN) models, thus reduce the annotation cost. However, almost all existing methods demand massive training data, which may be unavailable due to privacy or transmission issues. In this paper, we propose a data-free knowledge amalgamate strategy to craft a well-behaved multi-task student network from multiple single/multi-task teachers. The main idea is to construct the group-stack generative adversarial networks (GANs) which have two dual generators. First one generator is trained to collect the knowledge by reconstructing the images approximating the original dataset utilized for pre-training the teachers. Then a dual generator is trained by taking the output from the former generator as input. Finally we treat the dual part generator as the target network and regroup it. As demonstrated on several benchmarks of multi-label classification, the proposed method without any training data achieves the surprisingly competitive results, even compared with some full-supervised methods.
[multiple, dataset, work, artificial, hierarchical] [final, table, framework, feature, propose, detection] [trained, original, customized, adversarial, model, input, noise, effective] [method, proposed, dual, block, ieee, filtering, output, pattern, intermediate, convolutional, figure] [generator, generated, loss, gan, targetnet, real, fgan, train, image, ycst, igan, corresponding, discriminator, learn, fub, amalgamation, amalgamate, target, amalgamated, dafl] [training, knowledge, learning, network, deep, neural, data, label, set, teacher, student, group, classification, task, architecture, fin, compared, function, number, learned, random, unlabeled, processing, machine] [conference, computer, vision, international, human, approach, single, well]
@InProceedings{Ye_2020_CVPR,
  author = {Ye, Jingwen and Ji, Yixin and Wang, Xinchao and Gao, Xin and Song, Mingli},
  title = {Data-Free Knowledge Amalgamation via Group-Stack Dual-GAN},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Screencast Tutorial Video Understanding
Kunpeng Li, Chen Fang, Zhaowen Wang, Seokhwan Kim, Hailin Jin, Yun Fu


Screencast tutorials are videos created by people to teach how to use software applications or demonstrate procedures for accomplishing tasks. It is very popular for both novice and experienced users to learn new skills, compared to other tutorial media such as text, because of the visual guidance and the ease of understanding. In this paper, we propose visual understanding of screencast tutorials as a new research problem to the computer vision community. We collect a new dataset of Adobe Photoshop video tutorials and annotate it with both low-level and high-level semantic labels. We introduce a bottom-up pipeline to understand Photoshop video tutorials. We leverage state-of-the-art object detection algorithms with domain specific visual cues to detect important events in a video tutorial and segment it into clips according to the detected events. We propose a visual cue reasoning algorithm for two high-level tasks: video retrieval and video captioning. We conduct extensive evaluations of the proposed pipeline. Experimental results show that it is effective in terms of understanding video tutorials. We believe our work will serves as a starting point for future research on this important application domain of video understanding.
[video, visual, tutorial, cue, screencast, understanding, dataset, software, photoshop, retrieval, temporal, text, reasoning, clip, pstuts, captioning, evaluation, frame, build, recognition, attention, sequence, understand, work, description, step, described, microsoft, order, word] [segmentation, detection, propose, annotation, feature, table, segment, detected, object, focus, including] [model, datasets, collect, collected, change, case] [figure, method, existing, proposed, based] [tool, adobe, user, representation, content, corresponding, learn, image] [data, learning, set, selected, general, design, problem, small, training, algorithm, machine, entire, large] [pipeline, well, computer, human, vision, capture]
@InProceedings{Li_2020_CVPR,
  author = {Li, Kunpeng and Fang, Chen and Wang, Zhaowen and Kim, Seokhwan and Jin, Hailin and Fu, Yun},
  title = {Screencast Tutorial Video Understanding},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
DSGN: Deep Stereo Geometry Network for 3D Object Detection
Yilun Chen, Shu Liu, Xiaoyong Shen, Jiaya Jia


Most state-of-the-art 3D object detectors rely heavily on LiDAR sensors and there remains a large gap in terms of performance between image-based and LiDAR-based methods, caused by inappropriate representation for the prediction in 3D scenarios. Our method, called Deep Stereo Geometry Network (DSGN), reduces this gap significantly by detecting 3D objects on a differentiable volumetric representation -- 3D geometric volume, which effectively encodes 3D geometric structure for 3D regular space. With this representation, we learn depth information and semantic cues simultaneously. For the first time, we provide a simple and effective one-stage stereo-based 3D detection pipeline that jointly estimates the depth and detects 3D objects in an end-to-end learning manner. Our approach outperforms previous stereo-based 3D detectors (about 10 higher in terms of AP) and even achieves comparable performance with a few LiDAR-based methods on the KITTI 3D object detection leaderboard. Code will be made publicly available at https://github.com/chenyilun95/DSGN.
[construct, pair, regular, prediction, work] [object, detection, feature, hard, autonomous, lidar, semantic, map, easy, box, moderate, table, module, detector, main, regression, ablation, key] [eye, hourglass, model, effective, study] [based, conv, disparity, intermediate, psmnet, comparison, figure, transform, binocular, pixel] [image, representation, loss, learn] [network, learning, deep, space, training, size, data, number, performance, constructing] [volume, stereo, depth, geometric, point, cost, matching, kitti, geometry, camera, monocular, dsgn, cloud, correspondence, approach, transformation, constraint, voxel, frustum, left, projection, pipeline, estimation, scene, jointly, directly, view, grid, plane, structure]
@InProceedings{Chen_2020_CVPR,
  author = {Chen, Yilun and Liu, Shu and Shen, Xiaoyong and Jia, Jiaya},
  title = {DSGN: Deep Stereo Geometry Network for 3D Object Detection},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Weakly-Supervised Salient Object Detection via Scribble Annotations
Jing Zhang, Xin Yu, Aixuan Li, Peipei Song, Bowen Liu, Yuchao Dai


Compared with laborious pixel-wise dense labeling, it is much easier to label data by scribbles, which only costs 1 2 seconds to label one image. However, using scribble labels to learn salient object detection has not been explored. In this paper, we propose a weakly-supervised salient object detection model to learn saliency from such annotations. In doing so, we first relabel an existing large-scale salient object detection dataset with scribbles, namely S-DUTS dataset. Since object structure and detail information is not identified by scribbles, directly training with scribble labels will lead to saliency maps of poor boundary localization. To mitigate this problem, we propose an auxiliary edge detection task to localize object edges explicitly, and a gated structure-aware loss to place constraints on the scope of structure to be recovered. Moreover, we design a scribble boosting scheme to iteratively consolidate our scribble annotations, which are then employed as supervision to learn high-quality saliency maps. As existing saliency evaluation metrics neglect to measure structure alignment of the predictions, the saliency map ranking may not comply with human perception. We present a new metric, termed saliency structure measure, as a complementary metric to evaluate sharpness of the prediction. Extensive experiments on six benchmark datasets demonstrate that our method not only outperforms existing weakly-supervised/unsupervised methods, but also is on par with several fully-supervised state-of-the-art models (Our code and data is publicly available at: https://github.com/JingZhang617/Scribble_Saliency).
[gated, prediction, dataset, evaluation, three, outperforms] [saliency, scribble, salient, object, detection, map, semantic, edge, segmentation, predicted, foreground, densecrf, annotation, region, feature, propose, employ, huchuan, boundary, supervision, weakly, weak, module, jing, yuchao, bounding, focus, duts, background, represents, scrf, australia] [model, trained, input] [ieee, method, convolutional, based, proposed, boosting, figure, existing, result, channel] [loss, image, unsupervised, train, produce, learn, supervised, generate] [network, learning, performance, training, deep, labeled, label, measure, indicates, size, metric, compared, better, data] [structure, partial, human, consistent, initial, well, directly, smoothness, dense]
@InProceedings{Zhang_2020_CVPR,
  author = {Zhang, Jing and Yu, Xin and Li, Aixuan and Song, Peipei and Liu, Bowen and Dai, Yuchao},
  title = {Weakly-Supervised Salient Object Detection via Scribble Annotations},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Learning to Learn Single Domain Generalization
Fengchun Qiao, Long Zhao, Xi Peng


We are concerned with a worst-case scenario in model generalization, in the sense that a model aims to perform well on many unseen domains while there is only one single domain available for training. We propose a new method named adversarial domain augmentation to solve this Out-of-Distribution (OOD) generalization problem. The key idea is to leverage adversarial training to create "fictitious" yet "challenging" populations, from which a model can learn to generalize with theoretical guarantees. To facilitate fast and desirable domain augmentation, we cast the model training in a meta-learning scheme and use a Wasserstein Auto-Encoder (WAE) to relax the widely used worst-case constraint. Detailed theoretical analysis is provided to testify our formulation, while extensive experiments on multiple benchmark datasets indicate its superior performance in tackling single domain generalization.
[embedding, outperforms, transportation, work, multiple] [semantic, level, propose, table, key] [adversarial, model, generalization, robustness, corruption, input, create, robust, mce, severity, datasets, ensemble, trained] [proposed, method, comparison, figure, based, fast] [domain, source, lrelax, unseen, target, wasserstein, gud, learn, adaptation, lconst, ltask, unsupervised, image, discrepancy, ccsa, generate, consists, perform, train] [learning, training, augmented, augmentation, accuracy, data, deep, number, erm, neural, scheme, large, space, classification, problem, theoretical, gradient, task, distribution, observe, test, performance, rate, evaluate, set, report, better, compared, measure, function] [single, distance, constraint, defined, david, detailed, error]
@InProceedings{Qiao_2020_CVPR,
  author = {Qiao, Fengchun and Zhao, Long and Peng, Xi},
  title = {Learning to Learn Single Domain Generalization},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Severity-Aware Semantic Segmentation With Reinforced Wasserstein Training
Xiaofeng Liu, Wenxuan Ji, Jane You, Georges El Fakhri, Jonghye Woo


Semantic segmentation is a class of methods to classify each pixel in an image into semantic classes, which is critical for autonomous vehicles and surgery systems. Cross-entropy (CE) loss-based deep neural networks (DNN) achieved great success w.r.t. the accuracy-based metrics, e.g., mean Intersection-over Union. However, the CE loss has a limitation in that it ignores varying degrees of severity of pair-wise misclassified results. For instance, classifying a car into the road is much more terrible than recognizing it as a bus. To sidestep this, in this work, we propose to incorporate the severity-aware inter-class correlation into our Wasserstein training framework by configuring its ground distance matrix. In addition, our method can adaptively learn the ground metric in a high-fidelity simulator, following a reinforcement alternative optimization scheme. We evaluate our method using the CARLA simulator with the Deeplab backbone, demonstraing that our method significantly improves the survival time in the CARLA simulator. In addition, our method can be readily applied to existing DNN architectures and algorithms while yielding superior performance. We report results from experiments carried out with the CamVid and Cityscapes datasets.
[xiaofeng, driving, carla, jane, reinforcement, road, state, simulator, policy, agent, vehicle, actor, evaluation, town, ial, time, prediction, noticing, reward, truck, environment, action, speed] [segmentation, semantic, car, autonomous, propose, framework, table, deeplab, feature, map, ping, classifying, apply, enet, discriminate, conservative] [severity, trained] [pixel, based, figure, method, adaptively, result, fast, convolutional, pattern, ieee, chao] [loss, wasserstein, image, person, latent, bus, target, learn, sky, sidewalk] [learning, matrix, training, deep, alternative, space, learned, metric, group, probability, network, setting, function, neural, label, class, optimization, larger, vector, softmax, sum, indicates, gradient] [ground, distance, solution, conference]
@InProceedings{Liu_2020_CVPR,
  author = {Liu, Xiaofeng and Ji, Wenxuan and You, Jane and Fakhri, Georges El and Woo, Jonghye},
  title = {Severity-Aware Semantic Segmentation With Reinforced Wasserstein Training},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Boosting Few-Shot Learning With Adaptive Margin Loss
Aoxue Li, Weiran Huang, Xu Lan, Jiashi Feng, Zhenguo Li, Liwei Wang


Few-shot learning (FSL) has attracted increasing attention in recent years but remains challenging, due to the intrinsic difficulty in learning to generalize from a few examples. This paper proposes an adaptive margin principle to improve the generalization ability of metric-based meta-learning approaches for few-shot learning problems. Specifically, we first develop a class-relevant additive margin loss, where semantic similarity between each pair of classes is considered to separate samples in the feature embedding space from similar classes. Further, we incorporate the semantic context among all classes in a sampled training task and develop a task-relevant additive margin loss to better distinguish samples from different classes. Our adaptive margin method can be easily extended to a more realistic generalized FSL setting. Extensive experiments demonstrate that the proposed method can boost the performance of current metric-based meta-learning approaches, under both the standard FSL and generalized FSL settings.
[embedding, dog, word, visual, dataset, evaluation, context, current, pair, recognize, recognition, naive] [semantic, module, feature, table, sofa, propose, effectiveness, add, key] [model, query, improve, trained, comparative, suitable] [adaptive, proposed, method, dynamic, figure] [loss, generator, generalized, learn, discriminative, train, extended, realistic] [margin, class, additive, fsl, learning, metric, test, training, classification, base, set, space, task, standard, similarity, network, gradient, better, deep, labeled, learned, randomly, support, prototypical, performance, data, sample, setting, cosine, wolf, miniimagenet, label, episodic, meta, softmax, angular, accuracy] [novel, approach, matching, full, cabinet, form]
@InProceedings{Li_2020_CVPR,
  author = {Li, Aoxue and Huang, Weiran and Lan, Xu and Feng, Jiashi and Li, Zhenguo and Wang, Liwei},
  title = {Boosting Few-Shot Learning With Adaptive Margin Loss},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
JA-POLS: A Moving-Camera Background Model via Joint Alignment and Partially-Overlapping Local Subspaces
Irit Chelly, Vlad Winter, Dor Litvak, David Rosen, Oren Freifeld


Background models are widely used in computer vision. While successful Static-camera Background (SCB) models exist, Moving-camera Background (MCB) models are limited. Seemingly, there is a straightforward solution: 1) align the video frames; 2) learn an SCB model; 3) warp either original or previously-unseen frames toward the model. This approach, however, has drawbacks, especially when the accumulative camera motion is large and/or the video is long. Here we propose a purely-2D unsupervised modular method that systematically eliminates those issues. First, to estimate warps in the original video, we solve a joint-alignment problem while leveraging a certifiably-correct initialization. Next, we learn both multiple partially-overlapping local subspaces and how to predict alignments. Lastly, in test time, we warp a previously-unseen frame, based on the prediction, and project it on a subset of those subspaces to obtain a background/foreground separation. We show the method handles even large scenes with a relatively-free camera motion (provided the camera-to-scene distance does not change much) and that it not only yields State-of-the-Art results on the original video but also generalizes gracefully to previously-unseen videos of the same scene. Our code is available at https://github.com/BGU-CS-VIL/JA-POLS.
[video, frame, long, panoramic, moving, dataset, observation] [background, global, foreground, detection, table, propose, tracking, focus] [model, robust, original, accumulative, input] [method, jitter, motion, warped, affine, figure, separation, mcb, stn, geij, scb, warp, net, proposed, phase, pixel, prpca, consecutive] [alignment, image, domain, loss, learn, unsupervised, align, component, missing] [test, large, denote, training, learning, subspace, data, small, problem, pairwise, number, linear, average, incremental, note, entire] [camera, transformation, local, scene, joint, handle, estimated, approach, estimation, relative, novel, coordinate, computer, estimate, principal]
@InProceedings{Chelly_2020_CVPR,
  author = {Chelly, Irit and Winter, Vlad and Litvak, Dor and Rosen, David and Freifeld, Oren},
  title = {JA-POLS: A Moving-Camera Background Model via Joint Alignment and Partially-Overlapping Local Subspaces},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
AugFPN: Improving Multi-Scale Feature Learning for Object Detection
Chaoxu Guo, Bin Fan, Qian Zhang, Shiming Xiang, Chunhong Pan


Current state-of-the-art detectors typically exploit feature pyramid to detect objects at different scales. Among them, FPN is one of the representative works that build a feature pyramid by multi-scale features summation. However, the design defects behind prevent the multi-scale features from being fully exploited. In this paper, we begin by first analyzing the design defects of feature pyramid in FPN, and then introduce a new feature pyramid architecture named AugFPN to address these problems. Specifically, AugFPN consists of three components: Consistent Supervision, Residual Feature Augmentation, and Soft RoI Selection. AugFPN narrows the semantic gaps between features of different scales before feature fusion through Consistent Supervision. In feature fusion, ratio-invariant context information is extracted by Residual Feature Augmentation to reduce the information loss of feature map at the highest pyramid level. Finally, Soft RoI Selection is employed to learn a better RoI feature adaptively after feature fusion. By replacing FPN with AugFPN in Faster R-CNN, our models achieve 2.3 and 1.6 points higher Average Precision (AP) when using ResNet50 and MobileNet-v2 as backbone respectively. Furthermore, AugFPN improves RetinaNet by 1.6 points AP and FCOS by 0.9 points AP when using ResNet50 as backbone. Codes are available on https://github.com/Gus-Guo/AugFPN.
[context, three, connected] [feature, roi, pyramid, object, fpn, pooling, augfpn, semantic, faster, level, supervision, detection, table, fully, asf, ablation, coco, mask, improves, global, assigned, map, region, final, retinanet, ross, backbone, panet, extra, improvement, kaiming, highest, fusing, propose, brings, segmentation, apm] [improve] [adaptive, residual, spatial, fusion, based, method, figure, convolutional, convolution, scale] [loss, generate, corresponding, gap, representation] [soft, setting, baseline, max, selection, performance, better, network, average, augmentation, lower, sum, design, reduce, higher, deep, training, set, large, weight] [consistent, single]
@InProceedings{Guo_2020_CVPR,
  author = {Guo, Chaoxu and Fan, Bin and Zhang, Qian and Xiang, Shiming and Pan, Chunhong},
  title = {AugFPN: Improving Multi-Scale Feature Learning for Object Detection},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
xMUDA: Cross-Modal Unsupervised Domain Adaptation for 3D Semantic Segmentation
Maximilian Jaritz, Tuan-Hung Vu, Raoul de Charette, Emilie Wirbel, Patrick Perez


Unsupervised Domain Adaptation (UDA) is crucial to tackle the lack of annotations in a new domain. There are many multi-modal datasets, but most UDA approaches are uni-modal. In this work, we explore how to learn from multi-modality and propose cross-modal UDA (xMUDA) where we assume the presence of 2D images and 3D point clouds for 3D semantic segmentation. This is challenging as the two input spaces are heterogeneous and can be impacted differently by domain shift. In xMUDA, modalities learn from each other through mutual mimicking, disentangled from the segmentation objective, to prevent the stronger modality from adopting false predictions from the weaker one. We evaluate on new UDA scenarios including day-to-night, country-to-country and dataset-to-dataset, leveraging recent autonomous driving datasets. xMUDA brings large improvements over uni-modal UDA on all tested scenarios, and is complementary to state-of-the-art UDA techniques. Code is available at https://github.com/valeoai/xmuda.
[modality, prediction, stream, oracle, work, order] [segmentation, semantic, head, main, lidar, feature, fuse, object, miou, propose, car, autonomous, framework] [private, input, model, complementary, datasets, robust, adversarial] [fusion, dual, sensor, proposed, output, figure, method, existing] [uda, domain, target, loss, xmuda, source, lxm, adaptation, image, supervised, xmudapl, unsupervised, learn, shared, crossmodal, mimicry, lseg, train, address, generate, corresponding, project] [learning, network, architecture, training, deep, best, mimicking, performance, size, softmax, vanilla, baseline, objective, data, layer, class, consider, scheme, applied, task, knowledge, sample] [point, single, cloud, camera, additional, sparse]
@InProceedings{Jaritz_2020_CVPR,
  author = {Jaritz, Maximilian and Vu, Tuan-Hung and Charette, Raoul de and Wirbel, Emilie and Perez, Patrick},
  title = {xMUDA: Cross-Modal Unsupervised Domain Adaptation for 3D Semantic Segmentation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Norm-Aware Embedding for Efficient Person Search
Di Chen, Shanshan Zhang, Jian Yang, Bernt Schiele


Person Search is a practically relevant task that aims to jointly solve Person Detection and Person Re-identification (re-ID). Specifically, it requires to find and locate all instances with the same identity as the query person in a set of panoramic gallery images. One major challenge comes from the contradictory goals of the two sub-tasks, i.e., person detection focuses on finding the commonness of all persons while person re-ID handles the differences among multiple identities. Therefore, it is crucial to reconcile the relationship between the two sub-tasks in a joint person search model. To this end, We present a novel approach called Norm-Aware Embedding to disentangle the person embedding into norm and angle for detection and re-ID respectively, allowing for both effective and efficient multi-task training. We further extend the proposal-level person embedding to pixel-level, whose discrimination ability is less affected by mis-alignment. We outperform other one-step methods by a large margin and achieve comparable performance to two-step methods on both CUHK-SYSU and PRW. Also, Our method is easy to train and resource-friendly, running at 12 fps on a single GPU.
[embedding, embeddings, attention] [detection, nae, feature, bounding, box, gallery, pedestrian, map, faster, oim, background, propose, proposal, cws, regression, region, confidence, final, roialign, liang, contradictory, object, head, ross, jian] [norm, model, query, identity, trained, face] [method, spatial, convolutional, block, figure, comparison, science, based] [person, loss, image] [search, classification, learning, deep, performance, similarity, set, better, normalized, network, problem, size, standard, training, probability, kxk, layer, lower, vector, larger, top, denote, objective, weighted, margin, indicates] [matching, ground, truth, angle, extension, jointly, joint, approach]
@InProceedings{Chen_2020_CVPR,
  author = {Chen, Di and Zhang, Shanshan and Yang, Jian and Schiele, Bernt},
  title = {Norm-Aware Embedding for Efficient Person Search},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Intelligent Home 3D: Automatic 3D-House Design From Linguistic Descriptions Only
Qi Chen, Qi Wu, Rui Tang, Yuhan Wang, Shuai Wang, Mingkui Tan


Home design is a complex task that normally requires architects to finish with their professional skills and tools. It will be fascinating that if one can produce a house plan intuitively without knowing much knowledge about home design and experience of using complex designing tools, for example, via natural language. In this paper, we formulate it as a language conditioned visual content generation problem that is further divided into a floor plan generation and an interior texture (such as floor and wall) synthesis task. The only control signal of the generation process is the linguistic expression given by users that describe the house details. To this end, we propose a House Plan Generative Model (HPGM) that first translates the language input to a structural graph representation and then predicts the layout of rooms with a Graph Conditioned Layout Prediction Network (GC-LPN) and generates the interior texture with a Language Conditioned Texture GAN (LCT-GAN). With some post-processing, the final product of this task is a 3D house model. To train and evaluate our model, we build the first Text--to--3D House Model dataset, which will be released at: https:// hidden-link-for-submission.
[graph, house, linguistic, text, language, natural, prediction, parser, visual, step] [building, propose, feature, bounding, box, table, grant, semantic] [input, model, adversarial, refers] [ieee, figure, proposed, method, convolutional, based, colour, adjacent] [texture, generation, layout, generated, plan, image, generate, generative, corresponding, conditioned, mingkui, synthesis, generating, structural, wood, ability, conditional, produce, representation, generator, gan, real, fid, hpgm, ladv] [design, network, set, interior, performance, size, compared, neural, evaluate, training, task, process, deep, learning, data, objective, yield, indicates, note] [room, wall, scene, human, material, acm, square, rendering]
@InProceedings{Chen_2020_CVPR,
  author = {Chen, Qi and Wu, Qi and Tang, Rui and Wang, Yuhan and Wang, Shuai and Tan, Mingkui},
  title = {Intelligent Home 3D: Automatic 3D-House Design From Linguistic Descriptions Only},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Differential Treatment for Stuff and Things: A Simple Unsupervised Domain Adaptation Method for Semantic Segmentation
Zhonghao Wang, Mo Yu, Yunchao Wei, Rogerio Feris, Jinjun Xiong, Wen-mei Hwu, Thomas S. Huang, Honghui Shi


We consider the problem of unsupervised domain adaptation for semantic segmentation by easing the domain shift between the source domain (synthetic data) and the target domain (real data) in this work. State-of-the-art approaches prove that performing semantic-level alignment is helpful in tackling the domain shift issue. Based on the observation that stuff categories usually share similar appearances across images of different domains while things (i.e. object instances) have much larger differences, we propose to improve the semantic-level alignment with different strategies for stuff regions and for things: 1) for the stuff categories, we generate feature representation for each class and conduct the alignment operation from the target domain to the source domain; 2) for the thing categories, we generate feature representation for each individual instance and encourage the instance in the target domain to align with the most similar one in the source domain. In this way, the individual differences within thing categories will also be considered to alleviate over-alignment. In addition to our proposed method, we further reveal the reason why the current adversarial loss is often unstable in minimizing the distribution discrepancy and show that our method can help ease this issue by minimizing the most similar stuff and instance features between the source and the target domains. We conduct extensive experiments in two unsupervised domain adaptation tasks, i.e. GTA5 - Cityscapes and SYNTHIA - Cityscapes, and achieve the new state-of-the-art segmentation accuracy.
[dataset, recognition, shift] [semantic, feature, segmentation, stuff, instance, sim, miou, table, framework, confidence, module, achieves, predicted, background, yunchao, propose, head, map] [model, adversarial, trained] [ieee, pattern, method, output, june, convolutional, proposed, figure, honghui] [domain, source, target, adaptation, image, eqn, loss, unsupervised, generate, synthia, discriminator, minimizing, discrepancy, pseudo, alignment, representation, synthetic, generator, generated, appearance, transferring] [training, class, learning, performance, label, deep, distribution, set, number, follow, function, arxiv, problem, large, data, space, network, min, adapt, stored, operation, achieve] [computer, conference, vision, matching, ground, truth, scene, international, thomas, structure]
@InProceedings{Wang_2020_CVPR,
  author = {Wang, Zhonghao and Yu, Mo and Wei, Yunchao and Feris, Rogerio and Xiong, Jinjun and Hwu, Wen-mei and Huang, Thomas S. and Shi, Honghui},
  title = {Differential Treatment for Stuff and Things: A Simple Unsupervised Domain Adaptation Method for Semantic Segmentation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Robust Object Detection Under Occlusion With Context-Aware CompositionalNets
Angtian Wang, Yihong Sun, Adam Kortylewski, Alan L. Yuille


Detecting partially occluded objects is a difficult task. Our experimental results show that deep learning approaches, such as Faster R-CNN, are not robust at object detection under occlusion. Compositional convolutional neural networks (CompositionalNets) have been shown to be robust at classifying occluded objects by explicitly representing the object as a composition of parts. In this work, we propose to overcome two limitations of CompositionalNets which will enable them to detect partially occluded objects: 1) CompositionalNets, as well as other DCNN architectures, do not explicitly separate the representation of the context from the object itself. Under strong object occlusion, the influence of the context is amplified which can have severe negative effects for detection at test time. In order to overcome this, we propose to segment the context during training via bounding box annotations. We then use the segmentation to learn a context-aware compositionalNet that disentangles the representation of the context and the object. 2) We extend the part-based voting scheme in CompositionalNets to vote for the corners of the object's bounding box, which enables the model to reliably estimate bounding boxes for partially occluded objects. Our extensive experiments show that our proposed model can detect objects robustly, increasing the detection performance of strongly occluded vehicles from PASCAL3D+ and MS-COCO by 41% and 35% respectively in absolute performance relative to Faster R-CNN.
[context, compositional, dataset, work, blue, mechanism, order] [object, detection, bounding, box, occluded, compositionalnets, occlusion, faster, partially, feature, voting, propose, map, compositionalnet, alan, proposal, detect, semantic, region, occludedvehiclesdetection, bbv] [model, robust, trained, strong, influence, detecting, robustly] [proposed, figure, ieee, pattern, adam, convolutional, based, green, dcnn] [image, representation, loss, generative, component, separate] [training, performance, mixture, deep, learning, neural, network, classification, data, learned, arxiv, preprint, layer, classify, set, increasing, fixed, number, standard, note] [vision, computer, partial, conference, ground, estimation, enables, estimate, approach, position]
@InProceedings{Wang_2020_CVPR,
  author = {Wang, Angtian and Sun, Yihong and Kortylewski, Adam and Yuille, Alan L.},
  title = {Robust Object Detection Under Occlusion With Context-Aware CompositionalNets},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
IMRAM: Iterative Matching With Recurrent Attention Memory for Cross-Modal Image-Text Retrieval
Hui Chen, Guiguang Ding, Xudong Liu, Zijia Lin, Ji Liu, Jungong Han


Enabling bi-directional retrieval of images and texts is important for understanding the correspondence between vision and language. Existing methods leverage the attention mechanism to explore such correspondence in a fine-grained manner. However, most of them consider all semantics equally and thus align them uniformly, regardless of their diverse complexities. In fact, semantics are diverse (i.e. involving different kinds of semantic concepts), and humans usually follow a latent structure to combine them into understandable languages. It may be difficult to optimally capture such sophisticated correspondences in existing methods. In this paper, to address such a deficiency, we propose an Iterative Matching with Recurrent Attention Memory (IMRAM) method, in which correspondences between images and texts are captured with multiple steps of alignments. Specifically, we introduce an iterative matching scheme to explore such fine-grained correspondence progressively. A memory distillation unit is used to refine alignment knowledge from early steps to later ones. Experiment results on three benchmark datasets, i.e. Flickr8K, Flickr30K, and MS COCO, show that our IMRAM achieves state-of-the-art performance, well demonstrating its effectiveness. Experiments on a practical business advertisement dataset, named KWAI-AD, further validates the applicability of our method in practical scenarios.
[attention, retrieval, text, unit, recurrent, semantics, embedding, word, dataset, three, imram, advertisement, business, explore, context, visual, gru, cxi, step, guiguang, understanding, work, extract] [semantic, feature, table, benchmark, coco, effectiveness, salient, refine, matched, adopt, region, score, hard] [iterative, model, datasets, query, great] [proposed, method, ieee, pattern, figure, existing] [image, alignment, corresponding, latent, align, representation, shared, loss] [memory, performance, distillation, learning, function, scheme, knowledge, practical, deep, compared, similarity, set, experiment, neural, vector, better] [matching, vision, conference, correspondence, computer, well, fragment, international]
@InProceedings{Chen_2020_CVPR,
  author = {Chen, Hui and Ding, Guiguang and Liu, Xudong and Lin, Zijia and Liu, Ji and Han, Jungong},
  title = {IMRAM: Iterative Matching With Recurrent Attention Memory for Cross-Modal Image-Text Retrieval},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Domain-Aware Visual Bias Eliminating for Generalized Zero-Shot Learning
Shaobo Min, Hantao Yao, Hongtao Xie, Chaoqun Wang, Zheng-Jun Zha, Yongdong Zhang


Generalized zero-shot learning aims to recognize images from seen and unseen domains. Recent methods focus on learning a unified semantic-aligned visual representation to transfer knowledge between two domains, while ignoring the effect of semantic-free visual representation in alleviating the biased recognition problem. In this paper, we propose a novel Domain-aware Visual Bias Eliminating (DVBE) network that constructs two complementary visual representations, i.e., semantic-free and semantic-aligned, to treat seen and unseen domains separately. Specifically, we explore cross-attentive second-order visual statistics to compact the semantic-free representation, and design an adaptive margin Softmax to maximize inter-class divergences. Thus, the semantic-free representation becomes discriminative enough to not only predict seen class accurately but also filter out unseen images, i.e., domain detection, based on the predicted class entropy. For unseen images, we automatically search an optimal semantic-visual alignment architecture, rather than manual designs, to predict unseen classes. With accurate domain detection, the biased recognition problem towards the seen domain is significantly reduced. Experiments on five benchmarks for classification and segmentation show that DVBE outperforms existing methods by averaged 5.7% improvement.
[visual, embedding, recognition, graph, prediction, yongdong, automatically, previous, outperforms, embeddings, interaction] [semantic, biased, table, detection, cau, feature, object, sun, segmentation, predicted, improvement, hard, backbone, detector, extra, obtains] [improve, model, robust] [figure, adaptive, based, channel, proposed, ieee, eliminating] [unseen, dvbe, domain, representation, discriminative, generalized, discrimination, image, amse, cub, gzsl, loss, generate, fatt, latent, alignment, filtered, zeynep] [margin, class, learning, softmax, entropy, network, large, classification, data, architecture, training, performance, bias, deep, filter, optimal, problem, compared, sample, fixed, knowledge, standard] []
@InProceedings{Min_2020_CVPR,
  author = {Min, Shaobo and Yao, Hantao and Xie, Hongtao and Wang, Chaoqun and Zha, Zheng-Jun and Zhang, Yongdong},
  title = {Domain-Aware Visual Bias Eliminating for Generalized Zero-Shot Learning},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Semi-Supervised Semantic Segmentation With Cross-Consistency Training
Yassine Ouali, Celine Hudelot, Myriam Tami


In this paper, we present a novel cross-consistency based semi-supervised approach for semantic segmentation. Consistency training has proven to be a powerful semi-supervised learning framework for leveraging unlabeled data under the cluster assumption, in which the decision boundary should lie in low-density regions. In this work, we first observe that for semantic segmentation, the low-density regions are more apparent within the hidden representations than within the inputs. We thus propose cross-consistency training, where an invariance of the predictions is enforced over different perturbations applied to the outputs of the encoder. Concretely, a shared encoder and a main decoder are trained in a supervised manner using the available labeled examples. To leverage the unlabeled examples, we enforce a consistency between the main decoder predictions and those of the auxiliary decoders, taking as inputs different perturbed versions of the encoder's output, and consequently, improving the encoder's representations. The proposed method is simple and can easily be extended to use additional training signal, such as image-level labels or pixel-level labels across different domains. We perform an ablation study to tease apart the effectiveness of each component, and conduct extensive experiments to demonstrate that our method achieves state-of-the-art results in several datasets.
[decoder, multiple, previous, hidden] [semantic, main, segmentation, pascal, sun, ablation, effectiveness, object, weakly, camvid, propose, feature] [auxiliary, input, trained, adversarial, perturbation, perturbed, datasets] [ieee, method, pattern, based, proposed, output, low, assumption, spatial, version, convolutional] [consistency, encoder, cct, supervised, loss, domain, train, image, shared, representation, gak, generate, cluster, adaptation, corresponding, unsupervised] [training, labeled, unlabeled, learning, data, set, deep, applied, network, number, label, arxiv, preprint, average, performance, neural, ssl, classification, baseline, efficient, density, simple, setting, compared] [computer, conference, vision, additional, international, enforcing, approach, compute, distance]
@InProceedings{Ouali_2020_CVPR,
  author = {Ouali, Yassine and Hudelot, Celine and Tami, Myriam},
  title = {Semi-Supervised Semantic Segmentation With Cross-Consistency Training},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Learning to Learn Cropping Models for Different Aspect Ratio Requirements
Debang Li, Junge Zhang, Kaiqi Huang


Image cropping aims at improving the framing of an image by removing its extraneous outer areas, which is widely used in the photography and printing industry. In some cases, the aspect ratio of cropping results is specified depending on some conditions. In this paper, we propose a meta-learning (learning to learn) based aspect ratio specified image cropping method called Mars, which can generate cropping results of different expected aspect ratios. In the proposed method, a base model and two meta-learners are obtained during the training stage. Given an aspect ratio in the test stage, a new model with new parameters can be generated from the base model. Specifically, the two meta-learners predict the parameters of the base model based on the given aspect ratio. The learning process of the proposed method is learning how to learn cropping models for different aspect ratio requirements, which is a typical meta-learning process. In the experiments, the proposed method is evaluated on three datasets and outperforms most state-of-the-art methods in terms of accuracy and speed. In addition, both the intermediate and final results show that the proposed model can predict different cropping windows for an image depending on different aspect ratio requirements.
[embedding, speed, three, automatic, prediction, predict, visual, ven, gaic] [feature, table, map, fps, backbone, ablation, module, predicted, sliding, global, vpn, propose] [model, study, input, original, fat] [aspect, cropping, proposed, method, figure, based, output, window, resolution, depending, channel, upsampling, wout, cout, hcdb, convolution, interpolation, column, fcdb, hin] [image, generate, generated, user, learn, qualitative, target] [ratio, learning, required, number, set, base, training, network, vector, dimension, performance, log, layer, better, validation, compared, deep, find, size, test, process, accuracy] [directly, approach, single, transformation]
@InProceedings{Li_2020_CVPR,
  author = {Li, Debang and Zhang, Junge and Huang, Kaiqi},
  title = {Learning to Learn Cropping Models for Different Aspect Ratio Requirements},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
What Makes Training Multi-Modal Classification Networks Hard?
Weiyao Wang, Du Tran, Matt Feiszli


Consider end-to-end training of a multi-modal vs. a uni-modal network on a task with multiple input modalities: the multi-modal network receives more information, so it should match or outperform its uni-modal counterpart. In our experiments, however, we observe the opposite: the best uni-modal network can outperform the multi-modal network. This observation is consistent across different combinations of modalities and on different tasks and benchmarks for video classifications. This paper identifies two main causes for this performance drop: first, multi-modal networks are often prone to overfitting due to increased capacity. Second, different modalities overfit and generalize at different rates, so training them jointly with a single optimization strategy is sub-optimal. We address these two problems with a technique we call Gradient-Blending, which computes an optimal blending of modalities based on their overfitting behaviors. We demonstrate that Gradient Blending outperforms widely-used baselines for avoiding overfitting and achieves state-of-the-art accuracy on various tasks including human action recognition, ego-centric action recognition, and acoustic event detection.
[naive, visual, video, outperforms, audio, action, kinetics, multiple, ogr, three, modality, clip, late, slowfast, dataset, multimodal, blend, audioset, temporal, gblend] [table, backbone, including, improvement] [model, offline, blending, trained, input, auxiliary, ensemble] [fusion, method, comparison, optical, figure, convolutional] [loss, train, minimizing] [training, overfitting, learning, network, online, best, gradient, validation, epoch, accuracy, performance, baseline, set, deep, classification, problem, small, compared, dropout, measure, neural, consider, optimization, optimal, subset, sgd, algorithm, outperform, observe] [rgb, joint, single, compare, supplementary, compute, error]
@InProceedings{Wang_2020_CVPR,
  author = {Wang, Weiyao and Tran, Du and Feiszli, Matt},
  title = {What Makes Training Multi-Modal Classification Networks Hard?},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Selective Transfer With Reinforced Transfer Network for Partial Domain Adaptation
Zhihong Chen, Chao Chen, Zhaowei Cheng, Boyuan Jiang, Ke Fang, Xinyu Jin


One crucial aspect of partial domain adaptation (PDA) is how to select the relevant source samples in the shared classes for knowledge transfer. Previous PDA methods tackle this problem by re-weighting the source samples based on their high-level information (deep features). However, since the domain shift between source and target domains, only using the deep features for sample selection is defective. We argue that it is more reasonable to additionally exploit the pixel-level information for PDA problem, as the appearance difference between outlier source classes and target classes is significantly large. In this paper, we propose a reinforced transfer network (RTNet), which utilizes both high-level and pixel-level information for PDA problem. Our RTNet is composed of a reinforced data selector (RDS) based on reinforcement learning (RL), which filters out the outlier source samples, and a domain adaptation model which minimizes the domain discrepancy in the shared label space. Specifically, in the RDS, we design a novel reward based on the reconstruct errors of selected source samples on the target generator, which introduces the pixel-level information to guide the learning of RDS. Besides, we develope a state containing high-level information, which used by the RDS for sample selection. The proposed RDS is a general module, which can be easily integrated into existing DA models to make them fit the PDA situation. Extensive experiments indicate that RTNet can achieve state-of-the-art performance for PDA tasks on several benchmark datasets.
[reward, policy, reinforcement, previous, state, dataset, moment, relevant, shift, action, future] [feature, table, utilizes, represents, resnet] [model, adversarial, trained, improve] [based, figure, method, proposed, integrated, existing, analysis] [source, target, domain, pda, rtnet, transfer, adaptation, selector, rds, shared, coral, xbs, filtered, rtnetadv, appearance, alignment, wasserstein, unsupervised, trevor] [sample, label, learning, network, deep, data, negative, select, filter, performance, distribution, selected, batch, selection, class, update, expected, problem, space, probability, total, number, ratio, training, labeled, algorithm, episode, retention, design, classifier, note, ssi] [outlier, reinforced, error, reconstruction, partial, solve, estimate, defined, matching]
@InProceedings{Chen_2020_CVPR,
  author = {Chen, Zhihong and Chen, Chao and Cheng, Zhaowei and Jiang, Boyuan and Fang, Ke and Jin, Xinyu},
  title = {Selective Transfer With Reinforced Transfer Network for Partial Domain Adaptation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Semi-Supervised Semantic Image Segmentation With Self-Correcting Networks
Mostafa S. Ibrahim, Arash Vahdat, Mani Ranjbar, William G. Macready


Building a large image dataset with high-quality object masks for semantic segmentation is costly and time-consuming. In this paper, we introduce a principled semi-supervised framework that only use a small set of fully supervised images (having semantic segmentation labels and box labels) and a set of images with only object bounding box labels (we call it the weak-set). Our framework trains the primary segmentation model with the aid of an ancillary model that generates initial segmentation labels for the weak-set and a self-correction module that improves the generated labels during training using the increasingly accurate primary model. We introduce two variants of the self-correction module using either linear or convolutional functions. Experiments on the PASCAL VOC 2012 and Cityscape datasets show that our models trained with a small fully supervised set perform similar to, or better than, models trained with a large fully supervised set while requiring 7x less annotation effort.
[recognition, dataset, previous, work, predict, represent, current] [segmentation, ancillary, semantic, bounding, object, box, fully, weak, panc, pascal, mask, voc, framework, table, weakly, module, feature, annotation, foreground, george] [model, primary, trained, robust, input, noise, auxiliary] [pattern, ieee, convolutional, noisy, output, based] [image, supervised, train, generated, encoder, loss, generates] [training, set, learning, deep, label, linear, neural, network, distribution, small, performance, data, validation, simple, test, log, processing, large, better, logits, classification, labeled, function, number, machine] [computer, vision, conference, international, approach, initial, rely, second, european]
@InProceedings{Ibrahim_2020_CVPR,
  author = {Ibrahim, Mostafa S. and Vahdat, Arash and Ranjbar, Mani and Macready, William G.},
  title = {Semi-Supervised Semantic Image Segmentation With Self-Correcting Networks},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Exemplar Normalization for Learning Deep Representation
Ruimao Zhang, Zhanglin Peng, Lingyun Wu, Zhen Li, Ping Luo


Normalization techniques are important in different advanced neural networks and different tasks. This work investigates a novel dynamic learning-to-normalize (L2N) problem by proposing Exemplar Normalization (EN), which is able to learn different normalization methods for different convolutional layers and image samples of a deep network. EN significantly improves the flexibility of the recently proposed switchable normalization (SN), which solves a static L2N problem by linearly combining several normalizers in each normalization layer (the combination is the same for all samples). Instead of directly employing a multi-layer perceptron (MLP) to learn data-dependent parameters as conditional batch normalization (cBN) did, the internal architecture of EN is carefully designed to stabilize its optimization, leading to many appealing benefits. (1) EN enables different convolutional layers, image samples, categories, benchmarks, and tasks to use different normalization methods, shedding light on analyzing them in a holistic view. (2) EN is effective for various network architectures and tasks. (3) It could replace any normalization layers in a deep network and still produce stable model training. Extensive experiments demonstrate the effectiveness of EN in a wide spectrum of tasks including image recognition, noisy label learning, and semantic segmentation. For example, by replacing BN in the ordinary ResNet50, improvement produced by EN is 300% more than that of SN on both ImageNet and the noisy WebVision dataset. The codes and models will be released.
[dataset, outperforms, work, multiple, three, attention] [feature, table, ping, semantic, backbone, leading, including, improvement, segmentation, employed] [model, input, ruimao, generalization] [convolutional, dynamic, proposed, method, figure, combination, noisy, adopted, channel] [image, learn, exemplar, distinct, inception, learns] [normalization, learning, training, layer, imagenet, number, set, network, deep, batch, webvision, performance, validation, compared, accuracy, neural, rate, classification, function, standard, ratio, switchable, normalize, calculate, data, replace, group, shufflenet, sample, computational, size, dimension, chw, experiment, averaged, zhanglin, scheme, parameter, best, ordinary, small, average] [single]
@InProceedings{Zhang_2020_CVPR,
  author = {Zhang, Ruimao and Peng, Zhanglin and Wu, Lingyun and Li, Zhen and Luo, Ping},
  title = {Exemplar Normalization for Learning Deep Representation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Imitative Non-Autoregressive Modeling for Trajectory Forecasting and Imputation
Mengshi Qi, Jie Qin, Yu Wu, Yi Yang


Trajectory forecasting and imputation are pivotal steps towards understanding the movement of human and objects, which are quite challenging since the future trajectories and missing values in a temporal sequence are full of uncertainties, and the spatial-temporally contextual correlation is hard to model. Yet, the relevance between sequence prediction and imputation is disregarded by existing approaches. To this end, we propose a novel imitative non-autoregressive modeling method to simultaneously handle the trajectory prediction task and the missing value imputation task. Specifically, our framework adopts an imitation learning paradigm, which contains a recurrent conditional variational autoencoder (RC-VAE) as a demonstrator, and a non-autoregressive transformation model (NART) as a learner. By jointly optimizing the two models, RC-VAE can predict the future trajectory and capture the temporal relationship in the sequence to supervise the NART learner. As a result, NART learns from the demonstrator and imputes the missing value in a non autoregressive strategy. We conduct extensive experiments on three popular datasets, and the results show that our model achieves state-of-the-art performance across all the datasets.
[imitation, sequence, trajectory, prediction, time, future, nart, action, temporal, demonstrator, basketball, recurrent, dataset, modeling, forecasting, policy, observed, hidden, nonautoregressive, predict, impute, decoder, lstm, state, rnn, movement, relevance, history, step, social, yisong, three, previous, predicting] [framework, module, tracking, adopt, propose, denotes] [model, refers, adversarial, masking, ball] [proposed, figure, method, motion, supervise, prior, utilized, based] [missing, imputation, generative, variational, generated, imitative, latent, autoencoder, generate, conditional, loss, generating] [learning, distribution, data, training, arxiv, preprint, performance, set, denote, regularization, deep, neural, learner, task, discrete] [approach, ground, transformation, capture, continuous, position]
@InProceedings{Qi_2020_CVPR,
  author = {Qi, Mengshi and Qin, Jie and Wu, Yu and Yang, Yi},
  title = {Imitative Non-Autoregressive Modeling for Trajectory Forecasting and Imputation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Multi-Modal Graph Neural Network for Joint Reasoning on Vision and Scene Text
Difei Gao, Ke Li, Ruiping Wang, Shiguang Shan, Xilin Chen


Answering questions that require reading texts in an image is challenging for current models. One key difficulty of this task is that rare, polysemous, and ambiguous words frequently appear in images, e.g., names of places, products, and sports teams. To overcome this difficulty, only resorting to pre-trained word embedding models is far from enough. A desired model should utilize the rich information in multiple modalities of the image to help understand the meaning of scene texts, e.g., the prominent text on a bottle is most likely to be the brand. Following this idea, we propose a novel VQA approach, Multi-Modal Graph Neural Network (MM-GNN). It first represents an image as a graph consisting of three sub-graphs, depicting visual, semantic, and numeric modalities respectively. Then, we introduce three aggregators which guide the message passing from one graph to another to utilize the contexts in various modalities, so as to refine the features of nodes. The updated nodes have better features for the downstream question answering module. Experimental evaluations show that our MM-GNN represents the scene texts better and obviously facilitates the performances on two VQA tasks that require reading scene texts.
[visual, graph, question, ocr, attention, answering, node, vqa, three, textvqa, text, numeric, recognition, answer, reasoning, aggregator, word, gnn, context, language, token, modality, dataset, represent, natural, vocabulary, attended, passing, prediction, message, bert, semantics, infer, multimodal, encode] [semantic, feature, bounding, propose, table, refine, module] [model, type] [ieee, pattern, neighboring, proposed, method] [image, representation, utilize] [neural, processing, learning, better, network, updated, requires, set, number, accuracy, indicates] [conference, scene, vision, computer, international, provided]
@InProceedings{Gao_2020_CVPR,
  author = {Gao, Difei and Li, Ke and Wang, Ruiping and Shan, Shiguang and Chen, Xilin},
  title = {Multi-Modal Graph Neural Network for Joint Reasoning on Vision and Scene Text},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
StereoGAN: Bridging Synthetic-to-Real Domain Gap by Joint Optimization of Domain Translation and Stereo Matching
Rui Liu, Chengxi Yang, Wenxiu Sun, Xiaogang Wang, Hongsheng Li


Large-scale synthetic datasets are beneficial to stereo matching but usually introduce known domain bias. Although unsupervised image-to-image translation networks represented by CycleGAN show great potential in dealing with domain gap, it is non-trivial to generalize this method to stereo matching due to the problem of pixel distortion and stereo mismatch after translation. In this paper, we propose an end-to-end training framework with domain translation and stereo matching networks to tackle this challenge. First, joint optimization between domain translation and stereo matching networks in our end-to-end framework makes the former facilitate the latter one to the maximum extent. Second, this framework introduces two novel losses, i.e., bidirectional multi-scale feature re-projection loss and correlation consistency loss, to help translate all synthetic stereo images into realistic ones as well as maintain epipolar constraints. The effective combination of above two contributions leads to impressive stereo-consistent translation and disparity estimation accuracy. In addition, a mode seeking regularization term is added to endow the synthetic-to-real translation results with higher fine-grained diversity. Extensive experiments demonstrate the effectiveness of the proposed framework on bridging the synthetic-to-real domain gap on stereo matching.
[recognition, dataset, driving, time, evaluation, bidirectional] [feature, correlation, framework, propose, map, table, semantic, extra, ablation] [great, adversarial, model, datasets, help] [disparity, proposed, ieee, pattern, method, figure, epe, dispnet, pixel, warping] [domain, translation, real, synthetic, loss, image, unsupervised, adaptation, consistency, mode, seeking, train, gap, cyclegan, introduce, translated, cycle, synthia, corresponding] [network, training, deep, learning, inference, neural, data, optimization, better, set, large, reduce, performance, machine, regularization, problem] [stereo, matching, conference, computer, vision, joint, left, international, novel, noc, estimation, cost, gwcnet]
@InProceedings{Liu_2020_CVPR,
  author = {Liu, Rui and Yang, Chengxi and Sun, Wenxiu and Wang, Xiaogang and Li, Hongsheng},
  title = {StereoGAN: Bridging Synthetic-to-Real Domain Gap by Joint Optimization of Domain Translation and Stereo Matching},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Self-Supervised Domain-Aware Generative Network for Generalized Zero-Shot Learning
Jiamin Wu, Tianzhu Zhang, Zheng-Jun Zha, Jiebo Luo, Yongdong Zhang, Feng Wu


Generalized Zero-Shot Learning (GZSL) aims at recognizing both seen and unseen classes by constructing correspondence between visual and semantic embedding. However, existing methods have severely suffered from the strong bias problem, where unseen instances in target domain tend to be recognized as seen classes in source domain. To address this issue, we propose an end-to-end Self-supervised Domain-aware Generative Network (SDGN) by integrating self-supervised learning into feature generating model for unbiased GZSL. The proposed SDGN model enjoys several merits. First, we design a cross-domain feature generating module to synthesize samples with high fidelity based on class embeddings, which involves a novel target domain discriminator to preserve the domain consistency. Second, we propose a self-supervised learning module to investigate inter-domain relationships, where a set of anchors are introduced as a bridge between seen and unseen categories. In the shared space, we pull the distribution of target domain away from source domain, and obtain domain-aware features with high discriminative power for both seen and unseen classes. To our best knowledge, this is the first work to introduce self-supervised learning into GZSL as a learning guidance. Extensive experimental results on five standard benchmarks demonstrate that our model performs favorably against state-of-the-art GZSL methods.
[visual, embedding, dataset, embeddings] [feature, module, semantic, sun, propose, denotes, bernt, including, highest] [model, datasets, strong] [method, figure, based, analysis, ieee, high] [target, source, unseen, domain, synthesized, sdgn, generative, generalized, generating, gzsl, transductive, loss, attribute, cub, synthetic, image, harmonic, slm, zeynep, discriminator, zsl, generator, learn, discriminative, mapping, synthesize, utilize, idea, gxe, xrt, bridge, shared, introduce, corresponding] [learning, accuracy, data, class, bias, triplet, training, network, space, unlabeled, distribution, performance, baseline, best, similarity, mining, inductive, machine, power, compared] [reconstructed, reconstruct]
@InProceedings{Wu_2020_CVPR,
  author = {Wu, Jiamin and Zhang, Tianzhu and Zha, Zheng-Jun and Luo, Jiebo and Zhang, Yongdong and Wu, Feng},
  title = {Self-Supervised Domain-Aware Generative Network for Generalized Zero-Shot Learning},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Sparse Layered Graphs for Multi-Object Segmentation
Niels Jeppesen, Anders N. Christensen, Vedrana A. Dahl, Anders B. Dahl


We introduce the novel concept of a Sparse Layered Graph (SLG) for s-t graph cut segmentation of image data. The concept is based on the widely used Ishikawa layered technique for multi-object segmentation, which allows explicit object interactions, such as containment and exclusion with margins. However, the spatial complexity of the Ishikawa technique limits its use for many segmentation problems. To solve this issue, we formulate a general method for adding containment and exclusion interaction constraints to layered graphs. Given some prior knowledge, we can create a SLG, which is often orders of magnitude smaller than traditional Ishikawa graphs, with identical segmentation results. This allows us to solve many problems that could previously not be solved using general graph cut algorithms. We then propose three algorithms for further reducing the spatial complexity of SLGs, by using ordered multi-column graphs. In our experiments, we show that SLGs, and in particular ordered multi-column SLGs, can produce high-quality segmentation results using extremely simple data terms. We also show the scalability of ordered multi-column SLGs, by segmenting a high-resolution volume with several hundred interacting objects.
[graph, interaction, pair, ordered, time, node, three] [object, segmentation, add, segmenting, segmented, redundant] [adding] [exclusion, method, ieee, pattern, remove, figure, column, based, spatial, prior, result, created] [image] [number, containment, algorithm, ishikawa, margin, unlabelled, minimum, size, inner, outer, energy, layer, sampled, data, reduce, general, set, large, maximum, simple, qpbo, accuracy, sample, max, cut, complexity, find, remains, machine, optimal, pairwise, sampling] [layered, solve, dense, approach, computer, distance, volume, interacting, conference, term, solving, surface, position, single, vision, sparse, structure, neighborhood]
@InProceedings{Jeppesen_2020_CVPR,
  author = {Jeppesen, Niels and Christensen, Anders N. and Dahl, Vedrana A. and Dahl, Anders B.},
  title = {Sparse Layered Graphs for Multi-Object Segmentation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Visual-Semantic Matching by Exploring High-Order Attention and Distraction
Yongzhi Li, Duo Zhang, Yadong Mu


Cross-modality semantic matching is a vital task in computer vision and has attracted increasing attention in recent years. Existing methods mainly explore object-based alignment between image objects and text words. In this work, we address this task from two previously-ignored aspects: high-order semantic information (e.g., object-predicate-subject triplet, object-attribute pair) and visual distraction (i.e., despite the high relevance to textual query, images may also contain many prominent distracting objects or visual relations). Specifically, we build scene graphs for both visual and textual modalities. Our technical contributions are two-folds: firstly, we formulate the visual-semantic matching task as an attention-driven cross-modality scene graph matching problem. Graph convolutional networks (GCNs) are used to extract high-order information from two scene graphs. A novel cross-graph attention mechanism is proposed to contextually reweigh graph elements and calculate the inter-graph similarity; Secondly, some top-ranked samples are indeed false matching due to the co-occurrence of both highly-relevant and distracting information. We devise an information-theoretic measure for estimating semantic distraction and re-ranking the initial retrieval results. Comprehensive experiments and ablation studies on two large public datasets (MS-COCO and Flickr30K) demonstrate the superiority of the proposed method and the effectiveness of both high-order attention and distraction.
[graph, attention, node, retrieval, sentence, embedding, distraction, visual, relation, text, three, embeddings, matt, textual, represent, work, distracting, language, previous, scorem, vdist, red, explore, querying, explored] [semantic, object, global, feature, table, false, score, ablation, key] [model, query, effectively] [figure, proposed, based, convolutional, color, reference, green, method, column, adopted] [image, attribute, cross, corresponding, representation, xsi, common, row, loss] [set, similarity, learning, neural, data, calculate, deep, rij, arxiv, preprint, large, negative, performance, test, task, search, triplet, network, ranking, matrix] [matching, scene, computer, initial, demonstrate, local]
@InProceedings{Li_2020_CVPR,
  author = {Li, Yongzhi and Zhang, Duo and Mu, Yadong},
  title = {Visual-Semantic Matching by Exploring High-Order Attention and Distraction},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
End-to-End 3D Point Cloud Instance Segmentation Without Detection
Haiyong Jiang, Feilong Yan, Jianfei Cai, Jianmin Zheng, Jun Xiao


3D instance segmentation plays a predominant role in environment perception of robotics and augmented reality. Many deep learning based methods have been presented recently for this task. These methods rely on either a detection branch to propose objects or a grouping step to assemble same-instance points. However, detection based methods do not ensure a consistent instance label for each point, while the grouping step requires parameter-tuning and is computationally expensive. In this paper, we introduce a novel framework to enable end-to-end instance segmentation without detection and a separate step of grouping. The core idea is to convert instance segmentation to a candidate assignment problem. At first, a set of instance candidates is sampled. Then we propose an assignment module for candidate assignment and a suppression module to eliminate redundant candidates. A mapping between instance labels and instance candidates is further sought to construct an instance grouping loss for the network training. Experimental results demonstrate that our method is more effective and efficient than previous approaches.
[embedding, step, dataset, long, prediction, time, work] [instance, segmentation, semantic, assignment, grouping, centroid, suppression, feature, module, predicted, redundant, mask, asis, detection, denotes, backbone, refined, sgpn, jsis, object, mcov, scenenn, propose] [testing, great] [ieee, method, pattern, cvpr, june, figure, proposed, based, comparison] [loss, mapping, minimizing, firstly, eliminate] [candidate, number, network, similarity, label, set, sampling, matrix, learning, problem, algorithm, process, neural, deep, data, random, group, clustering, training, processing, learned, objective] [point, conference, computer, vision, distance, cloud, ground, mlp, truth, pointnet, additional, scene, lcd, indoor, directly]
@InProceedings{Jiang_2020_CVPR,
  author = {Jiang, Haiyong and Yan, Feilong and Cai, Jianfei and Zheng, Jianmin and Xiao, Jun},
  title = {End-to-End 3D Point Cloud Instance Segmentation Without Detection},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Deep Adversarial Decomposition: A Unified Framework for Separating Superimposed Images
Zhengxia Zou, Sen Lei, Tianyang Shi, Zhenwei Shi, Jieping Ye


Separating individual image layers from a single mixed image has long been an important but challenging task. We propose a unified framework named "deep adversarial decomposition" for single superimposed image separation. Our method deals with both linear and non-linear mixtures under an adversarial training paradigm. Considering the layer separating ambiguity that given a single mixed input, there could be an infinite number of possible solutions, we introduce a "Separation-Critic" - a discriminative network which is trained to identify whether the output layers are well-separated and thus further improves the layer separation. We also introduce a "crossroad L1" loss function, which computes the distance between the unordered outputs and their references in a crossover manner so that the training can be well-instructed with pixel-wise supervision. Experimental results suggest that our method significantly outperforms other popular image separation frameworks. Without specific tuning, our method achieves the state of the art results on multiple computer vision tasks, including the image deraining, photo reflection removal, and image shadow removal.
[dataset, recognition, three, evaluation, multiple] [table, including, framework, unified, detection, propose, apply] [adversarial, quality, trained, input, clean, datasets, model, experimental] [ieee, method, pattern, reflection, removal, separation, rain, deraining, superimposed, analysis, figure, june, output, based, perceptual, comparison, separating, separator, kurtosis, proposed, prenet, introduced] [image, shadow, loss, introduce, generative, train, critic, crossroad, mixing, lsun, consists, encourages, synthesized] [training, mixed, set, deep, learning, layer, test, standard, machine, mixture, better, random, network, function, neural, follow, evaluate, linear, group, lower] [computer, conference, vision, single, international, compare, additional, ground, truth, decomposition, distance]
@InProceedings{Zou_2020_CVPR,
  author = {Zou, Zhengxia and Lei, Sen and Shi, Tianyang and Shi, Zhenwei and Ye, Jieping},
  title = {Deep Adversarial Decomposition: A Unified Framework for Separating Superimposed Images},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Differentiable Adaptive Computation Time for Visual Reasoning
Cristobal Eyzaguirre, Alvaro Soto


This paper presents a novel attention-based algorithm for achieving adaptive computation called DACT, which, unlike existing ones, is end-to-end differentiable. Our method can be used in conjunction with many networks; in particular, we study its application to the widely know MAC architecture, obtaining a significant reduction in the number of recurrent steps needed to achieve similar accuracies, therefore improving its performance to computation ratio. Furthermore, we show that by increasing the maximum number of steps used, we surpass the accuracy of even our best non-adaptive MAC in the CLEVR dataset, demonstrating that our approach is able to control the number of steps without significant loss of performance. Additional advantages provided by our approach include considerably improving interpretability by discarding useless steps and providing more insights into the underlying reasoning process. Finally, we present adaptive computation as an equivalent to an ensemble of models, similar to a mixture of expert formulation. Both the code and the configuration files for our experiments are made available to support further research in this area.
[question, answer, visual, attention, step, recurrent, current, future, time, clevr, reasoning, mechanism, dataset, previous, rnn, language, provide] [final, main, module, subsequent] [model, improving, ensemble, improve, adding, change, interpretability, case, suitable, input, trained] [adaptive, figure, output, method, achieved, intermediate, existing, residual] [image, control, loss] [computation, number, mac, computational, dact, performance, accuracy, processing, complexity, halting, maximum, neural, probability, class, learning, architecture, ponder, needed, fixed, average, achieve, efficiency, sum, algorithm, deep, network, training, applied, increase, respect, reducing, variant, best, adapt] [approach, cost, differentiable, computer, conference, vision, require, additional, pipeline, complex, full, formulation]
@InProceedings{Eyzaguirre_2020_CVPR,
  author = {Eyzaguirre, Cristobal and Soto, Alvaro},
  title = {Differentiable Adaptive Computation Time for Visual Reasoning},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
DeepLPF: Deep Local Parametric Filters for Image Enhancement
Sean Moran, Pierre Marza, Steven McDonagh, Sarah Parisot, Gregory Slabaugh


Digital artists often improve the aesthetic quality of digital photographs through manual retouching. Beyond global adjustments, professional image editing programs provide local adjustment tools operating on specific parts of an image. Options include parametric (graduated, radial filters) and unconstrained brush tools. These highly expressive tools enable a diverse set of local image enhancements. However, their use can be time consuming, and requires artistic capability. State-of-the-art automated image enhancement approaches typically focus on learning pixel-level or global enhancements. The former can be noisy and lack interpretability, while the latter can fail to capture fine-grained adjustments. In this paper, we introduce a novel approach to automatically enhance images using learned spatially local filters of three different types (Elliptical Filter, Graduated Filter, Polynomial Filter). We introduce a deep neural network, dubbed Deep Local Parametric Filters (DeepLPF), which regresses the parameters of these spatially localized filters that are then automatically applied to enhance the image. DeepLPF provides a natural form of model regularization and enables interpretable, intuitive adjustments that lead to visually pleasing results. We report on multiple benchmarks and show that DeepLPF produces state-of-the-art performance on two variants of the MIT-Adobe 5k dataset, often using a fraction of the parameters required for competing methods.
[three, multiple, provide, automatic, work, stream, prediction] [global, propose, backbone, map, table, feature, challenging] [model, quality, input, digital, improve] [enhancement, graduated, deeplpf, elliptical, ieee, figure, adjustment, colour, channel, output, cubic, method, intensity, pixel, deepupe, pattern, dpe, convolutional, ssim, sid, spatially, contrast, block, capable, chen, enhanced] [image, photo, loss, learn, editing, consists, brush] [filter, deep, learning, parameter, architecture, network, training, function, neural, manual, set, learned, processing, scaling, performance, number, layer, considered, capacity] [local, parametric, polynomial, computer, conference, approach, vision, single, ground, form, human, additional, acm, rgb]
@InProceedings{Moran_2020_CVPR,
  author = {Moran, Sean and Marza, Pierre and McDonagh, Steven and Parisot, Sarah and Slabaugh, Gregory},
  title = {DeepLPF: Deep Local Parametric Filters for Image Enhancement},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Instance Credibility Inference for Few-Shot Learning
Yikai Wang, Chengming Xu, Chen Liu, Li Zhang, Yanwei Fu


Few-shot learning (FSL) aims to recognize new objects with extremely limited training data for each category. Previous efforts are made by either leveraging meta-learning paradigm or novel principles in data augmentation to alleviate this extremely data-scarce problem. In contrast, this paper presents a simple statistical approach, dubbed Instance Credibility Inference (ICI) to exploit the distribution support of unlabeled instances for few-shot learning. Specifically, we first train a linear classifier with the labeled few-shot examples and use it to infer the pseudo-labels for the unlabeled data. To measure the credibility of each pseudo-labeled instance, we then propose to solve another linear regression hypothesis by increasing the sparsity of the incidental parameters and rank the pseudo-labeled instances with their sparsity degree. We select the most trustworthy pseudo-labeled instances alongside the labeled examples to re-train the linear classifier. This process is iterated until all the unlabeled samples are included in the expanded training set, i.e. the pseudo-label is converged for unlabeled data pool. Extensive experiments under two few-shot settings show that our simple approach can establish new state-of-the-arts on four widely used few-shot learning benchmark datasets including miniImageNet, tieredImageNet, CIFAR-FS, and CUB. Our code is available at: https://github.com/Yikai-Wang/ICI-FSL
[dataset, tpn] [feature, instance, table, regression, effectiveness, benchmark, category, extractor] [model, query, datasets, trained, robustness, expanded] [proposed, figure] [train, transductive, corresponding, learn] [learning, unlabeled, data, ici, set, classifier, training, linear, support, credibility, labeled, select, class, inference, algorithm, arxiv, preprint, simple, deep, performance, regularization, label, accuracy, distribution, sparsity, compared, neural, base, dimension, reduction, statistical, trustworthy, process, sample, network, incidental, subset, logistic, denoted, miniimagenet, dimensionality, rank, space, ssfsl, better, vector, cnovel, classification, number, tfsl, measure, strategy] [novel, initial, compare, solve, limited, hypothesis, approach]
@InProceedings{Wang_2020_CVPR,
  author = {Wang, Yikai and Xu, Chengming and Liu, Chen and Zhang, Li and Fu, Yanwei},
  title = {Instance Credibility Inference for Few-Shot Learning},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Learning From Web Data With Self-Organizing Memory Module
Yi Tu, Li Niu, Junjie Chen, Dawei Cheng, Liqing Zhang


Learning from web data has attracted lots of research interest in recent years. However, crawled web images usually have two types of noises, label noise and background noise, which induce extra difficulties in utilizing them effectively. Most existing methods either rely on human supervision or ignore the background noise. In this paper, we propose a novel method, which is capable of handling these two types of noises together, without the supervision of clean images in the training stage. Particularly, we formulate our method under the framework of multi-instance learning by grouping ROIs (i.e., images and their region proposals) from the same category into bags. ROIs in each bag are assigned with different weights based on the representative/discriminative scores of their nearest clusters, in which the clusters and their scores are obtained via our designed memory module. Our memory module could be naturally integrated with the classification module, leading to an end-to-end trainable system. Extensive experiments on four benchmark datasets demonstrate the effectiveness of our method.
[dataset, attention, multiple, three, abhinav, visual, trainable] [key, web, module, bag, background, region, category, roi, slot, somnet, table, effectiveness, map, cnn, center, object, instance, denotes, weakly, framework] [noise, clean, representative, model, datasets, robust] [method, noisy, based, figure, proposed, result, handling, designed] [cluster, image, discriminative, learn, supervised, corresponding, unsupervised, curriculum, crawled, idea, expect, loss] [memory, learning, training, label, deep, data, number, set, denote, update, performance, webvision, classification, clustering, neural, network, weighted, weight, algorithm, total, approximate, prototypical, best] [directly]
@InProceedings{Tu_2020_CVPR,
  author = {Tu, Yi and Niu, Li and Chen, Junjie and Cheng, Dawei and Zhang, Liqing},
  title = {Learning From Web Data With Self-Organizing Memory Module},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
TransMatch: A Transfer-Learning Scheme for Semi-Supervised Few-Shot Learning
Zhongjie Yu, Lin Chen, Zhongwei Cheng, Jiebo Luo


The successful application of deep learning to many visual recognition tasks relies heavily on the availability of a large amount of labeled data which is usually expensive to obtain. The few-shot learning problem has attracted increasing attention from researchers for building a robust model upon only a few labeled samples. Most existing works tackle this problem under the meta-learning framework by mimicking the few-shot learning task with an episodic training strategy. In this paper, we propose a new transfer-learning framework for semi-supervised few-shot learning to fully utilize the auxiliary information from labeled base-class data and unlabeled novel-class data. The framework consists of three components: 1) pre-training a feature extractor on base-class data; 2) using the feature extractor to initialize the classifier weights for the novel classes; and 3) further updating the model with a semi-supervised learning method. Under the proposed framework, we develop a novel method for semi-supervised few-shot learning called TransMatch by instantiating the three components with imprinting and MixMatch. Extensive experiments on two popular benchmark datasets for few-shot learning, CUB-200-2011 and miniImageNet, demonstrate that our proposed method can effectively utilize the auxiliary information from labeled base-class data and unlabeled novel-class data to significantly improve the accuracy of few-shot learning task, and achieve new state-of-the-art results.
[work, dataset] [feature, framework, table, extractor, fully] [model, auxiliary, datasets] [method, based, proposed, existing, conventional] [utilize, learn, train, consistency] [learning, unlabeled, labeled, data, transmatch, classifier, base, mixmatch, imprinting, performance, set, neural, large, training, soft, weight, deep, amount, semisupervised, fewshot, accuracy, good, class, network, processing, test, classification, regularization, distractor, episodic, number, imprint, algorithm, entropy, popular, family, randomly, prototypical, label, optimization, dbase, miniimagenet, rate, problem, task, denote, best, metalearning, achieve, support, follow, better] [novel, directly, compare]
@InProceedings{Yu_2020_CVPR,
  author = {Yu, Zhongjie and Chen, Lin and Cheng, Zhongwei and Luo, Jiebo},
  title = {TransMatch: A Transfer-Learning Scheme for Semi-Supervised Few-Shot Learning},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Learning the Redundancy-Free Features for Generalized Zero-Shot Object Recognition
Zongyan Han, Zhenyong Fu, Jian Yang


Zero-shot object recognition or zero-shot learning aims to transfer the object recognition ability among the semantically related categories, such as fine-grained animal or bird species. However, the images of different fine-grained objects tend to merely exhibit subtle differences in appearance, which will severely deteriorate zero-shot object recognition. To reduce the superfluous information in the fine-grained objects, in this paper, we propose to learn the redundancy-free features for generalized zero-shot learning. We achieve our motivation by projecting the original visual features into a new (redundancy-free) feature space and then restricting the statistical dependence between these two feature spaces. Furthermore, we require the projected features to keep and even strengthen the category relationship in the redundancy-free feature space. In this way, we can remove the redundant information from the visual features without losing the discriminative information. We extensively evaluate the performance on four benchmark datasets. The results show that our redundancy-free feature based generalized zero-shot learning (RFF-GZSL) approach can outperform the state-of-the-arts often by a large margin.
[visual, embedding, recognition] [feature, semantic, object, map, sun, final, bernt] [original, model, adversarial, evaluated] [method, figure, conventional, proposed, traditional, based, science, remove] [gzsl, unseen, generation, generalized, zsl, learn, generator, synthetic, variational, discriminative, mapping, redundancyfree, image, generative, real, synthesized, awa, zeynep, dependence, conditional, synthesize, flo, loss, discriminator, harmonic, cub, zhenyong, bird, conditioned, supervised, generated] [learning, space, data, class, function, classification, network, deep, redundancy, dimension, set, bound, training, performance, labeled, problem, test, distribution, mutual, accuracy, achieve, evaluate, upper, arxiv, preprint, imbalance, classifier, objective, number, mapped] [descriptor, approach, defined, compare]
@InProceedings{Han_2020_CVPR,
  author = {Han, Zongyan and Fu, Zhenyong and Yang, Jian},
  title = {Learning the Redundancy-Free Features for Generalized Zero-Shot Object Recognition},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Neural Topological SLAM for Visual Navigation
Devendra Singh Chaplot, Ruslan Salakhutdinov, Abhinav Gupta, Saurabh Gupta


This paper studies the problem of image-goal navigation which involves navigating to the location indicated by a goal image in a novel previously unseen environment. To tackle this problem, we design topological representations for space that effectively leverage semantics and afford approximate geometric reasoning. At the heart of our representations are nodes with associated semantic features, that are interconnected using coarse geometric information. We describe supervised learning-based algorithms that can build, maintain and use such representations under noisy actuation. Experimental study in visually and physically realistic simulation suggests that our method builds effective representations that capture structural regularities and efficiently solve long-horizon navigation problems. We observe a relative improvement of more than 50% over existing methods that study this task.
[goal, node, agent, graph, navigation, policy, prediction, current, exploration, visual, explorable, sequential, time, action, work, saurabh, build, difficulty, abhinav, navigate, actuation] [semantic, score, map, localization, area, global, edge] [topological, model, noise, overview, trained] [spatial, figure, proposed, based, motion, ieee] [image, target, source, realistic, structural, representation, consists, mapping, supervised] [learning, function, metric, neural, performance, path, ghost, number, training, task, arxiv, preprint, reach, space, sample, active, lead, update, belong, learned] [relative, pose, local, geometric, conference, slam, direction, rgbd, international, novel, localized, distance]
@InProceedings{Chaplot_2020_CVPR,
  author = {Chaplot, Devendra Singh and Salakhutdinov, Ruslan and Gupta, Abhinav and Gupta, Saurabh},
  title = {Neural Topological SLAM for Visual Navigation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
WaveletStereo: Learning Wavelet Coefficients of Disparity Map in Stereo Matching
Menglong Yang, Fangrui Wu, Wei Li


Some stereo matching algorithms based on deep learning have been proposed and achieved state-of-the-art performances since some public large-scale datasets were put online. However, the disparity in smooth regions and detailed regions is still difficult to accurately estimate simultaneously. This paper proposes a novel stereo matching method called WaveletStereo, which learns the wavelet coefficients of the disparity rather than the disparity itself. The WaveletStereo consists of several sub-modules, where the low-frequency sub-module generates the low-frequency wavelet coefficients, which aims at learning global context information and well handling the low-frequency regions such as textureless surfaces, and the others focus on the details. In addition, a densely connected atrous spatial pyramid block is introduced for better learning the multi-scale image features. Experimental results show the effectiveness of the proposed method, which achieves state-of-the-art performance on the large-scale test dataset Scene Flow.
[connected, recognition, dataset, evaluation, mechanism, work, prediction, context] [map, atrous, pyramid, level, contextual, global, table, effectiveness, refine, module, predicted, groundtruth] [model] [wavelet, disparity, pattern, proposed, ieee, flow, resolution, method, pixel, spatial, densely, convolutional, transform, dmax, based, waveletstereo, figure, adopted, analysis, downsampling, dilated, conv, journal] [image, learn, train, representation, loss] [learning, deep, training, test, network, algorithm, softmax, neural, machine, performance, architecture, approximation, max, accuracy, operation, weighted] [stereo, cost, matching, vision, computer, conference, scene, kitti, volume, international, error, compare, estimation, reconstruction, smoothness, left]
@InProceedings{Yang_2020_CVPR,
  author = {Yang, Menglong and Wu, Fangrui and Li, Wei},
  title = {WaveletStereo: Learning Wavelet Coefficients of Disparity Map in Stereo Matching},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Robust Superpixel-Guided Attentional Adversarial Attack
Xiaoyi Dong, Jiangfan Han, Dongdong Chen, Jiayang Liu, Huanyu Bian, Zehua Ma, Hongsheng Li, Xiaogang Wang, Weiming Zhang, Nenghai Yu


Deep Neural Networks are vulnerable to adversarial samples, which can fool classifiers by adding small perturbations onto the original image. Since the pioneering optimization-based adversarial attack method, many following methods have been proposed in the past several years. However most of these methods add perturbations in a "pixel-wise" and "global" way. Firstly, because of the contradiction between the local smoothness of natural images and the noisy property of these adversarial perturbations, this "pixel-wise" way makes these methods not robust to image processing based defense methods and steganalysis based detection methods. Secondly, we find adding perturbations to the background is less useful than to the salient object, thus the "global" way is also not optimal. Based on these two considerations, we propose the first robust superpixel-guided attentional adversarial attack method. Specifically, the adversarial perturbations are only added to the salient regions and guaranteed to be same within each superpixel. Through extensive experiments, we demonstrate our method can preserve the attack ability even in this highly constrained modification space. More importantly, compared to existing methods, it is significantly more robust to image processing based defense and steganalysis based detection.
[] [] [] [] [] [] []
@InProceedings{Dong_2020_CVPR,
  author = {Dong, Xiaoyi and Han, Jiangfan and Chen, Dongdong and Liu, Jiayang and Bian, Huanyu and Ma, Zehua and Li, Hongsheng and Wang, Xiaogang and Zhang, Weiming and Yu, Nenghai},
  title = {Robust Superpixel-Guided Attentional Adversarial Attack},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
BEDSR-Net: A Deep Shadow Removal Network From a Single Document Image
Yun-Hsuan Lin, Wen-Chin Chen, Yung-Yu Chuang


Removing shadows in document images enhances both the visual quality and readability of digital copies of documents. Most existing shadow removal algorithms for document images use hand-crafted heuristics and are often not robust to documents with different characteristics. This paper proposes the Background Estimation Document Shadow Removal Network (BEDSR-Net), the first deep network specifically designed for document image shadow removal. For taking advantage of specific properties of document images, a background estimation module is designed for extracting the global background color of the document. During the process of estimating the background color, the module also learns information about the spatial distribution of background and non-background pixels. We encode such information into an attention map. With the estimated global background color and attention map, the shadow removal network can better recover the shadow-free image. We also show that the model trained on synthetic images remains effective for real photos, and provide a large set of synthetic shadow images of documents along with their corresponding shadow-free images and shadow masks. Extensive quantitative and qualitative experiments on several benchmarks show that the BEDSR-Net outperforms existing methods in enhancing both the visual quality and readability of document images.
[attention, dataset, visual, recognition, natural, outperforms, phone, exploring] [background, map, global, module, detection, predicted, table, pooling] [model, input, quality, trained, example, datasets, effective, adversarial, robust, collected] [removal, color, method, figure, ieee, removing, proposed, jung, pattern, analysis, psnr, existing, spatial, result, captured, readability, ssim, designed, recover, quantitative] [shadow, document, image, bako, real, kligler, synthetic, generator, content, specific, discriminator, proposes] [training, performance, network, deep, large, set, paper, compared, learning, better, architecture, requires] [conference, vision, computer, estimation, international, estimated, well, single, estimating, camera, lighting]
@InProceedings{Lin_2020_CVPR,
  author = {Lin, Yun-Hsuan and Chen, Wen-Chin and Chuang, Yung-Yu},
  title = {BEDSR-Net: A Deep Shadow Removal Network From a Single Document Image},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Cross-Domain Document Object Detection: Benchmark Suite and Method
Kai Li, Curtis Wigington, Chris Tensmeyer, Handong Zhao, Nikolaos Barmpalios, Vlad I. Morariu, Varun Manjunatha, Tong Sun, Yun Fu


Decomposing images of document pages into high-level semantic regions (e.g., figures, tables, paragraphs), document object detection (DOD) is fundamental for downstream tasks like intelligent document editing and understanding. DOD remains a challenging problem as document objects vary significantly in layout, size, aspect ratio, texture, etc. An additional challenge arises in practice because large labeled training datasets are only available for domains that differ from the target domain. We investigate cross-domain DOD, where the goal is to learn a detector for the target domain using labeled data from the source domain and only unlabeled data from the target domain. Documents from the two domains may vary significantly in layout, language, and genre. We establish a benchmark suite consisting of different types of PDF document datasets that can be utilized for cross-domain DOD model training and evaluation. For each dataset, we provide the page images, bounding box annotations, PDF files, and the rendering layers extracted from the PDF files. Moreover, we propose a novel cross-domain DOD model which builds upon the standard detection model and addresses domain shifts by incorporating three novel alignment modules: Feature Pyramid Alignment (FPA) module, Region Alignment (RA) module and Rendering Layer alignment (RLA) module. Extensive experiments on the benchmark suite substantiate the efficacy of the three proposed modules and the proposed method significantly outperforms the baseline methods. The project page is at https://github.com/kailigo/cddod.
[three, dataset, text, natural, provide, includes, shift] [detection, object, feature, table, region, fpn, map, benchmark, pyramid, module, segmentation, bounding, box, proposal, heading, semantic, foreground, propose, frcnn, mask] [model, datasets, trained] [proposed, figure, method, based, pixel, existing] [document, alignment, domain, pdf, target, source, image, rla, swda, suite, fpa, loss, pubmed, legal, dod, chn, list, content, layout, train, generate, adaptation, extracted] [layer, training, data, learning, problem, labeled, objective, performance, set, deep, standard, baseline, class, task, randomly, binary, vector] [rendering, novel, scene, structure, focal, single, ground]
@InProceedings{Li_2020_CVPR,
  author = {Li, Kai and Wigington, Curtis and Tensmeyer, Chris and Zhao, Handong and Barmpalios, Nikolaos and Morariu, Vlad I. and Manjunatha, Varun and Sun, Tong and Fu, Yun},
  title = {Cross-Domain Document Object Detection: Benchmark Suite and Method},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Explaining Knowledge Distillation by Quantifying the Knowledge
Xu Cheng, Zhefan Rao, Yilan Chen, Quanshi Zhang


This paper presents a method to interpret the success of knowledge distillation by quantifying and analyzing task-relevant and task-irrelevant visual concepts that are encoded in intermediate layers of a deep neural network (DNN). More specifically, three hypotheses are proposed as follows. 1. Knowledge distillation makes the DNN learn more visual concepts than learning from raw data. 2. Knowledge distillation ensures that the DNN is prone to learning various visual concepts simultaneously. Whereas, in the scenario of learning from raw data, the DNN learns visual concepts sequentially. 3. Knowledge distillation yields more stable optimization directions than learning from raw data. Accordingly, we design three types of mathematical metrics to evaluate feature representations of the DNN. In experiments, we diagnosed various DNNs, and above hypotheses were verified.
[visual, three, dataset, encode] [foreground, object, background, feature, box, det, semantic, table, pascal, voc] [dnn, encoded, nconcept, dnns, quantify, input, dmean, dstd, quanshi, ensures, verify, theory, discarded] [raw, figure, ieee, intermediate, proposed, convolutional, pattern] [image, learn, learns, specific, target] [network, knowledge, student, learning, distillation, baseline, teacher, learned, neural, deep, layer, arxiv, preprint, optimization, data, number, larger, entropy, epoch, classification, quantifying, mathematical, measure, set, indicates, considered, discard, training, discarding] [conference, hypothesis, computer, international, vision, compare, measured]
@InProceedings{Cheng_2020_CVPR,
  author = {Cheng, Xu and Rao, Zhefan and Chen, Yilan and Zhang, Quanshi},
  title = {Explaining Knowledge Distillation by Quantifying the Knowledge},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Exploring Bottom-Up and Top-Down Cues With Attentive Learning for Webly Supervised Object Detection
Zhonghua Wu, Qingyi Tao, Guosheng Lin, Jianfei Cai


Fully supervised object detection has achieved great success in recent years. However, abundant bounding boxes annotations are needed for training a detector for novel classes. To reduce the human labeling effort, we propose a novel webly supervised object detection (WebSOD) method for novel classes which only requires the web images without further annotations. Our proposed method combines bottom-up and top-down cues for novel class detection. Within our approach, we introduce a bottom-up mechanism based on the well-trained fully supervised object detector (i.e. Faster RCNN) as an object region estimator for web images by recognizing the common objectiveness shared by base and novel classes. With the estimated regions on the web images, we then utilize the top-down attention cues as the guidance for region classification. Furthermore, we propose a residual feature refinement (RFR) block to tackle the domain mismatch between web domain and the target domain. We demonstrate our proposed method on PASCAL VOC dataset with three different novel/base splits. Without any target-domain novel-class images and annotations, our proposed webly supervised object detection model is able to achieve promising performance for novel classes. Moreover, we also conduct transfer learning experiments on large scale ILSVRC 2013 detection dataset and achieve state-of-the-art performance.
[attention, dataset, work, three, visual] [object, web, detection, detector, feature, weakly, webly, attentive, region, voc, map, bounding, roi, refinement, propose, pascal, table, fully, level, cam, labeling, box, apply, split, acl, objectness, websod, pooling, jianfei] [model, trained, testing] [method, proposed, ieee, figure, residual, pattern, scale, block, abundant, existing] [supervised, domain, target, image, train, loss, transfer, common, adaptation, rfr] [base, class, classification, training, learning, large, performance, set, data, knowledge, classifier, network, layer, deep, label] [novel, conference, computer, vision, human, estimator, estimated, directly]
@InProceedings{Wu_2020_CVPR,
  author = {Wu, Zhonghua and Tao, Qingyi and Lin, Guosheng and Cai, Jianfei},
  title = {Exploring Bottom-Up and Top-Down Cues With Attentive Learning for Webly Supervised Object Detection},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Enhancing Generic Segmentation With Learned Region Representations
Or Isaacs, Oran Shayer, Michael Lindenbaum


Deep learning approaches to generic (non-semantic) segmentation have so far been indirect and relied on edge detection. This is in contrast to semantic segmentation, where DNNs are applied directly. We propose an alternative approach called Deep Generic Segmentation (DGS) and try to follow the path used for semantic segmentation. Our main contribution is a new method for learning a pixel-wise representation that reflects segment relatedness. This representation is combined with a CRF to yield the segmentation algorithm. We show that we are able to learn meaningful representations that improve segmentation quality and that the representations themselves achieve state-of-the-art segment similarity scores. The segmentation results are competitive and promising.
[pair, associated, context, hierarchical, recognition, current, graph, evaluation] [segmentation, edge, segment, region, detection, semantic, score, fop, achieves, boundary, contour, object, merging, pascal, cob, watershed, improves, agglomerative, ois, table] [generic, trained, face, input, original, improve, quality, combined] [ieee, pattern, pixel, proposed, figure, convolutional, analysis, method, color] [representation, image, dissimilarity, supervised, loss] [learning, algorithm, network, learned, training, classification, deep, classifier, machine, task, belong, space, triplet, label, improved, measure, appendix, test, layer, merged] [computer, approach, conference, volume, well, distance, vision, compare, silhouette, rely]
@InProceedings{Isaacs_2020_CVPR,
  author = {Isaacs, Or and Shayer, Oran and Lindenbaum, Michael},
  title = {Enhancing Generic Segmentation With Learned Region Representations},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Adaptive Hierarchical Down-Sampling for Point Cloud Classification
Ehsan Nezhadarya, Ehsan Taghavi, Ryan Razani, Bingbing Liu, Jun Luo


Deterministic down-sampling of an unordered point cloud in a deep neural network has not been rigorously studied so far. Existing methods down-sample the points regardless of their importance for the network output and often address down-sampling the raw point cloud before processing. As a result, some important points in the point cloud may be removed, while less valuable points may be passed to next layers. In contrast, the proposed adaptive down-sampling method samples the points by taking into account the importance of each point, which varies according to application, task and training data. In this paper, we propose a novel deterministic, adaptive, permutation-invariant down-sampling layer, called Critical Points Layer (CPL), which learns to reduce the number of points in an unordered point cloud while retaining the important (critical) ones. Unlike most graph-based point cloud down-sampling methods that use k-NN to find the neighboring points, CPL is a global down-sampling method, rendering it computationally very efficient. The proposed layer can be used along with a graph-based point cloud convolution layer to form a convolutional neural network, dubbed CP-Net in this paper. We introduce a CP-Net for 3D object classification that achieves high accuracy for the ModelNet 40 dataset among point cloud-based methods, which validates the effectiveness of the CPL.
[critical, graph, time, dataset, step, hierarchical, passed] [feature, object, table, fmax, propose] [input, model, original] [proposed, spatial, figure, convolutional, convolution, based, method, output, ieee, pattern, version, adaptive, neighbourhood, spectral, kernel, called] [generate, introduce, learns] [layer, deep, neural, size, network, accuracy, vector, number, learning, set, computational, ratio, training, maximum, computationally, weighted, bottleneck, data, sampling, random, complexity, dimension, processing, lower, applied, kcnet, explained, algorithm, batch, arxiv, preprint, deterministic, task] [point, cloud, cpl, unordered, edgeconv, conference, computer, vision, local, dgcnn, pointnet, wcpl, approach, michael]
@InProceedings{Nezhadarya_2020_CVPR,
  author = {Nezhadarya, Ehsan and Taghavi, Ehsan and Razani, Ryan and Liu, Bingbing and Luo, Jun},
  title = {Adaptive Hierarchical Down-Sampling for Point Cloud Classification},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
FBNetV2: Differentiable Neural Architecture Search for Spatial and Channel Dimensions
Alvin Wan, Xiaoliang Dai, Peizhao Zhang, Zijian He, Yuandong Tian, Saining Xie, Bichen Wu, Matthew Yu, Tao Xu, Kan Chen, Peter Vajda, Joseph E. Gonzalez


Differentiable Neural Architecture Search (DNAS) has demonstrated great success in designing state-of-the-art, efficient neural networks. However, DARTS-based DNAS's search space is small when compared to other search methods', since all candidate network layers must be explicitly instantiated in memory. To address this bottleneck, we propose a memory and computationally efficient DNAS variant: DMaskingNAS. This algorithm expands the search space by up to 10^14x over conventional DNAS, supporting searches over spatial and channel dimensions that are otherwise prohibitively expensive: input resolution and number of filters. We propose a masking mechanism for feature map reuse, so that memory and computational costs stay nearly constant as the search space expands. Furthermore, we employ effective shape propagation to maximize per-FLOP or per-parameter accuracy. The searched FBNetV2s yield state-of-the-art performance when compared with all previous architectures. With up to 421x less search cost, DMaskingNAS finds models with 0.9% higher accuracy, 15% fewer FLOPs than MobileNetV3-Small; and with similar accuracy but 20% fewer FLOPs than Efficient-B0. Furthermore, our FBNetV2 outperforms MobileNetV3 by 2.6% in accuracy, with equivalent model size. FBNetV2 models are open-sourced at https://github.com/facebookresearch/mobile-vision.
[step, previous, explore] [feature, map, propose, table] [input, effective, model, masking, constant] [channel, output, block, resolution, figure, convolutional, pattern, ieee, spatial, kernel, convolution, receptive, field] [train, address] [search, memory, neural, number, space, architecture, arxiv, preprint, dmaskingnas, efficient, accuracy, computational, pruning, size, network, gumbel, training, design, manual, searched, learning, imagenet, larger, note, deep, proxylessnas, gradient, option, supergraph, reduce, parameter, count, softmax, flop, quoc, designing, fewer, equivalent] [cost, computer, vision, conference, differentiable, shape]
@InProceedings{Wan_2020_CVPR,
  author = {Wan, Alvin and Dai, Xiaoliang and Zhang, Peizhao and He, Zijian and Tian, Yuandong and Xie, Saining and Wu, Bichen and Yu, Matthew and Xu, Tao and Chen, Kan and Vajda, Peter and Gonzalez, Joseph E.},
  title = {FBNetV2: Differentiable Neural Architecture Search for Spatial and Channel Dimensions},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Learning Texture Invariant Representation for Domain Adaptation of Semantic Segmentation
Myeongjin Kim, Hyeran Byun


Since annotating pixel-level labels for semantic segmentation is laborious, leveraging synthetic data is an attractive solution. However, due to the domain gap between synthetic domain and real domain, it is challenging for a model trained with synthetic data to generalize to real data. In this paper, considering the fundamental difference between the two domains as the texture, we propose a method to adapt to the target domain's texture. First, we diversity the texture of synthetic images using a style transfer algorithm. The various textures of generated images prevent a segmentation model from overfitting to one specific (synthetic) texture. Then, we fine-tune the model with self-training to get direct supervision of the target texture. Our results achieve state-of-the-art performance and we analyze the properties of the model trained on the stylized dataset with extensive experiments.
[dataset, visual, road, outperforms, considering] [segmentation, stage, semantic, table, ablation, feature, supervision, adopt] [model, original, trained, adversarial, datasets, study, difference, analyze] [method, figure, ieee, pattern, based, convolutional, color, comparison] [domain, source, target, stylized, texture, synthetic, image, style, adaptation, real, cyclegan, translated, gap, transfer, synthia, loss, learn, representation, generated, generate, drpc, translation, unsupervised, content, stylization] [training, learning, performance, data, arxiv, preprint, large, network, process, imagenet, validation, adapt, reduce, number, set, log, rate] [computer, conference, vision, shape, ground, truth, fundamental, direct, compare, volume]
@InProceedings{Kim_2020_CVPR,
  author = {Kim, Myeongjin and Byun, Hyeran},
  title = {Learning Texture Invariant Representation for Domain Adaptation of Semantic Segmentation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Putting Visual Object Recognition in Context
Mengmi Zhang, Claire Tseng, Gabriel Kreiman


Context plays an important role in visual recognition. Recent studies have shown that visual recognition networks can be fooled by placing objects in inconsistent contexts (e.g., a cow in the ocean). To model the role of contextual information in visual recognition, we systematically investigated ten critical properties of where, when, and how context modulates recognition, including the amount of context, context and object resolution, geometrical structure of context, context congruence, and temporal dynamics of contextual modulation. The tasks involved recognizing a target object surrounded with context in a natural image. As an essential benchmark, we conducted a series of psychophysics experiments where we altered one aspect of context at a time, and quantified recognition accuracy. We propose a biologically-inspired context-aware object recognition model consisting of a two-stream architecture. The model processes visual information at the fovea and periphery in parallel, dynamically incorporates object and contextual information, and sequentially reasons about the class label for the target object. Across a wide range of behavioral tasks, the model approximates human level performance without retraining for each task, captures the dependence of context enhancement on image properties, and provides initial steps towards integrating scene and object information for visual recognition. All source code and data are publicly available: https://github.com/kreimanlab/Put-In-Context.
[context, recognition, catnet, visual, attention, time, recurrent, incongruent, modulation, congruent, lstm, facilitation, role, previous, natural, psychophysics, reasoning, ranksum, step] [object, contextual, feature, semantic, module, detection, location, segmentation] [model, condition, systematically, led] [pattern, ieee, convolutional, blurring, exposure, figure, spatial, blurred, introduced, presented, journal] [target, image, texture] [performance, small, size, neural, accuracy, network, amount, class, large, exp, label, deep, computational, processing, arxiv, preprint, inference, classification, experiment, vector] [human, computer, vision, conference, full, scene, minimal, consistent, ground, european, gabriel]
@InProceedings{Zhang_2020_CVPR,
  author = {Zhang, Mengmi and Tseng, Claire and Kreiman, Gabriel},
  title = {Putting Visual Object Recognition in Context},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
SLV: Spatial Likelihood Voting for Weakly Supervised Object Detection
Ze Chen, Zhihang Fu, Rongxin Jiang, Yaowu Chen, Xian-Sheng Hua


Based on the framework of multiple instance learning (MIL), tremendous works have promoted the advances of weakly supervised object detection (WSOD). However, most MIL-based methods tend to localize instances to their discriminative parts instead of the whole content. In this paper, we propose a spatial likelihood voting (SLV) module to converge the proposal localizing process without any bounding box annotations. Specifically, all region proposals in a given image play the role of voters every iteration during training, voting for the likelihood of each category in spatial dimensions. After dilating alignment on the area with large likelihood values, the voting results are regularized as bounding boxes, being used for the final classification and localization. Based on SLV, we further propose an end-to-end training framework for multi-task learning. The classification and localization tasks promote each other, which further improves the detection performance. Extensive experiments on the PASCAL VOC 2007 and 2012 datasets demonstrate the superior performance of SLV.
[multiple, localizing, three] [object, slv, proposal, weakly, module, detection, instance, mil, localization, voc, bounding, score, framework, pascal, hslv, voting, propose, branch, table, labeling, refinement, box, region, wsod, feature, detector, map, supervision, fully, det, obtains, category, final] [model, fig, datasets, trained] [spatial, proposed, ieee, likelihood, pattern, method, based, output, figure] [supervised, image, loss, train, cluster, generate, generated, discriminative, pseudo, row] [training, classification, network, learning, basic, performance, deep, problem, classifier, average, set, process, better, algorithm, neural] [computer, conference, vision, single, second, international]
@InProceedings{Chen_2020_CVPR,
  author = {Chen, Ze and Fu, Zhihang and Jiang, Rongxin and Chen, Yaowu and Hua, Xian-Sheng},
  title = {SLV: Spatial Likelihood Voting for Weakly Supervised Object Detection},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Universal Weighting Metric Learning for Cross-Modal Matching
Jiwei Wei, Xing Xu, Yang Yang, Yanli Ji, Zheng Wang, Heng Tao Shen


Cross-modal matching has been a highlighted research topic in both vision and language areas. Learning appropriate mining strategy to sample and weight informative pairs is crucial for the cross-modal matching performance. However, most existing metric learning methods are developed for unimodal matching, which is unsuitable for cross-modal matching on multimodal data with heterogeneous features. To address this problem, we propose a simple and interpretable universal weighting framework for cross-modal matching, which provides a tool to analyze the interpretability of various loss functions. Furthermore, we introduce a new polynomial loss under the universal weighting framework, which defines a weight function for the positive and negative informative pairs respectively. Experimental results on two image-text matching benchmarks and two video-text matching benchmarks validate the efficacy of the proposed method.
[embedding, pair, retrieval, visual, video, text, sii, attention, multimodal, encoding, modality, bathroom, unimodal] [positive, table, framework, score, semantic, anchor, effectiveness, advanced] [universal, analyze, model, experimental, effectively] [proposed, figure, method, ieee, existing, residual, formulated, science] [loss, image, yang, tao, shared, xing, crossmodal, introduce, learn, generate, tool] [negative, triplet, weight, similarity, learning, function, weighting, informative, max, sample, performance, sij, heng, metric, set, network, select, deep, hardest, average, mining, find, data, space, sampling, test, appropriate, random, evaluate, better, discard, selected] [polynomial, matching, scan, form, variety, dense]
@InProceedings{Wei_2020_CVPR,
  author = {Wei, Jiwei and Xu, Xing and Yang, Yang and Ji, Yanli and Wang, Zheng and Shen, Heng Tao},
  title = {Universal Weighting Metric Learning for Cross-Modal Matching},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
IDA-3D: Instance-Depth-Aware 3D Object Detection From Stereo Vision for Autonomous Driving
Wanli Peng, Hao Pan, He Liu, Yi Sun


3D object detection is an important scene understanding task in autonomous driving and virtual reality. Approaches based on LiDAR technology have high performance, but LiDAR is expensive. Considering more general scenes, where there is no LiDAR data in the 3D datasets, we propose a 3D object detection approach from stereo vision which does not rely on LiDAR data either as input or as supervision in training, but solely takes RGB images with corresponding annotated 3D bounding boxes as training data. As depth estimation of object is the key factor affecting the performance of 3D object detection, we introduce an Instance-DepthAware (IDA) module which accurately predicts the depth of the 3D bounding box's center by instance-depth awareness, disparity adaptation and matching cost reweighting. Moreover, our model is an end-to-end learning framework which does not require multiple stages or postprocessing algorithm. We provide detailed experiments on KITTI benchmark and achieve impressive improvements compared with the existing image-based methods. Our code is available at https://github.com/swords123/IDA-3D.
[pair, provide, attention, driving] [object, detection, bounding, box, module, instance, lidar, feature, center, nonuniform, table, regression, autonomous, ida, map, level, rpn, iou, hard, apbev, car, propose, annotated, roi, proposal, reweighting, raquel] [input, model] [disparity, ieee, pattern, method, based, high, range, figure, solely, binocular] [image, loss, corresponding, adaptation] [quantization, performance, training, learning, network, data, compared, design, reduce, uniform, architecture, deep, strategy] [depth, stereo, estimation, conference, vision, cost, computer, monocular, left, point, error, approach, matching, volume, kitti, accurate, orientation, international, angle, rgb]
@InProceedings{Peng_2020_CVPR,
  author = {Peng, Wanli and Pan, Hao and Liu, He and Sun, Yi},
  title = {IDA-3D: Instance-Depth-Aware 3D Object Detection From Stereo Vision for Autonomous Driving},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Label Decoupling Framework for Salient Object Detection
Jun Wei, Shuhui Wang, Zhe Wu, Chi Su, Qingming Huang, Qi Tian


To get more accurate saliency maps, recent methods mainly focus on aggregating multi-level features from fully convolutional network (FCN) and introducing edge information as auxiliary supervision. Though remarkable progress has been achieved, we observe that the closer the pixel is to the edge, the more difficult it is to be predicted, because edge pixels have a very imbalance distribution. To address this problem, we propose a label decoupling framework (LDF) which consists of a label decoupling (LD) procedure and a feature interaction network (FIN). LD explicitly decomposes the original saliency map into body map and detail map, where body map concentrates on center areas of objects and detail map focuses on regions around edges. Detail map works better because it involves much more pixels than traditional edge supervision. Different from saliency map, body map discards edge pixels and only pays attention to center areas. This successfully avoids the distraction from edge pixels during training. Therefore, we employ two branches in FIN to deal with body map and detail map respectively. Feature interaction (FI) is designed to fuse the two complementary branches to predict the saliency map, which is then used to refine the two branches again. This iterative refinement is helpful for learning better representations and more precise saliency maps. Comprehensive experiments on six benchmark datasets demonstrate that LDF outperforms state-of-the-art approaches on different evaluation metrics.
[interaction, prediction, decoder, attention] [saliency, salient, edge, map, object, feature, scrn, afnet, poolnet, siba, basnet, detection, tdbu, sod, ecssd, duts, background, huchuan, backbone, ali, center, represents, including, framework, china, fully, propose, challenging, soc] [model, original, datasets, iterative] [detail, proposed, pixel, method, mae, convolutional, designed, ieee, figure, based] [image, loss, encoder, learn, consists, ldf, corresponding] [label, network, performance, better, decoupling, learning, applied, larger, binary, set, deep, measure, best, procedure, fin] [body, distance, computer, error, vision, accurate, conference, ground, demonstrate]
@InProceedings{Wei_2020_CVPR,
  author = {Wei, Jun and Wang, Shuhui and Wu, Zhe and Su, Chi and Huang, Qingming and Tian, Qi},
  title = {Label Decoupling Framework for Salient Object Detection},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Transform and Tell: Entity-Aware News Image Captioning
Alasdair Tran, Alexander Mathews, Lexing Xie


We propose an end-to-end model which generates captions for images embedded in news articles. News images present two key challenges: they rely on real-world knowledge, especially about named entities; and they typically have linguistically rich captions that include uncommon words. We address the first challenge by associating words in the caption with faces and objects in the image, via a multi-modal, multi-head attention mechanism. We tackle the second challenge with a state-of-the-art transformer language model that uses byte-pair-encoding to generate captions as a sequence of word parts. On the GoodNews dataset, our model outperforms the previous state of the art by a factor of four in CIDEr score (13 to 54). This performance gain comes from a unique combination of language models, word representation, image embeddings, face embeddings, object embeddings, and improvements in neural network design. We also introduce the NYTimes800k dataset which is 70% larger than GoodNews, has higher article quality, and includes the locations of images within articles as an additional contextual cue.
[attention, news, article, captioning, transformer, caption, language, roberta, text, goodnews, decoder, recognition, dataset, previous, cider, lstm, evaluation, time, attend, embeddings, glove, word, vocabulary, visual, context, state, biten, work, modeling, entity, bert, day, automatic, annual, meeting, sequence] [association, named, rare, object, key, recall, final, table, art, score] [model, face, input] [ieee, pattern, june, output, july, figure, dynamic] [image, generate, generated, representation, encoder, generating, generates] [computational, number, machine, training, neural, set, test, learning, proper, care, network, precision, weighted, performance, bpe, softmax, processing, layer] [conference, vision, computer, international]
@InProceedings{Tran_2020_CVPR,
  author = {Tran, Alasdair and Mathews, Alexander and Xie, Lexing},
  title = {Transform and Tell: Entity-Aware News Image Captioning},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
HAMBox: Delving Into Mining High-Quality Anchors on Face Detection
Yang Liu, Xu Tang, Junyu Han, Jingtuo Liu, Dinger Rui, Xiang Wu


Current face detectors utilize anchors to frame a multi-task learning problem which combines classification and bounding box regression. Effective anchor design and anchor matching strategy enable face detectors to localize faces under large pose and scale variations. However, we observe that, more than 80% correctly predicted bounding boxes are regressed from the unmatched anchors (the IoUs between anchors and target faces are lower than a threshold) in the inference phase. It indicates that these unmatched anchors perform excellent regression ability, but the existing methods neglect to learn from them. In this paper, we propose an Online High-quality Anchor Mining Strategy (HAMBox), which explicitly helps outer faces compensate with high-quality anchors. Our proposed HAMBox method could be a general strategy for anchor-based single-stage face detection. Experiments on various datasets, including WIDER FACE, FDDB, AFW and PASCAL Face, demonstrate the superiority of the proposed method.
[] [] [] [] [] [] []
@InProceedings{Liu_2020_CVPR,
  author = {Liu, Yang and Tang, Xu and Han, Junyu and Liu, Jingtuo and Rui, Dinger and Wu, Xiang},
  title = {HAMBox: Delving Into Mining High-Quality Anchors on Face Detection},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Hierarchical Feature Embedding for Attribute Recognition
Jie Yang, Jiarou Fan, Yiru Wang, Yige Wang, Weihao Gan, Lin Liu, Wei Wu


Attribute recognition is a crucial but challenging task due to viewpoint changes, illumination variations and appearance diversities, etc. Most of previous work only consider the attribute-level feature embedding, which might perform poorly in complicated heterogeneous conditions. To address this problem, we propose a hierarchical feature embedding (HFE) framework, which learns a fine-grained feature embedding by combining attribute and ID information. In HFE, we maintain the inter-class and intra-class feature embedding simultaneously. Not only samples with the same attribute but also samples with the same ID are gathered more closely, which could restrict the feature embedding of visually hard samples with regard to attributes and improve the robustness to variant conditions. We establish this hierarchical structure by utilizing HFE loss consisted of attribute-level and ID-level constraints. We also introduce an absolute boundary regularization and a dynamic loss weight as supplementary components to help build up the feature embedding. Experiments show that our method achieves the state-of-the-art results on two pedestrian attribute datasets and a facial attribute dataset.
[embedding, recognition, dataset, hierarchical, attention, combining, visual, construct, semantics, three, length, evaluation] [feature, pedestrian, hard, table, boundary, achieves, apr, positive, framework, backbone, recall, propose] [face, lij, clothing, model, identity, xiaoou, improve, datasets, facial] [ieee, based, method, pattern, dynamic, color, proposed] [attribute, loss, hfe, person, yij, linter, lintra, reid, quintuplet, discriminative, backpack, market, introduce, representation, xnij, duke, appearance, image] [learning, triplet, deep, metric, weight, sample, space, classification, regularization, performance, best, imbalanced, data, neural, training, better, arxiv, preprint] [conference, computer, absolute, international, vision, joint, well, distance, european]
@InProceedings{Yang_2020_CVPR,
  author = {Yang, Jie and Fan, Jiarou and Wang, Yiru and Wang, Yige and Gan, Weihao and Liu, Lin and Wu, Wei},
  title = {Hierarchical Feature Embedding for Attribute Recognition},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Squeeze-and-Attention Networks for Semantic Segmentation
Zilong Zhong, Zhong Qiu Lin, Rene Bidart, Xiaodan Hu, Ibrahim Ben Daya, Zhifeng Li, Wei-Shi Zheng, Jonathan Li, Alexander Wong


The recent integration of attention mechanisms into segmentation networks improves their representational capabilities through a great emphasis on more informative features. However, these attention mechanisms ignore an implicit sub-task of semantic segmentation and are constrained by the grid structure of convolution kernels. In this paper, we propose a novel squeeze-and-attention network (SANet) architecture that leverages an effective squeeze-and-attention (SA) module to account for two distinctive characteristics of segmentation: i) pixel-group attention, and ii) pixel-wise prediction. Specifically, the proposed SA modules impose pixel-group attention on conventional convolution by introducing an 'attention' convolutional channel, thus taking into account spatial-channel inter-dependencies in an efficient manner. The final segmentation results are produced by merging outputs from four hierarchical stages of a SANet to integrate multi-scale contexts for obtaining an enhanced pixel-wise prediction. Empirical experiments on two challenging public datasets validate the effectiveness of the proposed SANets, which achieves 83.2 % mIoU (without COCO pre-training) on PASCAL VOC and a state-of-the-art mIoU of 54.4 % on PASCAL Context.
[attention, context, prediction, dataset, den, integrate, recognition] [segmentation, semantic, module, feature, sanet, pascal, sanets, miou, voc, table, fully, categorical, backbone, coco, global, main, fcn, object, map, achieves, grouping, pooling, parsing, including, effectiveness, contextual, adopt, ablation, china, focus, pyramid] [model, input, effective, improve] [convolution, pixel, ieee, spatial, channel, figure, pattern, output, dilated, residual, convolutional, validate] [image, loss, generate] [network, learning, deep, design, training, test, neural, arxiv, preprint, efficient, data, representational, performance] [computer, conference, vision, dense, international, local, scene, structure]
@InProceedings{Zhong_2020_CVPR,
  author = {Zhong, Zilong and Lin, Zhong Qiu and Bidart, Rene and Hu, Xiaodan and Daya, Ibrahim Ben and Li, Zhifeng and Zheng, Wei-Shi and Li, Jonathan and Wong, Alexander},
  title = {Squeeze-and-Attention Networks for Semantic Segmentation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Context R-CNN: Long Term Temporal Context for Per-Camera Object Detection
Sara Beery, Guanhang Wu, Vivek Rathod, Ronny Votel, Jonathan Huang


In static monitoring cameras, useful contextual information can stretch far beyond the few seconds typical video understanding models might see: subjects may exhibit similar behavior over multiple days, and background objects remain static. Due to power and storage constraints, sampling frequencies are low, often no faster than one frame per second, and sometimes are irregular due to the use of a motion trigger. In order to perform well in this setting, models must be robust to irregular sampling rates. In this paper we propose a method that leverages temporal context from the unlabeled frames of a novel camera to improve performance at that camera. Specifically, we propose an attention-based approach that allows our model, Context R-CNN, to index into a long term memory bank constructed on a per-camera basis and aggregate contextual features from other frames to boost object detection performance on the current frame. We apply Context R-CNN to two settings: (1) species detection using camera traps, and (2) vehicle detection in traffic cameras, showing in both settings that Context R-CNN leads to performance gains over strong baselines. Moreover, we show that increasing the contextual time horizon leads to improved results. When applied to camera trap data from the Snapshot Serengeti dataset, Context R-CNN with context from up to a month of images outperforms a single-frame baseline by 17.9% mAP, and outperforms S3D (a 3d convolution based baseline) by 11.2% mAP.
[context, long, frame, attention, time, short, temporal, video, static, traffic, bank, horizon, trap, monitoring, current, dataset, multiple, irregular, month, behavior, serengeti, spatiotemporal, order] [object, detection, feature, map, box, contextual, background, extractor, faster, false, level, improvement, kaiming, ross, sara, aggregate] [model, improve, animal, highly, input] [ieee, figure, pattern, snapshot, motion, based, counting, method] [image] [memory, performance, data, number, arxiv, preprint, find, neural, sampling, deep, consider, top, training] [camera, term, conference, computer, single, vision, international, well, empty, approach, shape, european]
@InProceedings{Beery_2020_CVPR,
  author = {Beery, Sara and Wu, Guanhang and Rathod, Vivek and Votel, Ronny and Huang, Jonathan},
  title = {Context R-CNN: Long Term Temporal Context for Per-Camera Object Detection},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Mixture Dense Regression for Object Detection and Human Pose Estimation
Ali Varamesh, Tinne Tuytelaars


Mixture models are well-established learning approaches that, in computer vision, have mostly been applied to inverse or ill-defined problems. However, they are general-purpose divide-and-conquer techniques, splitting the input space into relatively homogeneous subsets in a data-driven manner. Not only ill-defined but also well-defined complex problems should benefit from them. To this end, we devise a framework for spatial regression using mixture density networks. We realize the framework for object detection and human pose estimation. For both tasks, a mixture model yields higher accuracy and divides the input space into interpretable modes. For object detection, mixture components focus on object scale, with the distribution of components closely following that of ground truth the object scale. This practically alleviates the need for multi-scale testing, providing a superior speed-accuracy trade-off. For human pose estimation, a mixture model divides the data based on viewpoint and uncertainty -- namely, front and back views, with back view imposing higher uncertainty. We conduct experiments on the MS COCO dataset and do not face any mode collapse.
[prediction, multiple, dataset, includes, evaluation, provide] [object, detection, regression, table, offset, center, coco, location, centernet, box, faster] [model, input, trained, face, difference] [spatial, ieee, scale, based, pattern, gaussian, output, figure, comparison] [component, loss, mode, person, image, generated, target, diverse, conditioned, train] [mixture, density, training, base, learning, classification, neural, distribution, data, variance, accuracy, better, deep, higher, network, arxiv, preprint, set, function, proper, number, rate, space, mdn, size] [pose, human, computer, estimation, conference, vision, dense, keypoints, body, single, ground, truth, uncertainty, formulation, accurate, term, european]
@InProceedings{Varamesh_2020_CVPR,
  author = {Varamesh, Ali and Tuytelaars, Tinne},
  title = {Mixture Dense Regression for Object Detection and Human Pose Estimation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Syntax-Aware Action Targeting for Video Captioning
Qi Zheng, Chaoyue Wang, Dacheng Tao


Existing methods on video captioning have made great efforts to identify objects/instances in videos, but few of them emphasize the prediction of action. As a result, the learned models are likely to depend heavily on the prior of training data, such as the co-occurrence of objects, which may cause an enormous divergence between the generated descriptions and the video content. In this paper, we explicitly emphasize the importance of action by predicting visually-related syntax components including subject, object and predicate. Specifically, we propose a Syntax-Aware Action Targeting (SAAT) module that firstly builds a self-attended scene representation to draw global dependence among multiple objects within a scene, and then decodes the visually-related syntax components by setting different queries. After targeting the action, indicated by predicate, our captioner learns an attention distribution over the predicate and the previously predicted words to guide the generation of the next word. Comprehensive experiments on MSVD and MSR-VTT datasets demonstrate the efficacy of the proposed model.
[video, action, captioning, man, saat, visual, syntax, dataset, cider, captioner, attention, predicate, targeting, temporal, embedding, recurrent, encoding, sequence, description, natural, caption, word, evaluation, msvd, playing, sentence, prediction, multiple, regular, automatic, rouge, language, verb, describe, question, decoder, explicitly] [object, module, feature, predicted, global, semantic, guide, table, detector, wei, car] [model, subject, input, datasets] [guidance, based, proposed, method, existing, dynamic] [generated, representation, image, person, specific, generation, piece, generate, drawing, loss, target, tao, learns] [learning, neural, set, layer, training, baseline, learned, accuracy, performance, compared, memory] [scene, rgb]
@InProceedings{Zheng_2020_CVPR,
  author = {Zheng, Qi and Wang, Chaoyue and Tao, Dacheng},
  title = {Syntax-Aware Action Targeting for Video Captioning},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Learning Visual Emotion Representations From Web Data
Zijun Wei, Jianming Zhang, Zhe Lin, Joon-Young Lee, Niranjan Balasubramanian, Minh Hoai, Dimitris Samaras


We present a scalable approach for learning powerful visual features for emotion recognition. A critical bottleneck in emotion recognition is the lack of large scale datasets that can be used for learning visual emotion features. To this end, we curate a webly derived large scale dataset, StockEmotion, which has more than a million images. StockEmotion uses 690 emotion related tags as labels giving us a fine-grained and diverse set of emotion labels, circumventing the difficulty in manually obtaining emotion annotations. We use this dataset to train a feature extraction network, EmotionNet, which we further regularize using joint text and visual embedding and text distillation. Our experimental results establish that EmotionNet trained on the StockEmotion dataset outperforms SOTA models on four different visual emotion tasks. An aded benefit of our joint embedding training approach is that EmotionNet achieves competitive zero-shot recognition performance against fully supervised baselines on a challenging visual emotion dataset, EMOTIC, which further highlights the generalizability of the learned emotion features.
[emotion, keywords, visual, emotionnet, dataset, embedding, stockemotion, text, recognition, associated, stock, emotional, keyword, embeddings, affective, provide, previous, conveyed, language, emotic, work, evaluation] [feature, annotated, object, annotation, extra, category, predicted, sota, propose, table] [trained, datasets, model, facial, input] [extraction, ieee, pattern, scale, based, method, analysis] [image, loss, list, representation, train] [training, learning, performance, set, network, imagenet, classification, data, number, large, learned, label, classifier, manually, labeled, space, deep, general, subset, vector, small, basic, simple, layer, evaluate] [conference, computer, joint, vision, international, european, approach]
@InProceedings{Wei_2020_CVPR,
  author = {Wei, Zijun and Zhang, Jianming and Lin, Zhe and Lee, Joon-Young and Balasubramanian, Niranjan and Hoai, Minh and Samaras, Dimitris},
  title = {Learning Visual Emotion Representations From Web Data},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
The Edge of Depth: Explicit Constraints Between Segmentation and Depth
Shengjie Zhu, Garrick Brazil, Xiaoming Liu


In this work we study the mutual benefits of two common computer vision tasks, self-supervised depth estimation and semantic segmentation from images. For example, to help unsupervised monocular depth estimation, constraint from semantic segmentation has been explored implicitly such as sharing and transforming features. In contrast, we propose to explicitly measure the border consistency between segmentation and depth and minimize it in a greedy manner by iteratively supervising the network towards a locally optimal solution. Partially this is motivated by our observation that semantic segmentation even trained with limited ground truth (200 images of KITTI) can offer more accurate border than that of any (monocular or stereo) image-based depth estimation. Through extensive experiments, our proposed approach advance the state of the art on unsupervised monocular depth estimation in the KITTI benchmark.
[recognition, relationship, prediction, explicitly, work, pair, semantics] [segmentation, semantic, edge, map, object, occlusion, propose, background, predicted, mask, supervision, sota, foreground, occluded, table, improvement] [quality] [ieee, disparity, method, pattern, pixel, proposed, prior, figure, optical, flow] [loss, image, morphing, unsupervised, consistency, supervised, corresponding] [learning, network, performance, function, baseline, training, deep, set, proxy, denote, finetuning, strategy, log, arxiv, preprint] [depth, stereo, vision, conference, computer, monocular, estimation, point, morph, left, border, photometric, estimated, rmse, bleeding, rel, distance, define, matching, consistent, view, international, leveraging, local, kitti, surface, relative, continuous, ground, truth, normal]
@InProceedings{Zhu_2020_CVPR,
  author = {Zhu, Shengjie and Brazil, Garrick and Liu, Xiaoming},
  title = {The Edge of Depth: Explicit Constraints Between Segmentation and Depth},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
A Context-Aware Loss Function for Action Spotting in Soccer Videos
Anthony Cioppa, Adrien Deliege, Silvio Giancola, Bernard Ghanem, Marc Van Droogenbroeck, Rikke Gade, Thomas B. Moeslund


In video understanding, action spotting consists in temporally localizing human-induced events annotated with single timestamps. In this paper, we propose a novel loss function that specifically considers the temporal context naturally present around each action, rather than focusing on the single annotated frame to spot. We benchmark our loss on a large dataset of soccer videos, SoccerNet, and achieve an improvement of 12.8% over the baseline. We show the generalization capability of our loss for generic activity proposals and detection on ActivityNet, by spotting the beginning and the end of each activity. Furthermore, we provide an extended ablation study and display challenging cases for action spotting in soccer videos. Finally, we qualitatively illustrate how our loss induces a precise temporal understanding of actions and show how such semantic knowledge can be used for automatic highlights generation.
[action, temporal, spotting, recognition, video, soccer, frame, activity, context, soccernet, goal, activitynet, automatic, bmn, bernard, understanding, encoding, future, broadcast, visual, ngt] [segmentation, score, detection, annotated, proposal, challenging, feature, table, localization, module, location] [game, spot, player, university] [ieee, pattern, figure, june, based, raw, september, field, method, analysis] [loss, train, market, perform] [network, task, number, set, performance, class, function, precision, baseline, deep, equation, learning, average, vector, dimension] [conference, computer, vision, international, slicing, october, ground, european, closest, novel, human, provided, truth, single, define]
@InProceedings{Cioppa_2020_CVPR,
  author = {Cioppa, Anthony and Deliege, Adrien and Giancola, Silvio and Ghanem, Bernard and Droogenbroeck, Marc Van and Gade, Rikke and Moeslund, Thomas B.},
  title = {A Context-Aware Loss Function for Action Spotting in Soccer Videos},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Towards Learning a Generic Agent for Vision-and-Language Navigation via Pre-Training
Weituo Hao, Chunyuan Li, Xiujun Li, Lawrence Carin, Jianfeng Gao


Learning to navigate in a visual environment following natural-language instructions is a challenging task, because the multimodal inputs to the agent are highly variable, and the training data on a new task is often limited. In this paper, we present the first pre-training and fine-tuning paradigm for vision-and-language navigation (VLN) tasks. By training on a large amount of image-text-action triplets in a self-supervised learning manner, the pre-trained model provides generic representations of visual environments and language instructions. It can be easily used as a drop-in for existing VLN frameworks, leading to the proposed agent PREVALENT. It learns more effectively in new tasks and generalizes better in a previously unseen environment. The performance is validated on three VLN tasks. On the Room-to-Room benchmark, our model improves the state-of-the-art from 47% to 51% on success rate weighted by path length. Further, the learned representation is transferable to other VLN tasks. On two recent tasks, vision-and-dialog navigation and "Help, Anna!", the proposed PREVALENT leads to significant improvement over existing methods, achieving a new state of the art.
[agent, navigation, language, visual, revalent, instruction, spl, cvdn, vln, trajectory, action, dataset, three, bert, embedding, environment, word, navigate, natural, navigator, attention, panoramic, embeddings, sequence, progress, length, lmlm, oracle, step, downstream, hanna, dialog, policy, reinforcement, text, goal, jianfeng, multimodal, previous, attend] [table, illustration, interactive] [model, trained, generic, generalization] [based, proposed, existing, figure, output] [unseen, encoder, image, representation, masked, consists, target] [learning, training, layer, indicates, validation, path, data, better, performance, test, arxiv, preprint, task, rate, number, size, mixed, consider] [joint, position, full]
@InProceedings{Hao_2020_CVPR,
  author = {Hao, Weituo and Li, Chunyuan and Li, Xiujun and Carin, Lawrence and Gao, Jianfeng},
  title = {Towards Learning a Generic Agent for Vision-and-Language Navigation via Pre-Training},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Video Instance Segmentation Tracking With a Modified VAE Architecture
Chung-Ching Lin, Ying Hung, Rogerio Feris, Linglin He


We propose a modified variational autoencoder (VAE) architecture built on top of Mask R-CNN for instance-level video segmentation and tracking. The method builds a shared encoder and three parallel decoders, yielding three disjoint branches for predictions of future frames, object detection boxes, and instance segmentation masks. To effectively solve multiple learning tasks, we introduce a Gaussian Process model to enhance the statistical representation of VAE by relaxing the prior strong independent and identically distributed (iid) assumption of conventional VAEs and allowing potential correlations among extracted latent variables. The network learns embedded spatial interdependence and motion continuity in video data and creates a representation that is effective to produce high-quality segmentation masks and track multiple instances in diverse and unstructured videos. Evaluation on a variety of recently introduced datasets shows that our model outperforms previous methods and achieves the new best in class performance.
[video, dataset, three, multiple, decoder, visual, frame, evaluation, recognition] [object, segmentation, mask, instance, tracking, branch, detection, box, vist, bounding, proposal, correlation, table, masktrack, false, semantic, challenge, smotsa, propose, association, track, achieves, framework] [model, auxiliary, datasets, strong] [ieee, pattern, method, spatial, proposed, motion, figure, prior, based] [latent, vae, variational, encoder, augment, loss, unsupervised, perform, autoencoder, representation, extracted, produce, image] [network, learning, performance, online, architecture, task, set, arxiv, preprint, data, evaluate, process, training, number] [computer, conference, vision, kitti, european, international, matching]
@InProceedings{Lin_2020_CVPR,
  author = {Lin, Chung-Ching and Hung, Ying and Feris, Rogerio and He, Linglin},
  title = {Video Instance Segmentation Tracking With a Modified VAE Architecture},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Deformation-Aware Unpaired Image Translation for Pose Estimation on Laboratory Animals
Siyuan Li, Semih Gunel, Mirela Ostrek, Pavan Ramdya, Pascal Fua, Helge Rhodin


Our goal is to capture the pose of real animals using synthetic training examples, without using any manual supervision. Our focus is on neuroscience model organisms, to be able to study how neural circuits orchestrate behaviour. Human pose estimation attains remarkable accuracy when trained on real or simulated datasets consisting of millions of frames. However, for many applications simulated models are unrealistic and real training datasets with comprehensive annotations do not exist. We address this problem with a new sim2real domain transfer method. Our key contribution is the explicit and independent modeling of appearance, shape and pose in an unpaired image translation framework. Our model lets us train a pose estimator on the target domain by transferring readily available body keypoint locations from the source domain to generated target images. We compare our approach with existing domain transfer methods and demonstrate improved pose estimation accuracy on Drosophila melanogaster (fruit fly), Caenorhabditis elegans (worm) and Danio rerio (zebrafish), without requiring any manual annotation on the target domain and despite using simplistic off-the-shelf animal characters for simulation, or simple geometric shapes as models. Our new datasets, code and trained models will be published to support future computer vision and neuroscientific studies.
[explicit, three, character] [segmentation, global, zebrafish, mask, threshold, annotation] [adversarial, model, input, animal, trained] [method, field, spatial, figure, existing, stn, motion, output, convolutional, intermediate] [image, domain, target, source, synthetic, real, transfer, unpaired, translation, style, drosophila, train, supervised, realistic, loss, generated, unsupervised, appearance, generator, discriminator, worm, cycle, synthesized, generate, fly, paired, representation] [training, network, large, neural, learning, deep, test, vector, manual, accuracy, simple, small] [pose, deformation, estimation, human, shape, keypoint, silhouette, keypoints, deformed, error, estimator, approach, michael, capture, compare, well, simulated, local, directly, transformation]
@InProceedings{Li_2020_CVPR,
  author = {Li, Siyuan and Gunel, Semih and Ostrek, Mirela and Ramdya, Pavan and Fua, Pascal and Rhodin, Helge},
  title = {Deformation-Aware Unpaired Image Translation for Pose Estimation on Laboratory Animals},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
ZeroQ: A Novel Zero Shot Quantization Framework
Yaohui Cai, Zhewei Yao, Zhen Dong, Amir Gholami, Michael W. Mahoney, Kurt Keutzer


Quantization is a promising approach for reducing the inference time and memory footprint of neural networks. However, most existing quantization methods require access to the original training dataset for retraining during quantization. This is often not possible for applications with sensitive or proprietary data, e.g., due to privacy and security concerns. Existing zero-shot quantization methods use different heuristics to address this, but they result in poor performance, especially when quantizing to ultra-low precision. Here, we propose \OURS, a novel zero-shot quantization framework to address this. \OURS enables mixed-precision quantization without any access to the training or validation data. This is achieved by optimizing for a Distilled Dataset, which is engineered to match the statistics of batch normalization across different layers of the network. \OURS supports both uniform and mixed-precision quantization. For the latter, we introduce a novel Pareto frontier based method to automatically determine the mixed-precision bit setting for all layers, with no manual search involved. We extensively test our proposed method on a diverse set of models, including ResNet18/50/152, MobileNetV2, ShuffleNet, SqueezeNext, and InceptionV3 on ImageNet, as well as RetinaNet-ResNet50 on the Microsoft COCO dataset. In particular, we show that \OURS can achieve 1.71% higher accuracy on MobileNetV2, as compared to the recently proposed DFQ [??] method. Importantly, \OURS has a very low computational overhead, and it can finish the entire quantization process in less than 30s (0.5% of one epoch training time of ResNet50 on ImageNet). We have open-sourced the \OURS framework(https://github.com/amirgholami/ZeroQ).
[dataset, microsoft, work, time, amir] [object, achieves, table, detection, framework, propose, coco, retinanet] [model, sensitivity, access, original, input, sensitive] [method, based, range, figure, proposed, convolutional, ieee, low, pattern, gaussian] [address, image, perform] [quantization, ero, data, training, precision, neural, distilled, accuracy, layer, bit, size, quantized, arxiv, preprint, pareto, quantizing, frontier, higher, compared, deep, configuration, efficient, note, activation, search, computational, network, imagenet, weight, kurt, batch, dfq, performance, clipping, normalization, mixedprecision, test, achieve, setting, epoch, learning, mentioned, optimization] [conference, computer, vision, approach, compute, novel, determine, international, well]
@InProceedings{Cai_2020_CVPR,
  author = {Cai, Yaohui and Yao, Zhewei and Dong, Zhen and Gholami, Amir and Mahoney, Michael W. and Keutzer, Kurt},
  title = {ZeroQ: A Novel Zero Shot Quantization Framework},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Disparity-Aware Domain Adaptation in Stereo Image Restoration
Bo Yan, Chenxi Ma, Bahetiyaer Bare, Weimin Tan, Steven C. H. Hoi


Under stereo settings, the problems of disparity estimation, stereo magnification and stereo-view synthesis have gathered wide attention. However, the limited image quality brings non-negligible difficulties in developing related applications and becomes the main bottleneck of stereo images. To the best of our knowledge, stereo image restoration is rarely studied. Towards this end, this paper analyses how to effectively explore disparity information, and proposes a unified stereo image restoration framework. The proposed framework explicitly learn the inherent pixel correspondence between stereo views and restores stereo image with the cross-view information at image and feature level. A Feature Modulation Dense Block (FMDB) is introduced to insert disparity prior throughout the whole network. The experiments in terms of efficiency, objective and perceptual quality, and the accuracy of depth estimation demonstrates the superiority of the proposed framework on various stereo image restoration tasks.
[video, attention, recognition, explore, modulation, provide, dataset, multiple, time, considering, pair] [feature, table, level, framework, ablation, unified] [model, quality, study, noise] [disparity, figure, proposed, restoration, binocular, etm, pixel, flow, ieee, pattern, prior, spatial, parallax, deblurring, convolution, reference, based, etb, passrnet, imaging, restored, block, stereoirn, ldisp, comparison, denoising, restores, perceptual, degraded, exploiting, ldisacc, convolutional] [image, etd, loss, train, utilize] [network, accuracy, deep, learning, better, best, compared, process, performance] [stereo, computer, vision, conference, monocular, left, view, accurate, estimation, single, correspondence, dense, scene, reconstruction, demonstrate, ground, truth, approach, structure]
@InProceedings{Yan_2020_CVPR,
  author = {Yan, Bo and Ma, Chenxi and Bare, Bahetiyaer and Tan, Weimin and Hoi, Steven C. H.},
  title = {Disparity-Aware Domain Adaptation in Stereo Image Restoration},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Offset Bin Classification Network for Accurate Object Detection
Heqian Qiu, Hongliang Li, Qingbo Wu, Hengcan Shi


Object detection combines object classification and object localization problems. Most existing object detection methods usually locate objects by leveraging regression networks trained with Smooth L1 loss function to predict offsets between candidate boxes and objects. However, this loss function applies the same penalties on different samples with large errors, which results in suboptimal regression networks and inaccurate offsets. In this paper, we propose an offset bin classification network optimized with cross entropy loss to predict more accurate offsets. It not only provides different penalties for different samples but also avoids the gradient explosion problem caused by the samples with large errors. Specifically, we discretize the continuous offset into a number of bins, and predict the probability of each offset bin. Furthermore, we propose an expectation-based offset prediction and a hierarchical focusing method to improve the prediction precision. Extensive experiments on the PASCAL VOC and MS-COCO datasets demonstrate the effectiveness of our proposed method. Our method outperforms the baseline methods by a large margin.
[prediction, hierarchical, predict, dataset] [offset, object, bin, detection, bounding, regression, box, focusing, effectiveness, localization, pascal, predicted, cascade, propose, table, voc, hongliang, qingbo, faster, stage, hengcan, fpn, iou, ross, kaiming, king, fanman, expectationbased, precise] [representative, improve] [method, proposed, ieee, figure, range, pattern, based, scale, comparison, discretized, high] [loss, image, cross, train, common] [network, set, number, performance, candidate, baseline, large, function, gradient, probability, achieve, problem, discrete, compared, test, classification, distribution, learning, entropy] [computer, conference, vision, accurate, smooth, international, continuous, second]
@InProceedings{Qiu_2020_CVPR,
  author = {Qiu, Heqian and Li, Hongliang and Wu, Qingbo and Shi, Hengcan},
  title = {Offset Bin Classification Network for Accurate Object Detection},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
TBT: Targeted Neural Network Attack With Bit Trojan
Adnan Siraj Rakin, Zhezhi He, Deliang Fan


Security of modern Deep Neural Networks (DNNs) is under severe scrutiny as the deployment of these models become widespread in many intelligence-based applications. Most recently, DNNs are attacked through Trojan which can effectively infect the model during the training phase and get activated only through specific input patterns (i.e, trigger) during inference. In this work, for the first time, we propose a novel Targeted Bit Trojan(TBT) method, which can insert a targeted neural Trojan into a DNN through bit-flip attack. Our algorithm efficiently generates a trigger specifically designed to locate certain vulnerable bits of DNN weights stored in main memory (i.e., DRAM). The objective is that once the attacker flips these vulnerable bits, the network still operates with normal inference accuracy with benign input. However, when the attacker activates the trigger by embedding it with any input, the network is forced to classify all inputs to a certain target class. We demonstrate that flipping only several vulnerable bits identified by our method, using available bit-flip techniques (i.e, row-hammer), can transform a fully functional DNN model into a Trojan-infected model. We perform extensive experiments of CIFAR-10, SVHN and ImageNet datasets on both VGG-16 and Resnet-18 architectures. Our proposed TBT could classify 92 of test images to a target class with as little as 84 bit-flips out of 88 million weight bits on Resnet-18 for CIFAR10 dataset.
[previous, work, step, dataset] [table, main, area, level, location] [attack, trojan, trigger, dnn, model, attacker, targeted, input, tbt, vulnerable, adversarial, clean, asr, success, access, noise, security, flip, inject, threat, tap, badnet, flipping, supply, insertion, identify, classified] [proposed, method, figure, designed, ieee, output, comparison] [target, specific, generate, generation, image, perform, row] [neural, test, training, weight, network, accuracy, data, bit, class, number, memory, deep, inference, layer, size, arxiv, preprint, imagenet, stored, svhn, small, popular, higher, learning, rate, parameter, amount, quantized, baseline, identified, computing, batch, practical] [computer, conference, assume, international]
@InProceedings{Rakin_2020_CVPR,
  author = {Rakin, Adnan Siraj and He, Zhezhi and Fan, Deliang},
  title = {TBT: Targeted Neural Network Attack With Bit Trojan},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Maintaining Discrimination and Fairness in Class Incremental Learning
Bowen Zhao, Xi Xiao, Guojun Gan, Bin Zhang, Shu-Tao Xia


Deep neural networks (DNNs) have been applied in class incremental learning, which aims to solve common real-world problems of learning new classes continually. One drawback of standard DNNs is that they are prone to catastrophic forgetting. Knowledge distillation (KD) is a commonly used technique to alleviate this problem. In this paper, we demonstrate it can indeed help the model to output more discriminative results within old classes. However, it cannot alleviate the problem that the model tends to classify objects into new classes, causing the positive effect of KD to be hidden and limited. We observed that an important factor causing catastrophic forgetting is that the weights in the last fully connected (FC) layer are highly biased in class incremental learning. In this paper, we propose a simple and effective solution motivated by the aforementioned observations to address catastrophic forgetting. Firstly, we utilize KD to maintain the discrimination within old classes. Then, to further maintain the fairness between old classes and new classes, we propose Weight Aligning (WA) that corrects the biased weights in the FC layer after normal training process. Unlike previous work, WA does not require any extra parameters or a validation set in advance, as it utilizes the information provided by the biased weights themselves. The proposed method is evaluated on ImageNet-1000, ImageNet-100, and CIFAR-100 under various settings. Experimental results show that the proposed method can effectively alleviate catastrophic forgetting and significantly outperform state-of-the-art methods.
[step, previous, correct, recognition] [positive, feature, biased, table, achieves, alleviate, utilizes, fish] [model, trained, effective, help, norm, combined, tend] [method, output, figure, proposed, ieee, based, pattern, analysis, extraction, called] [loss, discrimination, aligning, maintaining, learn, train, factor, cat, generative] [incremental, class, learning, weight, distillation, knowledge, data, layer, catastrophic, performance, training, set, bias, impact, logits, forgetting, rehearsal, neural, better, cold, average, fairness, problem, deep, maintain, selection, imbalance, simple, test, strategy, large, number, baseline, reported, compared, continual, validation, bic] [conference, computer, vision, term, solution, additional]
@InProceedings{Zhao_2020_CVPR,
  author = {Zhao, Bowen and Xiao, Xi and Gan, Guojun and Zhang, Bin and Xia, Shu-Tao},
  title = {Maintaining Discrimination and Fairness in Class Incremental Learning},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Background Data Resampling for Outlier-Aware Classification
Yi Li, Nuno Vasconcelos


The problem of learning an image classifier that allows detection of out-of-distribution (OOD) examples, with the help of auxiliary background datasets, is studied. While training with background has been shown to improve OOD detection performance, the optimal choice of such dataset remains an open question, and challenges of data imbalance and computational complexity make it a potentially inefficient or even impractical solution. Targeted at balancing between efficiency and detection quality, a dataset resampling approach is proposed for obtaining a compact yet representative set of background data points. The resampling algorithm takes inspiration from prior work on hard negative mining, performing an iterative adversarial weighting on the background examples and using the learned weights to obtain the subset of desired size. Experiments on different datasets, model architectures and training strategies validate the universal effectiveness and efficiency of adversarially resampled background data. Code is available at https://github.com/JerryYLi/ bg-resample-ood.
[dataset, recognition, lin, work, step, time] [background, detection, table, positive, object, hard, confidence] [datasets, trained, tiny, model, adversarial, example, effective, auxiliary, improve] [figure, proposed, ieee, pattern] [loss, image] [ood, data, training, resampling, resampled, classifier, learning, performance, set, test, rate, optimal, distribution, subset, sampling, problem, negative, learned, large, uniform, classification, deep, weight, min, neural, objective, update, lout, arg, aupr, random, accuracy, class, softmax, size, space, standard, optimization, efficiency, algorithm, storage, selection, sample, selecting, network, misclassified, alternative, pool, higher, log, posterior] [conference, computer, vision, international, solution, approach, full]
@InProceedings{Li_2020_CVPR,
  author = {Li, Yi and Vasconcelos, Nuno},
  title = {Background Data Resampling for Outlier-Aware Classification},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
STEFANN: Scene Text Editor Using Font Adaptive Neural Network
Prasun Roy, Saumik Bhattacharya, Subhankar Ghosh, Umapada Pal


Textual information in a captured scene plays an important role in scene interpretation and decision making. Though there exist methods that can successfully detect and interpret complex text regions present in a scene, to the best of our knowledge, there is no significant prior work that aims to modify the textual information in an image. The ability to edit text directly on images has several advantages including error correction, text restoration and image reusability. In this paper, we propose a method to modify text in an image at character-level. We approach the problem in two stages. At first, the unobserved character (target) is generated from an observed character (source) being modified. We propose two different neural network architectures - (a) FANnet to achieve structural consistency with source font and (b) Colornet to preserve source color. Next, we replace the source character with the generated character maintaining both geometric and visual consistency with neighboring characters. Our method works as a unified platform for modifying text in images. We present the effectiveness of our method on COCO-Text and ICDAR datasets both qualitatively and quantitatively.
[character, text, recognition, observed, visual, multiple, work, natural] [region, seam, bounding, ablation, apply, table, propose] [model, input, quality, original] [color, ieee, figure, proposed, pattern, output, method, based, analysis, convolution, comparison, adaptive] [image, source, target, font, transfer, generated, fannet, generation, colornet, edited, stefann, generate, generative, edit, assisted, assim, editing, synthesis, perform, project, style, row, naptha, train, structural] [network, neural, binary, architecture, learning, layer, size, algorithm, deep, select, best, problem, number, random, set, training, design, entire] [scene, conference, computer, vision, international, error, perspective, directly, carving]
@InProceedings{Roy_2020_CVPR,
  author = {Roy, Prasun and Bhattacharya, Saumik and Ghosh, Subhankar and Pal, Umapada},
  title = {STEFANN: Scene Text Editor Using Font Adaptive Neural Network},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Geometry and Learning Co-Supported Normal Estimation for Unstructured Point Cloud
Haoran Zhou, Honghua Chen, Yidan Feng, Qiong Wang, Jing Qin, Haoran Xie, Fu Lee Wang, Mingqiang Wei, Jun Wang


In this paper, we propose a normal estimation method for unstructured point cloud. We observe that geometric estimators commonly focus more on feature preservation but are hard to tune parameters and sensitive to noise, while learning-based approaches pursue an overall normal estimation accuracy but cannot well handle challenging regions such as surface edges. This paper presents a novel normal estimation method, under the co-support of geometric estimator and deep learning. To lowering the learning difficulty, we first propose to compute a suboptimal initial normal at each point by searching for a best fitting patch. Based on the computed normal field, we design a normal-based height map network (NH-Net) to fine-tune the suboptimal normals. Qualitative and quantitative evaluations demonstrate the clear improvements of our results over both traditional methods and learning-based methods, in terms of estimation accuracy and feature recovery.
[time, dataset, three, visual, constructed] [module, propose, final, pcv, feature, benchmark] [input, noise, improving, robust] [patch, method, noisy, sharp, based, denoising, ieee, figure, comparison, bilateral, proposed, traditional, pattern] [filtered, synthetic, corresponding, jun, learn, target] [network, learning, set, scheme, matrix, selection, candidate, training, deep, data, parameter, size, neural, architecture, vector, selected, better] [normal, point, estimation, suboptimal, fitting, plane, local, computer, geometric, surface, neighborhood, cloud, reconstruction, error, estimator, unstructured, estimated, gathering, hmp, computed, pca, houghcnn, conference, scanned, smooth, mesh, pcpnet, geometry, compute, initial, estimating, pointnet, approach, estimate]
@InProceedings{Zhou_2020_CVPR,
  author = {Zhou, Haoran and Chen, Honghua and Feng, Yidan and Wang, Qiong and Qin, Jing and Xie, Haoran and Wang, Fu Lee and Wei, Mingqiang and Wang, Jun},
  title = {Geometry and Learning Co-Supported Normal Estimation for Unstructured Point Cloud},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Sequential Motif Profiles and Topological Plots for Offline Signature Verification
Elias N. Zois, Evangelos Zervas, Dimitrios Tsourounis, George Economou


In spite of the overwhelming high-tech marvels and applications that rule our digital lives, the use of the handwritten signature is still recognized worldwide in government, personal and legal entities to be the most important behavioral biometric trait. A number of notable research approaches provide advanced results up to a certain point which allow us to assert with confidence that the performance attained by signature verification (SV) systems is comparable to those provided by any other biometric modality. Up to now, the mainstream trend for offline SV is shared between standard -or handcrafted- feature extraction methods and popular machine learning techniques, with typical examples ranging from sparse representation to Deep Learning. Recent progress in graph mining algorithms provide us with the prospect to re-evaluate the opportunity of utilizing graph representations by exploring corresponding graph features for offline SV. In this paper, inspired by the recent use of image visibility graphs for mapping images into networks, we introduce for the first time in offline SV literature their use as a parameter free, agnostic representation for exploring global as well as local information. Global properties of the sparsely located content of the shape of the signature image are encoded with topological information of the whole graph. In addition, local pixel patches are encoded by sequential visibility motifs-subgraphs of size four, to a low six dimensional motif profile vector. A number of pooling functions operate on the motif codes in a spatial pyramid context in order to create the final feature vector. The effectiveness of the proposed method is evaluated with the use of two popular datasets. The local visibility graph features are considered to be highly informative for SV; this is sustained by the corresponding results which are at least comparable with other classic state-of-the-art approaches.
[order, sequential, graph, handwritten, recognition, natural, time, sequence, provide, static, automatic, exploring] [feature, global, table, horizontal, pooling, hard] [signature, verification, visibility, offline, degree, hvg, motif, cedar, case, genuine, writer, derived, datasets, vgs, series, biometric, pfrr, kaze, topological] [pattern, patch, based, ieee, analysis, method, coding, extraction, journal, spatial, proposed, signal, figure] [image, corresponding, representation, mapping, edit, specific, code] [number, size, set, learning, average, machine, popular, deep, data, vector, random, performance, entire, equal, diagonal, small] [local, international, conference, defined, error, sparse, well, computer, vision, distance]
@InProceedings{Zois_2020_CVPR,
  author = {Zois, Elias N. and Zervas, Evangelos and Tsourounis, Dimitrios and Economou, George},
  title = {Sequential Motif Profiles and Topological Plots for Offline Signature Verification},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Optical Flow in Dense Foggy Scenes Using Semi-Supervised Learning
Wending Yan, Aashish Sharma, Robby T. Tan


In dense foggy scenes, existing optical flow methods are erroneous. This is due to the degradation caused by dense fog particles that break the optical flow basic assumptions such as brightness and gradient constancy. To address the problem, we introduce a semi-supervised deep learning technique that employs real fog images without optical flow ground-truths in the training process. Our network integrates the domain transformation and optical flow networks in one framework. Initially, given a pair of synthetic fog images, its corresponding clean images and optical flow ground-truths, in one training batch we train our network in a supervised manner. Subsequently, given a pair of real fog images and a pair of clean images that are not corresponding to each other (unpaired), in the next training batch, we train our network in an unsupervised manner. We then alternate the training of synthetic and real data iteratively. We use real data without ground-truths, since to have ground-truths in such conditions is intractable, and also to avoid the overfitting problem of synthetic data training, where the knowledge learned on synthetic data cannot be generalized to real data testing. Together with the network architecture design, we propose a new training strategy that combines supervised synthetic-data training and unsupervised real-data training. Experimental results show that our method is effective and outperforms the state-of-the-art methods in estimating optical flow in dense foggy scenes.
[pair, evaluation, previous, work, decoder, dataset] [module, pyramid, fully, predicted, stage, mask, propose] [clean, input, model, trained] [flow, optical, fog, method, ieee, foggy, pattern, epe, defogging, hazeline, existing, pwcnet, based, result, light, atmospheric, figure, chromaticity, constancy, color, dehazing, robby, transform, berman] [real, synthetic, image, domain, loss, train, unsupervised, consistency, corresponding, supervised, discriminative, encoders, generated, generate, learn, qualitative, vkitti] [training, network, data, learning, deep, architecture, performance, observe, problem, strategy, process] [conference, computer, rendered, transformation, dense, vision, dof, estimation, single, international, estimated, scene, handle, estimate, photometric, compute, accurate, cost, define, second]
@InProceedings{Yan_2020_CVPR,
  author = {Yan, Wending and Sharma, Aashish and Tan, Robby T.},
  title = {Optical Flow in Dense Foggy Scenes Using Semi-Supervised Learning},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
A Spatial RNN Codec for End-to-End Image Compression
Chaoyi Lin, Jiabao Yao, Fangdong Chen, Li Wang


Recently, deep learning has been explored as a promising direction for image compression. Removing the spatial redundancy of the image is crucial for image compression and most learning based methods focus on removing the redundancy between adjacent pixels. Intuitively, to explore larger pixel range beyond adjacent pixel is beneficial for removing the redundancy. In this paper, we propose a fast yet effective method for end-to-end image compression by incorporating a novel spatial recurrent neural network. Block based LSTM is utilized to remove the redundant information between adjacent pixels and blocks. Besides, the proposed method is a potential efficient system that parallel computation on individual blocks is possible. Experimental results demonstrate that the proposed model outperforms state-of-the-art traditional image compression standards and learning based image compression models in terms of both PSNR and MS-SSIM metrics. It provides a 26.73% bits-reduction than High Efficiency Video Coding (HEVC), which is the current official state-of-the-art video codec.
[decoder, recurrent, rnn, step, lstm, video, context, represent, decoding, dependency] [represents, correlation, map, table, propose, fully, sigmoid, redundant, side] [model, input, improve, highly] [compression, based, block, figure, hyperprior, proposed, adjacent, spatial, method, transform, coding, adaptive, high, bpg, lossy, output, channel, removing, parallel, psnr, convolution, brnn, ieee, pixel, adopted, gaussian, analysis, frequency, convolutional, hrnn, traditional, minnen, arithmetic] [image, latent, representation, encoder, synthesis] [quantization, network, entropy, neural, learning, architecture, redundancy, size, layer, number, quantized, process, deep, performance, average, efficiency, variance, hyper, function, training, metric, standard, processing, distribution] [estimate, joint]
@InProceedings{Lin_2020_CVPR,
  author = {Lin, Chaoyi and Yao, Jiabao and Chen, Fangdong and Wang, Li},
  title = {A Spatial RNN Codec for End-to-End Image Compression},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Object Relational Graph With Teacher-Recommended Learning for Video Captioning
Ziqi Zhang, Yaya Shi, Chunfeng Yuan, Bing Li, Peijin Wang, Weiming Hu, Zheng-Jun Zha


Taking full advantage of the information from both vision and language is critical for the video captioning task. Existing models lack adequate visual representation due to the neglect of interaction between object, and sufficient training for content-related words due to long-tailed problems. In this paper, we propose a complete video captioning system including both a novel model and an effective training strategy. Specifically, we propose an object relational graph (ORG) based encoder, which captures more detailed interaction features to enrich visual representation. Meanwhile, we design a teacher-recommended learning (TRL) method to make full use of the successful external language model (ELM) to integrate the abundant linguistic knowledge into the caption model. The ELM generates more semantically similar word proposals which extend the groundtruth words used for training to deal with the long-tailed problem. Experimental evaluations on three benchmarks: MSVD, MSR-VTT and VATEX show the proposed ORG-TRL system achieves state-of-the-art performance. Extensive ablation studies and visualizations illustrate the effectiveness of our system.
[video, graph, language, visual, relational, temporal, caption, word, trl, captioning, attention, linguistic, dataset, org, tel, msvd, extract, explore, frame, decoding, natural, vatex, step, hidden, bidirectional, relation, hierarchical, cap] [object, wei, propose, feature, effectiveness, global] [model, external] [ieee, pattern, cvpr, proposed, learnable, based, motion, method, fusion, spatial, output, figure] [learn, generate, image, appearance, pretrained, encoder, corresponding, representation, common, generation, aligned] [training, learning, knowledge, probability, neural, distribution, baseline, soft, number, compared, task, large, set, problem, machine, test, performance] [conference, vision, computer, elm, detailed, international, novel]
@InProceedings{Zhang_2020_CVPR,
  author = {Zhang, Ziqi and Shi, Yaya and Yuan, Chunfeng and Li, Bing and Wang, Peijin and Hu, Weiming and Zha, Zheng-Jun},
  title = {Object Relational Graph With Teacher-Recommended Learning for Video Captioning},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
MMTM: Multimodal Transfer Module for CNN Fusion
Hamid Reza Vaezi Joze, Amirreza Shaban, Michael L. Iuzzolino, Kazuhito Koishida


In late fusion, each modality is processed in a separate unimodal Convolutional Neural Network (CNN) stream and the scores of each modality are fused at the end. Due to its simplicity, late fusion is still the predominant approach in many state-of-the-art multimodal applications. In this paper, we present a simple neural network module for leveraging the knowledge from multiple modalities in convolutional neural networks. The proposed unit, named Multimodal Transfer Module (MMTM), can be added at different levels of the feature hierarchy, enabling slow modality fusion. Using squeeze and excitation operations, MMTM utilizes the knowledge of multiple modalities to recalibrate the channel-wise features in each CNN stream. Unlike other intermediate fusion methods, the proposed module could be used for feature modality fusion in convolution layers with different spatial dimensions. Another advantage of the proposed method is that it could be added among unimodal branches with minimum changes in the their network architectures, allowing each branch to be initialized with existing pretrained weights. Experimental results show that our framework improves the recognition accuracy of well-known multimodal networks. We demonstrate state-of-the-art or competitive performance on four datasets that span the task domains of dynamic hand gesture recognition, speech enhancement, and action recognition with RGB and body joints.
[multimodal, recognition, gesture, mmtm, speech, late, action, dataset, visual, unimodal, video, audio, hcn, modality, stream, multiple, avse, temporal, recalibrate, mmtms, spatiotemporal, skeleton, modal, slow, work, outperforms, connected] [feature, module, cnn, table, global, level, improves] [input, model] [fusion, method, spatial, convolutional, squeeze, proposed, excitation, intermediate, figure, flow, enhancement, convolution, dynamic, optical, ieee, resblock, fused, based] [representation, transfer] [network, layer, performance, neural, learning, architecture, deep, number, training, accuracy, data, design, operation, processing, objective, andrew, gating, evaluate, baseline, arxiv, preprint, simple, large, randomly] [hand, rgb, human, approach, depth, joint, pose, conference]
@InProceedings{Joze_2020_CVPR,
  author = {Joze, Hamid Reza Vaezi and Shaban, Amirreza and Iuzzolino, Michael L. and Koishida, Kazuhito},
  title = {MMTM: Multimodal Transfer Module for CNN Fusion},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Generalized Zero-Shot Learning via Over-Complete Distribution
Rohit Keshari, Richa Singh, Mayank Vatsa


A well trained and generalized deep neural network (DNN) should be robust to both seen and unseen classes. However, the performance of most of the existing supervised DNN algorithms degrade for classes which are unseen in the training set. To learn a discriminative classifier which yields good performance in Zero-Shot Learning (ZSL) settings, we propose to generate an Over-Complete Distribution (OCD) using Conditional Variational Autoencoder (CVAE) of both seen and unseen classes. In order to enforce the separability between classes and reduce the class scatter, we propose the use of Online Batch Triplet Loss (OBTL) and Center Loss (CL) on the generated OCD. The effectiveness of the framework is evaluated using both Zero-Shot Learning and Generalized Zero-Shot Learning protocols on three publicly available benchmark databases, SUN, CUB and AWA2. The results show that generating over-complete distributions and enforcing the classifier to learn a transform function from overlapping to non-overlapping distributions can improve the performance on both seen and unseen classes.
[three, observed, dataset, visual, decoder] [hard, framework, sun, propose, table, challenging, center, split, feature, semantic] [trained, model, improve, protocol] [proposed, figure, phase, synthetically, utilized, based] [unseen, generated, loss, attribute, generating, zsl, generalized, ocd, generative, cub, real, gzsl, cvae, latent, generate, train, learn, discriminative, variable, encoder, conditional, corresponding, mapping, obtl, third, zeynep, variational, separability] [distribution, learning, class, training, triplet, approximated, performance, test, set, classification, network, classifier, sample, data, deep, batch, standard, accuracy, equation, closer, space, average, best, online, randomly, sampled, number, reduce, size] [regressor, represented]
@InProceedings{Keshari_2020_CVPR,
  author = {Keshari, Rohit and Singh, Richa and Vatsa, Mayank},
  title = {Generalized Zero-Shot Learning via Over-Complete Distribution},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Gait Recognition via Semi-supervised Disentangled Representation Learning to Identity and Covariate Features
Xiang Li, Yasushi Makihara, Chi Xu, Yasushi Yagi, Mingwu Ren


Existing gait recognition approaches typically focus on learning identity features that are invariant to covariates (e.g., the carrying status, clothing, walking speed, and viewing angle) and seldom involve learning features from the covariate aspect, which may lead to failure modes when variations due to the covariate overwhelm those due to the identity. We therefore propose a method of gait recognition via disentangled representation learning that considers both identity and covariate features. Specifically, we first encode an input gait template to get the disentangled identity and covariate features, and then decode the features to simultaneously reconstruct the input gait template and the canonical version of the same subject with no covariates in a semi-supervised manner to ensure successful disentanglement. We finally feed the disentangled identity features into a contrastive/triplet loss function for a verification/identification task. Moreover, we find that new gait templates can be synthesized by transferring the covariate feature from one subject to another. Experimental results on three publicly available gait data sets demonstrate the effectiveness of the proposed method compared with other state-of-the-art methods.
[recognition, pair, decoder, viewing, considering, video, walking] [feature, module, gallery, table, category, template] [gait, identity, covariate, gei, input, subject, geis, identification, carrying, verification, covariates, face, cov, condition, probe, drl, model, ltrip, biometric, status, lreconst, discriminant, study, lcont, protocol] [ieee, pattern, method, proposed, analysis, figure, based] [loss, disentangled, disentanglement, encoder, representation, disentangle, invariant, image, appearance] [learning, set, data, training, rate, function, triplet, network, performance, number, test, machine, indicates] [computer, conference, vision, reconstructed, joint, human, view, canonical, pose, international, reconstruct, reconstruction]
@InProceedings{Li_2020_CVPR,
  author = {Li, Xiang and Makihara, Yasushi and Xu, Chi and Yagi, Yasushi and Ren, Mingwu},
  title = {Gait Recognition via Semi-supervised Disentangled Representation Learning to Identity and Covariate Features},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Unifying Training and Inference for Panoptic Segmentation
Qizhu Li, Xiaojuan Qi, Philip H.S. Torr


We present an end-to-end network to bridge the gap between training and inference pipeline for panoptic segmentation, a task that seeks to partition an image into semantic regions for "stuff" and object instances for "things". In contrast to recent works, our network exploits a parametrised, yet lightweight panoptic segmentation submodule, powered by an end-to-end learnt dense instance affinity, to capture the probability that any pair of pixels belong to the same instance. This panoptic submodule gives rise to a novel propagation mechanism for panoptic logits and enables the network to output a coherent panoptic segmentation map for both "stuff" and "thing" classes, without any post-processing. Reaping the benefits of end-to-end training, our full system sets new records on the popular street scene dataset, Cityscapes, achieving 61.4 PQ with a ResNet-50 backbone using only the fine annotations. On the challenging COCO dataset, our ResNet-50-based network also delivers state-of-the-art accuracy of 43.4 PQ. Moreover, our network flexibly works with and without object mask cues, performing competitively under both settings, which is of interest for applications with computation budgets.
[] [] [] [] [] [] []
@InProceedings{Li_2020_CVPR,
  author = {Li, Qizhu and Qi, Xiaojuan and Torr, Philip H.S.},
  title = {Unifying Training and Inference for Panoptic Segmentation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Associate-3Ddet: Perceptual-to-Conceptual Association for 3D Point Cloud Object Detection
Liang Du, Xiaoqing Ye, Xiao Tan, Jianfeng Feng, Zhenbo Xu, Errui Ding, Shilei Wen


Object detection from 3D point clouds remains a challenging task, though recent studies pushed the envelope with the deep learning techniques. Owing to the severe spatial occlusion and inherent variance of point density with the distance to sensors, appearance of a same object varies a lot in point cloud data. Designing robust feature representation against such appearance changes is hence the key issue in a 3D object detection method. In this paper, we innovatively propose a domain adaptation like approach to enhance the robustness of the feature representation. More specifically, we bridge the gap between the perceptual domain where the feature comes from a real scene and the conceptual domain where the feature is extracted from an augmented scene consisting of non-occlusion point cloud rich of detailed information. This domain adaptation approach mimics the functionality of the human brain when proceeding object perception. Extensive experiments demonstrate that our simple yet effective approach fundamentally boosts the performance of 3D point cloud object detection and achieves the state-of-the-art results.
[titan, length] [conceptual, object, feature, detection, cfg, lidar, table, map, association, offset, bev, hard, occlusion, percepted, proposal, foreground, split, easy, gtx, box, associative, reweighting, atl, raquel, ross, key] [pfe, model, robust] [perceptual, ieee, convolutional, pattern, figure, method, proposed, range, deformable, based, pixel, science] [domain, loss, adaptation, real, representation, learns, corresponding, transfer, source, generate, gap] [network, performance, training, learning, process, data, informative, baseline, average, candidate, neural, set, deep, density, simple, knowledge] [point, conference, computer, vision, cloud, scene, kitti, sparse, rotation, approach, complete, rgb, international, human, distance, directly, incomplete]
@InProceedings{Du_2020_CVPR,
  author = {Du, Liang and Ye, Xiaoqing and Tan, Xiao and Feng, Jianfeng and Xu, Zhenbo and Ding, Errui and Wen, Shilei},
  title = {Associate-3Ddet: Perceptual-to-Conceptual Association for 3D Point Cloud Object Detection},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Interactive Image Segmentation With First Click Attention
Zheng Lin, Zhao Zhang, Lin-Zhuo Chen, Ming-Ming Cheng, Shao-Ping Lu


In the task of interactive image segmentation, users initially click one point to segment the main body of the target object and then provide more points on mislabeled regions iteratively for a precise segmentation. Existing methods treat all interaction points indiscriminately, ignoring the difference between the first click and the remaining ones. In this paper, we demonstrate the critical role of the first click about providing the location and main body information of the target object. A deep framework, named First Click Attention Network (FCA-Net), is proposed to make better use of the first click. In this network, the interactive segmentation result can be much improved with the following benefits: focus invariance, location guidance, and error-tolerant ability. We then put forward a click-based loss function and a structural integrity strategy for better segmentation effect. The visualized segmentation results and sufficient experiments on five datasets demonstrate the importance of the first click and the superiority of our FCA-Net.
[interaction, attention, prediction, dataset, mscoco, visual, represent, role, graph, three] [click, segmentation, interactive, object, positive, integrity, fca, location, propose, pascal, annotated, miou, segment, global, mask, voc, module, gsc, focus, salient, foreground, final, grc, score, detection, center] [input, model, datasets] [ieee, pattern, proposed, based, guidance, method, result, gaussian, output, figure, supervise, convolutional, formulated] [image, loss, structural, target, user, generate] [network, number, strategy, neural, deep, better, set, function, learning, negative, basic, training, simple, esc, performance, computational, general, task, improved] [point, distance, geodesic, demonstrate]
@InProceedings{Lin_2020_CVPR,
  author = {Lin, Zheng and Zhang, Zhao and Chen, Lin-Zhuo and Cheng, Ming-Ming and Lu, Shao-Ping},
  title = {Interactive Image Segmentation With First Click Attention},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
NETNet: Neighbor Erasing and Transferring Network for Better Single Shot Object Detection
Yazhao Li, Yanwei Pang, Jianbing Shen, Jiale Cao, Ling Shao


Due to the advantages of real-time detection and improved performance, single-shot detectors have gained great attention recently. To solve the complex scale variations, single-shot detectors make scale-aware predictions based on multiple pyramid layers. However, the features in the pyramid are not scale-aware enough, which limits the detection performance. Two common problems in single-shot detectors caused by object scale variations can be observed: (1) small objects are easily missed; (2) the salient part of a large object is sometimes detected as an object. With this observation, we propose a new Neighbor Erasing and Transferring (NET) mechanism to reconfigure the pyramid features and explore scale-aware features. In NET, a Neighbor Erasing Module (NEM) is designed to erase the salient features of large objects and emphasize the features of small objects in shallow layers. A Neighbor Transferring Module (NTM) is introduced to transfer the erased features and highlight large objects in deep layers. With this mechanism, a single-shot network called NETNet is constructed for scale-aware object detection. In addition, we propose to aggregate nearest neighboring pyramid features to enhance our NET. NETNet achieves 38.5% AP at a speed of 27 FPS and 32.0% AP at a speed of 55 FPS on MS COCO dataset. As a result, NETNet achieves a better trade-off for real-time and accurate object detection.
[attention, mechanism, constructed, built, construct, previous] [feature, object, pyramid, detection, netnet, erasing, module, shallow, ssd, nem, false, netm, salient, nnfm, propose, ntm, positive, jianbing, coco, detector, backbone, table, yanwei, semantic, ross, reconfigure, erased, alleviate, improvement, pooling, kaiming, wenguan, ling, china, detected, detect, region, scaleaware] [detecting, input] [scale, spatial, based, net, convolutional, enhance, method, fusion, figure, resolution] [transferring, generate, pes, image, transfer] [small, large, deep, network, layer, baseline, larger, better, gate, problem, learning, size, performance, smaller, negative, evaluate, achieve] [neighbor, accurate, nearest, detailed, error, solve, ground, complex]
@InProceedings{Li_2020_CVPR,
  author = {Li, Yazhao and Pang, Yanwei and Shen, Jianbing and Cao, Jiale and Shao, Ling},
  title = {NETNet: Neighbor Erasing and Transferring Network for Better Single Shot Object Detection},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Scale-Equalizing Pyramid Convolution for Object Detection
Xinjiang Wang, Shilong Zhang, Zhuoran Yu, Litong Feng, Wayne Zhang


Feature pyramid has been an efficient method to extract features at different scales. Development over this method mainly focuses on aggregating contextual information at different levels while seldom touching the inter-level correlation in the feature pyramid. Early computer vision methods extracted scale-invariant features by locating the feature extrema in both spatial and scale dimension. Inspired by this, a convolution across the pyramid level is proposed in this study, which is termed pyramid convolution and is a modified 3-D convolution. Stacked pyramid convolutions directly extract 3-D (scale and spatial) features and outperforms other meticulously designed feature fusion modules. Based on the viewpoint of 3-D convolution, an integrated batch normalization that collects statistics from the whole feature pyramid is naturally inserted after the pyramid convolution. Furthermore, we also show that the naive pyramid convolution, together with the design of RetinaNet head, actually best applies for extracting features from a Gaussian pyramid, whose properties can hardly be satisfied by a feature pyramid. In order to alleviate this discrepancy, we build a scale-equalizing pyramid convolution (SEPC) that aligns the shared pyramid convolution kernel only at high-level feature maps. Being computationally efficient and compatible with the head design of most single-stage object detectors, the SEPC module brings significant performance improvement (>4AP increase on MS-COCO2017 dataset) in state-of-the-art one-stage object detectors, and a light version of SEPC also has 3.5AP gain with only around 7% inference time increase. The pyramid convolution also functions well as a stand-alone module in two-stage object detectors and is able to improve the performance by 2AP. The source code can be found at https://github.com/jshilong/SEPC.
[extract, multiple, natural] [feature, pyramid, pconv, object, head, sepc, detection, retinanet, module, freeanchor, fsaf, map, correlation, faster, backbone, dcn, level, fpn, kaiming, ross, extracting, brings, including, stride, improvement, cascade, extra] [original] [convolution, scale, kernel, ieee, pattern, gaussian, conv, fusion, convolutional, spatial, deformable, comparison, integrated, figure, blurring, method] [image, shared, independent] [performance, increase, design, size, deep, training, network, batch, baseline, libra, computational, neural, inference, learning, efficient, normalization, best] [computer, conference, vision, single, directly, international]
@InProceedings{Wang_2020_CVPR,
  author = {Wang, Xinjiang and Zhang, Shilong and Yu, Zhuoran and Feng, Litong and Zhang, Wayne},
  title = {Scale-Equalizing Pyramid Convolution for Object Detection},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Learning to Cluster Faces via Confidence and Connectivity Estimation
Lei Yang, Dapeng Chen, Xiaohang Zhan, Rui Zhao, Chen Change Loy, Dahua Lin


Face clustering is an essential tool for exploiting the unlabeled face data, and has a wide range of applications including face annotation and retrieval. Recent works show that supervised clustering can result in noticeable performance gain. However, they usually involve heuristic steps and require numerous overlapped subgraphs, severely restricting their accuracy and efficiency. In this paper, we propose a fully learnable clustering framework without requiring a large number of overlapped subgraphs. Instead, we transform the clustering problem into two sub-problems. Specifically, two graph convolutional networks, named GCN-V and GCN-E, are designed to estimate the confidence of vertices and the connectivity of edges, respectively. With the vertex confidence and edge connectivity, we can naturally organize more relevant vertices on the affinity graph and group them into clusters. Experiments on two large-scale benchmarks show that our method significantly improves clustering accuracy and thus performance of the recognition models trained on top, yet it is an order of magnitude more efficient than existing supervised methods.
[graph, recognition, gcn, connected, order, predict, previous, dahua, dataset, embedding] [confidence, affinity, edge, feature, table, framework, belonging, detection] [face, model, input, knn, ltc, heuristic, trained, datasets, dbscan, subgraphs] [proposed, based, method, high, convolutional, figure, ieee, learnable, indicate, existing, convolution] [supervised, cluster, specific, unsupervised, idea, learn, train, image] [clustering, number, performance, set, unlabeled, large, matrix, learning, candidate, data, computational, training, higher, belong, density, accuracy, efficient, design, indicates, similarity, class, small, denoted, deep, layer, pairwise, select] [vertex, connectivity, estimate, estimator, defined, conference, approach, neighbor, nearest]
@InProceedings{Yang_2020_CVPR,
  author = {Yang, Lei and Chen, Dapeng and Zhan, Xiaohang and Zhao, Rui and Loy, Chen Change and Lin, Dahua},
  title = {Learning to Cluster Faces via Confidence and Connectivity Estimation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Cross-Modality Person Re-Identification With Shared-Specific Feature Transfer
Yan Lu, Yue Wu, Bin Liu, Tianzhu Zhang, Baopu Li, Qi Chu, Nenghai Yu


Cross-modality person re-identification (cm-ReID) is a challenging but key technology for intelligent video analysis. Existing works mainly focus on learning modality-shared representation by embedding different modalities into a same feature space, lowering the upper bound of feature distinctiveness. In this paper, we tackle the above limitation by proposing a novel cross-modality shared-specific feature transfer algorithm (termed cm-SSFT) to explore the potential of both the modality-shared information and the modality-specific characteristics to boost the reidentification performance. We model the affinities of different modality samples according to the shared features and then transfer both shared and specific features among and across modalities. We also propose a complementary feature learning strategy including modality adaption, project adversarial learning and reconstruction enhancement to learn discriminative and complementary shared and specific features of each modality, respectively. The entire cmSSFTalgorithm can be trained in an end-to-end manner. We conducted comprehensive experiments to validate the superiority ofthe overall algorithm and the effectiveness ofeach component. The proposed algorithm significantly outperforms state-of-the-arts by 22.5% and 19.3% mAP on the two mainstream benchmark datasets SYSU-MM01 and RegDB, respectively.
[modality, attention, evaluation, work] [feature, map, affinity, thermal, table, effectiveness, extractor, backbone, liang, gallery] [adversarial, complementary, query, model] [ieee, method, proposed, pattern, figure, based, convolutional, enhance, comparison] [shared, specific, person, transfer, loss, reid, project, infrared, sstn, representation, image, discriminative, utilize, row, generative, lsmt, lcmt, modalityspecific, generate] [learning, network, algorithm, set, matrix, deep, arxiv, preprint, performance, sample, triplet, better, metric, neural, feat, training, data, baseline, large, process] [conference, computer, vision, rgb, international, visible, reconstruction, single, european, well]
@InProceedings{Lu_2020_CVPR,
  author = {Lu, Yan and Wu, Yue and Liu, Bin and Zhang, Tianzhu and Li, Baopu and Chu, Qi and Yu, Nenghai},
  title = {Cross-Modality Person Re-Identification With Shared-Specific Feature Transfer},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
DPGN: Distribution Propagation Graph Network for Few-Shot Learning
Ling Yang, Liangliang Li, Zilun Zhang, Xinyu Zhou, Erjin Zhou, Yu Liu


Most graph-network-based meta-learning approaches model instance-level relation of examples. We extend this idea further to explicitly model the distribution-level relation of one example to all other examples in a 1-vs-N manner. We propose a novel approach named distribution propagation graph network (DPGN) for few-shot learning. It conveys both the distribution-level relations and instance-level relations in each few-shot learning task. To combine the distribution-level relations and instance-level relations for all examples, we construct a dual complete graph network which consists of a point graph and a distribution graph with each node standing for an example. Equipped with dual graph architecture, DPGN propagates label information from labeled examples to unlabeled examples within several update generations. In extensive experiments on few-shot learning benchmarks, DPGN outperforms state-of-the-art results by a large margin in 5% 12% under supervised setting and 7% 13% under semi-supervised setting. Code will be released.
[] [] [] [] [] [] []
@InProceedings{Yang_2020_CVPR,
  author = {Yang, Ling and Li, Liangliang and Zhang, Zilun and Zhou, Xinyu and Zhou, Erjin and Liu, Yu},
  title = {DPGN: Distribution Propagation Graph Network for Few-Shot Learning},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Density-Aware Graph for Deep Semi-Supervised Visual Recognition
Suichan Li, Bin Liu, Dongdong Chen, Qi Chu, Lu Yuan, Nenghai Yu


Semi-supervised learning (SSL) has been extensively studied to improve the generalization ability of deep neural networks for visual recognition. To involve the unlabelled data, most existing SSL methods are based on common density-based cluster assumption: samples lying in the same high-density region are likely to belong to the same class, including the methods performing consistency regularization or generating pseudo-labels for the unlabelled images. Despite their impressive performance, we argue three limitations exist: 1) Though the density information is demonstrated to be an important clue, they all use it in an implicit way and have not exploited it in depth. 2) For feature learning, they often learn the feature embedding based on the single data sample and ignore the neighborhood information. 3) For label-propagation based pseudo-label generation, it is often done offline and difficult to be end-to-end trained with feature learning. Motivated by these limitations, this paper proposes to solve the SSL problem by building a novel density-aware graph, based on which the neighborhood information can be easily leveraged and the feature learning and label propagation can also be trained in an end-to-end way. Specifically, we first propose a new Density-aware Neighborhood Aggregation(DNA) module to learn more discriminative features by incorporating the neighborhood information in a density-aware manner. Then a novel Density-ascending Path based Label Propagation(DPLP) module is proposed to generate the pseudo-labels for unlabeled samples more efficiently according to the feature distribution characterized by density. Finally, the DNA module and DPLP module evolve and improve each other end-to-end. Extensive experiments demonstrate the effectiveness of the newly proposed density-aware graph based SSL framework and our approach can outperform current state-of-the-art methods by a large margin.
[node, graph, current, bank, construct, embedding, visual, length, incorporating] [feature, aggregation, labelled, propagation, propose, effectiveness, framework, extractor, global, module, table] [trained, model, study, improve, demonstrated] [based, method, ieee, proposed, high, figure, assumption] [target, perform, pseudo, learn, consistency, loss, cluster, proposes] [label, density, learning, path, unlabelled, training, ssl, similarity, neural, data, unlabeled, number, deep, sample, network, higher, large, set, baseline, processing, labeled, better, weight, efficient, class, rate, belong, regularization, distribution, dplp] [neighborhood, conference, international, neighbor, computer, novel, vision, define]
@InProceedings{Li_2020_CVPR,
  author = {Li, Suichan and Liu, Bin and Chen, Dongdong and Chu, Qi and Yuan, Lu and Yu, Nenghai},
  title = {Density-Aware Graph for Deep Semi-Supervised Visual Recognition},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Unsupervised Multi-Modal Image Registration via Geometry Preserving Image-to-Image Translation
Moab Arar, Yiftach Ginger, Dov Danon, Amit H. Bermano, Daniel Cohen-Or


Many applications, such as autonomous driving, heavily rely on multi-modal data where spatial alignment between the modalities is required. Most multi-modal registration methods struggle computing the spatial correspondence between the images using prevalent cross-modality similarity measures. In this work, we bypass the difficulties of developing cross-modality similarity measures, by training an image-to-image translation network on the two input modalities. This learned translation allows training the registration network using simple and reliable mono-modality metrics. We perform multi-modal registration using two networks - a spatial transformation network and a translation network. We show that by encouraging our translation network to be geometry preserving, we manage to train an accurate spatial transformation network. Compared to state-of-the-art multi-modal methods our presented method is unsupervised, requiring no pairs of aligned modalities for training, and can be adapted to any pair of modalities. We evaluate our method quantitatively and qualitatively on commercial datasets, showing that it performs well on several modalities and achieves accurate alignment.
[multimodal, dataset, three] [table, annotation, salient] [input, adversarial, trained] [spatial, figure, method, field, medical, bilateral, based, ieee, pixel, stn, deformable, performs, imaging, filtering, presented] [image, loss, translation, train, unsupervised, source, target, preserving, alignment, discriminator, domain, generative, cyclegan, fake, encourage, transformed, translated, gan, translate, generated] [network, training, accuracy, similarity, learning, metric, neural, deep, respect, test, data, simple, epoch, processing, computing, compared] [registration, deformation, transformation, computer, geometry, geometric, international, deformed, reconstruction, sift, conference, accurate, volume, scene, depth, rgb, vision, allows, local, registered, smoothness, lecture]
@InProceedings{Arar_2020_CVPR,
  author = {Arar, Moab and Ginger, Yiftach and Danon, Dov and Bermano, Amit H. and Cohen-Or, Daniel},
  title = {Unsupervised Multi-Modal Image Registration via Geometry Preserving Image-to-Image Translation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Binarizing MobileNet via Evolution-Based Searching
Hai Phan, Zechun Liu, Dang Huynh, Marios Savvides, Kwang-Ting Cheng, Zhiqiang Shen


Binary Neural Networks (BNNs), known to be one among the effectively compact network architectures, have achieved great outcomes in the visual tasks. Designing efficient binary architectures is not trivial due to the binary nature of the network. In this paper, we propose a use of evolutionary search to facilitate the construction and training scheme when binarizing MobileNet, a compact network with separable depth-wise convolution. Being inspired by one-shot architecture search frameworks, we manipulate the idea of group convolution to design efficient 1-Bit Convolutional Neural Networks (CNNs), assuming an approximately optimal trade-off between computational cost and model accuracy. Our objective is to come up with a tiny yet efficient binary neural architecture by exploring the best candidates of the group convolution while optimizing the model performance in terms of complexity and latency. The approach is threefold. First, we modify and train strong baseline binary networks with a wide range of random group combinations at each convolutional layer. This set-up gives the binary neural networks a capability of preserving essential information through layers. Second, to find a good set of hyper-parameters for group convolutions we make use of the evolutionary search which leverages the exploration of efficient 1-bit models. Lastly, these binary models are trained from scratch in a usual manner to achieve the final binary model. Various experiments on ImageNet are conducted to show that following our construction guideline, the final model achieves 60.09% Top-1 accuracy and outperforms the state-of-the-art CI-BCNN with the same computational cost.
[dataset, work, three, explore] [module, feature, table, propose, achieves, fully, jian] [model, input] [convolution, convolutional, proposed, conv, figure, capability, method, range, output, comparison, ieee] [train, image, representation, loss] [neural, group, binary, search, network, architecture, training, efficient, accuracy, random, computational, learning, evolutionary, number, optimal, performance, deep, compact, bnns, mobilenet, searching, imagenet, algorithm, randomized, classification, mobile, uniform, binarizing, design, space, mobinet, binarized, searched, scheme, set, conducted, operation, weight, arxiv, xiangyu, zechun] [conference, international, computer, vision, full, cost, assuming, accurate, compare, approach]
@InProceedings{Phan_2020_CVPR,
  author = {Phan, Hai and Liu, Zechun and Huynh, Dang and Savvides, Marios and Cheng, Kwang-Ting and Shen, Zhiqiang},
  title = {Binarizing MobileNet via Evolution-Based Searching},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Temporal-Context Enhanced Detection of Heavily Occluded Pedestrians
Jialian Wu, Chunluan Zhou, Ming Yang, Qian Zhang, Yuan Li, Junsong Yuan


State-of-the-art pedestrian detectors have performed promisingly on non-occluded pedestrians, yet they are still confronted by heavy occlusions. Although many previous works have attempted to alleviate the pedestrian occlusion issue, most of them rest on still images. In this paper, we exploit the local temporal context of pedestrians in videos and propose a tube feature aggregation network (TFAN) aiming at enhancing pedestrian detectors against severe occlusions. Specifically, for an occluded pedestrian in the current frame, we iteratively search for its relevant counterparts along temporal axis to form a tube. Then, features from the tube are aggregated according to an adaptive weight to enhance the feature representations of the occluded pedestrian. Furthermore, we devise a temporally discriminative embedding module (TDEM) and a part-based relation module (PRM), respectively, which adapts our approach to better handle tube drifting and heavy occlusions. Extensive experiments are conducted on three datasets, Caltech, NightOwls and KAIST, showing that our proposed method is significantly effective for heavily occluded pedestrian detection. Moreover, we achieve the state-of-the-art performance on the Caltech and NightOwls datasets.
[embedding, current, frame, video, temporal, context, temporally, relevant, exploit, three, wiki, relation, aggregating, bki] [pedestrian, occluded, feature, proposal, detection, heavily, tube, object, module, semantic, occlusion, table, prm, tfan, nightowls, linking, background, detector, tdem, bkt, xkt, aggregation, heavy, box, kaist, dem, roi, nearby, faster, region, vkt] [] [adaptive, proposed, spatial, enhance, ieee, figure, method, handling, fast, night] [discriminative, loss, align, corresponding, reasonable] [similarity, learning, network, performance, reliable, baseline, average, caltech, deep, set, search, weight, cosine, better, classifier, neural, best] [visible, approach, local, ground, truth, iteratively, matching]
@InProceedings{Wu_2020_CVPR,
  author = {Wu, Jialian and Zhou, Chunluan and Yang, Ming and Zhang, Qian and Li, Yuan and Yuan, Junsong},
  title = {Temporal-Context Enhanced Detection of Heavily Occluded Pedestrians},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Orderless Recurrent Models for Multi-Label Classification
Vacit Oguz Yazici, Abel Gonzalez-Garcia, Arnau Ramisa, Bartlomiej Twardowski, Joost van de Weijer


Recurrent neural networks (RNN) are popular for many computer vision tasks, including multi-label classification. Since RNNs produce sequential outputs, labels need to be ordered for the multi-label classification task. Current approaches sort labels according to their frequency, typically ordering them in either rare-first or frequent-first. These imposed orderings do not take into account that the natural order to generate the labels can change for each image, e.g. first the dominant object before summing up the smaller objects in the image. Therefore, in this paper, we propose ways to dynamically order the ground truth labels with the predicted label sequence. This allows for the faster training of more optimal LSTM models for multi-label classification. Analysis evidences that our method does not suffer from duplicate generation, something which is common for other models. Furthermore, it outperforms other CNN-RNN models, and we show that a standard architecture of an image encoder and language decoder trained with our proposed loss obtains the state-of-the-art results on the challenging MS-COCO, WIDER Attribute and PA-100K and competitive results on NUS-WIDE.
[order, lstm, pla, recurrent, time, rnn, attention, mla, orderless, step, prediction, visual, ttj, recognition, dynamically, sequence, predict, caption, previous, sequential, state, current, captioning] [predicted, bce, ordering, table, cnn, module, object, faster, duplicate] [model, input, trained, wider, ball] [figure, method, chen, proposed, comparison, convolutional, output] [image, loss, train, attribute, alignment, learn, generate, common, proposes] [label, training, learning, network, classification, neural, deep, matrix, problem, random, number, architecture, higher, better, smaller, machine] [ground, truth, approach, compare, conference, computed]
@InProceedings{Yazici_2020_CVPR,
  author = {Yazici, Vacit Oguz and Gonzalez-Garcia, Abel and Ramisa, Arnau and Twardowski, Bartlomiej and Weijer, Joost van de},
  title = {Orderless Recurrent Models for Multi-Label Classification},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Gold Seeker: Information Gain From Policy Distributions for Goal-Oriented Vision-and-Langauge Reasoning
Ehsan Abbasnejad, Iman Abbasnejad, Qi Wu, Javen Shi, Anton van den Hengel


As Computer Vision moves from passive analysis of pixels to active analysis of semantics, the breadth of information algorithms need to reason over has expanded significantly. One of the key challenges in this vein is the ability to identify the information required to make a decision, and select an action that will recover it. We propose a reinforcement-learning approach that maintains a distribution over its internal information, thus explicitly representing the ambiguity in what it knows, and needs to know, towards achieving its goal. Potential actions are then generated according to this distribution. For each potential action a distribution of the expected outcomes is calculated, and the value of the potential information gain assessed. The action taken is that which maximizes the potential information gain. We demonstrate this approach applied to two vision-and-language problems that have attracted significant recent interest, visual dialog and visual query generation. In both cases the method actively selects actions that will best reduce its internal uncertainty, and outperforms its competitors in achieving the goal of the challenge.
[policy, seeker, visual, goal, agent, question, responder, reward, answer, reinforcement, dialog, context, action, dialogue, history, multiple, guesswhat, state, anton, den, executor, clevr, language, sequence, evaluates, compositional] [object, van, response, score] [query, model, input] [conventional, based, proposed, method, figure] [image, generate, generated, variational, seeking, corresponding, ability, generation] [distribution, gain, learning, gradient, posterior, expected, parameter, task, achieving, better, arxiv, preprint, update, sample, consider, performance, ehsan, training, potential, set, neural, space, achieve, best, bayesian, log, function, problem] [approach, single, intrinsic, vision, computer, additional, program, compute, initial]
@InProceedings{Abbasnejad_2020_CVPR,
  author = {Abbasnejad, Ehsan and Abbasnejad, Iman and Wu, Qi and Shi, Javen and Hengel, Anton van den},
  title = {Gold Seeker: Information Gain From Policy Distributions for Goal-Oriented Vision-and-Langauge Reasoning},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Rethinking the Route Towards Weakly Supervised Object Localization
Chen-Lin Zhang, Yun-Hao Cao, Jianxin Wu


Weakly supervised object localization (WSOL) aims to localize objects with only image-level labels. Previous methods often try to utilize feature maps and classification weights to localize objects using image level annotations indirectly. In this paper, we demonstrate that weakly supervised object localization should be divided into two parts: class-agnostic object localization and object classification. For class-agnostic object localization, we should use class-agnostic methods to generate noisy pseudo annotations and then perform bounding box regression on them without class labels. We propose the pseudo supervised object localization (PSOL) method as a new way to solve WSOL. Our PSOL models have good transferability across different datasets without fine-tuning. With generated pseudo bounding boxes, we achieve 58.00% localization accuracy on ImageNet and 74.74% localization accuracy on CUB-200, which have a large edge over previous models.
[previous, connected, localize, dataset, current, provide] [localization, bounding, object, wsol, box, psol, fully, loc, ddt, weakly, feature, detection, final, cam, regression, map, table, spg, achieves, cnn, edge, including, adl] [model, trained, datasets, input, original] [method, output, convolutional, proposed, noisy, based] [supervised, generate, pseudo, perform, image, generated, train, transfer] [classification, class, training, accuracy, large, learning, better, imagenet, deep, network, layer, good, achieve, applied, label, validation, test, task] [single, directly, joint, accurate, full, combine]
@InProceedings{Zhang_2020_CVPR,
  author = {Zhang, Chen-Lin and Cao, Yun-Hao and Wu, Jianxin},
  title = {Rethinking the Route Towards Weakly Supervised Object Localization},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Adversarial Feature Hallucination Networks for Few-Shot Learning
Kai Li, Yulun Zhang, Kunpeng Li, Yun Fu


The recent flourish of deep learning in various tasks is largely accredited to the rich and accessible labeled data. Nonetheless, massive supervision remains a luxury for many real applications, boosting great interest in label-scarce techniques such as few-shot learning (FSL), which aims to learn concept of new classes with a few labeled samples. A natural approach to FSL is data augmentation and many recent works have proved the feasibility by proposing various data synthesis models. However, these models fail to well secure the discriminability and diversity of the synthesized data and thus often produce undesirable results. In this paper, we propose Adversarial Feature Hallucination Networks (AFHN) which is based on conditional Wasserstein Generative Adversarial networks (cWGAN) and hallucinates diverse and discriminative features conditioned on the few labeled samples. Two novel regularizers, i.e., the classification regularizer and the anti-collapse regularizer, are incorporated into AFHN to encourage discriminability and diversity of the synthesized features, respectively. Ablation study verifies the effectiveness of the proposed cWGAN based feature hallucination framework and the proposed regularizers. Comparative results on three common benchmark datasets substantiate the superiority of AFHN to existing data augmentation based FSL approaches and other state-of-the-art ones.
[dataset, three] [feature, framework, table, propose, effectiveness, ablation, category, correlation] [adversarial, model, noise, hallucination, query, study] [based, proposed, existing, method, high, enhance] [synthesized, afhn, generator, conditional, gan, discriminability, image, diversity, cwgan, fake, discriminator, wasserstein, generative, discriminative, train, collapse, mode, real, learn, synthesis, synthesize, diverse, encourage, generate] [data, learning, labeled, fsl, augmentation, classification, set, class, network, training, classifier, task, support, metric, deep, distribution, sample, regularizer, performance, neural, number, objective, sampled, arxiv, preprint, reach, similarity, accuracy, compared, variance, space] [novel, directly, well]
@InProceedings{Li_2020_CVPR,
  author = {Li, Kai and Zhang, Yulun and Li, Kunpeng and Fu, Yun},
  title = {Adversarial Feature Hallucination Networks for Few-Shot Learning},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Conditional Gaussian Distribution Learning for Open Set Recognition
Xin Sun, Zhenning Yang, Chi Zhang, Keck-Voon Ling, Guohao Peng


Deep neural networks have achieved state-of-the-art performance in a wide range of recognition/classification tasks. However, when applying deep learning to real-world applications, there are still multiple challenges. A typical challenge is that unknown samples may be fed into the system during the testing phase and traditional deep neural networks will wrongly recognize the unknown sample as one of the known classes. Open set recognition is a potential solution to overcome this problem, where the open set classifier should have the ability to reject unknown samples as well as maintain high classification accuracy on known classes. The variational auto-encoder (VAE) is a popular model to detect unknowns, but it cannot provide discriminative representations for known classification. In this paper, we propose a novel method, Conditional Gaussian Distribution Learning (CGDL), for open set recognition. In addition to detecting unknown samples, this method can also classify known samples by forcing different latent features to approximate different Gaussian models. Meanwhile, to avoid information hidden in the input vanishing in the middle layers, we also adopt the probabilistic ladder architecture to extract high-level abstract features. Experiments on several standard image datasets reveal that the proposed method significantly outperforms the baseline method and achieves new state-of-the-art results.
[dataset, recognition, decoder, extract, hidden, outperforms, previous] [detector, detection, achieves, detect] [testing, model, trained, input, mnist, recognized] [proposed, method, gaussian, ieee, pattern, traditional, prior, based, phase, figure] [unknown, latent, conditional, ladder, representation, image, vae, loss, multivariate, abstract, cgdl, unsupervised, encoder, variational, openness, discriminative, generative] [set, open, training, learning, deep, architecture, distribution, probabilistic, performance, classifier, classification, layer, neural, baseline, sample, approximate, space, probability, posterior, anomaly, data, class, closed, softmax, machine, network, function, procedure, label, openmax, standard, log] [reconstruction, conference, computer, defined, vision, international, novel, distance]
@InProceedings{Sun_2020_CVPR,
  author = {Sun, Xin and Yang, Zhenning and Zhang, Chi and Ling, Keck-Voon and Peng, Guohao},
  title = {Conditional Gaussian Distribution Learning for Open Set Recognition},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Connect-and-Slice: An Hybrid Approach for Reconstructing 3D Objects
Hao Fang, Florent Lafarge


Converting point clouds generated by Laser scanning, multiview stereo imagery or depth cameras into compact polygon meshes is a challenging problem in vision. Existing methods are either robust to imperfect data or scalable, but rarely both. In this paper, we address this issue with an hybrid method that successively connects and slices planes detected from 3D data. The core idea consists in constructing an efficient and compact partitioning data structure. The later is i) spatially-adaptive in the sense that a plane slices a restricted number of relevant planes only, and ii) composed of components with different structural meaning resulting from a preliminary analysis of the plane connectivity. Our experiments on a variety of objects and sensors show the versatility of our approach as well as its competitiveness with respect to existing methods.
[step, order, observed, three, urban, lying, illustrated, time, recognition] [bounding, detected, anchor, box, building] [input, robust, pcc, impose, preliminary] [output, figure, partition, analysis, running, pattern, extraction, existing, method, fast] [structural, missing, domain, consists, obvious] [data, algorithm, set, number, typically, complexity, compact, large, energy, problem, strategy, computational] [surface, slicing, primitive, computer, connectivity, geometric, facet, reconstruction, polygonal, vision, point, polyfit, partitioning, border, polygon, plane, assembling, allows, solution, well, second, solve, intersection, conference, approach, planar, polyhedral, mesh, intersect, shape, structuring, dense, volume]
@InProceedings{Fang_2020_CVPR,
  author = {Fang, Hao and Lafarge, Florent},
  title = {Connect-and-Slice: An Hybrid Approach for Reconstructing 3D Objects},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Attentive Weights Generation for Few Shot Learning via Information Maximization
Yiluan Guo, Ngai-Man Cheung


Few shot image classification aims at learning a classifier from limited labeled data. Generating the classification weights has been applied in many meta-learning methods for few shot image classification due to its simplicity and effectiveness. In this work, we present Attentive Weights Generation for few shot learning via Information Maximization (AWGIM), which introduces two novel contributions: i) Mutual information maximization between generated weights and data within the task; this enables the generated weights to retain information of the task and the specific query sample. ii) Self-attention and cross-attention paths to encode the context of the task and individual queries. Both two contributions are shown to be very effective in extensive experiments. Overall, AWGIM is competitive with state-of-the-art. Code is available at https://github.com/Yiluan/AWGIM.
[attention, encode, context, encoding, relation, prediction, time, individual] [attentive, contextual, feature, table, extractor] [query, model, trained] [proposed, analysis] [generated, image, generate, cross, generation, generator, conditioned, learn, loss, generating, latent, specific, code] [classification, learning, support, set, awgim, leo, task, shot, path, data, network, accuracy, class, equation, maximization, maximizing, shuffle, labeled, meta, sample, function, xcp, weight, miniimagenet, mutual, deep, objective, random, performance, log, training, problem, neural, lower, sampled, inner, entropy, distribution, optimal, learned, label, bound, complexity, applied, fixed, computational] [mlp, approach]
@InProceedings{Guo_2020_CVPR,
  author = {Guo, Yiluan and Cheung, Ngai-Man},
  title = {Attentive Weights Generation for Few Shot Learning via Information Maximization},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Assessing Eye Aesthetics for Automatic Multi-Reference Eye In-Painting
Bo Yan, Qing Lin, Weimin Tan, Shili Zhou


With the wide use of artistic images, aesthetic quality assessment has been widely concerned. How to integrate aesthetics into image editing is still a problem worthy of discussion. In this paper, aesthetic assessment is introduced into eye in-painting task for the first time. We construct an eye aesthetic dataset, and train the eye aesthetic assessment network on this basis. Then we propose a novel eye aesthetic and face semantic guided multi-reference eye inpainting GAN approach (AesGAN), which automatically selects the best reference under the guidance of eye aesthetics. A new aesthetic loss has also been introduced into the network to learn the eye aesthetic features and generate highquality eyes. We prove the effectiveness of eye aesthetic assessment in our experiments, which may inspire more applications of aesthetics assessment. Both qualitative and quantitative experimental results show that the proposed AesGAN can produce more natural and visually attractive eyes compared with state-of-the-art methods.
[dataset, work, order, three, visual, natural, provide] [parsing, module, effectiveness, feature, semantic, guided, branch, scoring, table, score, propose, final] [eye, aesthetic, assessment, quality, face, aesgan, aesnet, exgan, original, input, experimental, identity, facial, adversarial] [reference, figure, ieee, based, method, quantitative, ssim, comparison, pattern, proposed, extraction, result, traditional, output, high, residual] [image, loss, generated, produce, introduce, train, inpainting, generate, inception, gan, learn, structural] [network, better, best, select, performance, task, learning, compared, algorithm, selected, selection, number, training, test, baseline, sample, softmax] [reconstruction, computer, conference, vision, incomplete, single, acm, defined]
@InProceedings{Yan_2020_CVPR,
  author = {Yan, Bo and Lin, Qing and Tan, Weimin and Zhou, Shili},
  title = {Assessing Eye Aesthetics for Automatic Multi-Reference Eye In-Painting},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
PuppeteerGAN: Arbitrary Portrait Animation With Semantic-Aware Appearance Transformation
Zhuo Chen, Chaoyue Wang, Bo Yuan, Dacheng Tao


Portrait animation, which aims to animate a still portrait to life using poses extracted from target frames, is an important technique for many real-world entertainment applications. Although recent works have achieved highly realistic results on synthesizing or controlling human head images, the puppeteering of arbitrary portraits is still confronted by the following challenges: 1) identity/personality mismatch; 2) training data/domain limitations; and 3) low-efficiency in training/fine-tuning. In this paper, we devised a novel two-stage framework called PuppeteerGAN for solving these challenges. Specifically, we first learn identity-preserved semantic segmentation animation which executes pose retargeting between any portraits. As a general representation, the semantic segmentation results could be adapted to different datasets, environmental conditions or appearance domains. Furthermore, the synthesized semantic segmentation is filled with the appearance of the source portrait. To this end, an appearance transformation network is presented to produce fidelity output by jointly considering the wrapping of semantic features and conditional generation. After training, the two networks can directly perform end-to-end inference on unseen subjects without any retraining or fine-tuning. Extensive experiments on cross-identity/domain/resolution situations demonstrate the superiority of the proposed PuppetterGAN over existing portrait animation methods in both generation quality and inference speed.
[frame, recognition, video, dataset, time, decoder] [segmentation, mask, semantic, framework, detected, feature, including] [face, facial, identity, trained, landmark, input, adversarial, model, expression, generalization] [proposed, method, based, ieee, driven, pattern, figure, result] [portrait, source, image, appearance, target, coloring, generated, generation, sketching, animation, encoder, animated, puppeteergan, retargeting, extracted, generate, averbuch, realistic, conditional, zakharov, unseen, specific, mjs, arbitrary, person, loss, synthesize, lsj, synthesizing] [network, training, learning, inference, large, experiment, deep, compared, neural, dacheng] [pose, conference, computer, vision, geometry, international, cost, deformation, transformation, acm, demonstrate, animate]
@InProceedings{Chen_2020_CVPR,
  author = {Chen, Zhuo and Wang, Chaoyue and Yuan, Bo and Tao, Dacheng},
  title = {PuppeteerGAN: Arbitrary Portrait Animation With Semantic-Aware Appearance Transformation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
SEED: Semantics Enhanced Encoder-Decoder Framework for Scene Text Recognition
Zhi Qiao, Yu Zhou, Dongbao Yang, Yucan Zhou, Weiping Wang


Scene text recognition is a hot research topic in computer vision. Recently, many recognition methods based on the encoder-decoder framework have been proposed, and they can handle scene texts of perspective distortion and curve shape. Nevertheless, they still face lots of challenges like image blur, uneven illumination, and incomplete characters. We argue that most encoder-decoder methods are based on local visual features without explicit global semantic information. In this work, we propose a semantics enhanced encoder-decoder framework to robustly recognize low-quality scene texts. The semantic information is used both in the encoder module for supervision and in the decoder module for initializing. In particular, the state-of-the-art ASTER method is integrated into the proposed framework as an exemplar. Extensive experiments demonstrate that the proposed framework is more robust for low-quality text images, and achieves state-of-the-art results on several benchmark datasets. The source code will be available.
[text, word, recognition, decoder, embedding, aster, visual, attention, language, semantics, context, sequence, fasttext, svtp, natural, irregular, lstm, cong, mechanism, predict, character, spotting, sar, rnn, decoding, predicting, illustrated] [semantic, framework, global, predicted, module, detection, feature, cnn, xiang, supervision, propose, object, represents, including, map] [model, robust, input, trained, datasets] [method, proposed, based, existing, rectification, enhanced, output, figure, convolutional] [encoder, image, loss, supervised, proposes, consists, generate] [neural, performance, learning, training, deep, network, compared, set, function, linear, problem, vector, best, machine, better] [scene, incomplete, predicts, limited, handle, second]
@InProceedings{Qiao_2020_CVPR,
  author = {Qiao, Zhi and Zhou, Yu and Yang, Dongbao and Zhou, Yucan and Wang, Weiping},
  title = {SEED: Semantics Enhanced Encoder-Decoder Framework for Scene Text Recognition},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Texture and Shape Biased Two-Stream Networks for Clothing Classification and Attribute Recognition
Yuwei Zhang, Peng Zhang, Chun Yuan, Zhi Wang


Clothes category classification and attribute recognition have achieved distinguished success with the development of deep learning. People have found that landmark detection plays a positive role in these tasks. However, little research is committed to analyzing these tasks from the perspective of clothing attributes. In our work, we explore the usefulness of landmarks and find that landmarks can assist in extracting shape features; and using landmarks for joint learning can increase classification and recognition accuracy effectively. We also find that texture features have an impelling effect on these tasks and that the pre-trained ImageNet model has good performance in extracting texture features. To this end, we propose to use two streams to enhance the extraction of shape and texture, respectively. In particular, this paper proposes a simple implementation, Texture and Shape biased Fashion Networks (TS-FashionNet). Comprehensive and rich experiments demonstrate our discoveries and the effectiveness of our model. We improve the top-3 classification accuracy by 0.83% and improve the top-3 attribute recognition recall rate by 1.39% compared to the state-of-the-art models.
[recognition, attention, stream, dataset, retrieval, extract, role, predict, work, understanding, prediction, previous, mechanism] [category, detection, table, biased, branch, recall, feature, predicted, wang, module, propose, location, grammar, adopt] [landmark, fashion, clothing, model, clothes, visibility, improve, experimental] [method, based, enhance, analysis, convolutional, cnns, figure, extraction, proposed, output, performs, achieved] [attribute, texture, image, learn, corresponding, style, loss] [learning, classification, network, imagenet, accuracy, deep, find, rate, neural, layer, performance, better, data, training, size, bias, task, baseline] [shape, joint, ground, jointly, truth, demonstrate, perspective]
@InProceedings{Zhang_2020_CVPR,
  author = {Zhang, Yuwei and Zhang, Peng and Yuan, Chun and Wang, Zhi},
  title = {Texture and Shape Biased Two-Stream Networks for Clothing Classification and Attribute Recognition},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Distortion Agnostic Deep Watermarking
Xiyang Luo, Ruohan Zhan, Huiwen Chang, Feng Yang, Peyman Milanfar


Watermarking is the process of embedding information into an image that can survive under distortions, while requiring the encoded image to have little or no perceptual difference with the original image. Recently, deep learning-based methods achieved impressive results in both visual quality and message payload under a wide variety of image distortions. However, these methods all require differentiable models for the image distortions at training time, and may generalize poorly to unknown distortions. This is undesirable since the types of distortions applied to watermarked images are usually unknown and non-differentiable. In this paper, we propose a new framework for distortion-agnostic watermarking, where the image distortion is not explicitly modeled during training. Instead, the robustness of our system comes from two sources: adversarial training and channel coding. Compared to training on a fixed set of distortions and noise levels, our method achieves comparable or better results on distortions available during training, and better performance overall on unknown distortions.
[message, hidden, decoder, length] [table, redundant, propose, framework, cnn] [adversarial, watermarking, model, distortion, attack, combined, gadv, encoded, trained, noise, input, identity, robustness, fenc, digital, fdec, jpeg, ien, iadv, strength, original, ian, robust, difference, type, xdec] [channel, figure, coding, method, gaussian, convolutional, adjust, ieee, blur, combination, gif, color, comparison, psnr, based, residual] [image, loss, generated, encoder, unknown, train, generate, diverse] [training, network, learning, bit, accuracy, deep, performance, neural, set, equation, arxiv, preprint, applied, compared, better, agnostic, wide, comparable, binary] [conference, international, system, computer, additional, cover, detailed]
@InProceedings{Luo_2020_CVPR,
  author = {Luo, Xiyang and Zhan, Ruohan and Chang, Huiwen and Yang, Feng and Milanfar, Peyman},
  title = {Distortion Agnostic Deep Watermarking},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
RMP-SNN: Residual Membrane Potential Neuron for Enabling Deeper High-Accuracy and Low-Latency Spiking Neural Network
Bing Han, Gopalakrishnan Srinivasan, Kaushik Roy


Spiking Neural Networks (SNNs) have recently attracted significant research interest as the third generation of artificial neural networks that can enable low-power event-driven data analytics. The best performing SNNs for image recognition tasks are obtained by converting a trained Analog Neural Network (ANN), consisting of Rectified Linear Units (ReLU), to SNN composed of integrate-and-fire neurons with "proper" firing thresholds. The converted SNNs typically incur loss in accuracy compared to that provided by the original ANN and require sizable number of inference time-steps to achieve the best accuracy. We find that performance degradation in the converted SNN stems from using "hard reset" spiking neuron that is driven to fixed reset potential once its membrane potential exceeds the firing threshold, leading to information loss during SNN inference. We propose ANN-SNN conversion using "soft reset" spiking neuron model, referred to as Residual Membrane Potential (RMP) spiking neuron, which retains the "residual" membrane potential above threshold at the firing instants. We demonstrate near loss-less ANN-SNN conversion using RMP neurons for VGG-16, ResNet-20, and ResNet-34 SNNs on challenging datasets including CIFAR-10 (93.63% top-1), CIFAR-100 (70.93% top-1), and ImageNet (73.09% top-1 accuracy). Our results also show that RMP-SNN surpasses the best inference accuracy provided by the converted SNN with "hard reset" spiking neurons using 2-8 times fewer inference time-steps across network architectures and datasets.
[dataset, time, described, activity, composed, recognition] [threshold, propose] [input, trained, datasets, encoded] [spike, residual, output, relu, figure, achieved, proposed, range, convolutional, neuromorphic] [loss, image, unsupervised] [spiking, snn, accuracy, inference, neuron, vth, rate, rmp, neural, conversion, ann, potential, firing, vin, deep, fout, network, snns, average, training, membrane, imagenet, fin, linear, baseline, reduced, learning, weighted, converted, latency, performance, higher, initialization, best, computational, compared, sum, achieve, layer, number, max, sparsity, classification, kaushik, processing, balancing, performing, lower, backpropagation] [international, conference, error]
@InProceedings{Han_2020_CVPR,
  author = {Han, Bing and Srinivasan, Gopalakrishnan and Roy, Kaushik},
  title = {RMP-SNN: Residual Membrane Potential Neuron for Enabling Deeper High-Accuracy and Low-Latency Spiking Neural Network},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
BFBox: Searching Face-Appropriate Backbone and Feature Pyramid Network for Face Detector
Yang Liu, Xu Tang


Popular backbones designed on image classification have demonstrated their considerable compatibility on the task of general object detection. However, the same phenomenon does not appear on the face detection. This is largely due to the average scale of ground-truth in the WiderFace dataset is far smaller than that of generic objects in theCOCO one. To resolve this, the success of Neural Archi-tecture Search (NAS) inspires us to search face-appropriate backbone and featrue pyramid network (FPN) architecture.Firstly, we design the search space for backbone and FPN by comparing performance of feature maps with different backbones and excellent FPN architectures on the face detection. Second, we propose a FPN-attention module to joint search the architecture of backbone and FPN. Finally,we conduct comprehensive experiments on popular bench-marks, including Wider Face, FDDB, AFW and PASCALFace, display the superiority of our proposed method.
[] [] [] [] [] [] []
@InProceedings{Liu_2020_CVPR,
  author = {Liu, Yang and Tang, Xu},
  title = {BFBox: Searching Face-Appropriate Backbone and Feature Pyramid Network for Face Detector},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
PFCNN: Convolutional Neural Networks on 3D Surfaces Using Parallel Frames
Yuqi Yang, Shilin Liu, Hao Pan, Yang Liu, Xin Tong


Surface meshes are widely used shape representations and capture finer geometry data than point clouds or volumetric grids, but are challenging to apply CNNs directly due to their non-Euclidean structure. We use parallel frames on surface to define PFCNNs that enable effective feature learning on surface meshes by mimicking standard convolutions faithfully. In particular, the convolution of PFCNN not only maps local surface patches onto flat tangent planes, but also aligns the tangent planes such that they locally form a flat Euclidean structure, thus enabling recovery of standard convolutions. The alignment is achieved by the tool of locally flat connections borrowed from discrete differential geometry, which can be efficiently encoded and computed by parallel frame fields. In addition, the lack of canonical axis on surface is handled by sampling with the frame directions. Experiments show that for tasks including classification, segmentation and registration on deformable geometric domains, as well as semantic scene segmentation on rigid domains, PFCNNs achieve robust and superior performances without using sophisticated input features than state-of-the-art surface based CNNs.
[frame, regular, previous, trainable] [feature, map, segmentation, table, framework, salient] [input, original, encoded, effective] [convolution, parallel, cnns, field, patch, kernel, based, convolutional, method, ieee, deformable, figure, pattern] [translation, transport, image, domain, align] [learning, singular, standard, network, accuracy, vector, space, neural, deep, classification, better, data, task, achieve, test, sample] [surface, tangent, flat, shape, vertex, geodesic, local, locally, cover, pfcnn, registration, computer, point, conference, geometric, mdgcnn, equivariance, mesh, plane, geometry, pfcnns, euclidean, curvature, human, body, volumetric, canonical, structure, coordinate, acm, vision, scene]
@InProceedings{Yang_2020_CVPR,
  author = {Yang, Yuqi and Liu, Shilin and Pan, Hao and Liu, Yang and Tong, Xin},
  title = {PFCNN: Convolutional Neural Networks on 3D Surfaces Using Parallel Frames},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
iTAML: An Incremental Task-Agnostic Meta-learning Approach
Jathushan Rajasegaran, Salman Khan, Munawar Hayat, Fahad Shahbaz Khan, Mubarak Shah


Humans can continuously learn new knowledge as their experience grows. In contrast, previous learning in deep neural networks can quickly fade out when they are trained on a new task. In this paper, we hypothesize this problem can be avoided by learning a set of generalized parameters, that are neither specific to old nor new tasks. In this pursuit, we introduce a novel meta-learning approach that seeks to maintain an equilibrium between all the encountered tasks. This is ensured by a new meta-update rule which avoids catastrophic forgetting. In comparison to previous meta-learning techniques, our approach is task-agnostic. When presented with a continuum of data, our model automatically identifies the task and quickly adapts to it with just a single update. We perform extensive experiments on five datasets in a class-incremental setting, leading to significant improvements over the state of the art methods (e.g., a 21.3% boost on CIFAR100 with 10 incremental tasks). Specifically, on large-scale datasets that generally prove difficult cases for incremental learning, our approach delivers absolute gains as high as 19.1% and 7.4% on ImageNet and MS-Celeb datasets, respectively.
[prediction, predict, tpred, current, previous, automatically] [predicted, feature, response, achieves, propose, final] [model, generic, datasets, move, rule, adapts, case] [proposed, based, figure, existing, scale, pattern, ieee, comparison] [learn, exemplar, generalized, specific, loss, learns] [task, itaml, learning, data, continuum, incremental, class, inner, classification, set, training, number, gradient, update, accuracy, memory, reptile, algorithm, network, size, performance, updated, continual, deep, neural, forgetting, outer, optimal, maximum, arxiv, preprint, fomaml, find, catastrophic, space, rate, close, inference, incrementally] [loop, approach, single, conference, computer, well, joint, solution]
@InProceedings{Rajasegaran_2020_CVPR,
  author = {Rajasegaran, Jathushan and Khan, Salman and Hayat, Munawar and Khan, Fahad Shahbaz and Shah, Mubarak},
  title = {iTAML: An Incremental Task-Agnostic Meta-learning Approach},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Optimal least-squares solution to the hand-eye calibration problem
Amit Dekel, Linus Harenstam-Nielsen, Sergio Caccamo


We propose a least-squares formulation to the noisy hand-eye calibration problem using dual-quaternions, and introduce efficient algorithms to find the exact optimal solution, based on analytic properties of the problem, avoiding non-linear optimization. We further present simple analytic approximate solutions which provide remarkably good estimations compared to the exact solution. In addition, we show how to generalize our solution to account for a given extrinsic prior in the cost function. To the best of our knowledge our algorithm is the most efficient approach to optimally solve the hand-eye calibration problem.
[order, corresponds] [table, positive] [nonlinear, noise, change, case, degree] [prior, introduced, motion, noisy, based, figure, generally] [real, translation, corresponding, synthetic, notice, introduce, free, perform, representation] [problem, optimal, algorithm, minimization, find, function, best, data, optimization, maximum, equation, finding, efficient, approximate, lagrange, number, group, minimize, paper, matrix, equivalent, dqconvrlx, linear, appendix, lower] [solution, cost, calibration, rotation, solve, eigenvalue, dqopt, dan, analytic, relative, term, polynomial, smallest, compare, approach, well, solving, robotics, translational, planar, quaternion, constraint, second, error, transformation, rotational, axis, notation, define, correspond, chvecopt, quatvecopt, international, formulation]
@InProceedings{Dekel_2020_CVPR,
  author = {Dekel, Amit and Harenstam-Nielsen, Linus and Caccamo, Sergio},
  title = {Optimal least-squares solution to the hand-eye calibration problem},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
MnasFPN: Learning Latency-Aware Pyramid Architecture for Object Detection on Mobile Devices
Bo Chen, Golnaz Ghiasi, Hanxiao Liu, Tsung-Yi Lin, Dmitry Kalenichenko, Hartwig Adam, Quoc V. Le


Despite the blooming success of architecture search for vision tasks in resource-constrained environments, the design of on-device object detection architectures have mostly been manual. The few automated search efforts are either centered around non-mobile-friendly search spaces or not guided by on-device latency. We propose MnasFPN, a mobile-friendly search space for the detection head, and combine it with latency-aware architecture search to produce efficient object detection models. The learned MnasFPN head, when paired with MobileNetV2 body, outperforms MobileNetV3+SSDLite by 1.8 mAP at similar latency on Pixel. It is both 1 mAP more accurate and 10% faster than NAS-FPNLite. Ablation studies show that the majority of the performance gain comes from innovations in the search space. Further explorations reveal an interesting coupling between the search space design and the search algorithm, for which the complexity of MnasFPN search space is opportune.
[work, current] [detection, feature, object, head, map, ablation, coco, table, backbone, faster, merging, pyramid, merge, box, jian, kaiming, ross] [model, input] [intermediate, pattern, ieee, channel, block, convolution, output, resolution, figure, based, cell, residual, comparison, proposed, expansion, kernel, convolutional] [train] [search, mnasfpn, space, architecture, latency, mobile, design, performance, neural, size, efficient, network, ssdlite, arxiv, preprint, controller, sdo, task, proxy, training, learning, quoc, inverted, set, operation, frontier, class, process, depthwise, mnasnet, searchable, count, setup] [conference, computer, vision, connectivity, well, full, despite, compare]
@InProceedings{Chen_2020_CVPR,
  author = {Chen, Bo and Ghiasi, Golnaz and Liu, Hanxiao and Lin, Tsung-Yi and Kalenichenko, Dmitry and Adam, Hartwig and Le, Quoc V.},
  title = {MnasFPN: Learning Latency-Aware Pyramid Architecture for Object Detection on Mobile Devices},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
VSGNet: Spatial Attention Network for Detecting Human Object Interactions Using Graph Convolutions
Oytun Ulutan, A S M Iftekhar, B. S. Manjunath


Comprehensive visual understanding requires detection frameworks that can effectively learn and utilize object interactions while analyzing objects individually. This is the main objective in Human-Object Interaction (HOI) detection task. In particular, relative spatial reasoning and structural connections between objects are essential cues for analyzing interactions, which is addressed by the proposed Visual-Spatial-Graph Network (VSGNet) architecture. VSGNet extracts visual features from the human-object pairs, refines the features with spatial configurations of the pair, and utilizes the structural connections between the pair via graph convolutions. The performance of VSGNet is thoroughly evaluated using the Verbs in COCO (V-COCO) dataset. Experimental results indicate that VSGNet outperforms state-of-the-art solutions by 8% or 4 mAP.
[visual, attention, graph, interaction, pair, action, prediction, extract, previous, iho, outperforms, state, adjacency, relation] [object, branch, vsgnet, feature, detection, proposal, map, bounding, detect, table, hoi, achieves, score, fho, aho, ican, backbone, refine, edge] [model, detecting, scenario, input, datasets] [spatial, ieee, proposed, pattern, convolutional, figure, method, existing, spatially, residual] [image, structural, utilize, generate] [network, performance, class, training, configuration, size, base, vector, task, set, binary, learning, classification, evaluate, average] [human, computer, conference, vision, scene, defined, directly, approach, define]
@InProceedings{Ulutan_2020_CVPR,
  author = {Ulutan, Oytun and Iftekhar, A S M and Manjunath, B. S.},
  title = {VSGNet: Spatial Attention Network for Detecting Human Object Interactions Using Graph Convolutions},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
End-to-End Camera Calibration for Broadcast Videos
Long Sha, Jennifer Hobbs, Panna Felsen, Xinyu Wei, Patrick Lucey, Sujoy Ganguly


The increasing number of vision-based tracking systems deployed in production have necessitated fast, robust camera calibration. In the domain of sport, the majority of current work focuses on sports where lines and intersections are easy to extract, and appearance is relatively consistent across venues. However, for more challenging sports like basketball, those techniques are not sufficient. In this paper, we propose an end-to-end approach for single moving camera calibration across challenging scenarios in sports. Our method contains three key modules: 1) area-based court segmentation, 2) camera pose estimation with embedded templates, 3) homography prediction via a spatial transform network (STN). All three modules are connected, enabling end-to-end training. We evaluate our method on a new college basketball dataset and demonstrate state of the art performance in variable and dynamic environments. We also validate our method on the World Cup 2014 dataset to show its competitive performance against the state-of-the-art methods. Lastly, we show that our method is two orders of magnitude faster than the previous state of the art on both datasets.
[basketball, dataset, soccer, previous, overhead, broadcast, moving, video, work, cup, transformer, evaluation, long, state, length] [semantic, template, segmentation, refinement, module, challenging, table, tracking, location, siamese] [model, input, perfect, fraction, robust] [method, homography, field, figure, chen, ieee, transform, spatial, stn, dynamic, based, pixel] [image, loss, generate, row, appearance, train] [training, network, set, large, initialization, number, small, search, data, neural, better, top, function, matrix, performance, dictionary, task, required, algorithm] [camera, pose, calibration, computer, conference, ground, truth, iouentire, ioupart, focal, view, vision, registration, perspective, occupancy, single, handle, compute, international, approach, court, sharma]
@InProceedings{Sha_2020_CVPR,
  author = {Sha, Long and Hobbs, Jennifer and Felsen, Panna and Wei, Xinyu and Lucey, Patrick and Ganguly, Sujoy},
  title = {End-to-End Camera Calibration for Broadcast Videos},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Regularizing CNN Transfer Learning With Randomised Regression
Yang Zhong, Atsuto Maki


This paper is about regularizing deep convolutional networks (CNNs) based on an adaptive framework for transfer learning with limited training data in the target domain. Recent advances of CNN regularization in this context are commonly due to the use of additional regularization objectives. They guide the training away from the target task using some forms of concrete tasks. Unlike those related approaches, we suggest that an objective without a concrete goal can still serve well as a regularizer. In particular, we demonstrate Pseudo-task Regularization (PtR) which dynamically regularizes a network by simply attempting to regress image representations to pseudo-regression targets during fine-tuning. That is, a CNN is efficiently regularized without additional resources of data or prior domain expertise. In sum, the proposed PtR provides: a) an alternative for network regularization without dependence on the design of concrete regularization objectives or extra annotations; b) a dynamically adjusted and maintained strength of regularization effect by balancing the gradient norms between objectives on-line. Through numerous experiments, surprisingly, the improvements on classification accuracy by PtR are shown greater or on a par to the recent state-of-the-art methods.
[recognition] [regression, feature, table, cnn, framework] [model, norm, auxiliary, input, trained, study] [ieee, convolutional, figure, pattern, method, achieved, based, cnns] [target, loss, transfer, image, source, domain, lce, generated, representation, common, independent] [ptr, regularization, training, learning, network, data, gradient, accuracy, gain, deep, task, baseline, weight, concrete, classification, class, vanilla, impact, decay, higher, batch, validation, btfw, performance, machine, objective, neural, better, random, average, rate, set, fnp, compared, gce, ratio, learned, best, test, improved, labeled] [conference, computer, vision, international, additional, form]
@InProceedings{Zhong_2020_CVPR,
  author = {Zhong, Yang and Maki, Atsuto},
  title = {Regularizing CNN Transfer Learning With Randomised Regression},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
KeypointNet: A Large-Scale 3D Keypoint Dataset Aggregated From Numerous Human Annotations
Yang You, Yujing Lou, Chengkun Li, Zhoujun Cheng, Liangwei Li, Lizhuang Ma, Cewu Lu, Weiming Wang


Detecting 3D objects keypoints is ofgreat interest to the areas of both graphics and computer vision. There have been several 2D and 3D keypoint datasets aiming to address this problem in a data-driven way. These datasets, however, either lack scalability or bring ambiguity to the definition of keypoints. Therefore, we present KeypointNet: the first large-scale and diverse 3D keypoint dataset that contains 83,231 keypoints and 8,329 3D models from 16 object categories, by leveraging numerous human annotations. To handle the inconsistency between annotations from different people, we propose a novel method to aggregate these keypoints automatically, through minimization of a fidelity loss. Finally, ten state-of-the-art methods are benchmarked on our proposed dataset.
[dataset, embedding, evaluation, people, cap, predict] [semantic, object, table, detection, saliency, benchmark, interest, threshold, propose, aggregate, annotated, map, miou, annotation, category, bottle, helmet, aggregated] [model, datasets] [figure, ieee, pattern, method, based, raw, proposed] [fidelity] [learning, deep, set, labeled, number, large, potential, problem, simple, general, neural, evaluate, clustering, data, arg, network, arxiv, preprint] [keypoint, keypoints, human, point, computer, conference, distance, estimation, correspondence, vision, rscnn, dgcnn, graphcnn, pointconv, local, mesh, error, pointnet, spidercnn, geometric, consistent, chair, airplane, solve, pose, cloud, rsnet, international, leonidas, dutagaci, syncspeccnn, hao]
@InProceedings{You_2020_CVPR,
  author = {You, Yang and Lou, Yujing and Li, Chengkun and Cheng, Zhoujun and Li, Liangwei and Ma, Lizhuang and Lu, Cewu and Wang, Weiming},
  title = {KeypointNet: A Large-Scale 3D Keypoint Dataset Aggregated From Numerous Human Annotations},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Hierarchical Clustering With Hard-Batch Triplet Loss for Person Re-Identification
Kaiwei Zeng, Munan Ning, Yaohua Wang, Yang Guo


For clustering-guided fully unsupervised person reidentification (re-ID) methods, the quality of pseudo labels generated by clustering directly decides the model performance. In order to improve the quality of pseudo labels in existing methods, we propose the HCT method which combines hierarchical clustering with hard-batch triplet loss. The key idea of HCT is to make full use of the similarity among samples in the target dataset through hierarchical clustering, reduce the influence of hard examples through hard-batch triplet loss, so as to generate high quality pseudo labels and improve model performance. Specifically, (1) we use hierarchical clustering to generate pseudo labels, (2) we use PK sampling in each iteration to generate a new dataset for training, (3) we conduct training with hard-batch triplet loss and evaluate model performance in each iteration. We evaluate our model on Market-1501 and DukeMTMC-reID. Results show that HCT achieves 56.4% mAP on Market-1501 and 50.7% mAP on DukeMTMC-reID which surpasses state-of-the-arts a lot in fully unsupervised re-ID and even better than most unsupervised domain adaptation (UDA) methods which use the labeled source dataset. Code will be released soon on https://github.com/zengkaiwei/HCT
[hierarchical, step, dataset, order, people] [merging, map, fully, hard, annotated, merge, false, liang, propose, represents, table, focus] [model, quality, improve, effectively, influence, conduct, difficult] [method, figure, comparison, high] [pseudo, unsupervised, person, generate, hct, target, domain, transfer, loss, generated, buc, source, supervised, cluster, alse, uda, distinguish, adaptation, image, xia, ositive] [clustering, performance, triplet, training, learning, manually, number, better, set, evaluate, reduce, sampling, labeled, data, sample, best, similarity, iteration, unlabeled, deep, imagenet, baseline, pairwise, epoch, surpasses, good, early] [distance, directly, direct, finally, measurement]
@InProceedings{Zeng_2020_CVPR,
  author = {Zeng, Kaiwei and Ning, Munan and Wang, Yaohua and Guo, Yang},
  title = {Hierarchical Clustering With Hard-Batch Triplet Loss for Person Re-Identification},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Joint Semantic Segmentation and Boundary Detection Using Iterative Pyramid Contexts
Mingmin Zhen, Jinglu Wang, Lei Zhou, Shiwei Li, Tianwei Shen, Jiaxiang Shang, Tian Fang, Long Quan


In this paper, we present a joint multi-task learning framework for semantic segmentation and boundary detection. The critical component in the framework is the iterative pyramid context module (PCM), which couples two tasks and stores the shared latent semantics to interact between the two tasks. For semantic boundary detection, we propose the novel spatial gradient fusion to suppress non-semantic edges. As semantic boundary detection is the dual task of semantic segmentation, we introduce a loss function with boundary consistency constraint to improve the boundary pixel accuracy for semantic segmentation. Our extensive experiments demonstrate superior performance over state-of-the-art works, not only in semantic segmentation but also in semantic boundary detection. In particular, a mean IoU score of 81.8% on Cityscapes test set is achieved without using coarse data or any external data for semantic segmentation. For semantic boundary detection, we improve over previous state-of-the-art works by 9.9% in terms of AP and 6.8% in terms of MF(ODS).
[vno] [] [] [] [] [] []
@InProceedings{Zhen_2020_CVPR,
  author = {Zhen, Mingmin and Wang, Jinglu and Zhou, Lei and Li, Shiwei and Shen, Tianwei and Shang, Jiaxiang and Fang, Tian and Quan, Long},
  title = {Joint Semantic Segmentation and Boundary Detection Using Iterative Pyramid Contexts},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Attention-Guided Hierarchical Structure Aggregation for Image Matting
Yu Qiao, Yuhao Liu, Xin Yang, Dongsheng Zhou, Mingliang Xu, Qiang Zhang, Xiaopeng Wei


Existing deep learning based matting algorithms primarily resort to high-level semantic features to improve the overall structure of alpha mattes. However, we argue that advanced semantics extracted from CNNs contribute unequally for alpha perception and we are supposed to reconcile advanced semantic information with low-level appearance cues to refine the foreground details. In this paper, we propose an end-to-end Hierarchical Attention Matting Network (HAttMatting), which can predict the better structure of alpha mattes from single RGB images without additional input. Specifically, we employ spatial and channel-wise attention to integrate appearance cues and pyramidal features in a novel fashion. This blended attention mechanism can perceive alpha mattes from refined boundaries and adaptive semantics. We also introduce a hybrid loss function fusing Structural SIMilarity (SSIM), Mean Square Error (MSE) and Adversarial loss to guide the network to further improve the overall foreground structure. Besides, we construct a large-scale image matting dataset comprised of 59,600 training images and 1000 test images (total 646 distinct foreground alpha mattes), which can further improve the robustness of our hierarchical structure aggregation model. Extensive experiments demonstrate that the proposed HAttMatting can capture sophisticated foreground structure and achieve state-of-the-art performance with single RGB images as input.
[attention, semantics, hierarchical, visual, late, predict, natural, dataset, mechanism, perceive] [advanced, semantic, aggregation, foreground, segmentation, grant, region, boundary, feature, global, employ, aggregate] [input, dim, improve, adversarial, sophisticated, effectively] [low, based, spatial, fusion, traditional, adaptive, proposed, figure, method, dcnn, pixel, color, lssim, quantitative, ieee] [alpha, image, matting, appearance, hattmatting, loss, pyramidal, generate, texture, trimap, trimaps, extracted, corresponding, matte, structural, produce, shared, perform] [basic, network, deep, learning, transition, test, training, achieve, distill, function, similarity, sampling, better] [structure, rgb, ground, truth, single, hybrid, error, square]
@InProceedings{Qiao_2020_CVPR,
  author = {Qiao, Yu and Liu, Yuhao and Yang, Xin and Zhou, Dongsheng and Xu, Mingliang and Zhang, Qiang and Wei, Xiaopeng},
  title = {Attention-Guided Hierarchical Structure Aggregation for Image Matting},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
MetaFuse: A Pre-trained Fusion Model for Human Pose Estimation
Rongchang Xie, Chunyu Wang, Yizhou Wang


Cross view feature fusion is the key to address the occlusion problem in human pose estimation. The current fusion methods need to train a separate model for every pair of cameras making them difficult to scale. In this work, we introduce MetaFuse, a pre-trained fusion model learned from a large number of cameras in the Panoptic dataset. The model can be efficiently adapted or finetuned for a new pair of cameras using a small number of labeled images. The strong adaptation power of MetaFuse is due in large part to the proposed factorization of the original fusion model into two parts--(1) a generic fusion model shared by all cameras, and (2) lightweight camera-dependent transformations. Furthermore, the generic model is learned from many cameras by a meta-learning style algorithm to maximize its adaptation capability to various camera poses. We observe in experiments that MetaFuse finetuned on the public datasets outperforms the state-of-the-arts by a large margin which validates its value in practice.
[dataset, pair, three, outperforms, work, multiple] [heatmap, panoptic, feature, fuse, table, backbone, occlusion] [model, generic, heatmaps, trained, finetuned, customized, testing, university] [fusion, figure, affine, pixel, fused, applying, proposed] [learn, target, adaptation, image, corresponding, train, loss] [base, number, training, learning, large, small, data, task, learned, gradient, performance, meta, adapted, network, total, baseline, weight, note, algorithm, simple, problem, labeled, finetuning, average] [pose, camera, metafuse, view, human, approach, estimation, naivefuse, joint, transformation, jdr, pictorial, error, initial, estimated, left, full, capture, epipolar, directly, ground]
@InProceedings{Xie_2020_CVPR,
  author = {Xie, Rongchang and Wang, Chunyu and Wang, Yizhou},
  title = {MetaFuse: A Pre-trained Fusion Model for Human Pose Estimation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Prior Guided GAN Based Semantic Inpainting
Avisek Lahiri, Arnav Kumar Jain, Sanskar Agrawal, Pabitra Mitra, Prabir Kumar Biswas


Contemporary deep learning based semantic inpainting can be approached from two directions. First, and the more explored, approach is to train an offline deep regression network over the masked pixels with an additional refinement by adversarial training. This approach requires a single feed-forward pass for inpainting at inference. Another promising, yet unexplored approach is to first train a generative model to map a latent prior distribution to natural image manifold and during inference time search for the best-matching prior to reconstruct the signal. The primary aversion towards the latter genre is due to its inference time iterative optimization and difficulty to scale to higher resolution. In this paper, going against the general trend, we focus on the second paradigm of inpainting and address both of its mentioned problems. Most importantly, we learn a data driven parametric network to directly predict a matching prior for a given masked image. This converts an iterative paradigm to a single feed forward inference pipeline with around 800X speedup. We also regularize our network with structural prior (computed from the masked image itself) which helps in better preservation of pose and size of the object to be inpainted. Moreover, to extend our model for sequence reconstruction, we propose a recurrent net based grouped latent prior learning. Finally, we leverage recent advancements in high resolution GAN training to scale our inpainting network to 256X256. Experiments (spanning across resolutions from 64X64 to 256X256) conducted on SVHN, Standford Cars, CelebA, CelebA-HQ and ImageNet image datasets, and FaceForensics video datasets reveal that we consistently improve upon contemporary benchmarks from both schools of approaches.
[video, temporal, predict, time, dataset, natural, current, gated, lstm] [framework, table, grouped, object, refinement, semantic, paradigm] [noise, model, iterative, facial, face, adversarial, original] [prior, based, resolution, ieee, figure, proposed, high, scale, column, driven] [inpainting, masked, image, gan, structural, loss, generative, unmasked, fid, generator, yeh, train, inpainted, learn, latent, gip, contemporary, manifold] [network, deep, inference, training, imagenet, learning, class, better, neural, compared, random, baseline, set, data, vector, metric, lower, andrew, search, higher] [single, approach, conference, reconstruction, pose, error, computer, matching, compare, initial, keypoint, keypoints, international, additional]
@InProceedings{Lahiri_2020_CVPR,
  author = {Lahiri, Avisek and Jain, Arnav Kumar and Agrawal, Sanskar and Mitra, Pabitra and Biswas, Prabir Kumar},
  title = {Prior Guided GAN Based Semantic Inpainting},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Weakly Supervised Semantic Point Cloud Segmentation: Towards 10x Fewer Labels
Xun Xu, Gim Hee Lee


Point cloud analysis has received much attention recently; and segmentation is one of the most important tasks. The success of existing approaches is attributed to deep network design and large amount of labelled training data, where the latter is assumed to be always available. However, obtaining 3d point cloud segmentation labels is often very costly in practice. In this work, we propose a weakly supervised point cloud segmentation approach which requires only a tiny fraction of points to be labelled in the training stage. This is made possible by learning gradient approximation and exploitation of additional spatial and color smoothness constraints. Experiments are done on three public datasets with different degrees of weak supervision. In particular, our proposed method can produce results that are close to and sometimes even better than its fully supervised counterpart with 10X fewer labels.
[graph, three, dataset, embedding, prediction] [labelled, supervision, segmentation, weakly, weak, semantic, fully, annotation, branch, siamese, feature, miou, lbik, propagation, table, propose] [model, fraction, datasets] [proposed, spatial, color, method, analysis, figure, competitive, convolutional] [supervised, encoder, loss, image, manifold, introduce, unsupervised, train, consistency] [learning, label, training, deep, data, inexact, sample, network, baseline, fixed, inference, fewer, wij, amount, labelling, gradient, observe, total, better, classification, performance, alternative, machine, task, strategy, unlabelled] [point, cloud, additional, incomplete, full, rgb, shapenet, shape, partnet, smoothness, constraint, rotation, dgcnn, approach, indoor, smooth]
@InProceedings{Xu_2020_CVPR,
  author = {Xu, Xun and Lee, Gim Hee},
  title = {Weakly Supervised Semantic Point Cloud Segmentation: Towards 10x Fewer Labels},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Physically Realizable Adversarial Examples for LiDAR Object Detection
James Tu, Mengye Ren, Sivabalan Manivasagam, Ming Liang, Bin Yang, Richard Du, Frank Cheng, Raquel Urtasun


Modern autonomous driving systems rely heavily on deep learning models to process point cloud sensory data; meanwhile, deep models have been shown to be susceptible to adversarial attacks with visually imperceptible perturbations. Despite the fact that this poses a security concern for the self-driving industry, there has been very little exploration in terms of 3D perception, as most adversarial attacks have only been applied to 2D flat images. In this paper, we address this issue and present a method to generate universal 3D adversarial objects to fool LiDAR detectors. In particular, we demonstrate that placing an adversarial object on the rooftop of any target vehicle to hide the vehicle entirely from LiDAR detectors with a success rate of 80%. We report attack results on a suite of detectors using various input representation of point clouds. We also conduct a pilot study on adversarial defense using data augmentation. This is one step closer towards safer self-driving under unseen conditions from limited training data.
[vehicle, driving, work, place, evaluation, relevant, fitness] [box, lidar, object, detection, bounding, table, autonomous, detector, iou, global, score, detected, pointrcnn, propose, confidence, apply, pointpillar] [adversarial, attack, pixor, input, defense, success, black, white, model, rooftop, transferability, physical, robust, roof, realizable, universal, adversary, hide, example, adv] [figure, sensor, method, prior] [target, common, image, generate, representation, learn, generation, latent, loss] [learning, training, data, deep, rate, set, top, random, neural, consider, augmentation, density, classification, sample, close] [point, mesh, cloud, scene, physically, initial, shape, differentiable, compute, vertex, occupancy, voxels, kitti]
@InProceedings{Tu_2020_CVPR,
  author = {Tu, James and Ren, Mengye and Manivasagam, Sivabalan and Liang, Ming and Yang, Bin and Du, Richard and Cheng, Frank and Urtasun, Raquel},
  title = {Physically Realizable Adversarial Examples for LiDAR Object Detection},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Combating Noisy Labels by Agreement: A Joint Training Method with Co-Regularization
Hongxin Wei, Lei Feng, Xiangyu Chen, Bo An


Deep Learning with noisy labels is a practically challenging problem in weakly-supervised learning. The state-of-the-art approaches "Decoupling" and "Co-teaching+" claim that the "disagreement" strategy is crucial for alleviating the problem of learning with noisy labels. In this paper, we start from a different perspective and propose a robust learning paradigm called JoCoR, which aims to reduce the diversity of two networks during training. Specifically, we first use two networks to make predictions on the same mini-batch data and calculate a joint loss with Co-Regularization for each training example. Then we select small-loss examples to update the parameters of both two networks simultaneously. Trained by the joint loss, these two networks would be more and more similar due to the effect of Co-Regularization. Extensive experimental results on corrupted data from benchmark datasets including MNIST, CIFAR-10, CIFAR-100 and Clothing1M demonstrate that JoCoR is superior to many state-of-the-art approaches for learning with noisy labels.
[] [table, ablation, achieves, gang, including] [noise, clean, robust, datasets, mnist, conduct, example, trained] [noisy, figure, method, based, ieee, agreement, proposed, pattern, performs] [loss, supervised, peer, train] [learning, jocor, label, training, accuracy, test, epoch, deep, precision, neural, decoupling, better, data, network, standard, rate, classification, processing, select, transition, selection, set, performance, strategy, update, matrix, regularization, best, machine, masashi, reduce, selected, disagreement, unlabeled, arxiv, preprint] [joint, conference, computer, approach, international, error, vision, demonstrate]
@InProceedings{Wei_2020_CVPR,
  author = {Wei, Hongxin and Feng, Lei and Chen, Xiangyu and An, Bo},
  title = {Combating Noisy Labels by Agreement: A Joint Training Method with Co-Regularization},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Light-weight Calibrator: A Separable Component for Unsupervised Domain Adaptation
Shaokai Ye, Kailu Wu, Mu Zhou, Yunfei Yang, Sia Huat Tan, Kaidi Xu, Jiebo Song, Chenglong Bao, Kaisheng Ma


Existing domain adaptation methods aim at learning features that can be generalized among domains. These methods commonly require to update source classifier to adapt to the target domain and do not properly handle the trade-off between the source domain and the target domain. In this work, instead of training a classifier to adapt to the target domain, we use a separable component called data calibrator to help the fixed source classifier recover discrimination power in the target domain, while preserving the source domain's performance. When the difference between two domains is small, the source classifier's representation is sufficient to perform well in the target domain and outperforms GAN-based methods in digits. Otherwise, the proposed method can leverage synthetic images generated by GANs to boost performance and achieve state-of-the-art performance in digits datasets and driving scene semantic segmentation. Our method also empirically suggests the potential connection between domain adaptation and adversarial attacks. Code release is available at https://github.com/yeshaokai/ Calibrator-Domain-Adaptation
[prediction, work, shift, dataset, driving, previous] [level, feature, semantic, propose, segmentation, table] [adversarial, model, deployed, trained, mnist, difference, datasets, testing, lenet] [method, pixel, figure, existing, ieee, proposed, frequency, pattern, high, separable, commonly, degradation] [domain, source, target, calibrator, adaptation, discriminator, unsupervised, representation, loss, learn, train, eat, dpixel, stylized, cycada, trevor, kate, generative, changing, judy, shaokai, real, adapting] [data, classifier, performance, neural, training, learning, arxiv, preprint, deep, adapt, distribution, svhn, better, architecture, achieve, compared, test, update, processing, network] [conference, computer, vision, international, require, scene, calibrated, ground, truth, fit]
@InProceedings{Ye_2020_CVPR,
  author = {Ye, Shaokai and Wu, Kailu and Zhou, Mu and Yang, Yunfei and Tan, Sia Huat and Xu, Kaidi and Song, Jiebo and Bao, Chenglong and Ma, Kaisheng},
  title = {Light-weight Calibrator: A Separable Component for Unsupervised Domain Adaptation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Learn to Augment: Joint Data Augmentation and Network Optimization for Text Recognition
Canjie Luo, Yuanzhi Zhu, Lianwen Jin, Yongpan Wang


Handwritten text and scene text suffer from various shapes and distorted patterns. Thus training a robust recognition model requires a large amount of data to cover diversity as much as possible. In contrast to data collection and annotation, data augmentation is a low cost way. In this paper, we propose a new method for text image augmentation. Different from traditional augmentation methods such as rotation, scaling and perspective transformation, our proposed augmentation method is designed to learn proper and efficient data augmentation which is more effective and specific for training a robust recognizer. By using a set of custom fiducial points, the proposed augmentation method is flexible and controllable. Furthermore, we bridge the gap between the isolated processes of data augmentation and network optimization by joint learning. An agent network learns from the output of the recognition network and controls the fiducial points to generate more proper training samples for the recognition network. Extensive experiments on various benchmarks, including regular scene text, irregular scene text and handwritten text, show that the proposed augmentation and the joint learning methods significantly boost the performance of the recognition networks. A general toolkit for geometric augmentation is available.
[text, recognition, moving, agent, recognizer, handwritten, state, fiducial, irregular, word, difficulty, afdm, multiple, writing, character, handwriting, icdar, natural, wer, dataset, movement, sequence] [table, module, framework, ablation, including, predicted, propose, boost, challenging] [robust, datasets, adversarial, robustness, study] [method, proposed, figure, convolutional, learnable, based, flexible, designed, rectification, existing] [image, diversity, generate, synthetic, loss, edit, shi, generated] [augmentation, network, training, learning, data, neural, set, augmented, performance, general, deep, random, size, accuracy, randomly, baseline, large, distribution, best, scaling, increase, similarity, optimization] [scene, transformation, distance, perspective, geometric, joint, rigid, cover, measured, radius, deformation, limited]
@InProceedings{Luo_2020_CVPR,
  author = {Luo, Canjie and Zhu, Yuanzhi and Jin, Lianwen and Wang, Yongpan},
  title = {Learn to Augment: Joint Data Augmentation and Network Optimization for Text Recognition},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Learning Selective Self-Mutual Attention for RGB-D Saliency Detection
Nian Liu, Ni Zhang, Junwei Han


Saliency detection on RGB-D images is receiving more and more research interests recently. Previous models adopt the early fusion or the result fusion scheme to fuse the input RGB and depth data or their saliency maps, which incur the problem of distribution gap or information loss. Some other models use the feature fusion scheme but are limited by the linear feature fusion methods. In this paper, we propose to fuse attention learned in both modalities. Inspired by the Non-local model, we integrate the self-attention and each other's attention to propagate long-range contextual dependencies, thus incorporating multi-modal information to learn attention and propagate contexts more accurately. Considering the reliability of the other modality's attention, we further propose a selection attention to weight the newly added attention term. We embed the proposed attention module in a two-stream CNN for RGB-D saliency detection. Furthermore, we also propose a residual fusion module to fuse the depth decoder features into the RGB stream. Experimental results on seven benchmark datasets demonstrate the effectiveness of the proposed model components and our final saliency model. Our code and saliency maps are available at https://github.com/nnizhang/S2MA.
[attention, decoder, modality, selective, three, mechanism, work, sof, visual] [saliency, module, salient, detection, object, feature, fuse, adopt, propose, map, semantic, maxf, effectiveness, denseaspp, cnn, fusing, propagate, global, table, lfsd, contextual, background, segmentation] [model, improve, experimental, vgg, input, original, datasets] [fusion, proposed, conv, figure, residual, mae, based, comparison, channel, relu, sma, spatial, fused, convolutional] [image, learn, appearance, corresponding] [network, layer, deep, performance, selection, weight, learning, learned, set, mutual, training, strategy, architecture, design, function, softmax] [depth, rgb, complex, novel, second, dense]
@InProceedings{Liu_2020_CVPR,
  author = {Liu, Nian and Zhang, Ni and Han, Junwei},
  title = {Learning Selective Self-Mutual Attention for RGB-D Saliency Detection},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Cross-domain Object Detection through Coarse-to-Fine Feature Adaptation
Yangtao Zheng, Di Huang, Songtao Liu, Yunhong Wang


Recent years have witnessed great progress in deep learning based object detection. However, due to the domain shift problem, applying off-the-shelf detectors to an unseen domain leads to significant performance drop. To address such an issue, this paper proposes a novel coarse-to-fine feature adaptation approach to cross-domain object detection. At the coarse-grained stage, different from the rough image-level or instance-level feature alignment used in the literature, foreground regions are extracted by adopting the attention mechanism, and aligned according to their marginal distributions via multi-layer adversarial learning in the common feature space. At the fine-grained stage, we conduct conditional distribution alignment of foregrounds by minimizing the distance of global prototypes with the same category but from different domains. Thanks to this coarse-to-fine feature adaptation, domain knowledge in foreground regions can be effectively transferred. Extensive experiments are carried out in various cross-domain detection scenarios. The results are state-of-the-art, which demonstrate the broad applicability and effectiveness of the proposed approach.
[attention, shift, three, dataset, extract, mechanism] [feature, object, detection, semantic, art, module, map, foreground, category, car, global, faster, framework, region, detector, rpn, background, false, backbone, table, propose, roi] [model, adversarial] [method, proposed, based, adaptive, figure, flow, block] [domain, adaptation, source, target, alignment, transfer, psa, unsupervised, loss, image, swda, gpk, aligns, address, person, align, mda, common, discrepancy, supervised] [learning, deep, training, knowledge, set, performance, distribution, network, number, labeled, calculate, achieve, compared, gain, size, reduce] [ground, error, truth, approach, distance]
@InProceedings{Zheng_2020_CVPR,
  author = {Zheng, Yangtao and Huang, Di and Liu, Songtao and Wang, Yunhong},
  title = {Cross-domain Object Detection through Coarse-to-Fine Feature Adaptation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Estimating Low-Rank Region Likelihood Maps
Gabriela Csurka, Zoltan Kato, Andor Juhasz, Martin Humenberger


Low-rank regions capture geometrically meaningful structures in an image which encompass typical local features such as edges, corners and all kinds of regular, symmetric, often repetitive patterns, that are commonly found in man-made environment. While such patterns are challenging current state-of-the-art feature correspondence methods, the recovered homography of a low-rank texture readily provides 3D structure with respect to a 3D plane, without any prior knowledge of the visual information on that plane. However, the automatic and efficient detection of the broad class of low-rank regions is unsolved. Herein, we propose a novel self-supervised low-rank region detection deep network that predicts a low-rank likelihood map from an image. The evaluation of our method on real-world datasets shows not only that it reliably predicts low-rank regions in the image similarly to our baseline method, but thanks to the data augmentations used in the training phase it generalizes well to difficult cases (e.g. day/night lighting, low contrast, underexposure) where the baseline prediction fails.
[recognition, order, visual, described, multiple, dataset, step, work, day] [map, region, detection, feature, sliding, table, localization, propose, predicted] [model, case, development] [likelihood, figure, window, pattern, tilt, ieee, repetitive, rectifying, homography, method, segnet, analysis, pixel, called, based, proposed, rectified, output, pli, night, low, reference, lowrank] [image, eli, texture, invariant] [probability, network, matrix, algorithm, set, deep, test, training, optimization, considered, divergence, machine, rank, consider, size, data, problem, distribution, paper] [computer, camera, conference, vision, local, estimation, estimate, pose, sparse, single, matching, plane, planar, error, compute, aachen, relative, intelligence, well, intrinsic]
@InProceedings{Csurka_2020_CVPR,
  author = {Csurka, Gabriela and Kato, Zoltan and Juhasz, Andor and Humenberger, Martin},
  title = {Estimating Low-Rank Region Likelihood Maps},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Neural Head Reenactment with Latent Pose Descriptors
Egor Burkov, Igor Pasechnik, Artur Grigorev, Victor Lempitsky


We propose a neural head reenactment system, which is driven by a latent pose representation and is capable of predicting the foreground segmentation alongside the RGB image. The latent pose representation is learned as a part of the entire reenactment system, and the learning process is based solely on image reconstruction losses. We show that despite its simplicity, with a large and diverse enough training dataset, such learning successfully decomposes pose from identity. The resulting system can then reproduce mimics of the driving person and, furthermore, can perform cross-person reenactment. Additionally, we show that the learned descriptors are useful for other pose-related tasks, such as keypoint prediction and pose-based retrieval.
[video, dataset, embedding, people, embeddings, talking, driving, passed, prediction, work] [head, segmentation, foreground, background, annotation] [identity, reenactment, face, model, facial, adversarial, expression, landmark, trained, quality] [based, figure, method, tensor] [image, person, latent, unsupervised, generator, encoder, disentanglement, loss, representation, learn, adain, target, extracted, transfer, generative, ability, learns, arbitrary, source] [learned, learning, network, vector, better, neural, training, deep, size, smaller, consider, large, requires, set, random, applied, normalization, test] [pose, system, descriptor, reconstruction, keypoint, error, single, approach, keypoints, well, full, human, dense, mlp]
@InProceedings{Burkov_2020_CVPR,
  author = {Burkov, Egor and Pasechnik, Igor and Grigorev, Artur and Lempitsky, Victor},
  title = {Neural Head Reenactment with Latent Pose Descriptors},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Learning Individual Speaking Styles for Accurate Lip to Speech Synthesis
K R Prajwal, Rudrabha Mukhopadhyay, Vinay P. Namboodiri, C.V. Jawahar


Humans involuntarily tend to infer parts of the conversation from lip movements when the speech is absent or corrupted by external noise. In this work, we explore the task of lip to speech synthesis, i.e., learning to generate natural speech given only the lip movements of a speaker. Acknowledging the importance of contextual and speaker-specific cues for accurate lip-reading, we take a different path from existing works. We focus on learning accurate lip sequences to speech mappings for individual speakers in unconstrained, large vocabulary settings. To this end, we collect and release a large-scale benchmark dataset, the first of its kind, specifically to train and evaluate the single-speaker lip to speech task in natural settings. We propose a novel approach with key design choices to achieve accurate, natural lip to speech synthesis in such unconstrained scenarios for the first time. Extensive evaluation using quantitative, qualitative metrics and human evaluation shows that our method is four times more intelligible than previous works in this space.
[speech, lip, previous, dataset, speaker, text, natural, vocabulary, intelligibility, video, sequence, evaluation, corpus, decoder, ephrat, current, work, visual, context, wer, stoi, language, forcing, estoi, silent, word, pesq, intelligible, attention, multiple, lpc, melspectrograms] [table, head, benchmark, employ] [model, face, unconstrained, datasets, quality, input, highly, constrained] [ieee, method, figure] [generate, encoder, synthesis, generated, generation, train, learn, unseen, generating] [learning, test, training, neural, report, large, problem, objective, andrew, deep, processing, best, teacher, performance, evaluate, achieve, compared] [approach, accurate, conference, human, grid, international, reconstruction, computer]
@InProceedings{Prajwal_2020_CVPR,
  author = {Prajwal, K R and Mukhopadhyay, Rudrabha and Namboodiri, Vinay P. and Jawahar, C.V.},
  title = {Learning Individual Speaking Styles for Accurate Lip to Speech Synthesis},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Self-Supervised Learning of Video-Induced Visual Invariances
Michael Tschannen, Josip Djolonga, Marvin Ritter, Aravindh Mahendran, Neil Houlsby, Sylvain Gelly, Mario Lucic


We propose a general framework for self-supervised learning of transferable visual representations based on Video-Induced Visual Invariances (VIVI). We consider the implicit hierarchy present in the videos and make use of (i) frame-level invariances (e.g. stability to color and contrast perturbations), (ii) shot/clip-level invariances (e.g. robustness to changes in object orientation and lighting conditions), and (iii) video-level invariances (semantic relationships of scenes across shots/clips), to define a holistic self-supervised loss. Training models using different variants of the proposed framework on videos from the YouTube-8M (YT8M) data set, we obtain state-of-the-art self-supervised transfer learning results on the 19 diverse downstream tasks of the Visual Task Adaptation Benchmark (VTAB), using only 1000 labels per task. We then show how to co-train our models jointly with labeled images, outperforming an ImageNet-pretrained ResNet-50 by 0.8 points with 10x fewer labeled images, as well as the previous best supervised model by 3.7 points using the full ImageNet data set.
[video, prediction, visual, vtab, frame, natural, temporal, embeddings, downstream, work, order, structured, context, predicting, embedding, specialized] [framework, score, benchmark, table, feature, pooling, object, segmentation] [model, trained, example] [proposed, based, motion, color] [loss, image, unsupervised, representation, supervised, learn, exemplar, train, diverse, transfer, invariant] [learning, data, imagenet, shot, set, ssl, accuracy, labeled, training, deep, task, classification, performance, function, andrew, pretext, baseline, consider, better, induced, augmentation, outperform] [rotation, approach, well, full, rely]
@InProceedings{Tschannen_2020_CVPR,
  author = {Tschannen, Michael and Djolonga, Josip and Ritter, Marvin and Mahendran, Aravindh and Houlsby, Neil and Gelly, Sylvain and Lucic, Mario},
  title = {Self-Supervised Learning of Video-Induced Visual Invariances},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Two-Stage Peer-Regularized Feature Recombination for Arbitrary Image Style Transfer
Jan Svoboda, Asha Anoosheh, Christian Osendorfer, Jonathan Masci


This paper introduces a neural style transfer model to generate a stylized image conditioning on a set of examples describing the desired style. The proposed solution produces high-quality images even in the zero-shot setting and allows for more freedom in changes to the content geometry. This is made possible by introducing a novel Two-Stage Peer-Regularization Layer that recombines style and content in latent space by means of a custom graph convolutional layer. Contrary to the vast majority of existing solutions, our model does not depend on any pre-trained networks for computing perceptual losses and can be trained fully end-to-end thanks to a new set of cyclic losses that operate directly in latent space and not on the RGB images. An extensive ablation study confirms the usefulness of the proposed losses and of the Two-Stage Peer-Regularization Layer, with qualitative results that are competitive with respect to the current state of the art using a single model for all presented styles. This opens the door to more abstract and artistic neural image generation scenarios, along with simpler deployment of the model.
[decoder, graph, order, work, composed, account, attention, conditioning, current, state, visual] [feature, main, module, global, instance, map, ablation, art] [model, input, auxiliary, adversarial, trained, original] [proposed, figure, method, convolutional, introduced, pixel, perceptual, separation, ieee, presented] [style, latent, content, image, transfer, loss, code, arbitrary, target, texture, representation, encoder, nst, recombination, qualitative, separate, stylization, cycle, peer, jan, stylized, generated, component, recombine, recombines, artistic] [training, learning, neural, space, network, metric, layer, normalization, architecture, set, respect, deep, better, requires, optimization] [single, defined, approach, local, reconstruction, allows, well, novel]
@InProceedings{Svoboda_2020_CVPR,
  author = {Svoboda, Jan and Anoosheh, Asha and Osendorfer, Christian and Masci, Jonathan},
  title = {Two-Stage Peer-Regularized Feature Recombination for Arbitrary Image Style Transfer},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
MINA: Convex Mixed-Integer Programming for Non-Rigid Shape Alignment
Florian Bernard, Zeeshan Khan Suri, Christian Theobalt


We present a convex mixed-integer programming formulation for non-rigid shape matching. To this end, we propose a novel shape deformation model based on an efficient low-dimensional discrete model, so that finding a globally optimal solution is tractable in (most) practical cases. Our approach combines several favourable properties, namely it is independent of the initialisation, it is much more efficient to solve to global optimality compared to analogous quadratic assignment problem formulations, and it is highly flexible in terms of the variants of matching problems it can handle. Experimentally we demonstrate that our approach outperforms existing methods for sparse shape matching, that it can be used for initialising dense shape matching methods, and we showcase its flexibility on several examples.
[time, work, dataset, order] [global, assignment, matched, propose] [model, impose, highly, suitable] [method, based, figure, range, affine] [control, alignment, transformed] [problem, number, linear, efficient, matrix, binary, optimal, quadratic, learning, programming, finding, requires, deep, find, set, discrete, consider, good, combinatorial, matchings] [shape, matching, convex, point, formulation, deformation, mina, sparse, correspondence, dense, globally, rigid, mesh, geodesic, approach, optimality, match, qap, nonrigid, mip, initialisation, emanuele, solution, triangle, polyhedron, outlier, daniel, cloud, initial, allow, functional, vertex, rodola, michael, acm, demonstrate, transformation, optimisation, local, define, registration, tosca]
@InProceedings{Bernard_2020_CVPR,
  author = {Bernard, Florian and Suri, Zeeshan Khan and Theobalt, Christian},
  title = {MINA: Convex Mixed-Integer Programming for Non-Rigid Shape Alignment},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Improving One-Shot NAS by Suppressing the Posterior Fading
Xiang Li, Chen Lin, Chuming Li, Ming Sun, Wei Wu, Junjie Yan, Wanli Ouyang


Neural architecture search (NAS) has demonstrated much success in automatically designing effective neural network architectures. To improve the efficiency of NAS, previous approaches adopt weight sharing method to force all models share the same set of weights. However, it has been observed that a model performing better with shared weights does not necessarily perform better when trained alone. In this paper, we analyse existing weight sharing one-shot NAS approaches from a Bayesian point of view and identify the Posterior Fading problem, which compromises the effectiveness of shared weights. To alleviate this problem, we present a novel approach to guide the parameter posterior towards its true distribution. Moreover, a hard latency constraint is introduced during the search so that the desired latency can be achieved. The resulted method, namely Posterior Convergent NAS (PC-NAS), achieves state-of-the-art performance under standard GPU latency constraint on ImageNet.
[evaluation, previous, time] [table, effectiveness, map, final, wei] [model, shrinking, trained, true] [operator, method, kernel, existing, high, comparison] [shared, train] [search, architecture, space, training, neural, accuracy, pool, palone, large, posterior, supergraph, arxiv, preprint, set, number, parameter, latency, small, network, size, log, algorithm, performance, candidate, validation, mbconv, bayesian, distribution, evaluate, denoted, pshare, quoc, imagenet, mixed, expand, rconv, weight, sharing, better, proxy, problem, learning, sampling, potential, sampled, mixop, efficient, gpu, larger, probability, test, computation, randomly, setting, sample] [partial, approach, single, full, conference, constraint, computer, interval, form]
@InProceedings{Li_2020_CVPR,
  author = {Li, Xiang and Lin, Chen and Li, Chuming and Sun, Ming and Wu, Wei and Yan, Junjie and Ouyang, Wanli},
  title = {Improving One-Shot NAS by Suppressing the Posterior Fading},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Incremental Few-Shot Object Detection
Juan-Manuel Perez-Rua, Xiatian Zhu, Timothy M. Hospedales, Tao Xiang


Existing object detection methods typically rely on the availability of abundant labelled training samples per class and offline model training in a batch mode. These requirements substantially limit their scalability to open-ended accommodation of novel classes with limited labelled training data, both in terms of model accuracy and training efficiency during deployment. We present the first study aiming to go beyond these limitations by considering the Incremental Few-Shot Detection (iFSD) problem setting, where new classes must be registered incrementally (without revisiting base classes) and with few examples. To this end we propose OpeN-ended Centre nEt (ONCE), a detector designed for incrementally learning to detect novel class objects with few examples. This is achieved by an elegant adaptation of the efficient CentreNet detector to the few-shot learning scenario, and meta-learning a class-wise code generator model for registering novel classes. ONCE fully respects the incremental learning paradigm, with novel class registration requiring only a single forward pass of few-shot training samples, and no access to base classes - thus making it suitable for deployment on embedded devices, etc. Extensive experiments conducted on both the standard object detection (COCO, PASCAL VOC) and fashion landmark detection (DeepFashion2) tasks show the feasibility of iFSD for the first time, opening an interesting and very important line of research.
[prediction, visual, exploit] [object, detection, feature, extractor, coco, detector, stage, labelled, table, annotated, bounding, voc, backbone, pascal, challenging] [model, landmark, clothing, evaluated, trained, fashion, access] [method, convolutional, existing, proposed, abundant, designed, net, based] [code, generator, image, train, learn, supervised, requiring, perform, transfer] [class, learning, base, incremental, training, performance, data, set, support, centrenet, ifsd, test, batch, locator, classification, deep, setting, maml, problem, incrementally, large, forgetting, shot, centre, number, learned, architecture, standard, compared, size, sampled, meta, reported, randomly, knowledge, adapted, consider] [novel]
@InProceedings{Perez-Rua_2020_CVPR,
  author = {Perez-Rua, Juan-Manuel and Zhu, Xiatian and Hospedales, Timothy M. and Xiang, Tao},
  title = {Incremental Few-Shot Object Detection},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Synthetic Learning: Learn From Distributed Asynchronized Discriminator GAN Without Sharing Medical Image Data
Qi Chang, Hui Qu, Yikai Zhang, Mert Sabuncu, Chao Chen, Tong Zhang, Dimitris N. Metaxas


In this paper, we propose a data privacy-preserving and communication efficient distributed GAN learning framework named Distributed Asynchronized Discriminator GAN (AsynDGAN). Our proposed framework aims to train a central generator learns from distributed discriminator, and use the generated synthetic image solely to train the segmentation model. We validate the proposed framework on the application of health entities learning problem which is known to be privacy sensitive. Our experiments show that our approach: 1) could learn the real image's distribution from multiple datasets without sharing the patient's raw data. 2) is more efficient and requires lower bandwidth than other distributed deep learning methods. 3) achieves higher performance compared to the model trained by one real dataset, and almost the same performance compared to the model trained by all real datasets. 4) has provable guarantees that the generator could learn the distributed distribution in an all important fashion thus is unbiased.We release our AsynDGAN source code at: https://github.com/tommy-qichang/AsynDGAN
[dataset, regular, multiple, communication, node, entity] [segmentation, tumor, framework, achieves, table, apply] [adversarial, trained, privacy, model, central, datasets, access, auxiliary, input, dimitris] [medical, figure, proposed, ieee, brain, method, raw, dice, convolutional, pattern] [asyndgan, synthetic, real, image, generator, discriminator, gan, loss, learn, generative, train, health, learns, conditional, generated, federated, synthesis, organ, asynchronized, fake, variable, generate] [learning, data, training, distributed, distribution, deep, arxiv, preprint, test, performance, set, machine, network, neural, size, architecture, batch, function, subset, sharing, compared, algorithm, process, update, number] [conference, local, computer, international, vision, cost, distance]
@InProceedings{Chang_2020_CVPR,
  author = {Chang, Qi and Qu, Hui and Zhang, Yikai and Sabuncu, Mert and Chen, Chao and Zhang, Tong and Metaxas, Dimitris N.},
  title = {Synthetic Learning: Learn From Distributed Asynchronized Discriminator GAN Without Sharing Medical Image Data},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Exploring Category-Agnostic Clusters for Open-Set Domain Adaptation
Yingwei Pan, Ting Yao, Yehao Li, Chong-Wah Ngo, Tao Mei


Unsupervised domain adaptation has received significant attention in recent years. Most of existing works tackle the closed-set scenario, assuming that the source and target domains share the exactly same categories. In practice, nevertheless, a target domain often contains samples of classes unseen in source domain (i.e., unknown class). The extension of domain adaptation from closed-set to such open-set situation is not trivial since the target samples in unknown class are not expected to align with the source. In this paper, we address this problem by augmenting the state-of-the-art domain adaptation technique, Self-Ensembling, with category-agnostic clusters in target domain. Specifically, we present Self-Ensembling with Category-agnostic Clusters (SE-CC) --- a novel architecture that steers domain adaptation with the additional guidance of category-agnostic clusters that are specific to target domain. These clustering information provides domain-specific visual cues, facilitating the generalization of Self-Ensembling for both closed-set and open-set scenarios. Technically, clustering is firstly performed over all the unlabeled target samples to obtain the category-agnostic clusters, which reveal the underlying data space structure peculiar to target domain. A clustering branch is capitalized on to ensure that the learnt representation preserves such underlying structure by matching the estimated assignment distribution over clusters to the inherent cluster distribution for each target sample. Furthermore, SE-CC enhances the learnt representation with mutual information maximization. Extensive experiments are conducted on Office and VisDA datasets for both open-set and closed-set domain adaptation, and superior results are reported when comparing to the state-of-the-art approaches.
[three, yingwei, visual] [feature, assignment, global, inherent, branch, table, map, module] [input, model, mim] [output, figure, method, comparison, convolutional, cnns] [target, domain, cluster, adaptation, unknown, source, loss, learnt, unsupervised, underlying, visda, office, discriminator, pclu, representation, preserve, xst, categoryagnostic, conditional, image, pscls, rtn, ting, learn, transfer, real, tao, align, discriminative, enforced, pcls] [mutual, clustering, distribution, classification, student, data, sample, performance, set, class, unlabeled, learning, accuracy, entropy, classifier, training, teacher, probability, deep, maximization, better, network, design, parameter, softmax] [structure, local, estimated, additional, additionally, estimation]
@InProceedings{Pan_2020_CVPR,
  author = {Pan, Yingwei and Yao, Ting and Li, Yehao and Ngo, Chong-Wah and Mei, Tao},
  title = {Exploring Category-Agnostic Clusters for Open-Set Domain Adaptation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Regularizing Class-Wise Predictions via Self-Knowledge Distillation
Sukmin Yun, Jongjin Park, Kimin Lee, Jinwoo Shin


Deep neural networks with millions of parameters may suffer from poor generalization due to overfitting. To mitigate the issue, we propose a new regularization method that penalizes the predictive distribution between similar samples. In particular, we distill the predictive distribution between different samples of the same label during training. This results in regularizing the dark knowledge (i.e., the knowledge on wrong predictions) of a single network (i.e., a self-knowledge distillation) by forcing it to produce more meaningful and consistent predictions in a class-wise manner. Consequently, it mitigates overconfident predictions and reduces intra-class variations. Our experimental results on various image classification tasks demonstrate that the simple yet powerful method can significantly improve not only the generalization ability but also the calibration performance of modern convolutional neural networks.
[dataset, hierarchical, dog, three] [table, confidence, improves, feature] [model, trained, dnns, face, remark, generalization, original, improve] [method, proposed, figure, dark, convolutional, output, prior] [loss, image, bird, consistency, produce, meaningful] [regularization, knowledge, classification, learning, deep, neural, predictive, distillation, label, training, rate, accuracy, adacos, softmax, standard, misclassified, class, mixup, network, size, data, augmentation, entropy, preact, teacher, stanford, best, sample, random, distribution, performance, batch, evaluate, measure, ece, geoffrey, investigated, imagenet, report, optimal, tinyimagenet, indicated] [error, calibration, single]
@InProceedings{Yun_2020_CVPR,
  author = {Yun, Sukmin and Park, Jongjin and Lee, Kimin and Shin, Jinwoo},
  title = {Regularizing Class-Wise Predictions via Self-Knowledge Distillation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Hierarchical Graph Attention Network for Visual Relationship Detection
Li Mi, Zhenzhong Chen


Visual Relationship Detection (VRD) aims to describe the relationship between two objects by providing a structural triplet shown as . Existing graph-based methods mainly represent the relationships by an object-level graph, which ignores to model the triplet-level dependencies. In this work, a Hierarchical Graph Attention Network (HGAT) is proposed to capture the dependencies on both object-level and triplet-level. Object-level graph aims to capture the interactions between objects, while the triplet-level graph models the dependencies among relation triplets. In addition, prior knowledge and attention mechanism are introduced to fix the redundant or missing edges on graphs that are constructed according to spatial correlation. With these approaches, nodes are allowed to attend over their spatial and semantic neighborhoods' features based on the visual or semantic feature correlation. Experimental results on the well-known VG and VRD datasets demonstrate that our model significantly outperforms the state-of-the-art methods.
[graph, relationship, attention, visual, predicate, reasoning, girl, mechanism, vrd, woman, hierarchical, prediction, beach, constructed, previous, hgat, truck, ocean, relation, context, nmp, represent, work, water, boat, dataset, kite, swimsuit, pair, bench, embedding, attend, node, cddn, allowed, question, explicitly, language] [semantic, feature, object, detection, table, module, edge, bounding, correlation, redundant, denotes, sofa, framework] [model] [spatial, based, proposed, prior, figure, method, introduced, utilized] [hair, image, representation, person, consistency, missing] [network, knowledge, learning, set, triplet, neural, deep, denote, performance, baseline, pairwise] [capture, scene, structure, defined]
@InProceedings{Mi_2020_CVPR,
  author = {Mi, Li and Chen, Zhenzhong},
  title = {Hierarchical Graph Attention Network for Visual Relationship Detection},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
M2m: Imbalanced Classification via Major-to-Minor Translation
Jaehyung Kim, Jongheon Jeong, Jinwoo Shin


In most real-world scenarios, labeled training datasets are highly class-imbalanced, where deep neural networks suffer from generalizing to a balanced testing criterion. In this paper, we explore a novel yet simple way to alleviate this issue by augmenting less-frequent classes via translating samples (e.g., images) from more-frequent classes. This simple approach enables a classifier to learn more generalizable features of minority classes, by transferring and leveraging the diversity of the majority information. Our experimental results on a variety of class-imbalanced datasets show that the proposed method improves the generalization on minority classes significantly compared to other existing re-sampling or re-weighting methods. The performance of our method even surpasses those of previous state-of-the-art methods for the imbalanced classification.
[dataset, recognition] [table, seed, effectiveness, ablation, kaiming, improves] [datasets, adversarial, original, effective, generalization, case, trained, improve, model] [method, ieee, pattern, figure, proposed, based, comparison, existing] [synthetic, loss, generation, translation, target, diversity, perform, train, attempt, translated] [minority, class, training, sample, imbalanced, learning, majority, erm, neural, balanced, classifier, classification, number, bacc, standard, performance, test, deep, data, dbal, reuters, consider, imbalance, objective, accuracy, sampling, twitter, ldam, compared, set, ratio, simple, smote, size, network, baseline, distribution, problem] [conference, computer, vision, international, rejection, well, supplementary]
@InProceedings{Kim_2020_CVPR,
  author = {Kim, Jaehyung and Jeong, Jongheon and Shin, Jinwoo},
  title = {M2m: Imbalanced Classification via Major-to-Minor Translation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
CenterMask: Real-Time Anchor-Free Instance Segmentation
Youngwan Lee, Jongyoul Park


We propose a simple yet efficient anchor-free instance segmentation, called CenterMask, that adds a novel spatial attention-guided mask (SAG-Mask) branch to anchor-free one stage object detector (FCOS) in the same vein with Mask R-CNN. Plugged into the FCOS object detector, the SAG-Mask branch predicts a segmentation mask on each box with the spatial attention map that helps to focus on informative pixels and suppress noise. We also present an improved backbone networks, VoVNetV2, with two effective strategies: (1) residual connection for alleviating the optimization problem of larger VoVNet [??] and (2) effective Squeeze-Excitation (eSE) dealing with the channel information loss problem of original SE. With SAG-Mask and VoVNetV2, we deign CenterMask and CenterMask-Lite that are targeted to large and small models, respectively. Using the same ResNet-101-FPN backbone, CenterMask achieves 38.3%, surpassing all previous state-of-the-art methods while at a much faster speed. CenterMask-Lite also outperforms the state-of-the-art by large margins at over 35fps on Titan Xp. We hope that CenterMask and VoVNetV2 can serve as a solid baseline of real-time instance segmentation and backbone network for various vision tasks, respectively. The Code is available at https://github.com/youngwanLEE/CenterMask.
[attention, time, outperforms, speed, titan] [mask, feature, centermask, object, fcos, backbone, roi, box, instance, apmask, detection, module, segmentation, apbox, map, table, osa, branch, vovnet, faster, scoring, ross, detector, level, head, assignment, improves, kaiming, yolact, achieves, anchor, predicted, piotr, coco, propose, focus, ese, denotes, stage, fpn, sigmoid, pooling] [input, effective] [spatial, channel, proposed, residual, figure, conv, based, scale, convolutional, lightweight, sam] [loss] [performance, equation, large, network, accuracy, layer, efficient, connection, small, note, function, dimension, average, improved, inference, compared] [predicts]
@InProceedings{Lee_2020_CVPR,
  author = {Lee, Youngwan and Park, Jongyoul},
  title = {CenterMask: Real-Time Anchor-Free Instance Segmentation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Multi-Path Learning for Object Pose Estimation Across Domains
Martin Sundermeyer, Maximilian Durner, En Yen Puang, Zoltan-Csaba Marton, Narunas Vaskevicius, Kai O. Arras, Rudolph Triebel


We introduce a scalable approach for object pose estimation trained on simulated RGB views of multiple 3D models together. We learn an encoding of object views that does not only describe an implicit orientation of all objects seen during training, but can also relate views of untrained objects. Our single-encoder-multi-decoder network is trained using a technique we denote "multi-path learning": While the encoder is shared by all objects, each decoder only reconstructs views of a single object. Consequently, views of different instances do not have to be separated in the latent space and can share common features. The resulting encoder generalizes well from synthetic to real data and across various instances, categories, model types and datasets. We systematically investigate the learned encodings, their generalization, and iterative refinement strategies on the ModelNet40 and T-LESS dataset. Despite training jointly on multiple objects, our 6D Object Detection pipeline achieves state-of-the-art results on T-LESS at much lower runtimes than competing approaches.
[multiple, encoding, decoder, dataset] [object, refinement, detection, feature, instance, table, category, car, refined, challenge, maskrcnn] [trained, model, iterative, generalization] [ieee, method, pattern, figure] [encoder, real, target, latent, train, generalize, synthetic, domain, init, learn, code, shared] [training, test, data, performance, arxiv, preprint, learning, network, space, large, random, set, deep, learned, number, augmented, metric] [pose, untrained, estimation, computer, conference, vision, single, orientation, novel, relative, rotation, codebook, approach, well, matching, view, european, international, rgb, depth, aae, implicit, pipeline, codebooks, shape, full, ground, truth, deepim]
@InProceedings{Sundermeyer_2020_CVPR,
  author = {Sundermeyer, Martin and Durner, Maximilian and Puang, En Yen and Marton, Zoltan-Csaba and Vaskevicius, Narunas and Arras, Kai O. and Triebel, Rudolph},
  title = {Multi-Path Learning for Object Pose Estimation Across Domains},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Incremental Learning in Online Scenario
Jiangpeng He, Runyu Mao, Zeman Shao, Fengqing Zhu


Modern deep learning approaches have achieved great success in many vision applications by training a model using all available task-specific data. However, there are two major obstacles making it challenging to implement for real life applications: (1) Learning new classes makes the trained model quickly forget old classes knowledge, which is referred to as catastrophic forgetting. (2) As new observations of old classes come sequentially over time, the distribution may change in unforeseen way, making the performance degrade dramatically on future data, which is referred to as concept drift. Current state-of-the-art incremental learning methods require a long time to train the model whenever new classes are added and none of them takes into consideration the new observations of old classes. In this paper, we propose an incremental learning framework that can work in the challenging online learning scenario and handle both new classes data and new observations of old classes. We address problem (1) in online mode by introducing a modified cross-distillation loss together with a two-step learning technique. Our method outperforms the results obtained from current state-of-the-art offline incremental learning methods on the CIFAR-100 and ImageNet-1000 (ILSVRC 2012) datasets under the same experiment protocol but in online scenario. We also provide a simple yet effective method to mitigate problem (2) by updating exemplar set using the feature of each new observation of old classes and demonstrate a real life application of online food image classification based on our complete framework using the Food-101 dataset.
[current, step, described, future, previous, time, work, dataset, observation] [framework, challenging, including, propose, feature] [model, offline, concept, trained, scenario, technique, protocol, change] [block, method, figure, proposed, output, based, ieee, pattern, achieved] [loss, exemplar, real, modified, image, learn, representation, food] [learning, data, incremental, online, update, accuracy, set, class, training, life, knowledge, deep, performance, size, classifier, baseline, test, number, drift, distribution, catastrophic, updating, retain, experiment, classification, forgetting, lifelong, compared, network, neural, consider, problem, applied, large, accommodation, machine, ratio, upper, bound, higher, sequentially, simple, mitigate] [conference, computer, complete, vision, handle]
@InProceedings{He_2020_CVPR,
  author = {He, Jiangpeng and Mao, Runyu and Shao, Zeman and Zhu, Fengqing},
  title = {Incremental Learning in Online Scenario},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Enhanced Transport Distance for Unsupervised Domain Adaptation
Mengxue Li, Yi-Ming Zhai, You-Wei Luo, Peng-Fei Ge, Chuan-Xian Ren


Unsupervised domain adaptation (UDA) is a representative problem in transfer learning, which aims to improve the classification performance on an unlabeled target domain by exploiting discriminant information from a labeled source domain. The optimal transport model has been used for UDA in the perspective of distribution matching. However, the transport distance cannot reflect the discriminant information from either domain knowledge or category prior. In this work, we propose an enhanced transport distance (ETD) for UDA. This method builds an attention-aware transport distance, which can be viewed as the prediction feedback of the iteratively learned classifier, to measure the domain discrepancy. Further, the Kantorovich potential variable is re-parameterized by deep neural networks to learn the distribution in the latent space. The entropy-based regularization is developed to explore the intrinsic structure of the target domain. The proposed method is optimized alternately in an end-to-end manner. Extensive experiments are conducted on four benchmark datasets to demonstrate the SOTA performance of ETD.
[attention, three, recognition, explore, connected, dataset] [feature, table, fully, propose] [model, adversarial, discriminant, datasets] [method, based, figure, dual, adaptive, ieee, viewed, proposed, formulated, analysis, enhanced] [domain, transport, target, adaptation, source, loss, kantorovich, unsupervised, etd, learn, plan, uda, variable, image, discriminative, consists, reweighed, shared, discrepancy, minimizing, translation, transfer, digit] [network, optimal, accuracy, training, problem, deep, data, learning, optimization, parameter, potential, distribution, task, entropy, update, matrix, compared, algorithm, learned, average, labeled, neural, expected, function, performance, measure, machine, metric, sample, set, unlabeled, regularization] [distance, structure, error, joint, michael]
@InProceedings{Li_2020_CVPR,
  author = {Li, Mengxue and Zhai, Yi-Ming and Luo, You-Wei and Ge, Peng-Fei and Ren, Chuan-Xian},
  title = {Enhanced Transport Distance for Unsupervised Domain Adaptation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
TESA: Tensor Element Self-Attention via Matricization
Francesca Babiloni, Ioannis Marras, Gregory Slabaugh, Stefanos Zafeiriou


Representation learning is a fundamental part of modern computer vision, where abstract representations of data are encoded as tensors optimized to solve problems like image segmentation and inpainting. Recently, self-attention in the form of Non-Local Block has emerged as a powerful technique to enrich features, by capturing complex interdependencies in feature tensors. However, standard self-attention approaches leverage only spatial relationships, drawing similarities between vectors and overlooking correlations between channels. In this paper, we introduce a new method, called Tensor Element Self-Attention (TESA) that generalizes such work to capture interdependencies along all dimensions of the tensor using matricization. An order R tensor produces R results, one for each dimension. The results are then fused to produce an enriched output which encapsulates similarity among tensor elements. Additionally, we analyze self-attention mathematically, providing new perspectives on how it adjusts the singular values of the input feature tensor. With these new insights, we present experimental results demonstrating how TESA can benefit diverse problems including classification and instance segmentation. By simply adding a TESA module to existing networks, we substantially improve competitive baselines and set new state-of-the-art results for image inpainting on Celeb and low light raw-to-rgb image translation on SID.
[attention, selfattention, element, three, visual, mechanism, contribution, dataset, described, enrich, order] [segmentation, feature, instance, global, table, module] [input, case, original, trained] [block, tensor, ieee, output, method, spatial, pattern, tesa, matricization, figure, spectrum, convolutional, proposed, channel, capturing, learnable, achieved, raw, sid, version, exposure] [image, mode, inpainting, representation, row, qualitative] [singular, neural, network, learning, matrix, classification, deep, architecture, processing, machine, training, equation, compared, function, imagenet, data, performance, capacity, baseline] [computer, conference, vision, international, capture, single, rgb, complex, leverage]
@InProceedings{Babiloni_2020_CVPR,
  author = {Babiloni, Francesca and Marras, Ioannis and Slabaugh, Gregory and Zafeiriou, Stefanos},
  title = {TESA: Tensor Element Self-Attention via Matricization},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Training a Steerable CNN for Guidewire Detection
Donghang Li, Adrian Barbu


Guidewires are thin wires used in coronary angioplasty to guide different tools to access and repair the obstructed artery. The whole procedure is monitored using fluoroscopic (real-time X-ray) images. Due to the guidewire being thin in the low quality fluoroscopic images, it is usually poorly visible. The poor quality of the X-ray images makes the guidewire detection a challenging problem in image-guided interventions. Localizing the guidewire could help in enhancing its visibility and for other automatic procedures. Guidewire localization methods usually contain a first step of computing a pixelwise guidewire response map on the entire image. In this paper, we present a steerable Convolutional Neural Network (CNN), which is a Fully Convolutional Neural Network (FCNN) that can detect objects rotated by an arbitrary 2D angle, without being rotation invariant. In fact, the steerable CNN has an angle parameter that can be changed to make it sensitive to objects rotated by that angle. We present an application of this idea to detecting the guidewire pixels, and compare it with an FCNN trained to be invariant to the guidewire orientation. Results reveal that the proposed method is a good choice, outperforming some popular filter-based and learning-based approaches such as Frangi Filter, Spherical Quadrature Filter, FCNN and a state of the art trained classifier based on hand-crafted feature.
[frame] [cnn, detection, response, positive, detect, tracking, table, localization, center, annotation, vessel, guide, map] [trained, sensitive, model, input, pixelwise, detecting] [figure, convolutional, method, haar, based, range, ieee, proposed, introduced, convolution, patch] [loss, image, arbitrary, aligned, extracted, train, invariant] [steerable, guidewire, training, filter, sqf, rank, layer, frangi, test, fluoroscopic, learning, network, fcnn, set, size, rate, neural, quadrature, classifier, number, pbt, entire, data, steered, lorenz, cauchy, average, guidewires, parameter, performance] [angle, rotation, basis, rotated, spherical, thin, distance, orientation, focal, equivariant, sin, conference, adrian]
@InProceedings{Li_2020_CVPR,
  author = {Li, Donghang and Barbu, Adrian},
  title = {Training a Steerable CNN for Guidewire Detection},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Superpixel Segmentation With Fully Convolutional Networks
Fengting Yang, Qian Sun, Hailin Jin, Zihan Zhou


In computer vision, superpixels have been widely used as an effective way to reduce the number of image primitives for subsequent processing. But only a few attempts have been made to incorporate them into deep neural networks. One main reason is that the standard convolution operation is defined on regular grids and becomes inefficient when applied to superpixels. Inspired by an initialization strategy commonly adopted by traditional superpixel algorithms, we present a novel method that employs a simple fully convolutional network to predict superpixels on a regular image grid. Experimental results on benchmark datasets show that our method achieves state-of-the-art superpixel segmentation performance while running at about 50fps. Based on the predicted superpixels, we further develop a downsampling/upsampling scheme for deep networks with the goal of generating high-resolution outputs for dense prediction tasks. Specifically, we modify a popular network architecture for stereo matching to simultaneously predict superpixels and disparities. We show that improved disparity estimation accuracy can be obtained on public datasets.
[regular, predict, dataset, work, lsc, downstream] [superpixel, superpixels, segmentation, ssn, slic, seal, object, association, predicted, boundary, feature, final, affinity, cnn, benchmark, propose, module, propagation, map, fully, main] [input, model, original, choose] [method, disparity, figure, pixel, convolutional, psmnet, convolution, spatial, sceneflow, snic, based, proposed, etps, traditional, upsampling, existing] [image, train, generate, loss, perform, learn, fine] [network, deep, learning, size, better, number, neural, performance, set, test, clustering, standard, training, task, simple, scheme, accuracy, design, consider, note, soft, popular] [stereo, matching, grid, joint, computer, compute, volume, directly, vision, initial]
@InProceedings{Yang_2020_CVPR,
  author = {Yang, Fengting and Sun, Qian and Jin, Hailin and Zhou, Zihan},
  title = {Superpixel Segmentation With Fully Convolutional Networks},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
SharinGAN: Combining Synthetic and Real Data for Unsupervised Geometry Estimation
Koutilya PNVR, Hao Zhou, David Jacobs


We propose a novel method for combining synthetic and real images when training networks to determine geometric information from a single image. We suggest a method for mapping both image types into a single, shared domain. This is connected to a primary network for end-to-end training. Ideally, this results in images from two domains that present shared information to the primary network. Our experiments demonstrate significant improvements over the state-of-the-art in two important domains, surface normal estimation of human faces and monocular depth estimation for outdoor scenes, both in an unsupervised setting.
[dataset, prediction, combining] [map, table, apply, supervision, propose, predicted, split, module, semantic] [primary, face, trained, adversarial, model, input, help, datasets] [proposed, method, figure, ieee, convolutional, existing] [synthetic, real, domain, image, loss, sharingan, unsupervised, shared, train, generative, specific, mapping, gasda, gap, xsh, adaptation, supervised, row, generator, corresponding, learn, translated] [network, task, data, learning, better, training, deep, test, performance, reduce, compared, neural, problem, set] [depth, monocular, estimation, normal, ground, geometry, truth, reconstruction, kitti, single, david, virtual, error, sfsnet, rel, geometric, demonstrate, surface, computer, left, photoface, rmse, mde, eigen, defined]
@InProceedings{PNVR_2020_CVPR,
  author = {PNVR, Koutilya and Zhou, Hao and Jacobs, David},
  title = {SharinGAN: Combining Synthetic and Real Data for Unsupervised Geometry Estimation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Label Distribution Learning on Auxiliary Label Space Graphs for Facial Expression Recognition
Shikai Chen, Jianfeng Wang, Yuedong Chen, Zhongchao Shi, Xin Geng, Yong Rui


Many existing studies reveal that annotation inconsistency widely exists among a variety of facial expression recognition (FER) datasets. The reason might be the subjectivity of human annotators and the ambiguous nature of the expression labels. One promising strategy tackling such a problem is a recently proposed learning paradigm called Label Distribution Learning (LDL), which allows multiple labels with different intensity to be linked to one expression. However, it is often impractical to directly apply label distribution learning because numerous existing datasets only contain one-hot labels rather than label distributions. To solve the problem, we propose a novel approach named Label Distribution Learning on Auxiliary Label Space Graphs(LDL-ALSG) that leverages the topological information of the labels from related but more distinct tasks, such as action unit recognition and facial landmark detection. The underlying assumption is that facial images should have similar expression distributions to their neighbours in the label space of action unit recognition and facial landmark detection. Our proposed method is evaluated on a variety of datasets and outperforms those state-of-the-art methods consistently with a huge margin.
[recognition, action, unit, emotion, three, node] [annotation, backbone, framework, feature, annotated, guide, detection, table] [facial, expression, auxiliary, datasets, topological, noise, model, affectnet, neutral, face, central, raf, inconsistency, landmark, trained, wild, mmi, inconsistent, evaluated, logical, lab] [method, proposed, ieee, based, xin, guidance, noisy, pattern, enhancement, figure, assumption] [image, loss, address] [label, learning, training, space, distribution, network, data, baseline, deep, set, number, average, task, accuracy, test, basic, performance, function, arxiv, preprint, problem, neural, classification, mixture] [conference, neighbor, computer, vision, nearest, international, local, ambiguity, relative]
@InProceedings{Chen_2020_CVPR,
  author = {Chen, Shikai and Wang, Jianfeng and Chen, Yuedong and Shi, Zhongchao and Geng, Xin and Rui, Yong},
  title = {Label Distribution Learning on Auxiliary Label Space Graphs for Facial Expression Recognition},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Deep Residual Flow for Out of Distribution Detection
Ev Zisselman, Aviv Tamar


The effective application of neural networks in the real-world relies on proficiently detecting out-of-distribution examples. Contemporary methods seek to model the distribution of feature activations in the training data for adequately distinguishing abnormalities, and the state-of-the-art method uses Gaussian distribution models. In this work, we present a novel approach that improves upon the state-of-the-art by leveraging an expressive density model based on normalizing flows. We introduce the residual flow, a novel flow architecture that learns the residual distribution from a base Gaussian distribution. Our model is general, and can be applied to any data that is approximately Gaussian. For out of distribution detection in image datasets, our approach provides a principled improvement over the state-of-the-art. Specifically, we demonstrate the effectiveness of our method in ResNet and DenseNet architectures trained on various image datasets. For example, on a ResNet trained on CIFAR-100 and evaluated on detection of out-of-distribution samples from the ImageNet dataset, holding the true positive rate (TPR) at 95%, we improve the true negative rate (TNR) from 56.7% (current state of-the-art) to 77.5% (ours).
[work, described, modeling, dataset] [detection, feature, score, confidence, propose, resnet, improvement] [model, input, trained, detecting, curve, adversarial, roc, effective, true, improve] [flow, residual, gaussian, method, proposed, lee, likelihood, figure, based, output, pattern, comparison, ieee] [lsun, image, generative] [neural, linear, distribution, training, data, ood, deep, network, mahalanobis, layer, performance, learning, set, density, svhn, class, validation, classification, test, tinyimagenet, normalizing, covariance, permutation, expressive, log, better, note, applied, processing, space, empirical, matrix, auroc, anomaly, rate, denote, equivalent, principled, densenet, realnvp, vector, average] [approach, conference, transformation, computer, vision, estimation]
@InProceedings{Zisselman_2020_CVPR,
  author = {Zisselman, Ev and Tamar, Aviv},
  title = {Deep Residual Flow for Out of Distribution Detection},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
FeatureFlow: Robust Video Interpolation via Structure-to-Texture Generation
Shurui Gui, Chaoyue Wang, Qihua Chen, Dacheng Tao


Video interpolation aims to synthesize non-existent frames between two consecutive frames. Although existing optical flow based methods have achieved promising results, they still face great challenges in dealing with the interpolation of complicated dynamic scenes, which include occlusion, blur or abrupt brightness change. This is mainly because these cases may break the basic assumptions of the optical flow estimation (i.e. smoothness, consistency). In this work, we devised a novel structure-to-texture generation framework which splits the video interpolation task into two stages: structure-guided interpolation and texture refinement. In the first stage, deep structure-aware features are employed to predict feature flows from two consecutive frames to their intermediate result, and further generate the structure image of the intermediate frame. In the second stage, based on the generated coarse result, a Frame Texture Compensator is trained to fill in detailed textures. To the best of our knowledge, this is the first work that attempts to directly generate the intermediate frame through blending deep features. Experiments on both the benchmark datasets and challenging occlusion cases demonstrate the superiority of the proposed framework over the state-of-the-art methods. Codes are available on https://github.com/CM-BF/FeatureFlow.
[video, frame, attention, recognition, evaluation, predict] [feature, occlusion, edge, framework, module, challenging, semantic, extractor, table] [model, original, input, trained, adversarial, great, blending] [interpolation, ieee, proposed, figure, flow, intermediate, result, optical, deformable, based, pattern, motion, psnr, devised, dain, ssim, warped, consecutive, vfi, method, june, existing, dynamic, achieved, compensator, super, adopted, resolution, middlebury] [texture, loss, image, generator, synthesize, generate, generation, generated, produce, align, alignment, synthetic] [deep, best, learning, set, network, test, number, dacheng, training] [computer, conference, vision, triangle, international, coarse, handle, structure, second, european, estimation]
@InProceedings{Gui_2020_CVPR,
  author = {Gui, Shurui and Wang, Chaoyue and Chen, Qihua and Tao, Dacheng},
  title = {FeatureFlow: Robust Video Interpolation via Structure-to-Texture Generation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Learning Nanoscale Motion Patterns of Vesicles in Living Cells
Arif Ahmed Sekh, Ida Sundvor Opstad, Asa Birna Birgisdottir, Truls Myrmel, Balpreet Singh Ahluwalia, Krishna Agarwal, Dilip K. Prasad


Detecting and analyzing nanoscale motion patterns of vesicles, smaller than the microscope resolution ( 250 nm), inside living biological cells is a challenging problem. State-of-the-art CV approaches based on detection, tracking, optical flow or deep learning perform poorly for this problem. We propose an integrative approach, built upon physics based simulations, nanoscopy algorithms, and shallow residual attention network to make it possible for the first time to analysis sub-resolution motion patterns in vesicles that may also be of sub-resolution diameter. Our results show state-of-the-art performance, 89% validation accuracy on simulated dataset and 82% testing accuracy on an experimental dataset of living heart muscle cells imaged under three different pathological conditions. We demonstrate automated analysis of the motion states and changed in them for over 9000 vesicles. Such analysis will enable large scale biological studies of vesicle transport and interaction in living cells in the future.
[attention, dataset, musical, visual, multiple, activity, interaction, video] [tracking, detection, shallow, localization, framework, feature, roi, inside, table] [noise, living, experimental, nature] [motion, analysis, microscopy, nanoscale, vesicle, residual, ieee, nanoscopy, pattern, optical, cell, biological, based, proposed, figure, imaging, scale, method, raw, range, signal, hypoxia, microscope, presented, sran, hypoxiaadm, indicate, convolutional, resolution] [image, perform] [deep, learning, neural, network, accuracy, number, data, large, random, compared, small, simple, better, randomly, selected] [computer, conference, vision, approach, simulated, simulation, international, variety, single, velocity, normal]
@InProceedings{Sekh_2020_CVPR,
  author = {Sekh, Arif Ahmed and Opstad, Ida Sundvor and Birgisdottir, Asa Birna and Myrmel, Truls and Ahluwalia, Balpreet Singh and Agarwal, Krishna and Prasad, Dilip K.},
  title = {Learning Nanoscale Motion Patterns of Vesicles in Living Cells},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Improving Action Segmentation via Graph-Based Temporal Reasoning
Yifei Huang, Yusuke Sugano, Yoichi Sato


Temporal relations among multiple action segments play an important role in action segmentation especially when observations are limited (e.g., actions are occluded by other objects or happen outside a field of view). In this paper, we propose a network module called Graph-based Temporal Reasoning Module (GTRM) that can be built on top of existing action segmentation models to learn the relation of multiple action segments in various time spans. We model the relations by using two Graph Convolution Networks (GCNs) where each node represents an action segment. The two graphs have different edge properties to account for boundary regression and classification tasks, respectively. By applying graph convolution, we can update each node's representation based on its relation with neighboring nodes. The updated representation is then used for improved action segmentation. We evaluate our model on the challenging egocentric datasets namely EGTEA and EPIC-Kitchens, where actions may be partially observed due to the viewpoint restriction. The results show that our proposed GTRM outperforms state-of-the-art action segmentation models by a large margin. We also demonstrate the effectiveness of our model on two third-person video datasets, the 50Salads dataset and the Breakfast dataset.
[action, temporal, graph, node, gtrm, video, recognition, relation, dataset, reasoning, egtea, gru, work, multiple, breakfast, mstcn, segmental, recurrent, gcns, built, egocentric, modeling, length, time, long, water, gcn, hidden, temporally] [backbone, segmentation, segment, boundary, regression, table, module, detection] [model, datasets, adding] [ieee, pattern, convolution, based, proposed, convolutional, existing, result, neighboring, figure] [representation, loss, edit] [performance, learning, gain, top, network, neural, updated, task, better, layer, class, arxiv, preprint, training, function, average, classification] [vision, computer, conference, international, european, limited, human, directly]
@InProceedings{Huang_2020_CVPR,
  author = {Huang, Yifei and Sugano, Yusuke and Sato, Yoichi},
  title = {Improving Action Segmentation via Graph-Based Temporal Reasoning},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Episode-Based Prototype Generating Network for Zero-Shot Learning
Yunlong Yu, Zhong Ji, Jungong Han, Zhongfei Zhang


We introduce a simple yet effective episode-based training framework for zero-shot learning (ZSL), where the learning system requires to recognize unseen classes given only the corresponding class semantics. During training, the model is trained within a collection of episodes, each of which is designed to simulate a zero-shot classification task. Through training multiple episodes, the model progressively accumulates ensemble experiences on predicting the mimetic unseen classes, which will generalize well on the real unseen classes. Based on this training framework, we propose a novel generative model that synthesizes visual prototypes conditioned on the class semantic prototypes. The proposed model aligns the visual-semantic interactions by formulating both the visual prototype generation and the class semantic inference into an adversarial framework paired with a parameter-economic Multi-modal Cross-Entropy Loss to capture the discriminative information. Extensive experiments on four datasets under both traditional ZSL and generalized ZSL tasks show that our model outperforms the state-of-the-art approaches by large margins.
[visual, predicting] [semantic, feature, table, framework, paradigm, stage, predicted, achieves, bernt] [model, trained, adversarial, mce, ensemble, datasets] [proposed, traditional, refining, block, existing] [unseen, generative, prototype, zsl, generalized, real, loss, flo, cub, generating, mimetic, image, introduce, corresponding, discriminative, minimizing, consists, fake, zeynep, synthesize, mapping, progressively, generalize, aligns, generation] [class, training, classification, learning, base, network, episode, performance, test, accuracy, set, data, space, metric, observe, number, deep, process, selected, task, better, indicates, compared, sample, function, cosine, inference] [approach, distance, well, collection, euclidean]
@InProceedings{Yu_2020_CVPR,
  author = {Yu, Yunlong and Ji, Zhong and Han, Jungong and Zhang, Zhongfei},
  title = {Episode-Based Prototype Generating Network for Zero-Shot Learning},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Learning to Segment the Tail
Xinting Hu, Yi Jiang, Kaihua Tang, Jingyuan Chen, Chunyan Miao, Hanwang Zhang


Real-world visual recognition requires handling the extreme sample imbalance in large-scale long-tailed data. We propose a "divide&conquer" strategy for the challenging LVIS task: divide the whole data into balanced parts and then apply incremental learning to conquer each one. This derives a novel learning paradigm: class-incremental few-shot learning, which is especially effective for the challenge evolving over time: 1) the class imbalance among the old class knowledge review and 2) the few-shot data in new-class learning. We call our approach Learning to Segment the Tail (LST). In particular, we design an instance-level balanced replay scheme, which is a memory-efficient approximation to balance the instance-level samples from the old-class images. We also propose to use a meta-module for new-class learning, where the module parameters are shared across incremental phases, gaining the learning-to-learn knowledge incrementally, from the data-rich head to the data-poor tail. We empirically show that: at the expense of a little sacrifice of head-class forgetting, we can gain a significant 8.3% AP improvement for the tail classes with less than 10 instances, achieving an overall 2.0% AP boost for the whole 1,230 classes.
[dataset, previous, recognition, visual, work, current, vocabulary] [instance, mask, feature, head, segmentation, lvis, category, object, segment, backbone, box, table, tackle, detection, roi, propose, rare, final, stage, ross, improvement] [model, trained, evaluated, effective, study] [phase, figure, based, method, proposed, severe] [loss, image, generator, transfer] [learning, data, training, tail, balanced, weight, incremental, number, knowledge, base, replay, distillation, imbalance, class, performance, set, classifier, imbalanced, lst, classification, logits, network, baseline, large, sample, mwg, top, catastrophic, learned, meta, function, sampling, size, strategy] [novel]
@InProceedings{Hu_2020_CVPR,
  author = {Hu, Xinting and Jiang, Yi and Tang, Kaihua and Chen, Jingyuan and Miao, Chunyan and Zhang, Hanwang},
  title = {Learning to Segment the Tail},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Learning to Evaluate Perception Models Using Planner-Centric Metrics
Jonah Philion, Amlan Kar, Sanja Fidler


Variants of accuracy and precision are the gold-standard by which the computer vision community measures progress of perception algorithms. One reason for the ubiquity of these metrics is that they are largely task-agnostic; we in general seek to detect zero false negatives or positives. The downside of these metrics is that, at worst, they penalize all incorrect detections equally without conditioning on the task or scene, and at best, heuristics need to be chosen to ensure that different mistakes count differently. In this paper, we propose a principled metric for 3D object detection specifically for the task of self-driving. The core idea behind our metric is to isolate the task of object detection and measure the impact the produced detections would induce on the downstream task of driving. Without hand-designing it to, we find that our metric penalizes many of the mistakes that other metrics penalize by design. In addition, our metric downweighs detections based on additional factors such as distance from a detection to the ego car and the speed of the detection in intuitive ways that other detection metrics do not. For human evaluation, we generate scenes in which standard metrics and our metric disagree and find that humans side with our metric 79% of the time. Our project page including an evaluation server can be found at https://nv-tlabs.github.io/detection-relevance.
[time, ego, perception, driving, future, dataset, planner, downstream, evaluation, vehicle, agent, horizon, drive, speed] [object, pkl, detection, false, car, positive, nuscenes, detector, map, megvii, autonomous, penalizes, dangerous, score, box, raquel, lidar, detected, add, global, detect, penalize] [noise, evaluating, model, ranked, detecting] [figure, noisy, based, designed] [real, loss, generate, train] [metric, task, performance, distribution, measure, learning, precision, set, evaluate, neural, find, validation, average, probability, training, alex, size, top, test, accuracy, rank, data] [ground, truth, scene, human, system, error, local, point, front, joint, distance, computer]
@InProceedings{Philion_2020_CVPR,
  author = {Philion, Jonah and Kar, Amlan and Fidler, Sanja},
  title = {Learning to Evaluate Perception Models Using Planner-Centric Metrics},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Where, What, Whether: Multi-Modal Learning Meets Pedestrian Detection
Yan Luo, Chongyang Zhang, Muming Zhao, Hao Zhou, Jun Sun


Pedestrian detection benefits greatly from deep convolutional neural networks (CNNs). However, it is inherently hard for CNNs to handle situations in the presence of occlusion and scale variation. In this paper, we propose W^3Net, which attempts to address above challenges by decomposing the pedestrian detection task into Where, What and Whether problem directing against pedestrian localization, scale prediction and classification correspondingly. Specifically, for a pedestrian instance, we formulate its feature by three steps. i) We generate a bird view map, which is naturally free from occlusion issues, and scan all points on it to look for suitable locations for each pedestrian instance. ii) Instead of utilizing pre-fixed anchors, we model the interdependency between depth and scale aiming at generating depth-guided scales at different locations for better matching instances of different sizes. iii) We learn a latent vector shared by both visual and corpus space, by which false positives with similar vertical structure but lacking human partial features would be filtered out. We achieve state-of-the-art results on widely used datasets (Citypersons and Caltech). In particular. when evaluating on heavy occlusion subset, our results reduce MR^ -2 from 49.3% to 18.7% on Citypersons, and from 45.18% to 28.33% on Caltech.
[prediction, visual, corpus, previous, work, predict, embedding] [pedestrian, detection, height, occlusion, feature, table, map, branch, heavy, occluded, proposal, object, csp, location, semantic, unified, possibility, interdependency, achieves, represents, false, detector, faster, center, citypersons] [robust, model] [scale, proposed, net, figure, method, result, based, conv, formulated] [bird, corresponding, image, generation, reasonable, real, synthetic, target, attempt, loss, domain] [network, classification, data, problem, performance, distribution, denoted, task, caltech, width, learning, deep, neural, set, equation, fixed, best, number, process, compared] [view, front, depth, body, computer, estimation, single, uncertainty, camera]
@InProceedings{Luo_2020_CVPR,
  author = {Luo, Yan and Zhang, Chongyang and Zhao, Muming and Zhou, Hao and Sun, Jun},
  title = {Where, What, Whether: Multi-Modal Learning Meets Pedestrian Detection},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
CoverNet: Multimodal Behavior Prediction Using Trajectory Sets
Tung Phan-Minh, Elena Corina Grigore, Freddy A. Boulton, Oscar Beijbom, Eric M. Wolff


We present CoverNet, a new method for multimodal, probabilistic trajectory prediction for urban driving. Previous work has employed a variety of methods, including multimodal regression, occupancy maps, and 1-step stochastic policies. We instead frame the trajectory prediction problem as classification over a diverse set of trajectories. The size of this set remains manageable due to the limited number of distinct actions that can be taken over a reasonable prediction horizon. We structure the trajectory set to a) ensure a desired level of coverage of the state space, and b) eliminate physically impossible trajectories. By dynamically generating trajectory sets based on the agent's current state, we can further improve our method's efficiency. We demonstrate our approach on public, real world self-driving datasets, and show that it outperforms state-of-the-art methods.
[trajectory, prediction, current, agent, covernet, multimodal, driving, recognition, urban, state, multiple, vehicle, mtp, context, predict, predicting, dataset, behavior, previous, work, dynamically, time, outperforms, include, planning] [regression, nuscenes, autonomous, predicted, main, represents, map, focus, car, anchor] [model, internal, datasets, input, constant, lateral, insight, public, choose] [dynamic, output, ieee, figure, pattern, june, motion, based, method, result] [representation, diverse, control, mode] [set, fixed, classification, number, function, learning, average, best, problem, probabilistic, size, rate, implementation, data, deep, performance] [conference, computer, vision, coverage, multipath, scene, ground, single, physically, approach, hybrid, uncertainty, international]
@InProceedings{Phan-Minh_2020_CVPR,
  author = {Phan-Minh, Tung and Grigore, Elena Corina and Boulton, Freddy A. and Beijbom, Oscar and Wolff, Eric M.},
  title = {CoverNet: Multimodal Behavior Prediction Using Trajectory Sets},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Real-World Person Re-Identification via Degradation Invariance Learning
Yukun Huang, Zheng-Jun Zha, Xueyang Fu, Richang Hong, Liang Li


Person re-identification (Re-ID) in real-world scenarios usually suffers from various degradation factors, e.g., low-resolution, weak illumination, blurring and adverse weather. On the one hand, these degradations lead to severe discriminative information loss, which significantly obstructs identity representation learning; on the other hand, the feature mismatch problem caused by low-level visual variations greatly reduces retrieval performance. An intuitive solution to this problem is to utilize low-level image restoration methods to improve the image quality. However, existing restoration methods cannot directly serve to real-world Re-ID due to various limitations, e.g., the requirements of reference samples, domain gap between synthesis and reality, and incompatibility between low-level and high-level methods. In this paper, to solve the above problem, we propose a degradation invariance learning framework for real-world person Re-ID. By introducing a self-supervised disentangled representation learning strategy, our method is able to simultaneously extract identity-related robust features and remove real-world degradations without extra supervision. We use low-resolution images as the main demonstration, and experiments show that our approach is able to achieve state-of-the-art performance on several Re-ID benchmarks. In addition, our framework can be easily extended to other real-world degradation factors, such as weak illumination, with only a few modifications.
[extract, provide, attention, pair, retrieval] [feature, propose, framework, liang, pedestrian, table, wei, china, extra, focus] [identity, adversarial, robust, input, improve] [degradation, ieee, pattern, illumination, figure, existing, resolution, method, proposed, dual, restoration, gamma] [person, image, content, representation, domain, invariance, encoder, disentangled, generative, discriminator, introduce, generation, fdi, discriminative, generate, real, loss, fsen, utilize, gap, fdj, generated, caviar, jiawei] [learning, network, performance, deep, data, neural, learned, problem, simultaneously] [conference, computer, vision, international, approach, human, reality, camera, capture, acm]
@InProceedings{Huang_2020_CVPR,
  author = {Huang, Yukun and Zha, Zheng-Jun and Fu, Xueyang and Hong, Richang and Li, Liang},
  title = {Real-World Person Re-Identification via Degradation Invariance Learning},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Defending and Harnessing the Bit-Flip Based Adversarial Weight Attack
Zhezhi He, Adnan Siraj Rakin, Jingtao Li, Chaitali Chakrabarti, Deliang Fan


Recently, a new paradigm of the adversarial attack on the quantized neural network weights has attracted great attention, namely, the Bit-Flip based adversarial weight attack, aka. Bit-Flip Attack (BFA). BFA has shown extraordinary attacking ability, where the adversary can malfunction a quantized Deep Neural Network (DNN) as a random guess, through malicious bit-flips on a small set of vulnerable weight bits (e.g., 13 out of 93 millions bits of 8-bit quantized ResNet-18). However, there are no effective defensive methods to enhance the fault-tolerance capability of DNN against such BFA. In this work, we conduct comprehensive investigations on BFA and propose to leverage binarization-aware training and its relaxation -- piece-wise clustering as simple and effective countermeasures to BFA. The experiments show that, for BFA to achieve the identical prediction accuracy degradation (e.g., below 11% on CIFAR-10), it requires 19.3x and 480.1x more effective malicious bit-flips on ResNet-20 and VGG-11 respectively, compared to defend-free counterparts.
[observation, iter] [table, seed, improvement] [adversarial, bfa, attack, defense, dnn, nbf, model, resistance, input, vulnerable, improve, malicious, effective, robustness, example, fault, clean, case, security, strong, defensive, comprehensive, flip, defend] [based, proposed, output, figure, enhance, comparison, low] [perform] [weight, training, network, clustering, accuracy, trial, neural, binarization, dropout, test, quantized, deep, bit, number, large, layer, average, pruning, relaxation, capacity, binarized, parameter, arxiv, preprint, data, regularization, learning, width, performance, quantization, inference, increasing, report, baseline, random, small, requires, min, sample, gradient, iteration, discussed] [conference, computer, vision, piecewise]
@InProceedings{He_2020_CVPR,
  author = {He, Zhezhi and Rakin, Adnan Siraj and Li, Jingtao and Chakrabarti, Chaitali and Fan, Deliang},
  title = {Defending and Harnessing the Bit-Flip Based Adversarial Weight Attack},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Adversarial Latent Autoencoders
Stanislav Pidhorskyi, Donald A. Adjeroh, Gianfranco Doretto


Autoencoder networks are unsupervised approaches aiming at combining generative and representational properties by learning simultaneously an encoder-generator map. Although studied extensively, the issues of whether they have the same generative power of GANs, or learn disentangled representations, have not been fully addressed. We introduce an autoencoder that tackles these issues jointly, which we call Adversarial Latent Autoencoder (ALAE). It is a general architecture that can leverage recent improvements on GAN training procedures. We designed two autoencoders: one based on a MLP encoder, and another based on a StyleGAN generator, which we call StyleALAE. We verify the disentanglement properties of both architectures. We show that StyleALAE can not only generate 1024x1024 face images with comparable quality of StyleGAN, but at the same resolution can also produce face reconstructions and manipulations based on real images. This makes ALAE the first autoencoder able to compare with, and go beyond the capabilities of a generator-only type of architecture.
[step, work] [table, split, map, instance] [adversarial, face, trained, input, mnist] [figure, resolution, output, ieee, based, conv, designed] [latent, image, stylealae, generative, style, autoencoder, gan, stylegan, real, generator, ffhq, disentanglement, encoder, gans, source, pioneer, fid, synthetic, growing, autoencoders, learn, disentangled, bedroom, bigan, discriminator, variational, unsupervised, ability, reciprocity, progressively, synthesis, representation] [learning, data, space, distribution, training, set, architecture, neural, network, linear, deep, processing, similarity, general, arxiv, preprint, setting, machine, learned, gradient, normalization] [conference, computer, international, vision, reconstruction, well, approach, mlp, representing]
@InProceedings{Pidhorskyi_2020_CVPR,
  author = {Pidhorskyi, Stanislav and Adjeroh, Donald A. and Doretto, Gianfranco},
  title = {Adversarial Latent Autoencoders},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Adaptive Fractional Dilated Convolution Network for Image Aesthetics Assessment
Qiuyu Chen, Wei Zhang, Ning Zhou, Peng Lei, Yi Xu, Yu Zheng, Jianping Fan


To leverage deep learning for image aesthetics assessment, one critical but unsolved issue is how to seamlessly incorporate the information of image aspect ratios to learn more robust models. In this paper, an adaptive fractional dilated convolution (AFDC), which is aspect-ratio-embedded, composition-preserving and parameter-free, is developed to tackle this issue natively in convolutional kernel level. Specifically, the fractional dilated kernel is adaptively constructed according to the image aspect ratios, where the interpolation of nearest two integer dilated kernels are used to cope with the misalignment of fractional sampling. Moreover, we provide a concise formulation for mini-batch training and utilize a grouping strategy to reduce computational overhead. As a result, it can be easily implemented by common deep learning libraries and plugged into popular CNN architectures in a computation-efficient manner. Our experimental results demonstrate that our proposed method achieves state-of-the-art performance on image aesthetics assessment over the AVA dataset.
[ava, three, multiple] [score, table, pooling, global, grouping, cnn, extra] [original, aesthetic, model, assessment, easily, spp, trained, experimental, effective] [aspect, dilated, dilation, fractional, convolution, kernel, adaptive, proposed, integer, method, cropping, conv, afdc, ieee, warping, convolutional, pattern, interpolation, adaptively, receptive, figure, comparison, spatial, cvpr, deformable, june] [image, loss, photo, introduce, common, train, learn, preserving, preserve] [learning, training, network, deep, vanilla, data, computational, better, size, test, sampling, rate, strategy, classification, ratio, computation, layer, random, augmentation, distribution, implemented, neural, batch, note, weight, compared] [conference, computer, vision, nearest, approach, international]
@InProceedings{Chen_2020_CVPR,
  author = {Chen, Qiuyu and Zhang, Wei and Zhou, Ning and Lei, Peng and Xu, Yi and Zheng, Yu and Fan, Jianping},
  title = {Adaptive Fractional Dilated Convolution Network for Image Aesthetics Assessment},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Deep Generative Model for Robust Imbalance Classification
Xinyue Wang, Yilin Lyu, Liping Jing


Discovering hidden pattern from imbalanced data is a critical issue in various real-world applications including computer vision. The existing classification methods usually suffer from the limitation of data especially the minority classes, and result in unstable prediction and low performance. In this paper, a deep generative classifier is proposed to mitigate this issue via both data perturbation and model perturbation. Specially, the proposed generative classifier is modeled by a deep latent variable model where the latent variable aims to capture the direct cause of target label. Meanwhile, the latent variable is represented by a probability distribution over possible values rather than a single fixed value, which is able to enforce uncertainty of model and lead to stable prediction. Furthermore, this latent variable, as a confounder, affects the process of data (feature/label) generation, so that we can arrive at well-justified sampling variability considerations in statistics, and implement data perturbation. Extensive experiments have been conducted on widely-used real imbalanced image datasets. By comparing with the state-of-the-art methods, experimental results demonstrate the superiority of our proposed model on imbalance classification task.
[] [bag] [perturbation, model] [] [loss] [data, sampling, distribution, augmented, imbalanced, classification] [uncertainty, estimated]
@InProceedings{Wang_2020_CVPR,
  author = {Wang, Xinyue and Lyu, Yilin and Jing, Liping},
  title = {Deep Generative Model for Robust Imbalance Classification},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Learning Deep Network for Detecting 3D Object Keypoints and 6D Poses
Wanqing Zhao, Shaobo Zhang, Ziyu Guan, Wei Zhao, Jinye Peng, Jianping Fan


The state-of-art 6D object pose detection methods use convolutional neural networks to estimate objects' 6D poses from RGB images. However, they require huge numbers of images with explicit 3D annotations such as 6D poses, 3D bounding boxes and 3D keypoints, either obtained by manual labeling or inferred from synthetic images generated by 3D CAD models. Manual labeling for a large number of images is a laborious task, and we usually do not have the corresponding 3D CAD models of objects in real environment. In this paper, we develop a keypoint-based 6D object pose detection method (and its deep network) called Object Keypoint based POSe Estimation (OK-POSE). OK-POSE employs relative transformation between viewpoints for training. Specifically, we use pairs of images with object annotation and relative transformation information between their viewpoints to automatically discover objects' 3D keypoints which are geometrically and visually consistent. Then, the 6D object pose can be estimated using a keypoint-based geometric reasoning method with a reference viewpoint. The relative transformation information can be easily obtained from any cheap binocular cameras or most smartphone devices, thus greatly lowering the labeling cost. Experiments have demonstrated that OK-POSE achieves acceptable performance compared to methods relying on the object's 3D CAD model or a great deal of 3D labeling. These results show that our method can be used as a suitable alternative when there are no 3D CAD models or a large number of 3D annotations.
[explicit, predict, order, visual, dataset] [object, detection, detected, regression, branch, table, labeling, faster, feature, location, occlusion, predicted, annotation, achieves, detect, map, china] [input, model, robust, great] [method, reference, figure, convolutional, based, pixel, output] [image, loss, real, consistency, corresponding, train, synthetic, generated, translation] [network, training, deep, function, learning, accuracy, set, number, performance, compared, inference, calculated, large, algorithm, matrix, average, neural] [pose, keypoints, transformation, keypoint, relative, cad, rgb, depth, rotation, estimated, distance, single, estimation, local, camera, consistent, acceptable, distinctiveness, recovery, defined, epipolar, cluttered, ttrans, kabsch, linemod, estimate]
@InProceedings{Zhao_2020_CVPR,
  author = {Zhao, Wanqing and Zhang, Shaobo and Guan, Ziyu and Zhao, Wei and Peng, Jinye and Fan, Jianping},
  title = {Learning Deep Network for Detecting 3D Object Keypoints and 6D Poses},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
MetaIQA: Deep Meta-Learning for No-Reference Image Quality Assessment
Hancheng Zhu, Leida Li, Jinjian Wu, Weisheng Dong, Guangming Shi


Recently, increasing interest has been drawn in exploiting deep convolutional neural networks (DCNNs) for no-reference image quality assessment (NR-IQA). Despite of the notable success achieved, there is a broad consensus that training DCNNs heavily relies on massive annotated data. Unfortunately, IQA is a typical small sample problem. Therefore, most of the existing DCNN-based IQA metrics operate based on pre-trained networks. However, these pre-trained networks are not designed for IQA task, leading to generalization problem when evaluating different types of distortions. With this motivation, this paper presents a no-reference IQA metric based on deep meta-learning. The underlying idea is to learn the meta-knowledge shared by human when evaluating the quality of images with various distortions, which can then be adapted to unknown distortions easily. Specifically, we first collect a number of NR-IQA tasks for different distortions. Then meta-learning is adopted to learn the prior knowledge shared by diversified distortions. Finally, the quality prior model is fine-tuned on a target NR-IQA task for quickly obtaining the quality model. Extensive experiments demonstrate that the proposed metric outperforms the state-of-the-arts by a large margin. Furthermore, the meta-model learned from synthetic distortions can also be easily generalized to authentic distortions, which is highly desired in real-world applications of IQA metrics.
[evaluation, recognition, visual, natural] [challenge, score, table, regression, predicted, divided] [quality, model, iqa, distorted, assessment, generalization, distortion, srocc, live, database, authentically, query, plcc, evaluating, easily, authentic] [prior, ieee, method, based, blind, figure, proposed, convolutional, pattern, fast, signal, existing] [image, learn, unknown, shared, ability] [deep, learning, set, task, gradient, training, knowledge, network, performance, neural, number, learned, support, higher, metric, classification, evaluate, rate, problem, data, machine] [approach, conference, human, vision, computer, second, defined, international]
@InProceedings{Zhu_2020_CVPR,
  author = {Zhu, Hancheng and Li, Leida and Wu, Jinjian and Dong, Weisheng and Shi, Guangming},
  title = {MetaIQA: Deep Meta-Learning for No-Reference Image Quality Assessment},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Sketchformer: Transformer-Based Representation for Sketched Structure
Leo Sampaio Ferraz Ribeiro, Tu Bui, John Collomosse, Moacir Ponti


Sketchformer is a novel transformer-based representation for encoding free-hand sketches input in a vector form, i.e. as a sequence of strokes. Sketchformer effectively addresses multiple tasks: sketch classification, sketch based image retrieval (SBIR), and the reconstruction and interpolation of sketches. We report several variants exploring continuous and tokenized input representations, and contrast their performance. Our learned embedding, driven by a dictionary learning tokenization scheme, yields state of the art performance in classification and image retrieval tasks, when compared against baseline representations driven by LSTM sequence to sequence architectures: SketchRNN and derivatives. We show that sketch reconstruction and interpolation are improved significantly by the Sketchformer embedding for complex sketches with longer stroke sequences.
[embedding, transformer, sequence, retrieval, three, modeling, visual, language, attention, work, temporal, tokenization, lstm, longer, embeddings, corpus, dataset, long, token, encoding, state, length, rasterized, explore, decoder] [table, art, object] [input, query, model, trained, medium] [based, proposed, interpolation, figure, convolutional, method, driven] [sketch, image, sketchformer, stroke, raster, representation, livesketch, loss, sketchrnn, sketched, sbir, tokenized, learn, pen, encoder, common, generative, vaswani] [vector, learning, classification, network, search, performance, learned, dictionary, triplet, set, architecture, training, neural, deep, evaluate, class, baseline, sample, quantifying] [reconstruction, complex, continuous, point, acm]
@InProceedings{Ribeiro_2020_CVPR,
  author = {Ribeiro, Leo Sampaio Ferraz and Bui, Tu and Collomosse, John and Ponti, Moacir},
  title = {Sketchformer: Transformer-Based Representation for Sketched Structure},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Cylindrical Convolutional Networks for Joint Object Detection and Viewpoint Estimation
Sunghun Joung, Seungryong Kim, Hanjae Kim, Minsu Kim, Ig-Jae Kim, Junghyun Cho, Kwanghoon Sohn


Existing techniques to encode spatial invariance within deep convolutional neural networks only model 2D transformation fields. This does not account for the fact that objects in a 2D space are a projection of 3D ones, and thus they have limited ability to severe object viewpoint changes. To overcome this limitation, we introduce a learnable module, cylindrical convolutional networks (CCNs), that exploit cylindrical representation of a convolutional kernel defined in the 3D space. CCNs extract a view-specific feature through a view-specific convolutional kernel to predict object category scores at each viewpoint. With the view-specific feature, we simultaneously determine objective category and viewpoints using the proposed sinusoidal soft-argmax module. Our experiments demonstrate the effectiveness of the cylindrical convolutional networks on joint object detection and viewpoint estimation.
[dataset, recognition, extract, predict, visual, order] [object, category, detection, feature, ccns, cylindrical, pascal, bounding, box, sinusoidal, table, map, regression, faster, ross, side, categorization, score, fpn, apply, pooling, region] [model, trained, input, evaluated] [convolutional, kernel, ieee, conventional, pattern, proposed, spatial, cnns, figure, deformable, periodic, discretized, method, based] [image, structural, representation, loss, characteristic] [classification, set, performance, neural, average, network, deep, training, precision, learning, imagenet, probability, compared, processing, similarity, number, classifier, function] [viewpoint, estimation, conference, computer, joint, vision, geometric, kitti, estimate, international, transformation, limited, shape, single, continuous]
@InProceedings{Joung_2020_CVPR,
  author = {Joung, Sunghun and Kim, Seungryong and Kim, Hanjae and Kim, Minsu and Kim, Ig-Jae and Cho, Junghyun and Sohn, Kwanghoon},
  title = {Cylindrical Convolutional Networks for Joint Object Detection and Viewpoint Estimation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Learning a Unified Sample Weighting Network for Object Detection
Qi Cai, Yingwei Pan, Yu Wang, Jingen Liu, Ting Yao, Tao Mei


Region sampling or weighting is significantly important to the success of modern region-based object detectors. Unlike some previous works, which only focus on "hard" samples when optimizing the objective function, we argue that sample weighting should be data-dependent and task-dependent. The importance of a sample for the objective function optimization is determined by its uncertainties to both object classification and bounding box regression tasks. To this end, we devise a general loss function to cover most region-based object detectors with various sampling strategies, and then based on it we propose a unified sample weighting network to predict a sample's task weights. Our framework is simple yet effective. It leverages the samples' uncertainty distributions on classification loss, regression loss, IoU, and probability score, to predict sample weights. Our approach has several advantages: (i). It jointly learns sample weights for both classification and regression tasks, which differentiates it from most previous work. (ii). It is a data-driven process, so it avoids some manual parameter tuning. (iii). It can be effortlessly plugged into most object detectors and achieves noticeable performance improvements without affecting their inference time. Our approach has been thoroughly evaluated with recent object detection frameworks and it can consistently boost the detection accuracy. Code has been made available at https://github.com/caiqi/sample-weighting-network.
[predict, work, previous, yingwei, individual] [object, regression, detection, swn, faster, region, retinanet, reg, box, mask, bounding, hard, iou, table, ross, proposal, coco, positive, detector, lreg, unified, score, sreg, predicted, kaiming, feature, lcls, propose, framework, boost, easy] [example, evaluated] [figure, based, high, method, proposed] [loss, learn, ting, tao] [sample, classification, weighting, training, network, sampling, learning, function, performance, higher, negative, weight, log, general, large, mining, neural, objective, task, inference, strategy, set, deep, optimization, problem, soft, compared] [approach, ground, uncertainty, truth, jointly, accurate]
@InProceedings{Cai_2020_CVPR,
  author = {Cai, Qi and Pan, Yingwei and Wang, Yu and Liu, Jingen and Yao, Ting and Mei, Tao},
  title = {Learning a Unified Sample Weighting Network for Object Detection},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Old Is Gold: Redefining the Adversarially Learned One-Class Classifier Training Paradigm
Muhammad Zaigham Zaheer, Jin-Ha Lee, Marcella Astrid, Seung-Ik Lee


A popular method for anomaly detection is to use the generator of an adversarial network to formulate anomaly score over reconstruction loss of input. Due to the rare occurrence of anomalies, optimizing such networks can be a cumbersome task. Another possible approach is to use both generator and discriminator for anomaly detection. However, attributed to the involvement of adversarial training, this model is often unstable in a way that the performance fluctuates drastically with each training step. In this study, we propose a framework that effectively generates stable results across a wide range of training steps and allows us to use both the generator and the discriminator of an adversarial model for efficient and robust anomaly detection. Our approach transforms the fundamental role of a discriminator from identifying real and fake data to distinguishing between good and bad quality reconstructions. To this end, we prepare training examples for the good quality reconstruction by employing the current generator, whereas poor quality examples are obtained by utilizing an old state of the same generator. This way, the discriminator learns to detect subtle distortions that often appear in reconstructions of the anomaly inputs. Extensive experiments performed on Caltech-256 and MNIST image datasets for novelty detection show superior results. Furthermore, on UCSD Ped2 video dataset for anomaly detection, our model achieves a frame-level AUC of 98.1%, surpassing recent state-of-the-art methods
[dataset, video, state, frame, evaluation, role, work] [detection, score, framework, table, object, detect, module] [quality, model, adversarial, auc, trained, robust, input, mnist, case, adversarially, distinguishing] [phase, ieee, low, figure, pattern, based, proposed, bad, method, high, event, range, motion, conventional, analysis, existing] [generator, discriminator, pseudo, image, real, learn, generative, fake, unsupervised, loss] [anomaly, training, epoch, performance, data, abnormal, learning, good, test, baseline, ucsd, anomalous, deep, network, classification, best, class, selection, number, learned, wide] [conference, computer, vision, reconstruction, approach, international, reconstructed, normal, outlier, well, provided]
@InProceedings{Zaheer_2020_CVPR,
  author = {Zaheer, Muhammad Zaigham and Lee, Jin-Ha and Astrid, Marcella and Lee, Seung-Ik},
  title = {Old Is Gold: Redefining the Adversarially Learned One-Class Classifier Training Paradigm},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
An Adaptive Neural Network for Unsupervised Mosaic Consistency Analysis in Image Forensics
Quentin Bammey, Rafael Grompone von Gioi, Jean-Michel Morel


Automatically finding suspicious regions in a potentially forged image by splicing, inpainting or copy-move remains a widely open problem. Blind detection neural networks trained on benchmark data are flourishing. Yet, these methods do not provide an explanation of their detections. The more traditional methods try to provide such evidence by pointing out local inconsistencies in the image noise, JPEG compression, chromatic aberration, or in the mosaic. In this paper we develop a blind method that can train directly on unlabelled and potentially forged images to point out local mosaic inconsistencies. To this aim we designed a CNN structure inspired from demosaicing algorithms and directed at classifying image blocks by their position in the image modulo (2 x 2). Creating a diversified benchmark database using varied demosaicing methods, we explore the efficiency of the method and its ability to adapt quickly to any new data.
[order, three, dataset, bilinear] [detection, detect, detected, positive, false, concatenate, localization, map, feature] [forgery, jpeg, forged, mosaic, pixelwise, cfa, database, forensics, demosaiced, digital, trained, noise, splicing, softplus, multimedia, detecting, create, improve] [ieee, demosaicing, spatial, method, colour, output, pattern, blockwise, color, proposed, created, based, figure, pixel, convolutional, convolution, signal, analysis, adaptive, modulo, interpolation, green, residual, high, compression] [image, train, specific, source] [network, neural, processing, training, data, algorithm, number, rate, sampled, learning, small, retraining, applied] [conference, international, directly, position, computer, vision, full, compare, structure, camera, second, error, complex]
@InProceedings{Bammey_2020_CVPR,
  author = {Bammey, Quentin and Gioi, Rafael Grompone von and Morel, Jean-Michel},
  title = {An Adaptive Neural Network for Unsupervised Mosaic Consistency Analysis in Image Forensics},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
McFlow: Monte Carlo Flow Models for Data Imputation
Trevor W. Richardson, Wencheng Wu, Lei Lin, Beilei Xu, Edgar A. Bernal


We consider the topic of data imputation, a foundational task in machine learning that addresses issues with missing data. To that end, we propose MCFlow, a deep framework for imputation that leverages normalizing flow generative models and Monte Carlo sampling. We address the causality dilemma that arises when training models with incomplete data by introducing an iterative learning scheme which alternately updates the density estimate and the values of the missing entries in the training data. We provide extensive empirical validation of the effectiveness of the proposed method on standard multivariate and image datasets, and benchmark its performance against state-of-the-art alternatives. We demonstrate that MCFlow is superior to competing methods in terms of the quality of the imputed data, as well as with regards to its ability to preserve the semantic structure of the data.
[observed, embedding, multiple, order, current, work] [framework, table, semantic, fully, mask] [model, mnist, trained, adversarial, datasets, iterative, quality] [flow, based, proposed, method, analysis] [missing, imputation, generative, imputed, image, competing, multivariate, misgan, latent, variable, loss, mapping, celeba, address, independent, corresponding] [data, learning, training, network, density, mcflow, set, machine, performance, sample, neural, deep, algorithm, space, statistical, rate, task, normalizing, update, sampling, distribution, gain, monte, carlo, alternating, involves, inference, classification, standard, requires, maximum, log, feedforward, function, uci, processing, scheme, note, optimal, architecture, max] [incomplete, estimate, conference, complete, initial, international, computed, estimation]
@InProceedings{Richardson_2020_CVPR,
  author = {Richardson, Trevor W. and Wu, Wencheng and Lin, Lei and Xu, Beilei and Bernal, Edgar A.},
  title = {McFlow: Monte Carlo Flow Models for Data Imputation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Learning to See Through Obstructions
Yu-Lun Liu, Wei-Sheng Lai, Ming-Hsuan Yang, Yung-Yu Chuang, Jia-Bin Huang


We present a learning-based approach for removing unwanted obstructions, such as window reflections, fence occlusions or raindrops, from a short sequence of images captured by a moving camera. Our method leverages the motion differences between the background and the obstructing elements to recover both layers. Specifically, we alternate between estimating dense optical flow fields of the two layers and reconstructing each layer from the flow-warped images via a deep convolutional neural network. The learning-based layer reconstruction allows us to accommodate potential errors in the flow estimation and brittle assumptions such as brightness consistency. We show that training on synthetically generated data transfers well to real images. Our results on numerous challenging scenarios of reflection and fence removal demonstrate the effectiveness of the proposed method.
[frame, video, sequence, visual, work, dataset, temporal, moving, natural, exploit] [background, level, table, challenging, refinement, propose, framework, feature] [input, model, clean, difference] [reflection, flow, method, removal, obstruction, motion, optical, fence, figure, proposed, ncc, separation, xue, coarsest, spatial, warping, alayrac, captured, recover, existing] [image, loss, real, train, synthetic, generate, learn, align] [layer, network, online, optimization, training, learning, data, deep, uniform, set, average] [reconstruction, initial, decomposition, reconstruct, single, dense, reconstructed, well, demonstrate, approach, estimation, intrinsic, recovered, keyframe, supplementary, estimate]
@InProceedings{Liu_2020_CVPR,
  author = {Liu, Yu-Lun and Lai, Wei-Sheng and Yang, Ming-Hsuan and Chuang, Yung-Yu and Huang, Jia-Bin},
  title = {Learning to See Through Obstructions},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
GaitPart: Temporal Part-Based Model for Gait Recognition
Chao Fan, Yunjie Peng, Chunshui Cao, Xu Liu, Saihui Hou, Jiannan Chi, Yongzhen Huang, Qing Li, Zhiqiang He


Gait recognition, applied to identify individual walking patterns in a long-distance, is one of the most promising video-based biometric technologies. At present, most gait recognition methods take the whole human body as a unit to establish the spatio-temporal representations. However, we have observed that different parts of human body possess evidently various visual appearances and movement patterns during walking. In the latest literature, employing partial features for human body description has been verified being beneficial to individual recognition. Taken above insights together, we assume that each part of human body needs its own spatio-temporal expression. Then, we propose a novel part-based model GaitPart and get two aspects effect of boosting the performance: On the one hand, Focal Convolution Layer, a new applying of convolution, is presented to enhance the fine-grained learning of the part-level spatial features. On the other hand, the Micro-motion Capture Module (MCM) is proposed and there are several parallel MCMs in the GaitPart corresponding to the pre-defined parts of the human body, respectively. It is worth mentioning that the MCM is a novel way of temporal modeling for gait task, which focuses on the short-range temporal features rather than the redundant long-range features for cycle gait. Experiments on two of the most popular public datasets, CASIA-B and OU-MVLP, richly exemplified that our method meets a new state-of-the-art on multiple standard benchmarks. The source code will be available on https://github.com/ChaoFan96/GaitPart.
[temporal, gaitpart, sequence, fconv, recognition, mcm, extract, represent, composed, attention, mtb, regular, individual, walking, modeling] [feature, module, pooling, final, map, ablation, framework, denotes, effectiveness, table, global, inside, extractor] [gait, input, model, study] [convolution, kernel, spatial, periodic, output, field, applying, column, ieee, proposed, parallel, receptive, convolutional, analysis, pattern] [discriminative, person, corresponding, appearance, train, loss, representation] [group, size, learning, set, performance, experiment, applied, max, setting, conducted, vector, deep, average, network, test] [human, body, capture, novel, local, international, conference, structure]
@InProceedings{Fan_2020_CVPR,
  author = {Fan, Chao and Peng, Yunjie and Cao, Chunshui and Liu, Xu and Hou, Saihui and Chi, Jiannan and Huang, Yongzhen and Li, Qing and He, Zhiqiang},
  title = {GaitPart: Temporal Part-Based Model for Gait Recognition},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
EmotiCon: Context-Aware Multimodal Emotion Recognition Using Frege's Principle
Trisha Mittal, Pooja Guhan, Uttaran Bhattacharya, Rohan Chandra, Aniket Bera, Dinesh Manocha


We present EmotiCon, a learning-based algorithm for context-aware perceived human emotion recognition from videos and images. Motivated by Frege's Context Principle from psychology, our approach combines three interpretations of context for emotion recognition. Our first interpretation is based on using multiple modalities (e.g.faces and gaits) for emotion recognition. For the second interpretation, we gather semantic context from the input image and use a self-attention-based CNN to encode this information. Finally, we use depth maps to model the third interpretation related to socio-dynamic interactions and proximity among agents. We demonstrate the efficiency of our network through experiments on EMOTIC, a benchmark dataset. We report an Average Precision (AP) score of 35.48 across 26 classes, which is an improvement of 7-8 over prior methods. We also introduce a new dataset, GroupWalk, which is a collection of videos captured in multiple real-world settings of people walking. We report an AP of 65.83 across 4 categories on GroupWalk, which is also an improvement over prior methods.
[emotion, context, recognition, dataset, emotic, groupwalk, emoticon, three, multiple, multimodal, people, attention, graph, affective, emotional, psychology, social, proximity, infer, dinesh, work, extract, idepth, agent, convey, uttaran, aniket] [semantic, map, annotated, table, ablation, cnn] [input, facial, datasets, perceived, model, face, expression, interpretation, multiplicative] [figure, fusion, ieee, prior, based, lee, captured, convolutional, pattern, color] [image, perform, loss, row, corresponding, train, consists] [network, classification, discrete, learning, report, number, training, arxiv, preprint, principle] [depth, body, computer, conference, vision, approach, second, compute]
@InProceedings{Mittal_2020_CVPR,
  author = {Mittal, Trisha and Guhan, Pooja and Bhattacharya, Uttaran and Chandra, Rohan and Bera, Aniket and Manocha, Dinesh},
  title = {EmotiCon: Context-Aware Multimodal Emotion Recognition Using Frege's Principle},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Can Deep Learning Recognize Subtle Human Activities?
Vincent Jacquot, Zhuofan Ying, Gabriel Kreiman


Deep Learning has driven recent and exciting progress in computer vision, instilling the belief that these algorithms could solve any visual task. Yet, datasets commonly used to train and test computer vision algorithms have pervasive confounding factors. Such biases make it difficult to truly estimate the performance of those algorithms and how well computer vision models can extrapolate outside the distribution in which they were trained. In this work, we propose a new action classification challenge that is performed well by humans, but poorly by state-of-the-art Deep Learning models. As a proof-of-principle, we consider three exemplary tasks: drinking, reading, and sitting. The best accuracies reached using state-of-the-art computer vision models were 61.7%, 62.8%, and 76.8%, respectively, while human participants scored above 90% accuracy on the three tasks. We propose a rigorous method to reduce confounds when creating datasets, and when comparing human versus computer vision performance. Source code and datasets are publicly available.
[dataset, reading, action, sitting, three, recognition, drinking, reached, chance, current, detectron, progress, duration, extract, visual, text, recognize, psychophysics, provide] [object, ross, kaiming, bounding, coco, piotr, challenge, detection, focus, horizontal] [datasets, model, controlled, example, trained, help, classified, led] [figure, convolutional, exposure] [image, person, grayscale] [accuracy, deep, performance, neural, test, classification, imagenet, best, learning, task, network, better, binary, group, number, set, considered, classifier, algorithm, classify, simple, svm, applied, procedure, class, function, validation] [human, keypoints, pose, computer, approach, vision, well, internet, estimation, despite]
@InProceedings{Jacquot_2020_CVPR,
  author = {Jacquot, Vincent and Ying, Zhuofan and Kreiman, Gabriel},
  title = {Can Deep Learning Recognize Subtle Human Activities?},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
PhysGAN: Generating Physical-World-Resilient Adversarial Examples for Autonomous Driving
Zelun Kong, Junfeng Guo, Ang Li, Cong Liu


Although Deep neural networks (DNNs) are being pervasively used in vision-based autonomous driving systems, they are found vulnerable to adversarial attacks where small-magnitude perturbations into the inputs during test time cause dramatic changes to the outputs. While most of the recent attack methods target at digital-world adversarial scenarios, it is unclear how they perform in the physical world, and more importantly, the generated perturbations under such methods would cover a whole driving scene including those fixed background imagery such as the sky, making them inapplicable to physical world implementation. We present PhysGAN, which generates physical-world-resilient adversarial examples for misleading autonomous driving systems in a continuous manner. We show the effectiveness and robustness of PhysGAN via extensive digital- and real-world evaluations. We compare PhysGAN with a set of state-of-the-art baseline methods, which further demonstrate the robustness and efficacy of our approach. We also show that PhysGAN outperforms state-of-the-art baseline methods. To the best of our knowledge, PhysGAN is probably the first technique of generating realistic and physical-world-resilient adversarial examples for attacking common autonomous driving scenarios.
[driving, sign, video, frame, vehicle, evaluation] [autonomous, table, represents, background, challenge, effectiveness] [adversarial, steering, physgan, roadside, original, physical, model, sadv, attack, example, digital, udacity, input, fgsm, sorig, indistinguishable, mislead, testing, apple, noise, efficacy, continuously, robustness, attacking, case, mcdonalds, ian, targeted, xorig, robust, physfgsm] [visually, slice, ieee, captured, method] [generated, generate, target, generating, image, corresponding, generator, lgan, train, perform, realistic, real, gan, mapping] [random, set, entire, neural, baseline, sample, nvidia, learning, online, arxiv, preprint, classification, training, deep] [angle, error, single, kitti, ground, approach, scene]
@InProceedings{Kong_2020_CVPR,
  author = {Kong, Zelun and Guo, Junfeng and Li, Ang and Liu, Cong},
  title = {PhysGAN: Generating Physical-World-Resilient Adversarial Examples for Autonomous Driving},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
ILFO: Adversarial Attack on Adaptive Neural Networks
Mirazul Haque, Anki Chauhan, Cong Liu, Wei Yang


With the increasing number of layers and parameters in neural networks, the energy consumption of neural networks has become a great concern to society, especially to users of handheld or embedded devices. In this paper, we investigate the robustness of neural networks against energy-oriented attacks. Specifically, we propose ILFO (Intermediate Output-Based Loss Function Optimization) attack against a common type of energy-saving neural networks, Adaptive Neural Networks (AdNN). AdNNs save energy consumption by dynamically deactivating part of its model based on the need of the inputs. ILFO leverages intermediate output as a proxy to infer the relation between input and its corresponding energy consumption. ILFO has shown an increase up to 100 % of the FLOPs (floating-point operations per second) reduced by AdNNs with minimum noise added to input images. To our knowledge, this is the first attempt to attack the energy consumption of an AdNN.
[] [] [] [] [] [] []
@InProceedings{Haque_2020_CVPR,
  author = {Haque, Mirazul and Chauhan, Anki and Liu, Cong and Yang, Wei},
  title = {ILFO: Adversarial Attack on Adaptive Neural Networks},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
On Translation Invariance in CNNs: Convolutional Layers Can Exploit Absolute Spatial Location
Osman Semih Kayhan, Jan C. van Gemert


In this paper we challenge the common assumption that convolutional layers in modern CNNs are translation invariant. We show that CNNs can and will exploit the absolute spatial location by learning filters that respond exclusively to particular absolute locations by exploiting image boundary effects. Because modern CNNs filters have a huge receptive field, these boundary effects operate even far from the image boundary, allowing the network to exploit absolute spatial location all over the image. We give a simple solution to remove spatial location encoding which improves translation invariance and thus gives a stronger visual inductive bias which particularly benefits small data sets. We broadly demonstrate these benefits on several architectures and various applications such as image classification, patch matching, and two video classification datasets.
[connected, exploit, encode, action, recognition, include, recurrent, koray] [boundary, fully, cnn, region, pooling, van, location, object, alexander, ross, feature, semantic, kaiming, global, detection, pascal] [adding, robustness, model] [convolutional, ieee, pattern, spatial, cnns, cropping, journal, convolution, scale, analysis, exploiting, receptive, patch, residual, scattering, signal] [image, invariant, learn, translation] [neural, deep, learning, processing, arxiv, preprint, network, machine, data, augmentation, andrew, training, yann, max, statistical, yoshua, alex, filter, huge, large, average, vector, architecture] [conference, computer, vision, international, absolute, local, position, equivariant, daniel, geometric, equivariance, thomas, rotation, andrea, european, david]
@InProceedings{Kayhan_2020_CVPR,
  author = {Kayhan, Osman Semih and Gemert, Jan C. van},
  title = {On Translation Invariance in CNNs: Convolutional Layers Can Exploit Absolute Spatial Location},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Diverse Image Generation via Self-Conditioned GANs
Steven Liu, Tongzhou Wang, David Bau, Jun-Yan Zhu, Antonio Torralba


We introduce a simple but effective unsupervised method for generating diverse images. We train a class-conditional GAN model without using manually annotated class labels. Instead, our model is conditional on labels automatically derived from clustering in the discriminator's feature space. Our clustering step automatically discovers diverse modes, and explicitly requires the generator to cover them. Experiments on standard mode collapse benchmarks show that our method outperforms several competing methods when addressing mode collapse. Our method also performs well on large-scale datasets such as ImageNet and Places365, improving both diversity and standard metrics (e.g., Frechet Inception Distance), compared to previous methods.
[previous, dataset, conditioning, automatically] [score, feature, table] [adversarial, quality, model, trained, mnist, datasets, true] [method, figure, reverse, partition, output, high, proposed, stacked] [cluster, gan, generator, discriminator, image, real, generative, gans, generated, mode, conditional, inception, diverse, diversity, unsupervised, unconditional, train, collapse, target, mgan, purity, conditioned, lpips, specific, generation, generating] [training, clustering, number, imagenet, class, data, learning, random, distribution, sample, layer, iteration, standard, compared, better, mixture, subset, online, algorithm, higher, set, architecture, vanilla, arxiv, classifier, nmi] [well, cover, matching, reconstruction, david]
@InProceedings{Liu_2020_CVPR,
  author = {Liu, Steven and Wang, Tongzhou and Bau, David and Zhu, Jun-Yan and Torralba, Antonio},
  title = {Diverse Image Generation via Self-Conditioned GANs},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Inducing Hierarchical Compositional Model by Sparsifying Generator Network
Xianglei Xing, Tianfu Wu, Song-Chun Zhu, Ying Nian Wu


This paper proposes to learn hierarchical compositional AND-OR model for interpretable image synthesis by sparsifying the generator network. The proposed method adopts the scene-objects-parts-subparts-primitives hierarchy in image representation. A scene has different types (i.e., OR) each of which consists of a number of objects (i.e., AND). This can be recursively formulated across the scene-objects-parts-subparts hierarchy and is terminated at the primitive level (e.g., wavelets-like basis). To realize this AND-OR hierarchy in image synthesis, we learn a generator network that consists of the following two components: (i) Each layer of the hierarchy is represented by an over-completed set of convolutional basis functions. Off-the-shelf convolutional neural architectures are exploited to implement the hierarchy. (ii) Sparsity-inducing constraints are introduced in end-to-end training, which induces a sparsely activated and sparsely connected AND-OR model from the initially densely connected generator network. A straightforward sparsity-inducing constraint is utilized, that is to only allow the top-k basis functions to be activated at each layer (where k is a hyper-parameter). The learned basis functions are also capable of image reconstruction to explain the input images. In experiments, the proposed method is tested on four benchmark datasets. The results show that meaningful and interpretable hierarchical representations are learned with better qualities of image synthesis and reconstruction obtained than baselines.
[hierarchical, connected, work, dataset, natural, compositional, step, instantiated, three] [table, car, object, ali] [model, internal, adversarial, face, original, fashion] [proposed, figure, method, convolutional, tree, traditional, coding, consecutive, based] [image, generator, synthesis, meaningful, generative, interpretable, learn, semantically, latent, generated, code, bedroom, generation, gans, generate, synthesized, gan, train, celeba, representation] [layer, learned, network, learning, hierarchy, training, sparsity, log, deep, vector, number, better, process, set, neural, update, ying, nian, principle, vanilla, consider, algorithm, andrew, sparsifying] [basis, sparse, reconstruction, sparsely, dense, reconstructed, human, compare, david]
@InProceedings{Xing_2020_CVPR,
  author = {Xing, Xianglei and Wu, Tianfu and Zhu, Song-Chun and Wu, Ying Nian},
  title = {Inducing Hierarchical Compositional Model by Sparsifying Generator Network},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
CARP: Compression Through Adaptive Recursive Partitioning for Multi-Dimensional Images
Rongjie Liu, Meng Li, Li Ma


Fast and effective image compression for multi-dimensional images has become increasingly important for efficient storage and transfer of massive amounts of high resolution images and videos. Desirable properties in compression methods include (1) high reconstruction quality at a wide range of compression rates while preserving key local details, (2) computational scalability, (3) applicability to a variety of different image/video types and of different dimensions, and (4) ease of tuning. We present such a method for multi-dimensional image compression called Compression via Adaptive Recursive Partitioning (CARP). CARP uses an optimal permutation of the image pixels inferred from a Bayesian probabilistic model on recursive partitions of the image to reduce its effective dimensionality, achieving a parsimonious representation that preserves information. CARP uses a multi-layer Bayesian hierarchical model to achieve self-tuning and regularization to avoid overfitting-- resulting in one single parameter to be specified by the user to achieve the desired compression rate. Extensive numerical experiments using a variety of datasets including 2D ImageNet, 3D medical image, and real-life YouTube and surveillance videos show that CARP dominates the state-of-the-art compression approaches-- including JPEG, JPEG2000, MPEG4, and a neural network-based method--for all of these different image types and often on nearly all of the individual images.
[video, dataset, youtube, frame, hierarchical, individual, decoder, encoding, time] [level, region, including, regression] [jpeg, model, original, rdp] [compression, carp, psnr, figure, wavelet, recursive, transform, adaptive, tree, compressed, method, spatial, applicable, partition, pixel, optimized, huffman, symbol, low, journal, based, prior] [image, surveillance, representation, user, desired] [ratio, bayesian, permutation, posterior, achieve, imagenet, learning, efficient, performance, set, optimal, computational, processing, average, parameter, space, random, deep, standard, probability, pruning, statistical, probabilistic, neural, induced, maximum, distribution] [partitioning, reconstructed, compare, variety, single, computer, local, vision, directly]
@InProceedings{Liu_2020_CVPR,
  author = {Liu, Rongjie and Li, Meng and Ma, Li},
  title = {CARP: Compression Through Adaptive Recursive Partitioning for Multi-Dimensional Images},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
GrappaNet: Combining Parallel Imaging With Deep Learning for Multi-Coil MRI Reconstruction
Anuroop Sriram, Jure Zbontar, Tullie Murrell, C. Lawrence Zitnick, Aaron Defazio, Daniel K. Sodickson


Magnetic Resonance Image (MRI) acquisition is an inherently slow process which has spurred the development of two different acceleration methods: acquiring multiple correlated samples simultaneously (parallel imaging) and acquiring fewer samples than necessary for traditional signal processing methods (compressed sensing). Both methods provide complementary approaches to accelerating MRI acquisition. In this paper, we present a novel method to integrate traditional parallel imaging methods into deep neural networks that is able to generate high quality reconstructions even for high acceleration factors. The proposed method, called GrappaNet, performs progressive reconstruction by first mapping the reconstruction problem to a simpler one that can be solved by a traditional parallel imaging methods using a neural network, followed by an application of a parallel imaging method, and finally fine-tuning the output with another neural network. The entire network can be trained end-to-end. We present experimental results on the recently released fastMRI dataset and show that GrappaNet can generate higher quality reconstructions than competing methods for both 4x and 8x acceleration.
[work, dataset, multiple, previous, observed] [] [model, sensitivity, input, quality, trained, fat, complementary, experimental] [parallel, imaging, mri, grappanet, grappa, convolutional, magnetic, resonance, transform, coil, figure, inverse, classical, fourier, traditional, compressed, accelerated, acquisition, signal, high, fastmri, sensing, frequency, net, method, called, aliasing, captured, applying, knee, pdfs, ieee, acquiring, proposed, output, receiver, performed] [image, variational, domain, generate] [network, data, deep, baseline, neural, learning, acceleration, applied, training, layer, problem, higher, space, optimization, number, process, processing, architecture] [reconstruction, approach, estimate, michael, combine, single, daniel, scan, measured, ground, application]
@InProceedings{Sriram_2020_CVPR,
  author = {Sriram, Anuroop and Zbontar, Jure and Murrell, Tullie and Zitnick, C. Lawrence and Defazio, Aaron and Sodickson, Daniel K.},
  title = {GrappaNet: Combining Parallel Imaging With Deep Learning for Multi-Coil MRI Reconstruction},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Can Weight Sharing Outperform Random Architecture Search? An Investigation With TuNAS
Gabriel Bender, Hanxiao Liu, Bo Chen, Grace Chu, Shuyang Cheng, Pieter-Jan Kindermans, Quoc V. Le


Efficient Neural Architecture Search methods based on weight sharing have shown good promise in democratizing Neural Architecture Search for computer vision models. There is, however, an ongoing debate whether these efficient methods are significantly better than random search. Here we perform a thorough comparison between efficient and random search methods on a family of progressively larger and more challenging search spaces for image classification and detection on ImageNet and COCO. While the efficacies of both methods are problem-dependent, our experiments demonstrate that there are large, realistic tasks where efficient search methods can provide substantial gains over random search. In addition, we propose and evaluate techniques which improve the quality of searched architectures and reduce the need for manual hyper-parameter tuning.
[reward, time, three, previous, multiple, work, reinforcement] [table, object, final, challenging, focus] [model, quality, acc, improve, trained] [output, ieee, pattern, proposed, method, published, expansion, kernel, valid, based, figure] [image, shared, target] [search, architecture, proxylessnas, neural, random, space, filter, efficient, inference, latency, training, algorithm, learning, weight, find, set, network, sharing, function, inverted, controller, arxiv, preprint, larger, quoc, good, test, large, number, evaluate, searched, mnasnet, close, better, andrew, classification, accuracy, finding, bottleneck, size, rate, validation, small, implementation] [conference, computer, vision, absolute, international, single, cost, allows]
@InProceedings{Bender_2020_CVPR,
  author = {Bender, Gabriel and Liu, Hanxiao and Chen, Bo and Chu, Grace and Cheng, Shuyang and Kindermans, Pieter-Jan and Le, Quoc V.},
  title = {Can Weight Sharing Outperform Random Architecture Search? An Investigation With TuNAS},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Context Aware Graph Convolution for Skeleton-Based Action Recognition
Xikun Zhang, Chang Xu, Dacheng Tao


Graph convolutional models have gained impressive successes on skeleton based human action recognition task. As graph convolution is a local operation, it cannot fully investigate non-local joints that could be vital to recognizing the action. For example, actions like typing and clapping request the cooperation of two hands, which are distant from each other in a human skeleton graph. Multiple graph convolutional layers thus tend to be stacked together to increase receptive field, which brings in computational inefficiency and optimization difficulty. But there is still no guarantee that distant joints (e.g. two hands) can be well integrated. In this paper, we propose a context aware graph convolutional network (CA-GCN). Besides the computation of localized graph convolution, CA-GCN considers a context term for each vertex by integrating information of all other vertices. Long range dependencies among joints are thus naturally integrated in context information, which then eliminates the need of stacking multiple layers to enlarge receptive field and greatly simplifies the network. Moreover, we further propose an advanced CA-GCN, in which asymmetric relevance measurement and higher level representation are utilized to compute context information for more flexibility and better performance. Besides the joint features, our CA-GCN could also be extended to handle graphs with edge (limb) features. Extensive experiments on two real-world datasets demonstrate the importance of context information and the effectiveness of the proposed CA-GCN in skeleton based action recognition.
[context, graph, relevance, action, recognition, trainable, kinetics, gcns, denoting, concatenation, temporal, zjl, dataset, skeleton, multiple, considering, integrate, gecnn] [aware, advanced, feature, edge, score, table, map, propose, global, including, level] [model, conduct] [convolution, convolutional, light, proposed, integration, figure, based, version, spatial, ieee, spectral, receptive, performs] [generate] [neural, performance, network, data, learning, inner, function, arxiv, preprint, product, matrix, baseline, set, increase, accuracy, class, higher, better, denote, layer, computation, processing, classification, size] [vertex, human, joint, local, term, distant, depth, hand]
@InProceedings{Zhang_2020_CVPR,
  author = {Zhang, Xikun and Xu, Chang and Tao, Dacheng},
  title = {Context Aware Graph Convolution for Skeleton-Based Action Recognition},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Fast(er) Reconstruction of Shredded Text Documents via Self-Supervised Deep Asymmetric Metric Learning
Thiago M. Paixao, Rodrigo F. Berriel, Maria C. S. Boeres, Alessandro L. Koerich, Claudine Badue, Alberto F. De Souza, Thiago Oliveira-Santos


The reconstruction of shredded documents consists in arranging the pieces of paper (shreds) in order to reassemble the original aspect of such documents. This task is particularly relevant for supporting forensic investigation as documents may contain criminal evidence. As an alternative to the laborious and time-consuming manual process, several researchers have been investigating ways to perform automatic digital reconstruction. A central problem in automatic reconstruction of shredded documents is the pairwise compatibility evaluation of the shreds, notably for binary text documents. In this context, deep learning has enabled great progress for accurate reconstructions in the domain of mechanically-shredded documents. A sensitive issue, however, is that current deep model solutions require an inference whenever a pair of shreds has to be evaluated. This work proposes a scalable deep learning approach for measuring pairwise compatibility in which the number of inferences scales linearly (rather than quadratically) with the number of shreds. Instead of predicting compatibility directly, deep models are leveraged to asymmetrically project the raw shred content onto a common metric space in which distance is proportional to the compatibility. Experimental results show that our method has accuracy comparable to the state-of-the-art with a speed-up of about 22 times for a test instance with 505 shreds (20 mixed shredded-pages from different documents).
[time, evaluation, text, pair, embedding, work, embeddings, order, shift, automatic, three] [positive, stage, feature] [compatibility, model, trained, digital, experimental, original, scenario] [proposed, method, figure, reconstructing, based, linearly, convolutional, pattern] [document, loss, extracted] [accuracy, shredded, learning, pairwise, number, sample, deep, metric, function, training, optimization, fright, flef, average, search, paper, problem, process, performance, processing, shred, space, cut, data, set, genetic, better, algorithm, required, compared, negative, shredding] [reconstruction, vertical, distance, approach, projection, cost, reconstruct, solution, well, left, require, represented, local]
@InProceedings{Paixao_2020_CVPR,
  author = {Paixao, Thiago M. and Berriel, Rodrigo F. and Boeres, Maria C. S. and Koerich, Alessandro L. and Badue, Claudine and Souza, Alberto F. De and Oliveira-Santos, Thiago},
  title = {Fast(er) Reconstruction of Shredded Text Documents via Self-Supervised Deep Asymmetric Metric Learning},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Revisiting Pose-Normalization for Fine-Grained Few-Shot Recognition
Luming Tang, Davis Wertheimer, Bharath Hariharan


Few-shot, fine-grained classification requires a model to learn subtle, fine-grained distinctions between different classes (e.g., birds) based on a few images alone. This requires a remarkable degree of invariance to pose, articulation and background. A solution is to use pose-normalized representations: first localize semantic parts in each image, and then describe images by characterizing the appearance of each part. While such representations are out of favor for fully supervised classification, we show that they are extremely effective for few-shot fine-grained classification. With a minimal increase in model capacity, pose normalization improves accuracy between 10 and 20 percentage points for shallow and deep architectures, generalizes better to new domains, and is effective for multiple few-shot algorithms and network backbones. Code is available at https://github.com/Tsingularity/PoseNorm_Fewshot.
[recognition, three, evaluation, visual, bilinear, outperforms, work, dataset] [feature, map, heatmap, shallow, pooling, extractor, bounding, predicted, table, backbone, localization, semantic, object, location, box] [model, query, effective, highly, percentage] [ieee, figure, pattern, based, convolutional, reference, dynamic, proposed] [representation, image, bird, unsupervised, train, learnt, cub] [normalization, learning, classification, network, set, performance, training, accuracy, vector, deep, number, classifier, neural, class, large, labeled, test, base, dref, data, small, standard, prototypical, learned, convnet, deeper, evaluate, drepre] [pose, novel, computer, conference, vision, estimation, estimator, international, consistent, form, ground]
@InProceedings{Tang_2020_CVPR,
  author = {Tang, Luming and Wertheimer, Davis and Hariharan, Bharath},
  title = {Revisiting Pose-Normalization for Fine-Grained Few-Shot Recognition},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
RankMI: A Mutual Information Maximizing Ranking Loss
Mete Kemertas, Leila Pishdad, Konstantinos G. Derpanis, Afsaneh Fazly


We introduce an information-theoretic loss function, RankMI, and an associated training algorithm for deep representation learning for image retrieval. Our proposed framework consists of alternating updates to a network that estimates the divergence between distance distributions of matching and non-matching pairs of learned embeddings, and an embedding network that maximizes this estimate via sampled negatives. In addition, under this information-theoretic lens we draw connections between RankMI and commonly-used ranking losses, e.g., triplet loss. We extensively evaluate RankMI on several standard image retrieval datasets, namely, CUB-200-2011, CARS-196, and Stanford Online Products. Our method achieves competitive results or significant improvements over previous reported results on all datasets.
[embedding, retrieval, embeddings, work, pair, evaluation, previous] [positive, propose, score] [ensemble, example] [pattern, figure, high, proposed, method, based, low] [loss, image, learn, representation, variational, common] [deep, learning, negative, network, sampling, training, mutual, rankmi, neural, function, triplet, metric, lower, dij, ranking, quadruplet, divergence, margin, algorithm, stanford, sample, online, class, standard, bound, lrankmi, nmi, fastap, machine, pairwise, product, data, fixed, random, batch, procedure, number, processing, draw] [conference, computer, distance, international, vision, estimate, matching, european]
@InProceedings{Kemertas_2020_CVPR,
  author = {Kemertas, Mete and Pishdad, Leila and Derpanis, Konstantinos G. and Fazly, Afsaneh},
  title = {RankMI: A Mutual Information Maximizing Ranking Loss},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Learning Memory-Guided Normality for Anomaly Detection
Hyunjong Park, Jongyoun Noh, Bumsub Ham


We address the problem of anomaly detection, that is, detecting anomalous events in a video sequence. Anomaly detection methods based on convolutional neural networks (CNNs) typically leverage proxy tasks, such as reconstructing input video frames, to learn models describing normality without seeing anomalous samples at training time, and quantify the extent of abnormalities using the reconstruction error at test time. The main drawbacks of these approaches are that they do not consider the diversity of normal patterns explicitly, and the powerful representation capacity of CNNs allows to reconstruct abnormal video frames. To address this problem, we present an unsupervised learning approach to anomaly detection that considers the diversity of normal patterns explicitly, while lessening the representation capacity of CNNs. To this end, we propose to use a memory module with a new update scheme where items in the memory record prototypical patterns of normal data. We also present novel feature compactness and separateness losses to train the memory, boosting the discriminative power of both memory items and deeply learned features from normal data. Experimental results on standard benchmarks demonstrate the effectiveness and efficiency of our approach, which outperforms the state of the art.
[video, frame, separateness, prediction, future, qkt, state, avenue, abnormality, decoder, recording, extract, individual] [feature, detection, module, cuhk, propose, map, score, fps, center] [model, query, item, input, auc] [method, based, convolutional, shanghaitech, read, figure, viewed] [loss, unsupervised, discriminative, train, representation, diverse, image, encoder] [memory, anomaly, abnormal, learning, ucsd, test, deep, anomalous, training, update, prototypical, performance, size, compactness, network, average, set, weighted, neural, capacity, record, updating, task, denote, typically, consider, number, mapped, data] [normal, reconstruction, nearest, compute, reconstruct, approach, matching, second, allows, reconstructed, sparse]
@InProceedings{Park_2020_CVPR,
  author = {Park, Hyunjong and Noh, Jongyoun and Ham, Bumsub},
  title = {Learning Memory-Guided Normality for Anomaly Detection},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Appearance Shock Grammar for Fast Medial Axis Extraction From Real Images
Charles-Olivier Dufresne Camaro, Morteza Rezanejad, Stavros Tsogkas, Kaleem Siddiqi, Sven Dickinson


We combine ideas from shock graph theory with more recent appearance-based methods for medial axis extraction from complex natural scenes, improving upon the present best unsupervised method, in terms of efficiency and performance. We make the following specific contributions: i) we extend the shock graph representation to the domain of real images, by generalizing the shock type definitions using local, appearance-based criteria; ii) we then use the rules of a Shock Grammar to guide our search for medial points, drastically reducing run time when compared to other methods, which exhaustively consider all points in the input image; iii) we remove the need for typical post-processing steps including thinning, non-maximum suppression, and grouping, by adhering to the Shock Grammar rules while deriving the medial axis solution; iv) finally, we raise some fundamental concerns with the evaluation scheme used in previous work and propose a more appropriate alternative for assessing the performance of medial axis extraction from scenes. Our experiments on the BMAX500 and SK-LARGE datasets demonstrate the effectiveness of our approach. We outperform the present state-of-the-art, excelling particularly in the high-precision regime, while running an order of magnitude faster and requiring no post-processing.
[skeleton, natural, graph, asg, multiple, evaluation, recognition, connected, work, visual] [medial, shock, object, seed, amat, grammar, kaleem, sven, boundary, tsogkas, segmentation, detection, table, stavros, branch, wei, benjamin, recall, ligature, fully, birth] [type, theory, protocol] [ieee, extraction, scale, pattern, figure, valid, based, color, journal, proposed] [image, unsupervised, domain, supervised, appearance] [problem, algorithm, function, binary, search, consider, performance, space, candidate] [axis, computer, cost, point, local, vision, conference, disk, shape, international, scene, approach, symmetry, growth, ground, rgb, centered, define, truth, compute, allows, grow]
@InProceedings{Camaro_2020_CVPR,
  author = {Camaro, Charles-Olivier Dufresne and Rezanejad, Morteza and Tsogkas, Stavros and Siddiqi, Kaleem and Dickinson, Sven},
  title = {Appearance Shock Grammar for Fast Medial Axis Extraction From Real Images},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Generalizing Hand Segmentation in Egocentric Videos With Uncertainty-Guided Model Adaptation
Minjie Cai, Feng Lu, Yoichi Sato


Although the performance of hand segmentation in egocentric videos has been significantly improved by using CNNs, it still remains a challenging issue to generalize the trained models to new domains, e.g., unseen environments. In this work, we solve the hand segmentation generalization problem without requiring segmentation labels in the target domain. To this end, we propose a Bayesian CNN-based model adaptation framework for hand segmentation, which introduces and considers two key factors: 1) prediction uncertainty when the model is applied in a new domain and 2) common information about hand shapes shared across domains. Consequently, we propose an iterative self-training method for hand segmentation in the new domain, which is guided by the model uncertainty estimated by a Bayesian CNN. We further use an adversarial component in our framework to utilize shared information about hand shapes to constrain the model adaptation process. Experiments on multiple egocentric datasets show that the proposed method significantly improves the generalization performance of hand segmentation.
[egocentric, dataset, prediction, multiple, work] [segmentation, cnn, semantic, map, denotes, framework, improves, region, propose, miou, challenging] [model, adversarial, generalization, iterative, trained, datasets, study, input] [method, proposed, figure, ieee, based, prior, pattern, resolution] [adaptation, domain, target, source, unsupervised, yhg, image, egohands, dhs, generalize, unseen, loss, utg, common, selftraining] [bayesian, performance, learning, training, data, stochastic, network, probability, forward, neural, deep, number, standard, inference, distribution, iteration, set, adapt, test, machine, learned, better, procedure, large, task, compared, confident] [hand, uncertainty, conference, computer, shape, international, vision, estimated, well]
@InProceedings{Cai_2020_CVPR,
  author = {Cai, Minjie and Lu, Feng and Sato, Yoichi},
  title = {Generalizing Hand Segmentation in Egocentric Videos With Uncertainty-Guided Model Adaptation},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
DeFeat-Net: General Monocular Depth via Simultaneous Unsupervised Representation Learning
Jaime Spencer, Richard Bowden, Simon Hadfield


In the current monocular depth research, the dominant approach is to employ unsupervised training on large datasets, driven by warped photometric consistency. Such approaches lack robustness and are unable to generalize to challenging domains such as nighttime scenes or adverse weather conditions where assumptions about photometric consistency break down. We propose DeFeat-Net (Depth & Feature network), an approach to simultaneously learn a cross-domain dense feature representation, alongside a robust depth-estimation framework based on warped feature consistency. The resulting feature representation is learned in an unsupervised manner with no explicit ground-truth correspondences required. We show that within a single domain, our technique is comparable to both the current state of the art in monocular depth estimation and supervised feature representation learning. However, by simultaneously learning features, depth and motion, our technique is able to generalize to challenging domains, allowing DeFeat-Net to outperform the current state-of-the-art with around 10% reduction in all error measures on more challenging sequences such as nighttime driving.
[dataset, artificial, evaluation, order, visual, previous, current, day] [feature, challenging, map, table, predicted, detection, positive, fully] [robust, trained, technique, nov, improve] [ieee, pattern, convolutional, society, proposed, science, based, warp, figure, flow, weather, disparity, introduced, method, optical, night] [unsupervised, loss, image, learn, representation, consistency, target] [learning, deep, network, training, neural, negative, performance, space, contrastive, learned, machine, processing, support] [depth, computer, monocular, conference, lecture, volume, vision, estimation, dense, photometric, robotcar, single, error, matching, kitti, ground, stereo, subseries, intelligence, international, nighttime, defeat, approach, truth, scene, correspondence, local, jointly, richard, geometry]
@InProceedings{Spencer_2020_CVPR,
  author = {Spencer, Jaime and Bowden, Richard and Hadfield, Simon},
  title = {DeFeat-Net: General Monocular Depth via Simultaneous Unsupervised Representation Learning},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Learning Visual Motion Segmentation Using Event Surfaces
Anton Mitrokhin, Zhiyuan Hua, Cornelia Fermuller, Yiannis Aloimonos


Event-based cameras have been designed for scene motion perception - their high temporal resolution and spatial data sparsity converts the scene into a volume of boundary trajectories and allows to track and analyze the evolution of the scene in time. Analyzing this data is computationally expensive, and there is substantial lack of theory on dense-in-time object motion to guide the development of new algorithms; hence, many works resort to a simple solution of discretizing the event stream and converting it to classical pixel maps, which allows for application of conventional image processing methods. In this work we present a Graph Convolutional neural network for the task of scene motion segmentation by a moving camera. We convert the event stream into a 3D graph in (x,y,t) space and keep per-event temporal information. The difficulty of the task stems from the fact that unlike in metric space, the shape of an object in (x,y,t) space depends on its motion and is not the same across the dataset. We discuss properties of of the event data with respect to this 3D recognition problem, and show that our Graph Convolutional architecture is superior to PointNet++. We evaluate our method on the state of the art event-based motion segmentation dataset - EV-IMO and perform comparisons to a frame-based method proposed by its authors. Our ablation studies show that increasing the event slice width improves the accuracy, and how subsampling and edge configurations affect the network performance.
[temporal, time, graph, work, dataset, multiple, connected, trajectory, three, moving, provide, making] [object, segmentation, edge, feature, gconv, occlusion, table, corner, global, ablation, tracking, detection, bottom] [input] [event, motion, flow, optical, slice, ieee, pixel, subsampling, pattern, convolutional, figure, cornelia, yiannis, high, analysis, dynamic, spatial, parallel, asynchronous, fast] [image, perform] [network, learning, neural, data, large, training, augmentation, metric, better, layer, validation, processing, task, architecture, width] [point, cloud, local, conference, vision, camera, scene, shape, surface, international, radius, computer, approach, normal, single, structure, second, allows, estimation, plane, axis, full]
@InProceedings{Mitrokhin_2020_CVPR,
  author = {Mitrokhin, Anton and Hua, Zhiyuan and Fermuller, Cornelia and Aloimonos, Yiannis},
  title = {Learning Visual Motion Segmentation Using Event Surfaces},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Social-STGCNN: A Social Spatio-Temporal Graph Convolutional Neural Network for Human Trajectory Prediction
Abduallah Mohamed, Kun Qian, Mohamed Elhoseiny, Christian Claudel


Better machine understanding of pedestrian behaviors enables faster progress in modeling interactions between agents such as autonomous vehicles and humans. Pedestrian trajectories are not only influenced by the pedestrian itself but also by interaction with surrounding objects. Previous methods modeled these interactions by using a variety of aggregation methods that integrate different learned pedestrians states. We propose the Social Spatio-Temporal Graph Convolutional Neural Network (Social-STGCNN), which substitutes the need of aggregation methods by modeling the interactions as a graph. Our results show an improvement over the state of art by 20% on the Final Displacement Error (FDE) and an improvement on the Average Displacement Error (ADE) with 8.5 times less parameters and up to 48 times faster inference speed than previously reported methods. In addition, our model is data efficient, and exceeds previous state of the art on the ADE metric with only 20% of the training data. We propose a kernel function to embed the social interactions between pedestrians within the adjacency matrix. Through qualitative analysis, we show that our model inherited social behaviors that can be expected between pedestrians trajectories. Code is available at https://github.com/abduallahmohamed/Social-STGCNN.
[graph, social, trajectory, recurrent, prediction, adjacency, time, aij, previous, fde, ade, future, temporal, predict, walking, modeling, observed, speed, work, embedding, vtj, state, predicting, socialstgcnn, people] [pedestrian, aggregation, predicted, table, autonomous, pooling] [model, trained] [kernel, convolution, figure, convolutional, ieee, based, parallel, cnns, analysis, motion, gaussian, designed] [representation, generative] [function, matrix, neural, data, training, inference, equation, size, set, performance, network, deep, compared, number, distribution, layer, weighted, arxiv, preprint, better, operation] [defined, conference, computer, collision, error, international, ground, vision, human, direction, scene, predicts]
@InProceedings{Mohamed_2020_CVPR,
  author = {Mohamed, Abduallah and Qian, Kun and Elhoseiny, Mohamed and Claudel, Christian},
  title = {Social-STGCNN: A Social Spatio-Temporal Graph Convolutional Neural Network for Human Trajectory Prediction},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Discriminative Multi-Modality Speech Recognition
Bo Xu, Cheng Lu, Yandong Guo, Jacob Wang


Vision is often used as a complementary modality for audio speech recognition (ASR), especially in the noisy environment where performance of solo audio modality significantly deteriorates. After combining visual modality, ASR is upgraded to the multi-modality speech recognition (MSR). In this paper, we propose a two-stage speech recognition model. In the first stage, the target voice is separated from background noises with help from the corresponding visual information of lip movements, making the model 'listen' clearly. At the second stage, the audio modality combines visual modality again to better understand the speech by a MSR sub-network, further improving the recognition rate. There are some other key contributions: we introduce a pseudo-3D residual convolution (P3D)-based visual front-end to extract more discriminative features; we upgrade the temporal convolution block from 1D ResNet with the temporal convolutional network (TCN), which is more suitable for the temporal tasks; the MSR sub-network is built on the top of Element-wise-Attention Gated Recurrent Unit (EleAtt-GRU), which is more effective than Transformer in long sequences. We conducted extensive experiments on the LRS3-TED and the LRW datasets. Our two-stage model (audio enhanced multi-modality speech recognition, AE-MSR) consistently achieves the state-of-the-art performance by a significant margin, which demonstrates the necessity and effectiveness of AE-MSR.
[audio, speech, visual, recognition, msr, temporal, modality, lip, reading, awareness, video, word, lrw, decoder, dataset, combining, context, wer, joon, son, recurrent, unit, stream, tcn, eleattgru, sequence, spectrogram, extract, built, hearing] [table, resnet, feature, fed, propose] [model, magnitude, noise, trained, help] [enhancement, convolution, noisy, enhanced, ieee, method, block, figure, double, signal, residual, convolutional, demonstrates, based, proposed] [encoder, consists, introduce, image, produce, snr, target] [network, neural, performance, deep, learning, training, andrew, arxiv, preprint, layer, machine, classification, rate, architecture, number, processing, benefit] [conference, single, computer, international, vision, second, error, demonstrate]
@InProceedings{Xu_2020_CVPR,
  author = {Xu, Bo and Lu, Cheng and Guo, Yandong and Wang, Jacob},
  title = {Discriminative Multi-Modality Speech Recognition},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Clean-Label Backdoor Attacks on Video Recognition Models
Shihao Zhao, Xingjun Ma, Xiang Zheng, James Bailey, Jingjing Chen, Yu-Gang Jiang


Deep neural networks (DNNs) are vulnerable to backdoor attacks which can hide backdoor triggers in DNNs by poisoning training data. A backdoored model behaves normally on clean test images, yet consistently predicts a particular target class for any test examples that contain the trigger pattern. As such, backdoor attacks are hard to detect, and have raised severe security concerns in real-world applications. Thus far, backdoor research has mostly been conducted in the image domain with image classification models. In this paper, we show that existing image backdoor attacks are far less effective on videos, and outline 4 strict conditions where existing attacks are likely to fail: 1) scenarios with more input dimensions (eg. videos), 2) scenarios with high resolution, 3) scenarios with a large number of classes and few examples per class (a "sparse dataset"), and 4) attacks with access to correct labels (eg. clean-label attacks). We propose the use of a universal adversarial trigger as the backdoor trigger to attack video recognition models, a situation where backdoor attacks are likely to be challenged by the above 4 strict conditions. We show on benchmark video datasets that our proposed backdoor attack can manipulate state-of-the-art video models with high success rates by poisoning only a small proportion of training data (without changing the labels). We also show that our proposed backdoor attack is resistant to state-of-the-art backdoor defense/detection methods, and can even be applied to improve image backdoor attacks. Our proposed video backdoor attack not only serves as a strong baseline for improving the robustness of video models, but also provides a new perspective for more understanding more powerful backdoor attacks.
[video, recognition, static, dataset, powerful] [table, detection, propose, apply] [trigger, backdoor, attack, adversarial, model, universal, poisoning, success, poisoned, clean, strict, perturbation, targeted, percentage, effective, perturbed, datasets, xingjun, input, trained, james, improve, turner, choose, backdoored, poison, original, resistance, resistant] [proposed, pattern, figure, existing, optical, flow, high, spectral, method] [target, image, loss, generated, generate] [class, training, size, test, data, rate, neural, deep, set, learning, randomly, fixed, performance, find, uniform, baseline, classification, number, applied, space, network, arxiv, small] [rgb]
@InProceedings{Zhao_2020_CVPR,
  author = {Zhao, Shihao and Ma, Xingjun and Zheng, Xiang and Bailey, James and Chen, Jingjing and Jiang, Yu-Gang},
  title = {Clean-Label Backdoor Attacks on Video Recognition Models},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Detecting Adversarial Samples Using Influence Functions and Nearest Neighbors
Gilad Cohen, Guillermo Sapiro, Raja Giryes


Deep neural networks (DNNs) are notorious for their vulnerability to adversarial attacks, which are small perturbations added to their input images to mislead their prediction. Detection of adversarial examples is, therefore, a fundamental requirement for robust classification frameworks. In this work, we present a method for detecting such adversarial attacks, which is suitable for any pre-trained neural network classifier. We use influence functions to measure the impact of every training sample on the validation set data. From the influence scores, we find the most supportive training samples for any given validation example. A k-nearest neighbor (k-NN) model fitted on the DNN's activation layers is employed to search for the ranking of these supporting training samples. We observe that these samples are highly correlated with the nearest neighbors of the normal inputs, while this correlation is much weaker for adversarial inputs. We train an adversarial detector using the k-NN ranks and distances and show that it successfully distinguishes adversarial examples, getting state-of-the-art results on six attack methods with three datasets. Code is available at https://github.com/giladcohen/NNIF_adv_defense.
[embedding, three, dataset, time, natural] [detection, detector, table, feature, employed, sota, correlation] [adversarial, attack, nnif, dnn, influence, helpful, model, dknn, lid, input, defense, detecting, robustness, testing, deepfool, auc, dnns, example, decision, fgsm, trained, ztest, robust, harmful, dadv, nicolas, ian, attacker, pgd, ead, dnorm, penultimate, vulnerability, fitted] [method, based, proposed] [image, loss, train] [training, validation, deep, neural, set, mahalanobis, test, activation, space, network, sample, learning, layer, applied, classifier, number, data, classification, search, algorithm, selected, find, gradient, calculated, accuracy, measure] [nearest, normal, distance]
@InProceedings{Cohen_2020_CVPR,
  author = {Cohen, Gilad and Sapiro, Guillermo and Giryes, Raja},
  title = {Detecting Adversarial Samples Using Influence Functions and Nearest Neighbors},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Unsupervised Model Personalization While Preserving Privacy and Scalability: An Open Problem
Matthias De Lange, Xu Jia, Sarah Parisot, Ales Leonardis, Gregory Slabaugh, Tinne Tuytelaars


This work investigates the task of unsupervised model personalization, adapted to continually evolving, unlabeled local user images. We consider the practical scenario where a high capacity server interacts with a myriad of resource-limited edge devices, imposing strong requirements on scalability and local data privacy. We aim to address this challenge within the continual learning paradigm and provide a novel Dual User-Adaptation framework (DUA) to explore the problem. This framework flexibly disentangles user-adaptation into model personalization on the server and local data regularization on the user device, with desirable properties regarding scalability and privacy constraints. First, on the server, we introduce incremental learning of task-specific expert models, subsequently aggregated using a concealed unsupervised user prior. Aggregation avoids retraining, whereas the user prior conceals sensitive raw user data, and grants unsupervised adaptation. Second, local user-adaptation incorporates a domain adaptation point of view, adapting regularizing batch normalization parameters to the user data. We explore various empirical user configurations with different priors in categories and a tenfold of transforms for MIT Indoor Scene recognition, and classify numbers in a combined MNIST and SVHN setup. Extensive experiments yield promising results for data-driven local adaptation and elicit user priors for server adaptation to depend on the model rather than user data. Hence, although user-adaptation remains a challenging open problem, the DUA framework formalizes a principled foundation for personalizing both on server and user device, while maintaining privacy and scalability.
[previous, evaluation, aggregating, sequence] [framework, table, final, correlation] [model, privacy, mnist, trained] [raw, prior, dual, output, method, based, figure, adaptive] [user, unsupervised, adaptation, domain, dua, supervised, personalized, loss, imm, mitis, catprior, adapting, fim, adabn, transprior, personalization, subsequently] [data, server, task, learning, training, continual, function, neural, batch, incremental, network, arxiv, preprint, knowledge, deep, performance, labeled, setup, scalable, normalization, weight, parameter, forgetting, validation, storage, gradient, set, precision, alexnet, average, unlabeled, number, computational] [local, single, limited, scalability, locally, approach, mlp, additional]
@InProceedings{Lange_2020_CVPR,
  author = {Lange, Matthias De and Jia, Xu and Parisot, Sarah and Leonardis, Ales and Slabaugh, Gregory and Tuytelaars, Tinne},
  title = {Unsupervised Model Personalization While Preserving Privacy and Scalability: An Open Problem},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
GIFnets: Differentiable GIF Encoding Framework
Innfarn Yoo, Xiyang Luo, Yilin Wang, Feng Yang, Peyman Milanfar


Graphics Interchange Format (GIF) is a widely used image file format. Due to the limited number of palette colors, GIF encoding often introduces color banding artifacts. Traditionally, dithering is applied to reduce color banding, but introducing dotted-pattern artifacts. To reduce artifacts and provide a better and more efficient GIF encoding, we introduce a differentiable GIF encoding pipeline, which includes three novel neural networks: PaletteNet, DitherNet, and BandingNet. Each of these three networks provides an important functionality within the GIF encoding pipeline. PaletteNet predicts a near-optimal color palette given an input image. DitherNet manipulates the input image to reduce color banding artifacts and provides an alternative to traditional dithering. Finally, BandingNet is designed to detect color banding, and provides a new perceptual loss specifically for GIF images. As far as we know, this is the first fully differentiable GIF encoding pipeline based on deep neural networks and compatible with existing GIF decoders. User study shows that our algorithm is better than Floyd-Steinberg based GIF encoding.
[encoding, work, visual, prediction] [edge, table, map, final, fully, propose, score, hard] [quality, input, diffusion, original, improve, noise, trained, difference, model, visibility] [banding, palette, gif, perceptual, color, method, palettenet, dithernet, figure, dithering, bandingnet, output, proposed, psnr, june, traditional, artifact, pixel, based, high, november, ieee, extraction, low, dotted, halftoning, ssim, achieved] [image, loss, introduce, produce, train] [training, neural, quantization, network, compared, quantized, number, algorithm, better, deep, reduce, clustering, equation, higher, soft, learning, architecture, discussed, standard, note, evaluate] [error, differentiable, projection, pipeline, rgb, define, defined, point, predicts]
@InProceedings{Yoo_2020_CVPR,
  author = {Yoo, Innfarn and Luo, Xiyang and Wang, Yilin and Yang, Feng and Milanfar, Peyman},
  title = {GIFnets: Differentiable GIF Encoding Framework},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Learning Invariant Representation for Unsupervised Image Restoration
Wenchao Du, Hu Chen, Hongyu Yang


Recently, cross domain transfer has been applied for unsupervised image restoration tasks. However, directly applying existing frameworks would lead to domain-shift problems in translated images due to lack of effective supervision. Instead, we propose an unsupervised learning method that explicitly learns invariant presentation from noisy data and reconstructs clear observations. To do so, we introduce discrete disentangling representation and adversarial domain adaption into general domain transfer framework, aided by extra self-supervised modules including background and semantic consistency constraints, learning robust representation under dual domain constraints, such as feature and image domains. Experiments on synthetic and real noise removal tasks show the proposed method achieves comparable performance with other stateof-the-art supervised and unsupervised methods, while having faster and stable convergence than other domain adaption methods.
[unit, visual, recognition, shift, considering] [semantic, extra, background, including, feature, achieves] [noise, adversarial, clean, effective, adv, poisson, corrupted, model, robust] [method, proposed, ieee, restoration, denoising, removal, based, gaussian, pattern, figure, quantitative, noisy, convolutional, psnr, remove, blur, traditional] [image, domain, unsupervised, representation, consistency, loss, learn, transfer, invariant, supervised, texture, real, synthetic, translation, adaption, unpaired, latent, cross, translated, disentangling, code, cyclegan, unsuitable, generalized, encoder] [learning, training, distribution, deep, general, performance, better, network, convergence, data, neural, lead, discrete, sample] [computer, conference, vision, directly, international, reconstruct, sparse, recovered]
@InProceedings{Du_2020_CVPR,
  author = {Du, Wenchao and Chen, Hu and Yang, Hongyu},
  title = {Learning Invariant Representation for Unsupervised Image Restoration},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Improved Few-Shot Visual Classification
Peyman Bateni, Raghav Goyal, Vaden Masrani, Frank Wood, Leonid Sigal


Few-shot learning is a fundamental task in computer vision that carries the promise of alleviating the need for exhaustively labeled data. Most few-shot learning approaches to date have focused on progressively more complex neural feature extractors and classifier adaptation strategies, and the refinement of the task definition itself. In this paper, we explore the hypothesis that a simple class-covariance-based distance metric, namely the Mahalanobis distance, adopted into a state of the art few-shot learning approach (CNAPS) can, in and of itself, lead to a significant performance improvement. We also discover that it is possible to learn adaptive feature extractors that allow useful estimation of the high dimensional feature covariances required by this metric from surprisingly few samples. The result of our work is a new "Simple CNAPS" architecture which has up to 9.2% fewer trainable parameters than CNAPS and performs up to 6.1% better than state of the art on the standard few-shot image classification benchmark dataset.
[embedding, visual, work, embeddings, state, dot, making] [feature, extractor, table, sota, benchmark, final] [query, datasets, trained, model] [figure, block, adaptive, extraction, method] [image, adaptation, film, produce, specific] [cnaps, simple, classification, support, learning, class, task, number, accuracy, classifier, mahalanobis, neural, squared, covariance, set, metric, performance, adapted, architecture, deep, network, bregman, average, cosine, better, prototypical, choice, compared, labeled, test, space, family, layer, regularization, reported, fungi, distribution, linear, similarity, function, large, imagenet, negative] [distance, euclidean, conference, computer, matching, international, approach, absolute]
@InProceedings{Bateni_2020_CVPR,
  author = {Bateni, Peyman and Goyal, Raghav and Masrani, Vaden and Wood, Frank and Sigal, Leonid},
  title = {Improved Few-Shot Visual Classification},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Learning Weighted Submanifolds With Variational Autoencoders and Riemannian Variational Autoencoders
Nina Miolane, Susan Holmes


Manifold-valued data naturally arises in medical imaging. In cognitive neuroscience for instance, brain connectomes base the analysis of coactivation patterns between different brain regions on the analysis of the correlations of their functional Magnetic Resonance Imaging (fMRI) time series - an object thus constrained by construction to belong to the manifold of symmetric positive definite matrices. One of the challenges that naturally arises in these studies consists in finding a lower-dimensional subspace for representing such manifold-valued and typically high-dimensional data. Traditional techniques, like principal component analysis, are ill-adapted to tackle non-Euclidean spaces and may fail to achieve a lower-dimensional representation of the data - thus potentially pointing to the absence of lower-dimensional representation of the data. However, these techniques are restricted in that: (i) they do not leverage the assumption that the connectomes belong on a pre-specified manifold, therefore discarding information; (ii) they can only fit a linear subspace to the data. In this paper, we are interested in variants to learn potentially highly curved submanifolds of manifold-valued data. Motivated by the brain connectomes example, we investigate a latent variable generative model, which has the added benefit of providing us with uncertainty estimates - a crucial quantity in the medical applications we are considering. While latent variable models have been proposed to learn linear and nonlinear spaces for Euclidean data, or geodesic subspaces for manifold data, no intrinsic latent variable model exists to learn non-geodesic subspaces for manifold data. This paper fills this gap and formulates a Riemannian variational autoencoder with an intrinsic generative model of manifold-valued data. We evaluate its performances on synthetic and real datasets, by introducing the formalism of weighted Riemannian submanifolds.
[embedding, associated, embedded, dataset, naturally] [represents, framework, pga] [model, nonlinear, noise, true, constant, datasets] [figure, analysis, brain, method, gaussian, comparison, medical] [vae, manifold, latent, generative, variational, variable, generated, learn, wasserstein, multivariate, autoencoders, component, generalize, train, representation, autoencoder, synthetic, learns, vaes] [riemannian, submanifold, data, learning, rvae, weighted, linear, distribution, space, submanifolds, probabilistic, neural, subspace, connectomes, dimension, consider, nongeodesic, learned, family, function, approximation, standard, expm, inference, statistical, formalism, connectome, variance, log, exponential, network, depends, observe, belong, evaluate, training] [principal, geodesic, distance, euclidean, intrinsic, fit, geometric, allows, supplementary, represented, defined, pca]
@InProceedings{Miolane_2020_CVPR,
  author = {Miolane, Nina and Holmes, Susan},
  title = {Learning Weighted Submanifolds With Variational Autoencoders and Riemannian Variational Autoencoders},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Learning Geocentric Object Pose in Oblique Monocular Images
Gordon Christie, Rodrigo Rene Rai Munoz Abujder, Kevin Foster, Shea Hagstrom, Gregory D. Hager, Myron Z. Brown


An object's geocentric pose, defined as the height above ground and orientation with respect to gravity, is a powerful representation of real-world structure for object detection, segmentation, and localization tasks using RGBD images. For close-range vision tasks, height and orientation have been derived directly from stereo-computed depth and more recently from monocular depth predicted by deep networks. For long-range vision tasks such as Earth observation, depth cannot be reliably estimated with monocular images. Inspired by recent work in monocular height above ground prediction and optical flow prediction from static images, we develop an encoding of geocentric pose to address this challenge and train a deep network to compute the representation densely, supervised by publicly available airborne lidar. We exploit these attributes to rectify oblique images and remove observed object parallax to dramatically improve the accuracy of localization and to enable accurate alignment of multiple images taken from very different oblique viewpoints. We demonstrate the value of our approach by extending two large-scale public datasets for semantic segmentation in oblique satellite images. All of our data and code are publicly available.
[prediction, predict, work, static, overhead, include, observed, mag] [height, semantic, building, table, object, segmentation, iou, map, predicted, lidar, regression, localization, supervision] [model, trained, public, datasets, improve, derived] [flow, figure, method, proposed, optical, epe, pixel, output] [image, representation, train, enable, alignment, produced] [test, learning, data, deep, network, accuracy, respect, bias, vector, set, better, note, training, applied] [ground, orientation, geocentric, satellite, oblique, monocular, pose, demonstrate, rotation, depth, single, accurate, dense, truth, rgb, jointly, well, estimation, angle, footprint, vision]
@InProceedings{Christie_2020_CVPR,
  author = {Christie, Gordon and Abujder, Rodrigo Rene Rai Munoz and Foster, Kevin and Hagstrom, Shea and Hager, Gregory D. and Brown, Myron Z.},
  title = {Learning Geocentric Object Pose in Oblique Monocular Images},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Understanding Adversarial Examples From the Mutual Influence of Images and Perturbations
Chaoning Zhang, Philipp Benz, Tooba Imtiaz, In So Kweon


A wide variety of works have explored the reason for the existence of adversarial examples, but there is no consensus on the explanation. We propose to treat the DNN logits as a vector for feature representation, and exploit them to analyze the mutual influence of two independent inputs based on the Pearson correlation coefficient (PCC). We utilize this vector representation to understand adversarial examples by disentangling the clean images and adversarial perturbations, and analyze their influence on each other. Our results suggest a new perspective towards the relationship between images and universal perturbations: Universal perturbations contain dominant features, and images behave like noise to them. This feature perspective leads to a new method for generating targeted universal adversarial perturbations using random source images. We are the first to achieve the challenging task of a targeted universal attack without utilizing original training data. Our approach using a proxy dataset achieves comparable performance to the state-of-the-art baselines which utilize the original training dataset.
[previous, dataset, explored, contribution, work, observed] [feature, correlation, table, pascal] [adversarial, universal, targeted, logit, noise, original, input, perturbation, pcc, influence, uap, uaps, analyze, attack, transferability, behave, fooling, existence, indicating, nontargeted, dnn, clean, insight, robustness, access, pccla, googlenet, dnns, interpretation] [analysis, figure, gaussian, method, proposed, based, pattern, coefficient] [image, target, loss, independent, generate, source] [training, vector, class, proxy, learning, data, performance, network, function, neural, ratio, logits, imagenet, machine, linear, indicates, deep, report, algorithm, alexnet, processing, random] [conference, international, computer, vision, dominant, approach, well]
@InProceedings{Zhang_2020_CVPR,
  author = {Zhang, Chaoning and Benz, Philipp and Imtiaz, Tooba and Kweon, In So},
  title = {Understanding Adversarial Examples From the Mutual Influence of Images and Perturbations},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Your Local GAN: Designing Two Dimensional Local Attention Mechanisms for Generative Models
Giannis Daras, Augustus Odena, Han Zhang, Alexandros G. Dimakis


We introduce a new local sparse attention layer that preserves two-dimensional geometry and locality. We show that by just replacing the dense attention layer of SAGAN with our construction, we obtain very significant FID, Inception score and pure visual improvements. FID score is improved from 18.65 to 15.94 on ImageNet, keeping all other parameters the same. The sparse attention patterns that we propose for our new layer are designed using a novel information theoretic criterion that uses information flow graphs. We also present a novel way to invert Generative Adversarial Networks with attention. Our method uses the attention layer of the discriminator to create an innovative loss function. This allows us to visualize the newly introduced attention heads and show that they indeed capture interesting aspects of two-dimensional geometry of real images.
[attention, node, multiple, work, natural, graph, attend, attends, step, time] [head, score, map, key, saliency, table] [model, query, adversarial, create, example] [figure, flow, esa, pattern, introduced, version, method, strided, designed, inverse] [image, sagan, ylg, generative, inversion, fid, inception, locality, real, representation, bird, discriminator, perform, row, loss, visualize, generated, biggan, latent] [layer, training, arxiv, performance, number, gradient, vector, standard, compared, descent, design, better, network, fixed, preprint, deep, best, problem, bias, matrix] [sparse, local, full, dense, dimensional, solve, second, conference, sparsification, international, novel]
@InProceedings{Daras_2020_CVPR,
  author = {Daras, Giannis and Odena, Augustus and Zhang, Han and Dimakis, Alexandros G.},
  title = {Your Local GAN: Designing Two Dimensional Local Attention Mechanisms for Generative Models},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
MoreFusion: Multi-object Reasoning for 6D Pose Estimation from Volumetric Fusion
Kentaro Wada, Edgar Sucar, Stephen James, Daniel Lenton, Andrew J. Davison


Robots and other smart devices need efficient object-based scene representations from their on-board vision systems to reason about contact, physics and occlusion. Recognized precise object models will play an important role alongside non-parametric reconstructions of unrecognized structures. We present a system which can estimate the accurate poses of multiple known objects in contact and occlusion from real-time, embodied multi-view vision. Our approach makes 3D object pose proposals from single RGB-D views, accumulates pose estimates and non-parametric occupancy information from multiple views as the camera moves, and performs joint optimization to estimate consistent, non-intersecting poses for multiple objects in contact. We verify the accuracy and robustness of our approach experimentally on 2 object datasets: YCB-Video, and our own challenging Cluttered YCB-Video. We demonstrate a real-time robotics application where a robot arm precisely and orderly disassembles complicated piles of objects, using only on-board RGB-D vision.
[prediction, recognition, dataset, multiple, reasoning, work] [object, surrounding, map, feature, refinement, detection, stage, table, tracking, box] [model, iterative, trained] [figure, ieee, pattern, fusion, extraction, prior, performs, sensor] [target, loss, free, alignment, unknown, masked, representation, image, mapping, perform] [space, network, training, better, deep, baseline, metric] [pose, occupancy, volumetric, vision, grid, voxel, conference, estimation, scene, computer, system, cluttered, rgb, camera, voxelization, collision, cad, point, morefusion, occupied, differentiable, symmetric, depth, full, hypothesized, initial, local, international, robotics, impenetrable, reconstruction, densefusion, ground, truth, estimate, distance, icp, icc, accurate, robot, geometry]
@InProceedings{Wada_2020_CVPR,
  author = {Wada, Kentaro and Sucar, Edgar and James, Stephen and Lenton, Daniel and Davison, Andrew J.},
  title = {MoreFusion: Multi-object Reasoning for 6D Pose Estimation from Volumetric Fusion},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
HCNAF: Hyper-Conditioned Neural Autoregressive Flow and its Application for Probabilistic Occupancy Map Forecasting
Geunseob Oh, Jean-Sebastien Valois


We introduce Hyper-Conditioned Neural Autoregressive Flow (HCNAF); a powerful universal distribution approximator designed to model arbitrarily complex conditional probability density functions. HCNAF consists of a neural-net based conditional autoregressive flow (AF) and a hyper-network that can take large conditions in non-autoregressive fashion and outputs the network parameters of the AF. Like other flow models, HCNAF performs exact likelihood inference. We conduct a number of density estimation tasks on toy experiments and MNIST to demonstrate the effectiveness and attributes of HCNAF, including its generalization capability over unseen conditions and expressivity. Finally, we show that HCNAF scales up to complex high-dimensional prediction problems of the magnitude of self-driving and that HCNAF yields a state-of-the-art performance in a public self-driving dataset.
[hidden, time, forecasting, prediction, dataset, three, driving, perception, actor, history] [autonomous, map, table, lidar, car, including, dkl, module, represents] [model, trained, condition, adversarial, toy, generalization] [flow, autoregressive, figure, pmodel, ieee, designed, likelihood, presented, pattern, based, exact, capability] [hcnaf, conditional, pom, naf, arbitrarily, generative, unseen, dzd, encoder, consists, target, gan, variable, vae] [neural, density, probability, network, arxiv, preprint, distribution, layer, data, probabilistic, large, equation, processing, experiment, computation, expressivity, note, log, learning, performance, normalizing] [estimation, transformation, conference, complex, computer, vision, occupancy, scene, compute]
@InProceedings{Oh_2020_CVPR,
  author = {Oh, Geunseob and Valois, Jean-Sebastien},
  title = {HCNAF: Hyper-Conditioned Neural Autoregressive Flow and its Application for Probabilistic Occupancy Map Forecasting},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Detail-recovery Image Deraining via Context Aggregation Networks
Sen Deng, Mingqiang Wei, Jun Wang, Yidan Feng, Luming Liang, Haoran Xie, Fu Lee Wang, Meng Wang


This paper looks at this intriguing question: are single images with their details lost during deraining, reversible to their artifact-free status? We propose an end-to-end detail-recovery image deraining network (termed a DRDNet) to solve the problem. Unlike existing image deraining approaches that attempt to meet the conflicting goal of simultaneously deraining and preserving details in a unified framework, we propose to view rain removal and detail recovery as two seperate tasks, so that each part could specialize rather than trade-off between two conflicting goals. Specifically, we introduce two parallel sub-networks with a comprehensive loss function which synergize to derain and recover the lost details caused by deraining. For complete rain removal, we present a rain residual network with the squeeze-and-excitation (SE) operation to remove rain streaks from the rainy images. For detail recovery, we construct a specialized detail repair network consisting of welldesigned blocks, named structure detail context aggregation block (SDCAB), to encourage the lost details to return for eliminating image degradations. Moreover, the detail recovery branch of our proposed detail repair framework is detachable and can be incorporated into existing deraining methods to boost their performances. DRD-Net has been validated on several well-known benchmark datasets in terms of deraining robustness and detail accuracy. Comparisons show clear visual and numerical improvements of our method over the state-of-the-arts.
[context, dataset, work, time, three] [feature, table, aggregation, background, propose, branch, contextual, detection, ablation] [input, datasets, tested, google, university, comprehensive] [rain, deraining, detail, residual, rainy, repair, block, ieee, removal, ddn, pattern, lost, based, psnr, remove, conv, derained, ssim, existing, method, sdcab, rescan, proposed, figure, spatial, removing, dilated, output, result, prenet, parallel, recover, net, convolution, gmm] [image, loss, synthetic, train, separate] [network, function, layer, operation, deep, performance, training, learning, large, number] [single, computer, conference, vision, structure, international, recovery, sparse, direct, full]
@InProceedings{Deng_2020_CVPR,
  author = {Deng, Sen and Wei, Mingqiang and Wang, Jun and Feng, Yidan and Liang, Luming and Xie, Haoran and Wang, Fu Lee and Wang, Meng},
  title = {Detail-recovery Image Deraining via Context Aggregation Networks},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
MCEN: Bridging Cross-Modal Gap between Cooking Recipes and Dish Images with Latent Variable Model
Han Fu, Rui Wu, Chenghao Liu, Jianling Sun


Nowadays, driven by the increasing concern on diet and health, food computing has attracted enormous attention from both industry and research community. One of the most popular research topics in this domain is Food Retrieval, due to its profound influence on health-oriented applications. In this paper, we focus on the task of cross-modal retrieval between food images and cooking recipes. We present Modality-Consistent Embedding Network (MCEN) that learns modality-invariant representations by projecting images and texts to the same embedding space. To capture the latent alignments between modalities, we incorporate stochastic latent variables to explicitly exploit the interactions between textual and visual features. Importantly, our method learns the cross-modal alignments during training but computes embeddings of different modalities independently at inference time for the sake of efficiency. Extensive experimental results clearly demonstrate that the proposed MCEN outperforms all existing approaches on the benchmark Recipe1M dataset and requires less computational cost.
[recipe, retrieval, attention, embedding, mcen, cooking, embeddings, work, textual, acme, recognition, language, incorporate, visual, rnn, multiple, modeling, dataset, hierarchical, natural] [feature, semantic, final, focus, propose, correlation, effectiveness, side, table] [model, multimedia, major, query] [prior, ieee, pattern, proposed, method, figure, analysis, based] [image, food, latent, encoder, representation, learns, loss, corresponding, align, independent, generation, gap, dish, generative, variable] [learning, training, neural, inference, deep, task, network, set, machine, test, arxiv, preprint, computational, posterior, stochastic] [conference, computer, international, vision, acm, additional, joint, matching, capture]
@InProceedings{Fu_2020_CVPR,
  author = {Fu, Han and Wu, Rui and Liu, Chenghao and Sun, Jianling},
  title = {MCEN: Bridging Cross-Modal Gap between Cooking Recipes and Dish Images with Latent Variable Model},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Hypergraph Attention Networks for Multimodal Learning
Eun-Sol Kim, Woo Young Kang, Kyoung-Woon On, Yu-Jung Heo, Byoung-Tak Zhang


One of the fundamental problems that arise in multimodal learning tasks is the disparity of information levels between different modalities. To resolve this problem, we propose Hypergraph Attention Networks (HANs), which define a common semantic space among the modalities with symbolic graphs and extract a joint representation of the modalities based on a co-attention map constructed in the semantic space. HANs follow the process: constructing the common semantic space with symbolic graphs of each modality, matching the semantics between sub-structures of the symbolic graphs, constructing co-attention maps between the graphs in the semantic space, and integrating the multimodal inputs using the co-attention maps to get the final joint representation. From the qualitative analysis with two Visual Question and Answering datasets, we discover that 1) the alignment of the information levels between the modalities is important, and 2) the symbolic graphs are very powerful ways to represent the information of the low-level signals in alignment. Moreover, HANs dramatically improve the state-of-the-art accuracy on the GQA dataset from 54.6% to 61.88% only using the symbolic information in quantitatively.
[question, symbolic, graph, visual, multimodal, attention, gqa, vqa, semantics, dataset, node, ban, hyperedges, answering, bilinear, answer, word, powerful, constructed, message, hyperedge, hypergraph, represent, integrate, dependency, passing, hypergraphs, mfb, natural, construct, language, predict, three] [semantic, feature, map, object, pooling, table, final] [model] [based, method, ieee, pattern, suggested, proposed, figure] [image, representation, common, learn] [learning, neural, number, accuracy, similarity, arxiv, preprint, set, constructing, problem, consider, simple, validation, space, network, machine] [scene, vision, conference, computer, matching, defined, define, represented, compare, structure, international]
@InProceedings{Kim_2020_CVPR,
  author = {Kim, Eun-Sol and Kang, Woo Young and On, Kyoung-Woon and Heo, Yu-Jung and Zhang, Byoung-Tak},
  title = {Hypergraph Attention Networks for Multimodal Learning},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Moving in the Right Direction: A Regularization for Deep Metric Learning
Deen Dayal Mohan, Nishant Sankaran, Dennis Fedorishin, Srirangaraj Setlur, Venu Govindaraju


Deep metric learning leverages carefully designed sampling strategies and loss functions that aid in optimizing the generation of a discriminable embedding space. While effective sampling of pairs is critical for shaping the metric space during training, the relative interactions between pairs, and consequently the forces exerted on these pairs that direct their displacement in the embedding space can significantly impact the formation of well separated clusters. In this work, we identify a shortcoming of existing loss formulations which fail to consider more optimal directions of pair displacements as another criterion for optimization. We propose a novel direction regularization to explicitly account for the layout of sampled pairs and attempt to introduce orthogonality in the representations. The proposed regularization is easily integrated into existing loss functions providing considerable performance improvements. We experimentally validate our hypothesis on the Cars-196, CUB-200 and InShop datasets and outperform existing methods to yield state-of-the-art results on these datasets.
[embedding, embeddings, pair, current, dataset, retrieval, order, recognition, previous, explicitly] [positive, anchor, table, feature] [original, clothes, datasets, identify] [proposed, ieee, pattern, based, method, existing, separated] [loss, specific, corresponding, image, factor] [metric, negative, learning, sample, triplet, regularization, performance, deep, proxy, class, space, compared, standard, sampling, gradient, respect, batch, similarity, optimization, informative, weighting, regularized, kfa, set, note, optimal, close, mining, log, kfn, cosine, closer, training, better, network, dimension, criterion] [direction, term, computer, conference, distance, formulation, vision, relative, single, international, force, displacement]
@InProceedings{Mohan_2020_CVPR,
  author = {Mohan, Deen Dayal and Sankaran, Nishant and Fedorishin, Dennis and Setlur, Srirangaraj and Govindaraju, Venu},
  title = {Moving in the Right Direction: A Regularization for Deep Metric Learning},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Rethinking Depthwise Separable Convolutions: How Intra-Kernel Correlations Lead to Improved MobileNets
Daniel Haase, Manuel Amthor


We introduce blueprint separable convolutions (BSConv) as highly efficient building blocks for CNNs. They are motivated by quantitative analyses of kernel properties from trained models, which show the dominance of correlations along the depth axis. Based on our findings, we formulate a theoretical foundation from which we derive efficient implementations using only standard layers. Moreover, our approach provides a thorough theoretical derivation, interpretation, and justification for the application of depthwise separable convolutions (DSCs) in general, which have become the basis of many modern network architectures. Ultimately, we reveal that DSC-based architectures such as MobileNets implicitly rely on cross-kernel correlations, while our BSConv formulation is based on intra-kernel correlations and thus allows for a more efficient separation of regular convolutions. Extensive experiments on large-scale and fine-grained classification datasets show that BSConvs clearly and consistently improve MobileNets and other DSC-based architectures without introducing any further complexity. For fine-grained datasets, we achieve an improvement of up to 13.7 percentage points. In addition, if used as drop-in replacement for standard architectures such as ResNets, BSConv variants also outperform their vanilla counterparts by up to 9.5 percentage points on ImageNet.
[regular] [table, building, cnn] [model, trained, percentage, highly, datasets] [convolution, figure, separable, convolutional, kernel, cnns, ieee, pattern, based, residual, tensor] [loss, image] [bsconv, neural, filter, blueprint, efficient, depthwise, training, arxiv, preprint, baseline, regularization, size, subspace, network, parameter, deep, standard, pointwise, orthonormal, linear, mobilenets, computational, set, learning, architecture, imagenet, count, resnets, accuracy, andrew, classification, vanilla, weight, dscs, applied, large, implementation, cifar, stanford, outperform, performance, mnasnet, matrix, equation, improved] [computer, conference, depth, vision, approach, basis, international]
@InProceedings{Haase_2020_CVPR,
  author = {Haase, Daniel and Amthor, Manuel},
  title = {Rethinking Depthwise Separable Convolutions: How Intra-Kernel Correlations Lead to Improved MobileNets},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Seeing without Looking: Contextual Rescoring of Object Detections for AP Maximization
Lourenco V. Pato, Renato Negrinho, Pedro M. Q. Aguiar


The majority of current object detectors lack context: class predictions are made independently from other detections. We propose to incorporate context in object detection by post-processing the output of an arbitrary detector to rescore the confidences of its detections. Rescoring is done by conditioning on contextual information from the entire set of detections: their confidences, predicted classes, and positions. We show that AP can be improved by simply reassigning the detection confidence values such that true positives that survive longer (i.e., those with the correct class and large IoU) are scored higher than false positives or detections with small IoU. In this setting, we use a bidirectional RNN with attention for contextual rescoring and introduce a training target that uses the IoU with ground truth to maximize AP for the given set of detections. The fact that our approach does not require access to visual features makes it computationally inexpensive and agnostic to the detection architecture. In spite of this simplicity, our model consistently improves AP over strong pre-trained baselines (Cascade R-CNN and Faster R-CNN with several backbones), particularly by reducing the confidence of duplicate detections (a learned form of non-maximum suppression) and removing out-of-context objects by conditioning on the confidences, classes, positions, and sizes of the co-occurrent detections. Code is available at https://github.com/LourencoVazPato/seeing-without-looking/
[context, correct, rnn, current, baseball, sequence, visual, recurrent, banana] [confidence, object, detection, iou, rescoring, table, contextual, false, cascade, predicted, faster, feature, bounding, rescored, duplicate, coco, detector, matched, localization, highest, box, score, background, rescore, improves, location, region, threshold] [model, true, trained, strong, original] [figure, proposed, remove, low] [target, image, loss] [class, higher, set, vector, training, better, large, baseline, binary, maximization, learned, dissimilar, precision, algorithm, learning, function, size, neural, strategy, hot, number, average] [ground, truth, matching, approach, error, localized, computed]
@InProceedings{Pato_2020_CVPR,
  author = {Pato, Lourenco V. and Negrinho, Renato and Aguiar, Pedro M. Q.},
  title = {Seeing without Looking: Contextual Rescoring of Object Detections for AP Maximization},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
End-to-End Adversarial-Attention Network for Multi-Modal Clustering
Runwu Zhou, Yi-Dong Shen


Multi-modal clustering aims to cluster data into different groups by exploring complementary information from multiple modalities or views. Little work learns the deep fused representations and simutaneously discovers the cluster structure with a discriminative loss. In this paper, we present an End-to-end Adversarial-attention network for Multi-modal Clustering (EAMC), where adversarial learning and attention mechanism are leveraged to align the latent feature distributions and quantify the importance of modalities respectively. To benefit from the joint training, we introducea divergence-based clustering objective that not only encourages the separation and compactness of the clusters but also enjoy a clear cluster structure by embedding the simplex geometry of the output space into the loss. The proposed network consists of modality-specific feature learning, modality fusion and cluster assignment three modules. It can be trained from scratch with batch-mode based optimization and avoid an autoencoder pretraining stage. Comprehensive experiments conducted on five real-world datasets show the superiority and effectiveness of the proposed clustering method.
[modality, multiple, attention, three, multimodal, dataset, artificial, graph, order, embedding, connected, stream] [feature, table, assignment, module, denotes, consensus, correlation, guide, fully] [adversarial, model, mnist, acc, datasets] [kernel, proposed, ieee, fusion, method, based, fused, pattern, spectral, clear, analysis, gaussian, dsc, output, figure, traditional] [cluster, latent, loss, eamc, learn, image, alignment, nwc, consists, representation, discriminator, latt, dsim, dcca] [clustering, deep, learning, data, network, matrix, subspace, layer, neural, machine, weight, compared, optimization, performance, divergence, softmax, architecture, metric, nmi, processing, compactness, space, dmsc] [conference, structure, international, joint, computer, multiview, vision]
@InProceedings{Zhou_2020_CVPR,
  author = {Zhou, Runwu and Shen, Yi-Dong},
  title = {End-to-End Adversarial-Attention Network for Multi-Modal Clustering},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Fast Sparse ConvNets
Erich Elsen, Marat Dukhan, Trevor Gale, Karen Simonyan


Historically, the pursuit of efficient inference has been one of the driving forces behind the research into new deep learning architectures and building blocks. Some of the recent examples include: the squeeze-and-excitation module, depthwise separable convolutions in Xception, and the inverted bottleneck in MobileNet v2. Notably, in all of these cases, the resulting building blocks enabled not only higher efficiency, but also higher accuracy, and found wide adoption in the field. In this work, we further expand the arsenal of efficient building blocks for neural network architectures; but instead of combining standard primitives (such as convolution), we advocate for the replacement of these dense primitives with their sparse counterparts. While the idea of using sparsity to decrease the parameter count is not new, the conventional wisdom is that this reduction in theoretical FLOPs does not translate into real-world efficiency gains. We aim to correct this misconception by introducing a family of efficient sparse kernels for several hardware platforms, which we plan to open source for the benefit of the community. Equipped with our efficient implementation of sparse primitives, we show that sparse versions of MobileNet v1 and MobileNet v2 architectures substantially outperform strong dense baselines on the efficiency-accuracy curve. On Snapdragon 835 our sparse networks outperform their dense equivalents by 1.3 - 2.4x - equivalent to approximately one entire generation of improvement. We hope that our findings will facilitate wider adoption of sparsity as a tool for creating efficient and accurate deep learning architectures.
[time, work, taco, three] [table, web, final] [model, input, decrease] [block, figure, convolutional, pattern, channel, spatial, convolution, ieee, fast, output, kernel, june, residual] [train, image, factor, generation] [neural, sparsity, learning, size, efficient, inference, architecture, layer, deep, mobilenet, accuracy, parameter, efficientnet, weight, matrix, performance, number, width, cache, design, depthwise, search, pruning, bottleneck, machine, standard, xnnpack, network, efficiency, mobile, imagenet, small, eff, outperform, snapdragon, spmm, total, increasing, find, data] [sparse, dense, conference, vision, computer, international, unstructured, arm, full]
@InProceedings{Elsen_2020_CVPR,
  author = {Elsen, Erich and Dukhan, Marat and Gale, Trevor and Simonyan, Karen},
  title = {Fast Sparse ConvNets},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Few Sample Knowledge Distillation for Efficient Network Compression
Tianhong Li, Jianguo Li, Zhuang Liu, Changshui Zhang


Deep neural network compression techniques such as pruning and weight tensor decomposition usually require fine-tuning to recover the prediction accuracy when the compression ratio is high. However, conventional fine-tuning suffers from the requirement of a large training set and the time-consuming training procedure. This paper proposes a novel solution for knowledge distillation from label-free few samples to realize both data efficiency and training/processing efficiency. We treat the original network as "teacher-net" and the compressed network as "student-net". A 1x1 convolution layer is added at the end of each layer block of the student-net, and we fit the block-level outputs of the student-net to the teacher-net by estimating the parameters of the added layers. We prove that the added layer can be merged without adding extra parameters and computation cost during inference. Experiments on multiple datasets and network architectures verify the method's effectiveness on student-nets obtained by various network pruning and weight decomposition methods. Our method can recover student-net's accuracy to the same level as conventional fine-tuning methods in minutes while using only 1% label-free data of the full training data.
[previous, three] [table, feature, apply, add] [model, original, input] [method, block, figure, output, recover, convolution, based, compression, tensor, compressed, comparison, convolutional, decomposing] [loss, align, alignment] [fskd, network, pruning, fitnet, accuracy, training, knowledge, deep, number, data, distillation, layer, pruned, neural, performance, learning, algorithm, merged, sgd, large, set, teacher, randomly, decoupled, weight, student, unlabeled, better, slimming, imagenet, procedure, speedup, standard, rno, decoupling, selected, arxiv, preprint, ratio, computation] [full, decomposition, require, matching, solution, cost]
@InProceedings{Li_2020_CVPR,
  author = {Li, Tianhong and Li, Jianguo and Liu, Zhuang and Zhang, Changshui},
  title = {Few Sample Knowledge Distillation for Efficient Network Compression},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Predicting Sharp and Accurate Occlusion Boundaries in Monocular Depth Estimation Using Displacement Fields
Michael Ramamonjisoa, Yuming Du, Vincent Lepetit


Current methods for depth map prediction from monocular images tend to predict smooth, poorly localized contours for the occlusion boundaries in the input image. This is unfortunate as occlusion boundaries are important cues to recognize objects, and as we show, may lead to a way to discover new objects from scene reconstruction. To improve predicted depth maps, recent methods rely on various forms of filtering or predict an additive residual depth map to refine a first estimate. We instead learn to predict, given a depth map predicted by some reconstruction method, a 2D displacement field able to re-sample pixels around the occlusion boundaries into sharper reconstructions. Our method can be applied to the output of any depth estimation method, in an end-to-end trainable fashion. For evaluation, we manually annotated the occlusion boundaries in all the images in the test split of popular NYUv2-Depth dataset. We show that our approach improves the localization of occlusion boundaries for all state-of-the-art monocular depth estimation methods that we could evaluate, without degrading the depth accuracy for the rest of the images.
[evaluation, dataset, prediction, predict, predicting, work, previous] [occlusion, predicted, map, boundary, table, object, improves, refinement, edge, localization, refined, refine, fully, annotation, guided] [improve, input, trained, datasets, toy, help] [method, residual, guidance, proposed, bilateral, field, figure, sharp, convolutional, filtering, sharper, pixel, color, based, rlin, fast] [image, loss, learn] [network, deep, accuracy, learning, evaluate, training, problem, popular, better, performance, discussed, neural, optimal, best] [depth, displacement, estimation, monocular, mde, ground, reconstruction, truth, rgb, error, single, approach, smooth, accurate, initial, occluding, huber, compare, eigen, rel]
@InProceedings{Ramamonjisoa_2020_CVPR,
  author = {Ramamonjisoa, Michael and Du, Yuming and Lepetit, Vincent},
  title = {Predicting Sharp and Accurate Occlusion Boundaries in Monocular Depth Estimation Using Displacement Fields},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Shape correspondence using anisotropic Chebyshev spectral CNNs
Qinsong Li, Shengjun Liu, Ling Hu, Xinru Liu


Establishing correspondence between shapes is a very important and active research topic in many domains. Due to the powerful ability of deep learning on geometric data, lots of attractive results have been achieved by convolutional neural networks (CNNs). In this paper, we propose a novel architecture for shape correspondence, termed Anisotropic Chebyshev spectral CNNs (ACSCNNs), based on a new extension of the manifold convolution operator. The extended convolution operators aggregate the local features of signals by a set of oriented kernels around each point, which allows to much more comprehensively capture the intrinsic signal information. Rather than using fixed oriented kernels in the spatial domain in previous CNNs, in our framework, the kernels are learned by spectral filtering, based on the eigen-decompositions of multiple Anisotropic Laplace-Beltrami Operators. To reduce the computational complexity, we employ an explicit expansion of the Chebyshev polynomial basis to represent the spectral filters whose expansion coefficients are trainable. Through the benchmark experiments of shape correspondence, our architecture is demonstrated to be efficient and be able to provide better than the state-of-the-art results in several datasets even if using constant functions as inputs.
[graph, multiple, work, order, represent] [cnn, oriented] [diffusion, input] [anisotropic, spectral, convolution, based, ieee, operator, convolutional, filtering, figure, signal, kernel, method, pattern, spatial, fourier, analysis, proposed, cnns, deformable, called, transform, reference] [manifold, domain, representation] [learning, deep, neural, performance, set, accuracy, architecture, processing, computational, filter, matrix, formula, shot, product, vector] [shape, correspondence, computer, chebyshev, geometric, conference, vision, local, point, laplacian, defined, rotation, albo, michael, acscnn, geodesic, intrinsic, functional, eigenfunctions, partial, emanuele, basis, geometry, tangent, heat, allows, polynomial, euclidean, provided, direction, splinecnn, acm, ron, pierre]
@InProceedings{Li_2020_CVPR,
  author = {Li, Qinsong and Liu, Shengjun and Hu, Ling and Liu, Xinru},
  title = {Shape correspondence using anisotropic Chebyshev spectral CNNs},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
RetinaTrack: Online Single Stage Joint Detection and Tracking
Zhichao Lu, Vivek Rathod, Ronny Votel, Jonathan Huang


Traditionally multi-object tracking and object detection are performed using separate systems with most prior works focusing exclusively on one of these aspects over the other. Tracking systems clearly benefit from having access to accurate detections, however and there is ample evidence in literature that detectors can benefit from tracking which, for example, can help to smooth predictions over time. In this paper we focus on the tracking-by-detection paradigm for autonomous driving where both tasks are mission critical. We propose a conceptually simple and efficient joint model of detection and tracking, called RetinaTrack, which modifies the popular single stage RetinaNet approach such that it is amenable to instance-level embedding training. We show, via evaluations on the Waymo Open Dataset, that we outperform a recent state of the art tracking algorithm while requiring significantly less computation. We believe that our simple yet effective approach can serve as a strong baseline for future work in this area.
[embedding, time, video, multiple, dataset, driving, state, frame, order] [detection, tracking, retinatrack, retinanet, object, track, anchor, feature, waymo, iou, level, tracktor, faster, coco, map, mota, box, fpn, tracker, autonomous, art, stage, focus, instance, mot, location] [model, strong, subnetworks, trained] [ieee, pattern, based, convolutional, figure, convolution, running] [train, loss, reid, image, separate] [learning, training, simple, arxiv, preprint, baseline, deep, architecture, vanilla, number, performance, network, data, share, triplet, inference, open, base, better, classification, batch] [conference, computer, vision, single, international, joint, approach, well, finally, compare]
@InProceedings{Lu_2020_CVPR,
  author = {Lu, Zhichao and Rathod, Vivek and Votel, Ronny and Huang, Jonathan},
  title = {RetinaTrack: Online Single Stage Joint Detection and Tracking},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
Multimodal Categorization of Crisis Events in Social Media
Mahdi Abavisani, Liwei Wu, Shengli Hu, Joel Tetreault, Alejandro Jaimes


Recent developments in image classification and natural language processing, coupled with the rapid growth in social media usage, have enabled fundamental advances in detecting breaking events around the world in real-time. Emergency response is one such area that stands to gain from these advances. By processing billions of texts and images a minute, events can be automatically detected to enable emergency response workers to better assess rapidly evolving situations and deploy resources accordingly. To date, most event detection techniques in this area have focused on image-only or text-only approaches, limiting detection performance and impacting the quality of information delivered to crisis response teams. In this paper, we present a new multimodal fusion method that leverages both images and texts as input. In particular, we introduce a cross-attention module that can filter uninformative and misleading components from weak modalities on a sample by sample basis. In addition, we employ a multimodal graph-based approach to stochastically transition between embeddings of different multimodal pairs during training to better regularize the learning process as well as dealing with limited training data by constructing new matched pairs from different samples. We show that our method outperforms the unimodal approaches and strong multimodal baselines by a large margin on three crisis-related tasks.
[multimodal, text, social, crisis, embeddings, bert, attention, damage, humanitarian, language, embedding, emergency, three, visual, disaster, bilinear, natural, sse, modality, hurricane, sri, lanka, work] [feature, table, categorization, module, score, pooling, detection, framework, annotated, propose] [model, detecting, inconsistent, misleading] [fusion, event, method, ieee, proposed, figure, based] [image, train, introduce] [training, setting, data, task, classification, test, densenet, accuracy, arxiv, preprint, knowledge, deep, learning, neural, weighted, processing, macro, compact, better, transition, number, set, andrew, performance, stochastic] [conference, international, computer, approach, vision, limited]
@InProceedings{Abavisani_2020_CVPR,
  author = {Abavisani, Mahdi and Wu, Liwei and Hu, Shengli and Tetreault, Joel and Jaimes, Alejandro},
  title = {Multimodal Categorization of Crisis Events in Social Media},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
SPARE3D: A Dataset for SPAtial REasoning on Three-View Line Drawings
Wenyu Han, Siyuan Xiang, Chenhui Liu, Ruoyu Wang, Chen Feng


Spatial reasoning is an important component of human intelligence. We can imagine the shapes of 3D objects and reason about their spatial relations by merely looking at their three-view line drawings in 2D, with different levels of competence. Can deep networks be trained to perform spatial reasoning tasks? How can we measure their "spatial intelligence"? To answer these questions, we present the SPARE3D dataset. Based on cognitive science and psychometrics, SPARE3D contains three types of 2D-3D reasoning tasks on view consistency, camera pose, and shape generation, with increasing difficulty. We then design a method to automatically generate a large number of challenging questions with ground truth answers for each task. They are used to provide supervision for training our baseline models using state-of-the-art architectures like ResNet. Our experiments show that although convolutional networks have achieved superhuman performance in many visual learning tasks, their spatial reasoning performance in SPARE3D is almost equal to random guesses. We hope SPARE3D can stimulate new problem formulations and network designs for spatial reasoning to empower intelligent robots to operate effectively in the 3D world via 2D sensors.
[reasoning, dataset, visual, three, agent, correct, bagnet, reason, answer, question, previous, natural, language] [object, cnn, focus, benchmark, visualization] [testing, trained, datasets, model, input] [spatial, figure, ieee, pattern, designed, based] [drawing, generate, ability, image, generation, corresponding, consistency, generated] [performance, baseline, learning, top, deep, task, design, select, network, accuracy, test, data, classification, large, number, neural, random, note, candidate, binary, average] [isometric, pose, view, computer, conference, shape, intelligent, human, vision, point, camera, front, cloud, david, international, ground, geometry, engineering, truth, rotation, scene, solve, untrained, thomas]
@InProceedings{Han_2020_CVPR,
  author = {Han, Wenyu and Xiang, Siyuan and Liu, Chenhui and Wang, Ruoyu and Feng, Chen},
  title = {SPARE3D: A Dataset for SPAtial REasoning on Three-View Line Drawings},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
SwapText: Image Based Texts Transfer in Scenes
Qiangpeng Yang, Jun Huang, Wei Lin


Swapping text in scene images while preserving original fonts, colors, sizes and background textures is a challenging task due to the complex interplay between different factors. In this work, we present SwapText, a three-stage framework to transfer texts across scene images. First, a novel text swapping network is proposed to replace text labels only in the foreground image. Second, a background completion network is learned to reconstruct background images. Finally, the generated foreground image and background image are used to generate the word image by the fusion network. Using the proposing framework, we can manipulate the texts of the input images even with severe geometric distortion. Qualitative and quantitative results are presented on several scene text datasets, including regular and irregular text datasets. We conducted extensive experiments to prove the usefulness of our method such as image based text translation, text image synthesis.
[text, recognition, illustrated, word, three, natural, curved] [background, framework, detection, feature, table, map, adopt, global, foreground] [model, original, input, adversarial, robust, datasets] [figure, conv, based, proposed, fusion, method, output, ieee, pattern, transform, convolutional, quantitative, dilated, ssim, psnr, presented] [image, style, content, swapping, generate, generated, realistic, real, transfer, generative, synthesis, synthetic, loss, translation, geometrical, arbitrary, keeping, generation, gans, swaptext, gan, semantically, person] [network, neural, training, test, evaluate, replace, data, set, increased, average, accuracy, machine, better] [scene, completion, computer, shape, conference, vision, transformation, international, geometric, perspective, novel, reconstruct]
@InProceedings{Yang_2020_CVPR,
  author = {Yang, Qiangpeng and Huang, Jun and Lin, Wei},
  title = {SwapText: Image Based Texts Transfer in Scenes},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
OrigamiNet: Weakly-Supervised, Segmentation-Free, One-Step, Full Page Text Recognition by learning to unfold
Mohamed Yousef, Tom E. Bishop


Text recognition is a major computer vision task with a big set of associated challenges. One of those traditional challenges is the coupled nature of text recognition and segmentation. This problem has been progressively solved over the past decades, going from segmentation based recognition to segmentation free approaches, which proved more accurate and much cheaper to annotate data for. We take a step from segmentation-free single line recognition towards segmentation-free multi-line / full page recognition. We propose a novel and simple neural network module, termed OrigamiNet, that can augment any CTC-trained, fully convolutional single line text recognizer, to convert it into a multi-line version by providing the model with enough spatial capacity to be able to properly collapse a 2D input signal into 1D without losing information. Such modified networks can be trained using exactly their same simple original procedure, and using only unsegmented image and text pairs. We carry out a set of interpretability experiments that show that our trained models learn an accurate implicit line segmentation. We achieve state-of-the-art character error rate on both IAM & ICDAR 2017 HTR benchmarks for handwriting recognition, surpassing all other methods in the literature. On IAM we even surpass single line methods that use accurate localization information during training. Our code is available online at https://github.com/IntuitionMachines/OrigamiNet .
[text, recognition, iam, character, cer, paragraph, work, handwriting, recognizer, handwritten, ctc, transcription, origaminet, length, convert, visual, sequence, previous, long, dataset, iapr] [segmentation, table, localization, final, fully, main, propose, segmented, annotated, cnn, segment] [input, model, vgg, attribution, original] [convolutional, analysis, proposed, method, spatial, comparison, output, figure] [image, document, idea, learn, htr, loss, specific] [training, set, network, neural, simple, learning, data, performance, deep, layer, capacity, number, arxiv, preprint, problem, achieve, note, test, normalization] [full, single, international, conference, vertical, implicit]
@InProceedings{Yousef_2020_CVPR,
  author = {Yousef, Mohamed and Bishop, Tom E.},
  title = {OrigamiNet: Weakly-Supervised, Segmentation-Free, One-Step, Full Page Text Recognition by learning to unfold},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}
FroDO: From Detections to 3D Objects
Martin Runz, Kejie Li, Meng Tang, Lingni Ma, Chen Kong, Tanner Schmidt, Ian Reid, Lourdes Agapito, Julian Straub, Steven Lovegrove, Richard Newcombe


Object-oriented maps are important for scene understanding since they jointly capture geometry and semantics, allow individual instantiation and meaningful reasoning about objects. We introduce FroDO, a method for accurate 3D reconstruction of object instances from RGB video that infers their location, pose and shape in a coarse to fine manner. Key to FroDO is to embed object shapes in a novel learnt shape space that allows seamless switching between sparse point cloud and dense DeepSDF decoding. Given an input sequence of localized RGB frames, FroDO first aggregates 2D detections to instantiate a 3D bounding box per object. A shape code is regressed using an encoder network before optimizing shape and pose further under the learnt shape priors using sparse or dense shape representations. The optimization uses multi-view geometric, photometric and silhouette losses. We evaluate on real-world datasets, including Pix3D, Redwood-OS, and ScanNet, for single-view, multi-view, and multi-object reconstruction.
[embedding, decoder, multiple, recognition, dataset] [object, bounding, table, instance, box, predicted, detection, challenging, segmentation] [input] [ieee, pattern, figure, method, based, fusion] [code, representation, latent, loss, encoder, learnt, synthetic, image] [optimization, energy, set, network, learning, arxiv, preprint, evaluate, deep, data, space] [shape, reconstruction, conference, computer, sparse, dense, frodo, vision, single, rgb, point, joint, pose, photometric, distance, pmo, deepsdf, approach, slam, international, surface, monocular, chamfer, silhouette, ground, localized, infers, scene, novel, cloud, view, sdf, colmap, pointcloud, truth, scannet, signed]
@InProceedings{Runz_2020_CVPR,
  author = {Runz, Martin and Li, Kejie and Tang, Meng and Ma, Lingni and Kong, Chen and Schmidt, Tanner and Reid, Ian and Agapito, Lourdes and Straub, Julian and Lovegrove, Steven and Newcombe, Richard},
  title = {FroDO: From Detections to 3D Objects},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}