We present Meta Pseudo Labels, a semi-supervised learning method that achieves a new state-of-the-art top-1 accuracy of 90.2% on ImageNet, which is 1.6% better than the existing state-of-the-art. [Expand]
Taming Transformers for High-Resolution Image Synthesis
Patrick Esser, Robin Rombach, Bjorn Ommer
Designed to learn long-range interactions on sequential data, transformers continue to show state-of-the-art results on a wide variety of tasks. [Expand]
Shanchuan Lin, Andrey Ryabtsev, Soumyadip Sengupta, Brian L. Curless, Steven M. Seitz, Ira Kemelmacher-Shlizerman
We introduce a real-time, high-resolution background replacement technique which operates at 30fps in 4K resolution, and 60fps for HD on a modern GPU. [Expand]
We present a simple but powerful architecture of convolutional neural network, which has a VGG-like inference-time body composed of nothing but a stack of 3x3 convolution and ReLU, while the training-time model has a multi-branch topology. [Expand]
VirTex: Learning Visual Representations From Textual Annotations
Karan Desai, Justin Johnson
The de-facto approach to many vision tasks is to start from pretrained visual representations, typically learned via supervised training on ImageNet. [Expand]
Learning Continuous Image Representation With Local Implicit Image Function
Yinbo Chen, Sifei Liu, Xiaolong Wang
How to represent an image? While the visual world is presented in a continuous manner, machines store and see the images in a discrete way with 2D arrays of pixels. [Expand]
We present BoTNet, a conceptually simple yet powerful backbone architecture that incorporates self-attention for multiple computer vision tasks including image classification, object detection and instance segmentation. [Expand]
Simple Copy-Paste Is a Strong Data Augmentation Method for Instance Segmentation
Golnaz Ghiasi, Yin Cui, Aravind Srinivas, Rui Qian, Tsung-Yi Lin, Ekin D. Cubuk, Quoc V. Le, Barret Zoph
Building instance segmentation models that are data-efficient and can handle rare object categories is an important challenge in computer vision. [Expand]
NeRF in the Wild: Neural Radiance Fields for Unconstrained Photo Collections
Ricardo Martin-Brualla, Noha Radwan, Mehdi S. M. Sajjadi, Jonathan T. Barron, Alexey Dosovitskiy, Daniel Duckworth
We present a learning-based method for synthesizingnovel views of complex scenes using only unstructured collections of in-the-wild photographs. [Expand]
We present NeX, a new approach to novel view synthesis based on enhancements of multiplane image (MPI) that can reproduce next-level view-dependent effects--in real time. [Expand]
Omnimatte: Associating Objects and Their Effects in Video
Erika Lu, Forrester Cole, Tali Dekel, Andrew Zisserman, William T. Freeman, Michael Rubinstein
Computer vision has become increasingly better at segmenting objects in images and videos; however, scene effects related to the objects -- shadows, reflections, generated smoke, etc. [Expand]
Closed-Form Factorization of Latent Semantics in GANs
Yujun Shen, Bolei Zhou
A rich set of interpretable dimensions has been shown to emerge in the latent space of the Generative Adversarial Networks (GANs) trained for synthesizing images. [Expand]
Jiayan Qiu, Yiding Yang, Xinchao Wang, Dacheng Tao
What scene elements, if any, are indispensable for recognizing a scene? We strive to answer this question through the lens of an end-to-end learning scheme. [Expand]
Back to the Feature: Learning Robust Camera Localization From Pixels To Pose
Paul-Edouard Sarlin, Ajaykumar Unagar, Mans Larsson, Hugo Germain, Carl Toft, Viktor Larsson, Marc Pollefeys, Vincent Lepetit, Lars Hammarstrand, Fredrik Kahl, Torsten Sattler
Camera pose estimation in known scenes is a 3D geometry task recently tackled by multiple learning algorithms. [Expand]
Holistic 3D Scene Understanding From a Single Image With Implicit Representation
Cheng Zhang, Zhaopeng Cui, Yinda Zhang, Bing Zeng, Marc Pollefeys, Shuaicheng Liu
We present a new pipeline for holistic 3D scene understanding from a single image, which could predict object shape, object pose and scene layout. [Expand]
As the computing power of modern hardware is increasing strongly, pre-trained deep learning models (e.g., BERT, GPT-3) learned on large-scale datasets have shown their effectiveness over conventional methods. [Expand]
We present a novel large-scale dataset and accompanying machine learning models aimed at providing a detailed understanding of the interplay between visual content, its emotional effect, and explanations for the latter in language. [Expand]
DatasetGAN: Efficient Labeled Data Factory With Minimal Human Effort
Yuxuan Zhang, Huan Ling, Jun Gao, Kangxue Yin, Jean-Francois Lafleche, Adela Barriuso, Antonio Torralba, Sanja Fidler
We introduce DatasetGAN: an automatic procedure to generate massive datasets of high-quality semantically segmented images requiring minimal human effort. [Expand]
CutPaste: Self-Supervised Learning for Anomaly Detection and Localization
Chun-Liang Li, Kihyuk Sohn, Jinsung Yoon, Tomas Pfister
We aim at constructing a high performance model for defect detection that detects unknown anomalous patterns of an image without anomalous data. [Expand]
Neural Scene Flow Fields for Space-Time View Synthesis of Dynamic Scenes
Zhengqi Li, Simon Niklaus, Noah Snavely, Oliver Wang
We present a method to perform novel view and time synthesis of dynamic scenes, requiring only a monocular video with known camera poses as input. [Expand]
Image Generators With Conditionally-Independent Pixel Synthesis
Ivan Anokhin, Kirill Demochkin, Taras Khakhulin, Gleb Sterkin, Victor Lempitsky, Denis Korzhenkov
Existing image generator networks rely heavily on spatial convolutions and, optionally, self-attention blocks in order to gradually synthesize images in a coarse-to-fine manner. [Expand]
Semantic Segmentation With Generative Models: Semi-Supervised Learning and Strong Out-of-Domain Generalization
Daiqing Li, Junlin Yang, Karsten Kreis, Antonio Torralba, Sanja Fidler
Training deep networks with limited labeled data while achieving a strong generalization ability is key in the quest to reduce human annotation efforts. [Expand]
Self-attention has the promise of improving computer vision systems due to parameter-independent scaling of receptive fields and content-dependent interactions, in contrast to parameter-dependent scaling and content-independent interactions of convolutions. [Expand]
MonoRec: Semi-Supervised Dense Reconstruction in Dynamic Environments From a Single Moving Camera
Felix Wimbauer, Nan Yang, Lukas von Stumberg, Niclas Zeller, Daniel Cremers
In this paper, we propose MonoRec, a semi-supervised monocular dense reconstruction architecture that predicts depth maps from a single moving camera in dynamic environments. [Expand]
Information-Theoretic Segmentation by Inpainting Error Maximization
Pedro Savarese, Sunnie S. Y. Kim, Michael Maire, Greg Shakhnarovich, David McAllester
We study image segmentation from an information-theoretic perspective, proposing a novel adversarial method that performs unsupervised segmentation by partitioning images into maximally independent sets. [Expand]
Qianqian Wang, Zhicheng Wang, Kyle Genova, Pratul P. Srinivasan, Howard Zhou, Jonathan T. Barron, Ricardo Martin-Brualla, Noah Snavely, Thomas Funkhouser
We present a method that synthesizes novel views of complex scenes by interpolating a sparse set of nearby views. [Expand]
On Robustness and Transferability of Convolutional Neural Networks
Josip Djolonga, Jessica Yung, Michael Tschannen, Rob Romijnders, Lucas Beyer, Alexander Kolesnikov, Joan Puigcerver, Matthias Minderer, Alexander D'Amour, Dan Moldovan, Sylvain Gelly, Neil Houlsby, Xiaohua Zhai, Mario Lucic
Modern deep convolutional networks (CNNs) are often criticized for not generalizing under distributional shifts. [Expand]
Enriching ImageNet With Human Similarity Judgments and Psychological Embeddings
Brett D. Roads, Bradley C. Love
Advances in supervised learning approaches to object recognition flourished in part because of the availability of high-quality datasets and associated benchmarks. [Expand]
Daniel Lichy, Jiaye Wu, Soumyadip Sengupta, David W. Jacobs
In this paper, we present a technique for estimating the geometry and reflectance of objects using only a camera, flashlight, and optionally a tripod. [Expand]
Li Siyao, Shiyu Zhao, Weijiang Yu, Wenxiu Sun, Dimitris Metaxas, Chen Change Loy, Ziwei Liu
In the animation industry, cartoon videos are usually produced at low frame rate since hand drawing of such frames is costly and time-consuming. [Expand]
Less Is More: ClipBERT for Video-and-Language Learning via Sparse Sampling
Jie Lei, Linjie Li, Luowei Zhou, Zhe Gan, Tamara L. Berg, Mohit Bansal, Jingjing Liu
The canonical approach to video-and-language learning (e.g., video question answering) dictates a neural model to learn from offline-extracted dense video features from vision models and text features from language models. [Expand]
We aim to infer 3D shape and pose of objects from a single image and propose a learning-based approach that can train from unstructured image collections, using only segmentation outputs from off-the-shelf recognition systems as supervisory signal (i.e. [Expand]
Navigating the GAN Parameter Space for Semantic Image Editing
Anton Cherepkov, Andrey Voynov, Artem Babenko
Generative Adversarial Networks (GANs) are currently an indispensable tool for visual editing, being a standard component of image-to-image translation and image restoration pipelines. [Expand]
The Temporal Opportunist: Self-Supervised Multi-Frame Monocular Depth
Jamie Watson, Oisin Mac Aodha, Victor Prisacariu, Gabriel Brostow, Michael Firman
Self-supervised monocular depth estimation networks are trained to predict scene depth using nearby frames as a supervision signal during training. [Expand]
Modular Interactive Video Object Segmentation: Interaction-to-Mask, Propagation and Difference-Aware Fusion
Ho Kei Cheng, Yu-Wing Tai, Chi-Keung Tang
We present Modular interactive VOS (MiVOS) framework which decouples interaction-to-mask and mask propagation, allowing for higher generalizability and better performance. [Expand]
We present self-supervised geometric perception (SGP), the first general framework to learn a feature descriptor for correspondence matching without any ground-truth geometric model labels (e.g., camera poses, rigid transformations). [Expand]
Michael Dorkenwald, Timo Milbich, Andreas Blattmann, Robin Rombach, Konstantinos G. Derpanis, Bjorn Ommer
Video understanding calls for a model to learn the characteristic interplay between static scene content and its dynamics: Given an image, the model must be able to predict a future progression of the portrayed scene and, conversely, a video should be explained in terms of its static image content and all the remaining characteristics not present in the initial frame. [Expand]
Learning High Fidelity Depths of Dressed Humans by Watching Social Media Dance Videos
Yasamin Jafarian, Hyun Soo Park
A key challenge of learning the geometry of dressed humans lies in the limited availability of the ground truth data (e.g., 3D scanned models), which results in the performance degradation of 3D human reconstruction when applying to real world imagery. [Expand]
Multi-Modal Fusion Transformer for End-to-End Autonomous Driving
Aditya Prakash, Kashyap Chitta, Andreas Geiger
How should representations from complementary sensors be integrated for autonomous driving? Geometry-based sensor fusion has shown great promise for perception tasks such as object detection and motion forecasting. [Expand]
Line Segment Detection Using Transformers Without Edges
Yifan Xu, Weijian Xu, David Cheung, Zhuowen Tu
In this paper, we present a joint end-to-end line segment detection algorithm using Transformers that is post-processing and heuristics-guided intermediate processing (edge/junction/region detection) free. [Expand]
We present a self-supervised Contrastive Video Representation Learning (CVRL) method to learn spatiotemporal visual representations from unlabeled videos. [Expand]
Self-attention techniques, and specifically Transformers, are dominating the field of text processing and are becoming increasingly popular in computer vision classification tasks. [Expand]
Probabilistic Embeddings for Cross-Modal Retrieval
Sanghyuk Chun, Seong Joon Oh, Rafael Sampaio de Rezende, Yannis Kalantidis, Diane Larlus
Cross-modal retrieval methods build a common representation space for samples from multiple modalities, typically from the vision and the language domains. [Expand]
Gaurav Parmar, Dacheng Li, Kwonjoon Lee, Zhuowen Tu
We present a new generative autoencoder model with dual contradistinctive losses to improve generative autoencoder that performs simultaneous inference (reconstruction) and synthesis (sampling). [Expand]
Albert Pumarola, Enric Corona, Gerard Pons-Moll, Francesc Moreno-Noguer
Neural rendering techniques combining machine learning with geometric reasoning have arisen as one of the most promising approaches for synthesizing novel views of a scene from a sparse set of images. [Expand]
The moments (a.k.a., mean and standard deviation) of latent features are often removed as noise when training image recognition models, to increase stability and reduce training time. [Expand]
This paper revisits feature pyramids networks (FPN) for one-stage detectors and points out that the success of FPN is due to its divide-and-conquer solution to the optimization problem in object detection rather than multi-scale feature fusion. [Expand]
Mandy Lu, Qingyu Zhao, Jiequan Zhang, Kilian M. Pohl, Li Fei-Fei, Juan Carlos Niebles, Ehsan Adeli
Batch Normalization (BN) and its variants have delivered tremendous success in combating the covariate shift induced by the training step of deep learning methods. [Expand]
Video instance segmentation (VIS) is the task that requires simultaneously classifying, segmenting and tracking object instances of interest in video. [Expand]
Atul Ingle, Trevor Seets, Mauro Buttafava, Shantanu Gupta, Alberto Tosi, Mohit Gupta, Andreas Velten
Digital camera pixels measure image intensities by converting incident light energy into an analog electrical current, and then digitizing it into a fixed-width binary representation. [Expand]
We address the task of converting a floorplan and a set of associated photos of a residence into a textured 3D mesh model, a task which we call Plan2Scene. [Expand]
Task Programming: Learning Data Efficient Behavior Representations
Jennifer J. Sun, Ann Kennedy, Eric Zhan, David J. Anderson, Yisong Yue, Pietro Perona
Specialized domain knowledge is often necessary to accurately annotate training sets for in-depth analysis, but can be burdensome and time-consuming to acquire from domain experts. [Expand]
Deep Occlusion-Aware Instance Segmentation With Overlapping BiLayers
Lei Ke, Yu-Wing Tai, Chi-Keung Tang
Segmenting highly-overlapping objects is challenging, because typically no distinction is made between real object contours and occlusion boundaries. [Expand]
Rotation Coordinate Descent for Fast Globally Optimal Rotation Averaging
Alvaro Parra, Shin-Fang Chng, Tat-Jun Chin, Anders Eriksson, Ian Reid
Under mild conditions on the noise level of the measurements, rotation averaging satisfies strong duality, which enables global solutions to be obtained via semidefinite programming (SDP) relaxation. [Expand]
StyleSpace Analysis: Disentangled Controls for StyleGAN Image Generation
Zongze Wu, Dani Lischinski, Eli Shechtman
We explore and analyze the latent style space of StyleGAN2, a state-of-the-art architecture for image generation, using models pretrained on several different datasets. [Expand]
Student-Teacher Learning From Clean Inputs to Noisy Inputs
Guanzhe Hong, Zhiyuan Mao, Xiaojun Lin, Stanley H. Chan
Feature-based student-teacher learning, a training method that encourages the student's hidden features to mimic those of the teacher network, is empirically successful in transferring the knowledge from a pre-trained teacher network to the student network. [Expand]
Chien-Yao Wang, Alexey Bochkovskiy, Hong-Yuan Mark Liao
We show that the YOLOv4 object detection neural network based on the CSP approach, scales both up and down and is applicable to small and large networks while maintaining optimal speed and accuracy. [Expand]
NeRV: Neural Reflectance and Visibility Fields for Relighting and View Synthesis
Pratul P. Srinivasan, Boyang Deng, Xiuming Zhang, Matthew Tancik, Ben Mildenhall, Jonathan T. Barron
We present a method that takes as input a set of images of a scene illuminated by unconstrained known lighting, and produces as output a 3D representation that can be rendered from novel viewpoints under arbitrary lighting conditions. [Expand]
Soft-IntroVAE: Analyzing and Improving the Introspective Variational Autoencoder
Tal Daniel, Aviv Tamar
The recently introduced introspective variational autoencoder (IntroVAE) exhibits outstanding image generations, and allows for amortized inference using an image encoder. [Expand]
Learning To Recover 3D Scene Shape From a Single Image
Wei Yin, Jianming Zhang, Oliver Wang, Simon Niklaus, Long Mai, Simon Chen, Chunhua Shen
Despite significant progress in monocular depth estimation in the wild, recent state-of-the-art methods cannot be used to recover accurate 3D scene shape due to an unknown depth shift induced by shift-invariant reconstruction losses used in mixed-data depth prediction training, and possible unknown camera focal length. [Expand]
Cross-Modal Contrastive Learning for Text-to-Image Generation
Han Zhang, Jing Yu Koh, Jason Baldridge, Honglak Lee, Yinfei Yang
The output of text-to-image synthesis systems should be coherent, clear, photo-realistic scenes with high semantic fidelity to their conditioned text descriptions. [Expand]
MoViNets: Mobile Video Networks for Efficient Video Recognition
Dan Kondratyuk, Liangzhe Yuan, Yandong Li, Li Zhang, Mingxing Tan, Matthew Brown, Boqing Gong
We present Mobile Video Networks (MoViNets), a family of computation and memory efficient video networks that can operate on streaming video for online inference. [Expand]
We propose Pulsar, an efficient sphere-based differentiable rendering module that is orders of magnitude faster than competing techniques, modular, and easy-to-use due to its tight integration with PyTorch. [Expand]
Transformer Meets Tracker: Exploiting Temporal Context for Robust Visual Tracking
Ning Wang, Wengang Zhou, Jie Wang, Houqiang Li
In video object tracking, there exist rich temporal contexts among successive frames, which have been largely overlooked in existing trackers. [Expand]
PoseAug: A Differentiable Pose Augmentation Framework for 3D Human Pose Estimation
Kehong Gong, Jianfeng Zhang, Jiashi Feng
Existing 3D human pose estimators suffer poor generalization performance to new datasets, largely due to the limited diversity of 2D-3D pose pairs in the training data. [Expand]
Large-Scale Localization Datasets in Crowded Indoor Spaces
Donghwan Lee, Soohyun Ryu, Suyong Yeon, Yonghan Lee, Deokhwa Kim, Cheolho Han, Yohann Cabon, Philippe Weinzaepfel, Nicolas Guerin, Gabriela Csurka, Martin Humenberger
Estimating the precise location of a camera using visual localization enables interesting applications such as augmented reality or robot navigation. [Expand]
Unsupervised Learning of 3D Object Categories From Videos in the Wild
Philipp Henzler, Jeremy Reizenstein, Patrick Labatut, Roman Shapovalov, Tobias Ritschel, Andrea Vedaldi, David Novotny
Recently, numerous works have attempted to learn 3D reconstructors of textured 3D models of visual categories given a training set of annotated static images of objects. [Expand]
In this paper, we introduce the new task of reconstructing 3D human pose from a single image in which we can see the person and the person's image through a mirror. [Expand]
Pose-Controllable Talking Face Generation by Implicitly Modularized Audio-Visual Representation
Hang Zhou, Yasheng Sun, Wayne Wu, Chen Change Loy, Xiaogang Wang, Ziwei Liu
While accurate lip synchronization has been achieved for arbitrary-subject audio-driven talking face generation, the problem of how to efficiently drive the head pose remains. [Expand]
Recent advances have shown that symmetry, a structural prior that most objects exhibit, can support a variety of single-view 3D understanding tasks. [Expand]
This paper deals with the scarcity of data for training optical flow networks, highlighting the limitations of existing sources such as labeled synthetic datasets or unlabeled real videos. [Expand]
Three Ways To Improve Semantic Segmentation With Self-Supervised Depth Estimation
Lukas Hoyer, Dengxin Dai, Yuhua Chen, Adrian Koring, Suman Saha, Luc Van Gool
Training deep networks for semantic segmentation requires large amounts of labeled training data, which presents a major challenge in practice, as labeling segmentation masks is a highly labor-intensive process. [Expand]
Recent VQA models may tend to rely on language bias as a shortcut and thus fail to sufficiently learn the multi-modal knowledge from both vision and language. [Expand]
Current video retrieval efforts all found their evaluation on an instance-based assumption, that only a single caption is relevant to a query video and vice versa. [Expand]
In this work we analyze strategies for convolutional neural network scaling; that is, the process of scaling a base convolutional network to endow it with greater computational complexity and consequently representational power. [Expand]
Understanding Failures of Deep Networks via Robust Feature Extraction
Sahil Singla, Besmira Nushi, Shital Shah, Ece Kamar, Eric Horvitz
Traditional evaluation metrics for learned models that report aggregate scores over a test set are insufficient for surfacing important and informative patterns of failure over features and instances. [Expand]
High-Resolution Photorealistic Image Translation in Real-Time: A Laplacian Pyramid Translation Network
Jie Liang, Hui Zeng, Lei Zhang
Existing image-to-image translation (I2IT) methods are either constrained to low-resolution images or long inference time due to their heavy computational burden on the convolution of high-resolution feature maps. [Expand]
Julian Ost, Fahim Mannan, Nils Thuerey, Julian Knodt, Felix Heide
Recent implicit neural rendering methods have demonstrated that it is possible to learn accurate view synthesis for complex scenes by predicting their volumetric density and color supervised solely by a set of RGB images. [Expand]
Representation Learning via Global Temporal Alignment and Cycle-Consistency
Isma Hadji, Konstantinos G. Derpanis, Allan D. Jepson
We introduce a weakly supervised method for representation learning based on aligning temporal sequences (e.g., videos) of the same process (e.g., human action). [Expand]
A Sliced Wasserstein Loss for Neural Texture Synthesis
Eric Heitz, Kenneth Vanhoey, Thomas Chambon, Laurent Belcour
We address the problem of computing a textural loss based on the statistics extracted from the feature activations of a convolutional neural network optimized for object recognition (e.g. [Expand]
We show that pre-trained Generative Adversarial Networks (GANs), e.g., StyleGAN, can be used as a latent bank to improve the restoration quality of large-factor image super-resolution (SR). [Expand]
Patch-NetVLAD: Multi-Scale Fusion of Locally-Global Descriptors for Place Recognition
Stephen Hausler, Sourav Garg, Ming Xu, Michael Milford, Tobias Fischer
Visual Place Recognition is a challenging task for robotics and autonomous systems, which must deal with the twin problems of appearance and viewpoint change in an always changing world. [Expand]
Seeing Out of the Box: End-to-End Pre-Training for Vision-Language Representation Learning
Zhicheng Huang, Zhaoyang Zeng, Yupan Huang, Bei Liu, Dongmei Fu, Jianlong Fu
We study on joint learning of Convolutional Neural Network (CNN) and Transformer for vision-language pre-training (VLPT) which aims to learn cross-modal alignments from millions of image-text pairs. [Expand]
MobileDets: Searching for Object Detection Architectures for Mobile Accelerators
Yunyang Xiong, Hanxiao Liu, Suyog Gupta, Berkin Akin, Gabriel Bender, Yongzhe Wang, Pieter-Jan Kindermans, Mingxing Tan, Vikas Singh, Bo Chen
Inverted bottleneck layers, which are built upon depthwise convolutions, have been the predominant building blocks in state-of-the-art object detection models on mobile devices. [Expand]
Olivia Wiles, Sebastien Ehrhardt, Andrew Zisserman
We propose a new approach to determine correspondences between image pairs in the wild under large changes in illumination, viewpoint, context, and material. [Expand]
Learning Monocular 3D Reconstruction of Articulated Categories From Motion
Filippos Kokkinos, Iasonas Kokkinos
Monocular 3D reconstruction of articulated object categories is challenging due to the lack of training data and the inherent ill-posedness of the problem. [Expand]
Learned Initializations for Optimizing Coordinate-Based Neural Representations
Matthew Tancik, Ben Mildenhall, Terrance Wang, Divi Schmidt, Pratul P. Srinivasan, Jonathan T. Barron, Ren Ng
Coordinate-based neural representations have shown significant promise as an alternative to discrete, array-based representations for complex low dimensional signals. [Expand]
Rethinking and Improving the Robustness of Image Style Transfer
Pei Wang, Yijun Li, Nuno Vasconcelos
Extensive research in neural style transfer methods has shown that the correlation between features extracted by a pre-trained VGG network has remarkable ability to capture the visual style of an image. [Expand]
Robust and Accurate Object Detection via Adversarial Learning
Xiangning Chen, Cihang Xie, Mingxing Tan, Li Zhang, Cho-Jui Hsieh, Boqing Gong
Data augmentation has become a de facto component for training high-performance deep image classifiers, but its potential is under-explored for object detection. [Expand]
Estimating 3D scene flow from a sequence of monocular images has been gaining increased attention due to the simple, economical capture setup. [Expand]
PPR10K: A Large-Scale Portrait Photo Retouching Dataset With Human-Region Mask and Group-Level Consistency
Jie Liang, Hui Zeng, Miaomiao Cui, Xuansong Xie, Lei Zhang
Different from general photo retouching tasks, portrait photo retouching (PPR), which aims to enhance the visual quality of a collection of flat-looking portrait photos, has its special and practical requirements such as human-region priority (HRP) and group-level consistency (GLC). [Expand]
We present a novel attention mechanism: Causal Attention (CATT), to remove the ever-elusive confounding effect in existing attention-based vision-language models. [Expand]
This paper presents a detailed study of improving vision features and develops an improved object detection model for vision language (VL) tasks. [Expand]
Multimodal Contrastive Training for Visual Representation Learning
Xin Yuan, Zhe Lin, Jason Kuen, Jianming Zhang, Yilin Wang, Michael Maire, Ajinkya Kale, Baldo Faieta
We develop an approach to learning visual representations that embraces multimodal data, driven by a combination of intra- and inter-modal similarity preservation objectives. [Expand]
Daniel Rebain, Wei Jiang, Soroosh Yazdani, Ke Li, Kwang Moo Yi, Andrea Tagliasacchi
With the advent of Neural Radiance Fields (NeRF), neural networks can now render novel views of a 3D scene with quality that fools the human eye. [Expand]
TextOCR: Towards Large-Scale End-to-End Reasoning for Arbitrary-Shaped Scene Text
Amanpreet Singh, Guan Pang, Mandy Toh, Jing Huang, Wojciech Galuba, Tal Hassner
A crucial component for the scene text based reasoning required for TextVQA and TextCaps datasets involve detecting and recognizing text present in the images using an optical character recognition (OCR) system. [Expand]
We propose a novel approach to few-shot action recognition, finding temporally-corresponding frame tuples between the query and videos in the support set. [Expand]
STaR: Self-Supervised Tracking and Reconstruction of Rigid Objects in Motion With Neural Rendering
Wentao Yuan, Zhaoyang Lv, Tanner Schmidt, Steven Lovegrove
We present STaR, a novel method that performs Self-supervised Tracking and Reconstruction of dynamic scenes with rigid motion from multi-view RGB videos without any manual annotation. [Expand]
Zan Gojcic, Or Litany, Andreas Wieser, Leonidas J. Guibas, Tolga Birdal
We propose a data-driven scene flow estimation algorithm exploiting the observation that many 3D scenes can be explained by a collection of agents moving as rigid bodies. [Expand]
Reference-based Super-Resolution (Ref-SR) has recently emerged as a promising paradigm to enhance a low-resolution (LR) input image by introducing an additional high-resolution (HR) reference image. [Expand]
Haochen Wang, Xiaolong Jiang, Haibing Ren, Yao Hu, Song Bai
In this work we present SwiftNet for real-time semi-supervised video object segmentation (one-shot VOS), which reports 77.8% J&F and 70 FPS on DAVIS 2017 validation dataset, leading all present solutions in overall accuracy and speed performance. [Expand]
Goutam Bhat, Martin Danelljan, Luc Van Gool, Radu Timofte
While single-image super-resolution (SISR) has attracted substantial interest in recent years, the proposed approaches are limited to learning image priors in order to add high frequency details. [Expand]
DNN-based frame interpolation--that generates the intermediate frames given two consecutive frames--typically relies on heavy model architectures with a huge number of features, preventing them from being deployed on systems with limited resources, e.g., mobile devices. [Expand]
Amit Raj, Michael Zollhofer, Tomas Simon, Jason Saragih, Shunsuke Saito, James Hays, Stephen Lombardi
Acquisition and rendering of photo-realistic human heads is a highly challenging research problem of particular importance for virtual telepresence. [Expand]
Greedy Hierarchical Variational Autoencoders for Large-Scale Video Prediction
Bohan Wu, Suraj Nair, Roberto Martin-Martin, Li Fei-Fei, Chelsea Finn
A video prediction model that generalizes to diverse scenes would enable intelligent agents such as robots to perform a variety of tasks via planning with the model. [Expand]
Differentiable Patch Selection for Image Recognition
Jean-Baptiste Cordonnier, Aravindh Mahendran, Alexey Dosovitskiy, Dirk Weissenborn, Jakob Uszkoreit, Thomas Unterthiner
Neural Networks require large amounts of memory and compute to process high resolution images, even when only a small part of the image is actually informative for the task at hand. [Expand]
PLOP: Learning Without Forgetting for Continual Semantic Segmentation
Arthur Douillard, Yifu Chen, Arnaud Dapogny, Matthieu Cord
Deep learning approaches are nowadays ubiquitously used to tackle computer vision tasks such as semantic segmentation, requiring large datasets and substantial computational power. [Expand]
Recent work has demonstrated that volumetric scene representations combined with differentiable volume rendering can enable photo-realistic rendering for challenging scenes that mesh reconstruction fails on. [Expand]
Fashion IQ: A New Dataset Towards Retrieving Images by Natural Language Feedback
Hui Wu, Yupeng Gao, Xiaoxiao Guo, Ziad Al-Halah, Steven Rennie, Kristen Grauman, Rogerio Feris
Conversational interfaces for the detail-oriented retail fashion domain are more natural, expressive, and user friendly than classical keyword-based search interfaces. [Expand]
SCALE: Modeling Clothed Humans with a Surface Codec of Articulated Local Elements
Qianli Ma, Shunsuke Saito, Jinlong Yang, Siyu Tang, Michael J. Black
Learning to model and reconstruct humans in clothing is challenging due to articulation, non-rigid deformation, and varying clothing types and topologies. [Expand]
Baptiste Angles, Yuhe Jin, Simon Kornblith, Andrea Tagliasacchi, Kwang Moo Yi
We propose a deep network that can be trained to tackle image reconstruction and classification problems that involve detection of multiple object instances, without any supervision regarding their whereabouts. [Expand]
Learning Compositional Radiance Fields of Dynamic Human Heads
Ziyan Wang, Timur Bagautdinov, Stephen Lombardi, Tomas Simon, Jason Saragih, Jessica Hodgins, Michael Zollhofer
Photorealistic rendering of dynamic humans is an important ability for telepresence systems, virtual shopping, synthetic data generation, and more. [Expand]
Shugao Ma, Tomas Simon, Jason Saragih, Dawei Wang, Yuecheng Li, Fernando De la Torre, Yaser Sheikh
Telecommunication with photorealistic avatars in virtual or augmented reality is a promising path for achieving authentic face-to-face communication in 3D over remote physical distances. [Expand]
Benchmarking Representation Learning for Natural World Image Collections
Grant Van Horn, Elijah Cole, Sara Beery, Kimberly Wilber, Serge Belongie, Oisin Mac Aodha
Recent progress in self-supervised learning has resulted in models that are capable of extracting rich representations from image collections without requiring any explicit label supervision. [Expand]
The Spatially-Correlative Loss for Various Image Translation Tasks
Chuanxia Zheng, Tat-Jen Cham, Jianfei Cai
We propose a novel spatially-correlative loss that is simple, efficient, and yet effective for preserving scene structure consistency while supporting large appearance changes during unpaired image-to-image (I2I) translation. [Expand]
Lucy Chai, Jun-Yan Zhu, Eli Shechtman, Phillip Isola, Richard Zhang
Recent generative models can synthesize "views" of artificial images that mimic real-world variations, such as changes in color or pose, simply by learning from unlabeled image collections. [Expand]
VITON-HD: High-Resolution Virtual Try-On via Misalignment-Aware Normalization
Seunghwan Choi, Sunghyun Park, Minsoo Lee, Jaegul Choo
The task of image-based virtual try-on aims to transfer a target clothing item onto the corresponding region of a person, which is commonly tackled by fitting the item to the desired body part and fusing the warped item with the person. [Expand]
We present SPSG, a novel approach to generate high-quality, colored 3D models of scenes from RGB-D scan observations by learning to infer unobserved scene geometry and color in a self-supervised fashion. [Expand]
Corentin Kervadec, Theo Jaunet, Grigory Antipov, Moez Baccouche, Romain Vuillemot, Christian Wolf
Since its inception, Visual Question Answering (VQA) is notoriously known as a task, where models are prone to exploit biases in datasets to find shortcuts instead of performing high-level reasoning. [Expand]
Learning Decision Trees Recurrently Through Communication
Stephan Alaniz, Diego Marcos, Bernt Schiele, Zeynep Akata
Integrated interpretability without sacrificing the prediction accuracy of decision making algorithms has the potential of greatly improving their value to the user. [Expand]
Teachers Do More Than Teach: Compressing Image-to-Image Models
Qing Jin, Jian Ren, Oliver J. Woodford, Jiazhuo Wang, Geng Yuan, Yanzhi Wang, Sergey Tulyakov
Generative Adversarial Networks (GANs) have achieved huge success in generating high-fidelity images, however, they suffer from low efficiency due to tremendous computational cost and bulky memory usage. [Expand]
Image-to-Image Translation via Hierarchical Style Disentanglement
Xinyang Li, Shengchuan Zhang, Jie Hu, Liujuan Cao, Xiaopeng Hong, Xudong Mao, Feiyue Huang, Yongjian Wu, Rongrong Ji
Recently, image-to-image translation has made significant progress in achieving both multi-label (i.e., translation conditioned on different labels) and multi-style (i.e., generation with diverse styles) tasks. [Expand]
3D CNNs With Adaptive Temporal Feature Resolutions
Mohsen Fayyaz, Emad Bahrami, Ali Diba, Mehdi Noroozi, Ehsan Adeli, Luc Van Gool, Jurgen Gall
While state-of-the-art 3D Convolutional Neural Networks (CNN) achieve very good results on action recognition datasets, they are computationally very expensive and require many GFLOPs. [Expand]
Recent works have shown exciting results in unsupervised image de-rendering--learning to decompose 3D shape, appearance, and lighting from single-image collections without explicit supervision. [Expand]
Pixel-Wise Anomaly Detection in Complex Driving Scenes
Giancarlo Di Biase, Hermann Blum, Roland Siegwart, Cesar Cadena
The inability of state-of-the-art semantic segmentation methods to detect anomaly instances hinders them from being deployed in safety-critical and complex applications, such as autonomous driving. [Expand]
Learning the Superpixel in a Non-Iterative and Lifelong Manner
Lei Zhu, Qi She, Bin Zhang, Yanye Lu, Zhilin Lu, Duo Li, Jie Hu
Superpixel is generated by automatically clustering pixels in an image into hundreds of compact partitions, which is widely used to perceive the object contours for its excellent contour adherence. [Expand]
Few-Shot Segmentation Without Meta-Learning: A Good Transductive Inference Is All You Need?
Malik Boudiaf, Hoel Kervadec, Ziko Imtiaz Masud, Pablo Piantanida, Ismail Ben Ayed, Jose Dolz
We show that the way inference is performed in few-shot segmentation tasks has a substantial effect on performances--an aspect often overlooked in the literature in favor of the meta-learning paradigm. [Expand]
MultiBodySync: Multi-Body Segmentation and Motion Estimation via 3D Scan Synchronization
Jiahui Huang, He Wang, Tolga Birdal, Minhyuk Sung, Federica Arrigoni, Shi-Min Hu, Leonidas J. Guibas
We present MultiBodySync, a novel, end-to-end trainable multi-body motion segmentation and rigid registration framework for multiple input 3D point clouds. [Expand]
Towards Semantic Segmentation of Urban-Scale 3D Point Clouds: A Dataset, Benchmarks and Challenges
Qingyong Hu, Bo Yang, Sheikh Khalid, Wen Xiao, Niki Trigoni, Andrew Markham
An essential prerequisite for unleashing the potential of supervised deep learning algorithms in the area of 3D scene understanding is the availability of large-scale and richly annotated datasets. [Expand]
Conventional stereo suffers from a fundamental trade-off between imaging volume and signal-to-noise ratio (SNR) -- due to the conflicting impact of aperture size on both these variables. [Expand]
CausalVAE: Disentangled Representation Learning via Neural Structural Causal Models
Mengyue Yang, Furui Liu, Zhitang Chen, Xinwei Shen, Jianye Hao, Jun Wang
Learning disentanglement aims at finding a low dimensional representation which consists of multiple explanatory and generative factors of the observational data. [Expand]
CoCoNets: Continuous Contrastive 3D Scene Representations
Shamit Lal, Mihir Prabhudesai, Ishita Mediratta, Adam W. Harley, Katerina Fragkiadaki
This paper explores self-supervised learning of amodal 3D feature representations from RGB and RGB-D posed images and videos, agnostic to object and scene semantic content, and evaluates the resulting scene representations in the downstream tasks of visual correspondence, object tracking, and object detection. [Expand]
While recent studies on semi-supervised learning have shown remarkable progress in leveraging both labeled and unlabeled data, most of them presume a basic setting of the model is randomly initialized. [Expand]
PhySG: Inverse Rendering With Spherical Gaussians for Physics-Based Material Editing and Relighting
Kai Zhang, Fujun Luan, Qianqian Wang, Kavita Bala, Noah Snavely
We present an end-to-end inverse rendering pipeline that includes a fully differentiable renderer, and can reconstruct geometry, materials, and illumination from scratch from a set of images. [Expand]
Mohammed Suhail, Abhay Mittal, Behjat Siddiquie, Chris Broaddus, Jayan Eledath, Gerard Medioni, Leonid Sigal
Traditional scene graph generation methods are trained using cross-entropy losses that treat objects and relationships as independent entities. [Expand]
Beyond Static Features for Temporally Consistent 3D Human Pose and Shape From a Video
Hongsuk Choi, Gyeongsik Moon, Ju Yong Chang, Kyoung Mu Lee
Despite the recent success of single image-based 3D human pose and shape estimation methods, recovering temporally consistent and smooth 3D human motion from a video is still challenging. [Expand]
Inspired by the fact that human eyes continue to develop tracking ability in early and middle childhood, we propose to use tracking as a proxy task for a computer vision system to learn the visual representations. [Expand]
We present a technique for estimating the relative 3D rotation of an RGB image pair in an extreme setting, where the images have little or no overlap. [Expand]
This paper addresses the problem of learning to estimate the depth of detected objects given some measurement of camera motion (e.g., from robot kinematics or vehicle odometry). [Expand]
Coordinate Attention for Efficient Mobile Network Design
Qibin Hou, Daquan Zhou, Jiashi Feng
Recent studies on mobile network design have demonstrated the remarkable effectiveness of channel attention (e.g., the Squeeze-and-Excitation attention) for lifting model performance, but they generally neglect the positional information, which is important for generating spatially selective attention maps. [Expand]
Machine learning models are known to be vulnerable to adversarial attacks, namely perturbations of the data that lead to wrong predictions despite being imperceptible. [Expand]
Neural Reprojection Error: Merging Feature Learning and Camera Pose Estimation
Hugo Germain, Vincent Lepetit, Guillaume Bourmaud
Absolute camera pose estimation is usually addressed by sequentially solving two distinct subproblems: First a feature matching problem that seeks to establish putative 2D-3D correspondences, and then a Perspective-n-Point problem that minimizes, w.r.t. [Expand]
Appearance-based detectors achieve remarkable performance on common scenes, benefiting from high-capacity models and massive annotated data, but tend to fail for scenarios that lack training data. [Expand]
Refine Myself by Teaching Myself: Feature Refinement via Self-Knowledge Distillation
Mingi Ji, Seungjae Shin, Seunghyun Hwang, Gibeom Park, Il-Chul Moon
Knowledge distillation is a method of transferring the knowledge from a pretrained complex teacher model to a student model, so a smaller network can replace a large teacher network at the deployment stage. [Expand]
Constructing and animating humans is an important component for building virtual worlds in a wide variety of applications such as virtual reality or robotics testing in simulation. [Expand]
The Lottery Tickets Hypothesis for Supervised and Self-Supervised Pre-Training in Computer Vision Models
Tianlong Chen, Jonathan Frankle, Shiyu Chang, Sijia Liu, Yang Zhang, Michael Carbin, Zhangyang Wang
The computer vision world has been re-gaining enthusiasm in various pre-trained models, including both classical ImageNet supervised pre-training and recently emerged self-supervised pre-training such as simCLR and MoCo. [Expand]
Continual Adaptation of Visual Representations via Domain Randomization and Meta-Learning
Riccardo Volpi, Diane Larlus, Gregory Rogez
Most standard learning approaches lead to fragile models which are prone to drift when sequentially trained on samples of a different nature -- the well-known "catastrophic forgetting" issue. [Expand]
Soravit Changpinyo, Piyush Sharma, Nan Ding, Radu Soricut
The availability of large-scale image captioning and visual question answering datasets has contributed significantly to recent successes in vision-and-language pre-training. [Expand]
Roses Are Red, Violets Are Blue... but Should VQA Expect Them To?
Corentin Kervadec, Grigory Antipov, Moez Baccouche, Christian Wolf
Models for Visual Question Answering (VQA) are notorious for their tendency to rely on dataset biases, as the large and unbalanced diversity of questions and concepts involved and tends to prevent models from learning to ""reason"", leading them to perform ""educated guesses"" instead. [Expand]
ForgeryNet: A Versatile Benchmark for Comprehensive Forgery Analysis
Yinan He, Bei Gan, Siyu Chen, Yichun Zhou, Guojun Yin, Luchuan Song, Lu Sheng, Jing Shao, Ziwei Liu
The rapid progress of photorealistic synthesis techniques has reached at a critical point where the boundary between real and manipulated images starts to blur. [Expand]
This paper is concerned with ranking many pre-trained deep neural networks (DNNs), called checkpoints, for the transfer learning to a downstream task. [Expand]
Monocular Real-Time Full Body Capture With Inter-Part Correlations
Yuxiao Zhou, Marc Habermann, Ikhsanul Habibie, Ayush Tewari, Christian Theobalt, Feng Xu
We present the first method for real-time full body capture that estimates shape and motion of body and hands together with a dynamic 3D face model from a single color image. [Expand]
The classical matching pipeline used for visual localization typically involves three steps: (i) local feature detection and description, (ii) feature matching, and (iii) outlier rejection. [Expand]
Generalized Focal Loss V2: Learning Reliable Localization Quality Estimation for Dense Object Detection
Xiang Li, Wenhai Wang, Xiaolin Hu, Jun Li, Jinhui Tang, Jian Yang
Localization Quality Estimation (LQE) is crucial and popular in the recent advancement of dense object detectors since it can provide accurate ranking scores that benefit the Non-Maximum Suppression processing and improve detection performance. [Expand]
DeepVideoMVS: Multi-View Stereo on Video With Recurrent Spatio-Temporal Fusion
Arda Duzceker, Silvano Galliani, Christoph Vogel, Pablo Speciale, Mihai Dusmanu, Marc Pollefeys
We propose an online multi-view depth prediction approach on posed video streams, where the scene geometry information computed in the previous time steps is propagated to the current time step in an efficient and geometrically plausible way. [Expand]
UnsupervisedR&R: Unsupervised Point Cloud Registration via Differentiable Rendering
Mohamed El Banani, Luya Gao, Justin Johnson
Aligning partial views of a scene into a single whole is essential to understanding one's environment and is a key component of numerous robotics tasks such as SLAM and SfM. [Expand]
Binary TTC: A Temporal Geofence for Autonomous Navigation
Abhishek Badki, Orazio Gallo, Jan Kautz, Pradeep Sen
Time-to-contact (TTC), the time for an object to collide with the observer's plane, is a powerful tool for path planning: it is potentially more informative than the depth, velocity, and acceleration of objects in the scene---even for humans. [Expand]
NPAS: A Compiler-Aware Framework of Unified Network Pruning and Architecture Search for Beyond Real-Time Mobile Acceleration
Zhengang Li, Geng Yuan, Wei Niu, Pu Zhao, Yanyu Li, Yuxuan Cai, Xuan Shen, Zheng Zhan, Zhenglun Kong, Qing Jin, Zhiyu Chen, Sijia Liu, Kaiyuan Yang, Bin Ren, Yanzhi Wang, Xue Lin
With the increasing demand to efficiently deploy DNNs on mobile edge devices, it becomes much more important to reduce unnecessary computation and increase the execution speed. [Expand]
Generating Diverse Structure for Image Inpainting With Hierarchical VQ-VAE
Jialun Peng, Dong Liu, Songcen Xu, Houqiang Li
Given an incomplete image without additional constraint, image inpainting natively allows for multiple solutions as long as they appear plausible. [Expand]
Temporal Query Networks for Fine-Grained Video Understanding
Chuhan Zhang, Ankush Gupta, Andrew Zisserman
Our objective in this work is fine-grained classification of actions in untrimmed videos, where the actions may be temporally extended or may span only a few frames of the video. [Expand]
Points As Queries: Weakly Semi-Supervised Object Detection by Points
Liangyu Chen, Tong Yang, Xiangyu Zhang, Wei Zhang, Jian Sun
We propose a novel point annotated setting for the weakly semi-supervised object detection task, in which the dataset comprises small fully annotated images and large weakly annotated images by points. [Expand]
Self-supervised visual representation learning has seen huge progress recently, but no large scale evaluation has compared the many models now available. [Expand]
Learning To Relate Depth and Semantics for Unsupervised Domain Adaptation
Suman Saha, Anton Obukhov, Danda Pani Paudel, Menelaos Kanakis, Yuhua Chen, Stamatios Georgoulis, Luc Van Gool
We present an approach for encoding visual task relationships to improve model performance in an Unsupervised Domain Adaptation (UDA) setting. [Expand]
Luca Weihs, Matt Deitke, Aniruddha Kembhavi, Roozbeh Mottaghi
There has been a significant recent progress in the field of Embodied AI with researchers developing models and algorithms enabling embodied agents to navigate and interact within completely unseen environments. [Expand]
SimPLE: Similar Pseudo Label Exploitation for Semi-Supervised Classification
Zijian Hu, Zhengyu Yang, Xuefeng Hu, Ram Nevatia
A common classification task situation is where one has a large amount of data available for training, but only a small portion is annotated with class labels. [Expand]
SMURF: Self-Teaching Multi-Frame Unsupervised RAFT With Full-Image Warping
Austin Stone, Daniel Maurer, Alper Ayvaci, Anelia Angelova, Rico Jonschkowski
We present SMURF, a method for unsupervised learning of optical flow that improves state of the art on all benchmarks by 36% to 40% and even outperforms several supervised approaches such as PWC-Net and FlowNet2. [Expand]
WebFace260M: A Benchmark Unveiling the Power of Million-Scale Deep Face Recognition
Zheng Zhu, Guan Huang, Jiankang Deng, Yun Ye, Junjie Huang, Xinze Chen, Jiagang Zhu, Tian Yang, Jiwen Lu, Dalong Du, Jie Zhou
In this paper, we contribute a new million-scale face benchmark containing noisy 4M identities/260M faces (WebFace260M) and cleaned 2M identities/42M faces (WebFace42M) training data, as well as an elaborately designed time-constrained evaluation protocol. [Expand]
Recent neural view synthesis methods have achieved impressive quality and realism, surpassing classical pipelines which rely on multi-view reconstruction. [Expand]
Few-Shot Human Motion Transfer by Personalized Geometry and Texture Modeling
Zhichao Huang, Xintong Han, Jia Xu, Tong Zhang
We present a new method for few-shot human motion transfer that achieves realistic human image generation with only a small number of appearance inputs. [Expand]
HOTR: End-to-End Human-Object Interaction Detection With Transformers
Bumsoo Kim, Junhyun Lee, Jaewoo Kang, Eun-Sol Kim, Hyunwoo J. Kim
Human-Object Interaction (HOI) detection is a task of identifying "a set of interactions" in an image, which involves the i) localization of the subject (i.e., humans) and target (i.e., objects) of interaction, and ii) the classification of the interaction labels. [Expand]
A Realistic Evaluation of Semi-Supervised Learning for Fine-Grained Classification
Jong-Chyi Su, Zezhou Cheng, Subhransu Maji
We evaluate the effectiveness of semi-supervised learning (SSL) on a realistic benchmark where data exhibits considerable class imbalance and contains images from novel classes. [Expand]
Mesoscopic Photogrammetry With an Unstabilized Phone Camera
Kevin C. Zhou, Colin Cooke, Jaehee Park, Ruobing Qian, Roarke Horstmeyer, Joseph A. Izatt, Sina Farsiu
We present a feature-free photogrammetric technique that enables quantitative 3D mesoscopic (mm-scale height variation) imaging with tens-of-micron accuracy from sequences of images acquired by a smartphone at close range (several cm) under freehand motion without additional hardware. [Expand]
RfD-Net: Point Scene Understanding by Semantic Instance Reconstruction
Yinyu Nie, Ji Hou, Xiaoguang Han, Matthias Niessner
Semantic scene understanding from point clouds is particularly challenging as the points reflect only a sparse set of the underlying 3D geometry. [Expand]
Yifan Wang, Andrew Liu, Richard Tucker, Jiajun Wu, Brian L. Curless, Steven M. Seitz, Noah Snavely
We present a framework for automatically reconfiguring images of street scenes by populating, depopulating, or repopulating them with objects such as pedestrians or vehicles. [Expand]
Towards High Fidelity Face Relighting With Realistic Shadows
Andrew Hou, Ze Zhang, Michel Sarkis, Ning Bi, Yiying Tong, Xiaoming Liu
Existing face relighting methods often struggle with two problems: maintaining the local facial details of the subject and accurately removing and synthesizing shadows in the relit image, especially hard shadows. [Expand]
Complete & Label: A Domain Adaptation Approach to Semantic Segmentation of LiDAR Point Clouds
Li Yi, Boqing Gong, Thomas Funkhouser
We study an unsupervised domain adaptation problem for the semantic labeling of 3D point clouds, with a particular focus on domain discrepancies induced by different LiDAR sensors. [Expand]
We introduce the 'Incremental Implicitly-Refined Classification (IIRC)' setup, an extension to the class incremental learning setup where the incoming batches of classes have two granularity levels. [Expand]
Towards Real-World Blind Face Restoration With Generative Facial Prior
Xintao Wang, Yu Li, Honglun Zhang, Ying Shan
Blind face restoration usually relies on facial priors, such as facial geometry prior or reference prior, to restore realistic and faithful details. [Expand]
Instance Localization for Self-Supervised Detection Pretraining
Ceyuan Yang, Zhirong Wu, Bolei Zhou, Stephen Lin
Prior research on self-supervised learning has led to considerable progress on image classification, but often with degraded transfer performance on object detection. [Expand]
Vectorization and Rasterization: Self-Supervised Learning for Sketch and Handwriting
Ayan Kumar Bhunia, Pinaki Nath Chowdhury, Yongxin Yang, Timothy M. Hospedales, Tao Xiang, Yi-Zhe Song
Self-supervised learning has gained prominence due to its efficacy at learning powerful representations from unlabelled data that achieve excellent performance on many challenging downstream tasks. [Expand]
DexYCB: A Benchmark for Capturing Hand Grasping of Objects
Yu-Wei Chao, Wei Yang, Yu Xiang, Pavlo Molchanov, Ankur Handa, Jonathan Tremblay, Yashraj S. Narang, Karl Van Wyk, Umar Iqbal, Stan Birchfield, Jan Kautz, Dieter Fox
We introduce DexYCB, a new dataset for capturing hand grasping of objects. [Expand]
Wide-Baseline Multi-Camera Calibration Using Person Re-Identification
Yan Xu, Yu-Jhe Li, Xinshuo Weng, Kris Kitani
We address the problem of estimating the 3D pose of a network of cameras for large-environment wide-baseline scenarios, e.g., cameras for construction sites, sports stadiums, and public spaces. [Expand]
One of the factors that have hindered progress in the areas of sign language recognition, translation, and production is the absence of large annotated datasets. [Expand]
AGORA: Avatars in Geography Optimized for Regression Analysis
Priyanka Patel, Chun-Hao P. Huang, Joachim Tesch, David T. Hoffmann, Shashank Tripathi, Michael J. Black
While the accuracy of 3D human pose estimation from images has steadily improved on benchmark datasets, the best methods still fail in many real-world scenarios. [Expand]
VIP-DeepLab: Learning Visual Perception With Depth-Aware Video Panoptic Segmentation
Siyuan Qiao, Yukun Zhu, Hartwig Adam, Alan Yuille, Liang-Chieh Chen
In this paper, we present ViP-DeepLab, a unified model attempting to tackle the long-standing and challenging inverse projection problem in vision, which we model as restoring the point clouds from perspective image sequences while providing each point with instance-level semantic interpretations. [Expand]
FSCE: Few-Shot Object Detection via Contrastive Proposal Encoding
Bo Sun, Banghuai Li, Shengcai Cai, Ye Yuan, Chi Zhang
Emerging interests have been brought to recognize previously unseen objects given very few training examples, known as few-shot object detection (FSOD). [Expand]
Unsupervised Human Pose Estimation Through Transforming Shape Templates
Luca Schmidtke, Athanasios Vlontzos, Simon Ellershaw, Anna Lukens, Tomoki Arichi, Bernhard Kainz
Human pose estimation is a major computer vision problem with applications ranging from augmented reality and video capture to surveillance and movement tracking. [Expand]
Generative adversarial networks (GANs), e.g., StyleGAN2, play a vital role in various image generation and synthesis tasks, yet their notoriously high computational cost hinders their efficient deployment on edge devices. [Expand]
StereoPIFu: Depth Aware Clothed Human Digitization via Stereo Vision
Yang Hong, Juyong Zhang, Boyi Jiang, Yudong Guo, Ligang Liu, Hujun Bao
In this paper, we propose StereoPIFu, which integrates the geometric constraints of stereo vision with implicit function representation of PIFu, to recover the 3D shape of the clothed human from a pair of low-cost rectified images. [Expand]
Self-Supervised Motion Learning From Static Images
Ziyuan Huang, Shiwei Zhang, Jianwen Jiang, Mingqian Tang, Rong Jin, Marcelo H. Ang
Motions are reflected in videos as the movement of pixels, and actions are essentially patterns of inconsistent motions between the foreground and the background. [Expand]
KOALAnet: Blind Super-Resolution Using Kernel-Oriented Adaptive Local Adjustment
Soo Ye Kim, Hyeonjun Sim, Munchurl Kim
Blind super-resolution (SR) methods aim to generate a high quality high resolution image from a low resolution image containing unknown degradations. [Expand]
Temporally-Weighted Hierarchical Clustering for Unsupervised Action Segmentation
Saquib Sarfraz, Naila Murray, Vivek Sharma, Ali Diba, Luc Van Gool, Rainer Stiefelhagen
Action segmentation refers to inferring boundaries of semantically consistent visual concepts in videos and is an important requirement for many video understanding tasks. [Expand]
HyperSeg: Patch-Wise Hypernetwork for Real-Time Semantic Segmentation
Yuval Nirkin, Lior Wolf, Tal Hassner
We present a novel, real-time, semantic segmentation network in which the encoder both encodes and generates the parameters (weights) of the decoder. [Expand]
Categorical Depth Distribution Network for Monocular 3D Object Detection
Cody Reading, Ali Harakeh, Julia Chae, Steven L. Waslander
Monocular 3D object detection is a key problem for autonomous vehicles, as it provides a solution with simple configuration compared to typical multi-sensor systems. [Expand]
Monocular Reconstruction of Neural Face Reflectance Fields
Mallikarjun B R, Ayush Tewari, Tae-Hyun Oh, Tim Weyrich, Bernd Bickel, Hans-Peter Seidel, Hanspeter Pfister, Wojciech Matusik, Mohamed Elgharib, Christian Theobalt
The reflectance field of a face describes the reflectance properties responsible for complex lighting effects including diffuse, specular, inter-reflection and self shadowing. [Expand]
Synthesizing variations of a specific reference image with semantically valid content is an important task in terms of personalized generation as well as for data augmentation. [Expand]
We present a new vision-language (VL) pre-training model dubbed Kaleido-BERT, which introduces a novel kaleido strategy for fashion cross-modality representations from transformers. [Expand]
M3P: Learning Universal Representations via Multitask Multilingual Multimodal Pre-Training
Minheng Ni, Haoyang Huang, Lin Su, Edward Cui, Taroon Bharti, Lijuan Wang, Dongdong Zhang, Nan Duan
We present M3P, a Multitask Multilingual Multimodal Pre-trained model that combines multilingual pre-training and multimodal pre-training into a unified framework via multitask pre-training. [Expand]
(AF)2-S3Net: Attentive Feature Fusion With Adaptive Feature Selection for Sparse Semantic Segmentation Network
Ran Cheng, Ryan Razani, Ehsan Taghavi, Enxu Li, Bingbing Liu
Autonomous robotic systems and self driving cars rely on accurate perception of their surroundings as the safety of the passengers and pedestrians is the top priority. [Expand]
Roof-GAN: Learning To Generate Roof Geometry and Relations for Residential Houses
Yiming Qian, Hao Zhang, Yasutaka Furukawa
This paper presents Roof-GAN, a novel generative adversarial network that generates structured geometry of residential roof structures as a set of roof primitives and their relationships. [Expand]
ReMix: Towards Image-to-Image Translation With Limited Data
Jie Cao, Luanxuan Hou, Ming-Hsuan Yang, Ran He, Zhenan Sun
Image-to-image (I2I) translation methods based on generative adversarial networks (GANs) typically suffer from overfitting when limited training data is available. [Expand]
VaB-AL: Incorporating Class Imbalance and Difficulty With Variational Bayes for Active Learning
Jongwon Choi, Kwang Moo Yi, Jihoon Kim, Jinho Choo, Byoungjip Kim, Jinyeop Chang, Youngjune Gwon, Hyung Jin Chang
Active Learning for discriminative models has largely been studied with the focus on individual samples, with less emphasis on how classes are distributed or which classes are hard to deal with. [Expand]
Adversarial Robustness Under Long-Tailed Distribution
Tong Wu, Ziwei Liu, Qingqiu Huang, Yu Wang, Dahua Lin
Adversarial robustness has attracted extensive studies recently by revealing the vulnerability and intrinsic characteristics of deep networks. [Expand]
Camouflaged Object Segmentation With Distraction Mining
Haiyang Mei, Ge-Peng Ji, Ziqi Wei, Xin Yang, Xiaopeng Wei, Deng-Ping Fan
Camouflaged object segmentation (COS) aims to identify objects that are "perfectly" assimilate into their surroundings, which has a wide range of valuable applications. [Expand]
Learning Complete 3D Morphable Face Models From Images and Videos
Mallikarjun B R, Ayush Tewari, Hans-Peter Seidel, Mohamed Elgharib, Christian Theobalt
Most 3D face reconstruction methods rely on 3D morphable models, which disentangle the space of facial deformations into identity and expression geometry, and skin reflectance. [Expand]
Topological Planning With Transformers for Vision-and-Language Navigation
Kevin Chen, Junshen K. Chen, Jo Chuang, Marynel Vazquez, Silvio Savarese
Conventional approaches to vision-and-language navigation (VLN) are trained end-to-end but struggle to perform well in freely traversable environments. [Expand]
CanonPose: Self-Supervised Monocular 3D Human Pose Estimation in the Wild
Bastian Wandt, Marco Rudolph, Petrissa Zell, Helge Rhodin, Bodo Rosenhahn
Human pose estimation from single images is a challenging problem in computer vision that requires large amounts of labeled training data to be solved accurately. [Expand]
Alpha-Refine: Boosting Tracking Performance by Precise Bounding Box Estimation
Bin Yan, Xinyu Zhang, Dong Wang, Huchuan Lu, Xiaoyun Yang
Visual object tracking aims to precisely estimate the bounding box for the given target, which is a challenging problem due to factors such as deformation and occlusion. [Expand]
DCNAS: Densely Connected Neural Architecture Search for Semantic Image Segmentation
Xiong Zhang, Hongmin Xu, Hong Mo, Jianchao Tan, Cheng Yang, Lei Wang, Wenqi Ren
Existing NAS methods for dense image prediction tasks usually compromise on restricted search space or search on proxy task to meet the achievable computational demands. [Expand]
Video super-resolution (VSR) approaches tend to have more components than the image counterparts as they need to exploit the additional temporal dimension. [Expand]
The objective of this work is to segment high-resolution images without overloading GPU memory usage or losing the fine details in the output segmentation map. [Expand]
AdCo: Adversarial Contrast for Efficient Learning of Unsupervised Representations From Self-Trained Negative Adversaries
Qianjiang Hu, Xiao Wang, Wei Hu, Guo-Jun Qi
Contrastive learning relies on constructing a collection of negative examples that are sufficiently hard to discriminate against positive queries when their representations are self-trained. [Expand]
Rakshit Kothari, Shalini De Mello, Umar Iqbal, Wonmin Byeon, Seonwook Park, Jan Kautz
A major challenge for physically unconstrained gaze estimation is acquiring training data with 3D gaze annotations for in-the-wild and outdoor scenarios. [Expand]
ProSelfLC: Progressive Self Label Correction for Training Robust Deep Neural Networks
Xinshao Wang, Yang Hua, Elyor Kodirov, David A. Clifton, Neil M. Robertson
To train robust deep neural networks (DNNs), we systematically study several target modification approaches, which include output regularisation, self and non-self label correction (LC). [Expand]
Learning to Track Instances without Video Annotations
Yang Fu, Sifei Liu, Umar Iqbal, Shalini De Mello, Humphrey Shi, Jan Kautz
Tracking segmentation masks of multiple instances has been intensively studied, but still faces two fundamental challenges: 1) the requirement of large-scale, frame-wise annotation, and 2) the complexity of two-stage approaches. [Expand]
Unsupervised Feature Learning by Cross-Level Instance-Group Discrimination
Xudong Wang, Ziwei Liu, Stella X. Yu
Unsupervised feature learning has made great strides with contrastive learning based on instance discrimination and invariant mapping, as benchmarked on curated class-balanced datasets. [Expand]
Pose-Guided Human Animation From a Single Image in the Wild
Jae Shin Yoon, Lingjie Liu, Vladislav Golyanik, Kripasindhu Sarkar, Hyun Soo Park, Christian Theobalt
We present a new pose transfer method for synthesizing a human animation from a single image of a person controlled by a sequence of body poses. [Expand]
Probabilistic Tracklet Scoring and Inpainting for Multiple Object Tracking
Fatemeh Saleh, Sadegh Aliakbarian, Hamid Rezatofighi, Mathieu Salzmann, Stephen Gould
Despite the recent advances in multiple object tracking (MOT), achieved by joint detection and tracking, dealing with long occlusions remains a challenge. [Expand]
Linsen Song, Wayne Wu, Chaoyou Fu, Chen Qian, Chen Change Loy, Ran He
We present a new application direction named Pareidolia Face Reenactment, which is defined as animating a static illusory face to move in tandem with a human face in the video. [Expand]
Counterfactual Zero-Shot and Open-Set Visual Recognition
Zhongqi Yue, Tan Wang, Qianru Sun, Xian-Sheng Hua, Hanwang Zhang
We present a novel counterfactual framework for both Zero-Shot Learning (ZSL) and Open-Set Recognition (OSR), whose common challenge is generalizing to the unseen-classes by only training on the seen-classes. [Expand]
Xinya Ji, Hang Zhou, Kaisiyuan Wang, Wayne Wu, Chen Change Loy, Xun Cao, Feng Xu
Despite previous success in generating audio-driven talking heads, most of the previous studies focus on the correlation between speech content and the mouth shape. [Expand]
We propose a generative model of unordered point sets, such as point clouds, in the forms of an energy-based model, where the energy function is parameterized by an input-permutation-invariant bottom-up neural network. [Expand]
SimPoE: Simulated Character Control for 3D Human Pose Estimation
Ye Yuan, Shih-En Wei, Tomas Simon, Kris Kitani, Jason Saragih
Accurate estimation of 3D human motion from monocular video requires modeling both kinematics (body motion without physical forces) and dynamics (motion with physical forces). [Expand]
Knowledge transfer from large teacher models to smaller student models has recently been studied for metric learning, focusing on fine-grained classification. [Expand]
Pengguang Chen, Shu Liu, Hengshuang Zhao, Jiaya Jia
Knowledge distillation transfers knowledge from the teacher network to the student one, with the goal of greatly improving the performance of the student network. [Expand]
Revamping Cross-Modal Recipe Retrieval With Hierarchical Transformers and Self-Supervised Learning
Amaia Salvador, Erhan Gundogdu, Loris Bazzani, Michael Donoser
Cross-modal recipe retrieval has recently gained substantial attention due to the importance of food in people's lives, as well as the availability of vast amounts of digital cooking recipes and food images to train machine learning models. [Expand]
AutoFlow: Learning a Better Training Set for Optical Flow
Deqing Sun, Daniel Vlasic, Charles Herrmann, Varun Jampani, Michael Krainin, Huiwen Chang, Ramin Zabih, William T. Freeman, Ce Liu
Synthetic datasets play a critical role in pre-training CNN models for optical flow, but they are painstaking to generate and hard to adapt to new applications. [Expand]
MP3: A Unified Model To Map, Perceive, Predict and Plan
Sergio Casas, Abbas Sadat, Raquel Urtasun
High-definition maps (HD maps) are a key component of most modern self-driving systems due to their valuable semantic and geometric information. [Expand]
Pedro Savarese, David McAllester, Sudarshan Babu, Michael Maire
From a simplified analysis of adaptive methods, we derive AvaGrad, a new optimizer which outperforms SGD on vision tasks when its adaptability is properly tuned. [Expand]
Improving Unsupervised Image Clustering With Robust Learning
Sungwon Park, Sungwon Han, Sundong Kim, Danu Kim, Sungkyu Park, Seunghoon Hong, Meeyoung Cha
Unsupervised image clustering methods often introduce alternative objectives to indirectly train the model and are subject to faulty predictions and overconfident results. [Expand]
Curriculum Graph Co-Teaching for Multi-Target Domain Adaptation
Subhankar Roy, Evgeny Krivosheev, Zhun Zhong, Nicu Sebe, Elisa Ricci
In this paper we address multi-target domain adaptation (MTDA), where given one labeled source dataset and multiple unlabeled target datasets that differ in data distributions, the task is to learn a robust predictor for all the target domains. [Expand]
To address the problem of long-tail distribution for the large vocabulary object detection task, existing methods usually divide the whole categories into several groups and treat each group with different strategies. [Expand]
Interpolation-Based Semi-Supervised Learning for Object Detection
Jisoo Jeong, Vikas Verma, Minsung Hyun, Juho Kannala, Nojun Kwak
Despite the data labeling cost for the object detection tasks being substantially more than that of the classification tasks, semi-supervised learning methods for object detection have not been studied much. [Expand]
Learning Invariant Representations and Risks for Semi-Supervised Domain Adaptation
Bo Li, Yezhen Wang, Shanghang Zhang, Dongsheng Li, Kurt Keutzer, Trevor Darrell, Han Zhao
The success of supervised learning crucially hinges on the assumption that training data matches test data, which rarely holds in practice due to potential distribution shift. [Expand]
We present a controllable camera simulator based on deep neural networks to synthesize raw image data under different camera settings, including exposure time, ISO, and aperture. [Expand]
Understanding the nutritional content of food from visual data is a challenging computer vision problem, with the potential to have a positive and widespread impact on public health. [Expand]
Jean-Baptiste Truong, Pratyush Maini, Robert J. Walls, Nicolas Papernot
Current model extraction attacks assume that the adversary has access to a surrogate dataset with characteristics similar to the proprietary data used to train the victim model. [Expand]
The Multi-Temporal Urban Development SpaceNet Dataset
Adam Van Etten, Daniel Hogan, Jesus Martinez Manso, Jacob Shermeyer, Nicholas Weir, Ryan Lewis
Satellite imagery analytics have numerous human development and disaster response applications, particularly when time series methods are involved. [Expand]
Recently, numerous algorithms have been developed to tackle the problem of vision-language navigation (VLN), i.e., entailing an agent to navigate 3D environments through following linguistic instructions. [Expand]
Fair Attribute Classification Through Latent Space De-Biasing
Vikram V. Ramaswamy, Sunnie S. Y. Kim, Olga Russakovsky
Fairness in visual recognition is becoming a prominent and critical topic of discussion as recognition systems are deployed at scale in the real world. [Expand]
DANNet: A One-Stage Domain Adaptation Network for Unsupervised Nighttime Semantic Segmentation
Xinyi Wu, Zhenyao Wu, Hao Guo, Lili Ju, Song Wang
Semantic segmentation of nighttime images plays an equally important role as that of daytime images in autonomous driving, but the former is much more challenging due to poor illuminations and arduous human annotations. [Expand]
Prototypical Pseudo Label Denoising and Target Structure Learning for Domain Adaptive Semantic Segmentation
Pan Zhang, Bo Zhang, Ting Zhang, Dong Chen, Yong Wang, Fang Wen
Self-training is a competitive approach in domain adaptive segmentation, which trains the network with the pseudo labels on the target domain. [Expand]
When Age-Invariant Face Recognition Meets Face Age Synthesis: A Multi-Task Learning Framework
Zhizhong Huang, Junping Zhang, Hongming Shan
To minimize the effects of age variation in face recognition, previous work either extracts identity-related discriminative features by minimizing the correlation between identity- and age-related features, called age-invariant face recognition (AIFR), or removes age variation by transforming the faces of different age groups into the same age group, called face age synthesis (FAS); however, the former lacks visual results for model interpretation while the latter suffers from artifacts compromising downstream recognition. [Expand]
Coarse-Fine Networks for Temporal Activity Detection in Videos
Kumara Kahatapitiya, Michael S. Ryoo
In this paper, we introduce 'Coarse-Fine Networks', a two-stream architecture which benefits from different abstractions of temporal resolution to learn better video representations for long-term motion. [Expand]
Fabio Tosi, Yiyi Liao, Carolin Schmitt, Andreas Geiger
Despite stereo matching accuracy has greatly improved by deep learning in the last few years, recovering sharp boundaries and high-resolution outputs efficiently remains challenging. [Expand]
Hierarchical and Partially Observable Goal-Driven Policy Learning With Goals Relational Graph
Xin Ye, Yezhou Yang
We present a novel two-layer hierarchical reinforcement learning approach equipped with a Goals Relational Graph (GRG) for tackling the partially observable goal-driven task, such as goal-driven visual navigation. [Expand]
Differentiable SLAM-Net: Learning Particle SLAM for Visual Navigation
Peter Karkus, Shaojun Cai, David Hsu
Simultaneous localization and mapping (SLAM) remains challenging for a number of downstream applications, such as visual robot navigation, because of rapid turns, featureless walls, and poor camera quality. [Expand]
Predictions of certifiably robust classifiers remain constant in a neighborhood of a point, making them resilient to test-time attacks with a guarantee. [Expand]
Lorenzo Porzi, Samuel Rota Bulo, Peter Kontschieder
Crop-based training strategies decouple training resolution from GPU memory consumption, allowing the use of large-capacity panoptic segmentation networks on multi-megapixel images. [Expand]
Jin Chen, Xijun Wang, Zichao Guo, Xiangyu Zhang, Jian Sun
We propose a new convolution called Dynamic Region-Aware Convolution (DRConv), which can automatically assign multiple filters to corresponding spatial regions where features have similar representation. [Expand]
Predicting future trajectories of traffic agents in highly interactive environments is an essential and challenging problem for the safe operation of autonomous driving systems. [Expand]
SuperMix: Supervising the Mixing Data Augmentation
Ali Dabouei, Sobhan Soleymani, Fariborz Taherkhani, Nasser M. Nasrabadi
This paper presents a supervised mixing augmentation method termed SuperMix, which exploits the salient regions within input images to construct mixed training samples. [Expand]
Back to Event Basics: Self-Supervised Learning of Image Reconstruction for Event Cameras via Photometric Constancy
Federico Paredes-Valles, Guido C. H. E. de Croon
Event cameras are novel vision sensors that sample, in an asynchronous fashion, brightness increments with low latency and high temporal resolution. [Expand]
PU-GCN: Point Cloud Upsampling Using Graph Convolutional Networks
Guocheng Qian, Abdulellah Abualshour, Guohao Li, Ali Thabet, Bernard Ghanem
The effectiveness of learning-based point cloud upsampling pipelines heavily relies on the upsampling modules and feature extractors used therein. [Expand]
Andrei Zanfir, Eduard Gabriel Bazavan, Mihai Zanfir, William T. Freeman, Rahul Sukthankar, Cristian Sminchisescu
We present deep neural network methodology to reconstruct the 3d pose and shape of people, including hand gestures and facial expression, given an input RGB image. [Expand]
On Learning the Geodesic Path for Incremental Learning
Christian Simon, Piotr Koniusz, Mehrtash Harandi
Neural networks notoriously suffer from the problem of catastrophic forgetting, the phenomenon of forgetting the past knowledge when acquiring new knowledge. [Expand]
Semi-Supervised Action Recognition With Temporal Contrastive Learning
Ankit Singh, Omprakash Chakraborty, Ashutosh Varshney, Rameswar Panda, Rogerio Feris, Kate Saenko, Abir Das
Learning to recognize actions from only a handful of labeled videos is a challenging problem due to the scarcity of tediously collected activity labels. [Expand]
Multiple Instance Active Learning for Object Detection
Tianning Yuan, Fang Wan, Mengying Fu, Jianzhuang Liu, Songcen Xu, Xiangyang Ji, Qixiang Ye
Despite the substantial progress of active learning for image recognition, there still lacks an instance-level active learning method specified for object detection. [Expand]
Network quantization allows inference to be conducted using low-precision arithmetic for improved inference efficiency of deep neural networks on edge devices. [Expand]
Source-Free Domain Adaptation for Semantic Segmentation
Yuang Liu, Wei Zhang, Jun Wang
Unsupervised Domain Adaptation (UDA) can tackle the challenge that convolutional neural network (CNN)-based approaches for semantic segmentation heavily rely on the pixel-level annotated data, which is labor-intensive. [Expand]
Estimating 3D bounding boxes from monocular images is an essential component in autonomous driving, while accurate 3D object detection from this kind of data is very challenging. [Expand]
This work focuses on object goal visual navigation, aiming at finding the location of an object from a given class, where in each step the agent is provided with an egocentric RGB image of the scene. [Expand]
Cylindrical and Asymmetrical 3D Convolution Networks for LiDAR Segmentation
Xinge Zhu, Hui Zhou, Tai Wang, Fangzhou Hong, Yuexin Ma, Wei Li, Hongsheng Li, Dahua Lin
State-of-the-art methods for large-scale driving-scene LiDAR segmentation often project the point clouds to 2D space and then process them via 2D convolution. [Expand]
Convolutional Dynamic Alignment Networks for Interpretable Classifications
Moritz Bohle, Mario Fritz, Bernt Schiele
We introduce a new family of neural network models called Convolutional Dynamic Alignment Networks (CoDA-Nets), which are performant classifiers with a high degree of inherent interpretability. [Expand]
Recent work on audio-visual navigation assumes a constantly-sounding target and restricts the role of audio to signaling the target's position. [Expand]
Keep Your Eyes on the Lane: Real-Time Attention-Guided Lane Detection
Lucas Tabelini, Rodrigo Berriel, Thiago M. Paixao, Claudine Badue, Alberto F. De Souza, Thiago Oliveira-Santos
Modern lane detection methods have achieved remarkable performances in complex real-world scenarios, but many have issues maintaining real-time efficiency, which is important for autonomous vehicles. [Expand]
PCLs: Geometry-Aware Neural Reconstruction of 3D Pose With Perspective Crop Layers
Frank Yu, Mathieu Salzmann, Pascal Fua, Helge Rhodin
Local processing is an essential feature of CNNs and other neural network architectures -- it is one of the reasons why they work so well on images where relevant information is, to a large extent, local. [Expand]
Deep Implicit Templates for 3D Shape Representation
Zerong Zheng, Tao Yu, Qionghai Dai, Yebin Liu
Deep implicit functions (DIFs), as a kind of 3D shape representation, are becoming more and more popular in the 3D vision community due to their compactness and strong representation power. [Expand]
We propose a causal framework to explain the catastrophic forgetting in Class-Incremental Learning (CIL) and then derive a novel distillation method that is orthogonal to the existing anti-forgetting techniques, such as data replay and feature/label distillation. [Expand]
Divergence Optimization for Noisy Universal Domain Adaptation
Qing Yu, Atsushi Hashimoto, Yoshitaka Ushiku
Universal domain adaptation (UniDA) has been proposed to transfer knowledge learned from a label-rich source domain to a label-scarce target domain without any constraints on the label sets. [Expand]
Learning the Best Pooling Strategy for Visual Semantic Embedding
Jiacheng Chen, Hexiang Hu, Hao Wu, Yuning Jiang, Changhu Wang
Visual Semantic Embedding (VSE) is a dominant approach for vision-language retrieval, which aims at learning a deep embedding space such that visual data are embedded close to their semantic text labels or descriptions. [Expand]
Context-Aware Layout to Image Generation With Enhanced Object Appearance
Sen He, Wentong Liao, Michael Ying Yang, Yongxin Yang, Yi-Zhe Song, Bodo Rosenhahn, Tao Xiang
A layout to image (L2I) generation model aims to generate a complicated image containing multiple objects (things) against natural background (stuff), conditioned on a given layout. [Expand]
Neural Response Interpretation Through the Lens of Critical Pathways
Ashkan Khakzar, Soroosh Baselizadeh, Saurabh Khanduja, Christian Rupprecht, Seong Tae Kim, Nassir Navab
Is critical input information encoded in specific sparse pathways within the neural network? In this work, we discuss the problem of identifying these critical pathways and subsequently leverage them for interpreting the network's response to an input. [Expand]
Learning Semantic-Aware Dynamics for Video Prediction
Xinzhu Bei, Yanchao Yang, Stefano Soatto
We propose an architecture and training scheme to predict video frames by explicitly modeling dis-occlusions and capturing the evolution of semantically consistent regions in the video. [Expand]
We show that the influence of a subset of the training samples can be removed -- or "forgotten" -- from the weights of a network trained on large-scale image classification tasks, and we provide strong computable bounds on the amount of remaining information after forgetting. [Expand]
Deep Gaussian Scale Mixture Prior for Spectral Compressive Imaging
Tao Huang, Weisheng Dong, Xin Yuan, Jinjian Wu, Guangming Shi
In coded aperture snapshot spectral imaging (CASSI) system, the real-world hyperspectral image (HSI) can be reconstructed from the captured compressive image in a snapshot. [Expand]
To promote the developments of object detection, tracking and counting algorithms in drone-captured videos, we construct a benchmark with a new drone-captured large-scale dataset, named as DroneCrowd, formed by 112 video clips with 33,600 HD frames in various scenarios. [Expand]
Recent development of Under-Display Camera (UDC) systems provides a true bezel-less and notch-free viewing experience on smartphones (and TV, laptops, tablets), while allowing images to be captured from the selfie camera embedded underneath. [Expand]
MetaSAug: Meta Semantic Augmentation for Long-Tailed Visual Recognition
Shuang Li, Kaixiong Gong, Chi Harold Liu, Yulin Wang, Feng Qiao, Xinjing Cheng
Real-world training data usually exhibits long-tailed distribution, where several majority classes have a significantly larger number of samples than the remaining minority classes. [Expand]
Recent studies propose membership inference (MI) attacks on deep models, where the goal is to infer if a sample has been used in the training process. [Expand]
Automatic Vertebra Localization and Identification in CT by Spine Rectification and Anatomically-Constrained Optimization
Fakai Wang, Kang Zheng, Le Lu, Jing Xiao, Min Wu, Shun Miao
Accurate vertebra localization and identification are required in many clinical applications of spine disorder diagnosis and surgery planning. [Expand]
Most differentiable neural architecture search methods construct a super-net for search and derive a target-net as its sub-graph for evaluation. [Expand]
Alireza Zareian, Kevin Dela Rosa, Derek Hao Hu, Shih-Fu Chang
Despite the remarkable accuracy of deep neural networks in object detection, they are costly to train and scale due to supervision requirements. [Expand]
Instant-Teaching: An End-to-End Semi-Supervised Object Detection Framework
Qiang Zhou, Chaohui Yu, Zhibin Wang, Qi Qian, Hao Li
Supervised learning based object detection frameworks demand plenty of laborious manual annotations, which may not be practical in real applications. [Expand]
We present a plug-in replacement for batch normalization (BN) called exponential moving average normalization (EMAN), which improves the performance of existing student-teacher based self- and semi-supervised learning techniques. [Expand]
Temporal Context Aggregation Network for Temporal Action Proposal Refinement
Zhiwu Qing, Haisheng Su, Weihao Gan, Dongliang Wang, Wei Wu, Xiang Wang, Yu Qiao, Junjie Yan, Changxin Gao, Nong Sang
Temporal action proposal generation aims to estimate temporal intervals of actions in untrimmed videos, which is a challenging yet important task in the video understanding field. [Expand]
Look Before You Speak: Visually Contextualized Utterances
Paul Hongsuck Seo, Arsha Nagrani, Cordelia Schmid
While most conversational AI systems focus on textual dialogue only, conditioning utterances on visual context (when it's available) can lead to more realistic conversations. [Expand]
Deep Optimized Priors for 3D Shape Modeling and Reconstruction
Mingyue Yang, Yuxin Wen, Weikai Chen, Yongwei Chen, Kui Jia
Many learning-based approaches have difficulty scaling to unseen data, as the generality of its learned prior is limited to the scale and variations of the training samples. [Expand]
Meta Batch-Instance Normalization for Generalizable Person Re-Identification
Seokeon Choi, Taekyung Kim, Minki Jeong, Hyoungseob Park, Changick Kim
Although supervised person re-identification (Re-ID) methods have shown impressive performance, they suffer from a poor generalization capability on unseen domains. [Expand]
Despite substantial progress in applying neural networks (NN) to a wide variety of areas, they still largely suffer from a lack of transparency and interpretability. [Expand]
StyleMix: Separating Content and Style for Enhanced Data Augmentation
Minui Hong, Jinwoo Choi, Gunhee Kim
In spite of the great success of deep neural networks for many challenging classification tasks, the learned networks are vulnerable to overfitting and adversarial attacks. [Expand]
General Multi-Label Image Classification With Transformers
Jack Lanchantin, Tianlu Wang, Vicente Ordonez, Yanjun Qi
Multi-label image classification is the task of predicting a set of labels corresponding to objects, attributes or other entities present in an image. [Expand]
UAV-Human: A Large Benchmark for Human Behavior Understanding With Unmanned Aerial Vehicles
Tianjiao Li, Jun Liu, Wei Zhang, Yun Ni, Wenqian Wang, Zhiheng Li
Human behavior understanding with unmanned aerial vehicles (UAVs) is of great significance for a wide range of applications, which simultaneously brings an urgent demand of large, challenging, and comprehensive benchmarks for the development and evaluation of UAV-based models. [Expand]
Neural Prototype Trees for Interpretable Fine-Grained Image Recognition
Meike Nauta, Ron van Bree, Christin Seifert
Prototype-based methods use interpretable representations to address the black-box nature of deep learning models, in contrast to post-hoc explanation methods that only approximate such models. [Expand]
DeFlow: Learning Complex Image Degradations From Unpaired Data With Conditional Flows
Valentin Wolf, Andreas Lugmayr, Martin Danelljan, Luc Van Gool, Radu Timofte
The difficulty of obtaining paired data remains a major bottleneck for learning image restoration and enhancement models for real-world applications. [Expand]
In this paper, we present a decomposition model for stereo matching to solve the problem of excessive growth in computational cost (time and memory cost) as the resolution increases. [Expand]
Transitional Adaptation of Pretrained Models for Visual Storytelling
Youngjae Yu, Jiwan Chung, Heeseung Yun, Jongseok Kim, Gunhee Kim
Previous models for vision-to-language generation tasks usually pretrain a visual encoder and a language generator in the respective domains and jointly finetune them with the target task. [Expand]
Learning non-rigid registration in an end-to-end manner is challenging due to the inherent high degrees of freedom and the lack of labeled training data. [Expand]
Domain Adaptation With Auxiliary Target Domain-Oriented Classifier
Jian Liang, Dapeng Hu, Jiashi Feng
Domain adaptation (DA) aims to transfer knowledge from a label-rich but heterogeneous domain to a label-scare domain, which alleviates the labeling efforts and attracts considerable attention. [Expand]
Multi-Person Implicit Reconstruction From a Single Image
Armin Mustafa, Akin Caliskan, Lourdes Agapito, Adrian Hilton
We present a new end-to-end learning framework to obtain detailed and spatially coherent reconstructions of multiple people from a single image. [Expand]
Offboard 3D Object Detection From Point Cloud Sequences
Charles R. Qi, Yin Zhou, Mahyar Najibi, Pei Sun, Khoa Vo, Boyang Deng, Dragomir Anguelov
While current 3D object recognition research mostly focuses on the real-time, onboard scenario, there are many offboard use cases of perception that are largely under-explored, such as using machines to automatically generate high-quality 3D labels. [Expand]
Backdoor Attacks Against Deep Learning Systems in the Physical World
Emily Wenger, Josephine Passananti, Arjun Nitin Bhagoji, Yuanshun Yao, Haitao Zheng, Ben Y. Zhao
Backdoor attacks embed hidden malicious behaviors into deep learning models, which only activate and cause misclassifications on model inputs containing a specific "trigger." Existing works on backdoor attacks and defenses, however, mostly focus on digital attacks that apply digitally generated patterns as triggers. [Expand]
Neural Splines: Fitting 3D Surfaces With Infinitely-Wide Neural Networks
Francis Williams, Matthew Trager, Joan Bruna, Denis Zorin
We present Neural Splines, a technique for 3D surface reconstruction that is based on random feature kernels arising from infinitely-wide shallow ReLU networks. [Expand]
Differentiable Multi-Granularity Human Representation Learning for Instance-Aware Human Semantic Parsing
Tianfei Zhou, Wenguan Wang, Si Liu, Yi Yang, Luc Van Gool
To address the challenging task of instance-aware human part parsing, a new bottom-up regime is proposed to learn category-level human semantic segmentation as well as multi-person pose estimation in a joint and end-to-end manner. [Expand]
Yuhao Zhu, Qi Li, Jian Wang, Cheng-Zhong Xu, Zhenan Sun
Face swapping has both positive applications such as entertainment, human-computer interaction, etc., and negative applications such as DeepFake threats to politics, economics, etc. [Expand]
RobustNet: Improving Domain Generalization in Urban-Scene Segmentation via Instance Selective Whitening
Sungha Choi, Sanghun Jung, Huiwon Yun, Joanne T. Kim, Seungryong Kim, Jaegul Choo
Enhancing the generalization capability of deep neural networks to unseen domains is crucial for safety-critical applications in the real world such as autonomous driving. [Expand]
Group Whitening: Balancing Learning Efficiency and Representational Capacity
Lei Huang, Yi Zhou, Li Liu, Fan Zhu, Ling Shao
Batch normalization (BN) is an important technique commonly incorporated into deep learning models to perform standardization within mini-batches. [Expand]
Conditional generative adversarial networks (cGANs) target at synthesizing diverse images given the input conditions and latent codes, but unfortunately, they usually suffer from the issue of mode collapse. [Expand]
StyleMeUp: Towards Style-Agnostic Sketch-Based Image Retrieval
Aneeshan Sain, Ayan Kumar Bhunia, Yongxin Yang, Tao Xiang, Yi-Zhe Song
Sketch-based image retrieval (SBIR) is a cross-modal matching problem which is typically solved by learning a joint embedding space where the semantic content shared between photo and sketch modalities are preserved. [Expand]
SOE-Net: A Self-Attention and Orientation Encoding Network for Point Cloud Based Place Recognition
Yan Xia, Yusheng Xu, Shuang Li, Rui Wang, Juan Du, Daniel Cremers, Uwe Stilla
We tackle the problem of place recognition from point cloud data and introduce a self-attention and orientation encoding network (SOE-Net) that fully explores the relationship between points and incorporates long-range context into point-wise local descriptors. [Expand]
Anti-Aliasing Semantic Reconstruction for Few-Shot Semantic Segmentation
Binghao Liu, Yao Ding, Jianbin Jiao, Xiangyang Ji, Qixiang Ye
Encouraging progress in few-shot semantic segmentation has been made by leveraging features learned upon base classes with sufficient training data to represent novel classes with few-shot examples. [Expand]
Searching for Fast Model Families on Datacenter Accelerators
Sheng Li, Mingxing Tan, Ruoming Pang, Andrew Li, Liqun Cheng, Quoc V. Le, Norman P. Jouppi
Neural Architecture Search (NAS), together with model scaling, has shown remarkable progress in designing high accuracy and fast convolutional architecture families. [Expand]
Visual salient object detection (SOD) aims at finding the salient object(s) that attract human attention, while camouflaged object detection (COD) on the contrary intends to discover the camouflaged object(s) that hidden in the surrounding. [Expand]
Diffusion Probabilistic Models for 3D Point Cloud Generation
Shitong Luo, Wei Hu
We present a probabilistic model for point cloud generation, which is fundamental for various 3D vision tasks such as shape completion, upsampling, synthesis and data augmentation. [Expand]
MetaSCI: Scalable and Adaptive Reconstruction for Video Compressive Sensing
Zhengjue Wang, Hao Zhang, Ziheng Cheng, Bo Chen, Xin Yuan
To capture high-speed videos using a two-dimensional detector, video snapshot compressive imaging (SCI) is a promising system, where the video frames are coded by different masks and then compressed to a snapshot measurement. [Expand]
Prototype-Supervised Adversarial Network for Targeted Attack of Deep Hashing
Xunguang Wang, Zheng Zhang, Baoyuan Wu, Fumin Shen, Guangming Lu
Due to its powerful capability of representation learning and high-efficiency computation, deep hashing has made significant progress in large-scale image retrieval. [Expand]
Understanding the Robustness of Skeleton-Based Action Recognition Under Adversarial Attack
He Wang, Feixiang He, Zhexi Peng, Tianjia Shao, Yong-Liang Yang, Kun Zhou, David Hogg
Action recognition has been heavily employed in many applications such as autonomous vehicles, surveillance, etc, where its robustness is a primary concern. [Expand]
Robust Instance Segmentation Through Reasoning About Multi-Object Occlusion
Xiaoding Yuan, Adam Kortylewski, Yihong Sun, Alan Yuille
Analyzing complex scenes with Deep Neural Networks is a challenging task, particularly when images contain multiple objects that partially occlude each other. [Expand]
Approaches based on deep neural networks have achieved striking performance when testing data and training data share similar distribution, but can significantly fail otherwise. [Expand]
Due to the intensive cost of labor and expertise in annotating 3D medical images at a voxel level, most benchmark datasets are equipped with the annotations of only one type of organs and/or tumors, resulting in the so-called partially labeling issue. [Expand]
Improving Sign Language Translation With Monolingual Data by Sign Back-Translation
Hao Zhou, Wengang Zhou, Weizhen Qi, Junfu Pu, Houqiang Li
Despite existing pioneering works on sign language translation (SLT), there is a non-trivial obstacle, i.e., the limited quantity of parallel sign-text data. [Expand]
Spatially-Varying Outdoor Lighting Estimation From Intrinsics
Yongjie Zhu, Yinda Zhang, Si Li, Boxin Shi
We present SOLID-Net, a neural network for spatially-varying outdoor lighting estimation from a single outdoor image for any 2D pixel location. [Expand]
Reciprocal Landmark Detection and Tracking With Extremely Few Annotations
Jianzhe Lin, Ghazal Sahebzamani, Christina Luong, Fatemeh Taheri Dezaki, Mohammad Jafari, Purang Abolmaesumi, Teresa Tsang
Localization of anatomical landmarks to perform two-dimensional measurements in echocardiography is part of routine clinical workflow in cardiac disease diagnosis. [Expand]
Model quantization is a promising approach to compress deep neural networks and accelerate inference, making it possible to be deployed on mobile and edge devices. [Expand]
HumanGPS: Geodesic PreServing Feature for Dense Human Correspondences
Feitong Tan, Danhang Tang, Mingsong Dou, Kaiwen Guo, Rohit Pandey, Cem Keskin, Ruofei Du, Deqing Sun, Sofien Bouaziz, Sean Fanello, Ping Tan, Yinda Zhang
In this paper, we address the problem of building pixel-wise dense correspondences between human images under arbitrary camera viewpoints and body poses. [Expand]
Depth-Conditioned Dynamic Message Propagation for Monocular 3D Object Detection
Li Wang, Liang Du, Xiaoqing Ye, Yanwei Fu, Guodong Guo, Xiangyang Xue, Jianfeng Feng, Li Zhang
The objective of this paper is to learn context- and depth-aware feature representation to solve the problem of monocular 3D object detection. [Expand]
We introduce NExT-QA, a rigorously designed video question answering (VideoQA) benchmark to advance video understanding from describing to explaining the temporal actions. [Expand]
Zhenqiang Ying, Maniratnam Mandal, Deepti Ghadiyaram, Alan Bovik
No-reference (NR) perceptual video quality assessment (VQA) is a complex, unsolved, and important problem for social and streaming media applications. [Expand]
Are Labels Always Necessary for Classifier Accuracy Evaluation?
Weijian Deng, Liang Zheng
To calculate the model accuracy on a computer vision task, e.g., object recognition, we usually require a test set composing of test samples and their ground truth labels. [Expand]
One Thing One Click: A Self-Training Approach for Weakly Supervised 3D Semantic Segmentation
Zhengzhe Liu, Xiaojuan Qi, Chi-Wing Fu
Point cloud semantic segmentation often requires largescale annotated training data, but clearly, point-wise labels are too tedious to prepare. [Expand]
Dive Into Ambiguity: Latent Distribution Mining and Pairwise Uncertainty Estimation for Facial Expression Recognition
Jiahui She, Yibo Hu, Hailin Shi, Jun Wang, Qiu Shen, Tao Mei
Due to the subjective annotation and the inherent inter-class similarity of facial expressions, one of key challenges in Facial Expression Recognition (FER) is the annotation ambiguity. [Expand]
Learning Better Visual Dialog Agents With Pretrained Visual-Linguistic Representation
Tao Tu, Qing Ping, Govindarajan Thattai, Gokhan Tur, Prem Natarajan
GuessWhat?! is a visual dialog guessing game which incorporates a Questioner agent that generates a sequence of questions, while an Oracle agent answers the respective questions about a target object in an image. [Expand]
Unsupervised Discovery of the Long-Tail in Instance Segmentation Using Hierarchical Self-Supervision
Zhenzhen Weng, Mehmet Giray Ogut, Shai Limonchik, Serena Yeung
Instance segmentation is an active topic in computer vision that is usually solved by using supervised learning approaches over very large datasets composed of object level masks. [Expand]
TPCN: Temporal Point Cloud Networks for Motion Forecasting
Maosheng Ye, Tongyi Cao, Qifeng Chen
We propose the Temporal Point Cloud Networks (TPCN), a novel and flexible framework with joint spatial and temporal learning for trajectory prediction. [Expand]
This paper proposes a framework for the interactive video object segmentation (VOS) in the wild where users can choose some frames for annotations iteratively. [Expand]
Few-Shot Incremental Learning With Continually Evolved Classifiers
Chi Zhang, Nan Song, Guosheng Lin, Yun Zheng, Pan Pan, Yinghui Xu
Few-shot class-incremental learning (FSCIL) aims to design machine learning algorithms that can continually learn new concepts from a few data points, without forgetting knowledge of old classes. [Expand]
How to efficiently represent camera pose is an essential problem in 3D computer vision, especially in tasks like camera pose regression and novel view synthesis. [Expand]
Classifiers that are linear in their parameters, and trained by optimizing a convex loss function, have predictable behavior with respect to changes in the training data, initial conditions, and optimization. [Expand]
More Photos Are All You Need: Semi-Supervised Learning for Fine-Grained Sketch Based Image Retrieval
Ayan Kumar Bhunia, Pinaki Nath Chowdhury, Aneeshan Sain, Yongxin Yang, Tao Xiang, Yi-Zhe Song
A fundamental challenge faced by existing Fine-Grained Sketch-Based Image Retrieval (FG-SBIR) models is the data scarcity -- model performances are largely bottlenecked by the lack of sketch-photo pairs. [Expand]
InverseForm: A Loss Function for Structured Boundary-Aware Segmentation
Shubhankar Borse, Ying Wang, Yizhe Zhang, Fatih Porikli
We present a novel boundary-aware loss term for semantic segmentation using an inverse-transformation network, which efficiently learns the degree of parametric transformations between estimated and target boundaries. [Expand]
Deep Analysis of CNN-Based Spatio-Temporal Representations for Action Recognition
Chun-Fu Richard Chen, Rameswar Panda, Kandan Ramakrishnan, Rogerio Feris, John Cohn, Aude Oliva, Quanfu Fan
In recent years, a number of approaches based on 2D or 3D convolutional neural networks (CNN) have emerged for video action recognition, achieving state-of-the-art results on several large-scale benchmark datasets. [Expand]
The primary goal of knowledge distillation (KD) is to encapsulate the information of a model learned from a teacher network into a student network, with the latter being more compact than the former. [Expand]
Generalizable Person Re-Identification With Relevance-Aware Mixture of Experts
Yongxing Dai, Xiaotong Li, Jun Liu, Zekun Tong, Ling-Yu Duan
Domain generalizable (DG) person re-identification (ReID) is a challenging problem because we cannot access any unseen target domain data during training. [Expand]
Deformed Implicit Field: Modeling 3D Shapes With Learned Dense Correspondence
Yu Deng, Jiaolong Yang, Xin Tong
We propose a novel Deformed Implicit Field (DIF) representation for modeling 3D shapes of a category and generating dense correspondences among shapes. [Expand]
AlphaMatch: Improving Consistency for Semi-Supervised Learning With Alpha-Divergence
Chengyue Gong, Dilin Wang, Qiang Liu
Semi-supervised learning (SSL) is a key approach toward more data-efficient machine learning by jointly leverage both labeled and unlabeled data. [Expand]
Reinforced Attention for Few-Shot Learning and Beyond
Jie Hong, Pengfei Fang, Weihao Li, Tong Zhang, Christian Simon, Mehrtash Harandi, Lars Petersson
Few-shot learning aims to correctly recognize query samples from unseen classes given a limited number of support samples, often by relying on global embeddings of images. [Expand]
A Multiplexed Network for End-to-End, Multilingual OCR
Jing Huang, Guan Pang, Rama Kovvuri, Mandy Toh, Kevin J Liang, Praveen Krishnan, Xi Yin, Tal Hassner
Recent advances in OCR have shown that an end-to-end (E2E) training pipeline that includes both detection and recognition leads to the best results. [Expand]
Semi-Supervised 3D Hand-Object Poses Estimation With Interactions in Time
Shaowei Liu, Hanwen Jiang, Jiarui Xu, Sifei Liu, Xiaolong Wang
Estimating 3D hand and object pose from a single image is an extremely challenging problem: hands and objects are often self-occluded during interactions, and the 3D annotations are scarce as even humans cannot directly label the ground-truths from a single image perfectly. [Expand]
Unsupervised Part Segmentation Through Disentangling Appearance and Shape
Shilong Liu, Lei Zhang, Xiao Yang, Hang Su, Jun Zhu
We study the problem of unsupervised discovery and segmentation of object parts, which, as an intermediate local representation, are capable of finding intrinsic object structure and providing more explainable recognition results. [Expand]
QAIR: Practical Query-Efficient Black-Box Attacks for Image Retrieval
Xiaodan Li, Jinfeng Li, Yuefeng Chen, Shaokai Ye, Yuan He, Shuhui Wang, Hang Su, Hui Xue
We study the query-based attack against image retrieval to evaluate its robustness against adversarial examples under the black-box setting, where the adversary only has query access to the top-k ranked unlabeled images from the database. [Expand]
Removing the Background by Adding the Background: Towards Background Robust Self-Supervised Video Representation Learning
Jinpeng Wang, Yuting Gao, Ke Li, Yiqi Lin, Andy J. Ma, Hao Cheng, Pai Peng, Feiyue Huang, Rongrong Ji, Xing Sun
Self-supervised learning has shown great potentials in improving the video representation ability of deep neural networks by getting supervision from the data itself. [Expand]
Unsupervised Multi-Source Domain Adaptation Without Access to Source Data
Sk Miraj Ahmed, Dripta S. Raychaudhuri, Sujoy Paul, Samet Oymak, Amit K. Roy-Chowdhury
Unsupervised Domain Adaptation (UDA) aims to learn a predictor model for an unlabeled dataset by transferring knowledge from a labeled source data, which has been trained on similar tasks. [Expand]
Hamad Ahmed, Ronnie B. Wilbur, Hari M. Bharadwaj, Jeffrey Mark Siskind
New results suggest strong limits to the feasibility of object classification from human brain activity evoked by image stimuli, as measured through EEG. [Expand]
Polka Lines: Learning Structured Illumination and Reconstruction for Active Stereo
Seung-Hwan Baek, Felix Heide
Active stereo cameras that recover depth from structured light captures have become a cornerstone sensor modality for 3D scene reconstruction and understanding tasks across application domains. [Expand]
Architectural Adversarial Robustness: The Case for Deep Pursuit
George Cazenavette, Calvin Murdock, Simon Lucey
Despite their unmatched performance, deep neural networks remain susceptible to targeted attacks by nearly imperceptible levels of adversarial noise. [Expand]
Semantic-Aware Knowledge Distillation for Few-Shot Class-Incremental Learning
Ali Cheraghian, Shafin Rahman, Pengfei Fang, Soumava Kumar Roy, Lars Petersson, Mehrtash Harandi
Few-shot class incremental learning (FSCIL) portrays the problem of learning new concepts gradually, where only a few examples per concept are available to the learner. [Expand]
Lips Don't Lie: A Generalisable and Robust Approach To Face Forgery Detection
Alexandros Haliassos, Konstantinos Vougioukas, Stavros Petridis, Maja Pantic
Although current deep learning-based face forgery detectors achieve impressive performance in constrained scenarios, they are vulnerable to samples created by unseen manipulation methods. [Expand]
A source model trained on source data and a target model learned through unsupervised domain adaptation (UDA) usually encode different knowledge. [Expand]
Recently unsupervised domain adaptation for the semantic segmentation task has become more and more popular due to the high-cost of pixel-level annotation on real-world images. [Expand]
UniT: Unified Knowledge Transfer for Any-Shot Object Detection and Segmentation
Siddhesh Khandelwal, Raghav Goyal, Leonid Sigal
Methods for object detection and segmentation rely on large scale instance-level annotations for training, which are difficult and time-consuming to collect. [Expand]
Beyond Max-Margin: Class Margin Equilibrium for Few-Shot Object Detection
Bohao Li, Boyu Yang, Chang Liu, Feng Liu, Rongrong Ji, Qixiang Ye
Few-shot object detection has made encouraging progress by reconstructing novel class objects using the feature representation learned upon a set of base classes. [Expand]
D2IM-Net: Learning Detail Disentangled Implicit Fields From Single Images
Manyi Li, Hao Zhang
We present the first single-view 3D reconstruction network aimed at recovering geometric details from an input image which encompass both topological shape structures and surface features. [Expand]
PixMatch: Unsupervised Domain Adaptation via Pixelwise Consistency Training
Luke Melas-Kyriazi, Arjun K. Manrai
Unsupervised domain adaptation is a promising technique for semantic segmentation and other computer vision tasks for which large-scale data annotation is costly and time-consuming. [Expand]
Physical adversarial examples for camera-based computer vision have so far been achieved through visible artifacts -- a sticker on a Stop sign, colorful borders around eyeglasses or a 3D printed object with a colorful texture. [Expand]
Open Domain Generalization with Domain-Augmented Meta-Learning
Yang Shu, Zhangjie Cao, Chenyu Wang, Jianmin Wang, Mingsheng Long
Leveraging datasets available to learn a model with high generalization ability to unseen domains is important for computer vision, especially when the unseen domain's annotated data are unavailable. [Expand]
Current deep learning architectures suffer from catastrophic forgetting, a failure to retain knowledge of previously learned classes when incrementally trained on new classes. [Expand]
Learnable Companding Quantization for Accurate Low-Bit Neural Networks
Kohei Yamamoto
Quantizing deep neural networks is an effective method for reducing memory consumption and improving inference speed, and is thus useful for implementation in resource-constrained devices. [Expand]
Interactive Self-Training With Mean Teachers for Semi-Supervised Object Detection
Qize Yang, Xihan Wei, Biao Wang, Xian-Sheng Hua, Lei Zhang
The goal of semi-supervised object detection is to learn a detection model using only a few labeled data and large amounts of unlabeled data, thereby reducing the cost of data labeling. [Expand]
Closing the Loop: Joint Rain Generation and Removal via Disentangled Image Translation
Yuntong Ye, Yi Chang, Hanyu Zhou, Luxin Yan
Existing deep learning-based image deraining methods have achieved promising performance for synthetic rainy images, typically rely on the pairs of sharp images and simulated rainy counterparts. [Expand]
Unsupervised Domain Adaptive (UDA) person re-identification (ReID) aims at adapting the model trained on a labeled source-domain dataset to a target-domain dataset without any further annotations. [Expand]
Mohammadreza Armandpour, Ali Sadeghian, Chunyuan Li, Mingyuan Zhou
Despite the success of Generative Adversarial Networks (GANs), their training suffers from several well-known problems, including mode collapse and difficulties learning a disconnected set of manifolds. [Expand]
ReAgent: Point Cloud Registration Using Imitation and Reinforcement Learning
Dominik Bauer, Timothy Patten, Markus Vincze
Point cloud registration is a common step in many 3D computer vision tasks such as object pose estimation, where a 3D model is aligned to an observation. [Expand]
FBI-Denoiser: Fast Blind Image Denoiser for Poisson-Gaussian Noise
Jaeseok Byun, Sungmin Cha, Taesup Moon
We consider the challenging blind denoising problem for Poisson-Gaussian noise, in which no additional information about clean images or noise level parameters is available. [Expand]
Human-Like Controllable Image Captioning With Verb-Specific Semantic Roles
Long Chen, Zhihong Jiang, Jun Xiao, Wei Liu
Controllable Image Captioning (CIC) -- generating image descriptions following designated control signals -- has received unprecedented attention over the last few years. [Expand]
Unbiased Mean Teacher for Cross-Domain Object Detection
Jinhong Deng, Wen Li, Yuhua Chen, Lixin Duan
Cross-domain object detection is challenging, because object detection model is often vulnerable to data variance, especially to the considerable domain shift between two distinctive domains. [Expand]
Adaptive Methods for Real-World Domain Generalization
Abhimanyu Dubey, Vignesh Ramanathan, Alex Pentland, Dhruv Mahajan
Invariant approaches have been remarkably successful in tackling the problem of domain generalization, where the objective is to perform inference on data distributions different from those used in training. [Expand]
The task of motion transfer between a source dancer and a target person is a special case of the pose transfer problem, in which the target person changes their pose in accordance with the motions of the dancer. [Expand]
Interpreting Super-Resolution Networks With Local Attribution Maps
Jinjin Gu, Chao Dong
Image super-resolution (SR) techniques have been developing rapidly, benefiting from the invention of deep networks and its successive breakthroughs. [Expand]
Shihao Jiang, Yao Lu, Hongdong Li, Richard Hartley
State-of-the-art neural network models for optical flow estimation require a dense correlation volume at high resolutions for representing per-pixel displacement. [Expand]
Cross-Domain Adaptive Clustering for Semi-Supervised Domain Adaptation
Jichang Li, Guanbin Li, Yemin Shi, Yizhou Yu
In semi-supervised domain adaptation, a few labeled samples per class in the target domain guide features of the remaining target samples to aggregate around them. [Expand]
The success of deep learning has led to intense growth and interest in computer vision, along with concerns about its potential impact on society. [Expand]
A Fourier-Based Framework for Domain Generalization
Qinwei Xu, Ruipeng Zhang, Ya Zhang, Yanfeng Wang, Qi Tian
Modern deep neural networks suffer from performance degradation when evaluated on testing data under different distributions from training data. [Expand]
The ability to navigate like a human towards a language-guided target from anywhere in a 3D embodied environment is one of the 'holy grail' goals of intelligent robots. [Expand]
Deeply Shape-Guided Cascade for Instance Segmentation
Hao Ding, Siyuan Qiao, Alan Yuille, Wei Shen
The key to a successful cascade architecture for precise instance segmentation is to fully leverage the relationship between bounding box detection and mask segmentation across multiple stages. [Expand]
Distilling Object Detectors via Decoupled Features
Jianyuan Guo, Kai Han, Yunhe Wang, Han Wu, Xinghao Chen, Chunjing Xu, Chang Xu
Knowledge distillation is a widely used paradigm for inheriting information from a complicated teacher network to a compact student network and maintaining the strong performance. [Expand]
We present a self-supervised approach for learning video representations using temporal video alignment as a pretext task, while exploiting both frame-level and video-level information. [Expand]
Invertible Denoising Network: A Light Solution for Real Noise Removal
Yang Liu, Zhenyue Qin, Saeed Anwar, Pan Ji, Dongwoo Kim, Sabrina Caldwell, Tom Gedeon
Invertible networks have various benefits for image denoising since they are lightweight, information-lossless, and memory-saving during back-propagation. [Expand]
Coarse-To-Fine Domain Adaptive Semantic Segmentation With Photometric Alignment and Category-Center Regularization
Haoyu Ma, Xiangru Lin, Zifeng Wu, Yizhou Yu
Unsupervised domain adaptation (UDA) in semantic segmentation is a fundamental yet promising task relieving the need for laborious annotation works. [Expand]
Despite advances in feature representation, leveraging geometric relations is crucial for establishing reliable visual correspondences under large variations of images. [Expand]
Weakly supervised object localization (WSOL) remains an open problem due to the deficiency of finding object extent information using a classification network. [Expand]
Learning Dynamic Network Using a Reuse Gate Function in Semi-Supervised Video Object Segmentation
Hyojin Park, Jayeon Yoo, Seohyeong Jeong, Ganesh Venkatesh, Nojun Kwak
Current state-of-the-art approaches for Semi-supervised Video Object Segmentation (Semi-VOS) propagates information from previous frames to generate segmentation mask for the current frame. [Expand]
HoHoNet: 360 Indoor Holistic Understanding With Latent Horizontal Features
Cheng Sun, Min Sun, Hwann-Tzong Chen
We present HoHoNet, a versatile and efficient framework for holistic understanding of an indoor 360-degree panorama using a Latent Horizontal Feature (LHFeat). [Expand]
Consensus Maximisation Using Influences of Monotone Boolean Functions
Ruwan Tennakoon, David Suter, Erchuan Zhang, Tat-Jun Chin, Alireza Bab-Hadiashar
Consensus maximisation (MaxCon), widely used for robust fitting in computer vision, aims to find the largest subset of data that fits the model within some tolerance level. [Expand]
Found a Reason for me? Weakly-supervised Grounded Visual Question Answering using Capsules
Aisha Urooj, Hilde Kuehne, Kevin Duarte, Chuang Gan, Niels Lobo, Mubarak Shah
The problem of grounding VQA tasks has seen an increased attention in the research community recently, with most attempts usually focusing on solving this task by using pretrained object detectors. [Expand]
Efficient Feature Transformations for Discriminative and Generative Continual Learning
Vinay Kumar Verma, Kevin J Liang, Nikhil Mehta, Piyush Rai, Lawrence Carin
As neural networks are increasingly being applied to real-world applications, mechanisms to address distributional shift and sequential task learning without forgetting are critical. [Expand]
Nowadays, we have witnessed the early progress on learning the association between voice and face automatically, which brings a new wave of studies to the computer vision community. [Expand]
Yuang Zhang, Huanyu He, Jianguo Li, Yuxi Li, John See, Weiyao Lin
Pedestrian detection in a crowd is a challenging task due to a high number of mutually-occluding human instances, which brings ambiguity and optimization difficulties to the current IoU-based ground truth assignment procedure in classical object detection methods. [Expand]
Camera Pose Matters: Improving Depth Prediction by Mitigating Pose Distribution Bias
Yunhan Zhao, Shu Kong, Charless Fowlkes
Monocular depth predictors are typically trained on large-scale training sets which are naturally biased w.r.t the distribution of camera poses. [Expand]
Riggable 3D Face Reconstruction via In-Network Optimization
Ziqian Bai, Zhaopeng Cui, Xiaoming Liu, Ping Tan
This paper presents a method for riggable 3D face reconstruction from monocular images, which jointly estimates a personalized face rig and per-image parameters including expressions, poses, and illuminations. [Expand]
Limitations of Post-Hoc Feature Alignment for Robustness
Collin Burns, Jacob Steinhardt
Feature alignment is an approach to improving robustness to distribution shift that matches the distribution of feature activations between the training distribution and test distribution. [Expand]
Semi-Supervised Domain Adaptation Based on Dual-Level Domain Mixing for Semantic Segmentation
Shuaijun Chen, Xu Jia, Jianzhong He, Yongjie Shi, Jianzhuang Liu
Data-driven based approaches, in spite of great success in many tasks, have poor generalization when applied to unseen image domains, and require expensive cost of annotation especially for dense pixel prediction tasks such as semantic segmentation. [Expand]
Cloud2Curve: Generation and Vectorization of Parametric Sketches
Ayan Das, Yongxin Yang, Timothy M. Hospedales, Tao Xiang, Yi-Zhe Song
Analysis of human sketches in deep learning has advanced immensely through the use of waypoint-sequences rather than raster-graphic representations. [Expand]
Adversarial Laser Beam: Effective Physical-World Attack to DNNs in a Blink
Ranjie Duan, Xiaofeng Mao, A. K. Qin, Yuefeng Chen, Shaokai Ye, Yuan He, Yun Yang
Though it is well known that the performance of deep neural networks (DNNs) degrades under certain light conditions, there exists no study on the threats of light beams emitted from some physical source as adversarial attacker on DNNs in a real-world scenario. [Expand]
WOAD: Weakly Supervised Online Action Detection in Untrimmed Videos
Mingfei Gao, Yingbo Zhou, Ran Xu, Richard Socher, Caiming Xiong
Online action detection in untrimmed videos aims to identify an action as it happens, which makes it very important for real-time applications. [Expand]
Human POSEitioning System (HPS): 3D Human Pose Estimation and Self-Localization in Large Scenes From Body-Mounted Sensors
Vladimir Guzov, Aymen Mir, Torsten Sattler, Gerard Pons-Moll
We introduce (HPS) Human POSEitioning System, a method to recover the full 3D pose of a human registered with a 3D scan of the surrounding environment using wearable sensors. [Expand]
Heterogeneous Grid Convolution for Adaptive, Efficient, and Controllable Computation
Ryuhei Hamaguchi, Yasutaka Furukawa, Masaki Onishi, Ken Sakurada
This paper proposes a novel heterogeneous grid convolution that builds a graph-based image representation by exploiting heterogeneity in the image content, enabling adaptive, efficient, and controllable computations in a convolutional architecture. [Expand]
ChallenCap: Monocular 3D Capture of Challenging Human Performances Using Multi-Modal References
Yannan He, Anqi Pang, Xin Chen, Han Liang, Minye Wu, Yuexin Ma, Lan Xu
Capturing challenging human motions is critical for numerous applications, but it suffers from complex motion patterns and severe self-occlusion under the monocular setting. [Expand]
Interpretable Social Anchors for Human Trajectory Forecasting in Crowds
Parth Kothari, Brian Sifringer, Alexandre Alahi
Human trajectory forecasting in crowds, at its core, is a sequence prediction problem with specific challenges of capturing inter-sequence dependencies (social interactions) and consequently predicting socially-compliant multimodal distributions. [Expand]
Weakly supervised segmentation methods using bounding box annotations focus on obtaining a pixel-level mask from each box containing an object. [Expand]
Progressive Domain Expansion Network for Single Domain Generalization
Lei Li, Ke Gao, Juan Cao, Ziyao Huang, Yepeng Weng, Xiaoyue Mi, Zhengze Yu, Xiaoya Li, Boyang Xia
Single domain generalization is a challenging case of model generalization, where the models are trained on a single domain and tested on other unseen domains. [Expand]
PointFlow: Flowing Semantics Through Points for Aerial Image Segmentation
Xiangtai Li, Hao He, Xia Li, Duo Li, Guangliang Cheng, Jianping Shi, Lubin Weng, Yunhai Tong, Zhouchen Lin
Aerial Image Segmentation is a particular semantic segmentation problem and has several challenging characteristics that general semantic segmentation does not have. [Expand]
MUST-GAN: Multi-Level Statistics Transfer for Self-Driven Person Image Generation
Tianxiang Ma, Bo Peng, Wei Wang, Jing Dong
Pose-guided person image generation usually involves using paired source-target images to supervise the training, which significantly increases the data preparation effort and limits the application of the models. [Expand]
Focus on Local: Detecting Lane Marker From Bottom Up via Key Point
Zhan Qu, Huan Jin, Yang Zhou, Zhen Yang, Wei Zhang
Mainstream lane marker detection methods are implemented by predicting the overall structure and deriving parametric curves through post-processing. [Expand]
Neural network pruning is an essential approach for reducing the computational complexity of deep models so that they can be well deployed on resource-limited devices. [Expand]
HLA-Face: Joint High-Low Adaptation for Low Light Face Detection
Wenjing Wang, Wenhan Yang, Jiaying Liu
Face detection in low light scenarios is challenging but vital to many practical applications, e.g., surveillance video, autonomous driving at night. [Expand]
Scene text retrieval aims to localize and search all text instances from an image gallery, which are the same or similar with a given query text. [Expand]
Tracking by natural language specification is a new rising research topic that aims at locating the target object in the video sequence based on its language description. [Expand]
Troubleshooting Blind Image Quality Models in the Wild
Zhihua Wang, Haotao Wang, Tianlong Chen, Zhangyang Wang, Kede Ma
Recently, the group maximum differentiation competition (gMAD) has been used to improve blind image quality assessment (BIQA) models, with the help of full-reference metrics. [Expand]
Differential Neural Architecture Search (NAS) requires all layer choices to be held in memory simultaneously; this limits the size of both search space and final architecture. [Expand]
Multi-Label Activity Recognition Using Activity-Specific Features and Activity Correlations
Yanyi Zhang, Xinyu Li, Ivan Marsic
Multi-label activity recognition is designed for recognizing multiple activities that are performed simultaneously or sequentially in each video. [Expand]
Wangbo Zhao, Jing Zhang, Long Li, Nick Barnes, Nian Liu, Junwei Han
Significant performance improvement has been achieved for fully-supervised video salient object detection with the pixel-wise labeled training datasets, which are timeconsuming and expensive to obtain. [Expand]
Simpler Certified Radius Maximization by Propagating Covariances
Xingjian Zhen, Rudrasis Chakraborty, Vikas Singh
One strategy for adversarially training a robust model is to maximize its certified radius -- the neighborhood around a given training sample for which the model's prediction remains unchanged. [Expand]
What's in the Image? Explorable Decoding of Compressed Images
Yuval Bahat, Tomer Michaeli
The ever-growing amounts of visual contents captured on a daily basis necessitate the use of lossy compression methods in order to save storage space and transmission bandwidth. [Expand]
The focal loss has demonstrated its effectiveness in many real-world applications such as object detection and image classification, but its theoretical understanding has been limited so far. [Expand]
Wide-Baseline Relative Camera Pose Estimation With Directional Learning
Kefan Chen, Noah Snavely, Ameesh Makadia
Modern deep learning techniques that regress the relative camera pose between two images have difficulty dealing with challenging scenarios, such as large camera motions resulting in occlusions and significant changes in perspective that leave little overlap between images. [Expand]
Square Root Bundle Adjustment for Large-Scale Reconstruction
Nikolaus Demmel, Christiane Sommer, Daniel Cremers, Vladyslav Usenko
We propose a new formulation for the bundle adjustment problem which relies on nullspace marginalization of landmark variables by QR decomposition. [Expand]
Unsupervised Pre-Training for Person Re-Identification
Dengpan Fu, Dongdong Chen, Jianmin Bao, Hao Yang, Lu Yuan, Lei Zhang, Houqiang Li, Dong Chen
In this paper, we present a large scale unlabeled person re-identification (Re-ID) dataset "LUPerson" and make the first attempt of performing unsupervised pre-training for improving the generalization ability of the learned person Re-ID feature representation. [Expand]
Privacy-Preserving Collaborative Learning With Automatic Transformation Search
Wei Gao, Shangwei Guo, Tianwei Zhang, Han Qiu, Yonggang Wen, Yang Liu
Collaborative learning has gained great popularity due to its benefit of data privacy protection: participants can jointly train a Deep Learning model without sharing their training sets. [Expand]
Cluster, Split, Fuse, and Update: Meta-Learning for Open Compound Domain Adaptive Semantic Segmentation
Rui Gong, Yuhua Chen, Danda Pani Paudel, Yawei Li, Ajad Chhatkuli, Wen Li, Dengxin Dai, Luc Van Gool
Open compound domain adaptation (OCDA) is a domain adaptation setting, where target domain is modeled as a compound of multiple unknown homogeneous domains, which brings the advantage of improved generalization to unseen domains. [Expand]
Detecting Human-Object Interaction via Fabricated Compositional Learning
Zhi Hou, Baosheng Yu, Yu Qiao, Xiaojiang Peng, Dacheng Tao
Human-Object Interaction (HOI) detection, inferring the relationships between human and objects from images/videos, is a fundamental task for high-level scene understanding. [Expand]
DI-Fusion: Online Implicit 3D Reconstruction With Deep Priors
Jiahui Huang, Shi-Sheng Huang, Haoxuan Song, Shi-Min Hu
Previous online 3D dense reconstruction methods struggle to achieve the balance between memory storage and surface quality, largely due to the usage of stagnant underlying geometry representation, such as TSDF (truncated signed distance functions) or surfels, without any knowledge of the scene priors. [Expand]
EffiScene:
Efficient Per-Pixel Rigidity Inference for Unsupervised Joint Learning
of Optical Flow, Depth, Camera Pose and Motion Segmentation
Yang Jiao, Trac D. Tran, Guangming Shi
This paper addresses the challenging unsupervised scene flow estimation problem by jointly learning four low-level vision sub-tasks: optical flow F, stereo-depth D, camera pose P and motion segmentation S. [Expand]
Improving Accuracy of Binary Neural Networks Using Unbalanced Activation Distribution
Hyungjun Kim, Jihoon Park, Changhun Lee, Jae-Joon Kim
Binarization of neural network models is considered as one of the promising methods to deploy deep neural network models on resource-constrained environments such as mobile devices. [Expand]
Single-View Robot Pose and Joint Angle Estimation via Render & Compare
Yann Labbe, Justin Carpentier, Mathieu Aubry, Josef Sivic
We introduce RoboPose, a method to estimate the joint angles and the 6D camera-to-robot pose of a known articulated robot from a single RGB image. [Expand]
Anti-Adversarially Manipulated Attributions for Weakly and Semi-Supervised Semantic Segmentation
Jungbeom Lee, Eunji Kim, Sungroh Yoon
Weakly supervised semantic segmentation produces a pixel-level localization from class labels; but a classifier trained on such labels is likely to restrict its focus to a small discriminative region of the target object. [Expand]
Data augmentation is an effective regularization strategy to alleviate the overfitting, which is an inherent drawback of the deep neural networks. [Expand]
Inception Convolution With Efficient Dilation Search
Jie Liu, Chuming Li, Feng Liang, Chen Lin, Ming Sun, Junjie Yan, Wanli Ouyang, Dong Xu
As a variant of standard convolution, a dilated convolution can control effective receptive fields and handle large scale variance of objects without introducing additional computational costs. [Expand]
We propose a novel framework for creating large-scale photorealistic datasets of indoor scenes, with ground truth geometry, material, lighting and semantics. [Expand]
POSEFusion: Pose-Guided Selective Fusion for Single-View Human Volumetric Capture
Zhe Li, Tao Yu, Zerong Zheng, Kaiwen Guo, Yebin Liu
We propose POse-guided SElective Fusion (POSEFusion), a single-view human volumetric capture method that leverages tracking-based methods and tracking-free inference to achieve high-fidelity and dynamic 3D reconstruction. [Expand]
Bridging the Visual Gap: Wide-Range Image Blending
Chia-Ni Lu, Ya-Chu Chang, Wei-Chen Chiu
In this paper we propose a new problem scenario in image processing, wide-range image blending, which aims to smoothly merge two different input photos into a panorama by generating novel image content for the intermediate region between them. [Expand]
We present a novel mirror segmentation method that leverages depth estimates from ToF-based cameras as an additional cue to disambiguate challenging cases where the contrast or relation in RGB colors between the mirror reflection and the surrounding scene is subtle. [Expand]
Cheol-Hui Min, Jinseok Bae, Junho Lee, Young Min Kim
We present GATSBI, a generative model that can transform a sequence of raw observations into a structured latent representation that fully captures the spatio-temporal context of the agent's actions. [Expand]
StablePose: Learning 6D Object Poses From Geometrically Stable Patches
Yifei Shi, Junwen Huang, Xin Xu, Yifan Zhang, Kai Xu
We introduce the concept of geometric stability to the problem of 6D object pose estimation and propose to learn pose inference based on geometrically stable patches extracted from observed 3D point clouds. [Expand]
Searching for a more compact network width recently serves as an effective way of channel pruning for the deployment of convolutional neural networks (CNNs) under hardware constraints. [Expand]
Prioritized Architecture Sampling With Monto-Carlo Tree Search
Xiu Su, Tao Huang, Yanxi Li, Shan You, Fei Wang, Chen Qian, Changshui Zhang, Chang Xu
One-shot neural architecture search (NAS) methods significantly reduce the search cost by considering the whole search space as one network, which only needs to be trained once. [Expand]
EnD: Entangling and Disentangling Deep Representations for Bias Correction
Enzo Tartaglione, Carlo Alberto Barbano, Marco Grangetto
Artificial neural networks perform state-of-the-art in an ever-growing number of tasks, and nowadays they are used to solve an incredibly large variety of tasks. [Expand]
Combinatorial Learning of Graph Edit Distance via Dynamic Embedding
Runzhong Wang, Tianqi Zhang, Tianshu Yu, Junchi Yan, Xiaokang Yang
Graph Edit Distance (GED) is a popular similarity measurement for pairwise graphs and it also refers to the recovery of the edit path from the source graph to the target graph. [Expand]
Data-Uncertainty Guided Multi-Phase Learning for Semi-Supervised Object Detection
Zhenyu Wang, Yali Li, Ye Guo, Lu Fang, Shengjin Wang
In this paper, we delve into semi-supervised object detection where unlabeled images are leveraged to break through the upper bound of fully-supervised object detection models. [Expand]
Hijack-GAN: Unintended-Use of Pretrained, Black-Box GANs
Hui-Po Wang, Ning Yu, Mario Fritz
While Generative Adversarial Networks (GANs) show increasing performance and the level of realism is becoming indistinguishable from natural images, this also comes with high demands on data and computation. [Expand]
Exploring Heterogeneous Clues for Weakly-Supervised Audio-Visual Video Parsing
Yu Wu, Yi Yang
We investigate the weakly-supervised audio-visual video parsing task, which aims to parse a video into temporal event segments and predict the audible or visible event categories. [Expand]
MotionRNN: A Flexible Model for Video Prediction With Spacetime-Varying Motions
Haixu Wu, Zhiyu Yao, Jianmin Wang, Mingsheng Long
This paper tackles video prediction from a new dimension of predicting spacetime-varying motions that are incessantly changing across both space and time. [Expand]
Intra-Inter Camera Similarity for Unsupervised Person Re-Identification
Shiyu Xuan, Shiliang Zhang
Most of unsupervised person Re-Identification (Re-ID) works produce pseudo-labels by measuring the feature similarity without considering the distribution discrepancy among cameras, leading to degraded accuracy in label computation across cameras. [Expand]
DSC-PoseNet: Learning 6DoF Object Pose Estimation via Dual-Scale Consistency
Zongxin Yang, Xin Yu, Yi Yang
Compared to 2D object bounding-box labeling, it is very difficult for humans to annotate 3D object poses, especially when depth images of scenes are unavailable. [Expand]
Joint Noise-Tolerant Learning and Meta Camera Shift Adaptation for Unsupervised Person Re-Identification
Fengxiang Yang, Zhun Zhong, Zhiming Luo, Yuanzheng Cai, Yaojin Lin, Shaozi Li, Nicu Sebe
This paper considers the problem of unsupervised person re-identification (re-ID), which aims to learn discriminative models with unlabeled data. [Expand]
We present a new domain adaptive self-training pipeline, named ST3D, for unsupervised domain adaptation on 3D object detection from point clouds. [Expand]
Motivated by the intuition that one can transform two aligned point clouds to each other more easily and meaningfully than a misaligned pair, we propose CorrNet3D -the first unsupervised and end-to-end deep learning-based framework - to drive the learning of dense correspondence between 3D shapes by means of deformation-like reconstruction to overcome the need for annotated data. [Expand]
Abstract Spatial-Temporal Reasoning via Probabilistic Abduction and Execution
Chi Zhang, Baoxiong Jia, Song-Chun Zhu, Yixin Zhu
Spatial-temporal reasoning is a challenging task in Artificial Intelligence (AI) due to its demanding but unique nature: a theoretic requirement on representing and reasoning based on spatial-temporal knowledge in mind, and an applied requirement on a high-level cognitive system capable of navigating and acting in space and time. [Expand]
Chi Zhang, Baoxiong Jia, Mark Edmonds, Song-Chun Zhu, Yixin Zhu
Causal induction, i.e., identifying unobservable mechanisms that lead to the observable relations among variables, has played a pivotal role in modern scientific discovery, especially in scenarios with only sparse and limited data. [Expand]
Existing state-of-the-art disparity estimation works mostly leverage the 4D concatenation volume and construct a very deep 3D convolution neural network (CNN) for disparity regression, which is inefficient due to the high memory consumption and slow inference speed. [Expand]
Cross-MPI: Cross-Scale Stereo for Image Super-Resolution Using Multiplane Images
Yuemei Zhou, Gaochang Wu, Ying Fu, Kun Li, Yebin Liu
Various combinations of cameras enrich computational photography, among which reference-based superresolution (RefSR) plays a critical role in multiscale imaging systems. [Expand]
Panoptic-PolarNet: Proposal-Free LiDAR Point Cloud Panoptic Segmentation
Zixiang Zhou, Yang Zhang, Hassan Foroosh
Panoptic segmentation presents a new challenge in exploiting the merits of both detection and segmentation, with the aim of unifying instance segmentation and semantic segmentation in a single framework. [Expand]
One of the main challenges for arbitrary-shaped text detection is to design a good text instance representation that allows networks to learn diverse text geometry variances. [Expand]
Denoise and Contrast for Category Agnostic Shape Completion
Antonio Alliegro, Diego Valsesia, Giulia Fracastoro, Enrico Magli, Tatiana Tommasi
In this paper, we present a deep learning model that exploits the power of self-supervision to perform 3D point cloud completion, estimating the missing part and a context region around it. [Expand]
Muhammad Waseem Ashraf, Waqas Sultani, Mubarak Shah
As airborne vehicles are becoming more autonomous and ubiquitous, it has become vital to develop the capability to detect the objects in their surroundings. [Expand]
Siamese Natural Language Tracker: Tracking by Natural Language Descriptions With Siamese Trackers
Qi Feng, Vitaly Ablavsky, Qinxun Bai, Stan Sclaroff
We propose a novel Siamese Natural Language Tracker (SNLT), which brings the advancements in visual tracking to the tracking by natural language (NL) specification task. [Expand]
OTA: Optimal Transport Assignment for Object Detection
Zheng Ge, Songtao Liu, Zeming Li, Osamu Yoshie, Jian Sun
Recent advances in label assignment in object detection mainly seek to independently define positive/negative training samples for each ground-truth (gt) object. [Expand]
Bidirectional Projection Network for Cross Dimension Scene Understanding
Wenbo Hu, Hengshuang Zhao, Li Jiang, Jiaya Jia, Tien-Tsin Wong
2D image representations are in regular grids and can be processed efficiently, whereas 3D point clouds are unordered and scattered in 3D space. [Expand]
Few-Shot Open-Set Recognition by Transformation Consistency
Minki Jeong, Seokeon Choi, Changick Kim
In this paper, we attack a few-shot open-set recognition (FSOSR) problem, which is a combination of few-shot learning (FSL) and open-set recognition (OSR). [Expand]
Scalability vs. Utility: Do We Have To Sacrifice One for the Other in Data Importance Quantification?
Ruoxi Jia, Fan Wu, Xuehui Sun, Jiacen Xu, David Dao, Bhavya Kailkhura, Ce Zhang, Bo Li, Dawn Song
Quantifying the importance of each training point to a learning task is a fundamental problem in machine learning and the estimated importance scores have been leveraged to guide a range of data workflows such as data summarization and domain adaption. [Expand]
Relation-aware Instance Refinement for Weakly Supervised Visual Grounding
Yongfei Liu, Bo Wan, Lin Ma, Xuming He
Visual grounding, which aims to build a correspondence between visual objects and their language entities, plays a key role in cross-modal scene understanding. [Expand]
Instance Level Affinity-Based Transfer for Unsupervised Domain Adaptation
Astuti Sharma, Tarun Kalluri, Manmohan Chandraker
Domain adaptation deals with training models using large scale labeled data from a specific source domain and then adapting the knowledge to certain target domains that have few or no labels. [Expand]
Iterative Shrinking for Referring Expression Grounding Using Deep Reinforcement Learning
Mingjie Sun, Jimin Xiao, Eng Gee Lim
In this paper, we are tackling the proposal-free referring expression grounding task, aiming at localizing the target object according to a query sentence, without relying on off-the-shelf object proposals. [Expand]
For the single image rain removal (SIRR) task, the performance of deep learning (DL)-based methods is mainly affected by the designed deraining models and training datasets. [Expand]
In this paper, we present a novel unpaired point cloud completion network, named Cycle4Completion, to infer the complete geometries from a partial 3D object. [Expand]
Bilateral Grid Learning for Stereo Matching Networks
Bin Xu, Yuhua Xu, Xiaoli Yang, Wei Jia, Yulan Guo
Real-time performance of stereo matching networks is important for many applications, such as automatic driving, robot navigation and augmented reality (AR). [Expand]
Alexey Bokhovkin, Vladislav Ishimtsev, Emil Bogomolov, Denis Zorin, Alexey Artemov, Evgeny Burnaev, Angela Dai
Recent advances in 3D semantic scene understanding have shown impressive progress in 3D instance segmentation, enabling object-level reasoning about 3D scenes; however, a finer-grained understanding is required to enable interactions with objects and their functional understanding. [Expand]
Fine-Grained Angular Contrastive Learning With Coarse Labels
Guy Bukchin, Eli Schwartz, Kate Saenko, Ori Shahar, Rogerio Feris, Raja Giryes, Leonid Karlinsky
Few-shot learning methods offer pre-training techniques optimized for easier later adaptation of the model to new classes (unseen during training) using one or a few examples. [Expand]
Semantic Scene Completion aims at reconstructing a complete 3D scene with precise voxel-wise semantics from a single-view depth or RGBD image. [Expand]
Globally Optimal Relative Pose Estimation With Gravity Prior
Yaqing Ding, Daniel Barath, Jian Yang, Hui Kong, Zuzana Kukelova
Smartphones, tablets and camera systems used, e.g., in cars and UAVs, are typically equipped with IMUs (inertial measurement units) that can measure the gravity vector accurately. [Expand]
Restore From Restored: Video Restoration With Pseudo Clean Video
Seunghwan Lee, Donghyeon Cho, Jiwon Kim, Tae Hyun Kim
In this study, we propose a self-supervised video denoising method called ""restore-from-restored."" This method fine-tunes a pre-trained network by using a pseudo clean video during the test phase. [Expand]
Existing studies in weakly-supervised semantic segmentation (WSSS) using image-level weak supervision have several limitations: sparse object coverage, inaccurate object boundaries, and co-occurring pixels from non-target objects. [Expand]
Anchor-Constrained Viterbi for Set-Supervised Action Segmentation
Jun Li, Sinisa Todorovic
This paper is about action segmentation under weak supervision in training, where the ground truth provides only a set of actions present, but neither their temporal ordering nor when they occur in a training video. [Expand]
Continuous Face Aging via Self-Estimated Residual Age Embedding
Zeqi Li, Ruowei Jiang, Parham Aarabi
Face synthesis, including face aging, in particular, has been one of the major topics that witnessed a substantial improvement in image fidelity by using generative adversarial networks (GANs). [Expand]
Context Modeling in 3D Human Pose Estimation: A Unified Perspective
Xiaoxuan Ma, Jiajun Su, Chunyu Wang, Hai Ci, Yizhou Wang
Estimating 3D human pose from a single image suffers from severe ambiguity since multiple 3D joint configurations may have the same 2D projection. [Expand]
Feature Decomposition and Reconstruction Learning for Effective Facial Expression Recognition
Delian Ruan, Yan Yan, Shenqi Lai, Zhenhua Chai, Chunhua Shen, Hanzi Wang
In this paper, we propose a novel Feature Decomposition and Reconstruction Learning (FDRL) method for effective facial expression recognition. [Expand]
Existing color-guided depth super-resolution (DSR) approaches require paired RGB-D data as training examples where the RGB image is used as structural guidance to recover the degraded depth map due to their geometrical similarity. [Expand]
Image Inpainting With External-Internal Learning and Monochromic Bottleneck
Tengfei Wang, Hao Ouyang, Qifeng Chen
Although recent inpainting approaches have demonstrated significant improvement with deep neural networks, they still suffer from artifacts such as blunt structures and abrupt colors when filling in the missing regions. [Expand]
Multiple Object Tracking With Correlation Learning
Qiang Wang, Yun Zheng, Pan Pan, Yinghui Xu
Recent works have shown that convolutional networks have substantially improved the performance of multiple object tracking by simultaneously learning detection and appearance features. [Expand]
In this paper, we convert traditional video captioning task into a new paradigm, i.e., Open-book Video Captioning, which generates natural language under the prompts of video-content-relevant sentences, not limited to the video itself. [Expand]
Monocular 3D Multi-Person Pose Estimation by Integrating Top-Down and Bottom-Up Networks
Yu Cheng, Bo Wang, Bo Yang, Robby T. Tan
In monocular video 3D multi-person pose estimation, inter-person occlusion and close interactions can cause human detection to be erroneous and human-joints grouping to be unreliable. [Expand]
One-Shot Neural Ensemble Architecture Search by Diversity-Guided Search Space Shrinking
Minghao Chen, Jianlong Fu, Haibin Ling
Despite remarkable progress achieved, most neural architecture search (NAS) methods focus on searching for one single accurate and robust architecture. [Expand]
Quankai Gao, Fudong Wang, Nan Xue, Jin-Gang Yu, Gui-Song Xia
Recently, deep learning based methods have demonstrated promising results on the graph matching problem, by relying on the descriptive capability of deep features extracted on graph nodes. [Expand]
Depth maps obtained by commercial depth sensors are always in low-resolution, making it difficult to be used in various computer vision tasks. [Expand]
Affordance Transfer Learning for Human-Object Interaction Detection
Zhi Hou, Baosheng Yu, Yu Qiao, Xiaojiang Peng, Dacheng Tao
Reasoning the human-object interactions (HOI) is essential for deeper scene understanding, while object affordances (or functionalities) are of great importance for human to discover unseen HOIs with novel objects. [Expand]
DARCNN: Domain Adaptive Region-Based Convolutional Neural Network for Unsupervised Instance Segmentation in Biomedical Images
Joy Hsu, Wah Chiu, Serena Yeung
In the biomedical domain, there is an abundance of dense, complex data where objects of interest may be challenging to detect or constrained by limits of human knowledge. [Expand]
LaPred: Lane-Aware Prediction of Multi-Modal Future Trajectories of Dynamic Agents
ByeoungDo Kim, Seong Hyeon Park, Seokhwan Lee, Elbek Khoshimjonov, Dongsuk Kum, Junsoo Kim, Jeong Soo Kim, Jun Won Choi
In this paper, we address the problem of predicting the future motion of a dynamic agent (called a target agent) given its current and past states as well as the information on its environment. [Expand]
SIPSA-Net: Shift-Invariant Pan Sharpening With Moving Object Alignment for Satellite Imagery
Jaehyup Lee, Soomin Seo, Munchurl Kim
Pan-sharpening is a process of merging a high-resolution (HR) panchromatic (PAN) image and its corresponding low-resolution (LR) multi-spectral (MS) image to create an HR-MS and pan-sharpened image. [Expand]
Recently, neural architecture search (NAS) has been exploited to design feature pyramid networks (FPNs) and achieved promising results for visual object detection. [Expand]
Causal Hidden Markov Model for Time Series Disease Forecasting
Jing Li, Botong Wu, Xinwei Sun, Yizhou Wang
We propose a causal hidden Markov model to achieve robust prediction of irreversible disease at an early stage, which is safety-critical and vital for medical treatment in early stages. [Expand]
Generalizing Face Forgery Detection With High-Frequency Features
Yuchen Luo, Yong Zhang, Junchi Yan, Wei Liu
Current face forgery detection methods achieve high accuracy under the within-database scenario where training and testing forgeries are synthesized by the same algorithm. [Expand]
Self-Supervised Pillar Motion Learning for Autonomous Driving
Chenxu Luo, Xiaodong Yang, Alan Yuille
Autonomous driving can benefit from motion behavior comprehension when interacting with diverse traffic participants in highly dynamic environments. [Expand]
Hyeonseob Nam, HyunJae Lee, Jongchan Park, Wonjun Yoon, Donggeun Yoo
Convolutional Neural Networks (CNNs) often fail to maintain their performance when they confront new test domains, which is known as the problem of domain shift. [Expand]
Bridge To Answer: Structure-Aware Graph Interaction Network for Video Question Answering
Jungin Park, Jiyoung Lee, Kwanghoon Sohn
This paper presents a novel method, termed Bridge to Answer, to infer correct answers for questions about a given video by leveraging adequate graph interactions of heterogeneous crossmodal graphs. [Expand]
Self-Supervised Collision Handling via Generative 3D Garment Models for Virtual Try-On
Igor Santesteban, Nils Thuerey, Miguel A. Otaduy, Dan Casas
We propose a new generative model for 3D garment deformations that enables us to learn, for first time, a data-driven method for virtual try-on that effectively addresses garment-body collisions. [Expand]
Learning To Segment Actions From Visual and Language Instructions via Differentiable Weak Sequence Alignment
Yuhan Shen, Lu Wang, Ehsan Elhamifar
We address the problem of unsupervised localization of key-steps and feature learning in instructional videos using both visual and language instructions. [Expand]
SGCN: Sparse Graph Convolution Network for Pedestrian Trajectory Prediction
Liushuai Shi, Le Wang, Chengjiang Long, Sanping Zhou, Mo Zhou, Zhenxing Niu, Gang Hua
Pedestrian trajectory prediction is a key technology in autopilot, which remains to be very challenging due to complex interactions between pedestrians. [Expand]
Improving the Efficiency and Robustness of Deepfakes Detection Through Precise Geometric Features
Zekun Sun, Yujie Han, Zeyu Hua, Na Ruan, Weijia Jia
Deepfakes is a branch of malicious techniques that transplant a target face to the original one in videos, resulting in serious problems such as infringement of copyright, confusion of information, or even public panic. [Expand]
Coming Down to Earth: Satellite-to-Street View Synthesis for Geo-Localization
Aysim Toker, Qunjie Zhou, Maxim Maximov, Laura Leal-Taixe
The goal of cross-view image based geo-localization is to determine the location of a given street view image by matching it against a collection of geo-tagged satellite images. [Expand]
CRFace: Confidence Ranker for Model-Agnostic Face Detection Refinement
Noranart Vesdapunt, Baoyuan Wang
Face detection is a fundamental problem for many downstream face applications, and there is a rising demand for faster, more accurate yet support for higher resolution face detectors. [Expand]
Learning Fine-Grained Segmentation of 3D Shapes Without Part Labels
Xiaogang Wang, Xun Sun, Xinyu Cao, Kai Xu, Bin Zhou
Existing learning-based approaches to 3D shape segmentation usually formulate it as a semantic labeling problem, assuming that all parts of training shapes are annotated with a given set of labels. [Expand]
PWCLO-Net: Deep LiDAR Odometry in 3D Point Clouds Using Hierarchical Embedding Mask Optimization
Guangming Wang, Xinrui Wu, Zhe Liu, Hesheng Wang
A novel 3D point cloud learning model for deep LiDAR odometry, named PWCLO-Net, using hierarchical embedding mask optimization is proposed in this paper. [Expand]
Weakly-Supervised Instance Segmentation via Class-Agnostic Learning With Salient Images
Xinggang Wang, Jiapei Feng, Bin Hu, Qi Ding, Longjin Ran, Xiaoxin Chen, Wenyu Liu
Humans have a strong class-agnostic object segmentation ability and can outline boundaries of unknown objects precisely, which motivates us to propose a box-supervised class-agnostic object segmentation (BoxCaseg) based solution for weakly-supervised instance segmentation. [Expand]
Most existing CNN-based super-resolution (SR) methods are developed based on an assumption that the degradation is fixed and known (e.g., bicubic downsampling). [Expand]
Forecasting Parapapillary atrophy (PPA), i.e., a symptom related to most irreversible eye diseases, provides an alarm for implementing an intervention to slow down the disease progression at early stage. [Expand]
A Dual Iterative Refinement Method for Non-Rigid Shape Matching
Rui Xiang, Rongjie Lai, Hongkai Zhao
In this work, a robust and efficient dual iterative refinement (DIR) method is proposed for dense correspondence between two nearly isometric shapes. [Expand]
Deep Denoising of Flash and No-Flash Pairs for Photography in Low-Light Environments
Zhihao Xia, Michael Gharbi, Federico Perazzi, Kalyan Sunkavalli, Ayan Chakrabarti
We introduce a neural network-based method to denoise pairs of images taken in quick succession in low-light environments, with and without a flash. [Expand]
DG-Font: Deformable Generative Networks for Unsupervised Font Generation
Yangchen Xie, Xinyuan Chen, Li Sun, Yue Lu
Font generation is a challenging problem especially for some writing systems that consist of a large number of characters and has attracted a lot of attention in recent years. [Expand]
Graph Stacked Hourglass Networks for 3D Human Pose Estimation
Tianhan Xu, Wataru Takano
In this paper, we propose a novel graph convolutional network architecture, Graph Stacked Hourglass Networks, for 2D-to-3D human pose estimation tasks. [Expand]
Few-shot learning (FSL), which aims to recognise new classes by adapting the learned knowledge with extremely limited few-shot (support) examples, remains an important open problem in computer vision. [Expand]
Linear Semantics in Generative Adversarial Networks
Jianjin Xu, Changxi Zheng
Generative Adversarial Networks (GANs) are able to generate high-quality images, but it remains difficult to explicitly specify the semantics of synthesized images. [Expand]
KSM: Fast Multiple Task Adaption via Kernel-Wise Soft Mask Learning
Li Yang, Zhezhi He, Junshan Zhang, Deliang Fan
Deep Neural Networks (DNN) could forget the knowledge about earlier tasks when learning new tasks, and this is known as catastrophic forgetting. [Expand]
NetAdaptV2: Efficient Neural Architecture Search With Fast Super-Network Training and Architecture Optimization
Tien-Ju Yang, Yi-Lun Liao, Vivienne Sze
Neural architecture search (NAS) typically consists of three main steps: training a super-network, training and evaluating sampled deep neural networks (DNNs), and training the discovered DNN. [Expand]
ID-Unet: Iterative Soft and Hard Deformation for View Synthesis
Mingyu Yin, Li Sun, Qingli Li
View synthesis is usually done by an autoencoder, in which the encoder maps a source view image into a latent content code, and the decoder transforms it into a target view image according to the condition. [Expand]
A practical long-term tracker typically contains three key properties, i.e., an efficient model design, an effective global re-detection strategy and a robust distractor awareness mechanism. [Expand]
Domain-Robust VQA With Diverse Datasets and Methods but No Target Labels
Mingda Zhang, Tristan Maidment, Ahmad Diab, Adriana Kovashka, Rebecca Hwa
The observation that computer vision methods overfit to dataset specifics has inspired diverse attempts to make object recognition models robust to domain shifts. [Expand]
Event-Based Synthetic Aperture Imaging With a Hybrid Network
Xiang Zhang, Wei Liao, Lei Yu, Wen Yang, Gui-Song Xia
Synthetic aperture imaging (SAI) is able to achieve the see through effect by blurring out the off-focus foreground occlusions and reconstructing the in-focus occluded targets from multi-view images. [Expand]
Cross-view image geo-localization aims to determine the locations of street-view query images by matching with GPS-tagged reference images from aerial view. [Expand]
Leveraging the Availability of Two Cameras for Illuminant Estimation
Abdelrahman Abdelhamed, Abhijith Punnappurath, Michael S. Brown
Most modern smartphones are now equipped with two rear-facing cameras -- a main camera for standard imaging and an additional camera to provide wide-angle or telephoto zoom capabilities. [Expand]
Understanding and Simplifying Perceptual Distances
Dan Amir, Yair Weiss
Perceptual metrics based on features of deep Convolutional Neural Networks (CNNs) have shown remarkable success when used as loss functions in a range of computer vision problems and significantly outperform classical losses such as L1 or L2 in pixel space. [Expand]
Learning Deep Latent Variable Models by Short-Run MCMC Inference With Optimal Transport Correction
Dongsheng An, Jianwen Xie, Ping Li
Learning latent variable models with deep top-down architectures typically requires inferring the latent variables for each training example based on the posterior distribution of these latent variables. [Expand]
Unsupervised Multi-Source Domain Adaptation for Person Re-Identification
Zechen Bai, Zhigang Wang, Jian Wang, Di Hu, Errui Ding
Unsupervised domain adaptation (UDA) methods for person re-identification (re-ID) aim at transferring re-ID knowledge from labeled source data to unlabeled target data. [Expand]
Euro-PVI: Pedestrian Vehicle Interactions in Dense Urban Centers
Apratim Bhattacharyya, Daniel Olmeda Reino, Mario Fritz, Bernt Schiele
Accurate prediction of pedestrian and bicyclist paths is integral to the development of reliable autonomous vehicles in dense urban environments. [Expand]
Learning to model and predict how humans interact with objects while performing an action is challenging, and most of the existing video prediction models are ineffective in modeling complicated human-object interactions. [Expand]
Understanding Object Dynamics for Interactive Image-to-Video Synthesis
Andreas Blattmann, Timo Milbich, Michael Dorkenwald, Bjorn Ommer
What would be the effect of locally poking a static scene? We present an approach that learns naturally-looking global articulations caused by a local manipulation at a pixel level. [Expand]
Hardness Sampling for Self-Training Based Transductive Zero-Shot Learning
Liu Bo, Qiulei Dong, Zhanyi Hu
Transductive zero-shot learning (T-ZSL) which could alleviate the domain shift problem in existing ZSL works, has received much attention recently. [Expand]
GAIA: A Transfer Learning System of Object Detection That Fits Your Needs
Xingyuan Bu, Junran Peng, Junjie Yan, Tieniu Tan, Zhaoxiang Zhang
Transfer learning with pre-training on large-scale datasets has played an increasingly significant role in computer vision and natural language processing recently. [Expand]
Debiased Subjective Assessment of Real-World Image Enhancement
Peibei Cao, Zhangyang Wang, Kede Ma
In real-world image enhancement, it is often challenging (if not impossible) to acquire ground-truth data, preventing the adoption of distance metrics for objective quality assessment. [Expand]
Adaptive Convolutions for Structure-Aware Style Transfer
Prashanth Chandran, Gaspard Zoss, Paulo Gotardo, Markus Gross, Derek Bradley
Style transfer between images is an artistic application of CNNs, where the 'style' of one image is transferred onto another image while preserving the latter's content. [Expand]
Towards Robust Classification Model by Counterfactual and Invariant Data Generation
Chun-Hao Chang, George Alexandru Adam, Anna Goldenberg
Despite the success of machine learning applications in science, industry, and society in general, many approaches are known to be non-robust, often relying on spurious correlations to make predictions. [Expand]
Despite the great success of Siamese-based trackers, their performance under complicated scenarios is still not satisfying, especially when there are distractors. [Expand]
Adaptive Image Transformer for One-Shot Object Detection
Ding-Jie Chen, He-Yen Hsieh, Tyng-Luh Liu
One-shot object detection tackles a challenging task that aims at identifying within a target image all object instances of the same class, implied by a query image patch. [Expand]
Class-Aware Robust Adversarial Training for Object Detection
Pin-Chun Chen, Bo-Han Kung, Jun-Cheng Chen
Object detection is an important computer vision task with plenty of real-world applications; therefore, how to enhance its robustness against adversarial attacks has emerged as a crucial issue. [Expand]
Delving Deep Into Many-to-Many Attention for Few-Shot Video Object Segmentation
Haoxin Chen, Hanjie Wu, Nanxuan Zhao, Sucheng Ren, Shengfeng He
This paper tackles the task of Few-Shot Video Object Segmentation (FSVOS), i.e., segmenting objects in the query videos with certain class specified in a few labeled support images. [Expand]
Learning a Non-Blind Deblurring Network for Night Blurry Images
Liang Chen, Jiawei Zhang, Jinshan Pan, Songnan Lin, Faming Fang, Jimmy S. Ren
Deblurring night blurry images is difficult, because the common-used blur model based on the linear convolution operation does not hold in this situation due to the influence of saturated pixels. [Expand]
Recent self-supervised contrastive learning provides an effective approach for unsupervised person re-identification (ReID) by learning invariance from different views (transformed versions) of an input. [Expand]
Data-free learning for student networks is a new paradigm for solving users' anxiety caused by the privacy problem of using original training data. [Expand]
Neural Feature Search for RGB-Infrared Person Re-Identification
Yehansen Chen, Lin Wan, Zhihang Li, Qianyan Jing, Zongyuan Sun
RGB-Infrared person re-identification (RGB-IR ReID) is a challenging cross-modality retrieval problem, which aims at matching the person-of-interest over visible and infrared camera views. [Expand]
Perceptual Indistinguishability-Net (PI-Net): Facial Image Obfuscation With Manipulable Semantics
Jia-Wei Chen, Li-Ju Chen, Chia-Mu Yu, Chun-Shien Lu
With the growing use of camera devices, the industry has many image datasets that provide more opportunities for collaboration between the machine learning community and industry. [Expand]
Pareto Self-Supervised Training for Few-Shot Learning
Zhengyu Chen, Jixie Ge, Heshen Zhan, Siteng Huang, Donglin Wang
While few-shot learning (FSL) aims for rapid generalization to new concepts with little supervision, self-supervised learning (SSL) constructs supervisory signals directly computed from unlabeled data. [Expand]
Human can infer the 3D geometry of a scene from a sketch instead of a realistic image, which indicates that the spatial structure plays a fundamental role in understanding the depth of scenes. [Expand]
Scene Text Telescope: Text-Focused Scene Image Super-Resolution
Jingye Chen, Bin Li, Xiangyang Xue
Image super-resolution, which is often regarded as a preprocessing procedure of scene text recognition, aims to recover the realistic features from a low-resolution text image. [Expand]
Towards Bridging Event Captioner and Sentence Localizer for Weakly Supervised Dense Event Captioning
Shaoxiang Chen, Yu-Gang Jiang
Dense Event Captioning (DEC) aims to jointly localize and describe multiple events of interest in untrimmed videos, which is an advancement of the conventional video captioning task (generating a single sentence description for a trimmed video). [Expand]
Feature-Level Collaboration: Joint Unsupervised Learning of Optical Flow, Stereo Depth and Camera Motion
Cheng Chi, Qingjie Wang, Tianyu Hao, Peng Guo, Xin Yang
Precise estimation of optical flow, stereo depth and camera motion are important for the real-world 3D scene understanding and visual perception. [Expand]
Bayesian Nested Neural Networks for Uncertainty Calibration and Adaptive Compression
Yufei Cui, Ziquan Liu, Qiao Li, Antoni B. Chan, Chun Jason Xue
Nested networks or slimmable networks are neural networks whose architectures can be adjusted instantly during testing time, e.g., based on computational constraints. [Expand]
Towards Accurate 3D Human Motion Prediction From Incomplete Observations
Qiongjie Cui, Huaijiang Sun
Predicting accurate and realistic future human poses from historically observed sequences is a fundamental task in the intersection of computer vision, graphics, and artificial intelligence. [Expand]
Progressive Contour Regression for Arbitrary-Shape Scene Text Detection
Pengwen Dai, Sanyi Zhang, Hua Zhang, Xiaochun Cao
State-of-the-art scene text detection methods usually model the text instance with local pixels or components from the bottom-up perspective and, therefore, are sensitive to noises and dependent on the complicated heuristic post-processing especially for arbitrary-shape texts. [Expand]
Deep face recognition has achieved remarkable improvements due to the introduction of margin-based softmax loss, in which the prototype stored in the last linear layer represents the center of each class. [Expand]
In this work, we introduce the new scene understanding task of Part-aware Panoptic Segmentation (PPS), which aims to understand a scene at multiple levels of abstraction, and unifies the tasks of scene parsing and part parsing. [Expand]
Learning Spatially-Variant MAP Models for Non-Blind Image Deblurring
Jiangxin Dong, Stefan Roth, Bernt Schiele
The classical maximum a-posteriori (MAP) framework for non-blind image deblurring requires defining suitable data and regularization terms, whose interplay yields the desired clear image through optimization. [Expand]
EventZoom: Learning To Denoise and Super Resolve Neuromorphic Events
Peiqi Duan, Zihao W. Wang, Xinyu Zhou, Yi Ma, Boxin Shi
We address the problem of jointly denoising and super resolving neuromorphic events, a novel visual signal that represents thresholded temporal gradients in a space-time window. [Expand]
TransNAS-Bench-101: Improving Transferability and Generalizability of Cross-Task Neural Architecture Search
Yawen Duan, Xin Chen, Hang Xu, Zewei Chen, Xiaodan Liang, Tong Zhang, Zhenguo Li
Recent breakthroughs of Neural Architecture Search (NAS) extend the field's research scope towards a broader range of vision tasks and more diversified search spaces. [Expand]
We present a novel group collaborative learning framework (GCNet) capable of detecting co-salient objects in real time (16ms), by simultaneously mining consensus representations at group level based on the two necessary criteria: 1) intra-group compactness to better formulate the consistency among co-salient objects by capturing their inherent shared attributes using our novel group affinity module; 2) inter-group separability to effectively suppress the influence of noisy objects on the output by introducing our new group collaborating module conditioning the inconsistent consensus. [Expand]
Point 4D Transformer Networks for Spatio-Temporal Modeling in Point Cloud Videos
Hehe Fan, Yi Yang, Mohan Kankanhalli
Point cloud videos exhibit irregularities and lack of order along the spatial dimension where points emerge inconsistently across different frames. [Expand]
Face recognition models trained under the assumption of identical training and test distributions often suffer from poor generalization when faced with unknown variations, such as a novel ethnicity or unpredictable individual make-ups during test time. [Expand]
Most existing video text detection methods track texts with appearance features, which are easily influenced by the change of perspective and illumination. [Expand]
Anticipating Human Actions by Correlating Past With the Future With Jaccard Similarity Measures
Basura Fernando, Samitha Herath
We propose a framework for early action recognition and anticipation by correlating past features with the future using three novel similarity measures called Jaccard vector similarity, Jaccard cross-correlation and Jaccard Frobenius inner product over covariances. [Expand]
Double Low-Rank Representation With Projection Distance Penalty for Clustering
Zhiqiang Fu, Yao Zhao, Dongxia Chang, Xingxing Zhang, Yiming Wang
This paper presents a novel, simple yet robust self-representation method, i.e., Double Low-Rank Representation with Projection Distance penalty (DLRRPD) for clustering. [Expand]
Auto-Exposure Fusion for Single-Image Shadow Removal
Lan Fu, Changqing Zhou, Qing Guo, Felix Juefei-Xu, Hongkai Yu, Wei Feng, Yang Liu, Song Wang
Shadow removal is still a challenging task due to its inherent background-dependent and spatial-variant properties, leading to unknown and diverse shadow patterns. [Expand]
Partial Feature Selection and Alignment for Multi-Source Domain Adaptation
Yangye Fu, Ming Zhang, Xing Xu, Zuo Cao, Chao Ma, Yanli Ji, Kai Zuo, Huimin Lu
Multi-Source Domain Adaptation (MSDA), which dedicates to transfer the knowledge learned from multiple source domains to an unlabeled target domain, has drawn increasing attention in the research community. [Expand]
STMTrack: Template-Free Visual Tracking With Space-Time Memory Networks
Zhihong Fu, Qingjie Liu, Zehua Fu, Yunhong Wang
Boosting performance of the offline trained siamese trackers is getting harder nowadays since the fixed information of the template cropped from the first frame has been almost thoroughly mined, but they are poorly capable of resisting target appearance changes. [Expand]
Maolin Gao, Zorah Lahner, Johan Thunberg, Daniel Cremers, Florian Bernard
Finding correspondences between shapes is a fundamental problem in computer vision and graphics, which is relevant for many applications, including 3D reconstruction, object tracking, and style transfer. [Expand]
The Remote Embodied Referring Expression (REVERIE) is a recently raised task that requires an agent to navigate to and localise a referred remote object according to a high-level language instruction. [Expand]
Privacy Preserving Localization and Mapping From Uncalibrated Cameras
Marcel Geppert, Viktor Larsson, Pablo Speciale, Johannes L. Schonberger, Marc Pollefeys
Recent works on localization and mapping from privacy preserving line features have made significant progress towards addressing the privacy concerns arising from cloud-based solutions in mixed reality and robotics. [Expand]
Polygonal Building Extraction by Frame Field Learning
Nicolas Girard, Dmitriy Smirnov, Justin Solomon, Yuliya Tarabalka
While state of the art image segmentation models typically output segmentations in raster format, applications in geographic information systems often require vector polygons. [Expand]
MaxUp: Lightweight Adversarial Training With Data Augmentation Improves Neural Network Training
Chengyue Gong, Tongzheng Ren, Mao Ye, Qiang Liu
We propose MaxUp, an embarrassingly simple, highly effective technique for improving the generalization performance of machine learning models, especially deep neural networks. [Expand]
Hidden features in neural network usually fail to learn informative representation for 3D segmentation as supervisions are only given on output prediction, while this can be solved by omni-scale supervision on intermediate layers. [Expand]
Inverse Simulation: Reconstructing Dynamic Geometry of Clothed Humans via Optimal Control
Jingfan Guo, Jie Li, Rahul Narain, Hyun Soo Park
This paper studies the problem of inverse cloth simulation---to estimate shape and time-varying poses of the underlying body that generates physically plausible cloth motion, which matches to the point cloud measurements on the clothed humans. [Expand]
Beyond Bounding-Box: Convex-Hull Feature Adaptation for Oriented and Densely Packed Object Detection
Zonghao Guo, Chang Liu, Xiaosong Zhang, Jianbin Jiao, Xiangyang Ji, Qixiang Ye
Detecting oriented and densely packed objects remains challenging for spatial feature aliasing caused by the intersection of reception fields between objects. [Expand]
Compositing an image usually inevitably suffers from inharmony problem that is mainly caused by incompatibility of foreground and background from two different images with distinct surfaces and lights, corresponding to material-dependent and light-dependent characteristics, namely, reflectance and illumination intrinsic images, respectively. [Expand]
Long-Tailed Multi-Label Visual Recognition by Collaborative Training on Uniform and Re-Balanced Samplings
Hao Guo, Song Wang
Long-tailed data distribution is common in many multi-label visual recognition tasks and the direct use of these data for training usually leads to relatively low performance on tail classes. [Expand]
Multispectral photometric stereo (MPS) aims at recovering the surface normal of a scene from a single-shot multispectral image, which is known as an ill-posed problem. [Expand]
Strengthen Learning Tolerance for Weakly Supervised Object Localization
Guangyu Guo, Junwei Han, Fang Wan, Dingwen Zhang
Weakly supervised object localization (WSOL) aims at learning to localize objects of interest by only using the image-level labels as the supervision. [Expand]
Contrastive Embedding for Generalized Zero-Shot Learning
Zongyan Han, Zhenyong Fu, Shuo Chen, Jian Yang
Generalized zero-shot learning (GZSL) aims to recognize objects from both seen and unseen classes, when only the labeled examples from seen classes are provided. [Expand]
Crossing Cuts Polygonal Puzzles: Models and Solvers
Peleg Harel, Ohad Ben-Shahar
Jigsaw puzzle solving, the problem of constructing a coherent whole from a set of non-overlapping unordered fragments, is fundamental to numerous applications, and yet most of the literature has focused thus far on less realistic puzzles whose pieces are identical squares. [Expand]
NormalFusion: Real-Time Acquisition of Surface Normals for High-Resolution RGB-D Scanning
Hyunho Ha, Joo Ho Lee, Andreas Meuleman, Min H. Kim
Multiview shape-from-shading (SfS) has achieved high-detail geometry, but its computation is expensive for solving a multiview registration and an ill-posed inverse rendering problem. [Expand]
Guided Interactive Video Object Segmentation Using Reliability-Based Attention Maps
Yuk Heo, Yeong Jun Koh, Chang-Su Kim
We propose a novel guided interactive segmentation (GIS) algorithm for video objects to improve the segmentation accuracy and reduce the interaction time. [Expand]
DyCo3D: Robust Instance Segmentation of 3D Point Clouds Through Dynamic Convolution
Tong He, Chunhua Shen, Anton van den Hengel
Previous top-performing approaches for point cloud instance segmentation involve a bottom-up strategy, which often includes inefficient operations or complex pipelines, such as grouping over-segmented components, introducing additional steps for refining, or designing complicated loss functions. [Expand]
MOST: A Multi-Oriented Scene Text Detector With Localization Refinement
Minghang He, Minghui Liao, Zhibo Yang, Humen Zhong, Jun Tang, Wenqing Cheng, Cong Yao, Yongpan Wang, Xiang Bai
Over the past few years, the field of scene text detection has progressed rapidly that modern text detectors are able to hunt text in various challenging scenarios. [Expand]
Disentangling Label Distribution for Long-Tailed Visual Recognition
Youngkyu Hong, Seungju Han, Kwanghee Choi, Seokjun Seo, Beomsu Kim, Buru Chang
The current evaluation protocol of long-tailed visual recognition trains the classification model on the long-tailed source label distribution and evaluates its performance on the uniform target label distribution. [Expand]
Panoptic segmentation is a challenging task aiming to simultaneously segment objects (things) at instance level and background contents (stuff) at semantic level. [Expand]
Yuchen Hong, Qian Zheng, Lingran Zhao, Xudong Jiang, Alex C. Kot, Boxin Shi
This paper studies the problem of panoramic image reflection removal, aiming at reliving the content ambiguity between reflection and transmission scenes. [Expand]
Brain Image Synthesis With Unsupervised Multivariate Canonical CSCl4Net
Yawen Huang, Feng Zheng, Danyang Wang, Weilin Huang, Matthew R. Scott, Ling Shao
Recent advances in neuroscience have highlighted the effectiveness of multi-modal medical data for investigating certain pathologies and understanding human cognition. [Expand]
This paper addresses the video rescaling task, which arises from the needs of adapting the video spatial resolution to suit individual viewing devices. [Expand]
Learning the Non-Differentiable Optimization for Blind Super-Resolution
Zheng Hui, Jie Li, Xiumei Wang, Xinbo Gao
Previous convolutional neural network (CNN) based blind super-resolution (SR) methods usually adopt an iterative optimization way to approximate the ground-truth (GT) step-by-step. [Expand]
Semi-supervised learning is a useful tool for image segmentation, mainly due to its ability in extracting knowledge from unlabeled data to assist learning from labeled data. [Expand]
Collaborative Spatial-Temporal Modeling for Language-Queried Video Actor Segmentation
Tianrui Hui, Shaofei Huang, Si Liu, Zihan Ding, Guanbin Li, Wenguan Wang, Jizhong Han, Fei Wang
Language-queried video actor segmentation aims to predict the pixel-level mask of the actor which performs the actions described by a natural language query in the target frames. [Expand]
Dense Relation Distillation With Context-Aware Aggregation for Few-Shot Object Detection
Hanzhe Hu, Shuai Bai, Aoxue Li, Jinshi Cui, Liwei Wang
Conventional deep learning based methods for object detection require a large amount of bounding box annotations for training, which is expensive to obtain such high quality annotated data. [Expand]
The extraction of auto-correlation in images has shown great potential in deep learning networks, such as the self-attention mechanism in the channel domain and the self-similarity mechanism in the spatial domain. [Expand]
Hezhen Hu, Weilun Wang, Wengang Zhou, Weichao Zhao, Houqiang Li
Hand gesture-to-gesture translation is a significant and interesting problem, which serves as a key role in many applications, such as sign language production. [Expand]
Pradeep Kumar Jayaraman, Aditya Sanghi, Joseph G. Lambourne, Karl D.D. Willis, Thomas Davies, Hooman Shayani, Nigel Morris
We introduce UV-Net, a novel neural network architecture and representation designed to operate directly on Boundary representation (B-rep) data from 3D CAD models. [Expand]
In this paper, we propose a novel task for saliency-guided image translation, with the goal of image-to-image translation conditioned on the user specified saliency map. [Expand]
IoU Attack: Towards Temporally Coherent Black-Box Adversarial Attack for Visual Object Tracking
Shuai Jia, Yibing Song, Chao Ma, Xiaokang Yang
Adversarial attack arises due to the vulnerability of deep neural networks to perceive input samples injected with imperceptible perturbations. [Expand]
Turning Frequency to Resolution: Video Super-Resolution via Event Cameras
Yongcheng Jing, Yiding Yang, Xinchao Wang, Mingli Song, Dacheng Tao
State-of-the-art video super-resolution (VSR) methods focus on exploiting inter- and intra-frame correlations to estimate high-resolution (HR) video frames from low-resolution (LR) ones. [Expand]
Learning Calibrated Medical Image Segmentation via Multi-Rater Agreement Modeling
Wei Ji, Shuang Yu, Junde Wu, Kai Ma, Cheng Bian, Qi Bi, Jingjing Li, Hanruo Liu, Li Cheng, Yefeng Zheng
In medical image analysis, it is typical to collect multiple annotations, each from a different clinical expert or rater, in the expectation that possible diagnostic errors could be mitigated. [Expand]
Wei Ji, Jingjing Li, Shuang Yu, Miao Zhang, Yongri Piao, Shunyu Yao, Qi Bi, Kai Ma, Yefeng Zheng, Huchuan Lu, Li Cheng
Complex backgrounds and similar appearances between objects and their surroundings are generally recognized as challenging scenarios in Salient Object Detection (SOD). [Expand]
Practical Single-Image Super-Resolution Using Look-Up Table
Younghyun Jo, Seon Joo Kim
A number of super-resolution (SR) algorithms from interpolation to deep neural networks (DNN) have emerged to restore or create missing details of the input low-resolution image. [Expand]
Tackling the Ill-Posedness of Super-Resolution Through Adaptive Target Generation
Younghyun Jo, Seoung Wug Oh, Peter Vajda, Seon Joo Kim
By the one-to-many nature of the super-resolution (SR) problem, a single low-resolution (LR) image can be mapped to many high-resolution (HR) images. [Expand]
Relative Order Analysis and Optimization for Unsupervised Deep Metric Learning
Shichao Kan, Yigang Cen, Yang Li, Vladimir Mladenovic, Zhihai He
In unsupervised learning of image features without labels, especially on datasets with fine-grained object classes, it is often very difficult to tell if a given image belongs to one specific object class or another, even for human eyes. [Expand]
Differentiable Diffusion for Dense Depth Estimation From Multi-View Images
Numair Khan, Min H. Kim, James Tompkin
We present a method to estimate dense depth by optimizing a sparse set of points such that their diffusion into a depth map minimizes a multi-view reprojection error from RGB supervision. [Expand]
Quality-Agnostic Image Recognition via Invertible Decoder
Insoo Kim, Seungju Han, Ji-won Baek, Seong-Jin Park, Jae-Joon Han, Jinwoo Shin
Despite the remarkable performance of deep models on image recognition tasks, they are known to be susceptible to common corruptions such as blur, noise, and low-resolution. [Expand]
QPP: Real-Time Quantization Parameter Prediction for Deep Neural Networks
Vladimir Kryzhanovskiy, Gleb Balitskiy, Nikolay Kozyrskiy, Aleksandr Zuruev
Modern deep neural networks (DNNs) cannot be effectively used in mobile and embedded devices due to strict requirements for computational complexity, memory, and power consumption. [Expand]
T-vMF Similarity for Regularizing Intra-Class Feature Distribution
Takumi Kobayashi
Deep convolutional neural networks (CNNs) leverage large-scale training dataset to produce remarkable performance on various image classification tasks. [Expand]
Under-display camera (UDC) technology is essential for full-screen display in smartphones and is achieved by removing the concept of drilling holes on display. [Expand]
Leander Lacroix, Benjamin Charlier, Alain Trouve, Barbara Gris
A natural way to model the evolution of an object (growth of a leaf for instance) is to estimate a plausible deforming path between two observations. [Expand]
For several vision and robotics applications, 3D geometry of man-made environments such as indoor scenes can be represented with a small number of dominant planes. [Expand]
CoSMo: Content-Style Modulation for Image Retrieval With Text Feedback
Seungmin Lee, Dongwan Kim, Bohyung Han
We tackle the task of image retrieval with text feedback, where a reference image and modifier text are combined to identify the desired target image. [Expand]
DRANet: Disentangling Representation and Adaptation Networks for Unsupervised Cross-Domain Adaptation
Seunghun Lee, Sunghyun Cho, Sunghoon Im
In this paper, we present DRANet, a network architecture that disentangles image representations and transfers the visual attributes in a latent space for unsupervised cross-domain adaptation. [Expand]
Network Quantization With Element-Wise Gradient Scaling
Junghyup Lee, Dohyung Kim, Bumsub Ham
Network quantization aims at reducing bit-widths of weights and/or activations, particularly important for implementing deep neural networks with limited hardware resources. [Expand]
Relevance-CAM: Your Model Already Knows Where To Look
Jeong Ryong Lee, Sewon Kim, Inyong Park, Taejoon Eo, Dosik Hwang
With increasing fields of application for neural networks and the development of neural networks, the ability to explain deep learning models is also becoming increasingly important. [Expand]
4D Hyperspectral Photoacoustic Data Restoration With Reliability Analysis
Weihang Liao, Art Subpa-asa, Yinqiang Zheng, Imari Sato
Hyperspectral photoacoustic (HSPA) spectroscopy is an emerging bi-modal imaging technology that is able to show the wavelength-dependent absorption distribution of the interior of a 3D volume. [Expand]
COMPLETER: Incomplete Multi-View Clustering via Contrastive Prediction
Yijie Lin, Yuanbiao Gou, Zitao Liu, Boyun Li, Jiancheng Lv, Xi Peng
In this paper, we study two challenging problems in incomplete multi-view clustering analysis, namely, i) how to learn an informative and consistent representation among different views without the help of labels and ii) how to recover the missing views from data. [Expand]
Multi-View Multi-Person 3D Pose Estimation With Plane Sweep Stereo
Jiahao Lin, Gim Hee Lee
Existing approaches for multi-view multi-person 3D pose estimation explicitly establish cross-view correspondences to group 2D pose detections from multiple camera views and solve for the 3D pose estimation for each person. [Expand]
Adaptive Cross-Modal Prototypes for Cross-Domain Visual-Language Retrieval
Yang Liu, Qingchao Chen, Samuel Albanie
In this paper, we study the task of visual-text retrieval in the highly practical setting in which labelled visual data with paired text descriptions are available in one domain (the "source"), but only unlabelled visual data (without text descriptions) are available in the domain of interest (the "target"). [Expand]
This paper addresses the problem of temporal sentence grounding (TSG), which aims to identify the temporal boundary of a specific segment from an untrimmed video by a sentence query. [Expand]
The perceptual loss has been widely used as an effective loss term in image synthesis tasks including image super-resolution [16], and style transfer [14]. [Expand]
We introduce a new dataset for the emotional artificial intelligence research: identity-free video dataset for micro-gesture understanding and emotion analysis (iMiGUE). [Expand]
Mask-Embedded Discriminator With Region-Based Semantic Regularization for Semi-Supervised Class-Conditional Image Synthesis
Yi Liu, Xiaoyang Huo, Tianyi Chen, Xiangping Zeng, Si Wu, Zhiwen Yu, Hau-San Wong
Semi-supervised generative learning (SSGL) makes use of unlabeled data to achieve a trade-off between the data collection/annotation effort and generation performance, when adequate labeled data are not available. [Expand]
Neighborhood Normalization for Robust Geometric Feature Learning
Xingtong Liu, Benjamin D. Killeen, Ayushi Sinha, Masaru Ishii, Gregory D. Hager, Russell H. Taylor, Mathias Unberath
Extracting geometric features from 3D models is a common first step in applications such as 3D registration, tracking, and scene flow estimation. [Expand]
PluckerNet: Learn To Register 3D Line Reconstructions
Liu Liu, Hongdong Li, Haodong Yao, Ruyi Zha
Aligning two partially-overlapped 3D line reconstructions in Euclidean space is challenging, as we need to simultaneously solve line correspondences and relative pose between reconstructions. [Expand]
In this work, we propose a Cross-view Contrastive Learning framework for unsupervised 3D skeleton-based action representation (CrosSCLR), by leveraging multi-view complementary supervision signal. [Expand]
Combined Depth Space Based Architecture Search for Person Re-Identification
Hanjun Li, Gaojie Wu, Wei-Shi Zheng
Most works on person re-identification (ReID) take advantage of large backbone networks such as ResNet, which are designed for image classification instead of ReID, for feature extraction. [Expand]
Occluded person re-identification (Re-ID) is a challenging task as persons are frequently occluded by various obstacles or other persons, especially in the crowd scenario. [Expand]
Domain Consensus Clustering for Universal Domain Adaptation
Guangrui Li, Guoliang Kang, Yi Zhu, Yunchao Wei, Yi Yang
In this paper, we investigate Universal Domain Adaptation (UniDA) problem, which aims to transfer the knowledge from source to target under unaligned label space. [Expand]
Dynamic Class Queue for Large Scale Face Recognition in the Wild
Bi Li, Teng Xi, Gang Zhang, Haocheng Feng, Junyu Han, Jingtuo Liu, Errui Ding, Wenyu Liu
Learning discriminative representation using large-scale face datasets in the wild is crucial for real-world applications, yet it remains challenging. [Expand]
Shuang Li, JinMing Zhang, Wenxuan Ma, Chi Harold Liu, Wei Li
Domain adaptation (DA) enables knowledge transfer from a labeled source domain to an unlabeled target domain by reducing the cross-domain distribution discrepancy. [Expand]
FaceInpainter: High Fidelity Face Adaptation to Heterogeneous Domains
Jia Li, Zhaoyang Li, Jie Cao, Xingguang Song, Ran He
In this work, we propose a novel two-stage framework named FaceInpainter to implement controllable Identity-Guided Face Inpainting (IGFI) under heterogeneous domains. [Expand]
Lighting, Reflectance and Geometry Estimation From 360deg Panoramic Stereo
Junxuan Li, Hongdong Li, Yasuyuki Matsushita
We propose a method for estimating high-definition spatially-varying lighting, reflectance, and geometry of a scene from 360deg stereo images. [Expand]
Generalizing to the Open World: Deep Visual Odometry With Online Adaptation
Shunkai Li, Xin Wu, Yingdian Cao, Hongbin Zha
Despite learning-based visual odometry (VO) has shown impressive results in recent years, the pretrained networks may easily collapse in unseen environments. [Expand]
Probabilistic Model Distillation for Semantic Correspondence
Xin Li, Deng-Ping Fan, Fan Yang, Ao Luo, Hong Cheng, Zicheng Liu
Semantic correspondence is a fundamental problem in computer vision, which aims at establishing dense correspondences across images depicting different instances under the same category. [Expand]
Self-Supervised Video Hashing via Bidirectional Transformers
Shuyan Li, Xiu Li, Jiwen Lu, Jie Zhou
Most existing unsupervised video hashing methods are built on unidirectional models with less reliable training objectives, which underuse the correlations among frames and the similarity structure between videos. [Expand]
An emerging line of research has found that spherical spaces better match the underlying geometry of facial images, as evidenced by the state-of-the-art facial recognition methods which benefit empirically from spherical representations. [Expand]
Transferable Semantic Augmentation for Domain Adaptation
Shuang Li, Mixue Xie, Kaixiong Gong, Chi Harold Liu, Yulin Wang, Wei Li
Domain adaptation has been widely explored by transferring the knowledge from a label-rich source domain to a related but unlabeled target domain. [Expand]
Multi-view Depth Estimation using Epipolar Spatio-Temporal Networks
Xiaoxiao Long, Lingjie Liu, Wei Li, Christian Theobalt, Wenping Wang
We present a novel method for multi-view depth estimation from a single video, which is a critical task in various applications, such as perception, reconstruction and robot navigation. [Expand]
As a vital problem in classification-oriented transfer, unsupervised domain adaptation (UDA) has attracted widespread attention in recent years. [Expand]
Action Unit Memory Network for Weakly Supervised Temporal Action Localization
Wang Luo, Tianzhu Zhang, Wenfei Yang, Jingen Liu, Tao Mei, Feng Wu, Yongdong Zhang
Weakly supervised temporal action localization aims to detect and localize actions in untrimmed videos with only video-level labels during training. [Expand]
Intelligent Carpet: Inferring 3D Human Pose From Tactile Signals
Yiyue Luo, Yunzhu Li, Michael Foshey, Wan Shou, Pratyusha Sharma, Tomas Palacios, Antonio Torralba, Wojciech Matusik
Daily human activities, e.g., locomotion, exercises, and resting, are heavily guided by the tactile interactions between the human and the ground. [Expand]
In applications such as optical see-through and projector augmented reality, producing images amounts to solving non-negative image generation, where one can only add light to an existing image. [Expand]
Large-Capacity Image Steganography Based on Invertible Neural Networks
Shao-Ping Lu, Rong Wang, Tao Zhong, Paul L. Rosin
Many attempts have been made to hide information in images, where the main challenge is how to increase the payload capacity without the container image being detected as containing a message. [Expand]
CGA-Net: Category Guided Aggregation for Point Cloud Semantic Segmentation
Tao Lu, Limin Wang, Gangshan Wu
Previous point cloud semantic segmentation networks use the same process to aggregate features from neighbors of the same category and different categories. [Expand]
Reference-based image super-resolution (RefSR) has shown promising success in recovering high-frequency details by utilizing an external reference image (Ref). [Expand]
Progressive Modality Reinforcement for Human Multimodal Emotion Recognition From Unaligned Multimodal Sequences
Fengmao Lv, Xiang Chen, Yanyong Huang, Lixin Duan, Guosheng Lin
Human multimodal emotion recognition involves time-series data of different modalities, such as natural language, visual motions, and acoustic behaviors. [Expand]
Recognition and reconstruction of residential floor plan drawings are important and challenging in design, decoration, and architectural remodeling fields. [Expand]
In recent years, denoising methods based on deep learning have achieved unparalleled performance at the cost of large computational complexity. [Expand]
Gradient Forward-Propagation for Large-Scale Temporal Video Modelling
Mateusz Malinowski, Dimitrios Vytiniotis, Grzegorz Swirszcz, Viorica Patraucean, Joao Carreira
How can neural networks be trained on large-volume temporal data efficiently? To compute the gradients required to update parameters, backpropagation blocks computations until the forward and backward passes are completed. [Expand]
CapsuleRRT: Relationships-Aware Regression Tracking via Capsules
Ding Ma, Xiangqian Wu
Regression tracking has gained more and more attention thanks to its easy-to-implement characteristics, while existing regression trackers rarely consider the relationships between the object parts and the complete object. [Expand]
Seeing Behind Objects for 3D Multi-Object Tracking in RGB-D Sequences
Norman Muller, Yu-Shiang Wong, Niloy J. Mitra, Angela Dai, Matthias Niessner
Multi-object tracking from RGB-D video sequences is a challenging problem due to the combination of changing viewpoints, motion, and occlusions over time. [Expand]
Pedestrian and Ego-Vehicle Trajectory Prediction From Monocular Camera
Lukas Neumann, Andrea Vedaldi
Predicting future pedestrian trajectory is a crucial component of autonomous driving systems, as recognizing critical situations based only on current pedestrian position may come too late for any meaningful corrective action (e.g. [Expand]
Protecting Intellectual Property of Generative Adversarial Networks From Ambiguity Attacks
Ding Sheng Ong, Chee Seng Chan, Kam Woh Ng, Lixin Fan, Qiang Yang
Ever since Machine Learning as a Service emerges as a viable business that utilizes deep learning models to generate lucrative revenue, Intellectual Property Right (IPR) has become a major concern because these deep learning models can easily be replicated, shared, and re-distributed by any unauthorized third parties. [Expand]
Bilinear Parameterization for Non-Separable Singular Value Penalties
Marcus Valtonen Ornhag, Jose Pedro Iglesias, Carl Olsson
Low rank inducing penalties have been proven to successfully uncover fundamental structures considered in computer vision and machine learning; however, such methods generally lead to non-convex optimization problems. [Expand]
In recent years, Face Image Quality Assessment (FIQA) has become an indispensable part of the face recognition system to guarantee the stability and reliability of recognition performance in an unconstrained scenario. [Expand]
Fast Sinkhorn Filters: Using Matrix Scaling for Non-Rigid Shape Correspondence With Functional Maps
Gautam Pai, Jing Ren, Simone Melzi, Peter Wonka, Maks Ovsjanikov
In this paper, we provide a theoretical foundation for pointwise map recovery from functional maps and highlight its relation to a range of shape correspondence methods based on spectral alignment. [Expand]
Synthesize-It-Classifier: Learning a Generative Classifier Through Recurrent Self-Analysis
Arghya Pal, Raphael C.-W. Phan, KokSheik Wong
In this work, we show the generative capability of an image classifier network by synthesizing high-resolution, photo-realistic, and diverse images at scale. [Expand]
Generalization on Unseen Domains via Inference-Time Label-Preserving Target Projections
Prashant Pandey, Mrigank Raman, Sumanth Varambally, Prathosh AP
Generalization of machine learning models trained on a set of source domains on unseen target domains with different statistics, is a challenging problem. [Expand]
Unsupervised Hyperbolic Representation Learning via Message Passing Auto-Encoders
Jiwoong Park, Junho Cho, Hyung Jin Chang, Jin Young Choi
Most of the existing literature regarding hyperbolic embedding concentrate upon supervised learning, whereas the use of unsupervised hyperbolic embedding is less well explored. [Expand]
Deep Multi-Task Learning for Joint Localization, Perception, and Prediction
John Phillips, Julieta Martinez, Ioan Andrei Barsan, Sergio Casas, Abbas Sadat, Raquel Urtasun
Over the last few years, we have witnessed tremendous progress on many subtasks of autonomous driving including perception, motion forecasting, and motion planning. [Expand]
BABEL: Bodies, Action and Behavior With English Labels
Abhinanda R. Punnakkal, Arjun Chandrasekaran, Nikos Athanasiou, Alejandra Quiros-Ramirez, Michael J. Black
Understanding the semantics of human movement -- the what, how and why of the movement -- is an important problem that requires datasets of human actions with semantic labels. [Expand]
Effective Snapshot Compressive-Spectral Imaging via Deep Denoising and Total Variation Priors
Haiquan Qiu, Yao Wang, Deyu Meng
Snapshot compressive imaging (SCI) is a new type of compressive imaging system that compresses multiple frames of images into a single snapshot measurement, which enjoys low cost, low bandwidth, and high-speed sensing rate. [Expand]
Existing rain-removal algorithms often tackle either rain streak removal or raindrop removal, and thus may fail to handle real-world rainy scenes. [Expand]
DyGLIP: A Dynamic Graph Model With Link Prediction for Accurate Multi-Camera Multiple Object Tracking
Kha Gia Quach, Pha Nguyen, Huu Le, Thanh-Dat Truong, Chi Nhan Duong, Minh-Triet Tran, Khoa Luu
Multi-Camera Multiple Object Tracking (MC-MOT) is a significant computer vision problem due to its emerging applicability in several real-world applications. [Expand]
Exploiting & Refining Depth Distributions With Triangulation Light Curtains
Yaadhav Raaj, Siddharth Ancha, Robert Tamburo, David Held, Srinivasa G. Narasimhan
Active sensing through the use of Adaptive Depth Sensors is a nascent field, with potential in areas such as Advanced driver-assistance systems (ADAS). [Expand]
Dongsheng Ruan, Daiyin Wang, Yuan Zheng, Nenggan Zheng, Min Zheng
Recently, a large number of channel attention blocks are proposed to boost the representational power of deep convolutional neural networks (CNNs). [Expand]
A general approach for handling hyperspectral image (HSI) denoising issue is to impose weights on different HSI pixels to suppress negative influence brought by noisy elements. [Expand]
Multi-Perspective LSTM for Joint Visual Representation Learning
Alireza Sepas-Moghaddam, Fernando Pereira, Paulo Lobato Correia, Ali Etemad
We present a novel LSTM cell architecture capable of learning both intra- and inter-perspective relationships available in visual sequences captured from multiple perspectives. [Expand]
clDice - A Novel Topology-Preserving Loss Function for Tubular Structure Segmentation
Suprosanna Shit, Johannes C. Paetzold, Anjany Sekuboyina, Ivan Ezhov, Alexander Unger, Andrey Zhylka, Josien P. W. Pluim, Ulrich Bauer, Bjoern H. Menze
Accurate segmentation of tubular, network-like structures, such as vessels, neurons, or roads, is relevant to many fields of research. [Expand]
Learning Spatial-Semantic Relationship for Facial Attribute Recognition With Limited Labeled Data
Ying Shu, Yan Yan, Si Chen, Jing-Hao Xue, Chunhua Shen, Hanzi Wang
Recent advances in deep learning have demonstrated excellent results for Facial Attribute Recognition (FAR), typically trained with large-scale labeled data. [Expand]
In this paper, we address the problem of referring expression comprehension in videos, which is challenging due to complex expression and scene dynamics. [Expand]
Mesh Saliency: An Independent Perceptual Measure or a Derivative of Image Saliency?
Ran Song, Wei Zhang, Yitian Zhao, Yonghuai Liu, Paul L. Rosin
While mesh saliency aims to predict regional importance of 3D surfaces in agreement with human visual perception and is well researched in computer vision and graphics, latest work with eye-tracking experiments shows that state-of-the-art mesh saliency methods remain poor at predicting human fixations. [Expand]
Jie Song, Haofei Zhang, Xinchao Wang, Mengqi Xue, Ying Chen, Li Sun, Dacheng Tao, Mingli Song
Knowledge distillation pursues a diminutive yet well-behaved student network by harnessing the knowledge learned by a cumbersome teacher model. [Expand]
Dynamic Probabilistic Graph Convolution for Facial Action Unit Intensity Estimation
Tengfei Song, Zijun Cui, Yuru Wang, Wenming Zheng, Qiang Ji
Deep learning methods have been widely applied to automatic facial action unit (AU) intensity estimation and achieved state-of-the-art performance. [Expand]
Deep RGB-D Saliency Detection With Depth-Sensitive Attention and Automatic Multi-Modal Fusion
Peng Sun, Wenhu Zhang, Huanyu Wang, Songyuan Li, Xi Li
RGB-D salient object detection (SOD) is usually formulated as a problem of classification or regression over two modalities, i.e., RGB and depth. [Expand]
Deep Video Matting via Spatio-Temporal Alignment and Aggregation
Yanan Sun, Guanzhi Wang, Qiao Gu, Chi-Keung Tang, Yu-Wing Tai
Despite the significant progress made by deep learning in natural image matting, there has been so far no representative work on deep learning for video matting due to the inherent technical challenges in reasoning temporal domain and lack of large-scale video matting datasets. [Expand]
Natural image matting separates the foreground from background in fractional occupancy which can be caused by highly transparent objects, complex foreground (e.g., net or tree), and/or objects containing very fine details (e.g., hairs). [Expand]
Tuning IR-Cut Filter for Illumination-Aware Spectral Reconstruction From RGB
Bo Sun, Junchi Yan, Xiao Zhou, Yinqiang Zheng
To reconstruct spectral signals from multi-channel observations, in particular trichromatic RGBs, has recently emerged as a promising alternative to traditional scanning-based spectral imager. [Expand]
Uncertainty Reduction for Model Adaptation in Semantic Segmentation
Prabhu Teja S, Francois Fleuret
Traditional methods for Unsupervised Domain Adaptation (UDA) targeting semantic segmentation exploit information common to the source and target domains, using both labeled source data and unlabeled target data. [Expand]
Self-Supervised Wasserstein Pseudo-Labeling for Semi-Supervised Image Classification
Fariborz Taherkhani, Ali Dabouei, Sobhan Soleymani, Jeremy Dawson, Nasser M. Nasrabadi
The goal is to use Wasserstein metric to provide pseudo labels for the unlabeled images to train a Convolutional Neural Networks (CNN) in a Semi-Supervised Learning (SSL) manner for the classification task. [Expand]
The Information Bottleneck (IB) provides an information theoretic principle for representation learning, by retaining all information relevant for predicting label while minimizing the redundancy. [Expand]
Probabilistic Selective Encryption of Convolutional Neural Networks for Hierarchical Services
Jinyu Tian, Jiantao Zhou, Jia Duan
Model protection is vital when deploying Convolutional Neural Networks (CNNs) for commercial services, due to the massive costs of training them. [Expand]
Automatic Correction of Internal Units in Generative Neural Networks
Ali Tousi, Haedong Jeong, Jiyeon Han, Hwanil Choi, Jaesik Choi
Generative Adversarial Networks (GANs) have shown satisfactory performance in synthetic image generation by devising complex network structure and adversarial training scheme. [Expand]
Instance segmentation, the task of identifying and separating each individual object of interest in the image, is one of the actively studied research topics in computer vision. [Expand]
Stepan Tulyakov, Daniel Gehrig, Stamatios Georgoulis, Julius Erbach, Mathias Gehrig, Yuanyou Li, Davide Scaramuzza
State-of-the-art frame interpolation methods generate intermediate frames by inferring object motions in the image from consecutive key-frames. [Expand]
Uncertainty-Aware Camera Pose Estimation From Points and Lines
Alexander Vakhitov, Luis Ferraz, Antonio Agudo, Francesc Moreno-Noguer
Perspective-n-Point-and-Line (PnPL) algorithms aim at fast, accurate, and robust camera localization with respect to a 3D model from 2D-3D feature correspondences, being a major part of modern robotic and AR/VR systems. [Expand]
A Self-Boosting Framework for Automated Radiographic Report Generation
Zhanyu Wang, Luping Zhou, Lei Wang, Xiu Li
Automated radiographic report generation is a challenging task since it requires to generate paragraphs describing fine-grained visual differences of cases, especially for those between the diseased and the healthy. [Expand]
Contrastive Learning Based Hybrid Networks for Long-Tailed Image Classification
Peng Wang, Kai Han, Xiu-Shen Wei, Lei Zhang, Lei Wang
Learning discriminative image representations plays a vital role in long-tailed image classification because it can ease the classifier learning in imbalanced cases. [Expand]
Domain adaptation methods face performance degradation in object detection, as the complexity of tasks require more about the transferability of the model. [Expand]
EvDistill: Asynchronous Events To End-Task Learning via Bidirectional Reconstruction-Guided Cross-Modal Knowledge Distillation
Lin Wang, Yujeong Chae, Sung-Hoon Yoon, Tae-Kyun Kim, Kuk-Jin Yoon
Event cameras sense per-pixel intensity changes and produce asynchronous event streams with high dynamic range and less motion blur, showing advantages over the conventional cameras. [Expand]
FESTA: Flow Estimation via Spatial-Temporal Attention for Scene Point Clouds
Haiyan Wang, Jiahao Pang, Muhammad A. Lodhi, Yingli Tian, Dong Tian
Scene flow depicts the dynamics of a 3D scene, which is critical for various applications such as autonomous driving, robot navigation, AR/VR, etc. [Expand]
From Semantic Categories to Fixations: A Novel Weakly-Supervised Visual-Auditory Saliency Detection Approach
Guotao Wang, Chenglizhao Chen, Deng-Ping Fan, Aimin Hao, Hong Qin
Thanks to the rapid advances in the deep learning techniques and the wide availability of large-scale training sets, the performances of video saliency detection models have been improving steadily and significantly. [Expand]
Improving OCR-Based Image Captioning by Incorporating Geometrical Relationship
Jing Wang, Jinhui Tang, Mingkun Yang, Xiang Bai, Jiebo Luo
OCR-based image captioning aims to automatically describe images based on all the visual entities (both visual objects and scene text) in images. [Expand]
Glancing at the Patch: Anomaly Localization With Global and Local Feature Comparison
Shenzhi Wang, Liwei Wu, Lei Cui, Yujun Shen
Anomaly localization, with the purpose to segment the anomalous regions within images, is challenging due to the large variety of anomaly types. [Expand]
LED2-Net: Monocular 360deg Layout Estimation via Differentiable Depth Rendering
Fu-En Wang, Yu-Hsuan Yeh, Min Sun, Wei-Chen Chiu, Yi-Hsuan Tsai
Although significant progress has been made in room layout estimation, most methods aim to reduce the loss in the 2D pixel coordinate rather than exploiting the room structure in the 3D space. [Expand]
Multi-Decoding Deraining Network and Quasi-Sparsity Based Training
Yinglong Wang, Chao Ma, Bing Zeng
Existing deep deraining models are mainly learned via directly minimizing the statistical differences between rainy images and rain-free ground truths. [Expand]
PAUL: Procrustean Autoencoder for Unsupervised Lifting
Chaoyang Wang, Simon Lucey
Recent success in casting Non-rigid Structure from Motion (NRSfM) as an unsupervised deep learning problem has raised fundamental questions about what novelty in NRSfM prior could the deep learning offer. [Expand]
Representative Forgery Mining for Fake Face Detection
Chengrui Wang, Weihong Deng
Although vanilla Convolutional Neural Network (CNN) based detectors can achieve satisfactory performance on fake face detection, we observe that the detectors tend to seek forgeries on a limited region of face, which reveals that the detectors is short of understanding of forgery. [Expand]
RSG: A Simple but Effective Module for Learning Imbalanced Datasets
Jianfeng Wang, Thomas Lukasiewicz, Xiaolin Hu, Jianfei Cai, Zhenghua Xu
Imbalanced datasets widely exist in practice and are a great challenge for training deep neural models with a good generalization on infrequent classes. [Expand]
For unsupervised domain adaptation (UDA), to alleviate the effect of domain shift, many approaches align the source and target domains in the feature space by adversarial learning or by explicitly aligning their statistics. [Expand]
Autoregressive Stylized Motion Synthesis With Generative Flow
Yu-Hui Wen, Zhipeng Yang, Hongbo Fu, Lin Gao, Yanan Sun, Yong-Jin Liu
Motion style transfer is an important problem in many computer graphics and computer vision applications, including human animation, games, and robotics. [Expand]
Learning Progressive Point Embeddings for 3D Point Cloud Generation
Cheng Wen, Baosheng Yu, Dacheng Tao
Generative models for 3D point clouds are extremely important for scene/object reconstruction applications in autonomous driving and robotics. [Expand]
Embedded Discriminative Attention Mechanism for Weakly Supervised Semantic Segmentation
Tong Wu, Junshi Huang, Guangyu Gao, Xiaoming Wei, Xiaolin Wei, Xuan Luo, Chi Harold Liu
Weakly Supervised Semantic Segmentation (WSSS) with image-level annotation uses class activation maps from the classifier as pseudo-labels for semantic segmentation. [Expand]
Improving the Transferability of Adversarial Samples With Adversarial Transformations
Weibin Wu, Yuxin Su, Michael R. Lyu, Irwin King
Although deep neural networks (DNNs) have achieved tremendous performance in diverse vision challenges, they are surprisingly susceptible to adversarial examples, which are born of intentionally perturbing benign samples in a human-imperceptible fashion. [Expand]
Progressive Unsupervised Learning for Visual Object Tracking
Qiangqiang Wu, Jia Wan, Antoni B. Chan
In this paper, we propose a progressive unsupervised learning (PUL) framework, which entirely removes the need for annotated training videos in visual tracking. [Expand]
Dynamic Weighted Learning for Unsupervised Domain Adaptation
Ni Xiao, Lei Zhang
Unsupervised domain adaptation (UDA) aims to improve the classification performance on an unlabeled target domain by leveraging information from a fully labeled source domain. [Expand]
Space-Time Distillation for Video Super-Resolution
Zeyu Xiao, Xueyang Fu, Jie Huang, Zhen Cheng, Zhiwei Xiong
Compact video super-resolution (VSR) networks can be easily deployed on resource-limited devices, e.g., smart-phones and wearable devices, but have considerable performance gaps compared with complicated VSR networks that require a large amount of computing resources. [Expand]
Scale-Aware Graph Neural Network for Few-Shot Semantic Segmentation
Guo-Sen Xie, Jie Liu, Huan Xiong, Ling Shao
Few-shot semantic segmentation (FSS) aims to segment unseen class objects given very few densely-annotated support images from the same class. [Expand]
Recently, with the emergence of retrieval requirements for certain individual in the same superclass, e.g., birds, persons, cars, fine-grained recognition task has attracted a significant amount of attention from academia and industry. [Expand]
1-bit detectors show great promise for resource-constrained embedded devices but often suffer from a significant performance gap compared with their real-valued counterparts. [Expand]
SUTD-TrafficQA: A Question Answering Benchmark and an Efficient Network for Video Reasoning Over Traffic Events
Li Xu, He Huang, Jun Liu
Traffic event cognition and reasoning in videos is an important task that has a wide range of applications in intelligent transportation, assisted driving, and autonomous vehicles. [Expand]
Text-based image captioning (TextCap) which aims to read and reason images with texts is crucial for a machine to understand a detailed and complex scene environment, considering that texts are omnipresent in daily life. [Expand]
Referring image segmentation aims to segment the referent that is the corresponding object or stuff referred by a natural language expression in an image. [Expand]
Beyond Short Clips: End-to-End Video-Level Learning With Collaborative Memories
Xitong Yang, Haoqi Fan, Lorenzo Torresani, Larry S. Davis, Heng Wang
The standard way of training video models entails sampling at each iteration a single clip from a video and optimizing the clip prediction with respect to the video-level label. [Expand]
Defending Multimodal Fusion Models Against Single-Source Adversaries
Karren Yang, Wan-Yi Lin, Manash Barman, Filipe Condessa, Zico Kolter
Beyond achieving high performance across many vision tasks, multimodal models are expected to be robust to single-source faults due to the availability of redundant information between modalities. [Expand]
DyStaB: Unsupervised Object Segmentation via Dynamic-Static Bootstrapping
Yanchao Yang, Brian Lai, Stefano Soatto
We describe an unsupervised method to detect and segment portions of images of live scenes that, at some point in time, are seen moving as a coherent whole, which we refer to as objects. [Expand]
Exploiting Semantic Embedding and Visual Feature for Facial Action Unit Detection
Huiyuan Yang, Lijun Yin, Yi Zhou, Jiuxiang Gu
Recent study on detecting facial action units (AU) has utilized auxiliary information (i.e., facial landmarks, relationship among AUs and expressions, web facial images, etc.), in order to improve the AU detection performance. [Expand]
A deep facial attribute editing model strives to meet two requirements: (1) attribute correctness -- the target attribute should correctly appear on the edited face image; (2) irrelevance preservation -- any irrelevant information (e.g., identity) should not be changed after editing. [Expand]
LayoutTransformer: Scene Layout Generation With Conceptual and Spatial Diversity
Cheng-Fu Yang, Wan-Cyuan Fan, Fu-En Yang, Yu-Chiang Frank Wang
When translating text inputs into layouts or images, existing works typically require explicit descriptions of each object in a scene, including their spatial information or the associated relationships. [Expand]
Mol2Image: Improved Conditional Flow Models for Molecule to Image Synthesis
Karren Yang, Samuel Goldman, Wengong Jin, Alex X. Lu, Regina Barzilay, Tommi Jaakkola, Caroline Uhler
In this paper, we aim to synthesize cell microscopy images under different molecular interventions, motivated by practical applications to drug development. [Expand]
In real-world applications, it is common that only a portion of data is aligned across views due to spatial, temporal, or spatiotemporal asynchronism, thus leading to the so-called Partially View-aligned Problem (PVP). [Expand]
SelfSAGCN: Self-Supervised Semantic Alignment for Graph Convolution Network
Xu Yang, Cheng Deng, Zhiyuan Dang, Kun Wei, Junchi Yan
Graph convolution networks (GCNs) are a powerful deep learning approach and have been successfully applied to representation learning on graphs in a variety of real-world applications. [Expand]
Weakly supervised temporal action detection aims to localize temporal boundaries of actions and identify their categories simultaneously with only video-level category labels during training. [Expand]
Yichao Yan, Jinpeng Li, Jie Qin, Song Bai, Shengcai Liao, Li Liu, Fan Zhu, Ling Shao
Person search aims to simultaneously localize and identify a query person from realistic, uncropped images, which can be regarded as the unified task of pedestrian detection and person re-identification (re-id). [Expand]
Self-Aligned Video Deraining With Transmission-Depth Consistency
Wending Yan, Robby T. Tan, Wenhan Yang, Dengxin Dai
In this paper, we address the problems of rain streaks and rain accumulation removal in video, by developing a self-aligned network with transmission-depth consistency. [Expand]
Learning feature embedding directly from images without any human supervision is a very challenging and essential task in the field of computer vision and machine learning. [Expand]
We propose Joint-DetNAS, a unified NAS framework for object detection, which integrates 3 key components: Neural Architecture Search, pruning, and Knowledge Distillation. [Expand]
Though machine learning algorithms are able to achieve pattern recognition from the correlation between data and labels, the presence of spurious features in the data decreases the robustness of these learned relationships with respect to varied testing environments. [Expand]
Linguistic Structures As Weak Supervision for Visual Scene Graph Generation
Keren Ye, Adriana Kovashka
Prior work in scene graph generation requires categorical supervision at the level of triplets---subjects and objects, and predicates that relate them, either with or without bounding box information. [Expand]
Towards Efficient Tensor Decomposition-Based DNN Model Compression With Optimization Framework
Miao Yin, Yang Sui, Siyu Liao, Bo Yuan
Advanced tensor decomposition, such as Tensor train (TT) and Tensor ring (TR), has been widely studied for deep neural network (DNN) model compression, especially for recurrent neural networks (RNNs). [Expand]
Given an untrimmed video and a query sentence, cross-modal video moment retrieval aims to rank a video moment from pre-segmented video moment candidates that best matches the query sentence. [Expand]
The goal of out-of-distribution (OOD) detection is to handle the situations where the test samples are drawn from a different distribution than the training data. [Expand]
Hyper-LifelongGAN: Scalable Lifelong Learning for Image Conditioned Generation
Mengyao Zhai, Lei Chen, Greg Mori
Deep neural networks are susceptible to catastrophic forgetting: when encountering a new task, they can only remember the new task and fail to preserve its ability to accomplish previously learned tasks. [Expand]
Semantic segmentation models gain robustness against poor lighting conditions by virtue of complementary information from visible (RGB) and thermal images. [Expand]
Person re-identification (Re-ID) is to retrieve a particular person captured by different cameras, which is of great significance for security surveillance and pedestrian behavior analysis. [Expand]
Zhongwen Zhang, Dmitrii Marin, Maria Drangova, Yuri Boykov
We are interested in unsupervised reconstruction of complex near-capillary vasculature with thousands of bifurcations where supervision and learning are infeasible. [Expand]
Multi-view crowd counting has been previously proposed to utilize multi-cameras to extend the field-of-view of a single camera, capturing more people in the scene, and improve counting performance for occluded people or those in low resolution. [Expand]
Convolutional network compression methods require training data for achieving acceptable results, but training data is routinely unavailable due to some privacy and transmission limitations. [Expand]
Cross-View Gait Recognition With Deep Universal Linear Embeddings
Shaoxiong Zhang, Yunhong Wang, Annan Li
Gait is considered an attractive biometric identifier for its non-invasive and non-cooperative features compared with other biometric identifiers such as fingerprint and iris. [Expand]
Explicit Knowledge Incorporation for Visual Reasoning
Yifeng Zhang, Ming Jiang, Qi Zhao
Existing explainable and explicit visual reasoning methods only perform reasoning based on visual evidence but do not take into account knowledge beyond what is in the visual scene. [Expand]
Flow-Guided One-Shot Talking Face Generation With a High-Resolution Audio-Visual Dataset
Zhimeng Zhang, Lincheng Li, Yu Ding, Changjie Fan
One-shot talking face generation should synthesize high visual quality facial videos with reasonable animations of expression and head pose, and just utilize arbitrary driving audio and arbitrary single face image as the source. [Expand]
When in a new situation or geographical location, human drivers have an extraordinary ability to watch others and learn maneuvers that they themselves may have never performed. [Expand]
Learning To Restore Hazy Video: A New Real-World Dataset and a New Method
Xinyi Zhang, Hang Dong, Jinshan Pan, Chao Zhu, Ying Tai, Chengjie Wang, Jilin Li, Feiyue Huang, Fei Wang
Most of the existing deep learning-based dehazing methods are trained and evaluated on the image dehazing datasets, where the dehazed images are generated by only exploiting the information from the corresponding hazy ones. [Expand]
Learning a Self-Expressive Network for Subspace Clustering
Shangzhi Zhang, Chong You, Rene Vidal, Chun-Guang Li
State-of-the-art subspace clustering methods are based on the self-expressive model, which represents each data point as a linear combination of other data points. [Expand]
MR Image Super-Resolution With Squeeze and Excitation Reasoning Attention Network
Yulun Zhang, Kai Li, Kunpeng Li, Yun Fu
High-quality high-resolution (HR) magnetic resonance (MR) images afford more detailed information for reliable diagnosis and quantitative image analyses. [Expand]
Person Re-Identification Using Heterogeneous Local Graph Attention Networks
Zhong Zhang, Haijia Zhang, Shuang Liu
Recently, some methods have focused on learning local relation among parts of pedestrian images for person re-identification (Re-ID), as it offers powerful representation capabilities. [Expand]
Self-Guided and Cross-Guided Learning for Few-Shot Segmentation
Bingfeng Zhang, Jimin Xiao, Terry Qin
Few-shot segmentation has been attracting a lot of attention due to its effectiveness to segment unseen object classes with a few annotated samples. [Expand]
Sparse Multi-Path Corrections in Fringe Projection Profilometry
Yu Zhang, Daniel Lau, David Wipf
Three-dimensional scanning by means of structured light illumination is an active imaging technique involving projecting and capturing a series of striped patterns and then using the observed warping of stripes to reconstruct the target object's surface through triangulating each pixel in the camera to a unique projector coordinate corresponding to a particular feature in the projected patterns. [Expand]
Despite the great success of GANs in images translation with different conditioned inputs such as semantic segmentation and edge map, generating high-fidelity images with reference styles from exemplars remains a grand challenge in conditional image-to-image translation. [Expand]
Deep Lucas-Kanade Homography for Multimodal Image Alignment
Yiming Zhao, Xinming Huang, Ziming Zhang
Estimating homography to align image pairs captured by different sensors or image pairs with large appearance changes is an important and general challenge for many computer vision applications. [Expand]
In this paper, we explore the compression of deep neural networks by quantizing the weights and activations into multi-bit binary networks (MBNs). [Expand]
PhD Learning: Learning With Pompeiu-Hausdorff Distances for Video-Based Vehicle Re-Identification
Jianan Zhao, Fengliang Qi, Guangyu Ren, Lin Xu
Vehicle re-identification (re-ID) is of great significance to urban operation, management, security and has gained more attention in recent years. [Expand]
We study a very challenging task, human image completion, which tries to recover the human body part with a reasonable human shape from the corrupted region. [Expand]
Self-Generated Defocus Blur Detection via Dual Adversarial Discriminators
Wenda Zhao, Cai Shang, Huchuan Lu
Although existing fully-supervised defocus blur detection (DBD) models significantly improve performance, training such deep models requires abundant pixel-level manual annotation, which is highly time-consuming and error-prone. [Expand]
In this paper, we propose a deep compositional metric learning (DCML) framework for effective and generalizable similarity measurement between images. [Expand]
Deep Convolutional Dictionary Learning for Image Denoising
Hongyi Zheng, Hongwei Yong, Lei Zhang
Inspired by the great success of deep neural networks (DNNs), many unfolding methods have been proposed to integrate traditional image modeling techniques, such as dictionary learning (DicL) and sparse coding, into DNNs for image restoration. [Expand]
Improving Multiple Object Tracking With Single Object Tracking
Linyu Zheng, Ming Tang, Yingying Chen, Guibo Zhu, Jinqiao Wang, Hanqing Lu
Despite considerable similarities between multiple object tracking (MOT) and single object tracking (SOT) tasks, modern MOT methods have not benefited from the development of SOT ones to achieve satisfactory performance. [Expand]
Patchwise Generative ConvNet: Training Energy-Based Models From a Single Natural Image for Internal Learning
Zilong Zheng, Jianwen Xie, Ping Li
Exploiting internal statistics of a single natural image has long been recognized as a significant research paradigm where the goal is to learn the distribution of patches within the image without relying on external training data. [Expand]
This paper presents a detection-aware pre-training (DAP) approach, which leverages only weakly-labeled classification-style datasets (e.g., ImageNet) for pre-training, but is specifically tailored to benefit object detection tasks. [Expand]
Neighborhood Contrastive Learning for Novel Class Discovery
Zhun Zhong, Enrico Fini, Subhankar Roy, Zhiming Luo, Elisa Ricci, Nicu Sebe
In this paper, we address Novel Class Discovery (NCD), the task of unveiling new classes in a set of unlabeled samples given a labeled dataset with known classes. [Expand]
Embracing Uncertainty: Decoupling and De-Bias for Robust Temporal Grounding
Hao Zhou, Chongyang Zhang, Yan Luo, Yanjun Chen, Chuanping Hu
Temporal grounding aims to localize temporal boundaries within untrimmed videos by language queries, but it faces the challenge of two types of inevitable human uncertainties: query uncertainty and label uncertainty. [Expand]
Human De-Occlusion: Invisible Perception and Recovery for Humans
Qiang Zhou, Shiyin Wang, Yitong Wang, Zilong Huang, Xinggang Wang
In this paper, we tackle the problem of human de-occlusion which reasons about occluded segmentation masks and invisible appearance content of humans. [Expand]
Man Zhou, Jie Xiao, Yifan Chang, Xueyang Fu, Aiping Liu, Jinshan Pan, Zheng-Jun Zha
While deep convolutional neural networks (CNNs) have achieved great success on image de-raining task, most existing methods can only learn fixed mapping rules between paired rainy/clean images on a single dataset. [Expand]
Long-term actions involve many important visual concepts, e.g., objects, motions, and sub-actions, and there are various relations among these concepts, which we call basic relations. [Expand]
Prototype Augmentation and Self-Supervision for Incremental Learning
Fei Zhu, Xu-Yao Zhang, Chuang Wang, Fei Yin, Cheng-Lin Liu
Despite the impressive performance in many individual tasks, deep neural networks suffer from catastrophic forgetting when learning new tasks incrementally. [Expand]