Novel Datasets@CVPR2021

Novel datasets @CVPR2021 Masanari Kimura ([email protected])

• Introduce the novel datasets proposed at CVPR2021. Why do
we need new datasets? Development of the datasets Development of the methods new dataset SOTA method Additional constraints and assumptions TL;DR 2

New datasets @CVPR2021 1. PPR10K: A Large-Scale Portrait Photo Retouching
Dataset With Human-Region Mask and Group-Level Consistency 2. Towards Semantic Segmentation of Urban-Scale 3D Point Clouds: A Dataset, Benchmarks and Challenges 3. Rethinking Text Segmentation: A Novel Dataset and a Text-Specific Refinement Approach 4. SAIL-VOS 3D: A Synthetic Dataset and Baselines for Object Detection and 3D Mesh Reconstruction From Video Data 5. Intentonomy: A Dataset and Study Towards Human Intent Understanding 6. Towards Fast and Accurate Real-World Depth Super-Resolution: Benchmark Dataset and Baseline 7. Zillow Indoor Dataset: Annotated Floor Plans With 360deg Panoramas and 3D Room Layouts 8. Learning To Restore Hazy Video: A New Real-World Dataset and a New Method 9. Fashion IQ: A New Dataset Towards Retrieving Images by Natural Language Feedback 10. iMiGUE: An Identity-Free Video Dataset for Micro-Gesture Understanding and Emotion Analysis 11. Objectron: A Large Scale Dataset of Object-Centric Videos in the Wild With Pose Annotations 12. 3DCaricShop: A Dataset and a Baseline Method for Single-View 3D Caricature Face Reconstruction 13. VSPW: A Large-scale Dataset for Video Scene Parsing in the Wild 14. Flow-Guided One-Shot Talking Face Generation With a High-Resolution Audio-Visual Dataset 15. How2Sign: A Large-Scale Multimodal Dataset for Continuous American Sign Language 16. Sewer-ML: A Multi-Label Sewer Defect Classification Dataset and Benchmark 17. The Multi-Temporal Urban Development SpaceNet Dataset 18. GeoSim: Realistic Video Simulation via Geometry-Aware Composition for Self-Driving 3

4 New datasets @CVPR2021 19. Dictionary-Guided Scene Text Recognition 20.
Human POSEitioning System (HPS): 3D Human Pose Estimation and Self-Localization in Large Scenes From Body-Mounted Sensors 21. Transformation Driven Visual Reasoning 22. Natural Adversarial Examples 23. TextOCR: Towards Large-Scale End-to-End Reasoning for Arbitrary-Shaped Scene Text 24. Enriching ImageNet With Human Similarity Judgments and Psychological Embeddings 25. Semantic Image Matting 26. DoDNet: Learning To Segment Multi-Organ and Tumors From Multiple Partially Labeled Datasets 27. Euro-PVI: Pedestrian Vehicle Interactions in Dense Urban Centers 28. Learning Goals From Failure 29. Learning To Count Everything 30. Variational Relational Point Completion Network 31. TrafficSim: Learning To Simulate Realistic Multi-Agent Behaviors 32. OpenRooms: An Open Framework for Photorealistic Indoor Scene Datasets 33. ArtEmis: Affective Language for Visual Art 34. DexYCB: A Benchmark for Capturing Hand Grasping of Objects 35. SUTD-TrafficQA: A Question Answering Benchmark and an Efficient Network for Video Reasoning Over Traffic Events

5 New datasets @CVPR2021 36. Cross-View Cross-Scene Multi-View Crowd Counting
37. Depth-Aware Mirror Segmentation 38. AGORA: Avatars in Geography Optimized for Regression Analysis 39. Detection, Tracking, and Counting Meets Drones in Crowds: A Benchmark 40. Mirror3D: Depth Refinement for Mirror Surfaces 41. Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts 42. Learning Multi-Scale Photo Exposure Correction 43. Unsupervised Pre-Training for Person Re-Identification 44. Home Action Genome: Cooperative Compositional Action Understanding 45. AGQA: A Benchmark for Compositional Spatio-Temporal Reasoning 46. Nutrition5k: Towards Automatic Nutritional Understanding of Generic Food 47. Person30K: A Dual-Meta Generalization Network for Person Re-Identification

Trajectory Prediction / Person ReID 7

8 • This dataset is a large collection of videos
containing intentional and unintentional action. • Videos in this dataset are annotated with the moment at which action becomes unintentional. • Tasks: trajectory prediction

9 • This dataset consists of pedestrian and bicyclist trajectories.
• http://www.europvi.mpi-inf.mpg.de/ • Tasks: trajectory prediction

10 • This dataset is generated by a multiagent behavior
model for realistic traffic simulation. • Tasks: trajectory prediction

11 • This dataset is • a very large scale
containing 1.38 million images of 30K identities; • a large capture system containing 6,497 cameras deployed at 89 different sites; • abundant sample diversities including varied backgrounds and diverse person poses. • Tasks: Person ReID

12 • This dataset consists of 4M person images of
over 200K identities extracted from 46K YouTube videos, which is 30× larger than the largest existing Re-ID dataset MSMT. • the collected videos cover a wide range of capturing environments (e.g., using fixed or moving cameras, under dynamic scenes, or having different resolutions), yielding a great data diversity which is essential for learning generic representation. • Tasks: Person ReID

3D / Point Clouds 13

14 • This dataset is a large indoor dataset with
71,474 panoramas from 1,524 real unfurnished homes. • provides annotations of • 3D room layouts; • 2D and 3D floor plans; • panorama location in the floor plan; • locations of windows and doors. • https://github.com/zillow/zind • Tasks: layout estimation, multi-view registration

15 • This dataset is the first large-scale 3D caricature
dataset that contains 2000 high-quality diversified 3D caricatures manually crafted by professional artists. • https://qiuyuda.github.io/3DCaricShop/ • Tasks: 3D caricature reconstruction from a 2D caricature

16 • This dataset is a 3D mirror plane dataset
based on three RGBD datasets (Matterpot3D, NYUv2 and ScanNet) containing 7,011 mirror instance masks and 3D planes. • Motivation: mirror surfaces are a significant source of errors. • https://3dlg-hcvc.github.io/mirror3d/ • Tasks:

17 • This dataset is a large and rich urban-scale
dataset including two accurately labelled regions covering 4.4km 2 and an extra unlabelled region covering 3.2km 2 . • In the dataset, each 3D point is labeled as one of 13 semantic classes. • https://github.com/QingyongHu/SensatUrban • https://www.youtube.com/watch?v=IG0tTdqB3L8 • Tasks: (semi-) supervised 3D point clouds segmentation

18 • This dataset is a synthetic video dataset with
frame-by-frame mesh annotations which extends SAIL-VOS. • http://sailvos.web.illinois.edu/ • Tasks: 3D reconstruction from video data

19 • This dataset is a large-scale dataset which can
greatly promote the study of depth map super-resolution and even more depth-related real-world tasks. • Tasks: depth map super-resolution

• This dataset is known as SpaceNet 7 • both
a dataset and a NeurIPS 2020 competition • This dataset consists of 101 labelled sequences of satellite imagery collected by Planet Labs’ Dove constellation between 2017 and 2020 • https://registry.opendata.aws/spacenet/ • Tasks: object tracking, segmentation, change detection 20

21 • This dataset contains object-centric short videos with pose
annotations for nine categories and includes 4 million annotated images in 14,819 annotated videos. • https://github.com/google-research-datasets/Objectron • Tasks: 3D object detection, 3D object tracking

22 • This dataset is created by (HPS) Human POSEitioning
System, a method to recover the full 3D pose of a human registered with a 3D scan of the surrounding environment using wearable sensors. • http://virtualhumans.mpi-inf.mpg.de/hps/ • Tasks: scene modeling

23 • This dataset is contains over 100K HDR images
with ground-truth depths, normals, spatially-varying BRDF and light sources, along with per-pixel spatiallyvarying lighting and visibility masks for every light source. • https://ucsd-openrooms.github.io/ • Tasks: inverse rendering, depth estimation, etc.

24 • This dataset contains over 100,000 high-quality scans, which
renders partial 3D shapes from 26 uniformly distributed camera poses for each 3D CAD model. • https://paul007pl.github.io/projects/VRCNet • Tasks: shape completion

25 • This dataset is created consists of 582K RGB-D
frames over 1,000 sequences of 10 subjects grasping 20 different objects from 8 views. • https://dex-ycb.github.io/ • Tasks: key point detection, pose estimation, etc.

26 • Benchmark datasets for 3D human pose estimation from
images are limited by clothing complexity, environmental conditions, number of subjects, and occlusion. The authors constructed AGORA, a synthetic dataset with high accuracy ground-truth. Using 4,240 commercially available human scans, they fit the SMPL-X body model to the 3D scans to create a reference pose and body. • https://agora.is.tue.mpg.de/ • Tasks: pose estimation

27 • This dataset is RGB-D mirror segmentation dataset of
3, 049 exemplars. • https://mhaiyang.github.io/CVPR2021_PDNet/index • Tasks: RGB-D mirror segmentation

Text Recognition 28

29 • This dataset consisting of 4,024 text images, including
scene text and design text with various artistic effects. • This dataset has six types of annotations for each image: • word- and character-wise quadrilateral bounding polygons; • word- and character-wise pixel-level masks; • word- and character-wise transcriptions. • Tasks: text segmentation

30 • This dataset is a challenging scene text dataset
for Vietnamese, where some characters are equivocal in the visual form due to accent symbols. • Tasks: text detection, text recognition

31 • This dataset is an arbitrary-shaped scene text detection
and recognition with 900k annotated words collected on real images from TextVQA dataset. • https://textvqa.org/textocr • Tasks: text detection, text recognition

Captioning / VQA 32

33 • This dataset takes the form of video QA
based on the collected 10,080 in-the-wild videos and annotated 62,535 QA pairs, for benchmarking the cognitive capability of causal inference and event understanding models in complex traffic scenarios. • https://github.com/SUTDCV/SUTD-TrafficQA • Tasks: VQA

34 • This dataset is a novel large-scale dataset and
accompanying machine learning models aimed at providing a detailed understanding of the interplay between visual content, its emotional effect, and explanations for the latter in language • Tasks: VQA

35 • This dataset provide human-generated captions that distinguish similar
pairs of garment images together with side-information consisting of real-world product descriptions and derived visual attribute labels for these images. • Tasks: relative captioning

36 • This dataset is based on CLEVR, since it
is better to first study TVR in a simple setting and then move to more complex real scenarios, just like people first study VQA on CLEVR and then generalize to more complicated settings like GQA. • https://hongxin2019.github.io/TVR • Tasks: visual question answering (VQA)

37 • This dataset is a new benchmark for compositional
spatial-temporal reasoning. • This dataset contains 192M unbalanced question answer pairs for 9.6K videos. • Tasks: VideoQA

38 • Since the conventional Conceptual Captions 3M (CC3M) is
designed for the capture task, it only collects data that are valid for capture. Therefore, the authors propose Conceptual 12M (CC12M), a larger dataset with relaxed constraints. • Tasks: VQA, image captioning

Gestures / Emotion Recognition 39

40 • This dataset focuses on nonverbal body gestures without
using any identity information. • the proposed dataset offers an approach where the identity-free MGs are explored for hidden emotion understanding, and privacy of the individuals could be preserved. • Tasks: micro-gestures recognition

41 • This dataset is a multimodal and multiview continuous
American Sign Language (ASL) dataset, consisting of a parallel corpus of more than 80 hours of sign language videos and a set of corresponding modalities including speech, English transcripts, and depth. • Tasks: synthesizing sign language videos

42 • This dataset is a multi-view action dataset with
multiple modalities and view-points supplemented with hierarchical activity and atomic action labels together with dense scene composition labels. • http://www.homeactiongenome.org/ • Tasks: motion recognition

Classification / Regression 43

44 • This dataset comprising 14K images covering a wide
range of everyday scenes. • These images are manually annotated with 28 intent categories derived from a social psychology taxonomy. • https://github.com/kmnp/intentonomy • Tasks: object/context localization, classification

45 • This dataset consists of 5k diverse, real world
food dishes with corresponding video streams, depth images, component weights, and high accuracy nutritional content annotation. • https://github.com/google-research-datasets/Nutrition5k • Tasks: regression

46 • This dataset is a large novel and publicly
available multi- label classification dataset for image-based sewer defect classification. • This dataset consists of 1.3 million images annotated by professional sewer inspectors from three different utility companies across nine years. • http://vap.aau.dk/sewer-ml • Tasks: classification

Segmentation / Image Matting 47

48 • This dataset is generated from a geometry-aware image
composition process which synthesizes novel urban driving scenarios by augmenting existing images with dynamic objects extracted from other scenes and rendered at novel poses. • https://tmux.top/publication/geosim/ • Tasks: segmentation, (data augmentation)

49 • This dataset is for the scene parsing task
from images to videos. • well-trimmed long-temporal clips; • dense annotation; • high resolution. • https://www.vspwdataset.com/ • Tasks: semantic segmentation

50 • This dataset is a large-scale partially labeled dataset.
• This dataset is composed of seven partially labeled sub- datasets, involving seven organ and tumor segmentation tasks. • https://git.io/DoDNet • Tasks: semantic segmentation

51 • This dataset is a large-scale Semantic Image Matting
Dataset with careful consideration of data balancing across different semantic classes. • https://github.com/nowsyn/ • Tasks: semantic image matting

ImageNet Modification 52

53 • Two challenging datasets that reliably cause machine learning
model performance to substantially degrade. • IMAGENET-A is like the ImageNet test set, but it is far more challenging for existing models. • IMAGENET-O is the first out-of-distribution detection dataset created for ImageNet models. • https://github.com/hendrycks/natural-adv-examples • Tasks: classification

54 • This dataset is composed of a large set
of human similarity judgments that supplements the existing ILSVRC validation set. • https://osf.io/cn2s3/ • Tasks: information retrieval

Video Dehazing 55

56 • This dataset can be used for the supervised
learning of the video dehazing algorithms. • This dataset collected by a well-designed Consecutive Frames Acquisition System (CFAS). • Tasks: video dehazing

Face Recognition 57

58 • This dataset is a large in-the-wild high-resolution audio-visual
dataset. • dataset is collected from YouTube and consists of about 16 hours 720P or 1080P videos. • Tasks: talking face generation

Counting 59

60 • This dataset consists of 147 object categories containing
over 6000 images that are suitable for the few-shot counting task. • https://github.com/cvlab-stonybrook/LearningToCountEverything • Tasks: object counting

61 • This dataset is a large synthetic multi-camera crowd
counting dataset with many scenes and camera views to capture many possible variations, which avoids the difficulty of collecting and annotating such a large real dataset. • Tasks: crowd counting

Photo Retouching / Exposure Correction 62

63 • This dataset contains 1,681 groups and 11,161 high-quality
raw portrait photos in total. • This satisfies the following requirements: • the photos should in raw format with high-quality; • the dataset should be large-scale and cover a wide range of real cases. • https://github.com/csjliang/PPR10K • Tasks: portrait photo retouching, semantic segmentation

64 • This dataset consists of over 24,000 images exhibiting
the broadest range of exposure values to date with a corresponding properly exposed image. • https://github.com/mahmoudnafifi/Exposure_Correction • Tasks: photo exposure correction

For Engineers • it is useful for you to practice
a various tasks. • You can tackle the problem quickly by knowing the tasks that are similar to yours. For Researchers • It is useful for designing task-driven research. 65 Conclusion: The importance of new datasets

References • All photos are referenced from the corresponding original
papers. 66

Novel Datasets@CVPR2021

Novel Datasets@CVPR2021

More Decks by Masanari Kimura

Other Decks in Research

Featured

Transcript