Slide 1

Slide 1 text

Novel datasets @CVPR2021 Masanari Kimura (mkimura@ridge-i.com)

Slide 2

Slide 2 text

• Introduce the novel datasets proposed at CVPR2021. Why do we need new datasets? Development of the datasets Development of the methods new dataset SOTA method Additional constraints and assumptions TL;DR 2

Slide 3

Slide 3 text

New datasets @CVPR2021 1. PPR10K: A Large-Scale Portrait Photo Retouching Dataset With Human-Region Mask and Group-Level Consistency 2. Towards Semantic Segmentation of Urban-Scale 3D Point Clouds: A Dataset, Benchmarks and Challenges 3. Rethinking Text Segmentation: A Novel Dataset and a Text-Specific Refinement Approach 4. SAIL-VOS 3D: A Synthetic Dataset and Baselines for Object Detection and 3D Mesh Reconstruction From Video Data 5. Intentonomy: A Dataset and Study Towards Human Intent Understanding 6. Towards Fast and Accurate Real-World Depth Super-Resolution: Benchmark Dataset and Baseline 7. Zillow Indoor Dataset: Annotated Floor Plans With 360deg Panoramas and 3D Room Layouts 8. Learning To Restore Hazy Video: A New Real-World Dataset and a New Method 9. Fashion IQ: A New Dataset Towards Retrieving Images by Natural Language Feedback 10. iMiGUE: An Identity-Free Video Dataset for Micro-Gesture Understanding and Emotion Analysis 11. Objectron: A Large Scale Dataset of Object-Centric Videos in the Wild With Pose Annotations 12. 3DCaricShop: A Dataset and a Baseline Method for Single-View 3D Caricature Face Reconstruction 13. VSPW: A Large-scale Dataset for Video Scene Parsing in the Wild 14. Flow-Guided One-Shot Talking Face Generation With a High-Resolution Audio-Visual Dataset 15. How2Sign: A Large-Scale Multimodal Dataset for Continuous American Sign Language 16. Sewer-ML: A Multi-Label Sewer Defect Classification Dataset and Benchmark 17. The Multi-Temporal Urban Development SpaceNet Dataset 18. GeoSim: Realistic Video Simulation via Geometry-Aware Composition for Self-Driving 3

Slide 4

Slide 4 text

4 New datasets @CVPR2021 19. Dictionary-Guided Scene Text Recognition 20. Human POSEitioning System (HPS): 3D Human Pose Estimation and Self-Localization in Large Scenes From Body-Mounted Sensors 21. Transformation Driven Visual Reasoning 22. Natural Adversarial Examples 23. TextOCR: Towards Large-Scale End-to-End Reasoning for Arbitrary-Shaped Scene Text 24. Enriching ImageNet With Human Similarity Judgments and Psychological Embeddings 25. Semantic Image Matting 26. DoDNet: Learning To Segment Multi-Organ and Tumors From Multiple Partially Labeled Datasets 27. Euro-PVI: Pedestrian Vehicle Interactions in Dense Urban Centers 28. Learning Goals From Failure 29. Learning To Count Everything 30. Variational Relational Point Completion Network 31. TrafficSim: Learning To Simulate Realistic Multi-Agent Behaviors 32. OpenRooms: An Open Framework for Photorealistic Indoor Scene Datasets 33. ArtEmis: Affective Language for Visual Art 34. DexYCB: A Benchmark for Capturing Hand Grasping of Objects 35. SUTD-TrafficQA: A Question Answering Benchmark and an Efficient Network for Video Reasoning Over Traffic Events

Slide 5

Slide 5 text

5 New datasets @CVPR2021 36. Cross-View Cross-Scene Multi-View Crowd Counting 37. Depth-Aware Mirror Segmentation 38. AGORA: Avatars in Geography Optimized for Regression Analysis 39. Detection, Tracking, and Counting Meets Drones in Crowds: A Benchmark 40. Mirror3D: Depth Refinement for Mirror Surfaces 41. Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts 42. Learning Multi-Scale Photo Exposure Correction 43. Unsupervised Pre-Training for Person Re-Identification 44. Home Action Genome: Cooperative Compositional Action Understanding 45. AGQA: A Benchmark for Compositional Spatio-Temporal Reasoning 46. Nutrition5k: Towards Automatic Nutritional Understanding of Generic Food 47. Person30K: A Dual-Meta Generalization Network for Person Re-Identification

Slide 6

Slide 6 text

6

Slide 7

Slide 7 text

Trajectory Prediction / Person ReID 7

Slide 8

Slide 8 text

8 • This dataset is a large collection of videos containing intentional and unintentional action. • Videos in this dataset are annotated with the moment at which action becomes unintentional. • Tasks: trajectory prediction

Slide 9

Slide 9 text

9 • This dataset consists of pedestrian and bicyclist trajectories. • http://www.europvi.mpi-inf.mpg.de/ • Tasks: trajectory prediction

Slide 10

Slide 10 text

10 • This dataset is generated by a multiagent behavior model for realistic traffic simulation. • Tasks: trajectory prediction

Slide 11

Slide 11 text

11 • This dataset is • a very large scale containing 1.38 million images of 30K identities; • a large capture system containing 6,497 cameras deployed at 89 different sites; • abundant sample diversities including varied backgrounds and diverse person poses. • Tasks: Person ReID

Slide 12

Slide 12 text

12 • This dataset consists of 4M person images of over 200K identities extracted from 46K YouTube videos, which is 30× larger than the largest existing Re-ID dataset MSMT. • the collected videos cover a wide range of capturing environments (e.g., using fixed or moving cameras, under dynamic scenes, or having different resolutions), yielding a great data diversity which is essential for learning generic representation. • Tasks: Person ReID

Slide 13

Slide 13 text

3D / Point Clouds 13

Slide 14

Slide 14 text

14 • This dataset is a large indoor dataset with 71,474 panoramas from 1,524 real unfurnished homes. • provides annotations of • 3D room layouts; • 2D and 3D floor plans; • panorama location in the floor plan; • locations of windows and doors. • https://github.com/zillow/zind • Tasks: layout estimation, multi-view registration

Slide 15

Slide 15 text

15 • This dataset is the first large-scale 3D caricature dataset that contains 2000 high-quality diversified 3D caricatures manually crafted by professional artists. • https://qiuyuda.github.io/3DCaricShop/ • Tasks: 3D caricature reconstruction from a 2D caricature

Slide 16

Slide 16 text

16 • This dataset is a 3D mirror plane dataset based on three RGBD datasets (Matterpot3D, NYUv2 and ScanNet) containing 7,011 mirror instance masks and 3D planes. • Motivation: mirror surfaces are a significant source of errors. • https://3dlg-hcvc.github.io/mirror3d/ • Tasks:

Slide 17

Slide 17 text

17 • This dataset is a large and rich urban-scale dataset including two accurately labelled regions covering 4.4km 2 and an extra unlabelled region covering 3.2km 2 . • In the dataset, each 3D point is labeled as one of 13 semantic classes. • https://github.com/QingyongHu/SensatUrban • https://www.youtube.com/watch?v=IG0tTdqB3L8 • Tasks: (semi-) supervised 3D point clouds segmentation

Slide 18

Slide 18 text

18 • This dataset is a synthetic video dataset with frame-by-frame mesh annotations which extends SAIL-VOS. • http://sailvos.web.illinois.edu/ • Tasks: 3D reconstruction from video data

Slide 19

Slide 19 text

19 • This dataset is a large-scale dataset which can greatly promote the study of depth map super-resolution and even more depth-related real-world tasks. • Tasks: depth map super-resolution

Slide 20

Slide 20 text

• This dataset is known as SpaceNet 7 • both a dataset and a NeurIPS 2020 competition • This dataset consists of 101 labelled sequences of satellite imagery collected by Planet Labs’ Dove constellation between 2017 and 2020 • https://registry.opendata.aws/spacenet/ • Tasks: object tracking, segmentation, change detection 20

Slide 21

Slide 21 text

21 • This dataset contains object-centric short videos with pose annotations for nine categories and includes 4 million annotated images in 14,819 annotated videos. • https://github.com/google-research-datasets/Objectron • Tasks: 3D object detection, 3D object tracking

Slide 22

Slide 22 text

22 • This dataset is created by (HPS) Human POSEitioning System, a method to recover the full 3D pose of a human registered with a 3D scan of the surrounding environment using wearable sensors. • http://virtualhumans.mpi-inf.mpg.de/hps/ • Tasks: scene modeling

Slide 23

Slide 23 text

23 • This dataset is contains over 100K HDR images with ground-truth depths, normals, spatially-varying BRDF and light sources, along with per-pixel spatiallyvarying lighting and visibility masks for every light source. • https://ucsd-openrooms.github.io/ • Tasks: inverse rendering, depth estimation, etc.

Slide 24

Slide 24 text

24 • This dataset contains over 100,000 high-quality scans, which renders partial 3D shapes from 26 uniformly distributed camera poses for each 3D CAD model. • https://paul007pl.github.io/projects/VRCNet • Tasks: shape completion

Slide 25

Slide 25 text

25 • This dataset is created consists of 582K RGB-D frames over 1,000 sequences of 10 subjects grasping 20 different objects from 8 views. • https://dex-ycb.github.io/ • Tasks: key point detection, pose estimation, etc.

Slide 26

Slide 26 text

26 • Benchmark datasets for 3D human pose estimation from images are limited by clothing complexity, environmental conditions, number of subjects, and occlusion. The authors constructed AGORA, a synthetic dataset with high accuracy ground-truth. Using 4,240 commercially available human scans, they fit the SMPL-X body model to the 3D scans to create a reference pose and body. • https://agora.is.tue.mpg.de/ • Tasks: pose estimation

Slide 27

Slide 27 text

27 • This dataset is RGB-D mirror segmentation dataset of 3, 049 exemplars. • https://mhaiyang.github.io/CVPR2021_PDNet/index • Tasks: RGB-D mirror segmentation

Slide 28

Slide 28 text

Text Recognition 28

Slide 29

Slide 29 text

29 • This dataset consisting of 4,024 text images, including scene text and design text with various artistic effects. • This dataset has six types of annotations for each image: • word- and character-wise quadrilateral bounding polygons; • word- and character-wise pixel-level masks; • word- and character-wise transcriptions. • Tasks: text segmentation

Slide 30

Slide 30 text

30 • This dataset is a challenging scene text dataset for Vietnamese, where some characters are equivocal in the visual form due to accent symbols. • Tasks: text detection, text recognition

Slide 31

Slide 31 text

31 • This dataset is an arbitrary-shaped scene text detection and recognition with 900k annotated words collected on real images from TextVQA dataset. • https://textvqa.org/textocr • Tasks: text detection, text recognition

Slide 32

Slide 32 text

Captioning / VQA 32

Slide 33

Slide 33 text

33 • This dataset takes the form of video QA based on the collected 10,080 in-the-wild videos and annotated 62,535 QA pairs, for benchmarking the cognitive capability of causal inference and event understanding models in complex traffic scenarios. • https://github.com/SUTDCV/SUTD-TrafficQA • Tasks: VQA

Slide 34

Slide 34 text

34 • This dataset is a novel large-scale dataset and accompanying machine learning models aimed at providing a detailed understanding of the interplay between visual content, its emotional effect, and explanations for the latter in language • Tasks: VQA

Slide 35

Slide 35 text

35 • This dataset provide human-generated captions that distinguish similar pairs of garment images together with side-information consisting of real-world product descriptions and derived visual attribute labels for these images. • Tasks: relative captioning

Slide 36

Slide 36 text

36 • This dataset is based on CLEVR, since it is better to first study TVR in a simple setting and then move to more complex real scenarios, just like people first study VQA on CLEVR and then generalize to more complicated settings like GQA. • https://hongxin2019.github.io/TVR • Tasks: visual question answering (VQA)

Slide 37

Slide 37 text

37 • This dataset is a new benchmark for compositional spatial-temporal reasoning. • This dataset contains 192M unbalanced question answer pairs for 9.6K videos. • Tasks: VideoQA

Slide 38

Slide 38 text

38 • Since the conventional Conceptual Captions 3M (CC3M) is designed for the capture task, it only collects data that are valid for capture. Therefore, the authors propose Conceptual 12M (CC12M), a larger dataset with relaxed constraints. • Tasks: VQA, image captioning

Slide 39

Slide 39 text

Gestures / Emotion Recognition 39

Slide 40

Slide 40 text

40 • This dataset focuses on nonverbal body gestures without using any identity information. • the proposed dataset offers an approach where the identity-free MGs are explored for hidden emotion understanding, and privacy of the individuals could be preserved. • Tasks: micro-gestures recognition

Slide 41

Slide 41 text

41 • This dataset is a multimodal and multiview continuous American Sign Language (ASL) dataset, consisting of a parallel corpus of more than 80 hours of sign language videos and a set of corresponding modalities including speech, English transcripts, and depth. • Tasks: synthesizing sign language videos

Slide 42

Slide 42 text

42 • This dataset is a multi-view action dataset with multiple modalities and view-points supplemented with hierarchical activity and atomic action labels together with dense scene composition labels. • http://www.homeactiongenome.org/ • Tasks: motion recognition

Slide 43

Slide 43 text

Classification / Regression 43

Slide 44

Slide 44 text

44 • This dataset comprising 14K images covering a wide range of everyday scenes. • These images are manually annotated with 28 intent categories derived from a social psychology taxonomy. • https://github.com/kmnp/intentonomy • Tasks: object/context localization, classification

Slide 45

Slide 45 text

45 • This dataset consists of 5k diverse, real world food dishes with corresponding video streams, depth images, component weights, and high accuracy nutritional content annotation. • https://github.com/google-research-datasets/Nutrition5k • Tasks: regression

Slide 46

Slide 46 text

46 • This dataset is a large novel and publicly available multi- label classification dataset for image-based sewer defect classification. • This dataset consists of 1.3 million images annotated by professional sewer inspectors from three different utility companies across nine years. • http://vap.aau.dk/sewer-ml • Tasks: classification

Slide 47

Slide 47 text

Segmentation / Image Matting 47

Slide 48

Slide 48 text

48 • This dataset is generated from a geometry-aware image composition process which synthesizes novel urban driving scenarios by augmenting existing images with dynamic objects extracted from other scenes and rendered at novel poses. • https://tmux.top/publication/geosim/ • Tasks: segmentation, (data augmentation)

Slide 49

Slide 49 text

49 • This dataset is for the scene parsing task from images to videos. • well-trimmed long-temporal clips; • dense annotation; • high resolution. • https://www.vspwdataset.com/ • Tasks: semantic segmentation

Slide 50

Slide 50 text

50 • This dataset is a large-scale partially labeled dataset. • This dataset is composed of seven partially labeled sub- datasets, involving seven organ and tumor segmentation tasks. • https://git.io/DoDNet • Tasks: semantic segmentation

Slide 51

Slide 51 text

51 • This dataset is a large-scale Semantic Image Matting Dataset with careful consideration of data balancing across different semantic classes. • https://github.com/nowsyn/ • Tasks: semantic image matting

Slide 52

Slide 52 text

ImageNet Modification 52

Slide 53

Slide 53 text

53 • Two challenging datasets that reliably cause machine learning model performance to substantially degrade. • IMAGENET-A is like the ImageNet test set, but it is far more challenging for existing models. • IMAGENET-O is the first out-of-distribution detection dataset created for ImageNet models. • https://github.com/hendrycks/natural-adv-examples • Tasks: classification

Slide 54

Slide 54 text

54 • This dataset is composed of a large set of human similarity judgments that supplements the existing ILSVRC validation set. • https://osf.io/cn2s3/ • Tasks: information retrieval

Slide 55

Slide 55 text

Video Dehazing 55

Slide 56

Slide 56 text

56 • This dataset can be used for the supervised learning of the video dehazing algorithms. • This dataset collected by a well-designed Consecutive Frames Acquisition System (CFAS). • Tasks: video dehazing

Slide 57

Slide 57 text

Face Recognition 57

Slide 58

Slide 58 text

58 • This dataset is a large in-the-wild high-resolution audio-visual dataset. • dataset is collected from YouTube and consists of about 16 hours 720P or 1080P videos. • Tasks: talking face generation

Slide 59

Slide 59 text

Counting 59

Slide 60

Slide 60 text

60 • This dataset consists of 147 object categories containing over 6000 images that are suitable for the few-shot counting task. • https://github.com/cvlab-stonybrook/LearningToCountEverything • Tasks: object counting

Slide 61

Slide 61 text

61 • This dataset is a large synthetic multi-camera crowd counting dataset with many scenes and camera views to capture many possible variations, which avoids the difficulty of collecting and annotating such a large real dataset. • Tasks: crowd counting

Slide 62

Slide 62 text

Photo Retouching / Exposure Correction 62

Slide 63

Slide 63 text

63 • This dataset contains 1,681 groups and 11,161 high-quality raw portrait photos in total. • This satisfies the following requirements: • the photos should in raw format with high-quality; • the dataset should be large-scale and cover a wide range of real cases. • https://github.com/csjliang/PPR10K • Tasks: portrait photo retouching, semantic segmentation

Slide 64

Slide 64 text

64 • This dataset consists of over 24,000 images exhibiting the broadest range of exposure values to date with a corresponding properly exposed image. • https://github.com/mahmoudnafifi/Exposure_Correction • Tasks: photo exposure correction

Slide 65

Slide 65 text

For Engineers • it is useful for you to practice a various tasks. • You can tackle the problem quickly by knowing the tasks that are similar to yours. For Researchers • It is useful for designing task-driven research. 65 Conclusion: The importance of new datasets

Slide 66

Slide 66 text

References • All photos are referenced from the corresponding original papers. 66