Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Novel Datasets@CVPR2021

Novel Datasets@CVPR2021


Masanari Kimura

July 16, 2021


  1. Novel datasets @CVPR2021 Masanari Kimura (mkimura@ridge-i.com)

  2. • Introduce the novel datasets proposed at CVPR2021. Why do

    we need new datasets? Development of the datasets Development of the methods new dataset SOTA method Additional constraints and assumptions TL;DR 2
  3. New datasets @CVPR2021 1. PPR10K: A Large-Scale Portrait Photo Retouching

    Dataset With Human-Region Mask and Group-Level Consistency 2. Towards Semantic Segmentation of Urban-Scale 3D Point Clouds: A Dataset, Benchmarks and Challenges 3. Rethinking Text Segmentation: A Novel Dataset and a Text-Specific Refinement Approach 4. SAIL-VOS 3D: A Synthetic Dataset and Baselines for Object Detection and 3D Mesh Reconstruction From Video Data 5. Intentonomy: A Dataset and Study Towards Human Intent Understanding 6. Towards Fast and Accurate Real-World Depth Super-Resolution: Benchmark Dataset and Baseline 7. Zillow Indoor Dataset: Annotated Floor Plans With 360deg Panoramas and 3D Room Layouts 8. Learning To Restore Hazy Video: A New Real-World Dataset and a New Method 9. Fashion IQ: A New Dataset Towards Retrieving Images by Natural Language Feedback 10. iMiGUE: An Identity-Free Video Dataset for Micro-Gesture Understanding and Emotion Analysis 11. Objectron: A Large Scale Dataset of Object-Centric Videos in the Wild With Pose Annotations 12. 3DCaricShop: A Dataset and a Baseline Method for Single-View 3D Caricature Face Reconstruction 13. VSPW: A Large-scale Dataset for Video Scene Parsing in the Wild 14. Flow-Guided One-Shot Talking Face Generation With a High-Resolution Audio-Visual Dataset 15. How2Sign: A Large-Scale Multimodal Dataset for Continuous American Sign Language 16. Sewer-ML: A Multi-Label Sewer Defect Classification Dataset and Benchmark 17. The Multi-Temporal Urban Development SpaceNet Dataset 18. GeoSim: Realistic Video Simulation via Geometry-Aware Composition for Self-Driving 3
  4. 4 New datasets @CVPR2021 19. Dictionary-Guided Scene Text Recognition 20.

    Human POSEitioning System (HPS): 3D Human Pose Estimation and Self-Localization in Large Scenes From Body-Mounted Sensors 21. Transformation Driven Visual Reasoning 22. Natural Adversarial Examples 23. TextOCR: Towards Large-Scale End-to-End Reasoning for Arbitrary-Shaped Scene Text 24. Enriching ImageNet With Human Similarity Judgments and Psychological Embeddings 25. Semantic Image Matting 26. DoDNet: Learning To Segment Multi-Organ and Tumors From Multiple Partially Labeled Datasets 27. Euro-PVI: Pedestrian Vehicle Interactions in Dense Urban Centers 28. Learning Goals From Failure 29. Learning To Count Everything 30. Variational Relational Point Completion Network 31. TrafficSim: Learning To Simulate Realistic Multi-Agent Behaviors 32. OpenRooms: An Open Framework for Photorealistic Indoor Scene Datasets 33. ArtEmis: Affective Language for Visual Art 34. DexYCB: A Benchmark for Capturing Hand Grasping of Objects 35. SUTD-TrafficQA: A Question Answering Benchmark and an Efficient Network for Video Reasoning Over Traffic Events
  5. 5 New datasets @CVPR2021 36. Cross-View Cross-Scene Multi-View Crowd Counting

    37. Depth-Aware Mirror Segmentation 38. AGORA: Avatars in Geography Optimized for Regression Analysis 39. Detection, Tracking, and Counting Meets Drones in Crowds: A Benchmark 40. Mirror3D: Depth Refinement for Mirror Surfaces 41. Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts 42. Learning Multi-Scale Photo Exposure Correction 43. Unsupervised Pre-Training for Person Re-Identification 44. Home Action Genome: Cooperative Compositional Action Understanding 45. AGQA: A Benchmark for Compositional Spatio-Temporal Reasoning 46. Nutrition5k: Towards Automatic Nutritional Understanding of Generic Food 47. Person30K: A Dual-Meta Generalization Network for Person Re-Identification
  6. 6

  7. Trajectory Prediction / Person ReID 7

  8. 8 • This dataset is a large collection of videos

    containing intentional and unintentional action. • Videos in this dataset are annotated with the moment at which action becomes unintentional. • Tasks: trajectory prediction
  9. 9 • This dataset consists of pedestrian and bicyclist trajectories.

    • http://www.europvi.mpi-inf.mpg.de/ • Tasks: trajectory prediction
  10. 10 • This dataset is generated by a multiagent behavior

    model for realistic traffic simulation. • Tasks: trajectory prediction
  11. 11 • This dataset is • a very large scale

    containing 1.38 million images of 30K identities; • a large capture system containing 6,497 cameras deployed at 89 different sites; • abundant sample diversities including varied backgrounds and diverse person poses. • Tasks: Person ReID
  12. 12 • This dataset consists of 4M person images of

    over 200K identities extracted from 46K YouTube videos, which is 30× larger than the largest existing Re-ID dataset MSMT. • the collected videos cover a wide range of capturing environments (e.g., using fixed or moving cameras, under dynamic scenes, or having different resolutions), yielding a great data diversity which is essential for learning generic representation. • Tasks: Person ReID
  13. 3D / Point Clouds 13

  14. 14 • This dataset is a large indoor dataset with

    71,474 panoramas from 1,524 real unfurnished homes. • provides annotations of • 3D room layouts; • 2D and 3D floor plans; • panorama location in the floor plan; • locations of windows and doors. • https://github.com/zillow/zind • Tasks: layout estimation, multi-view registration
  15. 15 • This dataset is the first large-scale 3D caricature

    dataset that contains 2000 high-quality diversified 3D caricatures manually crafted by professional artists. • https://qiuyuda.github.io/3DCaricShop/ • Tasks: 3D caricature reconstruction from a 2D caricature
  16. 16 • This dataset is a 3D mirror plane dataset

    based on three RGBD datasets (Matterpot3D, NYUv2 and ScanNet) containing 7,011 mirror instance masks and 3D planes. • Motivation: mirror surfaces are a significant source of errors. • https://3dlg-hcvc.github.io/mirror3d/ • Tasks:
  17. 17 • This dataset is a large and rich urban-scale

    dataset including two accurately labelled regions covering 4.4km 2 and an extra unlabelled region covering 3.2km 2 . • In the dataset, each 3D point is labeled as one of 13 semantic classes. • https://github.com/QingyongHu/SensatUrban • https://www.youtube.com/watch?v=IG0tTdqB3L8 • Tasks: (semi-) supervised 3D point clouds segmentation
  18. 18 • This dataset is a synthetic video dataset with

    frame-by-frame mesh annotations which extends SAIL-VOS. • http://sailvos.web.illinois.edu/ • Tasks: 3D reconstruction from video data
  19. 19 • This dataset is a large-scale dataset which can

    greatly promote the study of depth map super-resolution and even more depth-related real-world tasks. • Tasks: depth map super-resolution
  20. • This dataset is known as SpaceNet 7 • both

    a dataset and a NeurIPS 2020 competition • This dataset consists of 101 labelled sequences of satellite imagery collected by Planet Labs’ Dove constellation between 2017 and 2020 • https://registry.opendata.aws/spacenet/ • Tasks: object tracking, segmentation, change detection 20
  21. 21 • This dataset contains object-centric short videos with pose

    annotations for nine categories and includes 4 million annotated images in 14,819 annotated videos. • https://github.com/google-research-datasets/Objectron • Tasks: 3D object detection, 3D object tracking
  22. 22 • This dataset is created by (HPS) Human POSEitioning

    System, a method to recover the full 3D pose of a human registered with a 3D scan of the surrounding environment using wearable sensors. • http://virtualhumans.mpi-inf.mpg.de/hps/ • Tasks: scene modeling
  23. 23 • This dataset is contains over 100K HDR images

    with ground-truth depths, normals, spatially-varying BRDF and light sources, along with per-pixel spatiallyvarying lighting and visibility masks for every light source. • https://ucsd-openrooms.github.io/ • Tasks: inverse rendering, depth estimation, etc.
  24. 24 • This dataset contains over 100,000 high-quality scans, which

    renders partial 3D shapes from 26 uniformly distributed camera poses for each 3D CAD model. • https://paul007pl.github.io/projects/VRCNet • Tasks: shape completion
  25. 25 • This dataset is created consists of 582K RGB-D

    frames over 1,000 sequences of 10 subjects grasping 20 different objects from 8 views. • https://dex-ycb.github.io/ • Tasks: key point detection, pose estimation, etc.
  26. 26 • Benchmark datasets for 3D human pose estimation from

    images are limited by clothing complexity, environmental conditions, number of subjects, and occlusion. The authors constructed AGORA, a synthetic dataset with high accuracy ground-truth. Using 4,240 commercially available human scans, they fit the SMPL-X body model to the 3D scans to create a reference pose and body. • https://agora.is.tue.mpg.de/ • Tasks: pose estimation
  27. 27 • This dataset is RGB-D mirror segmentation dataset of

    3, 049 exemplars. • https://mhaiyang.github.io/CVPR2021_PDNet/index • Tasks: RGB-D mirror segmentation
  28. Text Recognition 28

  29. 29 • This dataset consisting of 4,024 text images, including

    scene text and design text with various artistic effects. • This dataset has six types of annotations for each image: • word- and character-wise quadrilateral bounding polygons; • word- and character-wise pixel-level masks; • word- and character-wise transcriptions. • Tasks: text segmentation
  30. 30 • This dataset is a challenging scene text dataset

    for Vietnamese, where some characters are equivocal in the visual form due to accent symbols. • Tasks: text detection, text recognition
  31. 31 • This dataset is an arbitrary-shaped scene text detection

    and recognition with 900k annotated words collected on real images from TextVQA dataset. • https://textvqa.org/textocr • Tasks: text detection, text recognition
  32. Captioning / VQA 32

  33. 33 • This dataset takes the form of video QA

    based on the collected 10,080 in-the-wild videos and annotated 62,535 QA pairs, for benchmarking the cognitive capability of causal inference and event understanding models in complex traffic scenarios. • https://github.com/SUTDCV/SUTD-TrafficQA • Tasks: VQA
  34. 34 • This dataset is a novel large-scale dataset and

    accompanying machine learning models aimed at providing a detailed understanding of the interplay between visual content, its emotional effect, and explanations for the latter in language • Tasks: VQA
  35. 35 • This dataset provide human-generated captions that distinguish similar

    pairs of garment images together with side-information consisting of real-world product descriptions and derived visual attribute labels for these images. • Tasks: relative captioning
  36. 36 • This dataset is based on CLEVR, since it

    is better to first study TVR in a simple setting and then move to more complex real scenarios, just like people first study VQA on CLEVR and then generalize to more complicated settings like GQA. • https://hongxin2019.github.io/TVR • Tasks: visual question answering (VQA)
  37. 37 • This dataset is a new benchmark for compositional

    spatial-temporal reasoning. • This dataset contains 192M unbalanced question answer pairs for 9.6K videos. • Tasks: VideoQA
  38. 38 • Since the conventional Conceptual Captions 3M (CC3M) is

    designed for the capture task, it only collects data that are valid for capture. Therefore, the authors propose Conceptual 12M (CC12M), a larger dataset with relaxed constraints. • Tasks: VQA, image captioning
  39. Gestures / Emotion Recognition 39

  40. 40 • This dataset focuses on nonverbal body gestures without

    using any identity information. • the proposed dataset offers an approach where the identity-free MGs are explored for hidden emotion understanding, and privacy of the individuals could be preserved. • Tasks: micro-gestures recognition
  41. 41 • This dataset is a multimodal and multiview continuous

    American Sign Language (ASL) dataset, consisting of a parallel corpus of more than 80 hours of sign language videos and a set of corresponding modalities including speech, English transcripts, and depth. • Tasks: synthesizing sign language videos
  42. 42 • This dataset is a multi-view action dataset with

    multiple modalities and view-points supplemented with hierarchical activity and atomic action labels together with dense scene composition labels. • http://www.homeactiongenome.org/ • Tasks: motion recognition
  43. Classification / Regression 43

  44. 44 • This dataset comprising 14K images covering a wide

    range of everyday scenes. • These images are manually annotated with 28 intent categories derived from a social psychology taxonomy. • https://github.com/kmnp/intentonomy • Tasks: object/context localization, classification
  45. 45 • This dataset consists of 5k diverse, real world

    food dishes with corresponding video streams, depth images, component weights, and high accuracy nutritional content annotation. • https://github.com/google-research-datasets/Nutrition5k • Tasks: regression
  46. 46 • This dataset is a large novel and publicly

    available multi- label classification dataset for image-based sewer defect classification. • This dataset consists of 1.3 million images annotated by professional sewer inspectors from three different utility companies across nine years. • http://vap.aau.dk/sewer-ml • Tasks: classification
  47. Segmentation / Image Matting 47

  48. 48 • This dataset is generated from a geometry-aware image

    composition process which synthesizes novel urban driving scenarios by augmenting existing images with dynamic objects extracted from other scenes and rendered at novel poses. • https://tmux.top/publication/geosim/ • Tasks: segmentation, (data augmentation)
  49. 49 • This dataset is for the scene parsing task

    from images to videos. • well-trimmed long-temporal clips; • dense annotation; • high resolution. • https://www.vspwdataset.com/ • Tasks: semantic segmentation
  50. 50 • This dataset is a large-scale partially labeled dataset.

    • This dataset is composed of seven partially labeled sub- datasets, involving seven organ and tumor segmentation tasks. • https://git.io/DoDNet • Tasks: semantic segmentation
  51. 51 • This dataset is a large-scale Semantic Image Matting

    Dataset with careful consideration of data balancing across different semantic classes. • https://github.com/nowsyn/ • Tasks: semantic image matting
  52. ImageNet Modification 52

  53. 53 • Two challenging datasets that reliably cause machine learning

    model performance to substantially degrade. • IMAGENET-A is like the ImageNet test set, but it is far more challenging for existing models. • IMAGENET-O is the first out-of-distribution detection dataset created for ImageNet models. • https://github.com/hendrycks/natural-adv-examples • Tasks: classification
  54. 54 • This dataset is composed of a large set

    of human similarity judgments that supplements the existing ILSVRC validation set. • https://osf.io/cn2s3/ • Tasks: information retrieval
  55. Video Dehazing 55

  56. 56 • This dataset can be used for the supervised

    learning of the video dehazing algorithms. • This dataset collected by a well-designed Consecutive Frames Acquisition System (CFAS). • Tasks: video dehazing
  57. Face Recognition 57

  58. 58 • This dataset is a large in-the-wild high-resolution audio-visual

    dataset. • dataset is collected from YouTube and consists of about 16 hours 720P or 1080P videos. • Tasks: talking face generation
  59. Counting 59

  60. 60 • This dataset consists of 147 object categories containing

    over 6000 images that are suitable for the few-shot counting task. • https://github.com/cvlab-stonybrook/LearningToCountEverything • Tasks: object counting
  61. 61 • This dataset is a large synthetic multi-camera crowd

    counting dataset with many scenes and camera views to capture many possible variations, which avoids the difficulty of collecting and annotating such a large real dataset. • Tasks: crowd counting
  62. Photo Retouching / Exposure Correction 62

  63. 63 • This dataset contains 1,681 groups and 11,161 high-quality

    raw portrait photos in total. • This satisfies the following requirements: • the photos should in raw format with high-quality; • the dataset should be large-scale and cover a wide range of real cases. • https://github.com/csjliang/PPR10K • Tasks: portrait photo retouching, semantic segmentation
  64. 64 • This dataset consists of over 24,000 images exhibiting

    the broadest range of exposure values to date with a corresponding properly exposed image. • https://github.com/mahmoudnafifi/Exposure_Correction • Tasks: photo exposure correction
  65. For Engineers • it is useful for you to practice

    a various tasks. • You can tackle the problem quickly by knowing the tasks that are similar to yours. For Researchers • It is useful for designing task-driven research. 65 Conclusion: The importance of new datasets
  66. References • All photos are referenced from the corresponding original

    papers. 66