Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Software 2.0 Needs Data 2.0: A New Way of Storing and Managing Data for Efficient Deep Learning (Davit Buniatyan, Activeloop)

Software 2.0 Needs Data 2.0: A New Way of Storing and Managing Data for Efficient Deep Learning (Davit Buniatyan, Activeloop)

Every day, 90% of the data we generate is in unstructured form. However, current solutions for storing the data we create - Databases, Data Lakes, and Data Warehouses (or the Data 1.0 minions), are unfit for storing unstructured data. As a result, data scientists today work with unstructured data like developers used to work in the pre-database era. This slows down ML cycles, bottlenecks access speed and data transfer, and forces data scientists to wrangle with data instead of training models.

Creating Software 2.0 requires a new way of working with unstructured data, which we explore in this session. We present Data 2.0 - a framework bringing together all types of data under one umbrella, representing them in a unified tensorial form which is native to deep neural networks. The streaming process of the method is used for training and deploying machine learning models for both compute and data-bottlenecked operations as if the data is local to the machine. In addition, it allows version-controlling and collaborating on petabyte-scale datasets, as single numpy-like arrays on the cloud or locally. Lastly, we use Ray to improve our workflows.



July 21, 2021


  1. Presented by Davit Buniatyan Software 2.0 needs Data 2.0 A

    new way of storing and managing data for efficient deep learning Web: activeloop.ai Twitter: @activeloopai GitHub: activeloopai/Hub Community: slack.activeloop.ai
  2. Recreating a mouse brain is pretty hard [Davit Buniatyan &

    Nico Kemnitz, Seung Lab]
  3. Legal 80M patent documents to train embedding model Precision Agriculture

    1.5 PB Aerial Imagery to provide insights to farmers Same problem across all the applications of ML
  4. Delivering valuable insights from unstructured data is hard. @BigDataBorat

  5. The vast majority of popular platforms and tools are focused

    on 10% of the data generated. Current solutions are not a good fit for managing unstructured data All data generated today Structured Unstructured { 90% 10%
  6. Software 2.0 needs Data 2.0 There is no industry standard

    for storing unstructured data
  7. Data 2.0: A new standard for storing and streaming datasets

    ➔ Unstructured datasets stored and streamed as unified arrays on the cloud. ➔ Managed from any machine at scale ➔ Accessible and seamlessly streamable to Deep Learning as if the data was local
  8. How do we get Data 2.0? Unstructured Data Tensors

  9. Data 2.0 cuts ML cycle time and cost in half

    Conventional ML Cycle With Data 2.0 TIME TO INSIGHT ADVANTAGE Collect Unstructured Dataset Management aka Data Wrangling ~ 50% Train Deploy Annotate Train Deploy Collect Annotate Allows many more cycles in the same time Developers and scientists stuck building their own ingestion infrastructure bit-by-bit
  10. Large datasets accessed in under 2 minutes instead of days

    with 65 seconds with in-house solutions 41+ hr 8 hours Understand the package API Unzip all the files 4 hours 24 hours Download TBs of data Finally load the data 1 hour Stream any slice of the data as if it were on your PC 1 min 5 s Write 2 lines of code with our “hub” package Before After > pip install hub 4 hours Access a particular slice of the data (on average) Read: Extending Activeloop Hub capabilities to handle Waymo Open Dataset
  11. Our Solution Impact Details Issue Mistakes Efficiency loss Streamable Slow

    and error-prone dataset sharing from one GPU box Data Locality Mistakes Time-to-value Serverless Error-prone code dependency Local folder structure dependency Cost Time-to-value Version-controlled transformation Managing or version-controlling multiple scalable preprocessing pipelines Pre-processing pipelines These problems are salient across all verticals, but Data 2.0 solves them
  12. Our Solution Impact Details Issue These problems are salient across

    all verticals, but Data 2.0 solves them Efficiency loss Time-to-value Same dataset view across the team Multiple users can’t edit or version-control the data. Synchronization Efficiency loss Treat data as if its a local array Confusing as to which chunk to load. Inefficient to load the whole file. Reading a small slice of data Efficiency loss Multi-layer cache NumPy arrays overrunning the local RAM/disk limit. RAM Management Mistakes Efficiency loss Schema based dataset structure for rendering Hard to visualize raw or preprocessed data. Visualization
  13. Instantly visualize at any step of the data pipeline

  14. Less coding required from the user import hub ds =

    hub.load("activeloop/mnist") before after
  15. Faster in remote setting and 2x cheaper vs Tensorflow DS

    + Ignite Sources https://blog.tensorflow.org/2019/02/tensorflow-on-apache-ignite.html https://docs.activeloop.ai/benchmarks.html TFDS + Ignite Activeloop Local Remote From S3 Activeloop achieves comparable data transfer from S3 to the GPU to Tensorflow Datasets + local in-memory database Ignite. In a remote setting, it is faster and doesn’t require a second compute instance.
  16. Impact amplified via distributed workflows powered by Ray Anyscale Ray

    allows to run same local code on a large cluster as if the large cluster was local. + + Activeloop Hub allows to work with a dataset as if it were local. One-line distributed training with very high (~85- 90%) GPU utilization with ability to linearly scale transform jobs HUB
  17. A cloud-native performant workflow that won’t break your bank Note:

    A 3.3TB dataset was used in this benchmark. + 500 CPU hours Low GPU usage Cost $29 per run 100 CPU hours High (~85%) GPU usage Cost $6 per run before after
  18. A seamless flow of unstructured data to ML models, across

    industries Agriculture Legal 3x faster inference 50% cost savings 30% less storage required 9x faster training 75% cost saving 80% less storage required 1.5 PB Aerial Imagery to provide insights to farmers 80M patent documents to train embedding model Economic Modelling 50 data scientists working at the same time 69% less time spent on data preprocessing
  19. Activeloop’s open-source Hub for Data 2.0 has been growing fast

    “Not dealing with datasets is fantastic for computer vision researchers. Computer Vision Researcher, Berkeley
  20. ➔ It takes a lot of time to connect unstructured

    data to ML models. ➔ There are no widely adopted tools for storing & processing unstructured data. ➔ Data 2.0 structures all your unstructured datasets, in a simple and unified way. A recap for Data 2.0, enabled by Activeloop Hub > pip install hub Switch to Data 2.0!
  21. Structure the unstructured with Data 2.0 Join the Data 2.0

    movement slack.activeloop.ai enterprise@activeloop.ai Web: activeloop.ai Twitter: @activeloopai GitHub: activeloopai/Hub