Software 2.0 Needs Data 2.0: A New Way of Storing and Managing Data for Efficient Deep Learning (Davit Buniatyan, Activeloop)

Presented by Davit Buniatyan Software 2.0 needs Data 2.0 A
new way of storing and managing data for efﬁcient deep learning Web: activeloop.ai Twitter: @activeloopai GitHub: activeloopai/Hub Community: slack.activeloop.ai

Recreating a mouse brain is pretty hard [Davit Buniatyan &
Nico Kemnitz, Seung Lab]

Legal 80M patent documents to train embedding model Precision Agriculture
1.5 PB Aerial Imagery to provide insights to farmers Same problem across all the applications of ML

Delivering valuable insights from unstructured data is hard. @BigDataBorat

The vast majority of popular platforms and tools are focused
on 10% of the data generated. Current solutions are not a good ﬁt for managing unstructured data All data generated today Structured Unstructured { 90% 10%

Software 2.0 needs Data 2.0 There is no industry standard
for storing unstructured data

Data 2.0: A new standard for storing and streaming datasets
➔ Unstructured datasets stored and streamed as uniﬁed arrays on the cloud. ➔ Managed from any machine at scale ➔ Accessible and seamlessly streamable to Deep Learning as if the data was local

How do we get Data 2.0? Unstructured Data Tensors

Data 2.0 cuts ML cycle time and cost in half
Conventional ML Cycle With Data 2.0 TIME TO INSIGHT ADVANTAGE Collect Unstructured Dataset Management aka Data Wrangling ~ 50% Train Deploy Annotate Train Deploy Collect Annotate Allows many more cycles in the same time Developers and scientists stuck building their own ingestion infrastructure bit-by-bit

Large datasets accessed in under 2 minutes instead of days
with 65 seconds with in-house solutions 41+ hr 8 hours Understand the package API Unzip all the ﬁles 4 hours 24 hours Download TBs of data Finally load the data 1 hour Stream any slice of the data as if it were on your PC 1 min 5 s Write 2 lines of code with our “hub” package Before After > pip install hub 4 hours Access a particular slice of the data (on average) Read: Extending Activeloop Hub capabilities to handle Waymo Open Dataset

Our Solution Impact Details Issue Mistakes Efﬁciency loss Streamable Slow
and error-prone dataset sharing from one GPU box Data Locality Mistakes Time-to-value Serverless Error-prone code dependency Local folder structure dependency Cost Time-to-value Version-controlled transformation Managing or version-controlling multiple scalable preprocessing pipelines Pre-processing pipelines These problems are salient across all verticals, but Data 2.0 solves them

Our Solution Impact Details Issue These problems are salient across
all verticals, but Data 2.0 solves them Efficiency loss Time-to-value Same dataset view across the team Multiple users can’t edit or version-control the data. Synchronization Efficiency loss Treat data as if its a local array Confusing as to which chunk to load. Inefficient to load the whole file. Reading a small slice of data Efficiency loss Multi-layer cache NumPy arrays overrunning the local RAM/disk limit. RAM Management Mistakes Efficiency loss Schema based dataset structure for rendering Hard to visualize raw or preprocessed data. Visualization

Instantly visualize at any step of the data pipeline

Less coding required from the user import hub ds =
hub.load("activeloop/mnist") before after

Faster in remote setting and 2x cheaper vs Tensorflow DS
+ Ignite Sources https://blog.tensorflow.org/2019/02/tensorflow-on-apache-ignite.html https://docs.activeloop.ai/benchmarks.html TFDS + Ignite Activeloop Local Remote From S3 Activeloop achieves comparable data transfer from S3 to the GPU to Tensorflow Datasets + local in-memory database Ignite. In a remote setting, it is faster and doesn’t require a second compute instance.

Impact ampliﬁed via distributed workﬂows powered by Ray Anyscale Ray
allows to run same local code on a large cluster as if the large cluster was local. + + Activeloop Hub allows to work with a dataset as if it were local. One-line distributed training with very high (~85- 90%) GPU utilization with ability to linearly scale transform jobs HUB

A cloud-native performant workﬂow that won’t break your bank Note:
A 3.3TB dataset was used in this benchmark. + 500 CPU hours Low GPU usage Cost $29 per run 100 CPU hours High (~85%) GPU usage Cost $6 per run before after

A seamless ﬂow of unstructured data to ML models, across
industries Agriculture Legal 3x faster inference 50% cost savings 30% less storage required 9x faster training 75% cost saving 80% less storage required 1.5 PB Aerial Imagery to provide insights to farmers 80M patent documents to train embedding model Economic Modelling 50 data scientists working at the same time 69% less time spent on data preprocessing

Activeloop’s open-source Hub for Data 2.0 has been growing fast
“Not dealing with datasets is fantastic for computer vision researchers. Computer Vision Researcher, Berkeley

➔ It takes a lot of time to connect unstructured
data to ML models. ➔ There are no widely adopted tools for storing & processing unstructured data. ➔ Data 2.0 structures all your unstructured datasets, in a simple and uniﬁed way. A recap for Data 2.0, enabled by Activeloop Hub > pip install hub Switch to Data 2.0!

Structure the unstructured with Data 2.0 Join the Data 2.0
movement slack.activeloop.ai [email protected] Web: activeloop.ai Twitter: @activeloopai GitHub: activeloopai/Hub

Software 2.0 Needs Data 2.0: A New Way of Stori...

Software 2.0 Needs Data 2.0: A New Way of Storing and Managing Data for Efficient Deep Learning (Davit Buniatyan, Activeloop)

Anyscale

More Decks by Anyscale

Other Decks in Technology

Featured

Transcript

Presented by Davit Buniatyan Software 2.0 needs Data 2.0 A

Recreating a mouse brain is pretty hard [Davit Buniatyan &

Legal 80M patent documents to train embedding model Precision Agriculture

Delivering valuable insights from unstructured data is hard. @BigDataBorat

The vast majority of popular platforms and tools are focused

Software 2.0 needs Data 2.0 There is no industry standard

Data 2.0: A new standard for storing and streaming datasets

How do we get Data 2.0? Unstructured Data Tensors

Data 2.0 cuts ML cycle time and cost in half

Large datasets accessed in under 2 minutes instead of days

Our Solution Impact Details Issue Mistakes Efﬁciency loss Streamable Slow

Our Solution Impact Details Issue These problems are salient across

Instantly visualize at any step of the data pipeline

Less coding required from the user import hub ds =

Faster in remote setting and 2x cheaper vs Tensorﬂow DS

Impact ampliﬁed via distributed workﬂows powered by Ray Anyscale Ray

A cloud-native performant workﬂow that won’t break your bank Note:

A seamless ﬂow of unstructured data to ML models, across

Activeloop’s open-source Hub for Data 2.0 has been growing fast

➔ It takes a lot of time to connect unstructured

Structure the unstructured with Data 2.0 Join the Data 2.0