The combination of big data and deep learning has fundamentally changed how we approach data science; through machine learning models and application-specific code, computers can " understand" unstructured data (think images and audio). To leverage these capabilities, most applications require a continuous pipeline of one or more trained models interlinked with data processing functions. The Towhee open-source project aims to provide users with a tool to create these end-to-end pipelines using a simple API. This is done by providing generic data processing scripts, pre-trained ML models, training tools, and a Pythonic toolkit to stitch all the parts into a pipeline on a local machine. Doing this locally was hard enough, but scaling these pipelines is one step above and has remained a key challenge for us from day 1.
This talk will discuss how integrating a Ray engine into Towhee has helped our users confidently scale their unstructured data processing pipelines and other ML applications. Using Ray as a backend, our users can easily schedule compute-heavy Towhee operations using Ray Actors found in Ray Core. First, we will discuss some challenges in building scalable machine learning pipelines. Second, we will elaborate on the pipeline development-to-deployment gap and why Ray is the obvious choice for our users and us. Finally, we will provide a live demo of a real-world data processing application scaled using Ray.