Distributed computing library for big data ML applications

Slide 1

Slide 1 text

No content

Slide 2

Slide 2 text

ML Applications at LINE As of September 2020 Classification User Demographics CTR Prediction Ads Category Manga Rating Prediction Sticker Tag Prediction etc. Recommendation Sticker Manga Theme Music Official Account Point News etc. Representation Learning User Vector News Text Vector Sticker Image Vector etc.

Slide 3

Slide 3 text

Scale As of September 2020 Global Users 860M+ Sticker Package Items 10M+ Scheduled Job 100+

Slide 4

Slide 4 text

Session Overview Part 1: ghee - Distributed Data Processor Part 3: cumin - Data Management Part 2: ghee-models - Model Management

Slide 5

Slide 5 text

Session Overview Part 1: ghee - Distributed Data Processor Part 3: cumin - Data Management Part 2: ghee-models - Model Management

Slide 6

Slide 6 text

Agenda part 1 ghee › motivation › Introducing ghee › ghee implementation part 2 ghee-models › introducing ghee-models › graph convolutional networks part 3 cumin › motivation › introducing cumin › example

Slide 7

Slide 7 text

Motivation %BUBTFU General-purpose Hadoop Cluster hdfs Machine Learning Kubernetes cluster %BUBTFU ceph / nfs 4QBSL&YFDVUPS (16QPET 1. Dataset Preparation 3. Model Development 2. Copy Dataset 4. Copy Prediction Output Current Workflow 1600+ CPU nodes 20+ GPU nodes 40+ CPU nodes

Slide 8

Slide 8 text

Slide 9

Slide 9 text

Slide 10

Slide 10 text

Slide 11

Slide 11 text

Slide 12

Slide 12 text

Motivation %BUBTFU Hadoop cluster HDFS Kubernetes cluster (16QPE Main Issue: Data Skew and Duplication (16QPE (16QPE

Slide 13

Slide 13 text

Motivation › Manually acquire and release › CPU/IO overhead Resource Management GPU Utilization %BUBTFU Hadoop cluster HDFS Kubernetes cluster %BUBTFU CephFS (16QPE Other Issues

Slide 14

Slide 14 text

Motivation Wishlist › Fast and efficient data transfer › Stream instead of copy › Ease of use › Hide infrastructure details › Distributed Training Support › Uniformly send training data

Slide 15

Slide 15 text

Motivation Existing Solutions Strength Ease of use Error handling Weakness Data Transfer Speed across Executors Data Transfer Speed Flexibility (DAG) Learning-curve Spark Dask

Slide 16

Slide 16 text

Agenda part 1 ghee › motivation › introducing ghee › ghee implementation part 2 ghee-models › introducing ghee-models › graph convolutional networks part 3 cumin › motivation › introducing cumin › example

Slide 17

Slide 17 text

Introducing ghee Overview › A python library for running ML applications by specifying › Input and output specification › Execution environment (cpu/memory/docker image) › User-defined function (preprocess, train/predict, postprocess) › Encapsulate data transfer and Kubernetes pod management › Allow ML engineers to focus on model development

Slide 18

Slide 18 text

ghee.task.KubeMPITask Introducing ghee Kubernetes cluster *OQVU 4UPSBHF 5SBJO1SFEJDU (16QPE ghee program execution 0VUQVU 4UPSBHF 1SFQSPDFTT $16QPE 1PTUQSPDFTT $16QPE

Slide 19

Slide 19 text

ghee.task.KubeMPITask Introducing ghee 1. Create Kubernetes Jobs › Docker image › Resource (cpu, gpu, memory) › Run user-defined functions Kubernetes cluster *OQVU 4UPSBHF 5SBJO1SFEJDU (16QPE ghee program execution 0VUQVU 4UPSBHF 1SFQSPDFTT $16QPE 1PTUQSPDFTT $16QPE

Slide 20

Slide 20 text

ghee.task.KubeMPITask Introducing ghee › Support HDFS, S3, Kafka › Random shuffle, round-robin distribution 1. Create Kubernetes Jobs › Docker image › Resource (cpu, gpu, memory) › Run user-defined functions 2. Stream input data Kubernetes cluster *OQVU 4UPSBHF 5SBJO1SFEJDU (16QPE ghee program execution 0VUQVU 4UPSBHF 1SFQSPDFTT $16QPE 1PTUQSPDFTT $16QPE

Slide 21

Slide 21 text

ghee.task.KubeMPITask Introducing ghee › Support HDFS, S3, Kafka › Random shuffle, round-robin distribution 1. Create Kubernetes Jobs › Preprocess > Train › Preprocess > Predict > Postprocess › Docker image › Resource (cpu, gpu, memory) › Run user-defined functions 2. Stream input data 3. Process data Kubernetes cluster *OQVU 4UPSBHF 5SBJO1SFEJDU (16QPE ghee program execution 0VUQVU 4UPSBHF 1SFQSPDFTT $16QPE 1PTUQSPDFTT $16QPE

Slide 22

Slide 22 text

Slide 23

Slide 23 text

Slide 24

Slide 24 text

Introducing ghee Example: Model Training on Local Machine with pytorch

Slide 25

Slide 25 text

Kubernetes cluster USBJO (16QPE QSFQSPDFTT $16QPE Task Client hdfs ghee sample code Example: Model Training

Slide 26

Slide 26 text

ghee sample code Kubernetes cluster QSFEJDU (16QPE QSFQSPDFTT $16QPE Task Client HDFS QPTUQSPDFTT $16QPE S3 Example: Model Inference

Slide 27

Slide 27 text

Slide 28

Slide 28 text

mpirun mpirun Implementation Data Transfer $161PE 5SBOTGFS .BOBHFS 1SPDFTT 1SPDFTT $161PE 1SPDFTT 1SPDFTT QVTI QVTI QVTI QVTI (161PE 1SPDFTT 1SPDFTT (161PE 1SPDFTT 1SPDFTT QVMM QVMM QVMM 1VMM › ZeroMQ › Fast and stable › asyncio with aiozmq library › Transfer Manager › Manage push/pull sockets lifecycle › MPI › State Synchronization › Distributed Training (e.g. Horovod)

Slide 29

Slide 29 text

LT +PC Implementation Remote Procedure Call › Ship code with cloudpickle and execute on Kubernetes pod › Develop another proprietary library (named swimmy) for › Pod lifecycle management › Health check › Run MPI on swimmy for distributed preprocessing and training › Example: 3 pods QPE TXJNNZBHFOU NQJSVO QPE TXJNNZBHFOU NQJSVO QPE TXJNNZBHFOU NQJSVO $PEF DMPVEQJDLMF $PEF DMPVEQJDLMF $PEF DMPVEQJDLMF

Slide 30

Slide 30 text

Implementation Simple comparison to Dask › Setup - Recommendation Task › Data: Movielens-20M › Model: Factorization Machine › Preprocess on 16 CPUs, Train on 1 GPU › Simulate preprocessing with parameter n › Line of codes › ghee: 115 lines › Dask: 236 lines 5JNFJOTFDPOET TNBMMFSJTCFUUFS O O O. O. HIFF %BTL

Slide 31

Slide 31 text

Session Overview Part 1: ghee - Distributed Data Processor Part 3: cumin - Data Management Part 2: ghee-models - Model Management

Slide 32

Slide 32 text

Slide 33

Slide 33 text

ghee-models GOAL › Provide a collection of ML models based on ghee › Provide a standard way to manage the ML lifecycle using MLflow ghee ghee-models ML applications MLflow

Slide 34

Slide 34 text

BENEFIT ghee-models Reusable Reproducible Extensible

Slide 35

Slide 35 text

GHEE PROGRAM EXECUTION ghee-models Kubernetes cluster Input Storage Train/Predict GPU pod Output Storage Train/Predict Preprocess CPU pod Postprocess CPU pod ghee.task.KubeMPITask

Slide 36

Slide 36 text

GHEE-MODELS PROGRAM EXECUTION ghee-models Kubernetes cluster Input Storage Train/Predict GPU pod Output Storage Train/Predict Preprocess CPU pod Postprocess CPU pod ghee.task.KubeMPITask MLflow Data checker CPU pod Input Storage Config loader Driver Config

Slide 37

Slide 37 text

ghee-models EXAMPLE CODE - abstract model

Slide 38

Slide 38 text

ghee-models EXAMPLE CODE - image classification GPU CPU called by train batch loop

Slide 39

Slide 39 text

ghee-models EXAMPLE CODE - image classification CPU GPU read trained model from MLflow batch

Slide 40

Slide 40 text

ghee-models EXAMPLE CODE - image classification Data Format

Slide 41

Slide 41 text

ghee-models EXAMPLE CODE - image classification Config

Slide 42

Slide 42 text

› Bert natural language processing recommender › Graph Attention Networks (GAT) › GraphSage Networks › Multi Interests Graph Convolutional Networks › RankNet › Neural Collaboration Networks (NCF) classification › DNN for multi classes › DNN for Multi labels ghee-models MODEL image processing › EfficientNet B0-B7

Slide 43

Slide 43 text

Slide 44

Slide 44 text

Goal Provide user/item dense vectors using graph convolutional networks, which can be used in wide range of machine tasks graph convolutional networks USE CASE - user/item dense vectors item1 user item2 tag1 male 23 tag2 tag3 item3 dislike click buy has has

Slide 45

Slide 45 text

graph convolutional networks USE CASE - user/item dense vectors [0.3, -0.1, 0.9] [0.3, -0.05, 0.73] [0.29, -0.09, 0.58] [0.1, 0.02, 0.59] [0.3, -0.1, 0.9] Recommender [0.3, -0.1, 0.9] Age: 32 Income: 500M Interests: anime, movie Segments prediction Lookalike engine [0.26, -0.12, 0.87] [0.25, 0.01, 0.7]

Slide 46

Slide 46 text

Services 15+ Graph Nodes 1.5B+ Users 860M+ graph convolutional networks USE CASE - user/item dense vectors adapt to predict as of September 2020

Slide 47

Slide 47 text

graph convolutional layers graph convolutional networks USE CASE - user/item dense vectors loss deep deep item vectors user vectors loss BPR, Focal, ArcFace losses are also provided GCN1 GCN0 incremental learning

Slide 48

Slide 48 text

graph convolutional networks USE CASE - user/item dense vectors preprocess func (CPU) train func (GPU)

Slide 49

Slide 49 text

graph convolutional networks USE CASE - user/item dense vectors News Sticker AD Game OA Music Manga Delima Shopping Timeline Theme … … 27 models for 15+ services incremental learning

Slide 50

Slide 50 text

News … Sticker MLP user segments prediction graph convolutional networks USE CASE - user/item dense vectors model parameters f1 score sparse DNN 463M 0.6239 dense DNN 1M 0.6275 Evaluation

Slide 51

Slide 51 text

Session Overview Part 1: ghee - Distributed Data Processor Part 3: cumin - Data Management Part 2: ghee-models - Model Management

Slide 52

Slide 52 text

Slide 53

Slide 53 text

Data Configuration Motivation › Each recommendation tasks requires different source tables and columns › Each ML engineers has different ways to define dataset

Slide 54

Slide 54 text

Routine process Motivation › Routine data sanity check such as hive partition check, data count and error handlings are required when generating dataset › These similar routines are implemented differently for each recommendation tasks

Slide 55

Slide 55 text

Manage parameter Motivation › ML engineers fine-tunes not only model hyperparameters but also filter conditions when generating training dataset › The model training parameters are managed by MLflow in experiments, but dataset generation is fragmented and not reproducible

Slide 56

Slide 56 text

Slide 57

Slide 57 text

Cumin › Provide a standard format to define dataset › Handle routine processes such as hive partition and data count checks › Write filter conditions explicitly Table and Path setting

Slide 58

Slide 58 text

Cumin › Perform data sanity check according to data definition when Spark session starts › Rest of the code is a typical Spark program How to use

Slide 59

Slide 59 text

Cumin Other functions › Take into account HDFS block size › Increase number files to increase parallelism in ghee Auto adjust output file size Suport YAML format › Receive data definition from upstream components

Slide 60

Slide 60 text

Implementation Spark start Spark stop WBMJEBUF EBUB data processing SFBE DVNJO QPTU QSPDFTT › Add hooks to standard Spark function › Check filter conditions and data size when spark starts › Estimate file size of DataFrame and repartition it to fit HDFS block size when writing output

Slide 61

Slide 61 text

Slide 62

Slide 62 text

Example cumin with ghee-models › Manage experiments from generating dataset to training a model in a Jupyter Notebook › Explicitly write filter conditions and partitions

Slide 63

Slide 63 text

Example cumin with ghee-models › Use dataset generated by cumin › Hyperparameters are recorded in MLFlow Reproducible experiments from dataset generation to model training

Slide 64

Slide 64 text

Session Summary

Slide 65

Slide 65 text

Session Summary › ghee › Perform distributed processing and training with large data › Transfer data between CPU and GPU nodes efficiently › Encapsulate Kubernetes pod management › ghee-models › Improve model reusability › Utilize MLflow to manage the ML lifecycle › cumin › Provide dataset definition and validation › Make dataset generation reproducible VOS: Amazon S3 compatible internal data storage

Slide 66

Slide 66 text

Future work › ghee-models › Add models › Add loss function › cumin › Support more data sources VOS: Amazon S3 compatible internal data storage