trip duration • New data comes in each time someone orders a taxi ◦ NOT IN BATCHES • Continuous predictions 3 Use trained model to make 1 prediction Use trained model to make 1 prediction Use trained model to make 1 prediction
silos • Power faster decisions • Have reactive properties • Reduce point to point connections • Support both batch and stream paradigms Centralized Event Log
silos • Power faster decisions • Have reactive properties • Reduce point to point connections • Support both batch and stream paradigms Centralized Event Log
silos • Power faster decisions • Have reactive properties • Reduce point to point connections • Support both batch and stream paradigms Centralized Event Log
First access point to data • No intermediate storage layer • No intermediate processing • Faster feedbacks • Performance over time may trigger other events Kafka Streams application TensorFlow MODEL Kafka TOPICS
First access point to data • No intermediate storage layer • No intermediate processing • Faster feedbacks • Performance over time may trigger other events Kafka Streams application TensorFlow MODEL Kafka TOPICS
Reduce manual actions ◦ Avoid code duplication • The problem is supervised ◦ and we get the actual durations continuously • Events come from Kafka Topics
submodules for each Kafka Streams application • Plugin and virtual env are used to create python modules for the ml part • The Infrastructure is specified in separate projects . ├── pom.xml │ ├── edml-scoring │ └── src │ ├── edml-serving │ └── src │ └── edml-trainer ├── requirements.txt └── setup.py . └── tf-aiplatform-edml . └── tf-apps-edml
Open data Taxi Zones • Geography type • Manipulation via Big Query GIS • Simple geography functions SELECT ST_DISTANCE( ST_CENTROID(pickup_zone_geom), ST_CENTROID(dropoff_zone_geom) ) AS distance FROM <table>;
One hot encoded features ◦ pick-up day of week ◦ pick-up hour of day ◦ pick-up day of year ✖pick-up hour of day ◦ pick-up zone ◦ drop-off zone ◦ pick-up zone ✖drop-off zone Pick-up Location Pick-up Datetime Drop-off Location Passenger Count Trip distance Approximation
• Embedded Features ◦ pick-up day of year ◦ pick-up hour of day ◦ pick-up zone ◦ drop-off zone ◦ passenger count ◦ approximated distance Pick-up Location Pick-up Datetime Drop-off Location Passenger Count Trip distance approximation
distinct values • Two strategies combined ◦ one hot encoding → sparse features → linear model ◦ embedding → dense features → deep neural network • TensorFlow Estimator API
data points • AI Platform ◦ notebooks for exploring, building and testing the solution locally ◦ remote training and prediction ◦ hyperparameter tuning ◦ model deployment • Code must be organised and packaged properly $ tree edml-trainer/ . ├── setup.py └── trainer ├── __init__.py ├── model.py ├── task.py └── util.py
are needed to rebuild the model at prediction time • Graph serialization is not enough and will resolve in: ◦ Not found: Resource … variable was uninitialized • Proposal: ◦ The model metadata (e.g. inputs, GCS path) can be sent in a topic $ tree my_model/ . ├── saved_model.pb └── variables ├── variables.data-00000-of-00002 ├── variables.data-00001-of-00002 └── variables.index
topic: String = "<model.topic>" val version: String = "<model.version>" val model: String = "gs://.../<model.version>" val producer = new KafkaProducer[_, TFSavedModel](... val key = ModelKey("<app.name>") val value = // … producer.send(topic, key, value) producer.flush()
streams ◦ Input records to predict ◦ Model updates • The model description gets broadcasted on every instances of the same app ◦ they all separately load the model graph from GCS • Deserialized model Graph lives in memory • Input record gets skipped if no model is present APP CI DEPLOY STAGE MoDEL TOPIC NEW RECORDS PREDICTIONS
The TF graph is the interface between data scientist and data engineer Standardisation of the Model serialisation and event production "Success of Model training" is an event Model size can be an issue Transition to TF 2.0 & Java compatibility Preprocessing and dataprep is not covered
• Photo by Miryam León on Unsplash • Photo by Negative Space from Pexels • Photo by Gerrie van der Walt on Unsplash • Photo by Todd DeSantis on Unsplash • Photo by Rock'n Roll Monkey on Unsplash • Photo by Denys Nevozhai on Unsplash • Photo by Denys Nevozhai on Unsplash