Upgrade to Pro — share decks privately, control downloads, hide ads and more …

From data science to scalable NLU and vision cl...

Pycon ZA
October 11, 2019

From data science to scalable NLU and vision cloud service by Bernardt Duvenhage

The talk will show how (and why) we’ve built our own natural language understanding and machine vision cloud service. The service is used mostly for intelligent dialog agents and the production instances see 5M+ queries a month.

The core of the service is built with Python, NumPy, PyTorch, TensorFlow, OpenCV, TorchVision, scikit-learn and SQLAlchemy. We're undertaking to also build a framework within which machine comprehension models can be developed in isolation and have their own unit tests. The service and deployment related aspects (like dataset management, multi-tenancy and even the database interaction) are handled in a service layer that is well isolated from model development.

The cloud service is implemented with OpenAPI/Swagger & Connexion (Flask) to simplify development and maintenance. The connexion flask app is deployed using Gunicorn and we typically use NGINX as a reverse proxy and load balancer. The model DB is a shared PostgreSQL or Google CloudSQL DB. Everything is containerised and deployed on Kubernetes with Rancher.

Pycon ZA

October 11, 2019
Tweet

More Decks by Pycon ZA

Other Decks in Programming

Transcript

  1. FROM DATA SCIENCE TO SCALABLE NLU & VISION CLOUD SERVICE

    BERNARDT DUVENHAGE FEERSUM ENGINE, PRAEKELT CONSULTING
  2. Scope Overview of the NLU and vision API. The data

    science & model building pipeline. NLU & Vision Python module. The multi-tenant rest service/resource layer. Swagger spec to Flask app, monitoring & deployment.
  3. The NLU & Vision API Developed mainly for building task

    oriented chatbots: Navigation intents, entity extraction, natural language FAQs, emotion detection.
  4. The NLU & Vision API Developed mainly for building task

    oriented chatbots: Navigation intents, entity extraction, natural language FAQs, emotion detection. Image classification, visual entity extraction, assessment/regression.
  5. The NLU & Vision API Flexible on which algorithms we

    use. Local language support. Custom pre-trained vision models.
  6. The NLU & Vision API Flexible on which algorithms we

    use. Local language support. Custom pre-trained vision models. Own costing model.
  7. The Data Science Pipeline Develop & test models in isolation:

    Notebooks (linear please). Model unit tests.
  8. The Data Science Pipeline Develop & test models in isolation:

    Notebooks (linear please). Model unit tests. In same repo and Python environment as the service.
  9. The Python Module To be used by chatbot and training

    software. Model management and consistent API for loading and using the models.
  10. The Python Module To be used by chatbot and training

    software. Model management and consistent API for loading and using the models. Models stored in files or SQLAlchemy DB.
  11. The Python Module To be used by chatbot and training

    software. Model management and consistent API for loading and using the models. Models stored in files or SQLAlchemy DB. Module level notebooks & unit tests.
  12. The Python Module 1 nlpe.create_feers_language_model('feers_elmo_eng') 2 training_list, testing_list = nlpe_data.load_quora_data(…)

    3 nlpe.train_text_clsfr("example_clsfr", training_list, testing_list, clsfr_algorithm=…, …) 4 accuracy, f1, cm = nlpe.test_text_clsfr("example_clsfr", testing_list, …) 5 score_labels, _ = nlpe.retrieve_text_class("example_clsfr", input_text, …)
  13. The Python Module 1 vise.create_feers_vision_model('feers_resnet152') 2 training_list, testing_list = vise_data.load_cat_dog_data(…)

    3 vise.train_image_clsfr("example_clsfr", training_list, testing_list, clsfr_algorithm=…, …) 4 accuracy, f1, cm = vise.test_image_clsfr("example_clsfr", testing_list, …) 5 score_labels, _ = vise.retrieve_image_class("example_clsfr", input_image, …)
  14. The Python Module Model workflow & life cycle management was

    ok. Difficulties: Performance scalability of inference.
  15. The Python Module Model workflow & life cycle management was

    ok. Difficulties: Performance scalability of inference. Ownership of training & testing data and model hyper params.
  16. The Service Wrapper Layer 1 text_clsfr_wrapper.text_clsfr_create(name, auth_token, 
 desc,…) 2

    text_clsfr_wrapper.text_clsfr_add_training_samples(name, auth_token, json_training_data={…})
 2 text_clsfr_wrapper.text_clsfr_train(name, auth_token, 
 json_training_data={…})
 4 _, response_json = 
 text_clsfr_wrapper.text_clsfr_retrieve(name, 
 auth_token, text=text)
  17. The Service Wrapper Layer Benefits: Multi-tenancy via API key auth

    & model namespaces. Training & testing data and model hyper params via CRUD.
  18. From Swagger Spec to Flask App OpenAPI / Swagger spec.

    app = connexion.App(__name__, specification_dir=…, debug=…)
  19. From Swagger Spec to Flask App OpenAPI / Swagger spec.

    app = connexion.App(__name__, specification_dir=…, debug=…) Connect controllers to service wrapper!
  20. From Swagger Spec to Flask App Benefits: Don’t have to

    write Flask code. Spec driven development. API implementation and tests can live in flask_server folder. Python API wrapper using codegen.
  21. From Swagger Spec to Flask App OpenAPI / Swagger spec.

    app = connexion.App(__name__, specification_dir=…, debug=…) Connect controllers to service wrapper! connexion_app.add_api(
 specification='swagger.yaml', arguments={…}, options={…})
  22. Monitoring Prometheus + Grafana. promths_request_latency_gauge = Gauge('feersum_nlu_request_latency_seconds', 'FeersumNLU - Request

    Latency', ['endpoint']) promths_request_latency_gauge.labels(endpoint=f.__name__).set( call_duration)
  23. Alerting Service /health endpoint & pingdom. Slack webhook integration for

    Grafana alerts. Resource alerts from hosting infrastructure.
  24. Deployment Flask app on gunicorn. Docker containers. Rancher 2.0 on

    top of Kubernetes on GCP. CloudSQL Postgres DB.
  25. Deployment Flask app on gunicorn. Docker containers. Rancher 2.0 on

    top of Kubernetes on GCP. CloudSQL Postgres DB. nginx load balancer.
  26. 10 request/s; 1.5M MAU 20 request/s; 3.0M MAU 30 request/s;

    4.5M MAU … 100 request/s; 45M MAU Deployment DB _0 _2 _1