From data science to scalable NLU and vision cloud service by Bernardt Duvenhage

7b0645f018c0bddc8ce3900ccc3ba70c?s=47 Pycon ZA
October 11, 2019

From data science to scalable NLU and vision cloud service by Bernardt Duvenhage

The talk will show how (and why) we’ve built our own natural language understanding and machine vision cloud service. The service is used mostly for intelligent dialog agents and the production instances see 5M+ queries a month.

The core of the service is built with Python, NumPy, PyTorch, TensorFlow, OpenCV, TorchVision, scikit-learn and SQLAlchemy. We're undertaking to also build a framework within which machine comprehension models can be developed in isolation and have their own unit tests. The service and deployment related aspects (like dataset management, multi-tenancy and even the database interaction) are handled in a service layer that is well isolated from model development.

The cloud service is implemented with OpenAPI/Swagger & Connexion (Flask) to simplify development and maintenance. The connexion flask app is deployed using Gunicorn and we typically use NGINX as a reverse proxy and load balancer. The model DB is a shared PostgreSQL or Google CloudSQL DB. Everything is containerised and deployed on Kubernetes with Rancher.

7b0645f018c0bddc8ce3900ccc3ba70c?s=128

Pycon ZA

October 11, 2019
Tweet

Transcript

  1. FROM DATA SCIENCE TO SCALABLE NLU & VISION CLOUD SERVICE

    BERNARDT DUVENHAGE FEERSUM ENGINE, PRAEKELT CONSULTING
  2. Scope Overview of the NLU and vision API. The data

    science & model building pipeline. NLU & Vision Python module. The multi-tenant rest service/resource layer. Swagger spec to Flask app, monitoring & deployment.
  3. The NLU & Vision API

  4. The NLU & Vision API Developed mainly for building task

    oriented chatbots:
  5. The NLU & Vision API Developed mainly for building task

    oriented chatbots: Navigation intents, entity extraction, natural language FAQs, emotion detection.
  6. The NLU & Vision API Developed mainly for building task

    oriented chatbots: Navigation intents, entity extraction, natural language FAQs, emotion detection. Image classification, visual entity extraction, assessment/regression.
  7. The NLU & Vision API

  8. The NLU & Vision API Flexible on which algorithms we

    use.
  9. The NLU & Vision API Flexible on which algorithms we

    use. Local language support.
  10. The NLU & Vision API Flexible on which algorithms we

    use. Local language support. Custom pre-trained vision models.
  11. The NLU & Vision API Flexible on which algorithms we

    use. Local language support. Custom pre-trained vision models. Own costing model.
  12. The Data Science Pipeline

  13. The Data Science Pipeline Develop & test models in isolation:

  14. The Data Science Pipeline Develop & test models in isolation:

    Notebooks (linear please).
  15. The Data Science Pipeline Develop & test models in isolation:

    Notebooks (linear please). Model unit tests.
  16. The Data Science Pipeline Develop & test models in isolation:

    Notebooks (linear please). Model unit tests. In same repo and Python environment as the service.
  17. None
  18. None
  19. The Data Science Pipeline

  20. The Data Science Pipeline

  21. The Data Science Pipeline

  22. The Data Science Pipeline

  23. Detour into Transfer Learning - NLU

  24. char n-grams Detour into Transfer Learning - NLU

  25. char n-grams word n-grams Detour into Transfer Learning - NLU

  26. char n-grams word n-grams POS meaning Detour into Transfer Learning

    - NLU
  27. char n-grams word n-grams POS meaning Domain Model Detour into

    Transfer Learning - NLU
  28. char n-grams word n-grams POS meaning Domain Model Detour into

    Transfer Learning - NLU
  29. Detour into Transfer Learning - Vision

  30. Detour into Transfer Learning - Vision

  31. Detour into Transfer Learning - Vision

  32. Detour into Transfer Learning - Vision

  33. Retrain Detour into Transfer Learning - Vision

  34. The Data Science Pipeline

  35. The Data Science Pipeline

  36. The Python Module

  37. The Python Module To be used by chatbot and training

    software.
  38. The Python Module To be used by chatbot and training

    software. Model management and consistent API for loading and using the models.
  39. The Python Module To be used by chatbot and training

    software. Model management and consistent API for loading and using the models. Models stored in files or SQLAlchemy DB.
  40. The Python Module To be used by chatbot and training

    software. Model management and consistent API for loading and using the models. Models stored in files or SQLAlchemy DB. Module level notebooks & unit tests.
  41. The Python Module 1 nlpe.create_feers_language_model('feers_elmo_eng') 2 training_list, testing_list = nlpe_data.load_quora_data(…)

    3 nlpe.train_text_clsfr("example_clsfr", training_list, testing_list, clsfr_algorithm=…, …) 4 accuracy, f1, cm = nlpe.test_text_clsfr("example_clsfr", testing_list, …) 5 score_labels, _ = nlpe.retrieve_text_class("example_clsfr", input_text, …)
  42. The Python Module 1 vise.create_feers_vision_model('feers_resnet152') 2 training_list, testing_list = vise_data.load_cat_dog_data(…)

    3 vise.train_image_clsfr("example_clsfr", training_list, testing_list, clsfr_algorithm=…, …) 4 accuracy, f1, cm = vise.test_image_clsfr("example_clsfr", testing_list, …) 5 score_labels, _ = vise.retrieve_image_class("example_clsfr", input_image, …)
  43. The Python Module

  44. The Python Module Model workflow & life cycle management was

    ok.
  45. The Python Module Model workflow & life cycle management was

    ok. Difficulties:
  46. The Python Module Model workflow & life cycle management was

    ok. Difficulties: Performance scalability of inference.
  47. The Python Module Model workflow & life cycle management was

    ok. Difficulties: Performance scalability of inference. Ownership of training & testing data and model hyper params.
  48. Architecture Idea User of module

  49. Architecture Idea User of module Add a REST API …

  50. User of module

  51. User of module

  52. The Service Wrapper Layer 1 text_clsfr_wrapper.text_clsfr_create(name, auth_token, 
 desc,…) 2

    text_clsfr_wrapper.text_clsfr_add_training_samples(name, auth_token, json_training_data={…})
 2 text_clsfr_wrapper.text_clsfr_train(name, auth_token, 
 json_training_data={…})
 4 _, response_json = 
 text_clsfr_wrapper.text_clsfr_retrieve(name, 
 auth_token, text=text)
  53. The Service Wrapper Layer Benefits: Multi-tenancy via API key auth

    & model namespaces. Training & testing data and model hyper params via CRUD.
  54. None
  55. Lasso (Corpus/Dataset Manager)

  56. Lasso (Corpus/Dataset Manager)

  57. From Swagger Spec to Flask App OpenAPI / Swagger spec.

  58. From Swagger Spec to Flask App OpenAPI / Swagger spec.

    app = connexion.App(__name__, specification_dir=…, debug=…)
  59. From Swagger Spec to Flask App OpenAPI / Swagger spec.

    app = connexion.App(__name__, specification_dir=…, debug=…) Connect controllers to service wrapper!
  60. From Swagger Spec to Flask App Benefits: Don’t have to

    write Flask code. Spec driven development. API implementation and tests can live in flask_server folder. Python API wrapper using codegen.
  61. From Swagger Spec to Flask App OpenAPI / Swagger spec.

    app = connexion.App(__name__, specification_dir=…, debug=…) Connect controllers to service wrapper! connexion_app.add_api(
 specification='swagger.yaml', arguments={…}, options={…})
  62. None
  63. None
  64. None
  65. Monitoring

  66. Monitoring Prometheus + Grafana.

  67. Monitoring Prometheus + Grafana. promths_request_latency_gauge = Gauge('feersum_nlu_request_latency_seconds', 'FeersumNLU - Request

    Latency', ['endpoint'])
  68. Monitoring Prometheus + Grafana. promths_request_latency_gauge = Gauge('feersum_nlu_request_latency_seconds', 'FeersumNLU - Request

    Latency', ['endpoint']) promths_request_latency_gauge.labels(endpoint=f.__name__).set( call_duration)
  69. Monitoring

  70. Alerting Service /health endpoint & pingdom. Slack webhook integration for

    Grafana alerts. Resource alerts from hosting infrastructure.
  71. Deployment

  72. Deployment Flask app on gunicorn.

  73. Deployment Flask app on gunicorn. Docker containers.

  74. Deployment Flask app on gunicorn. Docker containers. Rancher 2.0 on

    top of Kubernetes on GCP.
  75. Deployment Flask app on gunicorn. Docker containers. Rancher 2.0 on

    top of Kubernetes on GCP. CloudSQL Postgres DB.
  76. Deployment Flask app on gunicorn. Docker containers. Rancher 2.0 on

    top of Kubernetes on GCP. CloudSQL Postgres DB. nginx load balancer.
  77. Deployment DB _0

  78. 10 request/s; 1.5M MAU Deployment DB _0

  79. 10 request/s; 1.5M MAU 20 request/s; 3.0M MAU Deployment DB

    _0 _1
  80. 10 request/s; 1.5M MAU 20 request/s; 3.0M MAU 30 request/s;

    4.5M MAU Deployment DB _0 _2 _1
  81. 10 request/s; 1.5M MAU 20 request/s; 3.0M MAU 30 request/s;

    4.5M MAU … 100 request/s; 45M MAU Deployment DB _0 _2 _1
  82. QUESTIONS?