Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Better Data with Machine Learning and Serverless

Better Data with Machine Learning and Serverless

Creating valuable insights out of raw data files, such as audio or video, has traditionally been a very manual and tedious process, and has produced mixed results due to an influential human element in the mix.

Thanks to enhancements in machine learning systems, coupled with the rapidly deployable nature of serverless technology as a middleware layer, we are able to create highly sophisticated data insight platforms to replace the huge time requirements that have typically been required in the past.

With this in mind, we’ll look at:
- How to build end-to-end data insight and predictor systems, built on the back of serverless and machine learning systems.
- Best practices for working with serverless technology for ferrying information between raw data files and machine learning systems through an eventing system.
- Considerations and practical examples of working with the security implications of dealing with sensitive information.

5cee03b83de1c332b42d87f29b02e59f?s=128

Jonathan LeBlanc

October 23, 2018
Tweet

Transcript

  1. Better Data with Machine Learning and Serverless Jonathan LeBlanc (Director

    of Developer Advocacy @ Box) Twitter: @jcleblanc Email: jleblanc@box.com
  2. Agenda for Today Building Blocks: How are these systems built?

    Best Practices: How do we architect the solution? Security Considerations: How do ensure data security? Jonathan LeBlanc • Director of Developer Advocacy @ Box • Twitter: @jcleblanc • Email: jleblanc@box.com
  3. Part 1: Building Blocks

  4. 1 What Machine Learning Isn’t Jonathan LeBlanc • Director of

    Developer Advocacy @ Box • Twitter: @jcleblanc • Email: jleblanc@box.com
  5. 1 Components of the System Jonathan LeBlanc • Director of

    Developer Advocacy @ Box • Twitter: @jcleblanc • Email: jleblanc@box.com Serverless Framework Provides the compute and data management from stored data location to machine learning engine. Machine Learning System Provides the data enhancement capabilities which improves the underlying source data’s metadata (information about information).
  6. 1 Why Serverless? Jonathan LeBlanc • Director of Developer Advocacy

    @ Box • Twitter: @jcleblanc • Email: jleblanc@box.com On Demand: Machine learning ties are only required when files need processing, which may be infrequent. No hosting: You don’t have to run or manage any servers, containers, or VMs of your own. Pricing based on use: Execution resources are only run (and charged for) based on your use, typically resulting in very low server costs. Different stack options: Multiple serverless systems exist to fit stack needs, including numerous open source options.
  7. 1 Components of the System Jonathan LeBlanc • Director of

    Developer Advocacy @ Box • Twitter: @jcleblanc • Email: jleblanc@box.com Webhook / Event Pump System: Handles notifications to the middleware layer when a new file should be processed. Middleware Layer: Handles communication between the data source and machine learning systems. Metadata Layer: The storage facility for machine learning data responses. Token Downscoping System: Allows you to pass tightly scoped read / write tokens through multiple uncontrolled system layers.
  8. 1 How a Data / ML System Works Jonathan LeBlanc

    • Director of Developer Advocacy @ Box • Twitter: @jcleblanc • Email: jleblanc@box.com Cloud Data Data store & initial metadata Serverless Framework Callback handler and code execution Machine Learning Data processor and enhancer Webhook Metadata Execute Callback
  9. 1 Common Serverless Frameworks Jonathan LeBlanc • Director of Developer

    Advocacy @ Box • Twitter: @jcleblanc • Email: jleblanc@box.com AWS Lambda: https:/ /aws.amazon.com/lambda/ Azure Functions: https:/ /azure.microsoft.com/en-us/services/functions/ Google Cloud Functions: https:/ /cloud.google.com/functions/ IronFunctions: https:/ /github.com/iron-io/functions OpenWhisk: https:/ /openwhisk.apache.org/ Fission: https:/ /fission.io/ Considerations 1. Your stack 2. Pricing / free use 3. Supported languages 4. Regional support
  10. 1 Machine Learning Frameworks Jonathan LeBlanc • Director of Developer

    Advocacy @ Box • Twitter: @jcleblanc • Email: jleblanc@box.com Audio / Video / Image • [video] MS Video Indexer • [audio] Voicebase • [face] Hive AI • [image] Clarifai • [image] Google Vision • [mixed] IBM Watson • [moderation] MS Content Moderator • [face] Kairos • [audio] AT&T Speech • [image] Amazon Rekognition Text Extraction • [id] Acuant • [invoice] Rossum.AI • [contract] eBrevia • [lease] Leverton • [resume] TextKernal • [prediction] AmazonML • [analysis] Aylien • [classification] MonkeyLearn • [natural language] ApiAI • [sentiment] AlchemyText Open Source • TensorFlow • Keras • Scikit-learn • MS Cognitive Toolkit • Theano • Caffe • Torch • Accord.NET
  11. Part 2: Best Practices

  12. 2 Program Logic and Serverless Separation Jonathan LeBlanc • Director

    of Developer Advocacy @ Box • Twitter: @jcleblanc • Email: jleblanc@box.com Serverless function agnostic: The core logic of the function should be separate from the serverless requirements. Thin handlers / routers may be written on top of the core logic to maintain separation. Service deployments: To allow for deployment amongst numerous serverless technologies, systems like serverless.com may be utilized. Testability: The separation of concerns allows you to test the function separately from the container. Handler: Separate handler from core program logic for testability.
  13. / / API Gateway Handler exports.handler = (event, context, callback)

    => { / / Check for valid event if (isValidEvent()) { processEvent(); } else { callback(null, { statusCode: 200, body: 'Event received but invalid' }); } }; AWS Lambda Handler Jonathan LeBlanc • Director of Developer Advocacy @ Box • Twitter: @jcleblanc • Email: jleblanc@box.com
  14. 2 Dealing with Cold Starts Jonathan LeBlanc • Director of

    Developer Advocacy @ Box • Twitter: @jcleblanc • Email: jleblanc@box.com What is it: The latency experienced when a function is triggered, which only runs when there isn’t a warn / idle container. A container is automatically dropped after a period of inactivity. Options: You can either keep the container warm through memory increases and calls, or deal with the cold start. Fewer libraries: The more libraries that are used the longer it will take to start the container. Smaller functions: Writing smaller functions decreases start time.
  15. 2 Exit Callback Hygiene Jonathan LeBlanc • Director of Developer

    Advocacy @ Box • Twitter: @jcleblanc • Email: jleblanc@box.com Error logging: With many serverless environments proper callback use will provide full data logging. Reliability: Failing to exist properly can result in your function executing until a timeout is hit. Timeouts may also cause subsequent invocations to require a cold start, which results in additional latency. Cost: If a timeout occurs, you will be charged for the entire timeout time.
  16. / / Success Callback callback(null, { statusCode: 200, body: 'Event

    processed' }); / / Error Callback callback({ statusCode: 400, body: 'Event error' }); Processing AWS Lambda Exit Callbacks Jonathan LeBlanc • Director of Developer Advocacy @ Box • Twitter: @jcleblanc • Email: jleblanc@box.com
  17. 2 Writing Stateless Single Purpose Functions Jonathan LeBlanc • Director

    of Developer Advocacy @ Box • Twitter: @jcleblanc • Email: jleblanc@box.com Error isolation: Debugging and error handling is easier with function / concern isolation. Scaling: With monolith functions, you have to optimize entire for all elements of the functions, rather than the specific functionality receiving the most calls / traffic. Planning and testing: It’s easier to plan and write test plans for functions with singular concerns.
  18. /** * Check for a valid event. * @param {object}

    indexerEvent – indexer event * @return {boolean} - true if valid event */ const isValidEvent = (indexerEvent) => { return (indexerEvent.body || indexerEvent.queryStringParameters); }; Valid Event Function Jonathan LeBlanc • Director of Developer Advocacy @ Box • Twitter: @jcleblanc • Email: jleblanc@box.com
  19. Part 3: Security Considerations

  20. 3 Security Considerations Jonathan LeBlanc • Director of Developer Advocacy

    @ Box • Twitter: @jcleblanc • Email: jleblanc@box.com Serverless use consideration: Are serverless systems a viable / approved mechanism within your organization? Token exposure: Many API auth systems are token based, with broadly scoped tokens, leading to the potential of token leakage. Credential exposure: With the use of numerous APIs, each with auth credentials, we have the potential of credential leakage. Sensitive information exposure: Data is being passed through multiple systems and we have to be aware of how the information is used / stored.
  21. 3 Middleware System Jonathan LeBlanc • Director of Developer Advocacy

    @ Box • Twitter: @jcleblanc • Email: jleblanc@box.com Serverless Solution All compute functionality is offloaded to the serverless framework. On-prem Solution All computer functionality (and connection to the ML system) is run off of existing internal servers.
  22. 3 Protecting Credentials Jonathan LeBlanc • Director of Developer Advocacy

    @ Box • Twitter: @jcleblanc • Email: jleblanc@box.com Use Secure Storage: Use a secure system to store API credentials or tokens, such as the AWS Systems Manager Parameter Store. Least Privilege Principle: Functions requiring access to credentials should follow the least privilege principle, meaning they have access to only as much data as they absolutely need. Separate Environment Credentials: Credentials used in a more open developer environment should not be the same used in a production deployment.
  23. 3 Token Downscoping Jonathan LeBlanc • Director of Developer Advocacy

    @ Box • Twitter: @jcleblanc • Email: jleblanc@box.com Access Token Fully scoped access token Downscoped Token Tightly scoped child token Channel Transmission Transmit through uncontrolled channels
  24. 3 Token Downscoping Components Jonathan LeBlanc • Director of Developer

    Advocacy @ Box • Twitter: @jcleblanc • Email: jleblanc@box.com Tightly scoped for single file: A token should only be scoped for the item needed for processing, such as a file. Short lived: Downscoped tokens should only live for their natural useful time (e.g. 1 hour) Revocable: Downscoped tokens may be revoked before natural expiration through the API. Split read / write functions: To further scope token exposure, separate read / write tokens can be issued.
  25. 3 Sensitive Information Exposure Jonathan LeBlanc • Director of Developer

    Advocacy @ Box • Twitter: @jcleblanc • Email: jleblanc@box.com Data in the files: What information is being transmitted through the channels in the files, and is it sensitive information? Are channels secure: Are all connections between your systems, the serverless framework, and the machine learning system secure? How the ML system handles data: Does the machine learning system store any data long-term, and how secure is that storage? Logging sensitive information: Are you logging sensitive information during general program flow unintentionally?
  26. 3 Tokenisation Specification Jonathan LeBlanc • Director of Developer Advocacy

    @ Box • Twitter: @jcleblanc • Email: jleblanc@box.com Data Request Sensitive information request Cloud Data API Data hosting service API Secure Data Vault Secure vault hosting data files 1. PAN 4. Token / Status 2. PAN 3. Token / Status
  27. Wrapup Topics Building Blocks: How are these systems built? Best

    Practices: How do we architect the solution? Security Considerations: How do ensure data security? Jonathan LeBlanc • Director of Developer Advocacy @ Box • Twitter: @jcleblanc • Email: jleblanc@box.com
  28. Better Data with Machine Learning and Serverless Slides: http:/ /bit.ly/ato-bdml

    Jonathan LeBlanc (Director of Developer Advocacy @ Box) Twitter: @jcleblanc Email: jleblanc@box.com