Slide 1

Slide 1 text

Better Data with Machine Learning and Serverless Jonathan LeBlanc (Director of Developer Advocacy @ Box) Twitter: @jcleblanc Email: [email protected]

Slide 2

Slide 2 text

Agenda for Today Building Blocks: How are these systems built? Best Practices: How do we architect the solution? Security Considerations: How do ensure data security? Jonathan LeBlanc • Director of Developer Advocacy @ Box • Twitter: @jcleblanc • Email: [email protected]

Slide 3

Slide 3 text

Part 1: Building Blocks

Slide 4

Slide 4 text

1 What Machine Learning Isn’t Jonathan LeBlanc • Director of Developer Advocacy @ Box • Twitter: @jcleblanc • Email: [email protected]

Slide 5

Slide 5 text

1 Components of the System Jonathan LeBlanc • Director of Developer Advocacy @ Box • Twitter: @jcleblanc • Email: [email protected] Serverless Framework Provides the compute and data management from stored data location to machine learning engine. Machine Learning System Provides the data enhancement capabilities which improves the underlying source data’s metadata (information about information).

Slide 6

Slide 6 text

1 Why Serverless? Jonathan LeBlanc • Director of Developer Advocacy @ Box • Twitter: @jcleblanc • Email: [email protected] On Demand: Machine learning ties are only required when files need processing, which may be infrequent. No hosting: You don’t have to run or manage any servers, containers, or VMs of your own. Pricing based on use: Execution resources are only run (and charged for) based on your use, typically resulting in very low server costs. Different stack options: Multiple serverless systems exist to fit stack needs, including numerous open source options.

Slide 7

Slide 7 text

1 Components of the System Jonathan LeBlanc • Director of Developer Advocacy @ Box • Twitter: @jcleblanc • Email: [email protected] Webhook / Event Pump System: Handles notifications to the middleware layer when a new file should be processed. Middleware Layer: Handles communication between the data source and machine learning systems. Metadata Layer: The storage facility for machine learning data responses. Token Downscoping System: Allows you to pass tightly scoped read / write tokens through multiple uncontrolled system layers.

Slide 8

Slide 8 text

1 How a Data / ML System Works Jonathan LeBlanc • Director of Developer Advocacy @ Box • Twitter: @jcleblanc • Email: [email protected] Cloud Data Data store & initial metadata Serverless Framework Callback handler and code execution Machine Learning Data processor and enhancer Webhook Metadata Execute Callback

Slide 9

Slide 9 text

1 Common Serverless Frameworks Jonathan LeBlanc • Director of Developer Advocacy @ Box • Twitter: @jcleblanc • Email: [email protected] AWS Lambda: https:/ /aws.amazon.com/lambda/ Azure Functions: https:/ /azure.microsoft.com/en-us/services/functions/ Google Cloud Functions: https:/ /cloud.google.com/functions/ IronFunctions: https:/ /github.com/iron-io/functions OpenWhisk: https:/ /openwhisk.apache.org/ Fission: https:/ /fission.io/ Considerations 1. Your stack 2. Pricing / free use 3. Supported languages 4. Regional support

Slide 10

Slide 10 text

1 Machine Learning Frameworks Jonathan LeBlanc • Director of Developer Advocacy @ Box • Twitter: @jcleblanc • Email: [email protected] Audio / Video / Image • [video] MS Video Indexer • [audio] Voicebase • [face] Hive AI • [image] Clarifai • [image] Google Vision • [mixed] IBM Watson • [moderation] MS Content Moderator • [face] Kairos • [audio] AT&T Speech • [image] Amazon Rekognition Text Extraction • [id] Acuant • [invoice] Rossum.AI • [contract] eBrevia • [lease] Leverton • [resume] TextKernal • [prediction] AmazonML • [analysis] Aylien • [classification] MonkeyLearn • [natural language] ApiAI • [sentiment] AlchemyText Open Source • TensorFlow • Keras • Scikit-learn • MS Cognitive Toolkit • Theano • Caffe • Torch • Accord.NET

Slide 11

Slide 11 text

Part 2: Best Practices

Slide 12

Slide 12 text

2 Program Logic and Serverless Separation Jonathan LeBlanc • Director of Developer Advocacy @ Box • Twitter: @jcleblanc • Email: [email protected] Serverless function agnostic: The core logic of the function should be separate from the serverless requirements. Thin handlers / routers may be written on top of the core logic to maintain separation. Service deployments: To allow for deployment amongst numerous serverless technologies, systems like serverless.com may be utilized. Testability: The separation of concerns allows you to test the function separately from the container. Handler: Separate handler from core program logic for testability.

Slide 13

Slide 13 text

/ / API Gateway Handler exports.handler = (event, context, callback) => { / / Check for valid event if (isValidEvent()) { processEvent(); } else { callback(null, { statusCode: 200, body: 'Event received but invalid' }); } }; AWS Lambda Handler Jonathan LeBlanc • Director of Developer Advocacy @ Box • Twitter: @jcleblanc • Email: [email protected]

Slide 14

Slide 14 text

2 Dealing with Cold Starts Jonathan LeBlanc • Director of Developer Advocacy @ Box • Twitter: @jcleblanc • Email: [email protected] What is it: The latency experienced when a function is triggered, which only runs when there isn’t a warn / idle container. A container is automatically dropped after a period of inactivity. Options: You can either keep the container warm through memory increases and calls, or deal with the cold start. Fewer libraries: The more libraries that are used the longer it will take to start the container. Smaller functions: Writing smaller functions decreases start time.

Slide 15

Slide 15 text

2 Exit Callback Hygiene Jonathan LeBlanc • Director of Developer Advocacy @ Box • Twitter: @jcleblanc • Email: [email protected] Error logging: With many serverless environments proper callback use will provide full data logging. Reliability: Failing to exist properly can result in your function executing until a timeout is hit. Timeouts may also cause subsequent invocations to require a cold start, which results in additional latency. Cost: If a timeout occurs, you will be charged for the entire timeout time.

Slide 16

Slide 16 text

/ / Success Callback callback(null, { statusCode: 200, body: 'Event processed' }); / / Error Callback callback({ statusCode: 400, body: 'Event error' }); Processing AWS Lambda Exit Callbacks Jonathan LeBlanc • Director of Developer Advocacy @ Box • Twitter: @jcleblanc • Email: [email protected]

Slide 17

Slide 17 text

2 Writing Stateless Single Purpose Functions Jonathan LeBlanc • Director of Developer Advocacy @ Box • Twitter: @jcleblanc • Email: [email protected] Error isolation: Debugging and error handling is easier with function / concern isolation. Scaling: With monolith functions, you have to optimize entire for all elements of the functions, rather than the specific functionality receiving the most calls / traffic. Planning and testing: It’s easier to plan and write test plans for functions with singular concerns.

Slide 18

Slide 18 text

/** * Check for a valid event. * @param {object} indexerEvent – indexer event * @return {boolean} - true if valid event */ const isValidEvent = (indexerEvent) => { return (indexerEvent.body || indexerEvent.queryStringParameters); }; Valid Event Function Jonathan LeBlanc • Director of Developer Advocacy @ Box • Twitter: @jcleblanc • Email: [email protected]

Slide 19

Slide 19 text

Part 3: Security Considerations

Slide 20

Slide 20 text

3 Security Considerations Jonathan LeBlanc • Director of Developer Advocacy @ Box • Twitter: @jcleblanc • Email: [email protected] Serverless use consideration: Are serverless systems a viable / approved mechanism within your organization? Token exposure: Many API auth systems are token based, with broadly scoped tokens, leading to the potential of token leakage. Credential exposure: With the use of numerous APIs, each with auth credentials, we have the potential of credential leakage. Sensitive information exposure: Data is being passed through multiple systems and we have to be aware of how the information is used / stored.

Slide 21

Slide 21 text

3 Middleware System Jonathan LeBlanc • Director of Developer Advocacy @ Box • Twitter: @jcleblanc • Email: [email protected] Serverless Solution All compute functionality is offloaded to the serverless framework. On-prem Solution All computer functionality (and connection to the ML system) is run off of existing internal servers.

Slide 22

Slide 22 text

3 Protecting Credentials Jonathan LeBlanc • Director of Developer Advocacy @ Box • Twitter: @jcleblanc • Email: [email protected] Use Secure Storage: Use a secure system to store API credentials or tokens, such as the AWS Systems Manager Parameter Store. Least Privilege Principle: Functions requiring access to credentials should follow the least privilege principle, meaning they have access to only as much data as they absolutely need. Separate Environment Credentials: Credentials used in a more open developer environment should not be the same used in a production deployment.

Slide 23

Slide 23 text

3 Token Downscoping Jonathan LeBlanc • Director of Developer Advocacy @ Box • Twitter: @jcleblanc • Email: [email protected] Access Token Fully scoped access token Downscoped Token Tightly scoped child token Channel Transmission Transmit through uncontrolled channels

Slide 24

Slide 24 text

3 Token Downscoping Components Jonathan LeBlanc • Director of Developer Advocacy @ Box • Twitter: @jcleblanc • Email: [email protected] Tightly scoped for single file: A token should only be scoped for the item needed for processing, such as a file. Short lived: Downscoped tokens should only live for their natural useful time (e.g. 1 hour) Revocable: Downscoped tokens may be revoked before natural expiration through the API. Split read / write functions: To further scope token exposure, separate read / write tokens can be issued.

Slide 25

Slide 25 text

3 Sensitive Information Exposure Jonathan LeBlanc • Director of Developer Advocacy @ Box • Twitter: @jcleblanc • Email: [email protected] Data in the files: What information is being transmitted through the channels in the files, and is it sensitive information? Are channels secure: Are all connections between your systems, the serverless framework, and the machine learning system secure? How the ML system handles data: Does the machine learning system store any data long-term, and how secure is that storage? Logging sensitive information: Are you logging sensitive information during general program flow unintentionally?

Slide 26

Slide 26 text

3 Tokenisation Specification Jonathan LeBlanc • Director of Developer Advocacy @ Box • Twitter: @jcleblanc • Email: [email protected] Data Request Sensitive information request Cloud Data API Data hosting service API Secure Data Vault Secure vault hosting data files 1. PAN 4. Token / Status 2. PAN 3. Token / Status

Slide 27

Slide 27 text

Wrapup Topics Building Blocks: How are these systems built? Best Practices: How do we architect the solution? Security Considerations: How do ensure data security? Jonathan LeBlanc • Director of Developer Advocacy @ Box • Twitter: @jcleblanc • Email: [email protected]

Slide 28

Slide 28 text

Better Data with Machine Learning and Serverless Slides: http:/ /bit.ly/ato-bdml Jonathan LeBlanc (Director of Developer Advocacy @ Box) Twitter: @jcleblanc Email: [email protected]