Build A Serverless Data Pipeline

Build A Serverless Data Pipeline

Short version of the serverless data pipeline talk for ServerlessCPH in Copenhagen

D33d8bdd9096c80b8d1acca8d28410b5?s=128

Lorna Mitchell

May 16, 2018
Tweet

Transcript

  1. Build A Serverless Data Pipeline Lorna Mitchell, IBM https://lornajane.net/resources

  2. Stackoverflow Dashboard @lornajane

  3. Pipeline To Shift Data Bringing data from StackOverflow into the

    dashboard my advocate team uses @lornajane
  4. Why Go Serverless? • Costs nothing when idle • Small

    application, simple architecture • Bursty usage since it runs from a cron • No real-time requirement • Easily within free tier @lornajane
  5. An Aside About Databases @lornajane

  6. Document Databases Store collections of schemaless documents, in JSON @lornajane

  7. Apache CouchDB • Modern, robust, scalable document database • HTTP

    API • JSON data format • Best replication on the planet (probably) @lornajane
  8. OfflineFirst Applications This app is OfflineFirst: • Client side JS

    • Client side copy of DB using PouchDB • Background sync to serverside CouchDB @lornajane
  9. Build the Data Pipeline @lornajane

  10. Serverless Functions • independent • single purpose • testable •

    scalable @lornajane
  11. Start with Security Need an API key or user creds

    for bx wsk tool Web actions: we know how to secure HTTP connections, so do it! • Auth standards e.g. JWT • Security in transmission: use HTTPS @lornajane
  12. Logging Considerations • Standard, configurable logging setup • Use a

    trace_id to link requests between services • Aggregate logs to a central place, ensure search functionality • Collect metrics (invocations, execution time, error rates) • display metrics on a dashboard • have appropriate, configurable alerting @lornajane
  13. Pipeline Actions Sequence socron • collector makes an API call,

    passes on data • invoker fires many actions: one for each item Sequence qhandler • storer inserts or updates the record • notifier sends a webhook to slack or a bot @lornajane
  14. Pipeline Actions @lornajane

  15. Collector 1 var request = require('request'); 2 function main(message) {

    3 return new Promise(function(resolve, reject) { 4 var tagged = message.tags.join(';'); 5 var r = { method: 'get', url: https://api.stackexchange.com 6 request(r, function(err, response, body) { 7 if (err) { return reject(err); } 8 if (response.statusCode != 200) { throw(new Error('status 9 resolve({ items: body.items }); 10 }); 11 }); 12 } 13 module.exports = main; @lornajane
  16. Invoker 1 function main(args) { 2 return new Promise(function(resolve, reject)

    { 3 var openwhisk = require('openwhisk'); 4 var ow = openwhisk(); 5 var actions = args.items.map(function (item) { 6 return ow.actions.invoke( 7 {actionName: "stackoverflow/qhandler", params: {questio 8 }); 9 return Promise.all(actions).then(function (results) { 10 return resolve({payload: "All OK: " + results.length + " 11 }); 12 }); 13 } @lornajane
  17. Storer 1 function main(message) { 2 var cloudant = require('cloudant')({url:

    message.cloudantURL, 3 var db = cloudant.db.use(message.dbname); 4 var id = message.question.question_id.toString(); 5 return getDoc(db, id).then(function(data) { 6 if (data === null) { // so insert 7 message.question.tags = message.question.tags.sort(); // 8 var obj = { _id: id, type: 'question', owner: null, statu 9 return db.insert(obj).then(function(data) { 10 return obj; // pass on the new object to the next actio 11 }); 12 } else { ... } 13 }); @lornajane
  18. Notifier 1 function main(data) { 2 return new Promise(function(resolve, reject)

    { 3 var request = require('request'); 4 if(data.status == 'new') { 5 var event = { type: "new-question", data: data }; 6 request({ 7 url: hardcoded_hubot_url, method: "POST", headers: {"Co 8 }, function (err, response, body) { 9 if(err) { reject ({payload: "Failed"}); 10 } else { resolve( {payload: "Notified"} ); } 11 }); 12 } else { resolve( {payload: "Complete"} ); } 13 }); @lornajane
  19. ... and breathe! @lornajane

  20. Deployment • IBM Cloud Deployments or TravisCI • Deploy on

    commit (optionally just what has changed) • Recreate triggers and rules if appropriate • Use environment variables for secrets • Install bx command to use at deploy time @lornajane
  21. Serverless And Data @lornajane

  22. Resources • Cloud Functions: https://console.bluemix.net/openwhisk/ • Code https://github.com/ibm-watson-data-lab/soingest • My

    blog: https://lornajane.net/ • OpenWhisk: https://openwhisk.org/ • CouchDB: https://couchdb.apache.org/ • Offline First: https://offlinefirst.org/ @lornajane