Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Build A Serverless Data Pipeline

Build A Serverless Data Pipeline

Short version of the serverless data pipeline talk for ServerlessCPH in Copenhagen

Lorna Mitchell

May 16, 2018
Tweet

More Decks by Lorna Mitchell

Other Decks in Technology

Transcript

  1. Build A Serverless
    Data Pipeline
    Lorna Mitchell, IBM
    https://lornajane.net/resources

    View Slide

  2. Stackoverflow Dashboard
    @lornajane

    View Slide

  3. Pipeline To Shift Data
    Bringing data from StackOverflow into the dashboard my
    advocate team uses
    @lornajane

    View Slide

  4. Why Go Serverless?
    • Costs nothing when idle
    • Small application, simple architecture
    • Bursty usage since it runs from a cron
    • No real-time requirement
    • Easily within free tier
    @lornajane

    View Slide

  5. An Aside About Databases
    @lornajane

    View Slide

  6. Document Databases
    Store collections of schemaless documents, in JSON
    @lornajane

    View Slide

  7. Apache CouchDB
    • Modern, robust, scalable document database
    • HTTP API
    • JSON data format
    • Best replication on the planet (probably)
    @lornajane

    View Slide

  8. OfflineFirst Applications
    This app is OfflineFirst:
    • Client side JS
    • Client side copy of DB using PouchDB
    • Background sync to serverside CouchDB
    @lornajane

    View Slide

  9. Build the Data Pipeline
    @lornajane

    View Slide

  10. Serverless Functions
    • independent
    • single purpose
    • testable
    • scalable
    @lornajane

    View Slide

  11. Start with Security
    Need an API key or user creds for bx wsk tool
    Web actions: we know how to secure HTTP connections, so do
    it!
    • Auth standards e.g. JWT
    • Security in transmission: use HTTPS
    @lornajane

    View Slide

  12. Logging Considerations
    • Standard, configurable logging setup
    • Use a trace_id to link requests between services
    • Aggregate logs to a central place, ensure search functionality
    • Collect metrics (invocations, execution time, error rates)
    • display metrics on a dashboard
    • have appropriate, configurable alerting
    @lornajane

    View Slide

  13. Pipeline Actions
    Sequence socron
    • collector makes an API call, passes on data
    • invoker fires many actions: one for each item
    Sequence qhandler
    • storer inserts or updates the record
    • notifier sends a webhook to slack or a bot
    @lornajane

    View Slide

  14. Pipeline Actions
    @lornajane

    View Slide

  15. Collector
    1 var request = require('request');
    2 function main(message) {
    3 return new Promise(function(resolve, reject) {
    4 var tagged = message.tags.join(';');
    5 var r = { method: 'get', url: https://api.stackexchange.com
    6 request(r, function(err, response, body) {
    7 if (err) { return reject(err); }
    8 if (response.statusCode != 200) { throw(new Error('status
    9 resolve({ items: body.items });
    10 });
    11 });
    12 }
    13 module.exports = main;
    @lornajane

    View Slide

  16. Invoker
    1 function main(args) {
    2 return new Promise(function(resolve, reject) {
    3 var openwhisk = require('openwhisk');
    4 var ow = openwhisk();
    5 var actions = args.items.map(function (item) {
    6 return ow.actions.invoke(
    7 {actionName: "stackoverflow/qhandler", params: {questio
    8 });
    9 return Promise.all(actions).then(function (results) {
    10 return resolve({payload: "All OK: " + results.length + "
    11 });
    12 });
    13 }
    @lornajane

    View Slide

  17. Storer
    1 function main(message) {
    2 var cloudant = require('cloudant')({url: message.cloudantURL,
    3 var db = cloudant.db.use(message.dbname);
    4 var id = message.question.question_id.toString();
    5 return getDoc(db, id).then(function(data) {
    6 if (data === null) { // so insert
    7 message.question.tags = message.question.tags.sort(); //
    8 var obj = { _id: id, type: 'question', owner: null, statu
    9 return db.insert(obj).then(function(data) {
    10 return obj; // pass on the new object to the next actio
    11 });
    12 } else { ... }
    13 });
    @lornajane

    View Slide

  18. Notifier
    1 function main(data) {
    2 return new Promise(function(resolve, reject) {
    3 var request = require('request');
    4 if(data.status == 'new') {
    5 var event = { type: "new-question", data: data };
    6 request({
    7 url: hardcoded_hubot_url, method: "POST", headers: {"Co
    8 }, function (err, response, body) {
    9 if(err) { reject ({payload: "Failed"});
    10 } else { resolve( {payload: "Notified"} ); }
    11 });
    12 } else { resolve( {payload: "Complete"} ); }
    13 });
    @lornajane

    View Slide

  19. ... and breathe!
    @lornajane

    View Slide

  20. Deployment
    • IBM Cloud Deployments or TravisCI
    • Deploy on commit (optionally just what has changed)
    • Recreate triggers and rules if appropriate
    • Use environment variables for secrets
    • Install bx command to use at deploy time
    @lornajane

    View Slide

  21. Serverless And Data
    @lornajane

    View Slide

  22. Resources
    • Cloud Functions: https://console.bluemix.net/openwhisk/
    • Code https://github.com/ibm-watson-data-lab/soingest
    • My blog: https://lornajane.net/
    • OpenWhisk: https://openwhisk.org/
    • CouchDB: https://couchdb.apache.org/
    • Offline First: https://offlinefirst.org/
    @lornajane

    View Slide