Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Fullstack Data Science by Arian Pasquali

Fullstack Data Science by Arian Pasquali

From Data Processing Pipeline using Luigi to Serverless Backend Architecture using AWS Lambda Functions.

DevOpsPorto

August 08, 2019
Tweet

More Decks by DevOpsPorto

Other Decks in Programming

Transcript

  1. Full stack data science
    Building a data processing pipeline with Luigi and
    backend deployment using Amazon’s Lambda API
    Arian Pasquali
    August 2019

    View full-size slide

  2. Full stack data science
    Building a data processing pipeline with Luigi and
    backend deployment using Amazon’s Lambda API
    Arian Pasquali
    August 2019
    Engineering

    View full-size slide

  3. meuParlamento.pt
    the Portuguese Parliament in your pocket
    non-profit and open source project
    Arian Pasquali - Nuno Moniz - Tomás Amaro

    [email protected]
    Case study

    View full-size slide

  4. 69.27%
    Abstention in the last elections in Portugal
    4
    Motivation

    View full-size slide

  5. 5
    • Increase engagement in politics
    • Encourage healthy debate
    • Explore new ways to interact
    with Government's open data
    Our goals

    View full-size slide

  6. 6
    Vote real legislative proposals from your
    smartphone
    See which political parties vote like you
    Share results with your friends
    meuParlamento.pt
    Our solution

    View full-size slide

  7. In a nutshell
    How it works with 

    three simple gestures
    Abstention
    In favor
    Against
    Swipe left if you
    are against
    Swipe right if you
    aprove
    Swipe up
    Users can skip if they
    prefer not to vote

    View full-size slide

  8. 8
    Our solution

    View full-size slide

  9. Data source
    http://parlamento.pt

    View full-size slide

  10. 10
    HTTP API
    arquivo.pt
    parlamento.pt Data processing
    pipeline
    Storage
    About 3.000
    proposals
    Web scrapping
    Refinement
    Text summarization
    Etc
    Endpoint: 

    Provide 10 random
    proposals to vote
    Overview

    View full-size slide

  11. Data collection and
    refinement
    Challenges building the data processing pipeline

    View full-size slide

  12. 12
    • Initially, prototype involved different Python scripts using
    Jupiter Notebooks, Java and even R.

    • In order to structure the data processing we choose Luigi
    • Tasks dependences
    • Dashboard
    • Email notifications, etc
    • As simple as it can be
    • Just define task dependence and output. It takes
    care of the rest

    • Challenges
    • Handling exceptions


    Data processing pipeline

    From plain python scripts to task management
    engine
    https://github.com/spotify/luigi

    View full-size slide

  13. 13
    Web scrapping Download PDF
    PDF to Text
    Text Summarization
    PDF Parser
    Compute Readability
    Score
    Persistence
    Python R Python
    (for now)
    Data pipeline

    View full-size slide

  14. Data processing pipeline

    Important to keep in mind
    • Keep each step as simple as possible
    • Atomic tasks
    • Easier to test
    • If one task fails you can start again from the last one
    • Keep track of changes
    • Don’t update data, keep it.
    • Save every change you made in the data
    • It also facilitates recovery at any point in the pipeline
    • Fail gracefully
    • Handle exceptions properly so it doesn’t break the entire workflow. It can be
    tricky sometimes, specially with wrappers and loops.
    • It pays off when things get complicated
    14

    View full-size slide

  15. Backend API
    From Flask to Serverless

    View full-size slide

  16. 16
    Backend requirements
    • Provide a simple endpoint
    1. Serve list of random proposals to vote 

    • It should be
    • Easy to deploy
    • Easy to monitor
    • Cheap to scale
    • $$$ comes from our pockets
    meuParlamento
    API
    HTTP API
    JSON

    View full-size slide

  17. • Flask as python web framework
    • MongoDB as storage (Cloud)
    • Pros
    • Data updates directly from pipeline
    • Cons
    • Bottleneck: Heroku’s free tier
    • Slow processing units, no load balance, etc
    • Expensive to scale if necessary
    • Too much trouble for such a simple endpoint
    17
    Backend API using Flask

    Flask-based web API

    View full-size slide

  18. Serverless Backend API

    Since we are almost completely stateless. 

    Why not Serverless / Lambda Functions?
    This was a good opportunity to try out
    serverless deployments.
    The cloud provider manages the allocation
    of machine resources.

    Pros
    • Minimal infrastructure
    management;
    • Load balance;
    • Seems cheaper to scale;
    • Easier deployment too; https://www.fullstackpython.com/serverless.html

    View full-size slide

  19. There are a few python packages to
    develop serverless applications:
    • Zappa
    • python-lambda
    • Chalice
    We choose Chalice

    Pros
    • Very similar to Flask
    • Painless migration from
    previous Flask code
    • Provided by Amazon (trust)
    • Pretty decent documentation
    • Simple command line interface
    https://github.com/aws/chalice/
    Serverless Backend API


    View full-size slide

  20. • Pros
    • Cheaper to scale if necessary;
    • Free tier is quite generous
    • Single command deployment;
    • Support different stages (e.g. dev, test, prod, …)
    • Support api versioning
    • Good monitoring tools
    • Advanced log analytics
    • Alerts (e.g. latency alerts, etc)
    • Cons
    • Package size limitation - (50 Mb)
    • Vendor lock. Chalice supports AWS only
    • URL may change in some scenarios. Make sure
    you use a proxy url in front of it
    20
    AWS Lambda
    Serverless Backend API

    From Flask to serverless with AWS Chalice

    View full-size slide

  21. 21
    • It scales so well that the bottleneck is now the pool of
    available connections at the database (MongoDB).
    • Up to 200 for free tier MongoDB
    • Solution
    • Query requirements are actually really simple
    • Remove mongodb and use simpler in memory
    data structure?
    • Idea
    • Pipeline
    • Save proposals file at Amazon S3 bucket
    • Backend
    • Load proposals file from S3 bucket
    Open question

    Remove mongodb?
    AWS Lambda

    View full-size slide

  22. 22
    • Properly support API versioning
    • API proxy

    • Improve tests coverage, documentation
    and website

    • Pipeline repository is still not open. But it
    will be soon.
    Next steps
    http://github.com/meuparlamento

    View full-size slide

  23. 23
    • Support session debates
    • Support proposals in real-time with
    notifications
    • Improve User Interface
    • Support different public institutions 

    (e.g. European Parliament, city council, etc)
    Next steps (features)
    http://github.com/meuparlamento

    View full-size slide

  24. http://meuParlamento.pt
    by
    Arian Pasquali - Nuno Moniz - Tomás Amaro

    [email protected]
    http://github.com/meuparlamento
    Non-profit - Open source

    View full-size slide