Detailed version: From Data Processing Pipeline using Luigi to Serverless Backend Architecture using AWS Lambda Functions

Detailed version: From Data Processing Pipeline using Luigi to Serverless Backend Architecture using AWS Lambda Functions

Extended versions from the talk presented at DevOps Porto #30: Lightning Talks with Python Porto

Full Title:
Fullstack Data Science. From Data Processing Pipeline using Luigi to Serverless Backend Architecture using AWS Lambda Functions

During this talk I will discuss how the project was built. You will learn how we structured the data processing pipeline using Luigi and how we built a Backend API for the mobile App using the serverless architecture using AWS Lambda.

As a side project and non-profit endeavour, was created to bring the Parliament to each users smartphone. The idea is to allow any citizen to become a Member of Parliament, and to vote on the many proposals that have been debated over the years while focusing on privacy and anonymity - no voting data is recorded.

This is an open source project that combines web scrapping, text summarization and data mining to deliver content to users and I hope to present the audience an overview on how it works.



August 08, 2019


  1. Full stack data science Building a data processing pipeline with

    Luigi and backend deployment using Amazon’s Lambda API Arian Pasquali August 2019
  2. Full stack data science Building a data processing pipeline with

    Luigi and backend deployment using Amazon’s Lambda API Arian Pasquali August 2019 Engineering
  3. the Portuguese Parliament in your pocket non-profit and open

    source project Arian Pasquali - Nuno Moniz - Tomás Amaro Case study
  4. 69.27% Abstention in the last elections in Portugal 4 Motivation

  5. 5 • Increase engagement in politics • Encourage healthy debate

    • Explore new ways to interact with Government's open data Our goals
  6. 6 Vote real legislative proposals from your smartphone See which

    political parties vote like you Share results with your friends Our solution
  7. In a nutshell How it works with 
 three simple

    gestures Abstention In favor Against Swipe left if you are against Swipe right if you aprove Swipe up Users can skip if they prefer not to vote
  8. 8 Our solution

  9. Data source

  10. 10 HTTP API Data processing pipeline Storage About

    3.000 proposals Web scrapping Refinement Text summarization Etc Endpoint: 
 Provide 10 random proposals to vote Overview
  11. Data collection and refinement Challenges building the data processing pipeline

  12. 12 • Initially, prototype involved different Python scripts using Jupiter

    Notebooks, Java and even R.
 • In order to structure the data processing we choose Luigi • Tasks dependences • Dashboard • Email notifications, etc • As simple as it can be • Just define task dependence and output. It takes care of the rest
 • Challenges • Handling exceptions
 Data processing pipeline
 From plain python scripts to task management engine
  13. 13 Web scrapping Download PDF PDF to Text Text Summarization

    PDF Parser Compute Readability Score Persistence Python R Python (for now) Data pipeline
  14. 14 Data processing pipeline Starting Luigi server

  15. 15 Data processing pipeline
 Luigi as task management engine

  16. 16 Data processing pipeline Luigi Task example Fetch proposal from Output - saves proposal data in json file
  17. 17 Data processing pipeline Running our Luigi Task

  18. 18 Data processing pipeline Fetch proposal sample output file ./results/raw/proposal_39526.json

  19. 19

  20. 20 Data processing pipeline Download PDF Task Input - requires

    previous task Output - saves pdf locally
  21. 21 Data processing pipeline PDF parser Task Input - requires

    previous pdf download task 
 Output - saves pdf in plain text
  22. Data processing pipeline
 Important to keep in mind • Keep

    each step as simple as possible • Atomic tasks • Easier to test • If one task fails you can start again from the last one • Keep track of changes • Don’t update data, keep it. • Save every change you made in the data • It also facilitates recovery at any point in the pipeline • Fail gracefully • Handle exceptions properly so it doesn’t break the entire workflow. It can be tricky sometimes, specially with wrappers and loops. • It pays off when things get complicated 22
  23. Backend API From Flask to Serverless

  24. 24 Backend requirements • Provide a simple endpoint 1. Serve

    list of random proposals to vote 
 • It should be • Easy to deploy • Easy to monitor • Cheap to scale • $$$ comes from our pockets meuParlamento API HTTP API JSON
  25. • Flask as python web framework • MongoDB as storage

    (Cloud) • Pros • Data updates directly from pipeline • Cons • Bottleneck: Heroku’s free tier • Slow processing units, no load balance, etc • Expensive to scale if necessary • Too much trouble for such a simple endpoint 25 Backend API using Flask
 Flask-based web API
  26. Serverless Backend API
 Since we are almost completely stateless. 

    Why not Serverless / Lambda Functions? This was a good opportunity to try out serverless deployments. The cloud provider manages the allocation of machine resources.
 Pros • Minimal infrastructure management; • Load balance; • Seems cheaper to scale; • Easier deployment too;
  27. There are a few python packages to develop serverless applications:

    • Zappa • python-lambda • Chalice We choose Chalice
 Pros • Very similar to Flask • Painless migration from previous Flask code • Provided by Amazon (trust) • Pretty decent documentation • Simple command line interface Serverless Backend API

  28. • Pros • Cheaper to scale if necessary; • Free

    tier is quite generous • Single command deployment; • Support different stages (e.g. dev, test, prod, …) • Support api versioning • Good monitoring tools • Advanced log analytics • Alerts (e.g. latency alerts, etc) • Cons • Package size limitation - (50 Mb) • Vendor lock. Chalice supports AWS only • URL may change in some scenarios. Make sure you use a proxy url in front of it 28 AWS Lambda Serverless Backend API
 From Flask to serverless with AWS Chalice
  29. 29 Backend API using serverless framework

  30. 30 Easier DevOps with AWS Lambda Deploying using a single

  31. 31 Example requesting random proposals GET

  32. 32 • It scales so well that the bottleneck is

    now the pool of available connections at the database (MongoDB). • Up to 200 for free tier MongoDB • Solution • Query requirements are actually really simple • Remove mongodb and use simpler in memory data structure? • Idea • Pipeline • Save proposals file at Amazon S3 bucket • Backend • Load proposals file from S3 bucket Open question
 Remove mongodb? AWS Lambda
  33. Challenges Exploratory data analysis and probability after midnight

  34. 34 Exploratory data analysis
 Imbalanced distributions by political party •

    We want to provide a fair random method that evenly selects proposals by different political party. • It is not as easy as it sounds: • Number of proposals by political parties is very imbalanced; • We need to take into account majority and opposition in the Parliament. • Proposals aggregated by political party
  35. 35 Exploratory data analysis
 Imbalanced distributions by political party Proposals

    aggregated by political party • What is the likelihood to randomly pick a proposal from PCP? • About 700 out of 3000 • What is the likelihood to randomly pick a proposal from PSD? • About 300 out of 3000 • Plain simple random selection is not fair because authorship distribution is not even
  36. • Fixing random selection using inverse probabilities as weights in

    random selection. • Python Standard library to the rescue • Simply random.choices * :) 36 "Fair randomness” * random.choices(population, weights=None, *, k=1) 
 Return a k sized list of elements chosen from the population with replacement. If the population is empty, raises IndexError. If a weights sequence is specified, selections are made according to the relative
  37. 37 • Properly support API versioning • API proxy

    Improve tests coverage, documentation and website
 • Pipeline repository is still not open. But it will be soon. Next steps
  38. 38 • Support session debates • Support proposals in real-time

    with notifications • Improve User Interface • Support different public institutions 
 (e.g. European Parliament, city council, etc) Next steps (features)
  39. by Arian Pasquali - Nuno Moniz - Tomás Amaro Non-profit - Open source