Fullstack Data Science by Arian Pasquali

Full stack data science Building a data processing pipeline with
Luigi and backend deployment using Amazon’s Lambda API Arian Pasquali August 2019

Full stack data science Building a data processing pipeline with
Luigi and backend deployment using Amazon’s Lambda API Arian Pasquali August 2019 Engineering

meuParlamento.pt the Portuguese Parliament in your pocket non-profit and open
source project Arian Pasquali - Nuno Moniz - Tomás Amaro  [email protected] Case study

69.27% Abstention in the last elections in Portugal 4 Motivation

5 • Increase engagement in politics • Encourage healthy debate
• Explore new ways to interact with Government's open data Our goals

6 Vote real legislative proposals from your smartphone See which
political parties vote like you Share results with your friends meuParlamento.pt Our solution

In a nutshell How it works with   three simple
gestures Abstention In favor Against Swipe left if you are against Swipe right if you aprove Swipe up Users can skip if they prefer not to vote

8 Our solution

Data source http://parlamento.pt

10 HTTP API arquivo.pt parlamento.pt Data processing pipeline Storage About
3.000 proposals Web scrapping Refinement Text summarization Etc Endpoint:   Provide 10 random proposals to vote Overview

Data collection and refinement Challenges building the data processing pipeline

12 • Initially, prototype involved different Python scripts using Jupiter
Notebooks, Java and even R.  • In order to structure the data processing we choose Luigi • Tasks dependences • Dashboard • Email notifications, etc • As simple as it can be • Just define task dependence and output. It takes care of the rest  • Challenges • Handling exceptions    Data processing pipeline  From plain python scripts to task management engine https://github.com/spotify/luigi

13 Web scrapping Download PDF PDF to Text Text Summarization
PDF Parser Compute Readability Score Persistence Python R Python (for now) Data pipeline

Data processing pipeline  Important to keep in mind • Keep
each step as simple as possible • Atomic tasks • Easier to test • If one task fails you can start again from the last one • Keep track of changes • Don’t update data, keep it. • Save every change you made in the data • It also facilitates recovery at any point in the pipeline • Fail gracefully • Handle exceptions properly so it doesn’t break the entire workflow. It can be tricky sometimes, specially with wrappers and loops. • It pays off when things get complicated 14

Backend API From Flask to Serverless

16 Backend requirements • Provide a simple endpoint 1. Serve
list of random proposals to vote   • It should be • Easy to deploy • Easy to monitor • Cheap to scale • $$$ comes from our pockets meuParlamento API HTTP API JSON

• Flask as python web framework • MongoDB as storage
(Cloud) • Pros • Data updates directly from pipeline • Cons • Bottleneck: Heroku’s free tier • Slow processing units, no load balance, etc • Expensive to scale if necessary • Too much trouble for such a simple endpoint 17 Backend API using Flask  Flask-based web API

Serverless Backend API  Since we are almost completely stateless.  
Why not Serverless / Lambda Functions? This was a good opportunity to try out serverless deployments. The cloud provider manages the allocation of machine resources.  Pros • Minimal infrastructure management; • Load balance; • Seems cheaper to scale; • Easier deployment too; https://www.fullstackpython.com/serverless.html

There are a few python packages to develop serverless applications:
• Zappa • python-lambda • Chalice We choose Chalice  Pros • Very similar to Flask • Painless migration from previous Flask code • Provided by Amazon (trust) • Pretty decent documentation • Simple command line interface https://github.com/aws/chalice/ Serverless Backend API 

• Pros • Cheaper to scale if necessary; • Free
tier is quite generous • Single command deployment; • Support different stages (e.g. dev, test, prod, …) • Support api versioning • Good monitoring tools • Advanced log analytics • Alerts (e.g. latency alerts, etc) • Cons • Package size limitation - (50 Mb) • Vendor lock. Chalice supports AWS only • URL may change in some scenarios. Make sure you use a proxy url in front of it 20 AWS Lambda Serverless Backend API  From Flask to serverless with AWS Chalice

21 • It scales so well that the bottleneck is
now the pool of available connections at the database (MongoDB). • Up to 200 for free tier MongoDB • Solution • Query requirements are actually really simple • Remove mongodb and use simpler in memory data structure? • Idea • Pipeline • Save proposals file at Amazon S3 bucket • Backend • Load proposals file from S3 bucket Open question  Remove mongodb? AWS Lambda

22 • Properly support API versioning • API proxy  •
Improve tests coverage, documentation and website  • Pipeline repository is still not open. But it will be soon. Next steps http://github.com/meuparlamento

23 • Support session debates • Support proposals in real-time
with notifications • Improve User Interface • Support different public institutions   (e.g. European Parliament, city council, etc) Next steps (features) http://github.com/meuparlamento

http://meuParlamento.pt by Arian Pasquali - Nuno Moniz - Tomás Amaro 
[email protected] http://github.com/meuparlamento Non-profit - Open source

Fullstack Data Science by Arian Pasquali

Fullstack Data Science by Arian Pasquali

DevOpsPorto

More Decks by DevOpsPorto

Other Decks in Programming

Featured

Transcript

Full stack data science Building a data processing pipeline with

Full stack data science Building a data processing pipeline with

meuParlamento.pt the Portuguese Parliament in your pocket non-profit and open

69.27% Abstention in the last elections in Portugal 4 Motivation

5 • Increase engagement in politics • Encourage healthy debate

6 Vote real legislative proposals from your smartphone See which

In a nutshell How it works with   three simple

8 Our solution

Data source http://parlamento.pt

10 HTTP API arquivo.pt parlamento.pt Data processing pipeline Storage About

Data collection and refinement Challenges building the data processing pipeline

12 • Initially, prototype involved different Python scripts using Jupiter

13 Web scrapping Download PDF PDF to Text Text Summarization

Data processing pipeline  Important to keep in mind • Keep

Backend API From Flask to Serverless

16 Backend requirements • Provide a simple endpoint 1. Serve

• Flask as python web framework • MongoDB as storage

Serverless Backend API  Since we are almost completely stateless.

There are a few python packages to develop serverless applications:

• Pros • Cheaper to scale if necessary; • Free

21 • It scales so well that the bottleneck is

22 • Properly support API versioning • API proxy  •

23 • Support session debates • Support proposals in real-time

http://meuParlamento.pt by Arian Pasquali - Nuno Moniz - Tomás Amaro