Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Serverless Data Warehousing & Data Analysis on AWS

36ca8ae7e19067021d0e39c0b72acc2a?s=47 Alex Casalboni
February 16, 2018

Serverless Data Warehousing & Data Analysis on AWS

Data science teams need to reduce storage and maintenance costs, and at the same time provide analytics tools for data analysts and scientists.
How can we make data collection and data analysis exciting, performant, and cost-effective in the Cloud?
Alex will connect the dots around the data processing building blocks provided by AWS, without managing any servers!


Alex Casalboni

February 16, 2018

More Decks by Alex Casalboni

Other Decks in Technology


  1. clda.co/jeffconf-hamburg Serverless Data Warehousing & Data Analysis on AWS 2/16/2018

  2. About Me twi$er://@alex_casalboni Computer Science Background Master in Sound &

    Music Engineering Sr. SoMware Engineer & Web Developer clda.co/jeffconf-hamburg
  3. Agenda Why do you need a DWH? Warehouses Vs. Lakes

    Serverless Architecture Q & A clda.co/jeffconf-hamburg
  4. Why do you need a DWH? #bigdata clda.co/jeffconf-hamburg

  5. Data Warehousing goals clda.co/jeffconf-hamburg Historical data repository ReporUng & DDDM

    Data Analysis & ML Data integraUon
  6. clda.co/jeffconf-hamburg How “Big” is your Data?

  7. clda.co/jeffconf-hamburg How “Correct” is your Data?

  8. clda.co/jeffconf-hamburg Data-Driven Decision Making

  9. Warehouses Vs. Lakes #buzzwordschallenge clda.co/jeffconf-hamburg

  10. Warehouses Vs. Lakes clda.co/jeffconf-hamburg Only structured Data Rigid & Expensive

    Business-Analyst-friendly Literally any kind of Data Agile & Cheap Data-ScienUsts-friendly
  11. Hybrid approaches clda.co/jeffconf-hamburg DWH Data Lake Amazon Redshi. Amazon Athena

    Redshi. Spectrum Amazon S3 +
  12. SeparaUon of compute and storage clda.co/jeffconf-hamburg Independent scaling Storage stays

    cheap and highly available Compute scales out only if/when needed Data sources can be reused
  13. clda.co/jeffconf-hamburg

  14. Serverless Data IngesUon & Data AnalyUcs Architecture #JeffFTW clda.co/jeffconf-hamburg

  15. Architecture black box clda.co/jeffconf-hamburg 1. Submit event/data 2. Submit query/analysis

    3. Fetch analysis results
  16. Architecture goals clda.co/jeffconf-hamburg No hourly/monthly costs No servers to manage

    No scale limitaUons or resize Possibly anonymous producers Storage as cheap as possible Data validaUon / manipulaUon IntuiUve data exploraUon & reporUng Real-Ume metrics & alerts
  17. clda.co/jeffconf-hamburg 1. Get CredenUals 3. Put Records 2. HTTP POST

    4. Filter / Manipulate 5. Compress & Encrypt 6. Query 7. SPICE Import 8. Analyse 9. Sliding SQL 10. Process aggregates 11. Update RealUme Metrics
  18. Gotchas clda.co/jeffconf-hamburg Kinesis Data AnalyUcs & Streams are not 100%

    serverless API Gateway isn’t cheap (directly using PutRecords might help) Don’t forget Athena ParUUons to reduce cost and latency AWS Glue is your friend for ETL and schema discovery
  19. Deploy it with AWS SAM! clda.co/jeffconf-hamburg github.com/alexcasalboni/serverless-data-pipeline-sam

  20. clda.co/jeffconf-hamburg

  21. Danke schön :) Q & A clda.co/jeffconf-hamburg 2/16/2018