Serverless Data Warehousing & Data Analysis on AWS

36ca8ae7e19067021d0e39c0b72acc2a?s=47 Alex Casalboni
February 16, 2018

Serverless Data Warehousing & Data Analysis on AWS

Data science teams need to reduce storage and maintenance costs, and at the same time provide analytics tools for data analysts and scientists.
How can we make data collection and data analysis exciting, performant, and cost-effective in the Cloud?
Alex will connect the dots around the data processing building blocks provided by AWS, without managing any servers!


Alex Casalboni

February 16, 2018


  1. Serverless Data Warehousing & Data Analysis on AWS 2/16/2018

  2. About Me twi$er://@alex_casalboni Computer Science Background Master in Sound &

    Music Engineering Sr. SoMware Engineer & Web Developer
  3. Agenda Why do you need a DWH? Warehouses Vs. Lakes

    Serverless Architecture Q & A
  4. Why do you need a DWH? #bigdata

  5. Data Warehousing goals Historical data repository ReporUng & DDDM

    Data Analysis & ML Data integraUon
  6. How “Big” is your Data?

  7. How “Correct” is your Data?

  8. Data-Driven Decision Making

  9. Warehouses Vs. Lakes #buzzwordschallenge

  10. Warehouses Vs. Lakes Only structured Data Rigid & Expensive

    Business-Analyst-friendly Literally any kind of Data Agile & Cheap Data-ScienUsts-friendly
  11. Hybrid approaches DWH Data Lake Amazon Redshi. Amazon Athena

    Redshi. Spectrum Amazon S3 +
  12. SeparaUon of compute and storage Independent scaling Storage stays

    cheap and highly available Compute scales out only if/when needed Data sources can be reused

  14. Serverless Data IngesUon & Data AnalyUcs Architecture #JeffFTW

  15. Architecture black box 1. Submit event/data 2. Submit query/analysis

    3. Fetch analysis results
  16. Architecture goals No hourly/monthly costs No servers to manage

    No scale limitaUons or resize Possibly anonymous producers Storage as cheap as possible Data validaUon / manipulaUon IntuiUve data exploraUon & reporUng Real-Ume metrics & alerts
  17. 1. Get CredenUals 3. Put Records 2. HTTP POST

    4. Filter / Manipulate 5. Compress & Encrypt 6. Query 7. SPICE Import 8. Analyse 9. Sliding SQL 10. Process aggregates 11. Update RealUme Metrics
  18. Gotchas Kinesis Data AnalyUcs & Streams are not 100%

    serverless API Gateway isn’t cheap (directly using PutRecords might help) Don’t forget Athena ParUUons to reduce cost and latency AWS Glue is your friend for ETL and schema discovery
  19. Deploy it with AWS SAM!


  21. Danke schön :) Q & A 2/16/2018