Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Cloud hopping with a stack of data

Cloud hopping with a stack of data

We’ve collected tons of events and aggregated data from our 20 million users for Wunderlist. However, after our acquisition by Microsoft we started to migrate our data and backend infrastructure from AWS to Azure. I am going to talk about our hiccups, challenges and what have we learnt from the journey from a completely Unix perspective.

It was presented at Big Data Universe conference. More information: http://bdu.hu

Bence Faludi

May 19, 2016
Tweet

More Decks by Bence Faludi

Other Decks in Technology

Transcript

  1. Bence Faludi → Data & Applied Scientist at Microsoft →

    Lives in Berlin, Germany → Building data pipelines, data infrastructure → Open source addict → Working on Wunderlist, a multi platform to-do application
  2. Topics → Scale and Complexity → Refactoring a Flying Airplane

    → Using the Zollstock → Bust a Move → Tighten Up
  3. → Organize and share your to-do, work, grocery, movies and

    household lists. → Set Due Dates and Reminders and Assign to-dos. → Share your lists and work collaboratively on projects with your colleagues, friends and family. → Available for free on every platform.
  4. Micro services → Disposable software components → Many tiny databases

    → Many tiny services → Monitor everything → Multiple programming languages: Clojure, Scala, Go, etc.
  5. mostly PostgreSQL as Storage → Hosted on AWS → ~33

    databases → ~120 concurrent connections/database → Usually 2-3 tables per database → tasks table contains 1 billion records.
  6. Data infrastructure's size → Collect every event from tracking. 125M/day

    → Parse & load compressed logs' content. 375GB/day → Mirror every production database. 35GB inc./day → Load external sources. (e.g.: app store, payments). → Calculate KPIs, aggregates, business logic. 200+ queries
  7. cold storage (Redshift) hot storage (Redshift) production database(s) external sources

    S3 S3 microservice applications Rsyslog Noxy EMR (Hadoop) logging clients (phone, tablet, etc.) email Tracking SNS SQS SQS dumper tracking Postamt Chart.io AWS + DWH riporting data-flow S3 2015-12
  8. night-shift as ETL → Use cron for scheduling → Use

    make for dependencies, partial results, retries → Glue everything together with a handful of bash script → Inject variables and logic into SQL with Ruby's ERB → Flask application to monitor the data pipeline. → Testable in a local machine → Open source
  9. Goals → Remove the extra steps from data pipeline →

    Remove unnecessary complications like Hadoop → Add Azure support for the components → Refactor and make the code reusable
  10. cold storage (Redshift) hot storage (Redshift) production database(s) external sources

    S3 S3 microservice applications Rsyslog Noxy EMR (Hadoop) logging clients (phone, tablet, etc.) email Tracking SNS SQS SQS dumper tracking Postamt Chart.io AWS + DWH riporting data-flow S3 2015-12
  11. cold storage (Redshift) hot storage (Redshift) production database(s) external sources

    S3 S3 Rsyslog Noxy Jr. Beaver logging clients (phone, tablet, etc.) email Tracking SNS SQS SQS dumper tracking Postamt Chart.io AWS + DWH riporting data-flow S3 2016-01
  12. EMR to Jr. Beaver → Detects the format of every

    log line → Log cruncher that standardizes microservices' logs → Classifies events' names based on API's URL → Filters the analytically interesting rows → Map/reduce functionality. → Hadoop+Scala to make+pypy
  13. Jr. Beaver → Configurable with YAML files → Written in

    Pypy instead of Go → Using night-shift's make for parallelism → "Big RAM kills Big data" → No Hadoop+Scala headache anymore → Gives monitoring
  14. vCPU count EMR (600+ in 20 computers): ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

    ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| |||||||||||||||||||||||||||||||||||||||||||| Jr. Beaver (8 in 1 computer): ||||||||
  15. vCPU * working hours comparison EMR (600hrs): ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

    ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| |||||||||||||||||||||||||||||||||||||||||||| Jr. Beaver (64hrs): ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
  16. cold storage (Redshift) hot storage (Redshift) production database(s) external sources

    S3 S3 Rsyslog Noxy Jr. Beaver logging clients (phone, tablet, etc.) email Tracking SNS SQS SQS dumper tracking Postamt Chart.io AWS + DWH riporting data-flow S3 2016-01
  17. cold storage (Redshift) hot storage (Redshift) production database(s) external sources

    S3 S3 Rsyslog Noxy Jr. Beaver logging clients (phone, tablet, etc.) email Hamustro tracking Chart.io AWS + DWH riporting data-flow S3 2016-02
  18. Homebrew Tracking to Hamustro → Track the activities of our

    registered users. → Receives events from client devices. → Saves events to cloud targets. → Tracks sessions and order of events. → Rewritten from NodeJS to Go. → Use S3 directly instead of SNS/SQS.
  19. Hamustro → Supports Amazon SNS/SQS, Azure Queue Storage. → Supports

    Amazon S3, Azure Blob Storage. → Tracks up to 6M events/min on a single 4vCPU server. → Using Protobuf/JSON for events sending. → Written in Go. → Open source
  20. S3 vs. SNS in a single 4vCPU computer Hamustro's S3

    dialect (~6M/min): ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| |||||||||||||||||||||||||||||| Hamustro's SNS dialect (~60k/min): ||||||
  21. Azure Blob Storage Azure SQL Data Warehouse Ubuntu 14.04 Amazon

    S3 Amazon Redshift Ubuntu 14.04 Amazon SNS/SQS Chartio Chartio Hamustro Hamustro Power BI (under evaluation) Tracking
  22. Clear the trash 1. Add Azure Blob Storage and Azure

    SQL Data Warehouse support for your services. 2. Build the missing tools for production. 3. Remove the not necessary data and backups to make your stack smaller. 4. Reconnect the new parts one by one.
  23. Tools in UNIX for production → azrcmd: CLI to download

    and upload files to Azure Blob Storage. Provides s3cmd like functionality. → cheetah: CLI for MSSQL that works in OSX and Linux and also supports Azure SQL Data Warehouse. Similar to psql and superior to sql-cli and Microsoft's sqlcmd.
  24. cold storage (Redshift) hot storage (Redshift) production database(s) external sources

    S3 S3 Rsyslog Noxy Jr. Beaver logging clients (phone, tablet, etc.) email Hamustro tracking Chart.io AWS + DWH riporting data-flow S3 2016-02
  25. SQLDWH production database(s) external sources ABS Weasel Noxy Jr. Beaver

    clients (phone, tablet, etc.) Hamustro Chart.io Azure + DWH riporting data-flow logging tracking ABS 2016-04
  26. Improve SQLDW usage → Different loading strategies than in Redshift.

    → Scale up while the data pipeline is running. → Set up the right resource groups for every user. → Define distributions and use partitions. → Use full featured SQL. → Find the perfect balance between concurrency and speed.