Scalable Big Data Architecture

Scalable Big Data Architecture


Calvin Canh Tran

July 31, 2020


  1. Big Data Architecture “Scalable” Solution by @calvincanhtran

  2. About me - 1st Data Engineer @ Grab Finance Group.

    - 2+ years experiences as Data Scientist, 3+ years as Data Engineer. - MTech in Knowledge Engineering (National University of Singapore). - Blogger at
  3. Agenda • Hadoop Big Data Architecture. • Cloud Computing Solution

    • “Scalable” Architecture with AWS. • Use Case #1: Machine Learning Toolbox. • Use Case #2: Process Transaction Data.
  4. Hadoop Big Data Architecture @calvincanhtran

  5. Hadoop Big Data Architecture @calvincanhtran

  6. Hadoop Big Data Architecture Physical Server Rack (Dell PowerEdge R730

    5 nodes rack / 4TB ~ 80K USD). Engineering efforts for setup and maintain physical rack as well as platform. License services for Cloudera, MapR, Hortonworks (10000 USD per node annum). Scaling depend on Hardware. Data security (can get certified for data governance). SELF-HOSTED @calvincanhtran
  7. Cloud Computing Solution - advantage Scalable (pay as you go)

    Reduce the cost of infrastructure and engineering efforts to maintain the system. Suitable with tech company where the products are deployed on cloud. Work from home (mùa Cô Vy). Require infrastructure efforts to get certified for data governance. @calvincanhtran
  8. Cloud Computing Solution - AWS @calvincanhtran

  9. Cloud Computing Solution - troubles ❏ Depend on Cloud Provider

    Services. ❏ Could be expensive for large scale (E.g Spark EMR, High Available EC2 instances, EFS filesystem…) ❏ Hard to customise the components. ❏ Engineer future careers. ❏ Engineering efforts to run Hybrid Clouds. @calvincanhtran
  10. “Scalable” solution • Containerize with docker, deploy to Kubernetes and

    utilize cloud services. • Scale on different layers. @calvincanhtran
  11. “Scalable” solution - Some Definitions @calvincanhtran

  12. Deepdive in “Scalable” solution @calvincanhtran

  13. Deepdive in “Scalable” solution @calvincanhtran

  14. “Scalable” solution - Deployment @calvincanhtran

  15. “Scalable” solution - Isolation @calvincanhtran

  16. Use case #1: Machine Learning Toolbox • Data is too

    big to aggregate on local machine. • Security concern. @calvincanhtran
  17. Use case #2: Process transaction data Multiple BI, Analytics teams:

    Product Analytics, Marketing Analytics, Finance Analytics… Different use cases and different data marts / data warehouse. Weekly, monthly reports. @calvincanhtran
  18. Use case #2: Process transaction data @calvincanhtran

  19. Q&A Thank you for your attending! Email: Facebook:

    Twitter: @holacalvinhere