Slide 1

Slide 1 text

Big Data Architecture “Scalable” Solution by @calvincanhtran

Slide 2

Slide 2 text

About me - 1st Data Engineer @ Grab Finance Group. - 2+ years experiences as Data Scientist, 3+ years as Data Engineer. - MTech in Knowledge Engineering (National University of Singapore). - Blogger at http://www.dataguystory.com

Slide 3

Slide 3 text

Agenda ● Hadoop Big Data Architecture. ● Cloud Computing Solution ● “Scalable” Architecture with AWS. ● Use Case #1: Machine Learning Toolbox. ● Use Case #2: Process Transaction Data.

Slide 4

Slide 4 text

Hadoop Big Data Architecture @calvincanhtran

Slide 5

Slide 5 text

Hadoop Big Data Architecture @calvincanhtran

Slide 6

Slide 6 text

Hadoop Big Data Architecture Physical Server Rack (Dell PowerEdge R730 5 nodes rack / 4TB ~ 80K USD). Engineering efforts for setup and maintain physical rack as well as platform. License services for Cloudera, MapR, Hortonworks (10000 USD per node annum). Scaling depend on Hardware. Data security (can get certified for data governance). SELF-HOSTED @calvincanhtran

Slide 7

Slide 7 text

Cloud Computing Solution - advantage Scalable (pay as you go) Reduce the cost of infrastructure and engineering efforts to maintain the system. Suitable with tech company where the products are deployed on cloud. Work from home (mùa Cô Vy). Require infrastructure efforts to get certified for data governance. @calvincanhtran

Slide 8

Slide 8 text

Cloud Computing Solution - AWS @calvincanhtran

Slide 9

Slide 9 text

Cloud Computing Solution - troubles ❏ Depend on Cloud Provider Services. ❏ Could be expensive for large scale (E.g Spark EMR, High Available EC2 instances, EFS filesystem…) ❏ Hard to customise the components. ❏ Engineer future careers. ❏ Engineering efforts to run Hybrid Clouds. @calvincanhtran

Slide 10

Slide 10 text

“Scalable” solution ● Containerize with docker, deploy to Kubernetes and utilize cloud services. ● Scale on different layers. @calvincanhtran

Slide 11

Slide 11 text

“Scalable” solution - Some Definitions @calvincanhtran

Slide 12

Slide 12 text

Deepdive in “Scalable” solution @calvincanhtran

Slide 13

Slide 13 text

Deepdive in “Scalable” solution @calvincanhtran

Slide 14

Slide 14 text

“Scalable” solution - Deployment @calvincanhtran

Slide 15

Slide 15 text

“Scalable” solution - Isolation @calvincanhtran

Slide 16

Slide 16 text

Use case #1: Machine Learning Toolbox ● Data is too big to aggregate on local machine. ● Security concern. @calvincanhtran

Slide 17

Slide 17 text

Use case #2: Process transaction data Multiple BI, Analytics teams: Product Analytics, Marketing Analytics, Finance Analytics… Different use cases and different data marts / data warehouse. Weekly, monthly reports. @calvincanhtran

Slide 18

Slide 18 text

Use case #2: Process transaction data @calvincanhtran

Slide 19

Slide 19 text

Q&A Thank you for your attending! Email: [email protected] Facebook: fb.com/dataguystory Twitter: @holacalvinhere