Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Architecture of an Effective Data Platform

vrajat
February 16, 2019

Architecture of an Effective Data Platform

Requirements of a data platform depends on data volume and use cases. An effective data platform for a startup, mid-sized company and a large enterprise is different from each other. This talk will introduce a framework to understand the requirements of a data platform. Then the talk will explore successful architectures segmented by start ups, mid-size companies and large enterprises.

vrajat

February 16, 2019
Tweet

More Decks by vrajat

Other Decks in Technology

Transcript

  1. Summary or Spoiler Alert! An effective data platform has: •

    A solid data engineering foundation before ◦ A solid analytics foundation which is before ▪ A solid data science foundation which is before • Supporting AI
  2. About the Presenter • Early engineer at Vertica - a

    columnar data warehouse • Early engineer at Qubole - a Big Data Platform as a service • Now consulting & starting up in the data admin space • Get in touch: ◦ vrajat on Twitter, Github, Linkedin and Medium
  3. Agenda • Determine effective • One size does not fit

    all. Discuss evolution stages: ◦ Part-time Data ◦ Data Engineering ◦ Analytics ◦ AI/ML • Effective Architectures for every stage.
  4. Evolution of Data Teams • Part Time ◦ Engineers moonlight

    as data engineers ◦ Collect easily available data and provide insights based on counting • Data Engineering ◦ Full time engineers maintain pipelines to collect and store data. ◦ Responsible for reliability of data storage and quality of data. ◦ Support one or two data applications for data analysts. • Analytics ◦ Prepare data for multiple reports and teams. ◦ Support basic data science projects. ◦ Support data-driven product modifications. • AI/ML ◦ Support AI & Deep Learning ◦ Support Big Data ingestion and applications
  5. Qubole as an example • Part Time ( - 2015)

    ◦ Combination of solution architects and interested engineers ◦ Used business logic data on MySQL replica ◦ Basic information on popularity of engines. ◦ Largest customers based on usage or commands submitted • Data Engineering (2016 - 2018) ◦ Built a generic collector for Hadoop Clusters metrics. (Qubole Blog: Building QDS: AIR Infrastructure) ◦ Built a reliable ETL pipeline. (Qubole Blog: Under the hood: Building AIR at Qubole) ◦ Improved Data Quality. (Fifth Elephant: Improve Data Quality using Airflow and check operator) ◦ State of DataOps Report • Analytics (2018 - ) ◦ SparkLens ◦ Data Driven Customer Features
  6. Data Admin Hierarchy of Needs Reliability & Quality Adopt a

    workflow tool (Airflow). Monitor pipelines. Manage reliability and correctness of data. Scale up Data Discovery, TCO, Performance Add more data sources and data applications Compliance Obfuscation. ACLs. Auditing. GDPR. Chargebacks Generation 1 Generation 2 Generation 3
  7. Data Engineering • Time to add data sources and data

    pipeline • Monitor SLA and Quality • Time to add SLA and Quality checks • Ability to self-serve by data generators
  8. Analytics • Time to data discovery • Data Scientists and

    Analytts should be technology unaware • Ability to self-serve data preparation and cleaning • Availability of recipes for common data engineering patterns ◦ Incremental Computation Framework ◦ Backfill Framework ◦ Global Metrics Framework
  9. Big Data Data Warehouse Production Database Logs 3rd Party Sources

    Messaging Platform ETL Frameworks Storage Analytics, Data Science and AI