Upgrade to PRO for Only $50/Year—Limited-Time Offer! 🔥

What Is Data Engineering

What Is Data Engineering

A series of talks on data engineering

Avatar for Yuri Ostapchuk

Yuri Ostapchuk

September 13, 2021
Tweet

More Decks by Yuri Ostapchuk

Other Decks in Programming

Transcript

  1. PLAN PLAN pillars of data engineering bird's eye view on

    ecosystem and typical architectures building blocks of big data and what's in common real-world pipeline example 2 . 1
  2. WHO WE ARE AND WHERE WE ARE WHO WE ARE

    AND WHERE WE ARE we are data engineers building data pipelines using building blocks vocab: data pipeline 3 . 1
  3. data ows into the system from various data sources and

    has to be prepared for serving what is the trigger? incoming data (push/pull) 3 . 4
  4. TYPES OF WORKLOAD TYPES OF WORKLOAD Ingestion and storage of

    the big datasets Batch processing of data at rest Real-time streaming processing of data in motion Interactive exploration of big data Predictive analytics and machine learning vocab: job ETL ingestion, streaming, batch data-at-rest, data-at-motion data wrangling, data mining 3 . 6
  5. COMMON BIG-DATA-SYSTEM ARCHITECTURE COMMON BIG-DATA-SYSTEM ARCHITECTURE (building block: framework /

    tool) nodes (workers/slaves/brokers/nodes/…) memory, cpu, storage + client + coordinator + master coordinator scheduler metastorage 4 . 11
  6. DISTRIBUTED ALGORITHMS & COORDINATION DISTRIBUTED ALGORITHMS & COORDINATION leader election

    consensus non-blocking data structures replication atomic broadcast consistent hashing .. 4 . 13
  7. COMMON BIG DATA TOOLS & FRAMEWORKS COMMON BIG DATA TOOLS

    & FRAMEWORKS how do they work? how to get our heads over this in fact this is all about microservices 5 . 1
  8. WHAT'S IN COMMON WHAT'S IN COMMON storage, data at-rest hdfs

    (AP, distributed le system) hive, presto, tez, impala (sql engine) bigtable/hbase (CP, random-access, OLTP) dynamo, cassandra (AP) elastic (FTS, AP) … processing, data at-motion mapreduce (classic programming model) spark (in-memory processing) ink, storm, beam kafka (streaming, messaging) … orchestration, mngmnt, work ow zookeeper (coordination) air ow, oozie (work ow) yarn (cluster management) … 5 . 3
  9. GUIDELINE TO GET INTO NEW BIG DATA FRAMEWORK GUIDELINE TO

    GET INTO NEW BIG DATA FRAMEWORK at glance storage / computation? where is it in terms of CAP? olap / oltp, read or write optimized? stream or batch? data model computation model delivery semantics what kind of interface it provides? (client / con guration) 5 . 4
  10. GUIDELINE TO GET INTO NEW BIG DATA FRAMEWORK GUIDELINE TO

    GET INTO NEW BIG DATA FRAMEWORK deep what services it runs? where it runs those? what do they do? how they communicate? 5 . 5
  11. WHAT'S NEXT? WHAT'S NEXT? getting hands dirty word-count problem literature

    designing data-intensive applications courses ask me let's meet next week :) https://www.coursera.org/learn/big-data- analysis 8 . 1