Upgrade to Pro — share decks privately, control downloads, hide ads and more …

What do we talk about when we talk about data?

What do we talk about when we talk about data?

A tour of the Big Data problems and technical challenges.

Ahmad Alhour

January 23, 2020
Tweet

More Decks by Ahmad Alhour

Other Decks in Technology

Transcript

  1. What do we talk
    about when we talk
    about data?
    A tour of the Big Data field and all that jazz...
    23.01.2020

    View Slide

  2. View Slide

  3. Some Business Problems...
    1. (Real-time) Data Processing (e.g.: IoT; Gaming)
    2. Data Integration (e.g.: SalesForce; CRM; PMS)
    3. Data Management (e.g.: TripAdvisor User Reviews; Metadata)
    4. Exploration and Governance (e.g.: Lyft’s internal mapping data)
    5. Analytics & Machine Learning (e.g.: Amazon Recommender Systems)

    View Slide

  4. Common
    Technical
    Challenges
    ➔ What are some of the common
    technical challenges?
    Ingestion, buffering, processing,
    storage, exploration, serving and
    governance

    View Slide

  5. Ingestion
    Data lives somewhere else, and
    we need to collect it.
    Typical scenario for data
    integration, acquisition and
    collection.
    Buffer

    View Slide

  6. Storage
    I got some data and I want to
    store it somewhere where I can
    process it in the future in multiple
    ways.
    Typical scenario for data
    warehousing.
    Distributed File System
    RACK 2
    Database
    Google
    Bigtable
    BI
    Analytics
    Tableau
    Batch
    Process
    Apache
    Spark
    RACK 1 RACK 3

    View Slide

  7. Processing
    I want to transform my data in a
    way that allows me to achieve X
    (cleaning up, model training).
    Typical scenario for data
    processing engines.
    “Making sense of stream processing”,
    Martin Kleppmann

    View Slide

  8. Serving
    Results
    I’ve ingested, stored and
    processed my data, and now I
    want to serve it to the
    appropriate consumer.
    Typical scenario for ML model
    training, data materialization
    (search … etc) and governance.
    Distributed File System
    Governance
    {api}

    View Slide

  9. Ingest Store Process Serve
    Orchestrate
    Data Architecture Blueprint
    {api}
    {api}

    View Slide

  10. Data Constraints
    ➔ How big is Big Data?
    ➔ What data characteristics
    dictate our approach to solving
    problems?

    View Slide

  11. How big is Big Data?
    Volume
    GB? TB? PB? More?!
    Velocity
    What is the rate of
    data generation?
    Variety
    Type and nature of
    data, e.g.: text,
    numeric, relational,
    unstructured,
    formatted … etc

    View Slide

  12. Characteristics of Data
    Bounds
    Bounded
    vs.
    Unbounded
    Time
    Time-agnostic
    vs.
    Time-sensitive
    Order
    Order-dependent
    vs.
    Order-independent
    Structure
    Structured
    vs.
    Unstructured

    View Slide

  13. Bounded vs.
    Unbounded
    Logs for dec. 19-29
    vs.
    Live app logs & statistics

    View Slide

  14. Order-dependent vs.
    Order-independent
    Change data replication
    vs.
    Element-wise filtering

    View Slide

  15. Structured vs.
    Unstructured
    Stars-ranked ratings
    vs.
    User generated free text

    View Slide

  16. Time-agnostic vs.
    Time-sensitive
    Filter traffic events for a
    single source (don’t care about time)
    vs.
    Trends analysis (care about time)

    View Slide

  17. Time Domain[1]
    Processing Time
    When did the system become
    aware of the data?
    Event Time
    When did the data happen?
    [1] Tyler Akidau, “Streaming 101: The world beyond batch”

    View Slide

  18. Characteristics of Data
    Bounds
    Bounded
    vs.
    Unbounded
    Time
    Time-agnostic
    vs.
    Time-sensitive
    Order
    Order-dependent
    vs.
    Order-independent
    Structure
    Structured
    vs.
    Unstructured

    View Slide

  19. Batch Processing
    1. High Latency
    2. Bounded data
    3. Time-agnostic
    4. Strong Consistency
    5. Correctness = Completeness
    Data Processing - Batch & Streaming
    Stream Processing
    1. Low Latency
    2. Unbounded data
    3. Time-sensitive
    4. Weak Consistency
    5. Correctness = Approx./Time

    View Slide

  20. Where can I start with Data
    Engineering?

    View Slide

  21. The Bible
    If I can only read one book what
    would it be?

    View Slide

  22. Give processing
    engines a test
    drive!
    Get Apache Spark running on your
    machine and give the examples a
    spin!
    https://spark.apache.org/examples.html

    View Slide

  23. Hack on a
    side-project!
    Trending Restaurants Project
    1. Choose favorite city
    2. Scrape GMaps reviews of
    restaurants there
    3. Process them w/ Apache Spark
    4. Calculate trends for each block
    in the city ;)
    5. Visualize!
    6. $
    Do you know what the trending
    restaurants in your favorite city
    are?

    View Slide

  24. Out in the Wild
    ● Uber’s Big Data Platform - https://eng.uber.com/uber-big-data-platform
    ● Netflix’s Data Platform - https://www.youtube.com/watch?v=CSDIThSwA7s
    ● How Netflix Handles Data Streams - https://www.youtube.com/watch?v=WuRazsX-MBY
    ● Delivering High Quality Analytics at Netflix - https://www.youtube.com/watch?v=nMyuCdqzpZc
    ● Twitter's Petabyte-Scale Data Architecture on GCP - https://www.youtube.com/watch?v=rBNFwdVDlyo
    ● Scaling Apache Spark at Facebook (Databricks, ‘19) - https://www.youtube.com/watch?v=5O03Q4UetnA
    Articles
    ● Data & AI Trends of 2019 - https://mattturck.com/data2019
    ● History of Hadoop - https://medium.com/@markobonaci/the-history-of-hadoop-68984a11704
    Books
    ● Designing Data Intensive Applications - https://dataintensive.net
    ● Hadoop: The Definitive Guide - https://www.amazon.com/dp/1491901632
    ● Hadoop Application Architectures - https://www.amazon.com/dp/1491901632
    MOARRR LINKS

    View Slide

  25. Thank You!

    View Slide

  26. Spark Code Example
    # Create DataFrame representing the stream of
    # input lines from connection to
    # localhost:9999
    lines = spark \
    .readStream \
    .format("socket") \
    .option("host", "localhost") \
    .option("port", 9999) \
    .load()
    # Split the lines into words
    words = lines.select(
    explode(
    split(lines.value, " ")
    ).alias("word")
    )
    # Generate running word count
    wordCounts = words.groupBy("word").count()

    View Slide

  27. Netflix’s Data
    Platform,
    circa 2017-18

    View Slide

  28. Uber’s Generic Data Platform Model,
    2018-Present

    View Slide