What do we talk about when we talk about data?

What do we talk about when we talk about data?

A tour of the Big Data problems and technical challenges.


Ahmad Alhour

January 23, 2020


  1. What do we talk about when we talk about data?

    A tour of the Big Data field and all that jazz... 23.01.2020
  2. None
  3. Some Business Problems... 1. (Real-time) Data Processing (e.g.: IoT; Gaming)

    2. Data Integration (e.g.: SalesForce; CRM; PMS) 3. Data Management (e.g.: TripAdvisor User Reviews; Metadata) 4. Exploration and Governance (e.g.: Lyft’s internal mapping data) 5. Analytics & Machine Learning (e.g.: Amazon Recommender Systems)
  4. Common Technical Challenges ➔ What are some of the common

    technical challenges? Ingestion, buffering, processing, storage, exploration, serving and governance
  5. Ingestion Data lives somewhere else, and we need to collect

    it. Typical scenario for data integration, acquisition and collection. Buffer
  6. Storage I got some data and I want to store

    it somewhere where I can process it in the future in multiple ways. Typical scenario for data warehousing. Distributed File System RACK 2 Database Google Bigtable BI Analytics Tableau Batch Process Apache Spark RACK 1 RACK 3
  7. Processing I want to transform my data in a way

    that allows me to achieve X (cleaning up, model training). Typical scenario for data processing engines. “Making sense of stream processing”, Martin Kleppmann
  8. Serving Results I’ve ingested, stored and processed my data, and

    now I want to serve it to the appropriate consumer. Typical scenario for ML model training, data materialization (search … etc) and governance. Distributed File System Governance {api}
  9. Ingest Store Process Serve Orchestrate Data Architecture Blueprint {api} {api}

  10. Data Constraints ➔ How big is Big Data? ➔ What

    data characteristics dictate our approach to solving problems?
  11. How big is Big Data? Volume GB? TB? PB? More?!

    Velocity What is the rate of data generation? Variety Type and nature of data, e.g.: text, numeric, relational, unstructured, formatted … etc
  12. Characteristics of Data Bounds Bounded vs. Unbounded Time Time-agnostic vs.

    Time-sensitive Order Order-dependent vs. Order-independent Structure Structured vs. Unstructured
  13. Bounded vs. Unbounded Logs for dec. 19-29 vs. Live app

    logs & statistics
  14. Order-dependent vs. Order-independent Change data replication vs. Element-wise filtering

  15. Structured vs. Unstructured Stars-ranked ratings vs. User generated free text

  16. Time-agnostic vs. Time-sensitive Filter traffic events for a single source

    (don’t care about time) vs. Trends analysis (care about time)
  17. Time Domain[1] Processing Time When did the system become aware

    of the data? Event Time When did the data happen? [1] Tyler Akidau, “Streaming 101: The world beyond batch”
  18. Characteristics of Data Bounds Bounded vs. Unbounded Time Time-agnostic vs.

    Time-sensitive Order Order-dependent vs. Order-independent Structure Structured vs. Unstructured
  19. Batch Processing 1. High Latency 2. Bounded data 3. Time-agnostic

    4. Strong Consistency 5. Correctness = Completeness Data Processing - Batch & Streaming Stream Processing 1. Low Latency 2. Unbounded data 3. Time-sensitive 4. Weak Consistency 5. Correctness = Approx./Time
  20. Where can I start with Data Engineering?

  21. The Bible If I can only read one book what

    would it be?
  22. Give processing engines a test drive! Get Apache Spark running

    on your machine and give the examples a spin! https://spark.apache.org/examples.html
  23. Hack on a side-project! Trending Restaurants Project 1. Choose favorite

    city 2. Scrape GMaps reviews of restaurants there 3. Process them w/ Apache Spark 4. Calculate trends for each block in the city ;) 5. Visualize! 6. $ Do you know what the trending restaurants in your favorite city are?
  24. Out in the Wild • Uber’s Big Data Platform -

    https://eng.uber.com/uber-big-data-platform • Netflix’s Data Platform - https://www.youtube.com/watch?v=CSDIThSwA7s • How Netflix Handles Data Streams - https://www.youtube.com/watch?v=WuRazsX-MBY • Delivering High Quality Analytics at Netflix - https://www.youtube.com/watch?v=nMyuCdqzpZc • Twitter's Petabyte-Scale Data Architecture on GCP - https://www.youtube.com/watch?v=rBNFwdVDlyo • Scaling Apache Spark at Facebook (Databricks, ‘19) - https://www.youtube.com/watch?v=5O03Q4UetnA Articles • Data & AI Trends of 2019 - https://mattturck.com/data2019 • History of Hadoop - https://medium.com/@markobonaci/the-history-of-hadoop-68984a11704 Books • Designing Data Intensive Applications - https://dataintensive.net • Hadoop: The Definitive Guide - https://www.amazon.com/dp/1491901632 • Hadoop Application Architectures - https://www.amazon.com/dp/1491901632 MOARRR LINKS
  25. Thank You!

  26. Spark Code Example # Create DataFrame representing the stream of

    # input lines from connection to # localhost:9999 lines = spark \ .readStream \ .format("socket") \ .option("host", "localhost") \ .option("port", 9999) \ .load() # Split the lines into words words = lines.select( explode( split(lines.value, " ") ).alias("word") ) # Generate running word count wordCounts = words.groupBy("word").count()
  27. Netflix’s Data Platform, circa 2017-18

  28. Uber’s Generic Data Platform Model, 2018-Present