What do we talk about when we talk about data?

What do we talk about when we talk about data?
A tour of the Big Data ﬁeld and all that jazz... 23.01.2020

Some Business Problems... 1. (Real-time) Data Processing (e.g.: IoT; Gaming)
2. Data Integration (e.g.: SalesForce; CRM; PMS) 3. Data Management (e.g.: TripAdvisor User Reviews; Metadata) 4. Exploration and Governance (e.g.: Lyft’s internal mapping data) 5. Analytics & Machine Learning (e.g.: Amazon Recommender Systems)

Common Technical Challenges ➔ What are some of the common
technical challenges? Ingestion, buﬀering, processing, storage, exploration, serving and governance

Ingestion Data lives somewhere else, and we need to collect
it. Typical scenario for data integration, acquisition and collection. Buffer

Storage I got some data and I want to store
it somewhere where I can process it in the future in multiple ways. Typical scenario for data warehousing. Distributed File System RACK 2 Database Google Bigtable BI Analytics Tableau Batch Process Apache Spark RACK 1 RACK 3

Processing I want to transform my data in a way
that allows me to achieve X (cleaning up, model training). Typical scenario for data processing engines. “Making sense of stream processing”, Martin Kleppmann

Serving Results I’ve ingested, stored and processed my data, and
now I want to serve it to the appropriate consumer. Typical scenario for ML model training, data materialization (search … etc) and governance. Distributed File System Governance {api}

Ingest Store Process Serve Orchestrate Data Architecture Blueprint {api} {api}

Data Constraints ➔ How big is Big Data? ➔ What
data characteristics dictate our approach to solving problems?

How big is Big Data? Volume GB? TB? PB? More?!
Velocity What is the rate of data generation? Variety Type and nature of data, e.g.: text, numeric, relational, unstructured, formatted … etc

Characteristics of Data Bounds Bounded vs. Unbounded Time Time-agnostic vs.
Time-sensitive Order Order-dependent vs. Order-independent Structure Structured vs. Unstructured

Bounded vs. Unbounded Logs for dec. 19-29 vs. Live app
logs & statistics

Order-dependent vs. Order-independent Change data replication vs. Element-wise ﬁltering

Structured vs. Unstructured Stars-ranked ratings vs. User generated free text

Time-agnostic vs. Time-sensitive Filter traﬃc events for a single source
(don’t care about time) vs. Trends analysis (care about time)

Time Domain[1] Processing Time When did the system become aware
of the data? Event Time When did the data happen? [1] Tyler Akidau, “Streaming 101: The world beyond batch”

Characteristics of Data Bounds Bounded vs. Unbounded Time Time-agnostic vs.
Time-sensitive Order Order-dependent vs. Order-independent Structure Structured vs. Unstructured

Batch Processing 1. High Latency 2. Bounded data 3. Time-agnostic
4. Strong Consistency 5. Correctness = Completeness Data Processing - Batch & Streaming Stream Processing 1. Low Latency 2. Unbounded data 3. Time-sensitive 4. Weak Consistency 5. Correctness = Approx./Time

Where can I start with Data Engineering?

The Bible If I can only read one book what
would it be?

Give processing engines a test drive! Get Apache Spark running
on your machine and give the examples a spin! https://spark.apache.org/examples.html

Hack on a side-project! Trending Restaurants Project 1. Choose favorite
city 2. Scrape GMaps reviews of restaurants there 3. Process them w/ Apache Spark 4. Calculate trends for each block in the city ;) 5. Visualize! 6. $ Do you know what the trending restaurants in your favorite city are?

Out in the Wild • Uber’s Big Data Platform -
https://eng.uber.com/uber-big-data-platform • Netflix’s Data Platform - https://www.youtube.com/watch?v=CSDIThSwA7s • How Netflix Handles Data Streams - https://www.youtube.com/watch?v=WuRazsX-MBY • Delivering High Quality Analytics at Netflix - https://www.youtube.com/watch?v=nMyuCdqzpZc • Twitter's Petabyte-Scale Data Architecture on GCP - https://www.youtube.com/watch?v=rBNFwdVDlyo • Scaling Apache Spark at Facebook (Databricks, ‘19) - https://www.youtube.com/watch?v=5O03Q4UetnA Articles • Data & AI Trends of 2019 - https://mattturck.com/data2019 • History of Hadoop - https://medium.com/@markobonaci/the-history-of-hadoop-68984a11704 Books • Designing Data Intensive Applications - https://dataintensive.net • Hadoop: The Definitive Guide - https://www.amazon.com/dp/1491901632 • Hadoop Application Architectures - https://www.amazon.com/dp/1491901632 MOARRR LINKS

Thank You!

Spark Code Example # Create DataFrame representing the stream of
# input lines from connection to # localhost:9999 lines = spark \ .readStream \ .format("socket") \ .option("host", "localhost") \ .option("port", 9999) \ .load() # Split the lines into words words = lines.select( explode( split(lines.value, " ") ).alias("word") ) # Generate running word count wordCounts = words.groupBy("word").count()

Netﬂix’s Data Platform, circa 2017-18

Uber’s Generic Data Platform Model, 2018-Present

What do we talk about when we talk about data?

What do we talk about when we talk about data?

Ahmad Alhour

More Decks by Ahmad Alhour

Other Decks in Technology

Featured

Transcript