Common Technical Challenges ➔ What are some of the common technical challenges? Ingestion, buffering, processing, storage, exploration, serving and governance
Storage I got some data and I want to store it somewhere where I can process it in the future in multiple ways. Typical scenario for data warehousing. Distributed File System RACK 2 Database Google Bigtable BI Analytics Tableau Batch Process Apache Spark RACK 1 RACK 3
Processing I want to transform my data in a way that allows me to achieve X (cleaning up, model training). Typical scenario for data processing engines. “Making sense of stream processing”, Martin Kleppmann
Serving Results I’ve ingested, stored and processed my data, and now I want to serve it to the appropriate consumer. Typical scenario for ML model training, data materialization (search … etc) and governance. Distributed File System Governance {api}
How big is Big Data? Volume GB? TB? PB? More?! Velocity What is the rate of data generation? Variety Type and nature of data, e.g.: text, numeric, relational, unstructured, formatted … etc
Characteristics of Data Bounds Bounded vs. Unbounded Time Time-agnostic vs. Time-sensitive Order Order-dependent vs. Order-independent Structure Structured vs. Unstructured
Time Domain[1] Processing Time When did the system become aware of the data? Event Time When did the data happen? [1] Tyler Akidau, “Streaming 101: The world beyond batch”
Characteristics of Data Bounds Bounded vs. Unbounded Time Time-agnostic vs. Time-sensitive Order Order-dependent vs. Order-independent Structure Structured vs. Unstructured
Hack on a side-project! Trending Restaurants Project 1. Choose favorite city 2. Scrape GMaps reviews of restaurants there 3. Process them w/ Apache Spark 4. Calculate trends for each block in the city ;) 5. Visualize! 6. $ Do you know what the trending restaurants in your favorite city are?
Out in the Wild ● Uber’s Big Data Platform - https://eng.uber.com/uber-big-data-platform ● Netflix’s Data Platform - https://www.youtube.com/watch?v=CSDIThSwA7s ● How Netflix Handles Data Streams - https://www.youtube.com/watch?v=WuRazsX-MBY ● Delivering High Quality Analytics at Netflix - https://www.youtube.com/watch?v=nMyuCdqzpZc ● Twitter's Petabyte-Scale Data Architecture on GCP - https://www.youtube.com/watch?v=rBNFwdVDlyo ● Scaling Apache Spark at Facebook (Databricks, ‘19) - https://www.youtube.com/watch?v=5O03Q4UetnA Articles ● Data & AI Trends of 2019 - https://mattturck.com/data2019 ● History of Hadoop - https://medium.com/@markobonaci/the-history-of-hadoop-68984a11704 Books ● Designing Data Intensive Applications - https://dataintensive.net ● Hadoop: The Definitive Guide - https://www.amazon.com/dp/1491901632 ● Hadoop Application Architectures - https://www.amazon.com/dp/1491901632 MOARRR LINKS