Slide 7
Slide 7 text
Data Lake (since 2010)
Late 1980s 2020
- Structured, Semi-Structured & Unstructured Data
- Schema on Read
- Hive Table Format
- Storage and Compute decoupling
- Open Data formats like CSV, Avro, Parquet, ORC
- Lower cost
- Supports ML use cases
- No metadata layer, no ACID support
Data Lake is often used in conjunction with a Data Warehouse - raw data is stored in
the lake and further cleansed and aggregated with a data warehouse
Started with Hadoop MapReduce and HDFS as storage
Evolved with cloud object storage (S3, ADLS, GCS) with query engines (Spark, Presto)