Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Part 2 Modern Data Warehouse with Azure

9e33a1d43a88f23f6c545c1e0f07f4b5?s=47 Nilesh Gule
September 01, 2020

Part 2 Modern Data Warehouse with Azure

Slidedeck of the presentation for the SQL Pass SG user group on 1 September 2020. The session covered Azure Databricks and Azure Key Vault. The session covers Azure Databricks features such as Notebooks, Autoscaling, cluster types. We also look at performance benefits of using columnar data formats such as Parquet or ORC compared to flat file such as CSV.

The recording of the session is available on YouTube
https://youtu.be/0CWNsqNlbao?WT.mc_id=DP-MVP-5003170

9e33a1d43a88f23f6c545c1e0f07f4b5?s=128

Nilesh Gule

September 01, 2020
Tweet

Transcript

  1. Nilesh Gule @nileshgule | www.HandsOnArchitect.com Modern Data Warehouse Using Azure

  2. $whoami { “name” : “Nilesh Gule”, “website” : “https://www.HandsOnArchitect.com", “github”

    : “https://github.com/NileshGule" “twitter” : “@nileshgule”, “linkedin” : “https://www.linkedin.com/in/nileshgule”, “email” : “nileshgule@gmail.com", “likes” : “Technical Evangelism, Cricket”, “co-organizer” : “Azure Singapore UG” }
  3. None
  4. Credits: James Serra

  5. None
  6. Part 1 - Summary – ADLS & ADF • Petabyte

    scale storage • Hierarchical namespace • Hadoop compatible access with ABFS driver ADLS - Main features ADF - Main features • Cloud ETL service • Scale-out serverless data integration & data transformation • Code-free UI • Monitoring & Management
  7. None
  8. Azure Databricks • Collaborative Spark based Analytical service • Optimized

    Spark environment for data pipelines and ML • ACID transactions on Spark • Streaming & Batch unification • Schema enforcement • Time Travel • Upserts & Deletes https://docs.databricks.com/delta/delta-intro.html
  9. Benefits of using Apache Spark • Speed • Up to

    100x faster compared to Map Reduce • Ease of use • Easy to use API’s • Multi language support • 100+ operators • Unified engine • Higher level libraries & support for SQL Queries, streaming data, machine learning and graph processing • Runs everywhere • Hadoop, standalone, Mesos, Kubernetes, cloud https://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html
  10. Apache Spark Components • Dataset, DataFrame, RDD • Distributed collection

    of data • SparkSession • Entry point into Spark API • SparkContext, SQLContext, StreamingContext unified into one • Executors • Handles distributed processing • Transformations & Actions • Transformations – lazy operations that returns immutable data structures • Actions – apply operations and return value or write data to external storage
  11. Spark Common Transformations • map • flatMap • filter •

    Distinct • Sample(withReplacement, ..) • Union • Intersection • Subtract • cartesian • reduceByKey • groupByKey • sortByKey • Join • repartition
  12. Spark Common Actions • collect • count • countByValue •

    Take(num) • Top(num) • Reduce(func) • Fold(zero)(func) • saveAsTextFile(path) • saveAsSequenceFile(path) • countByKey()
  13. Azure Databricks - clusters

  14. 0 5 10 15 20 25 30 35 CSV File

    Parquet file Orc File Delta Non Partitioned Data Aggregation Azure Databricks - Performance 0 1 2 3 4 5 6 7 8 CSV File Parquet file Orc File Delta Partiotioned Data Load performance 0 0.5 1 1.5 2 2.5 3 3.5 CSV File Parquet file Orc File Delta Partitioned Data Aggregation 0 10 20 30 40 50 60 CSV File Parquet file Orc File Delta Non partitioned Load Performance
  15. Azure Key Vault

  16. Azure Key Vault • Safeguards Cryptographic keys and secrets used

    by cloud apps • Stores • Encryption keys used for BYOK solution • Storage account Access Keys • Secrets / connection strings • Certificates
  17. Summary • Collaborative Spark based Analytical service • Different cluster

    types (automated / interactive / pool) • Autoscale based on workloads • Fine grained access controls Azure Databricks - Main features
  18. Azure Databricks Apache Spark Databricks documentation MovieLens datasets Secrets Scope

  19. References – MS Learn https://docs.microsoft.com/en-us/learn/paths/data-engineer-azure-databricks/

  20. Thank you very much Code with Passion and Strive for

    Excellence https://www.slideshare.net/nileshgule/presentations https://speakerdeck.com/nileshgule/
  21. Nilesh Gule ARCHITECT | MICROSOFT MVP “Code with Passion and

    Strive for Excellence” nileshgule @nileshgule Nilesh Gule NileshGule www.handsonarchitect.com
  22. Q&A