Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Part 2 Modern Data Warehouse with Azure

Nilesh Gule
September 01, 2020

Part 2 Modern Data Warehouse with Azure

Slidedeck of the presentation for the SQL Pass SG user group on 1 September 2020. The session covered Azure Databricks and Azure Key Vault. The session covers Azure Databricks features such as Notebooks, Autoscaling, cluster types. We also look at performance benefits of using columnar data formats such as Parquet or ORC compared to flat file such as CSV.

The recording of the session is available on YouTube
https://youtu.be/0CWNsqNlbao?WT.mc_id=DP-MVP-5003170

Nilesh Gule

September 01, 2020
Tweet

More Decks by Nilesh Gule

Other Decks in Technology

Transcript

  1. $whoami { “name” : “Nilesh Gule”, “website” : “https://www.HandsOnArchitect.com", “github”

    : “https://github.com/NileshGule" “twitter” : “@nileshgule”, “linkedin” : “https://www.linkedin.com/in/nileshgule”, “email” : “[email protected]", “likes” : “Technical Evangelism, Cricket”, “co-organizer” : “Azure Singapore UG” }
  2. Part 1 - Summary – ADLS & ADF • Petabyte

    scale storage • Hierarchical namespace • Hadoop compatible access with ABFS driver ADLS - Main features ADF - Main features • Cloud ETL service • Scale-out serverless data integration & data transformation • Code-free UI • Monitoring & Management
  3. Azure Databricks • Collaborative Spark based Analytical service • Optimized

    Spark environment for data pipelines and ML • ACID transactions on Spark • Streaming & Batch unification • Schema enforcement • Time Travel • Upserts & Deletes https://docs.databricks.com/delta/delta-intro.html
  4. Benefits of using Apache Spark • Speed • Up to

    100x faster compared to Map Reduce • Ease of use • Easy to use API’s • Multi language support • 100+ operators • Unified engine • Higher level libraries & support for SQL Queries, streaming data, machine learning and graph processing • Runs everywhere • Hadoop, standalone, Mesos, Kubernetes, cloud https://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html
  5. Apache Spark Components • Dataset, DataFrame, RDD • Distributed collection

    of data • SparkSession • Entry point into Spark API • SparkContext, SQLContext, StreamingContext unified into one • Executors • Handles distributed processing • Transformations & Actions • Transformations – lazy operations that returns immutable data structures • Actions – apply operations and return value or write data to external storage
  6. Spark Common Transformations • map • flatMap • filter •

    Distinct • Sample(withReplacement, ..) • Union • Intersection • Subtract • cartesian • reduceByKey • groupByKey • sortByKey • Join • repartition
  7. Spark Common Actions • collect • count • countByValue •

    Take(num) • Top(num) • Reduce(func) • Fold(zero)(func) • saveAsTextFile(path) • saveAsSequenceFile(path) • countByKey()
  8. 0 5 10 15 20 25 30 35 CSV File

    Parquet file Orc File Delta Non Partitioned Data Aggregation Azure Databricks - Performance 0 1 2 3 4 5 6 7 8 CSV File Parquet file Orc File Delta Partiotioned Data Load performance 0 0.5 1 1.5 2 2.5 3 3.5 CSV File Parquet file Orc File Delta Partitioned Data Aggregation 0 10 20 30 40 50 60 CSV File Parquet file Orc File Delta Non partitioned Load Performance
  9. Azure Key Vault • Safeguards Cryptographic keys and secrets used

    by cloud apps • Stores • Encryption keys used for BYOK solution • Storage account Access Keys • Secrets / connection strings • Certificates
  10. Summary • Collaborative Spark based Analytical service • Different cluster

    types (automated / interactive / pool) • Autoscale based on workloads • Fine grained access controls Azure Databricks - Main features
  11. Thank you very much Code with Passion and Strive for

    Excellence https://www.slideshare.net/nileshgule/presentations https://speakerdeck.com/nileshgule/
  12. Nilesh Gule ARCHITECT | MICROSOFT MVP “Code with Passion and

    Strive for Excellence” nileshgule @nileshgule Nilesh Gule NileshGule www.handsonarchitect.com
  13. Q&A