Big Data workloads using Spark on HDInsight

Slide 1

Slide 1 text

Big Data workloads using Apache Spark on HDInsight Nilesh Gule @nileshgule | www.HandsOnArchitect.com

Slide 2

Slide 2 text

$whoami { “name” : “Nilesh Gule”, “website” : “https://www.HandsOnArchitect.com", “github” : “https://github.com/NileshGule" “twitter” : “@nileshgule”, “linkedin” : “https://www.linkedin.com/in/nileshgule”, “email” : “[email protected]", “likes” : “Technical Evangelism, Cricket” }

Slide 3

Slide 3 text

01 02 03 04 05 06 Agenda Evolution of Apache Spark Demo – local mode Demo – cluster mode Spark Ecosystem HDInsight Spark cluster Q&A

Slide 4

Slide 4 text

Assumptions • Basic knowledge of Hadoop • HDFS, Map Reduce, YARN, Resource Manager • Cloud Computing • Big Data Procesisng • OO / functional programming • CLI

Slide 5

Slide 5 text

Unified Analytics engine large-scale data processing

Slide 6

Slide 6 text

Life before Spark • Different tools for performing different tasks related to data processing • Multiple languages (Java, scripting, HQL) • Non-interactive • Lengthy batch processing • Difficult to debug

Slide 7

Slide 7 text

Life with Spark

Slide 8

Slide 8 text

2010 Iterative compute Evolution of Apache Spark 2009 Mesos cluster management 2014 Spark SQL Spark 1.2 Data Sources API 2015 Structured Data DataFrame API 2016 Dataset API Superset of Dataframes Mesos Spark Open Sourced Spark 1.0 Spark 1.3 Spark 1.6 2016 Streaming Structured streaming Spark 2.0 2018 Kubernetes support Data Sources 2.0 API Spark 2.3

Slide 9

Slide 9 text

Benefits of using Apache Spark • Speed • Up to 100x faster compared to Map Reduce • Ease of use • Easy to use API’s • Multi language support • 100+ operators • Unified engine • Higher level libraries & support for SQL Queries, streaming data, machine learning and graph processing • Runs everywhere • Hadoop, standlone, Mesos, Kubernetes, cloud https://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html

Slide 10

Slide 10 text

Demo • MovieLens Dataset • Movies • Ratings • Tags • Spark Datasets & Spark SQL API • Local mode • Spark submit • Cluster mode • CSV to ORC conversion • Execution modes • YARN resource manager UI • Spark UI • Spark History • Spark logs https://grouplens.org/datasets/movielens/

Slide 11

Slide 11 text

DEMO

Slide 12

Slide 12 text

Apache Spark components • Dataset • Distributed collection of Rows • SparkSession • Entry point into Spark API • SparkContext, SQLContext, StreamingContext unified into one • Executors • Handles distributed processing • Transformations & Actions • Transformations – lazy operations that returns immutable data structures • Actions – apply operations and return value or write data to external storage

Slide 13

Slide 13 text

Common Transformations • map • flatMap • filter • Distinct • Sample(withReplacement, ..) • Union • Intersection • Subtract • cartesian • reduceByKey • groupByKey • sortByKey • Join • repartition

Slide 14

Slide 14 text

Common Actions • collect • count • countByValue • Take(num) • Top(num) • Reduce(func) • Fold(zero)(func) • saveAsTextFile(path) • saveAsSequenceFile(path) • countByKey()

Slide 15

Slide 15 text

DEMO

Slide 16

Slide 16 text

HDInsinght components • YARN UI • Resource management UI • Ambari views • Access to Ambari dashboard & installed services • Spark History UI • Historical Spark job details • Zeppelin & Jupyter notebooks • Online notebooks to work with cluster resources

Slide 17

Slide 17 text

References • https://spark.apache.org • Databricks spark • RDD programming guide • Spark SQL, DataFrames & Datasets • Data sources API V2 • Sparkhub databricks • MovieLens dataset

Slide 18

Slide 18 text

Thank you very much Code with Passion and Strive for Excellence https://github.com/NileshGule/learning-spark https://github.com/NileshGule/learning-spark/blob/master/Azure%20Meetup%2017th%20May%202019.md https://www.slideshare.net/nileshgule/presentations https://speakerdeck.com/nileshgule/

Slide 19

Slide 19 text

Design Q&A