Big Data workloads using Spark on HDInsight

Big Data workloads using Apache Spark on HDInsight Nilesh Gule
@nileshgule | www.HandsOnArchitect.com

$whoami { “name” : “Nilesh Gule”, “website” : “https://www.HandsOnArchitect.com", “github”
: “https://github.com/NileshGule" “twitter” : “@nileshgule”, “linkedin” : “https://www.linkedin.com/in/nileshgule”, “email” : “[email protected]", “likes” : “Technical Evangelism, Cricket” }

01 02 03 04 05 06 Agenda Evolution of Apache
Spark Demo – local mode Demo – cluster mode Spark Ecosystem HDInsight Spark cluster Q&A

Assumptions • Basic knowledge of Hadoop • HDFS, Map Reduce,
YARN, Resource Manager • Cloud Computing • Big Data Procesisng • OO / functional programming • CLI

Unified Analytics engine large-scale data processing

Life before Spark • Different tools for performing different tasks
related to data processing • Multiple languages (Java, scripting, HQL) • Non-interactive • Lengthy batch processing • Difficult to debug

Life with Spark

2010 Iterative compute Evolution of Apache Spark 2009 Mesos cluster
management 2014 Spark SQL Spark 1.2 Data Sources API 2015 Structured Data DataFrame API 2016 Dataset API Superset of Dataframes Mesos Spark Open Sourced Spark 1.0 Spark 1.3 Spark 1.6 2016 Streaming Structured streaming Spark 2.0 2018 Kubernetes support Data Sources 2.0 API Spark 2.3

Benefits of using Apache Spark • Speed • Up to
100x faster compared to Map Reduce • Ease of use • Easy to use API’s • Multi language support • 100+ operators • Unified engine • Higher level libraries & support for SQL Queries, streaming data, machine learning and graph processing • Runs everywhere • Hadoop, standlone, Mesos, Kubernetes, cloud https://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html

Demo • MovieLens Dataset • Movies • Ratings • Tags
• Spark Datasets & Spark SQL API • Local mode • Spark submit • Cluster mode • CSV to ORC conversion • Execution modes • YARN resource manager UI • Spark UI • Spark History • Spark logs https://grouplens.org/datasets/movielens/

Apache Spark components • Dataset • Distributed collection of Rows
• SparkSession • Entry point into Spark API • SparkContext, SQLContext, StreamingContext unified into one • Executors • Handles distributed processing • Transformations & Actions • Transformations – lazy operations that returns immutable data structures • Actions – apply operations and return value or write data to external storage

Common Transformations • map • flatMap • filter • Distinct
• Sample(withReplacement, ..) • Union • Intersection • Subtract • cartesian • reduceByKey • groupByKey • sortByKey • Join • repartition

Common Actions • collect • count • countByValue • Take(num)
• Top(num) • Reduce(func) • Fold(zero)(func) • saveAsTextFile(path) • saveAsSequenceFile(path) • countByKey()

HDInsinght components • YARN UI • Resource management UI •
Ambari views • Access to Ambari dashboard & installed services • Spark History UI • Historical Spark job details • Zeppelin & Jupyter notebooks • Online notebooks to work with cluster resources

References • https://spark.apache.org • Databricks spark • RDD programming guide
• Spark SQL, DataFrames & Datasets • Data sources API V2 • Sparkhub databricks • MovieLens dataset

Thank you very much Code with Passion and Strive for
Excellence https://github.com/NileshGule/learning-spark https://github.com/NileshGule/learning-spark/blob/master/Azure%20Meetup%2017th%20May%202019.md https://www.slideshare.net/nileshgule/presentations https://speakerdeck.com/nileshgule/

Design Q&A

Big Data workloads using Spark on HDInsight

Big Data workloads using Spark on HDInsight

Nilesh Gule

More Decks by Nilesh Gule

Other Decks in Technology

Featured

Transcript

Big Data workloads using Apache Spark on HDInsight Nilesh Gule

$whoami { “name” : “Nilesh Gule”, “website” : “https://www.HandsOnArchitect.com", “github”

01 02 03 04 05 06 Agenda Evolution of Apache

Assumptions • Basic knowledge of Hadoop • HDFS, Map Reduce,

Unified Analytics engine large-scale data processing

Life before Spark • Different tools for performing different tasks

Life with Spark

2010 Iterative compute Evolution of Apache Spark 2009 Mesos cluster

Benefits of using Apache Spark • Speed • Up to

Demo • MovieLens Dataset • Movies • Ratings • Tags

DEMO

Apache Spark components • Dataset • Distributed collection of Rows

Common Transformations • map • flatMap • filter • Distinct

Common Actions • collect • count • countByValue • Take(num)

DEMO

HDInsinght components • YARN UI • Resource management UI •

References • https://spark.apache.org • Databricks spark • RDD programming guide

Thank you very much Code with Passion and Strive for

Design Q&A