Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Big Data workloads using Spark on HDInsight

Big Data workloads using Spark on HDInsight

Slidedeck used during the Azure UG meetup in Singapore on 17th May 2019. Demonstrates usage of Spark for running big data workloads on HDInsight cluster. Spark SQL, Dataset API along with Hive support was demonstrated

Nilesh Gule

May 17, 2019
Tweet

More Decks by Nilesh Gule

Other Decks in Technology

Transcript

  1. $whoami { “name” : “Nilesh Gule”, “website” : “https://www.HandsOnArchitect.com", “github”

    : “https://github.com/NileshGule" “twitter” : “@nileshgule”, “linkedin” : “https://www.linkedin.com/in/nileshgule”, “email” : “[email protected]", “likes” : “Technical Evangelism, Cricket” }
  2. 01 02 03 04 05 06 Agenda Evolution of Apache

    Spark Demo – local mode Demo – cluster mode Spark Ecosystem HDInsight Spark cluster Q&A
  3. Assumptions • Basic knowledge of Hadoop • HDFS, Map Reduce,

    YARN, Resource Manager • Cloud Computing • Big Data Procesisng • OO / functional programming • CLI
  4. Life before Spark • Different tools for performing different tasks

    related to data processing • Multiple languages (Java, scripting, HQL) • Non-interactive • Lengthy batch processing • Difficult to debug
  5. 2010 Iterative compute Evolution of Apache Spark 2009 Mesos cluster

    management 2014 Spark SQL Spark 1.2 Data Sources API 2015 Structured Data DataFrame API 2016 Dataset API Superset of Dataframes Mesos Spark Open Sourced Spark 1.0 Spark 1.3 Spark 1.6 2016 Streaming Structured streaming Spark 2.0 2018 Kubernetes support Data Sources 2.0 API Spark 2.3
  6. Benefits of using Apache Spark • Speed • Up to

    100x faster compared to Map Reduce • Ease of use • Easy to use API’s • Multi language support • 100+ operators • Unified engine • Higher level libraries & support for SQL Queries, streaming data, machine learning and graph processing • Runs everywhere • Hadoop, standlone, Mesos, Kubernetes, cloud https://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html
  7. Demo • MovieLens Dataset • Movies • Ratings • Tags

    • Spark Datasets & Spark SQL API • Local mode • Spark submit • Cluster mode • CSV to ORC conversion • Execution modes • YARN resource manager UI • Spark UI • Spark History • Spark logs https://grouplens.org/datasets/movielens/
  8. Apache Spark components • Dataset • Distributed collection of Rows

    • SparkSession • Entry point into Spark API • SparkContext, SQLContext, StreamingContext unified into one • Executors • Handles distributed processing • Transformations & Actions • Transformations – lazy operations that returns immutable data structures • Actions – apply operations and return value or write data to external storage
  9. Common Transformations • map • flatMap • filter • Distinct

    • Sample(withReplacement, ..) • Union • Intersection • Subtract • cartesian • reduceByKey • groupByKey • sortByKey • Join • repartition
  10. Common Actions • collect • count • countByValue • Take(num)

    • Top(num) • Reduce(func) • Fold(zero)(func) • saveAsTextFile(path) • saveAsSequenceFile(path) • countByKey()
  11. HDInsinght components • YARN UI • Resource management UI •

    Ambari views • Access to Ambari dashboard & installed services • Spark History UI • Historical Spark job details • Zeppelin & Jupyter notebooks • Online notebooks to work with cluster resources
  12. References • https://spark.apache.org • Databricks spark • RDD programming guide

    • Spark SQL, DataFrames & Datasets • Data sources API V2 • Sparkhub databricks • MovieLens dataset
  13. Thank you very much Code with Passion and Strive for

    Excellence https://github.com/NileshGule/learning-spark https://github.com/NileshGule/learning-spark/blob/master/Azure%20Meetup%2017th%20May%202019.md https://www.slideshare.net/nileshgule/presentations https://speakerdeck.com/nileshgule/