Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Big Data workloads using Spark on HDInsight

Big Data workloads using Spark on HDInsight

Slidedeck used during the Azure UG meetup in Singapore on 17th May 2019. Demonstrates usage of Spark for running big data workloads on HDInsight cluster. Spark SQL, Dataset API along with Hive support was demonstrated

9e33a1d43a88f23f6c545c1e0f07f4b5?s=128

Nilesh Gule

May 17, 2019
Tweet

More Decks by Nilesh Gule

Other Decks in Technology

Transcript

  1. Big Data workloads using Apache Spark on HDInsight Nilesh Gule

    @nileshgule | www.HandsOnArchitect.com
  2. $whoami { “name” : “Nilesh Gule”, “website” : “https://www.HandsOnArchitect.com", “github”

    : “https://github.com/NileshGule" “twitter” : “@nileshgule”, “linkedin” : “https://www.linkedin.com/in/nileshgule”, “email” : “nileshgule@gmail.com", “likes” : “Technical Evangelism, Cricket” }
  3. 01 02 03 04 05 06 Agenda Evolution of Apache

    Spark Demo – local mode Demo – cluster mode Spark Ecosystem HDInsight Spark cluster Q&A
  4. Assumptions • Basic knowledge of Hadoop • HDFS, Map Reduce,

    YARN, Resource Manager • Cloud Computing • Big Data Procesisng • OO / functional programming • CLI
  5. Unified Analytics engine large-scale data processing

  6. Life before Spark • Different tools for performing different tasks

    related to data processing • Multiple languages (Java, scripting, HQL) • Non-interactive • Lengthy batch processing • Difficult to debug
  7. Life with Spark

  8. 2010 Iterative compute Evolution of Apache Spark 2009 Mesos cluster

    management 2014 Spark SQL Spark 1.2 Data Sources API 2015 Structured Data DataFrame API 2016 Dataset API Superset of Dataframes Mesos Spark Open Sourced Spark 1.0 Spark 1.3 Spark 1.6 2016 Streaming Structured streaming Spark 2.0 2018 Kubernetes support Data Sources 2.0 API Spark 2.3
  9. Benefits of using Apache Spark • Speed • Up to

    100x faster compared to Map Reduce • Ease of use • Easy to use API’s • Multi language support • 100+ operators • Unified engine • Higher level libraries & support for SQL Queries, streaming data, machine learning and graph processing • Runs everywhere • Hadoop, standlone, Mesos, Kubernetes, cloud https://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html
  10. Demo • MovieLens Dataset • Movies • Ratings • Tags

    • Spark Datasets & Spark SQL API • Local mode • Spark submit • Cluster mode • CSV to ORC conversion • Execution modes • YARN resource manager UI • Spark UI • Spark History • Spark logs https://grouplens.org/datasets/movielens/
  11. DEMO

  12. Apache Spark components • Dataset • Distributed collection of Rows

    • SparkSession • Entry point into Spark API • SparkContext, SQLContext, StreamingContext unified into one • Executors • Handles distributed processing • Transformations & Actions • Transformations – lazy operations that returns immutable data structures • Actions – apply operations and return value or write data to external storage
  13. Common Transformations • map • flatMap • filter • Distinct

    • Sample(withReplacement, ..) • Union • Intersection • Subtract • cartesian • reduceByKey • groupByKey • sortByKey • Join • repartition
  14. Common Actions • collect • count • countByValue • Take(num)

    • Top(num) • Reduce(func) • Fold(zero)(func) • saveAsTextFile(path) • saveAsSequenceFile(path) • countByKey()
  15. DEMO

  16. HDInsinght components • YARN UI • Resource management UI •

    Ambari views • Access to Ambari dashboard & installed services • Spark History UI • Historical Spark job details • Zeppelin & Jupyter notebooks • Online notebooks to work with cluster resources
  17. References • https://spark.apache.org • Databricks spark • RDD programming guide

    • Spark SQL, DataFrames & Datasets • Data sources API V2 • Sparkhub databricks • MovieLens dataset
  18. Thank you very much Code with Passion and Strive for

    Excellence https://github.com/NileshGule/learning-spark https://github.com/NileshGule/learning-spark/blob/master/Azure%20Meetup%2017th%20May%202019.md https://www.slideshare.net/nileshgule/presentations https://speakerdeck.com/nileshgule/
  19. Design Q&A