Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Big data for .Net Devs with Apache Spark

Big data for .Net Devs with Apache Spark

Slide deck of the presentation done at the .Net Dev Summit 2020 on 31 October 2020. The talk covered overview of Apache Spark, .Net for Apache Spark. The demo showed the integration between Azure Synapse Workspace which has native integration with notebooks feature to analyse the popular MovieLens datasets.

Nilesh Gule

October 31, 2020
Tweet

More Decks by Nilesh Gule

Other Decks in Technology

Transcript

  1. $whoami { “name” : “Nilesh Gule”, “website” : “https://www.HandsOnArchitect.com", “github”

    : “https://github.com/NileshGule" “twitter” : “@nileshgule”, “linkedin” : “https://www.linkedin.com/in/nileshgule”, “likes” : “Technical Evangelism, Cricket”, “co-organizer” : “Azure Singapore UG” }
  2. Benefits of using Apache Spark • Speed • Up to

    100x faster compared to Map Reduce • Ease of use • Easy to use API’s • Multi language support • 100+ operators • Unified engine • Higher level libraries & support for SQL Queries, streaming data, machine learning and graph processing • Runs everywhere • Hadoop, standalone, Mesos, Kubernetes, cloud https://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html
  3. Apache Spark Components • Dataset, DataFrame, RDD • Distributed collection

    of data • SparkSession • Entry point into Spark API • SparkContext, SQLContext, StreamingContext unified into one • Executors • Handles distributed processing • Transformations & Actions • Transformations – lazy operations that returns immutable data structures • Actions – apply operations and return value or write data to external storage
  4. Spark Common Transformations • map • flatMap • filter •

    Distinct • Sample(withReplacement, ..) • Union • Intersection • Subtract • cartesian • reduceByKey • groupByKey • sortByKey • Join • repartition
  5. Spark Common Actions • collect • count • countByValue •

    Take(num) • Top(num) • Reduce(func) • Fold(zero)(func) • saveAsTextFile(path) • saveAsSequenceFile(path) • countByKey()
  6. What is .Net for Apache Spark • .Net bindings for

    Spark written on Spark interop layer • Provides high performance bindings for C# and F# • Compliant with .Net standard https://devblogs.microsoft.com/dotnet/introducing-net-for-apache-spark/#performance
  7. Demo • MovieLens Datatset • CSV files in Azure Data

    Lake Storage • Spark pools using Azure Synapse analytics
  8. Summary • Apache Spark is great for Big Data Analytics

    • .Net for Apache Spark provides .Net language bindings to Spark • Azure Synapse Analytics has native support for C#
  9.  Apache Spark  .Net for Apache Spark  MovieLens

    datasets  Azure Synapse Analytics
  10. Thank you very much Code with Passion and Strive for

    Excellence https://www.slideshare.net/nileshgule/presentations https://speakerdeck.com/nileshgule/
  11. Nilesh Gule ARCHITECT | MICROSOFT MVP “Code with Passion and

    Strive for Excellence” nileshgule @nileshgule Nilesh Gule NileshGule www.handsonarchitect.com
  12. Q&A