Upgrade to Pro — share decks privately, control downloads, hide ads and more …

An introduction to Apache Spark

An introduction to Apache Spark

A introduction to Apache Spark, what is it and
how does it work ? Why use it and some examples
of use.

Avatar for Mike Frampton

Mike Frampton

August 04, 2013
Tweet

More Decks by Mike Frampton

Other Decks in Technology

Transcript

  1. Apache Spark • What is it ? • How does

    it work ? • Benefits • Tuning • Examples www.semtech-solutions.co.nz [email protected]
  2. Spark – What is it ? • Open Source •

    Alternative to Map Reduce for certain applications • A low latency cluster computing system • For very large data sets • May be 100 times faster than Map Reduce for – Iterative algorithms – Interactive data mining • Used with Hadoop / HDFS • Released under BSD License www.semtech-solutions.co.nz [email protected]
  3. Spark – How does it work ? • Uses in

    memory cluster computing • Memory access faster than disk access • Has API's written in – Scala – Java – Python • Can be accessed from Scala and Python shells • Currently an Apache incubator project www.semtech-solutions.co.nz [email protected]
  4. Spark – Benefits • Scales to very large clusters •

    Uses in memory processing for increased speed • High Level API's – Java, Scala, Python • Low latency shell access www.semtech-solutions.co.nz [email protected]
  5. Spark – Tuning • Bottlenecks can occur in the cluster

    via – CPU, memory or network bandwidth • Tune data serialization method i.e. – Java ObjectOutputStream vs Kryo • Memory Tuning – Use primitive types – Set JVM Flags – Store objects in serialized form i.e. • RDD Persistence • MEMORY_ONLY_SER www.semtech-solutions.co.nz [email protected]
  6. Spark – Examples Example from spark-project.org, Spark job in Scala.

    Showing a simple text count from a system log. /*** SimpleJob.scala ***/ import spark.SparkContext import SparkContext._ object SimpleJob { def main(args: Array[String]) { val logFile = "/var/log/syslog" // Should be some file on your system val sc = new SparkContext("local", "Simple Job", "$YOUR_SPARK_HOME", List("target/scala-2.9.3/simple-project_2.9.3-1.0.jar")) val logData = sc.textFile(logFile, 2).cache() val numAs = logData.filter(line => line.contains("a")).count() val numBs = logData.filter(line => line.contains("b")).count() println("Lines with a: %s, Lines with b: %s".format(numAs, numBs)) } } www.semtech-solutions.co.nz [email protected]
  7. Contact Us • Feel free to contact us at –

    www.semtech-solutions.co.nz – [email protected] • We offer IT project consultancy • We are happy to hear about your problems • You can just pay for those hours that you need • To solve your problems