Upgrade to Pro — share decks privately, control downloads, hide ads and more …

An introduction to Apache Spark

An introduction to Apache Spark

A introduction to Apache Spark, what is it and
how does it work ? Why use it and some examples
of use.

Mike Frampton

August 04, 2013
Tweet

More Decks by Mike Frampton

Other Decks in Technology

Transcript

  1. Apache Spark • What is it ? • How does

    it work ? • Benefits • Tuning • Examples www.semtech-solutions.co.nz [email protected]
  2. Spark – What is it ? • Open Source •

    Alternative to Map Reduce for certain applications • A low latency cluster computing system • For very large data sets • May be 100 times faster than Map Reduce for – Iterative algorithms – Interactive data mining • Used with Hadoop / HDFS • Released under BSD License www.semtech-solutions.co.nz [email protected]
  3. Spark – How does it work ? • Uses in

    memory cluster computing • Memory access faster than disk access • Has API's written in – Scala – Java – Python • Can be accessed from Scala and Python shells • Currently an Apache incubator project www.semtech-solutions.co.nz [email protected]
  4. Spark – Benefits • Scales to very large clusters •

    Uses in memory processing for increased speed • High Level API's – Java, Scala, Python • Low latency shell access www.semtech-solutions.co.nz [email protected]
  5. Spark – Tuning • Bottlenecks can occur in the cluster

    via – CPU, memory or network bandwidth • Tune data serialization method i.e. – Java ObjectOutputStream vs Kryo • Memory Tuning – Use primitive types – Set JVM Flags – Store objects in serialized form i.e. • RDD Persistence • MEMORY_ONLY_SER www.semtech-solutions.co.nz [email protected]
  6. Spark – Examples Example from spark-project.org, Spark job in Scala.

    Showing a simple text count from a system log. /*** SimpleJob.scala ***/ import spark.SparkContext import SparkContext._ object SimpleJob { def main(args: Array[String]) { val logFile = "/var/log/syslog" // Should be some file on your system val sc = new SparkContext("local", "Simple Job", "$YOUR_SPARK_HOME", List("target/scala-2.9.3/simple-project_2.9.3-1.0.jar")) val logData = sc.textFile(logFile, 2).cache() val numAs = logData.filter(line => line.contains("a")).count() val numBs = logData.filter(line => line.contains("b")).count() println("Lines with a: %s, Lines with b: %s".format(numAs, numBs)) } } www.semtech-solutions.co.nz [email protected]
  7. Contact Us • Feel free to contact us at –

    www.semtech-solutions.co.nz – [email protected] • We offer IT project consultancy • We are happy to hear about your problems • You can just pay for those hours that you need • To solve your problems