An introduction to Apache Spark

Apache Spark • What is it ? • How does
it work ? • Benefits • Tuning • Examples www.semtech-solutions.co.nz [email protected]

Spark – What is it ? • Open Source •
Alternative to Map Reduce for certain applications • A low latency cluster computing system • For very large data sets • May be 100 times faster than Map Reduce for – Iterative algorithms – Interactive data mining • Used with Hadoop / HDFS • Released under BSD License www.semtech-solutions.co.nz [email protected]

Spark – How does it work ? • Uses in
memory cluster computing • Memory access faster than disk access • Has API's written in – Scala – Java – Python • Can be accessed from Scala and Python shells • Currently an Apache incubator project www.semtech-solutions.co.nz [email protected]

Spark – Benefits • Scales to very large clusters •
Uses in memory processing for increased speed • High Level API's – Java, Scala, Python • Low latency shell access www.semtech-solutions.co.nz [email protected]

Spark – Tuning • Bottlenecks can occur in the cluster
via – CPU, memory or network bandwidth • Tune data serialization method i.e. – Java ObjectOutputStream vs Kryo • Memory Tuning – Use primitive types – Set JVM Flags – Store objects in serialized form i.e. • RDD Persistence • MEMORY_ONLY_SER www.semtech-solutions.co.nz [email protected]

Spark – Examples Example from spark-project.org, Spark job in Scala.
Showing a simple text count from a system log. /*** SimpleJob.scala ***/ import spark.SparkContext import SparkContext._ object SimpleJob { def main(args: Array[String]) { val logFile = "/var/log/syslog" // Should be some file on your system val sc = new SparkContext("local", "Simple Job", "$YOUR_SPARK_HOME", List("target/scala-2.9.3/simple-project_2.9.3-1.0.jar")) val logData = sc.textFile(logFile, 2).cache() val numAs = logData.filter(line => line.contains("a")).count() val numBs = logData.filter(line => line.contains("b")).count() println("Lines with a: %s, Lines with b: %s".format(numAs, numBs)) } } www.semtech-solutions.co.nz [email protected]

Contact Us • Feel free to contact us at –
www.semtech-solutions.co.nz – [email protected] • We offer IT project consultancy • We are happy to hear about your problems • You can just pay for those hours that you need • To solve your problems

An introduction to Apache Spark

An introduction to Apache Spark

Mike Frampton

More Decks by Mike Frampton

Other Decks in Technology

Featured

Transcript

Apache Spark • What is it ? • How does

Spark – What is it ? • Open Source •

Spark – How does it work ? • Uses in

Spark – Benefits • Scales to very large clusters •

Spark – Tuning • Bottlenecks can occur in the cluster

Spark – Examples Example from spark-project.org, Spark job in Scala.

Contact Us • Feel free to contact us at –