Alternative to Map Reduce for certain applications • A low latency cluster computing system • For very large data sets • May be 100 times faster than Map Reduce for – Iterative algorithms – Interactive data mining • Used with Hadoop / HDFS • Released under BSD License www.semtech-solutions.co.nz [email protected]
memory cluster computing • Memory access faster than disk access • Has API's written in – Scala – Java – Python • Can be accessed from Scala and Python shells • Currently an Apache incubator project www.semtech-solutions.co.nz [email protected]
via – CPU, memory or network bandwidth • Tune data serialization method i.e. – Java ObjectOutputStream vs Kryo • Memory Tuning – Use primitive types – Set JVM Flags – Store objects in serialized form i.e. • RDD Persistence • MEMORY_ONLY_SER www.semtech-solutions.co.nz [email protected]
Showing a simple text count from a system log. /*** SimpleJob.scala ***/ import spark.SparkContext import SparkContext._ object SimpleJob { def main(args: Array[String]) { val logFile = "/var/log/syslog" // Should be some file on your system val sc = new SparkContext("local", "Simple Job", "$YOUR_SPARK_HOME", List("target/scala-2.9.3/simple-project_2.9.3-1.0.jar")) val logData = sc.textFile(logFile, 2).cache() val numAs = logData.filter(line => line.contains("a")).count() val numBs = logData.filter(line => line.contains("b")).count() println("Lines with a: %s, Lines with b: %s".format(numAs, numBs)) } } www.semtech-solutions.co.nz [email protected]
www.semtech-solutions.co.nz – [email protected] • We offer IT project consultancy • We are happy to hear about your problems • You can just pay for those hours that you need • To solve your problems