Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Big Data and Functional Programming

Big Data and Functional Programming

A talk from Nick Shiftan at Hack+Startup on Feb 28, 2013.

hackplusstartup.com

First Round Capital

February 28, 2013
Tweet

More Decks by First Round Capital

Other Decks in Technology

Transcript

  1. #hackplusstartup WHO AM I • CTO/co-founder @ Curalate • Previously

    @ Parkio, Microsoft, Harvard • Interested in distributed systems. • Scala enthusiast. I’m Nick • We’re the marketing platform for the visual web. • Traditional social media relies upon keywords to “hear” consumers. • When people “speak” with pictures (on Pinterest, Instagram, Tumblr, etc), very few words are used. • Interesting problems around intersection of computer-vision and web-scale data. • Philly-based, 13 months old. I work at Curalate
  2. #hackplusstartup WHAT DO THEY HAVE IN COMMON Both are hot

    Google Trends: “Big Data” (1/08-1/13) GitHub Projects vs. StackOverflow tags 1. http://www.google.com/trends/explore 2. http://redmonk.com/sogrady/2012/09/12/language-rankings-9-12/ • Big data is everywhere you look. • Many (if not most) up-and-coming programming languages offer a functional paradigm.
  3. #hackplusstartup QUICK DETOUR - WHY FP? The Classic Reason •

    Concurrency is hard. • Multi-core machines are ubiquitous. • Renewed interest in distributed computing and web-scale problems. • FP languages are no longer slow. • Social factors. The ways in which one can divide up the original problem depend directly on the ways in which one can glue solutions together. Therefore, to increase ones ability to modularise a problem conceptually, one must provide new kinds of glue in the programming language. Jon Hughes - Why Functional Programming Matters (1984) The Modern Reason(s)
  4. #hackplusstartup JOINED AT BIRTH Map is functional • In FP,

    map is a higher-order function that takes two parameters: a function f, and a list of values. • It applies f to each item in the list, and returns a new list composed of the results. Reduce is functional • Reduce is a special case of the fold function. • In FP, fold is a function that takes three parameters: a function, a value, and a list of values. • It combines all values in the list into a single value, using f. } def map(f: Function, items: List): List = { val newItems = new ListBuffer items.foreach(item => { newItems.add(f(item)) }) return newItems } def fold(f: Function, s: S, items: List): S = { var retval = s items.foreach(item => { retval = f(retval, item) }) return retval } def reduce(f: Function, items: List): S = fold(f, items.head, items.tail)
  5. #hackplusstartup HELLO WORLD! MapReduce is functional } object WordCount {

    def main(args: Array[String]): Unit = { val files = args.map(Source.fromFile) val results = files. flatMap(_.mkString.split("\\s")). groupBy(word => word). map(group => (group._1, group._2.size)) } } • FP yields really elegant solutions to problems that lend themselves to partitioning. • Hadoop is built upon fundamentally FP concepts. • So, what happened?
  6. #hackplusstartup WHAT HAPPENED? How did we end up with this?

    } public class HadoopWordCount { public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); } } } public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); } } public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = new Job(conf, "wordcount"); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(Map.class); job.setReducerClass(Reduce.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true); } } object WordCount { def main(args: Array[String]): Unit = { val files = args.map(Source.fromFile) val results = files. flatMap(_.mkString.split("\\s")). groupBy(word => word). map(group => (group._1, group._2.size)) } } WordCount.scala HadoopWordCount.java
  7. #hackplusstartup TO BE FAIR... } Hadoop does a lot more

    than .map.reduce • Distributed computation. • Data locality. • Fault tolerance & high availability. • Support for non-FP languages. Can we get all this alongside our FP idioms and libaries?
  8. #hackplusstartup BRIDGING THE GAP } Two schools of thought. A.

    Build Hadoop (or Hadoop-like capabilities) into the language. B. Wrap Hadoop in FP-friendly libraries.
  9. #hackplusstartup WORTH A SHOT? } Could a Scala make Hadoop

    a first-class feature? • This is not as crazy as it sounds. • Parallel collections are built into Scala. • Google built FlumeJava (distributed collections). • Scala committers were, at one point, working on a distributed collections library. This has since been abandoned [1]. • Current focus is on distributed computing via actors. 1. https://github.com/scala-incubator/distributed-collections theList.map(doSomething) theList.par.map(doSomething) theList.dist.map(doSomething)
  10. #hackplusstartup THE OTHER OPTION } Is it possible to make

    Hadoop FP- friendly? • Functional programming aside, there is a ton of activity around building novel wrappers around Hadoop. Cascalog Apache Crunch • Some of these libraries have FP-friendly equivalents. • Crunch -> Scrunch • Cascading -> Scalding
  11. #hackplusstartup ANAGRAMS IN SCALA & SCALDING object ScalaExample { def

    main(words: List[String]): Unit = { words. groupBy(word => new String(word.toCharArray.sorted)). map(_._2). map(anagrams => join(anagrams, ' ')). foreach(println) } } class ScaldingExample(args: Args) extends Job(args) { val input = TextLine("data/nytimes_1899-2012") val output = TextLine("data/anagrams") // Mappers def tokenizeWords(s: String): Array[String] = StringUtils.split(s, " \n\t") def makeAnagramHash(s: String): String = new String(s.toCharArray.sorted) // Reducers def combineAnagrams(gb: GroupBuilder): GroupBuilder = gb.sortBy('word).mkString('word -> 'words, ", ") input. read. flatMap('line -> 'word)(tokenizeWords). map('word -> 'hash)(makeAnagramHash). groupBy('hash)(combineAnagrams). project('words). write(output) }