Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Scala @ Bizo

boia01
June 12, 2012

Scala @ Bizo

Scala usage at Bizo / Presentation made at ScalaDays 2011 @ Stanford University

boia01

June 12, 2012
Tweet

More Decks by boia01

Other Decks in Programming

Transcript

  1. Who is this talk for? You are considering or in

    the process of adopting Scala in your organization You want to hear about other people's experience adopting Scala You want to learn about companies using Scala to build cool software solutions
  2. Who am I? Alex Boisvert (twitter: @boia01) Software engineer/architect working

    at Bizo Back-end kinda guy, interested in system scalability • High-volume transaction-processing • Data retrieval and big data analytics
  3. About Bizo Online ad targeting & analytics platform • Reach

    80 million business professionals online • Help better understand composition of web audience • Site personalization, action tracking, custom audiences, funnel analysis, … • And more! Offer many services through web APIs. Classify web visitors into “business demographics” • 150+ industries (agriculture, construction, health care, government, ...) • 100+ functional areas (finance, engineering, legal, sales, ...) • Company size (small, medium, large, F500, …) • Seniority (non-management, mid-management, executive, …) • … location, education, gender, and more.
  4. Engineering Team Team of 8 “dev-ops” engineers Use mix of

    Scala, Ruby, Java, Javascript, … All infrastructure running on Amazon Web Services such as EC2, Elastic Map-Reduce (Hadoop), etc. Started using Scala late in 2009. Scala now used almost everywhere (analytics backend, web APIs, scripts, etc.) except, • Large legacy Java components (will take some time) • Web codebase using Google Web Toolkit (GWT) • Smallish / prototyping-size web apps (Ruby + Rails)
  5. TL;DR Scala as scalable language A+ (awesome!) Scala → Java

    Interoperability A (just as advertised) Java → Scala Interoperability B- (unintended consequence) Binary Compatibility C (you will have to deal with it) IDE Support C+ (trending towards B) Standard Library A- (not 100% bug-free)
  6. Success Story – Scala + Big Data Handle between 2-3

    billion web requests per month Data growing at 400% pace year-over-year Use Hadoop + Hive to aggregate web traffic Built “Sugarcube” – a NoSQL analytics database (OLAP) • 100% Scala code + some Java libraries – 20K LoC • Distributed, cloud-friendly (AWS), scale-out architecture • Multi-dimensional indexing of billions of rows with 12+ dimensions • Response time: < 100 ms (for typical queries) • 6 man-months from paper to production (2 prototype iterations) • Server cluster: 4 x m1.large EC2 instances • Indexing cluster: 2 x m2.2xlarge EC2 instances (2-3 hours/day)
  7. Why Scala? Productivity • Succinct code clearly expresses algorithms •

    “Systems Programming” without the hassle Concurrency • Excellent JVM primitives • Easy to build best-fitting abstractions • Immutable data structures Performance • 15X faster than equivalent Ruby prototype
  8. Scala → Java Interoperability Deploy webapps on Tomcat/Jetty • Scala

    is “just another jar” • No special handling compared to 100% Java apps Mix in many Java libraries • Jersey (RESTful web services / JAX-RS) • Apache Common-*, HTTPClient, Log4J/SLF4J, etc. • Thrift, Spring, … and lots more. Still using Ant + Ivy to build most projects!
  9. Ivy dependencies <ivy-module version="1.1"> <info organisation="bizo.com" module="api-web" /> <dependencies defaultconf="default->default;sources->sources()">

    <!-- global (compile, test, runtime) dependencies --> <dependency org="bizo.com" name="account-management-remote-api" rev="2-20110506191607" /> <dependency org="bizo.com" name="api-account-management" rev="1.2-dev-20110518205306" /> <dependency org="simplistic" name="simplistic" rev="1.0.10b" /> <dependency org="bizo.com" name="api-web-usage-sdb" rev="1.1-dev-20110518205924" /> <dependency org="bizo.com" name="asperatus" rev="1.1-dev-20100923180448" /> <dependency org="bizo.com" name="s3-app-logger" rev="1.1-dev-20110519175618" /> <dependency org="bizo.com" name="datalog-tools" rev="1.1-dev-20101103222807" /> <dependency org="bizo.com" name="datalog-manager" rev="3.1-dev-20110512225804" /> <dependency org="bizo.com" name="bizographer-client" rev="1.1-dev-20110303190156" /> <dependency org="bizo.com" name="thrift-bizographer" rev="1.1-dev-20110125190603" /> <dependency org="bizo.com" name="thrift-dimension-model" rev="1.1-dev-20100804051620" /> <dependency org="apache" name="commons-fileupload" rev="1.2.1" /> <dependency org="apache" name="commons-io" rev="1.4" /> <dependency org="twitter.com" name="scala-json_2.8.0" rev="1.1.2" /> <dependency org="opencsv" name="opencsv" rev="1.8" /> <dependency org="sun" name="jersey" rev="1.5" /> <dependency org="sun" name="jersey-contrib" rev="1.5"> <artifact name="oauth-server" /> <artifact name="oauth-signature" /> </dependency> <dependency org="scala" name="scala" rev="2.8.1.final" conf="default->default;sources->sources;buildtime->compiler"/> <dependency org="ehcache" name="ehcache" rev="1.6.0-beta5" /> <!-- build time --> <dependency org="sun" name="servlet-api" rev="2.5" conf="buildtime->default" /> <dependency org="findbugs" name="findbugs" rev="1.3.9" conf="buildtime->default" /> <dependency org="cobertura" name="cobertura" rev="1.9.3" conf="cobertura->default" /> <dependency org="svntask" name="svntask" rev="1.0.7" conf="buildtime->default" /> <!-- test time only dependencies --> <dependency org="bizo.com" name="jtty" rev="1.1" conf="test->default" /> <dependency org="junit" name="junit" rev="4.7" conf="test->default;sources->sources" /> <dependency org="scalatest" name="scalatest" rev="1.0" conf="test->default" /> <dependency org="mockito" name="mockito" rev="1.8.2" conf="test->default,sources"/> <dependency org="bizo.com" name="fakesdb" rev="2.2" conf="test->servlet" /> <dependency org="oauth" name="oauth-signpost" rev="1.2" conf="test->default"> <artifact name="signpost-core" type="jar" /> </dependency> </dependencies> </ivy-module>
  10. Example – Jersey Annotations package com.bizo.api.web.controllers import javax.ws.rs.{GET, Path, PathParam,

    QueryParam, Produces} import javax.ws.rs.core.Response import com.bizo.api.web.model.{Result, TaxonomyResult} import com.bizo.util.BizographicNamingFactory.allByDimension @Path("/v1/taxonomy.{format}") class Taxonomy { @GET @Produces(Array("application/json", "application/xml", "text/csv")) def doGet(@QueryParam("callback") callback: String) = { new Result(Taxonomy.current, callback) } } object Taxonomy { val current = new TaxonomyResult("20100809", allByDimension) }
  11. Java 1nt@r0p: The Ugly scala.Option<String> x = scala.None$.MODULE$; Map<String, String>

    m = new HashMap(); m.$plus(new Tuple2<String, String>("foo", "bar")); // don't try this at home, kids m.map(new Function1<Tuple2<String, String>, Tuple2<String, String>>() { // ... }, HashMap.$MODULE.<Tuple2<String, String>, Tuple2<String, String>>canBuildFrom());
  12. Data-Access Layer OLAP Database Web Oh Noes! Testing in-VM!! Java

    + GWT Java Scala @#$@!% a.k.a. “leaky abstraction”
  13. trait IndexingContext { val database: String val cube: String val

    dimensions: IndexedSeq[(String, String)] val measures: IndexedSeq[(String, String)] val aggregates: IndexedSeq[String] val hierarchicalLevels: IndexedSeq[Level] … }
  14. /** Java-friendly builder */ class IndexingContextBuilder { private val dimensions

    = new ArrayBuffer[(String, String)] private val measures = new ArrayBuffer[(String, String)] def addDimension(name: String, dataType: String) { dimensions += (name, dataType) } def addMeasure(name: String, dataType: String) { measures += (name, dataType) } ... def toContext(): IndexingContext = new IndexingContext { override val dimensions = _dimensions override val measures = _measures ... } }
  15. /** * A simpler Sugarcube class. * * Exists for

    the sole purpose of making Java testing easier * since dealing with abstract Scala classes with traits * in Java is hell. */ class SimpleSugarcube(val cubes: Map[String, PartitionedCube]) extends Sugarcube { import scala.collection.JavaConversions._ def this(cubes: java.util.Map[String, PartitionedCube]) = this(cubes.toMap) protected def cube(database: String, name: String) = cubes(name) }
  16. trait MapReduce[Input, Output] { val executor: ExecutorService = { /*

    default (n processors + 1) thread pool */ } def map(input: Input): Output def reduce(o1: Output, o2: Output): Output final def submit(inputs: Traversable[Input]): Future[Output] = { /* profit !!! */ } } Full code @ https://github.com/aboisvert/scala-samples/tree/master/mapreduce
  17. def expandCuboids() { var level = 1 var expansion =

    true // stop when no cuboid expanded within a layer or when // we've expanded everything (worst case) while (expansion && level <= dimensions.size) { val cuboids = for { set <- combinations(dimensions) dims <- combinations(set, level) } yield dims val expansion = new MapReduce[Dimensions, Boolean] { override val executor = indexingExecutor def map(cuboid: Dimensions) = expandCuboid(cuboid) def reduce(expanded1: Boolean, expanded2: Boolean) = expanded1 || expanded2 } submit cuboids level += 1 } }
  18. trait PartitionedCube { … def query( aggregates: Set[Aggregate], conditions: Map[Dimension,

    Set[Value]], groupBy: Set[Dimension] ): QueryResult = { new MapReduce[Partition, QueryResult] { override val executor = PartitionedCube.this.executor def map(p: Partition) = p.query(aggregates, conditions) groupBy (groupBy) def reduce(r1: QueryResult, r2: QueryResult) = (r1 merge r2) } submit (partitions filter conditions) } }
  19. trait ParallelForeach { val executor: ExecutorService def foreach[T](xs: Traversable[T])(f: T

    => Unit) = { new MapReduce[T, Unit] { override val executor = ParallelForeach.this.executor override def map(t: T) = f(t) override def reduce(u1: Unit, u2: Unit) = () } submit xs } }
  20. val parallel = new ParallelForeach { val executor = …

    } parallel.foreach(files) { f => // download, unzip, etc ... }
  21. import simplistic._ val account = new SimpleDBAccount(key, secret) // list

    all domains account.domains.toList // list all items in mydomain account.domain("mydomain").items.toList // create item with single attribute "bar" and value "baz" account.domain("mydomain") item ("foo") += ("bar" -> "baz") // query and print results account.select("select * from mydomain") foreach { e => println(e.name) } Full code @ https://github.com/aboisvert/simplistic
  22. // type-safe attributes import simplistic.Attributes._ import simplistic.Conversions._ object User {

    val user = attribute("user") // default to String type val startDate = optionalAttribute("startDate", ISO8601Date) val visits = attribute("visits", PositiveInt) val tags = multiValued("tags") } import User._ users.unique += (user("jack"), startDate(d1), visits(100)) users.unique += (user("jon"), startDate(d2), visits(20)) users.unique += (user("alice"), startDate(d3), visits(15)) users.find(user(“jack”)) += tags(“male”, “frequent”, “premium”)
  23. // type-safe queries import simplistic.Query._ val visitors = for (

    item <- users (visits > 1 and visits < 50 sort visits desc)) ) yield user(item) // without for-comprehension val visitors = { users select (visits > 1 and visits < 50 sort visits desc)) } visitors foreach println
  24. // SimpleDB “conditional put” import simplistic.Query._ class Task { …

    def updateIfUnassigned(): Boolean = try { setIf(assigned doesNotExist)(updateAttributes) true } catch { case ex: ConditionalCheckFailed => false } }
  25. // Table definitions // // e.g. for data files stored

    on Hadoop File System (HDFS) object Persons extends Table[(String, Int)]("persons") { def name = column[String]("name") def age = column[Int]("age") def * = name ~ age } object Follows extends Table[(String, String)]("follows") { def follower = column[String]("follower") def followee = column[String]("followee") def * = follower ~ followee }
  26. /* Predator detection */ for { p1 <- Persons p2

    <- Persons where { p2 => p2.age < 16 && p1.age - p2.age > 10 } _ <- Follows where { f => (f.follower is p1.name) && (f.followee is p2.name) } _ <- Query.orderBy { (p1.age - p2.age) desc } } yield p1.name ~ p2.name Full code @ https://github.com/aboisvert/revolute
  27. Standard Library Odds are you're going to run into some

    bugs :-( • 2.8.1: NoSuchElementException in HashSet (issue with hash-code collisions) • 2.9.0: View.groupBy() broken (StackOverflowException) Scala releases are few and far between … How will you deal with situation? • Avoid the feature • Build your own • Fix it yourself • Buy support from Typesafe We decided to maintain patched version of standard library.
  28. Final Scorecard Scala as scalable language A+ (awesome!) Scala →

    Java Interoperability A (just as advertised) Java → Scala Interoperability B- (unintended consequence) Binary Compatibility C (you will have to deal with it) IDE Support C+ (trending towards B) Standard Library A- (not 100% bug-free)