Upgrade to Pro — share decks privately, control downloads, hide ads and more …

How we built InsightEdge.io

Sponsored · Ship Features Fearlessly Turn features on and off without deploys. Used by thousands of Ruby developers.

How we built InsightEdge.io

Avatar for Oleksii Diagiliev

Oleksii Diagiliev

August 16, 2016
Tweet

More Decks by Oleksii Diagiliev

Other Decks in Programming

Transcript

  1. 2 About me • Chief Software Engineer at EPAM •

    Leading design and architecture efforts at http://insightedge.io • Blogging at http://dyagilev.org
  2. 3 Modern applications: a blend of workloads Transactional Analytical Essential

    to operate the business Turning data into value: insights, diagnosis, decision making
  3. 5 Fare rates automatically increase, when the taxi demand is

    higher than drivers around you Ensure reliability and availability for those who agree to pay more
  4. 9 XAP In-Memory Data Grid Scale-out In-Memory Storage Scale by

    sharding or replication Low latency and High throughput In-memory storage, fast indexing, co-located ops, batch ops Rich API and query language SQL-like against POJOs and schema-free documents. Geospatial API ACID transactions Maintain full ACID compliance against your data set through full transaction semantics High availability and resiliency Fault tolerance through replication, cross-data center replication, auto- healing Data Tiering RAM and SSD
  5. 10 In-Memory Data Grid In-Memory Store(RAM) Flash, SSD, Off-Heap Store

    Spark Spark SQL Spark Steaming Machine Learning High availability Security & Management InsightEdge Core InsightEdge Core
  6. 13 InsightEdge architecture: collocating Spark and XAP data grid node

    1 Spark master Grid master node 2 Spark worker Grid worker node 3 Spark worker Grid worker
  7. 14 • List of parent RDDs – Empty • An

    array of partitions that a dataset is divided to – XAP Distributed Query to get partitions and their hosts • A compute function to do a computation on partitions – Iterator over portion of data • Optional preferred locations, i.e. hosts for a partition where the data will be loaded – hosts from Distributed Query InsightEdge RDD: resilient distributed dataset
  8. 15 node 1 Spark executor InsightEdge RDD: one-to-one partition Spark

    Partition #1 XAP Primary #1 direct connection Simple, but not enough parallelism for Spark node 2 Spark executor Spark Partition #2 XAP Primary #2 node 3 Spark executor Spark Partition #3 XAP Primary #3
  9. 16 node 1 Spark Executor XAP Primary #1 InsightEdge RDD:

    with bucketing 0 .. 1 .. 2 .. 3 .. 4 .. 5 .. .. .. .. .. .. Spark Partition #1 1023 1 Spark partition = M grid buckets 1 XAP partition = N Spark partitions Spark Partition #2 Spark Partition #1 range query by index
  10. 17 InsightEdge DataFrames: predicates pushdown & columns pruning Aggregation in

    Spark Filtering and columns pruning in Data Grid SELECT SUM(amount) FROM order WHERE city = ‘NY’ AND year > 2012 Spark SQL architecture: • Pushing down predicates to Data Grid • Leveraging indexes • Transparent to user • Enabling support for other languages - Python/R Implementing DataSource API
  11. 18 Extending Spark with GeoSpatial API The tricky part: Shapes

    are packed in 3rd party java library - To register UDT you have to add Spark annotation to shape classes (but you cannot) - Dataframes don’t support a mix of Scala and Java types val searchRadius = 3 // km val userLocation = point(-77.024470, 39.032506) val searchArea = circle(userLocation, kmToDeg(searchRadius)) val schools = sqlContext.read.grid.loadClass[School] val nearestSchools = schools.filter(schools("location") geoWithin searchArea) Shapes: Queries: intersects, within, contains https://github.com/InsightEdge/insightedge-geo-demo/ Geo Indexes
  12. 19 Developing InsightEdge in Scala: the unpleasant part Interoperability with

    Java: • Native XAP model is declared as Java POJOs with annotations on getters • How do we declare them in Scala? Unpleasant things (should be addressed in the next releases): • Mutable class • No-args constructor • Null instead of Option[T] object annotation { import com.gigaspaces.annotation.pojo type SpaceId = pojo.SpaceId@beanGetter type SpaceRouting = pojo.SpaceRouting@beanGetter … } case class Data( @BeanProperty @SpaceId(autoGenerate = true) var id: String, @BeanProperty @SpaceRouting var routing: Long ) { def this() = this(-1, null) }
  13. 20 Developing InsightEdge in Scala: the good parts • Easy

    to extend Spark API with implicit conversions import org.insightedge.spark.implicits.all._ val stream = … stream.saveToGrid() def gridRdd[R: ClassTag]():InsightEdgeRdd={…} val rdd = sc.gridRdd[Product]() • Code in functional style is concise and readable • Developers are really productive with Scala • Negative experience with SBT so far, we use Maven • ClassTags make the API clean (solves JVM’s type erasure problem) • Mixin class compositions, e.g. for testing class InsightEdgeRDDSpec extends FlatSpec with IEConfig with InsightEdge with Spark
  14. 21 How we do testing • Unit and integrational tests

    with ScalaTest • Unit tests start Spark and XAP in embedded mode; covers all our API • Integrational tests use Docker for virtualization – covers bash scripts – Zeppelin notebooks – clustering and networking – tag long running tests and launch them only on ‘master’ and ‘release’ branches – we use Spotify’s plugin to build images with Maven – got rid of xebialabs/overcast, using spotify/docker-client OR just running containers with import sys.process._ “docker run image” !