Upgrade to Pro — share decks privately, control downloads, hide ads and more …

How we built InsightEdge.io

How we built InsightEdge.io

Oleksii Diagiliev

August 16, 2016

More Decks by Oleksii Diagiliev

Other Decks in Programming


  1. 2 About me • Chief Software Engineer at EPAM •

    Leading design and architecture efforts at http://insightedge.io • Blogging at http://dyagilev.org
  2. 3 Modern applications: a blend of workloads Transactional Analytical Essential

    to operate the business Turning data into value: insights, diagnosis, decision making
  3. 5 Fare rates automatically increase, when the taxi demand is

    higher than drivers around you Ensure reliability and availability for those who agree to pay more
  4. 9 XAP In-Memory Data Grid Scale-out In-Memory Storage Scale by

    sharding or replication Low latency and High throughput In-memory storage, fast indexing, co-located ops, batch ops Rich API and query language SQL-like against POJOs and schema-free documents. Geospatial API ACID transactions Maintain full ACID compliance against your data set through full transaction semantics High availability and resiliency Fault tolerance through replication, cross-data center replication, auto- healing Data Tiering RAM and SSD
  5. 10 In-Memory Data Grid In-Memory Store(RAM) Flash, SSD, Off-Heap Store

    Spark Spark SQL Spark Steaming Machine Learning High availability Security & Management InsightEdge Core InsightEdge Core
  6. 13 InsightEdge architecture: collocating Spark and XAP data grid node

    1 Spark master Grid master node 2 Spark worker Grid worker node 3 Spark worker Grid worker
  7. 14 • List of parent RDDs – Empty • An

    array of partitions that a dataset is divided to – XAP Distributed Query to get partitions and their hosts • A compute function to do a computation on partitions – Iterator over portion of data • Optional preferred locations, i.e. hosts for a partition where the data will be loaded – hosts from Distributed Query InsightEdge RDD: resilient distributed dataset
  8. 15 node 1 Spark executor InsightEdge RDD: one-to-one partition Spark

    Partition #1 XAP Primary #1 direct connection Simple, but not enough parallelism for Spark node 2 Spark executor Spark Partition #2 XAP Primary #2 node 3 Spark executor Spark Partition #3 XAP Primary #3
  9. 16 node 1 Spark Executor XAP Primary #1 InsightEdge RDD:

    with bucketing 0 .. 1 .. 2 .. 3 .. 4 .. 5 .. .. .. .. .. .. Spark Partition #1 1023 1 Spark partition = M grid buckets 1 XAP partition = N Spark partitions Spark Partition #2 Spark Partition #1 range query by index
  10. 17 InsightEdge DataFrames: predicates pushdown & columns pruning Aggregation in

    Spark Filtering and columns pruning in Data Grid SELECT SUM(amount) FROM order WHERE city = ‘NY’ AND year > 2012 Spark SQL architecture: • Pushing down predicates to Data Grid • Leveraging indexes • Transparent to user • Enabling support for other languages - Python/R Implementing DataSource API
  11. 18 Extending Spark with GeoSpatial API The tricky part: Shapes

    are packed in 3rd party java library - To register UDT you have to add Spark annotation to shape classes (but you cannot) - Dataframes don’t support a mix of Scala and Java types val searchRadius = 3 // km val userLocation = point(-77.024470, 39.032506) val searchArea = circle(userLocation, kmToDeg(searchRadius)) val schools = sqlContext.read.grid.loadClass[School] val nearestSchools = schools.filter(schools("location") geoWithin searchArea) Shapes: Queries: intersects, within, contains https://github.com/InsightEdge/insightedge-geo-demo/ Geo Indexes
  12. 19 Developing InsightEdge in Scala: the unpleasant part Interoperability with

    Java: • Native XAP model is declared as Java POJOs with annotations on getters • How do we declare them in Scala? Unpleasant things (should be addressed in the next releases): • Mutable class • No-args constructor • Null instead of Option[T] object annotation { import com.gigaspaces.annotation.pojo type SpaceId = pojo.SpaceId@beanGetter type SpaceRouting = pojo.SpaceRouting@beanGetter … } case class Data( @BeanProperty @SpaceId(autoGenerate = true) var id: String, @BeanProperty @SpaceRouting var routing: Long ) { def this() = this(-1, null) }
  13. 20 Developing InsightEdge in Scala: the good parts • Easy

    to extend Spark API with implicit conversions import org.insightedge.spark.implicits.all._ val stream = … stream.saveToGrid() def gridRdd[R: ClassTag]():InsightEdgeRdd={…} val rdd = sc.gridRdd[Product]() • Code in functional style is concise and readable • Developers are really productive with Scala • Negative experience with SBT so far, we use Maven • ClassTags make the API clean (solves JVM’s type erasure problem) • Mixin class compositions, e.g. for testing class InsightEdgeRDDSpec extends FlatSpec with IEConfig with InsightEdge with Spark
  14. 21 How we do testing • Unit and integrational tests

    with ScalaTest • Unit tests start Spark and XAP in embedded mode; covers all our API • Integrational tests use Docker for virtualization – covers bash scripts – Zeppelin notebooks – clustering and networking – tag long running tests and launch them only on ‘master’ and ‘release’ branches – we use Spotify’s plugin to build images with Maven – got rid of xebialabs/overcast, using spotify/docker-client OR just running containers with import sys.process._ “docker run image” !