Slide 1

Slide 1 text

1 How we built InsightEdge.io Oleksiy Dyagilev

Slide 2

Slide 2 text

2 About me • Chief Software Engineer at EPAM • Leading design and architecture efforts at http://insightedge.io • Blogging at http://dyagilev.org

Slide 3

Slide 3 text

3 Modern applications: a blend of workloads Transactional Analytical Essential to operate the business Turning data into value: insights, diagnosis, decision making

Slide 4

Slide 4 text

4 Consider Uber Surge Pricing

Slide 5

Slide 5 text

5 Fare rates automatically increase, when the taxi demand is higher than drivers around you Ensure reliability and availability for those who agree to pay more

Slide 6

Slide 6 text

6 What are the challenges? Data -> Insight -> Action at scale

Slide 7

Slide 7 text

7 INTRODUCING INSIGHTEDGE FAST DATA TO ACTION

Slide 8

Slide 8 text

8 InsightEdge synergy XAP: fast scale-out in-memory data grid Large-scale data processing framework

Slide 9

Slide 9 text

9 XAP In-Memory Data Grid Scale-out In-Memory Storage Scale by sharding or replication Low latency and High throughput In-memory storage, fast indexing, co-located ops, batch ops Rich API and query language SQL-like against POJOs and schema-free documents. Geospatial API ACID transactions Maintain full ACID compliance against your data set through full transaction semantics High availability and resiliency Fault tolerance through replication, cross-data center replication, auto- healing Data Tiering RAM and SSD

Slide 10

Slide 10 text

10 In-Memory Data Grid In-Memory Store(RAM) Flash, SSD, Off-Heap Store Spark Spark SQL Spark Steaming Machine Learning High availability Security & Management InsightEdge Core InsightEdge Core

Slide 11

Slide 11 text

11 InsightEdge Core InsightEdge Ecosystem Flash, SSD, Off-Heap

Slide 12

Slide 12 text

12 BUILDING INSIGHTEDGE

Slide 13

Slide 13 text

13 InsightEdge architecture: collocating Spark and XAP data grid node 1 Spark master Grid master node 2 Spark worker Grid worker node 3 Spark worker Grid worker

Slide 14

Slide 14 text

14 • List of parent RDDs – Empty • An array of partitions that a dataset is divided to – XAP Distributed Query to get partitions and their hosts • A compute function to do a computation on partitions – Iterator over portion of data • Optional preferred locations, i.e. hosts for a partition where the data will be loaded – hosts from Distributed Query InsightEdge RDD: resilient distributed dataset

Slide 15

Slide 15 text

15 node 1 Spark executor InsightEdge RDD: one-to-one partition Spark Partition #1 XAP Primary #1 direct connection Simple, but not enough parallelism for Spark node 2 Spark executor Spark Partition #2 XAP Primary #2 node 3 Spark executor Spark Partition #3 XAP Primary #3

Slide 16

Slide 16 text

16 node 1 Spark Executor XAP Primary #1 InsightEdge RDD: with bucketing 0 .. 1 .. 2 .. 3 .. 4 .. 5 .. .. .. .. .. .. Spark Partition #1 1023 1 Spark partition = M grid buckets 1 XAP partition = N Spark partitions Spark Partition #2 Spark Partition #1 range query by index

Slide 17

Slide 17 text

17 InsightEdge DataFrames: predicates pushdown & columns pruning Aggregation in Spark Filtering and columns pruning in Data Grid SELECT SUM(amount) FROM order WHERE city = ‘NY’ AND year > 2012 Spark SQL architecture: • Pushing down predicates to Data Grid • Leveraging indexes • Transparent to user • Enabling support for other languages - Python/R Implementing DataSource API

Slide 18

Slide 18 text

18 Extending Spark with GeoSpatial API The tricky part: Shapes are packed in 3rd party java library - To register UDT you have to add Spark annotation to shape classes (but you cannot) - Dataframes don’t support a mix of Scala and Java types val searchRadius = 3 // km val userLocation = point(-77.024470, 39.032506) val searchArea = circle(userLocation, kmToDeg(searchRadius)) val schools = sqlContext.read.grid.loadClass[School] val nearestSchools = schools.filter(schools("location") geoWithin searchArea) Shapes: Queries: intersects, within, contains https://github.com/InsightEdge/insightedge-geo-demo/ Geo Indexes

Slide 19

Slide 19 text

19 Developing InsightEdge in Scala: the unpleasant part Interoperability with Java: • Native XAP model is declared as Java POJOs with annotations on getters • How do we declare them in Scala? Unpleasant things (should be addressed in the next releases): • Mutable class • No-args constructor • Null instead of Option[T] object annotation { import com.gigaspaces.annotation.pojo type SpaceId = pojo.SpaceId@beanGetter type SpaceRouting = pojo.SpaceRouting@beanGetter … } case class Data( @BeanProperty @SpaceId(autoGenerate = true) var id: String, @BeanProperty @SpaceRouting var routing: Long ) { def this() = this(-1, null) }

Slide 20

Slide 20 text

20 Developing InsightEdge in Scala: the good parts • Easy to extend Spark API with implicit conversions import org.insightedge.spark.implicits.all._ val stream = … stream.saveToGrid() def gridRdd[R: ClassTag]():InsightEdgeRdd={…} val rdd = sc.gridRdd[Product]() • Code in functional style is concise and readable • Developers are really productive with Scala • Negative experience with SBT so far, we use Maven • ClassTags make the API clean (solves JVM’s type erasure problem) • Mixin class compositions, e.g. for testing class InsightEdgeRDDSpec extends FlatSpec with IEConfig with InsightEdge with Spark

Slide 21

Slide 21 text

21 How we do testing • Unit and integrational tests with ScalaTest • Unit tests start Spark and XAP in embedded mode; covers all our API • Integrational tests use Docker for virtualization – covers bash scripts – Zeppelin notebooks – clustering and networking – tag long running tests and launch them only on ‘master’ and ‘release’ branches – we use Spotify’s plugin to build images with Maven – got rid of xebialabs/overcast, using spotify/docker-client OR just running containers with import sys.process._ “docker run image” !

Slide 22

Slide 22 text

22 THANK YOU Github: http://github.com/InsightEdge Slack chat: http://insightedge-slack.herokuapp.com Website: http://insightedge.io JOIN COMMUNITY