How we built InsightEdge.io

1 How we built InsightEdge.io Oleksiy Dyagilev

2 About me • Chief Software Engineer at EPAM •
Leading design and architecture efforts at http://insightedge.io • Blogging at http://dyagilev.org

3 Modern applications: a blend of workloads Transactional Analytical Essential
to operate the business Turning data into value: insights, diagnosis, decision making

4 Consider Uber Surge Pricing

5 Fare rates automatically increase, when the taxi demand is
higher than drivers around you Ensure reliability and availability for those who agree to pay more

6 What are the challenges? Data -> Insight -> Action
at scale

7 INTRODUCING INSIGHTEDGE FAST DATA TO ACTION

8 InsightEdge synergy XAP: fast scale-out in-memory data grid Large-scale
data processing framework

9 XAP In-Memory Data Grid Scale-out In-Memory Storage Scale by
sharding or replication Low latency and High throughput In-memory storage, fast indexing, co-located ops, batch ops Rich API and query language SQL-like against POJOs and schema-free documents. Geospatial API ACID transactions Maintain full ACID compliance against your data set through full transaction semantics High availability and resiliency Fault tolerance through replication, cross-data center replication, auto- healing Data Tiering RAM and SSD

10 In-Memory Data Grid In-Memory Store(RAM) Flash, SSD, Off-Heap Store
Spark Spark SQL Spark Steaming Machine Learning High availability Security & Management InsightEdge Core InsightEdge Core

11 InsightEdge Core InsightEdge Ecosystem Flash, SSD, Off-Heap

12 BUILDING INSIGHTEDGE

13 InsightEdge architecture: collocating Spark and XAP data grid node
1 Spark master Grid master node 2 Spark worker Grid worker node 3 Spark worker Grid worker

14 • List of parent RDDs – Empty • An
array of partitions that a dataset is divided to – XAP Distributed Query to get partitions and their hosts • A compute function to do a computation on partitions – Iterator over portion of data • Optional preferred locations, i.e. hosts for a partition where the data will be loaded – hosts from Distributed Query InsightEdge RDD: resilient distributed dataset

15 node 1 Spark executor InsightEdge RDD: one-to-one partition Spark
Partition #1 XAP Primary #1 direct connection Simple, but not enough parallelism for Spark node 2 Spark executor Spark Partition #2 XAP Primary #2 node 3 Spark executor Spark Partition #3 XAP Primary #3

16 node 1 Spark Executor XAP Primary #1 InsightEdge RDD:
with bucketing 0 .. 1 .. 2 .. 3 .. 4 .. 5 .. .. .. .. .. .. Spark Partition #1 1023 1 Spark partition = M grid buckets 1 XAP partition = N Spark partitions Spark Partition #2 Spark Partition #1 range query by index

17 InsightEdge DataFrames: predicates pushdown & columns pruning Aggregation in
Spark Filtering and columns pruning in Data Grid SELECT SUM(amount) FROM order WHERE city = ‘NY’ AND year > 2012 Spark SQL architecture: • Pushing down predicates to Data Grid • Leveraging indexes • Transparent to user • Enabling support for other languages - Python/R Implementing DataSource API

18 Extending Spark with GeoSpatial API The tricky part: Shapes
are packed in 3rd party java library - To register UDT you have to add Spark annotation to shape classes (but you cannot) - Dataframes don’t support a mix of Scala and Java types val searchRadius = 3 // km val userLocation = point(-77.024470, 39.032506) val searchArea = circle(userLocation, kmToDeg(searchRadius)) val schools = sqlContext.read.grid.loadClass[School] val nearestSchools = schools.filter(schools("location") geoWithin searchArea) Shapes: Queries: intersects, within, contains https://github.com/InsightEdge/insightedge-geo-demo/ Geo Indexes

19 Developing InsightEdge in Scala: the unpleasant part Interoperability with
Java: • Native XAP model is declared as Java POJOs with annotations on getters • How do we declare them in Scala? Unpleasant things (should be addressed in the next releases): • Mutable class • No-args constructor • Null instead of Option[T] object annotation { import com.gigaspaces.annotation.pojo type SpaceId = pojo.SpaceId@beanGetter type SpaceRouting = pojo.SpaceRouting@beanGetter … } case class Data( @BeanProperty @SpaceId(autoGenerate = true) var id: String, @BeanProperty @SpaceRouting var routing: Long ) { def this() = this(-1, null) }

20 Developing InsightEdge in Scala: the good parts • Easy
to extend Spark API with implicit conversions import org.insightedge.spark.implicits.all._ val stream = … stream.saveToGrid() def gridRdd[R: ClassTag]():InsightEdgeRdd={…} val rdd = sc.gridRdd[Product]() • Code in functional style is concise and readable • Developers are really productive with Scala • Negative experience with SBT so far, we use Maven • ClassTags make the API clean (solves JVM’s type erasure problem) • Mixin class compositions, e.g. for testing class InsightEdgeRDDSpec extends FlatSpec with IEConfig with InsightEdge with Spark

21 How we do testing • Unit and integrational tests
with ScalaTest • Unit tests start Spark and XAP in embedded mode; covers all our API • Integrational tests use Docker for virtualization – covers bash scripts – Zeppelin notebooks – clustering and networking – tag long running tests and launch them only on ‘master’ and ‘release’ branches – we use Spotify’s plugin to build images with Maven – got rid of xebialabs/overcast, using spotify/docker-client OR just running containers with import sys.process._ “docker run image” !

22 THANK YOU Github: http://github.com/InsightEdge Slack chat: http://insightedge-slack.herokuapp.com Website: http://insightedge.io
JOIN COMMUNITY

How we built InsightEdge.io

How we built InsightEdge.io

Oleksii Diagiliev

More Decks by Oleksii Diagiliev

Other Decks in Programming

Featured

Transcript

1 How we built InsightEdge.io Oleksiy Dyagilev

2 About me • Chief Software Engineer at EPAM •

3 Modern applications: a blend of workloads Transactional Analytical Essential

4 Consider Uber Surge Pricing

5 Fare rates automatically increase, when the taxi demand is

6 What are the challenges? Data -> Insight -> Action

7 INTRODUCING INSIGHTEDGE FAST DATA TO ACTION

8 InsightEdge synergy XAP: fast scale-out in-memory data grid Large-scale

9 XAP In-Memory Data Grid Scale-out In-Memory Storage Scale by

10 In-Memory Data Grid In-Memory Store(RAM) Flash, SSD, Off-Heap Store

11 InsightEdge Core InsightEdge Ecosystem Flash, SSD, Off-Heap

12 BUILDING INSIGHTEDGE

13 InsightEdge architecture: collocating Spark and XAP data grid node

14 • List of parent RDDs – Empty • An

15 node 1 Spark executor InsightEdge RDD: one-to-one partition Spark

16 node 1 Spark Executor XAP Primary #1 InsightEdge RDD:

17 InsightEdge DataFrames: predicates pushdown & columns pruning Aggregation in

18 Extending Spark with GeoSpatial API The tricky part: Shapes

19 Developing InsightEdge in Scala: the unpleasant part Interoperability with

20 Developing InsightEdge in Scala: the good parts • Easy

21 How we do testing • Unit and integrational tests

22 THANK YOU Github: http://github.com/InsightEdge Slack chat: http://insightedge-slack.herokuapp.com Website: http://insightedge.io