Upgrade to Pro — share decks privately, control downloads, hide ads and more …

How we built InsightEdge.io

How we built InsightEdge.io

Oleksii Diagiliev

August 16, 2016
Tweet

More Decks by Oleksii Diagiliev

Other Decks in Programming

Transcript

  1. 1
    How we built
    InsightEdge.io
    Oleksiy Dyagilev

    View Slide

  2. 2
    About me
    • Chief Software Engineer at EPAM
    • Leading design and architecture efforts at http://insightedge.io
    • Blogging at http://dyagilev.org

    View Slide

  3. 3
    Modern applications: a blend of workloads
    Transactional Analytical
    Essential to operate the business Turning data into value:
    insights, diagnosis, decision making

    View Slide

  4. 4
    Consider Uber
    Surge Pricing

    View Slide

  5. 5
    Fare rates automatically
    increase, when the taxi demand
    is higher than drivers around you
    Ensure reliability and availability
    for those who agree to pay more

    View Slide

  6. 6
    What are the challenges?
    Data -> Insight -> Action
    at scale

    View Slide

  7. 7
    INTRODUCING
    INSIGHTEDGE
    FAST DATA TO ACTION

    View Slide

  8. 8
    InsightEdge synergy
    XAP: fast scale-out
    in-memory data grid
    Large-scale data
    processing framework

    View Slide

  9. 9
    XAP In-Memory Data Grid
    Scale-out
    In-Memory Storage
    Scale by sharding or replication
    Low latency and High
    throughput
    In-memory storage, fast indexing,
    co-located ops, batch ops
    Rich API and query
    language
    SQL-like against POJOs and
    schema-free documents.
    Geospatial API
    ACID transactions
    Maintain full ACID compliance
    against your data set through full
    transaction semantics
    High availability and
    resiliency
    Fault tolerance through replication,
    cross-data center replication, auto-
    healing
    Data Tiering
    RAM and SSD

    View Slide

  10. 10
    In-Memory Data Grid
    In-Memory Store(RAM) Flash, SSD, Off-Heap Store
    Spark Spark SQL Spark Steaming Machine Learning
    High availability
    Security & Management
    InsightEdge Core
    InsightEdge Core

    View Slide

  11. 11
    InsightEdge Core
    InsightEdge Ecosystem
    Flash, SSD, Off-Heap

    View Slide

  12. 12
    BUILDING
    INSIGHTEDGE

    View Slide

  13. 13
    InsightEdge architecture: collocating Spark and XAP data grid
    node 1
    Spark master
    Grid
    master
    node 2
    Spark worker
    Grid
    worker
    node 3
    Spark worker
    Grid
    worker

    View Slide

  14. 14
    • List of parent RDDs – Empty
    • An array of partitions that a dataset is divided to – XAP Distributed Query to
    get partitions and their hosts
    • A compute function to do a computation on partitions – Iterator over portion
    of data
    • Optional preferred locations, i.e. hosts for a partition where the data will be
    loaded – hosts from Distributed Query
    InsightEdge RDD: resilient distributed dataset

    View Slide

  15. 15
    node 1
    Spark executor
    InsightEdge RDD: one-to-one partition
    Spark
    Partition
    #1
    XAP
    Primary #1
    direct
    connection
    Simple, but not enough
    parallelism for Spark
    node 2
    Spark executor
    Spark
    Partition
    #2
    XAP
    Primary #2
    node 3
    Spark executor
    Spark
    Partition
    #3
    XAP
    Primary #3

    View Slide

  16. 16
    node 1
    Spark Executor
    XAP Primary #1
    InsightEdge RDD: with bucketing
    0
    ..
    1
    ..
    2
    ..
    3
    ..
    4
    ..
    5
    ..
    ..
    ..
    ..
    ..
    ..
    Spark
    Partition #1
    1023
    1 Spark partition = M grid buckets
    1 XAP partition = N Spark partitions
    Spark
    Partition #2
    Spark
    Partition #1
    range query by index

    View Slide

  17. 17
    InsightEdge DataFrames: predicates pushdown & columns pruning
    Aggregation in
    Spark
    Filtering and
    columns pruning
    in Data Grid
    SELECT SUM(amount)
    FROM order
    WHERE city = ‘NY’ AND year > 2012
    Spark SQL architecture:
    • Pushing down predicates to Data Grid
    • Leveraging indexes
    • Transparent to user
    • Enabling support for other languages - Python/R
    Implementing DataSource API

    View Slide

  18. 18
    Extending Spark with GeoSpatial API
    The tricky part:
    Shapes are packed in 3rd party java library
    - To register UDT you have to add Spark annotation to shape classes (but you cannot)
    - Dataframes don’t support a mix of Scala and Java types
    val searchRadius = 3 // km
    val userLocation = point(-77.024470, 39.032506)
    val searchArea = circle(userLocation, kmToDeg(searchRadius))
    val schools = sqlContext.read.grid.loadClass[School]
    val nearestSchools = schools.filter(schools("location") geoWithin searchArea)
    Shapes:
    Queries: intersects, within, contains
    https://github.com/InsightEdge/insightedge-geo-demo/
    Geo Indexes

    View Slide

  19. 19
    Developing InsightEdge in Scala: the unpleasant part
    Interoperability with Java:
    • Native XAP model is declared as Java POJOs with annotations on getters
    • How do we declare them in Scala?
    Unpleasant things (should be addressed in the next releases):
    • Mutable class
    • No-args constructor
    • Null instead of Option[T]
    object annotation {
    import com.gigaspaces.annotation.pojo
    type SpaceId = pojo.SpaceId@beanGetter
    type SpaceRouting = pojo.SpaceRouting@beanGetter

    }
    case class Data(
    @BeanProperty
    @SpaceId(autoGenerate = true)
    var id: String,
    @BeanProperty
    @SpaceRouting
    var routing: Long
    ) {
    def this() = this(-1, null)
    }

    View Slide

  20. 20
    Developing InsightEdge in Scala: the good parts
    • Easy to extend Spark API with implicit conversions
    import org.insightedge.spark.implicits.all._
    val stream = …
    stream.saveToGrid()
    def gridRdd[R: ClassTag]():InsightEdgeRdd={…}
    val rdd = sc.gridRdd[Product]()
    • Code in functional style is concise and readable
    • Developers are really productive with Scala
    • Negative experience with SBT so far, we use Maven
    • ClassTags make the API clean
    (solves JVM’s type erasure problem)
    • Mixin class compositions, e.g. for testing
    class InsightEdgeRDDSpec extends FlatSpec with IEConfig with InsightEdge with Spark

    View Slide

  21. 21
    How we do testing
    • Unit and integrational tests with ScalaTest
    • Unit tests start Spark and XAP in embedded mode; covers all our API
    • Integrational tests use Docker for virtualization
    – covers bash scripts
    – Zeppelin notebooks
    – clustering and networking
    – tag long running tests and launch them only on ‘master’ and ‘release’ branches
    – we use Spotify’s plugin to build images with Maven
    – got rid of xebialabs/overcast, using spotify/docker-client OR just running containers with
    import sys.process._
    “docker run image” !

    View Slide

  22. 22
    THANK
    YOU
    Github: http://github.com/InsightEdge
    Slack chat: http://insightedge-slack.herokuapp.com
    Website: http://insightedge.io
    JOIN COMMUNITY

    View Slide