◦ Support for Clojure and Java • Apache Spark ◦ Large Scale Data processing and Interactive Analysis ◦ Streaming Support ◦ Built in Scala, Good support for Scala/Python ◦ Clojure DSLs- Flambo, Sparkling • Apache Flink ◦ Batch and Stream Data Processing ◦ Streaming Dataflow Engine ◦ No production support for Clojure (as of now) ▪ Word Count samples do exist * * https://github.com/mjsax/flink-external/tree/master/flink-clojure
for Interactive and Streaming ◦ Batch Jobs are Scheduled ◦ Any change requires Spark app to be resubmitted • Long Running Apps ◦ Deploy Once for Multiple Jobs ◦ Fixed resource allocation (not using dynamic allocation yet) • Clojure is enforced ◦ Glue Scala/Java Plug-ins
of our codebase is in Clojure • Architecture ◦ Masterless ◦ Fault Tolerant ◦ Cloud Scale ◦ Distributed Computation • Implementation ◦ Programs as immutable data structures ▪ Very close to our way of defining dataflow models ◦ Decouples behavioral set from specific execution • Unified API for both batch and stream processing ◦ As good as it sounds
can start standalone zookeeper via curator • Peer ◦ Only entity in Onyx ◦ All Peers are considered equal ◦ There is no master Peer (masterless design) ▪ No single coordinating process ▪ No entity to orchestrate the cluster ◦ Works on at-most one job at a time ◦ Virtual Peer ▪ Single Peer process running on a single physical machine ▪ Works on at most one task at a time
• Job Scheduling Strategies ◦ Greedy ◦ Balanced Robin ◦ Percentage • Task Scheduling ◦ Balanced ◦ Percentage ◦ Colocation ▪ Assigns to peers on the single machine, low latency, min network • Tags can be used to assign behavior to peers ◦ Make database peer so all database task go to that peer ◦ Assign CPU intensive tasks to a set of peers with high CPU or so
• Messaging layer is pluggable • Aeron is the default messaging implementation ◦ High throughput and low latency ◦ Subscription (Connection Multiplexing) ▪ Aeron subscribers perform deserialization ▪ May become CPU bound ▪ Multiple subscribers per node ◦ Connection Short Circuiting ▪ Co-located virtual Peers bypass Aeron ▪ Direct communication without any network or serialization overhead
to each peer replica • Log entries are the functions (pure, deterministic, idempotent) with args that are used to update the replica • Each peer sees each event in the cluster in exactly the same order • Since each peer has its own independent pointers to the log, they never block • Replica contains structural information of the cluster known to the Peer
to the fixed origin address and then starts reading the events forward • gc process sets the fixed origin address marker to the last entry as new start ◦ Step-1: Creates its own replica by pretending to be a peer and read every entry till the last entry ◦ Step-2: Take the replica and store it in origin address that peer can now use ◦ Step-3: Instructs origin to atomically point to new start • In the process if the peers are left behind, they may crash and eventually start from the new origin address
segments • Segments ◦ Data (maps) that Onyx allows to emit between functions • Workflow ◦ Articulates the paths that flow through the cluster at runtime (DAG) • Catalog ◦ Describes all inputs, outputs, functions in a Workflow • Flow Conditions ◦ Applied on segment-by-segment basis ◦ Dataflow direction is determined by defined predicate functions Disclaimer: Images used are only for demonstration purpose and are not endorsed by Onyx. The images are not being used for any commercial purpose.
task ◦ Can inject hooks at critical points during a task ◦ Carries context map • Sentinel ◦ Signals to end stream, switch between streaming/batch mode (:done) • Task ◦ Smallest unit of work ◦ Associated with only one Job • Job ◦ Collection of Workflow, Catalog, Flow Conditions, Lifecycles and Execution Parameters ◦ Every task is associated with exactly one job Disclaimer: Images used are only for demonstration purpose and are not endorsed by Onyx. The images are not being used for any commercial purpose.
(map #(%2 (str %1)) s) (apply str))) (defn mixed-case [segment] {:word (mixed-case-impl (:word segment))}) • Take segments as parameters • Emit one or more segments as output
to provide strong, multi-tenant isolation of peers ◦ :zookeeper/server? : Used to startup a local, in-memory ZooKeeper (test only) ◦ :zookeeper.server/port : Port to use for the local in-memory ZooKeeper ◦ :zookeeper/address : Addresses of ZooKeeper servers to use for coordination • Peer Configuration ◦ :onyx/tenancy-id : Provides a way to provide strong, multi-tenant isolation of peers ◦ :zookeeper/address : Addresses of ZooKeeper servers to use for coordination ◦ :onyx.peer/job-scheduler : Coordinates which jobs peers are allowed to volunteer to execute ◦ :onyx.messaging/impl : Messaging protocol to use for peer-to-peer communication ◦ :onyx.messaging/bind-addr : IP address to bind the peer to for messaging ◦ :onyx.messaging/peer-port : Port that peers should use to communicate • Submit Job (onyx.api/submit-job config job) ◦ Environment, peer configuration with workflow, catalog, lifecycles and flow conditions