Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Edgelixir: Distributed Graph Processing in Elixir

Edgelixir: Distributed Graph Processing in Elixir

Distributed graph processing systems are an important part of the modern data toolkit. Graph structures are a great way to solve a lot of problems, and distributed systems let us scale our problem size. This talk walks through how some of Elixir's features (including OTP, macros, and functional programming itself) can come together to simplify writing a distributed graph system. The result is Edgelixir; an early, work-in-progress, Pregel-based graph framework.

Nathan Lapierre

September 01, 2016
Tweet

Other Decks in Programming

Transcript

  1. Goals for this talk • Learning more about Elixir, OTP,

    and the Erlang VM • I don’t assume any special background knowledge • Using Elixir for distributed data processing (“Big Data”) • Working with graphs in Elixir
  2. Talk Outline • Why Graphs? • Distributed graph processing •

    Pregel model and existing frameworks • Edgelixir • Why Elixir makes a lot of sense! • How to use Edgelixir • Dive into Edgelixir’s code • Summary & Questions/Feedback
  3. 0 10 20 30 40 50 60 70 80 90

    100 4 3 2 1 % Prepared Weeks Remaining Graph: Presentation preparedness vs. weeks until conference Graph: New conference friends So we’re on the same page… Terms:Vertex (node), Edge (arc), weight, (un)directed
  4. Why Graphs? • We can model a lot of problems

    as graphs • Social networks • Web graph • Science • Constraint satisfaction • A lot more… • We have a lot of well understood machinery to apply to graphs • Topology, distance, coloring , centrality, clustering… • This makes solving problems with graphs very intuitive • … and satisfying!
  5. Distributed Graph Processing • Our graphs can easily be too

    big to process on one computer • Too large (RAM/Disk-bound) • Too slow (CPU-bound) • Need distributed processing to scale computing power with our problem size • Multicore • Multi-computer • Graphs aren’t as easy as other workloads to distribute • Less obvious how to divide the work while keeping communication overhead low
  6. Pregel • Malewicz et al. (2010) • Graph distribution by

    message passing • Goals: • Many algorithms implemented in one system • “Think like a vertex” programming • Fault tolerant • Scalable Figure from Giraph Getting Started Docs
  7. Open Source Pregel-based Frameworks Most popular: • Apache Giraph •

    JVM-based, Java • Built on Hadoop MapReduce • GraphX • JVM-based, Scala • Built on Apache Spark • Others… Both are: • Mature projects used in production at scale • A little bit terrifying • Thousands of LoCs • Complicated to deploy and configure
  8. Pregel in Elixir • The Erlang VM, and the Elixir

    language/standard library have everything we need to implement Pregel out-of-the-box • Erlang OTP has decades of distributed computing knowledge baked in • Let’s build a concise Pregel-like Elixir package using the most appropriate Elixir features
  9. The Toolbox • :digraph • ETS-backed mutable graph storage in

    Erlang standard library • :global • Globally name registration across cluster • Caveat: scalability issues, OTP team working on it • GenServer • Message passing and state • Behaviours and Protocols • Graph input/output extensibility
  10. Edgelixir Inputs: A graph, compute/3 Output: A graph Graph persistence

    (:digraph/ETS) Edgelixir OTP app Node N Superstep Supervisor (GenServer) Vertex compute 1 Vertex compute V …
  11. Reading the Graph •Graph data is loaded and parsed •

    Edgelixir.GraphSource, Edgelixir.GraphFormat •Vertices/edges are partitioned • Edgelixir.GraphPartition •If the vertex/edge belongs on the current node, store in node’s :digraph
  12. Superstep Supervisor: Distributed Sync Barrier • Nothing exactly like this

    built into OTP • (:global locks might work) • Can build a simple one with GenServer • Assumption: no new nodes are coming online after starting • GenServer state holds # of nodes that have hit the barrier so far • Compares to # of nodes in the cluster
  13. Superstep Supervisor: Message passing • Messages come from compute/3 •

    Edgelixir.Superstep.message_neighbours/2 • Checks node partition during send • Sends only to necessary node • Messages are stored in GenServer state • Messages are passed to compute/3 during each superstep