pg2 and You: Getting Distributed with Elixir

pg2 and You: Getting Distributed with Elixir Eric Entin @antipax

Distributed Applications are awesome!

Distributed Elixir is really awesome!

Distributed Erlang/OTP makes Distributed Elixir really awesome!

But…

Distributed Applications Are Hard really hard!

And…

Erlang/OTP Isn’t Magic but it does provide some really nice
tools that make distributed applications a little easier!

One of these tools is called…

pg2 distributed named process groups

pg2 allows us to create, join, and query groups of
processes across a cluster.

What does pg2 let us do? • We can access
a group of processes by a common name. For example, there can be a set of processes (which can be located on different nodes) that are all members of the group :foobar. • We can send a message to one, some, or all group members. • If a member process terminates, it is automatically removed from the group.

Let’s take a look at a real world example of
pg2 in use…

…in a chat app… (what else?)

…using Phoenix Channels! “bidirectional communication for soft-realtime functionality”

Our Phoenix Chat App Phone Tablet Browser Phoenix Server

Well, that was easy.

But what about when we outgrow one server? (most likely
for redundancy, given Phoenix’s performance characteristics)

Our Distributed Phoenix Chat App Phone Tablet Browser Phoenix Server
Phoenix Server

But how is this implemented in Phoenix?

Phoenix PubSub (simpliﬁed) Client 1 Socket 1 Channel 2 PubSub
Server Channel 1 Client 2 Socket 2 This is a process running on each node.

Our Distributed Phoenix Chat App Phone Tablet Browser Phoenix Server
Phoenix Server How do we ﬁnd the PubSub server on all of our nodes?

By having our Phoenix PubSub servers join a pg2 group,
we can fan our messages out across the cluster. (disclaimer: in practice, PubSub is implemented via adapters, so we can use pg2, redis, or any other solution we can think of!)

Let’s take a look at a simpliﬁed code example.

pg2 example defmodule Phoenix.PubSub.PG2Server do use GenServer def init(server_name) do
:ok = :pg2.create({:phx, server_name}) :ok = :pg2.join({:phx, server_name}, self()) {:ok, server_name} end … end

pg2 example, cont. defmodule Phoenix.PubSub.PG2Server do … def broadcast(server_name, topic,
message) do Enum.each :pg2.get_members({:phx, server_name}), fn pid -> send pid, {:broadcast, topic, message} end end def handle_info({:broadcast, topic, message}, server_name) do Local.broadcast(topic, message) {:noreply, server_name} end end

In practice, Phoenix PubSub is more complicated, thanks to extensive
optimization. But this is essentially how it works.

But how does pg2 work?

What can we learn about its characteristics from its implementation?

In order to answer these questions and learn exactly how
pg2 works, I translated the code to Elixir.

Introducing: RePG2 A highly-documented translation of the original Erlang pg2
implementation to Elixir for educational purposes.

The true speciﬁcation of behavior is the code itself. By
reading the code of our favorite software, we can gain a deeper understanding.

Not everyone knows Erlang. Despite the high quality of the
implementation, pg2's Erlang code is not necessarily easy to read, even if you know Erlang.

For fun! Sometimes you’re just bored.

In order to accomplish these goals, I set out some
guiding principles for the translation.

RePG2 Translation Principles • RePG2 code should be idiomatic, easy-to-read,
fully (over?) documented Elixir. • RePG2 should be identical to pg2 in terms of functionality and performance characteristics, even if it has been refactored to increase clarity. • Code which exists purely for backwards compatibility may be eliminated in the interest of clarity.

Tests were also written using ExUnit for full RePG2 code
coverage. This includes a distributed suite which interacts with multiple nodes.

RePG2 vs. pg2 • RePG2 does not have the same
backwards compatibility as pg2, and has only been tested on Erlang/OTP 18.3 and Elixir 1.2.4 • pg2 is started under the kernel_safe_sup, a special OTP kernel supervisor for important services that are considered safe to restart. RePG2 is implemented as a normal OTP application. • pg2 will start itself if it is not yet started. RePG2 expects to be added to :applications in mix.exs and will not start itself.

How much work was the translation?

Actually, not much, because pg2 is tiny!

In fact, pg2 is only 333 lines of code. with
comments, 390!

pg2 is simple thanks to some other useful tools OTP
provides. RePG2 uses all of these tools as well.

pg2’s OTP Toolbox • GenServer • ETS • :global •
Node and Process Monitoring

Let’s take a look at each of these individually. All
of these apply equally to RePG2.

Additionally, by looking at how pg2 uses these tools, we
will gain a high-level understanding of pg2’s implementation.

GenServer • Each node (which is using pg2) has a
pg2 server process running. • This server process serves as the central point of interaction for pg2 between (and within) each node.

ETS • ETS is an in-memory concurrent storage solution for
Elixir terms. • In pg2, ETS is used to store process groups and their memberships. • Reads can happen from any process, but to avoid race conditions, writes are serialized through the node’s pg2 server. • ETS is extremely useful in general, and this pattern in particular is very common.

:global • The :global module provides a function called trans/2.
This function acquires a lock across the entire cluster, using any Elixir term as a key, and then runs a provided function. After the function completes, the lock is released. • By combining trans/2 with GenServer’s multi_call/3, which allows us to call all processes registered with a given name within the cluster, pg2 ensures that only one process across the entire cluster can modify any given group at a time. • This pattern can be very useful in our own code!

Node and Process Monitoring • The :net_kernel function monitor_nodes/1 allows
the calling process to register for notiﬁcations about nodes connecting and disconnecting from the cluster. • When the pg2 server receives a notiﬁcation that a new node has connected, it merges the groups and memberships between itself and the new member’s pg2 server. • Additionally, pg2 registers a monitor for each process which joins a group. If this monitor reports that the process is down (which could be because the process died, or because its node disconnected), the process’s membership is removed from the local data.

That’s the pg2 implementation!

What are some key insights we can take away from
these pg2 implementation details?

pg2 uses global locks • While reading group memberships is
very fast, modifying them is a globally locked operation requiring multiple network round-trips. • We may run into problems with lock overhead if our groups contain a large number of memberships.

pg2 is a distributed database • In terms of the
CAP theorem, pg2 is AP: available, and partition- tolerant. • Cluster partitions will only see groups and memberships from nodes that are reachable. • However, pg2 is eventually consistent in that it automatically heals from any partitions. • Process groups are uniquely easy to distribute thanks to monitors, and the fact that conﬂicts can be easily resolved via merging.

Overall, pg2 is an amazingly powerful tool. Just be aware
of the caveats!

Check out RePG2 for more examples. Including how to build
a distributed test suite!

Thanks! Questions? https://github.com/antipax/repg2 https://www.twitter.com/antipax

pg2 and You: Getting Distributed with Elixir

pg2 and You: Getting Distributed with Elixir

More Decks by Eric Entin

Other Decks in Programming

Featured

Transcript