Sharding with the Django ORM

Sharding with the Django ORM Mike Clarke

Overview • what is sharding • sharding at Disqus •
example application • next steps

What is sharding?

splitting rows across multiple database instances

db polls_poll polls_choices

db1 db2 db3 db4 polls_choices_1 polls_choices_2 polls_choices_3 polls_choices_4 polls_choices_5 polls_choices_6
polls_choices_7 polls_choices_8 polls_choices_9 polls_choices_10 polls_choices_11 polls_choices_12 polls_choices_13 polls_choices_14 polls_choices_15 polls_choices_16 polls_choices_17 polls_choices_18 polls_choices_19 polls_choices_20

Why shard your db?

Advantages • smaller databases (disk space) • faster queries •
cheaper hardware • easier to manage • horizontal scaling

Disadvantages • code complexity • SPOF becomes many-POF • casscading
failures • cross-shard JOINs not an option • materialized views can help • DDL changes more difﬁcult

When to shard?

When to shard? • as late as possible... • vertical
scaling (dedicated hardware) • ...but not too late • migrating to shards is harder with time • system metrics are early indicators • disk IO utilization critical metric

Database Hardware • 384 GB RAM • Dual procs, octocore
CPUs • 36 HDDs • 32 drive RAID10 for data partition • Many read slaves

Buying time • Throttling writes via RabbitMQ / celeryd •
Shufﬂing load from master to read slaves • Role-based read slaves to optimize reads • Diminishing returns of read slave strategy • Logical replication delays via slony • Very pricey

Connecting to shards • dedicated pgBouncer machines • HAProxy accepts
connections • local pgBouncers for • keepalived for failover / clustering • Required due to max connections & NIC limitations on individual machines

sidenote: DB tools • pgbadger w/ rsyslog • http://dalibo.github.com/pgbadger/ •
pgbarman for backup • http://www.pgbarman.org/ • slony, bucardo for replication • bucardo offers multi-master

sidenote: monitoring • Diamond & enhanced postgres collector • http://github.com/disqus/Diamond
• collects per-table stats • os and database metrics all in graphite • WIP visualization tool (open sourced soon) • alert with check_graphite.py

Use Cases

Use Cases • high throughput (insert / update) tables •
votes • posts • tweets • large, non-critical data • import metadata • HTTP headers

Extending the Django ORM

ORM Challenges • conﬁguring database connections • sequences for auto-incremented
ids • routing queries • rewriting table names • generating DDL

sample sqlshard app https://github.com/disqus/sharding-example

database conﬁguration

# settings.py DATABASE_CONFIG = { 'root': { 'ENGINE': 'django.db.backends.postgresql_psycopg2', },
'servers': { 'default': { 'NAME': 'polls', }, 'sharded': { 'NAME': 'sharded_polls', 'HOSTS': { 0: { ‘HOST’: ‘bacon-is-delicious.disqus.com’, ‘PORT’: 5432 }, 1: { ‘HOST’: ‘i-love-bacon.disqus.com’, ‘PORT’: 5432 }, } } } } DATABASES = dict(DatabaseConﬁgurator( DATABASE_CONFIG['servers'], DATABASE_CONFIG['root'] )) DATABASE_ROUTERS = ['sqlshards.db.shards.routers.ShardedRouter']

>>> from django.db import connections >>> connections.databases.keys() ['default', 'sharded.shard0', 'sharded.shard1']

sharding sequences

PL/pgSQL function a la Instagram http://instagram-engineering.tumblr.com/post/10853187575/sharding-ids-at-instagram

CREATE OR REPLACE FUNCTION next_sharded_id(varchar, int, OUT result bigint) AS
$$ DECLARE sequence_name ALIAS FOR $1; shard_id ALIAS FOR $2; seq_id bigint; now_millis bigint; BEGIN SELECT nextval(sequence_name::regclass) % 1024 INTO seq_id; SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) * 1000) INTO now_millis; result := (now_millis - 1351746000000) << 23; result := result | (shard_id << 10); result := result | (seq_id); END; $$ LANGUAGE PLPGSQL;

CREATE SEQUENCE polls_choice_0_id_seq; CREATE TABLE "polls_choice_0" ( "id" bigint DEFAULT
next_sharded_id('polls_choice_0_id_seq', 0) NOT NULL PRIMARY KEY, "poll_id" integer CHECK ("poll_id" >= 0) NOT NULL, "choice_text" varchar(200) NOT NULL, "votes" integer NOT NULL );

deﬁning models

# models.py from sqlshards.db.shards.ﬁelds import ShardedAutoField from sqlshards.db.shards.models import PartitionModel
class PollPartitionBase(PartitionModel): id = ShardedAutoField(primary_key=True, auto_created=True) poll_id = models.PositiveIntegerField(db_index=True) class Meta: abstract = True class Shards: key = 'poll_id' num_shards = 2 cluster = 'sharded' class Choice(PollPartitionBase): choice_text = models.CharField(max_length=200) votes = models.IntegerField()

routing queries

# sqlshard/db/shards/routers.py class ShardedRouter(object): def db_for_read(self, model, **hints): shard_info =
getattr(model, '_shards', None) if shard_info: if not shard_info.is_child: raise ValueError(...) return shard_info.get_database() return None def db_for_write(self, model, **hints): hints['is_write'] = True return self.db_for_read(model, **hints)

running queries

>>> from polls.models import Poll, Choice >>> Choice.objects.create(poll_id=9000, choice_text=‘Absolutely’, votes=100)
>>> Choice.objects.ﬁlter(poll_id=9000) [<Choice_Partition1: Absolutely>] >>> Choice.objects.all() # Raises exception, cannot run

generating DDL

$ manage.py sqlpartition polls.choice CREATE OR REPLACE FUNCTION next_sharded_id(varchar, int,
OUT result bigint) ... CREATE SEQUENCE polls_choice_0_id_seq; ... CREATE TABLE "polls_choice_0" ( "id" bigint DEFAULT next_sharded_id('polls_choice_0_id_seq', 0) NOT NULL PRIMARY KEY, "poll_id" integer CHECK ("poll_id" >= 0) NOT NULL, "choice_text" varchar(200) NOT NULL, "votes" integer NOT NULL ); ... ALTER TABLE "polls_choice_0" ADD CONSTRAINT "polls_choice_0_poll_id_check_modulo" CHECK (("poll_id") % 2 = 0); ALTER TABLE "polls_choice_1" ADD CONSTRAINT "polls_choice_1_poll_id_check_modulo" CHECK (("poll_id") % 2 = 1);

alternatives

• shard by schema (Instagram approach) • public => shard0,
shard1, ... • shard by database name • route via Django connections • mongodb? • glhf

next steps

• open source as standalone app • needs better tests
• actual documentation • integrate into nydus • parallel query execution • https://github.com/disqus/nydus • beyond python • no tool exists currently that meets our needs • postgres query routing proxy

PSA http://postgresweekly.com

PSA #2 we’re hirring! disqus.com/jobs

Thanks! @mikeclarke for questions shout out to @jasonyan and @zeeg
for their hard work

Sharding with the Django ORM

Sharding with the Django ORM

More Decks by Mike Clarke

Other Decks in Programming

Featured

Transcript