Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Building for 100x scale by Simon Kelly

Pycon ZA
October 07, 2016

Building for 100x scale by Simon Kelly

CommCare is an open source platform built in python (Django) designed for mobile data collection, longitudinal client tracking, decision support, and behavior change communication. CommCare provides an online application-building platform through which users build mobile applications for use by frontline workers.

The mobile application is used by client-facing frontline work workers as a client management, data collection and educational tool. Data entered in the mobile application is submitted to the CommCare servers.

Currently CommCare supports 14K active mobile users submitting over 1 million forms a month. With new national projects launching soon, it will need to be able to support 100K users and up to 10 million monthly forms by the end of 2016 and 1.4M users within the next few years. The current architecture would not scale to that level due to limitations of the database and increasing cost of ownership so we have embarked on an internal project to re-design critical pieces of the platform in order to support this scale up.

This talk will describe the old and new architecture and delve into some of the details of the new architecture and decisions we’ve made along the way such as changing our primary database, database sharding and stream processing.

Pycon ZA

October 07, 2016
Tweet

More Decks by Pycon ZA

Other Decks in Programming

Transcript

  1. • Who is Dimagi • A story of scale •

    Rethinking the system • Implementation • Learnings
  2. Senegal National Informed Push for Supply Chain Guatemala Scaling maternal

    health, malaria & nutrition app to 9,000 users Ghana Supply Chain, Community Health Worker Expansion Burkina Faso Clinical tools used in 25% of all national clinics India Scaling app to 100,000 Community Health Workers Myanmar Scaling to 12,000 midwives Tanzania Nationally scaling supply chain project Mozambique National Community Health Worker app rollout The story of scale
  3. System Model Mobile Users Cases Data Elements Transactions { "type":

    "person", "name": "Mary", "gender": "F", "dob": "1985-04-12" } { "next_visit": "2016-11-04", } • 100 - 20 000 per user • Sharing
  4. Expected System Load (2017) 150 thousand 4 million / month

    600 % 600 % Mobile Users Cases 4 billion / month 2500 % Data Elements
  5. Data growth 5 Year timeline • 10 billion cases •

    1 trillion data points • 5 petabytes of data
  6. System architecture Redis (cache) Elasticsearch Django Django Django nginx Celery

    Celery Celery Stream processing PostgreSQL CouchDB CouchDB Cluster
  7. System principles ◦ Good technology fit ◦ Open Source ◦

    Cost ◦ Control ◦ Lock in ◦ Mature ◦ Well supported ◦ Reasonable upgrade paths ◦ Good tooling ◦ Horizontally scalable
  8. Principles applied • Cost • Control • Lock in •

    Design Horizontal scalability Open Source Maturity Technology Fit Redis (cache) Elasticsearch Django Celery nginx Stream processing PostgreSQL CouchDB Cluster
  9. Rethinking our data High Volume Primary Data Cases, Forms Low

    Volume Primary Data Users, Groups, Apps etc. Analytics Data Binary Data Attachments, Multimedia ✓
  10. Evaluating options • Technology fit • Horizontal scalability • Open

    Source • Project maturity • Transactional properties / consistency model • Speed / transaction throughput • Secondary index support • Ease of implementation • Maintenance burden
  11. Identifying solutions • Short list ◦ PostgreSQL ◦ MongoDB ◦

    CouchDB • Prototype • Benchmark ◦ Tsung ◦ Variety of workloads Flask PostgreSQL CouchDB mongoDB
  12. Evaluation results • PostgreSQL ◦ Benchmarks ◦ Flexibility of SQL

    ◦ Mature product ◦ Already in our toolset • Gaining insights ◦ Optimize for reads ◦ Scaling factors ◦ Scale limitations
  13. Foundations: Tests • Test suite ◦ Good coverage of code

    and use cases ◦ Run on both backends @run_with_all_backends def test_parent_and_child_cases(self): …. run_with_all_backends = functools.partial( run_with_multiple_configs, run_configs=[ RunConfig(settings={'USE_SQL_BACKEND': True}, post_run=self.tearDown()), RunConfig(settings={'USE_SQL_BACKEND': False}, pre_run=self.setUp()), ] )
  14. Foundations: Branching if should_use_sql_backend(project): # SQL specific else: # CouchDB

    specific • Code branching mechanism ◦ Override in tests ◦ In production def should_use_sql_backend(project): local_override = get_local_sql_backend_override(project) if local_override is not None: return local_override if settings.UNIT_TESTING: return _should_use_sql_backend_in_tests(project) return USE_SQL_BACKEND_FLAG.enabled(project)
  15. Data Model • Nested JSON objects → Multiple SQL tables

    { "doc_type": "case", "relationships": [{"case_id": "a"}, {"case_id": "b"}], "transactions": [{"id": "1"}, {"id": "2"}], …. } Case CaseRelationship CaseTransaction 1 1 0..* 0..*
  16. Data Access Interface CaseAccessors(project).get_case(case_id) class CaseAccessors(object): def __init__(self, project=None): self.project

    = project @property def _db_accessor(self): if should_use_sql_backend(self.project): return CaseAccessorSQL else: return CaseAccessorCouch def get_case(self, case_id): return self._db_accessor.get_case(case_id)
  17. Data Access Implementation class CaseAccessorSQL(AbstractCaseAccessor): @staticmethod def get_case(case_id): return CaseSQL.objects.get(case_id=case_id)

    class CaseAccessorCouch(AbstractCaseAccessor): @staticmethod def get_case(case_id): return CaseCouch.get(case_id) class AbstractCaseAccessor(six.with_metaclass(ABCMeta)): @abstractmethod def get_case(case_id): raise NotImplementedError
  18. Sharding “PL/Proxy is a PostgreSQL procedural language handler that allows

    you to do remote procedure calls between PostgreSQL databases, with optional sharding.” PL/Proxy DB DB 1 DB 2 DB 3 2N Logical shards mapped to Y databases
  19. PL/Proxy: RUN ON hash() DB 2 DB 1 DB 3

    SELECT * FROM get_case_by_id(case_id) Stub function Function implementation PL/Proxy DB get_case_by_id(case_id) get_case_by_id(case_id) hash(case_id) & (2N-1)
  20. SQL Functions Actual Function Proxy Function CREATE FUNCTION get_case_by_id(p_case_id TEXT)

    RETURNS SETOF case_table AS $$ CLUSTER 'commcare'; RUN ON hash_string(case_id); $$ LANGUAGE plproxy; CREATE FUNCTION get_case_by_id(p_case_id TEXT) RETURNS SETOF case_table AS $$ BEGIN RETURN QUERY SELECT * FROM case_table where case_id = p_case_id; END; $$ LANGUAGE plpgsql;
  21. Un-sharded Environment Final State Sharded Environment unsharded proxy p1 pN

    Django unsharded Django PL/Proxy functions SQL functions SQL functions
  22. Running queries from Python • Fetching Django objects • Queries

    that don’t return objects CaseSQL.objects.get(case_id=case_id) CaseRelationship.objects.filter(case_id=case_id).values_list('referenced_id') CaseSQL.objects.raw('SELECT * from get_case_by_id(%s)', [case_id])[0] with get_cursor(CaseSQL) as cursor: cursor.execute('SELECT referenced_id FROM get_parent_case_ids(%s)', [case_id]) results = fetchall_as_namedtuple(cursor) return [result.referenced_id for result in results]
  23. Writing data CREATE FUNCTION save_case( case case_table ) AS $$

    BEGIN INSERT INTO case_table (case_id, type, properties) VALUES ( case.case_id, case.type, case.properties ); END $$ SELECT save_case(ROW('123','farmer','{"name": "Jo"}')::case_table); cursor.execute('SELECT save_case(%s)', [case]) psycopg2.extensions.register_adapter(CaseSQL, case_adapter) https://github.com/dimagi/commcare-hq/blob/4375b4a1e4107616abe686550fb13ed73542d054/corehq/form_processor/utils/sql.py
  24. App Layer (Django) • Raw queries ◦ Disable Django ORM

    queries class DisabledDbMixin(object): def save(self, *args, **kwargs): raise AccessRestricted('Direct object save disabled.') class RestrictedManager(models.Manager): def get_queryset(self): raise AccessRestricted('Only "raw" queries allowed')
  25. App Layer (Django) • DB Router ◦ Queries ◦ Migrations

    $ ./manage.py migrate --database=proxy • Tools for managing PL/Proxy cluster $ ./manage.py migrate_multi $ ./manage.py configure_pl_proxy_cluster
  26. New architecture Elasticsearch Django Django Django nginx Redis (cache) Stream

    processing Celery Celery Celery CouchDB Cluster proxy p1 pN Kafka RiakCS PostgreSQL
  27. Gotchas • Transactions ◦ Django’s connection is with the proxy

    ◦ ‘proxy’ issues autocommit transactions to shard DB’s ◦ Even if Django rolls back the effects to the shard DB’s persist • Returning results from multiple databases ◦ count_case_in_domain (RUN ON ALL) ◦ one result from each shard DB returned ◦ SELECT sum(c) AS count FROM count_cases_in_domain('x') as t(c); • Limiting / sorting results ◦ ‘limit’ and ‘sort’ operations happens on shard DB’s
  28. ➔ PL/Proxy ➔ Django Multi-Database support ➔ Horizontal scaling with

    PL/Proxy ➔ RiakCS ➔ Kafka ➔ www.dimagi.com ➔ www.commcarehq.org ➔ github.com/dimagi/commcare-hq