Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Billions of Emails Synced with Python

Nylas
August 13, 2017

Billions of Emails Synced with Python

The open source Nylas Sync Engine provides a RESTful API on top of a powerful email sync platform, making it easy to build messaging into apps. It’s built using Python and gevent and has scaled to sync billions of messages over its lifetime deployment. In this talk, we’ll show you how it’s built and what technical challenges we’ve solved along the way.

Nylas

August 13, 2017
Tweet

More Decks by Nylas

Other Decks in Technology

Transcript

  1. What we’re going to talk about today • What does

    the Nylas do & why did we build a sync engine? • Technical Architecture & Stack • Technical Challenges • What’s next?
  2. What we’re going to talk about today • What does

    the Nylas do & why did we build a sync engine? • Technical Architecture & Stack • Technical Challenges • What’s next?
  3. Why? Works pretty OK when... • You know what the

    email you’re parsing looks like • You’re only working with a single email provider => Highly constrained!
  4. Why? Authentication complexity • No standard for email address =>

    provider settings • OAuth2 or password authentication ◦ Error messages are not standardized
  5. Why? Protocol complexity • IMAP is a TCP protocol &

    has many server implementations ◦ extensions, Gmail labels, server-dependent errors • Exchange ActiveSync (WBXML) • Exchange Web Services (SOAP) ◦ Is there even a library for that in $LANG? Is it any good?
  6. Why? Parsing complexity • Many specs • Messages are encoded

    on clients & sometimes clients violate the specs • MIME, base64, 7bit & 8bit, quoted-printable, plaintext or HTML, folded & encoded-words headers, attachments...
  7. Why? Sending complexity • For IMAP servers at least, you

    have to use a different protocol to send email (SMTP) ◦ Sometimes integrated w/IMAP, sometimes not (no way to find out but try it & see) • Exchange servers rewrite input & can mangle
  8. Why? Integrating gets harder when... • You have to parse

    & filter many, non-specific emails • You need compatibility with many different email providers ugh!
  9. 2 4 The Nylas Sync Engine & API: A Modern

    REST API for Email, Contacts, & Calendar
  10. What we’re going to talk about today • What does

    Nylas do & why did we build a sync engine? • Technical Architecture & Stack • Technical Challenges • What’s next?
  11. Tech Stack • ~80,000 lines of Python 2.7 • Flask,

    gevent, SQLAlchemy, pytest • HAproxy -> nginx -> gunicorn (w/gevent-pywsgi) • MySQL (mostly primary-replica clusters we manage on EC2) • ProxySQL • Ansible • Redis
  12. Tech Stack “Let's say every company gets about three innovation

    tokens. You can spend these however you want, but the supply is fixed for a long while.” — @mcfunley http://mcfunley.com/choose-boring-technology
  13. Architecture Two possible strategies: • Store minimal data & proxy

    requests to upstream providers • Mirror contents of mailboxes & serve most requests directly
  14. Architecture Two possible strategies: • Store minimal data & proxy

    requests to upstream providers • Mirror contents of mailboxes & serve most requests directly Reliability & Speed!
  15. Architecture: A semi-monolithic application Global DB Sharded DB Sharded DB

    Sharded DB Redis Redis ProxySQL ProxySQL ProxySQL ProxySQL ProxySQL ProxySQL ProxySQL ProxySQL Sync fleet API fleet haproxy clients
  16. What we’re going to talk about today • What does

    Nylas do & why did we build a sync engine? • Technical Architecture & Stack • Technical Challenges • What’s next?
  17. API Philosophy Our clients should build one integration, not many.

    That means we must build a unified API that is consistent across email providers.
  18. Database Sharding • People have a lot of email. One

    of the first scaling challenges we had to solve was data storage. • Our primary data store is sharded using MySQL autoincrements on primary keys. • https://www.nylas.com/blog/growing-up-with-mysql/
  19. MySQL Transaction Log • We record changes to mailboxes in

    a table as we sync them • Translates document store => changesets for easier sync • Powers webhooks, streaming API, internal services
  20. Architecture: MySQL transaction log Sharded DB ProxySQL ProxySQL ProxySQL ProxySQL

    Sync fleet ProxySQL ProxySQL Webhooks fleet Transaction table Sharded DB Transaction table Writes new entries Polls & consumes entries
  21. MySQL Transaction Log Why MySQL?? • It’s technically not the

    right tool for the job • … but it was one less thing to set up, maintain, learn • Can write entries in same SQL transaction & guarantee atomicity • We knew it wouldn’t lose our data
  22. MySQL Transaction Log The Future • With MySQL, all clients

    must poll to get updates • Excessive locking, DB load … expensive • The right tool in 2017 is probably Kafka ◦ Starting early experimentation now!
  23. Architecture: Sync Fleet Avoid the GIL: Use multiple processes on

    multicore machines! sync-1 sync-3 sync-4 sync-2 sync-1 sync-3 sync-4 sync-2 sync-1 sync-3 sync-4 sync-2 EC2 instance EC2 instance EC2 instance
  24. Sync Processes • Gevent to sync multiple accounts on a

    single process ◦ ~100 accounts per process ◦ Minimizes overhead from open sockets to IMAP providers
  25. Architecture: Sync Process Sync Service Gmail Sync Exchange Sync All

    Mail Trash Calendar Contacts Inbox Folder 1 Contacts Contacts
  26. Sync Load Balancing • Mailboxes are heterogeneous (different providers, different

    protocols, different sizes & rate of new mail receipt…) • Can’t easily predict how “expensive” it will be to sync • Measure time spent active in greenlets for each account & run manual load balances
  27. Greenlet Instrumentation • 3+ greenlets per account • ~100 accounts

    per process • 16 processes per machine • Dozens of machines
  28. Greenlet Instrumentation • Greenlet scheduling is cooperative ◦ Watch out

    for non-cooperative behaviour! • Excessive parallelism can cause delays in greenlet execution • Separate thread on each sync process that serves stack samples for generating flame graphs for profiling • https://github.com/nylas/nylas-perftools
  29. What we’re going to talk about today • What does

    the Nylas Sync Engine do & why did we build it? • Technical Architecture & Stack • Technical Challenges • What’s next?
  30. What’s next? • Full mypy coverage • Python 3 •

    Kafka event backbone • Better load balancing, enhanced webhooks, contacts & calendar features, observability, infosec...
  31. 5 2 Thank you! • Nylas team, past & present

    • Mailgun team, Menno Smits, & other authors of libraries we depend on • Python core contribs https://nylas.com https://github.com/nylas/sync-engine