Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Iterating a data platform

Sponsored · Your Podcast. Everywhere. Effortlessly. Share. Educate. Inspire. Entertain. You do you. We'll handle the rest.

Iterating a data platform

Slides from a short talk I gave at the Bristol CouchDB talk on 13th March 2013.

Avatar for Dan Fairs

Dan Fairs

March 13, 2013

More Decks by Dan Fairs

Other Decks in Technology

Transcript

  1. BROADCAST INTELLIGENCE ITERATING A DATA PLATFORM Friday, 15 March 13

    We’re a broadcast analytics company that is powered by convos in social media. MEASURE AND UNDERSTAND TV AUDIENCES With the advent of social media convos have moved from the living room to the web. We spotted an opportunity to harness these conversations to provide insights to allow Broadcasters, prog makers, advertisers, brand - understand the audience better. Development of the platform started in late 2011, and I’m going to talk about how its architecture developed and changed. I hope to provide you with some useful pointers for managing change in a data platform while maintaining your sanity and your clients’ unconditional adoration.
  2. WHAT WE DO • Ingest TV listings and figure out

    search terms • Search the firehose for matching Tweets • Show them in a pretty UI • Profit! Friday, 15 March 13 We live in the hip and groovy world of audience measurement - the social equivalent of 9m people watched EastEnders
  3. • Let’s start with a demo, so we can see

    where we’re heading. DEMO Friday, 15 March 13
  4. SERVERS > Friday, 15 March 13 Aside: Real servers >

    cloud servers for our application Relatively constant load. EC2 *very* expensive. We also do broadcast monitoring, hard to hang an aerial on the cloud.
  5. Friday, 15 March 13 Hardware problems meant it took almost

    THREE MONTHS to commission one particular server
  6. ARCHITECTURE V1 Simple!* * includes absolutely free, no-cost, glossed-over details!

    Friday, 15 March 13 Logotastic! HTML5/JavaScript application REST API powered by Django based on a PostgreSQL database There’s also a conventional XML data feed containing TV listings data - but batch processing XML data feeds is boooring We suck Tweets in from DataSift (who get them from Twitter, obvs), enrich them with extra metadata, and store them - at this point, in PostgreSQL.
  7. KEY POINT: LOOSE COUPLING Friday, 15 March 13 ZeroMQ brings

    sanity to network programming No, it’s not really a message queue. Well, it is, but that’s only an implementation detail. Provides a few socket types: push/pull, publish/subscribe, request/reply, router/dealer Various patterns possible, see the Guide Lets you easily write small daemons that talk to each other over 0MQ - discourages shared state, encourages message-based architecture = easier to scale This probably sounds familiar to Erlang programmers ;) Many language bindings = use the right language/tool for the job, plumb your C/Python/Erlang bits together using 0MQ Can even use inproc to get rid of traditional shared state/mutex architectures with internal message passing = much more reliable, faster. We use 0MQ in our tweet processing pipeline - tweets come in from DS, we fan out to a number of workers, enrich and persist Factor your application in components talking over 0MQ, and you need maintain only your message formats, no fragile, monolithic architecture
  8. DATA GROWTH • Data grows • Often >5m Tweets per

    week • ~ 15GB Friday, 15 March 13 When everything’s in one DB, and you’re decomposing JSON into fields in a table, this gets pretty painful
  9. ELEPHANT IN THE ROOM Friday, 15 March 13 Tweets are

    JSON documents PostgreSQL is rubbish for working with JSON documents. It’s a relational database, after all. Large tables are painful All we could do was offer some simple tweet counts and filters Any extra filter required ALTER TABLE (which is actually OK in PostgreSQL) to add a field for it, and modifications to the processing pipeline to extract the field from the tweet JSON and store it
  10. IF ONLY... • If only we had a data store

    that: • Stored JSON natively • Talked HTTP for easy integration • Handled slowly changing schemas gracefully • Was very fast to query • Just Add Servers Friday, 15 March 13 Anyone? Anyone?
  11. COUCHDB • Stores JSON natively: \o/ • Talks HTTP for

    easy integration: \o/ • Handles slowly changing schemas gracefully: \o/ • Very fast to query: \o/ • Just Add Servers.... Uh-oh. Friday, 15 March 13
  12. BIGCOUCH \o/ Friday, 15 March 13 There are far better

    qualified people than I to talk about BigCouch! But it seemed to tick all the boxes
  13. ARCHITECTURE V2 Less simple... but works! Friday, 15 March 13

    Still storing most metadata in PostgreSQL Tweets also stored in CouchDB Note we continued storing Tweets in PG for the time being, in a simplified form - basically just transmission id and json. Yes I know about hstore. No we’re not using it (yet?).
  14. EVOLVING APPLICATIONS WITH COUCHDB • Things were great for a

    while • I even priced up using Cloudant • I asked their support folks Difficult Questions • Then we had some new requirements • Apply multiple filters at once • Smattering of ad-hoc analysis Friday, 15 March 13
  15. UH-OH • View build time • Cloudant does this better,

    but... • Combinatorial view explosion Friday, 15 March 13 Basically need a view in CouchDB for each kind of query you’re going to do This rapidly grows extremely painful as you add more filters
  16. IF ONLY... • If only we had a data store

    that: • Stored JSON natively • Talked HTTP for easy integration • Handled slowly changing schemas gracefully • Was very fast to query • And supported flexible, ad-hoc-ish querying • Just Add Servers Friday, 15 March 13
  17. ARCHITECTURE V3 - TRANSITIONAL Less simple... but works! Friday, 15

    March 13 Still storing most metadata in PostgreSQL So, CouchDB continued to be pretty awesome even when we were migrating away ES has rivers - incoming flows of data. Trivial to hook up to CouchDB’s _changes feed. Allowed us to test ES without ripping up the architecture
  18. • HTTP rocks for external system integration • If your

    system can talk HTTP, it makes life easier for everyone • ... but you CouchDB folks knew that already :) KEY POINT: HTTP FOR EXTERNAL SERVICES Friday, 15 March 13
  19. ARCHITECTURE V4 Removed CouchDB from the stack Friday, 15 March

    13 Still storing most metadata in PostgreSQL Tweets also stored in Elasticsearch Same architecture
  20. TODAY • That’s basically the architecture we run today •

    Yes, tweets still get written to PostgreSQL • Dump tweets older than 1 month to S3 • Not without problems (OHAI AUTOVACUUM) Friday, 15 March 13 Autovaccuum - among other things - reclaims space. Bit like compaction. Poor case which we hit with large inserts and deletes -> partitioning
  21. THE PRODIGAL DATABASE? • That’s not the end of the

    CouchDB story though • PostgreSQL isn’t the right choice for either: • The short-term storage before tweets go into Elasticsearch • Longer-term storage for data mining Friday, 15 March 13 Currently, tweets come in live and aren’t persisted until they’ve been enriched and stored in PostgreSQL Not ideal - if something in the 0MQ pipeline breaks, we lose tweets. CouchDB could be a good fit here, as a short term store. Rendundant streamer with own standalone store. Can then replicate in on-demand. Some challenges around Adaptive Search. Longer term storage, maybe less likely - something like Cassandra or HDFS probably a better fit - would be nice to leverage all the tooling in the Hadoop world, even if it looks a nightmare to maintain...
  22. TAKEAWAYS • Manage your technical debt • Sometimes it’s OK

    to borrow to get your product done quicker • Make sure you pay it back! Friday, 15 March 13
  23. TAKEAWAYS • Build loosely-coupled systems • It’s easier to upgrade

    individual pieces and components when you have well-defined interfaces between them Friday, 15 March 13 For us, this basically boils down to the JSON structure of the Tweet.
  24. TAKEAWAYS • Automated tests are crucial • Who cares whether

    they’re unit, integration, functional, Byzantine, or minty • They should be there, they should have good coverage, and they should be automated Friday, 15 March 13 Tough to make three-and-a-bit architecture changes without comprehensive tests to make sure you haven’t broken anything. Especially when the backend tech team is just you.
  25. ELEPHANT IN THE ROOM = Friday, 15 March 13 All

    this time we’d kept on storing those tweets in PostgreSQL Turns out that while PostgreSQL is rubbish for querying json objects, it’s actually pretty good at storing lots of text TOAST - best thing since sliced bread - allows large values to be stored efficiently, compressed transparently Kept a lid on disk usage while we figured stuff out Tens to hundreds of millions of rows in a table (just don’t query it by anything except a very selective index) Handy! PostgreSQL rocks. POSTGRESQL ROCKS Sure, it has its limitations. But it’s an incredible piece of software. Not all problems are non-relational, folks!