Iterating a data platform

BROADCAST INTELLIGENCE ITERATING A DATA PLATFORM Friday, 15 March 13
We’re a broadcast analytics company that is powered by convos in social media. MEASURE AND UNDERSTAND TV AUDIENCES With the advent of social media convos have moved from the living room to the web. We spotted an opportunity to harness these conversations to provide insights to allow Broadcasters, prog makers, advertisers, brand - understand the audience better. Development of the platform started in late 2011, and I’m going to talk about how its architecture developed and changed. I hope to provide you with some useful pointers for managing change in a data platform while maintaining your sanity and your clients’ unconditional adoration.

WHAT WE DO We map Tweets to TV Friday, 15
March 13 Huh?

WHAT WE DO • Ingest TV listings and figure out
search terms • Search the firehose for matching Tweets • Show them in a pretty UI • Profit! Friday, 15 March 13 We live in the hip and groovy world of audience measurement - the social equivalent of 9m people watched EastEnders

EVENT BASED SHOWS Friday, 15 March 13

• Let’s start with a demo, so we can see
where we’re heading. DEMO Friday, 15 March 13

SERVERS > Friday, 15 March 13 Aside: Real servers >
cloud servers for our application Relatively constant load. EC2 *very* expensive. We also do broadcast monitoring, hard to hang an aerial on the cloud.

SERVERS Friday, 15 March 13 UNTIL...

Friday, 15 March 13 Hardware problems meant it took almost
THREE MONTHS to commission one particular server

ARCHITECTURE V1 Simple!* * includes absolutely free, no-cost, glossed-over details!
Friday, 15 March 13 Logotastic! HTML5/JavaScript application REST API powered by Django based on a PostgreSQL database There’s also a conventional XML data feed containing TV listings data - but batch processing XML data feeds is boooring We suck Tweets in from DataSift (who get them from Twitter, obvs), enrich them with extra metadata, and store them - at this point, in PostgreSQL.

KEY POINT: LOOSE COUPLING Friday, 15 March 13 ZeroMQ brings
sanity to network programming No, it’s not really a message queue. Well, it is, but that’s only an implementation detail. Provides a few socket types: push/pull, publish/subscribe, request/reply, router/dealer Various patterns possible, see the Guide Lets you easily write small daemons that talk to each other over 0MQ - discourages shared state, encourages message-based architecture = easier to scale This probably sounds familiar to Erlang programmers ;) Many language bindings = use the right language/tool for the job, plumb your C/Python/Erlang bits together using 0MQ Can even use inproc to get rid of traditional shared state/mutex architectures with internal message passing = much more reliable, faster. We use 0MQ in our tweet processing pipeline - tweets come in from DS, we fan out to a number of workers, enrich and persist Factor your application in components talking over 0MQ, and you need maintain only your message formats, no fragile, monolithic architecture

DATA GROWTH • Data grows • Often >5m Tweets per
week • ~ 15GB Friday, 15 March 13 When everything’s in one DB, and you’re decomposing JSON into ﬁelds in a table, this gets pretty painful

ELEPHANT IN THE ROOM Friday, 15 March 13 Tweets are
JSON documents PostgreSQL is rubbish for working with JSON documents. It’s a relational database, after all. Large tables are painful All we could do was offer some simple tweet counts and filters Any extra filter required ALTER TABLE (which is actually OK in PostgreSQL) to add a field for it, and modifications to the processing pipeline to extract the field from the tweet JSON and store it

IF ONLY... • If only we had a data store
that: • Stored JSON natively • Talked HTTP for easy integration • Handled slowly changing schemas gracefully • Was very fast to query • Just Add Servers Friday, 15 March 13 Anyone? Anyone?

Friday, 15 March 13 Anyone? Anyone?

COUCHDB • Stores JSON natively: \o/ • Talks HTTP for
easy integration: \o/ • Handles slowly changing schemas gracefully: \o/ • Very fast to query: \o/ • Just Add Servers.... Uh-oh. Friday, 15 March 13

BIGCOUCH \o/ Friday, 15 March 13 There are far better
qualiﬁed people than I to talk about BigCouch! But it seemed to tick all the boxes

ARCHITECTURE V2 Less simple... but works! Friday, 15 March 13
Still storing most metadata in PostgreSQL Tweets also stored in CouchDB Note we continued storing Tweets in PG for the time being, in a simpliﬁed form - basically just transmission id and json. Yes I know about hstore. No we’re not using it (yet?).

EVOLVING APPLICATIONS WITH COUCHDB • Things were great for a
while • I even priced up using Cloudant • I asked their support folks Difficult Questions • Then we had some new requirements • Apply multiple filters at once • Smattering of ad-hoc analysis Friday, 15 March 13

UH-OH • View build time • Cloudant does this better,
but... • Combinatorial view explosion Friday, 15 March 13 Basically need a view in CouchDB for each kind of query you’re going to do This rapidly grows extremely painful as you add more ﬁlters

IF ONLY... • If only we had a data store
that: • Stored JSON natively • Talked HTTP for easy integration • Handled slowly changing schemas gracefully • Was very fast to query • And supported flexible, ad-hoc-ish querying • Just Add Servers Friday, 15 March 13

ELASTICSEARCH \o/ Friday, 15 March 13

ARCHITECTURE V3 - TRANSITIONAL Less simple... but works! Friday, 15
March 13 Still storing most metadata in PostgreSQL So, CouchDB continued to be pretty awesome even when we were migrating away ES has rivers - incoming ﬂows of data. Trivial to hook up to CouchDB’s _changes feed. Allowed us to test ES without ripping up the architecture

• HTTP rocks for external system integration • If your
system can talk HTTP, it makes life easier for everyone • ... but you CouchDB folks knew that already :) KEY POINT: HTTP FOR EXTERNAL SERVICES Friday, 15 March 13

ARCHITECTURE V4 Removed CouchDB from the stack Friday, 15 March
13 Still storing most metadata in PostgreSQL Tweets also stored in Elasticsearch Same architecture

TODAY • That’s basically the architecture we run today •
Yes, tweets still get written to PostgreSQL • Dump tweets older than 1 month to S3 • Not without problems (OHAI AUTOVACUUM) Friday, 15 March 13 Autovaccuum - among other things - reclaims space. Bit like compaction. Poor case which we hit with large inserts and deletes -> partitioning

THE PRODIGAL DATABASE? • That’s not the end of the
CouchDB story though • PostgreSQL isn’t the right choice for either: • The short-term storage before tweets go into Elasticsearch • Longer-term storage for data mining Friday, 15 March 13 Currently, tweets come in live and aren’t persisted until they’ve been enriched and stored in PostgreSQL Not ideal - if something in the 0MQ pipeline breaks, we lose tweets. CouchDB could be a good ﬁt here, as a short term store. Rendundant streamer with own standalone store. Can then replicate in on-demand. Some challenges around Adaptive Search. Longer term storage, maybe less likely - something like Cassandra or HDFS probably a better ﬁt - would be nice to leverage all the tooling in the Hadoop world, even if it looks a nightmare to maintain...

TAKEAWAYS • Manage your technical debt • Sometimes it’s OK
to borrow to get your product done quicker • Make sure you pay it back! Friday, 15 March 13

TAKEAWAYS • Build loosely-coupled systems • It’s easier to upgrade
individual pieces and components when you have well-defined interfaces between them Friday, 15 March 13 For us, this basically boils down to the JSON structure of the Tweet.

TAKEAWAYS • Automated tests are crucial • Who cares whether
they’re unit, integration, functional, Byzantine, or minty • They should be there, they should have good coverage, and they should be automated Friday, 15 March 13 Tough to make three-and-a-bit architecture changes without comprehensive tests to make sure you haven’t broken anything. Especially when the backend tech team is just you.

ELEPHANT IN THE ROOM = Friday, 15 March 13 All
this time we’d kept on storing those tweets in PostgreSQL Turns out that while PostgreSQL is rubbish for querying json objects, it’s actually pretty good at storing lots of text TOAST - best thing since sliced bread - allows large values to be stored efficiently, compressed transparently Kept a lid on disk usage while we ﬁgured stuff out Tens to hundreds of millions of rows in a table (just don’t query it by anything except a very selective index) Handy! PostgreSQL rocks. POSTGRESQL ROCKS Sure, it has its limitations. But it’s an incredible piece of software. Not all problems are non-relational, folks!

? Friday, 15 March 13

BROADCAST INTELLIGENCE @secondsync @danfairs Friday, 15 March 13

Iterating a data platform

Iterating a data platform

Dan Fairs

More Decks by Dan Fairs

Other Decks in Technology

Featured

Transcript

BROADCAST INTELLIGENCE ITERATING A DATA PLATFORM Friday, 15 March 13

WHAT WE DO We map Tweets to TV Friday, 15

WHAT WE DO • Ingest TV listings and figure out

EVENT BASED SHOWS Friday, 15 March 13

EVENT BASED SHOWS Friday, 15 March 13

• Let’s start with a demo, so we can see

SERVERS > Friday, 15 March 13 Aside: Real servers >

SERVERS Friday, 15 March 13 UNTIL...

Friday, 15 March 13 Hardware problems meant it took almost

ARCHITECTURE V1 Simple!* * includes absolutely free, no-cost, glossed-over details!

KEY POINT: LOOSE COUPLING Friday, 15 March 13 ZeroMQ brings

DATA GROWTH • Data grows • Often >5m Tweets per

ELEPHANT IN THE ROOM Friday, 15 March 13 Tweets are

IF ONLY... • If only we had a data store

Friday, 15 March 13 Anyone? Anyone?

COUCHDB • Stores JSON natively: \o/ • Talks HTTP for

BIGCOUCH \o/ Friday, 15 March 13 There are far better

ARCHITECTURE V2 Less simple... but works! Friday, 15 March 13

EVOLVING APPLICATIONS WITH COUCHDB • Things were great for a

UH-OH • View build time • Cloudant does this better,

IF ONLY... • If only we had a data store

ELASTICSEARCH \o/ Friday, 15 March 13

ARCHITECTURE V3 - TRANSITIONAL Less simple... but works! Friday, 15

• HTTP rocks for external system integration • If your

ARCHITECTURE V4 Removed CouchDB from the stack Friday, 15 March

TODAY • That’s basically the architecture we run today •

THE PRODIGAL DATABASE? • That’s not the end of the

TAKEAWAYS • Manage your technical debt • Sometimes it’s OK

TAKEAWAYS • Build loosely-coupled systems • It’s easier to upgrade

TAKEAWAYS • Automated tests are crucial • Who cares whether

ELEPHANT IN THE ROOM = Friday, 15 March 13 All

? Friday, 15 March 13

BROADCAST INTELLIGENCE @secondsync @danfairs Friday, 15 March 13