Using Kafka and Crunch for Realtime Processing

Slide 1

Slide 1 text

Using Kafka and Crunch for Realtime Processing Michael Rose, FullContact

Slide 2

Slide 2 text

most powerful fully-connected contact management platform for professionals and enterprises who need to master their contacts and be awesome with people. Michael Rose [email protected] @Xorlev on Twitter Senior Software Engineer, Data Science

Slide 3

Slide 3 text

Data Team Scott Brave Brandon Vargo Michael Rose Before anything I’d like to give a shout out to the rest of my team who couldn’t attend today. Scott, our CTO and leader and Brandon Vargo the other member of our team. The work I’m talking about today is deﬁnitely a team effort, not solely my own. Ken Michie, another long time FullContact employee also played an important role in helping us with this project.

Slide 4

Slide 4 text

A little bit about what we do (not marketing, I promise) Maybe not intentionally.

Slide 5

Slide 5 text

Turn a partial contact into a full contact via API [email protected] Bart Lorang CEO / Co-Founder FullContact, Inc. EMAIL work [email protected] TWITTER personal @lorangb PHONE work +1 303-736-9406 Gender: Male Age: 32 Location: Boulder, Colorado WEB “Enrichment” We don’t provide phone We provide an API around taking a originating identiﬁer of some sort, e.g. an email, phone, or twitter handle and return a full social proﬁle. Name, photo, demographic information, social handles, social metadata.

Slide 6

Slide 6 text

as a Platform for Professionals for Business Single Version of Truth Complete, Accurate Contact Info. Synced Everywhere. FullContact Cloud Address Book Security Transcription Backups Versioning Cleansing & Correction Enrichment De-duplication & Merge Validation & Verification Sync Tagging Search Sharing Storage We’ve also spun this into a consumer facing business. We sync all of your contacts from the sources you care about. Then we ﬁx them, update them, deduplicate them, and push them out to all the places you use your contacts. I’m going to be talking about our core data technologies that underpin some of our most important functions: keeping contacts up to date and deduplicating contacts.

Slide 7

Slide 7 text

Enrichment Person API Sherlock WebObservations Facebook API Twitter API Foursquare API Identibase ... Note: Not a literal cloud [email protected] { [...] "contactInfo":{ "familyName":"Lorang", "fullName":"Bart Lorang", "givenName":"Bart" } [...] } Merged Profiles I’ll focus on our Enrichment system that as it’s the primary provider of information into our data pipeline. Here’s the way it works. A user (internal or external) calls into the Person API. The Person API checks for already merged profiles and if it has a recent one it serves it up. Otherwise it tosses a message onto a queue for a service called Sherlock to work on. Sherlock scours the web for clues about identity, merges a profile from the clues, and dumps it in the database. Meanwhile, each response from each data provider (APIs, scrapers) we have is dumped into a database I’ll call WebObservations. Previously, this was powered by 18 Cassandra nodes. You can imagine the recursive nature of this system could lead to some funky results, especially if someone put someone else’s usernames in one of their profiles. This is where Identibase comes in. I’ve bolded a component labelled as Identibase because Identibase is the lynchpin of this entire system. It’s our secret sauce, and I’m about to reveal (some) of the ingredients.

Slide 8

Slide 8 text

Identibase • Identity resolution engine • Finds co-occurrences of contact ﬁelds • email & phone, phone & name • .:. email -> name • Full graph database on top of HBase • Also tracks source/privacy metadata for every piece of information. • Powers Person API & AddressBook deduplication Essentially a giant adjacency list in HBase

Slide 9

Slide 9 text

Identibase  identity resolution [email protected] [email protected] scott.brave [email protected] @sbrave 1 2 3 4 5 1 2 3 4 5 ABC DEF 1 2 3 4 5 5 1,4 2 3 ABC DEF Identibase provides identity resolution. What do we mean by that? Lets consider two contacts, one with 3 ﬁelds and one with two. If you model contacts instead as a directed graph of co-occurrences and merge the resultant graphs you end up with a joined graph. You can imagine this is fairly useful for ﬁnding new information on contacts as well as helping to deduplicate contacts, even if they share no common information!

Slide 10

Slide 10 text

Identibase Here’s a non-toy example. This is a graphical representation from a program called Identiviz. It built a graphical representation of my proﬁle and everyone else I’m linked to in some way (e.g. test accounts, shared phones, etc.) It can get as large as you might imagine, and I’m not even the most connected person around.

Slide 11

Slide 11 text

Enrichment Sherlock WebObservations Facebook API Twitter API Foursquare API Identibase Query ... HBase I’ve cut out Person API as it’s not really important to this pipeline. Here you can see that Sherlock writes those observations into a database, then Hadoop magic happens, and we produce an HBase table used by the Identibase system.

Slide 12

Slide 12 text

Old Batch ~monthly-ish But that isn’t the only source. We read in data from WebObservations as well as a ton of other static datasets living on HDFS, ran separate importers for each, run algorithms on top of that data, build HFiles, and *then* bulkload into new HBase cluster. This ran well for ~3 years, but there were a number of issues.

Slide 13

Slide 13 text

Old Batch • Raw Hadoop MapReduce • Fairly complex multi-stage graph algorithm built on top of chained MapReduce jobs • Hard to reason about, especially for a newcomer • Run monthly, bulk load to HBase • Pulled data directly from source systems • C* cluster with web observations was never happy about this • Job wasn’t frequent enough to have “analytics” second DC • Pulled lots of other static data

Slide 14

Slide 14 text

We wanted to change this system. Make it realtime. Modernize it.

Slide 15

Slide 15 text

Motivations • Lower the time between observing and integrating new data • Open up new use cases not possible on a batch schedule (e.g. subscriptions) • Still maintain batch pipeline for big changes • Redo underlying storage architecture to support future extensions • Avoid impacting production datastores • Simplify pipeline code, reuse as much code as possible

Slide 16

Slide 16 text

Requirements Requirement: ~Real-time, within a minute or so • Need a means of sharing data between systems • Process to ingest this information Requirement: Keep batch pipeline too (lambda architecture) • Still need MapReduce • Reuse existing code where possible Requirement: Uniﬁed data model / no live DB reads • Backfill old data into a unified model • Archive new data for batch process

Slide 17

Slide 17 text

Two technologies made this possible

Slide 18

Slide 18 text

Apache Crunch • Abstraction layer on top of MapReduce • More developer friendly than Pig/Hive (in many cases) • Modeled after Google’s FlumeJava • Flexible Data Types:   PCollection, PTable • Simple but Powerful Operations:  parallelDo(), groupByKey(), combineValues(), flatten(), count(), join(), sort(), top() • Robust join strategies: reduce-side, map-side, sharded joins, bloom- filter joins, cogroups, etc. • Three runner pipelines: MRPipeline (Hadoop), SparkPipeline, MemPipeline So for these reasons, we decided to give Apache Crunch a try. Crunch is an abstraction layer for deﬁning data pipelines that under the covers compile to a series of map-reduce jobs Distinct from other abstraction layers like say Pig and Hive it’s geared less towards data scientists and more towards developers. More like Cascading/Scalding if you are familiar with those. The Crunch API itself is modeled after Googles FlumeJava project, which is what they use for this at Google. Their goal is to be simple & ﬂexible. We found that with some tweaks, Crunch actually enabled us to meet all of our goals. So let me just jump right into what we did…

Slide 19

Slide 19 text

Apache Crunch Can you imagine doing a join in MapReduce after it’s this easy? Crunch makes it fairly easy to do otherwise nasty operations like joins, group bys, and cogroups.

Slide 20

Slide 20 text

Apache Kafka publish-subscribe messaging rethought as a distributed commit log • Efficient - OS sympathy • Trivial horizontal scalability • Boring in production! • It’s a log, not a queue: Rewind, Replay, Multiple consumers. • Systems don’t have to know about each other • The log is a unified abstraction for real-time processing Advantages source: LinkedIn

Slide 21

Slide 21 text

The Databus

Slide 22

Slide 22 text

The Databus • Our Kafka-based data pipeline • The Library (“Frizzle”) has producer/consumer helpers, common message envelope w/ metadata, and serializable POJOs for each unique topic/ system. • Schemas become the interface between systems. • Systems at FullContact publish to Kafka what they would write to a DB e.g.: • Web observations (crawl data) • Contact versions • User activity • Databus Archiver consumes all topics, archives to S3 as Hadoop SequenceFiles (long offset, byte[] serialized). Based on Pinterest’s Secor.

Slide 23

Slide 23 text

The Databus • Another important component was data backfill • We wanted to pretend we always had Kafka and the databus • Backfilled all prod data and all backups of prod data into the same SequenceFile format, put on S3. Which, btw, was over 100TB of Cassandra backups and HBase backups. We were able to prune backups after that which made Ops very happy.

Slide 24

Slide 24 text

Once available, data quickly gets used everywhere This is a visualization of our Kafka server’s topics and consumers. As you can see, there’s some fan-out on two of the topics.

Slide 25

Slide 25 text

Architecture HBase& Person'API' Iden+base' contacts' contacts'&' feedback' enriched'contacts' &'deduplica+on' Data'Bus' Ka(a& Crunch' In+ Memory& Crunch' Hadoop& Crunch' Query& FullContact'Apps' Sherlock'Search' S3' We broke up the work into two major pieces. I worked mainly on becoming our SME on Kafka and building this databus infrastructure, hooking up systems, building schema, then doing the backﬁll of much of our observed web data to S3. My co-worker Brandon took the lead on converting the main pipeline from raw MapReduce to Apache Crunch and starting the realtime project. Sherlock and our apps contribute contacts and feedback to the databus, we archive that on S3 as well as ingest it via our realtime pipeline into HBase. We then have the same code running on batch and realtime ingest.

Slide 26

Slide 26 text

New Batch • Apache Crunch • Compiles down to MapReduce jobs • 17 jobs, many run in parallel • Totally tested end-to-end with in-memory runner • Code is easy to follow • Tons of code reused from old implementation • Reads archived Kafka data (+backfill) from S3 • Same output. Totally different pipeline.

Slide 27

Slide 27 text

Kafka (but still batch) Sherlock Facebook API Twitter API Foursquare API Identibase Query ... S3 Secor HBase Crunch MapReduce This is what the pipeline looks like now. Sherlock writes web observations to Kafka (the DB too, but not primarily), Secor archives the data onto S3, and then the Crunch MapReduce code reads the SequenceFiles off S3 in order to generate the graph that’s written into HBase.

Slide 28

Slide 28 text

Lambda Architecture • Batch base, Realtime does deltas, batch eventually replaces • Can take liberties with realtime process, batch fixes • Run batch when needed/wanted

Slide 29

Slide 29 text

Realtime Platform • “Out of the box lambda” • Spark / Spark Streaming • Summingbird • Separate implementation • Storm • Raw Kafka consumer

Slide 30

Slide 30 text

Realtime Platform • Spark Streaming • Issues, unknown to us, no code reuse • Summingbird • Very elegant, complex. • Sparse documentation • Analytics focused • Storm • Pure realtime • Would have to completely rewrite our logic • Something else (raw kafka consumer) Spark - relative unknown to us, had issues with running on cluster Summingbird - sparse documentation, felt analytics focused. Would be awesome if you worked at Twitter with the internal integrations Storm - Would require totally separate implementation It looked like we might need to do an implementation on top of Spark or try and share code as much as possible with our batch pipeline.

Slide 31

Slide 31 text

Crunch has an in memory mode But wait— And we have Crunch code already…

Slide 32

Slide 32 text

Crunch (Ab)use • Crunch has an in-memory runner for development • Re-use our existing code in realtime • One codebase, 2 great usecases (what a deal!) • Kafka consumer reads data, bundles into related batches, feeds into runner

Slide 33

Slide 33 text

How it works (Realtime) • Batching layer groups work, tracks outstanding work • We’re indexing people, we batch around a search • Batch of work is all fed into the exact same code* as the Hadoop job, only in-memory runner • Expensive calculations cached, otherwise only stores raw inputs • Instead of outputting to SequenceFiles, generates HBase mutations * +/- a few code blocks only relevant to one or the other

Slide 34

Slide 34 text

How it works (Batch) • Snapshot latest offsets from archiver process • Start batch process (~2-3 days), reads SequenceFiles from S3 • Launch brand new HBase cluster once done • Bulk load to HBase • Start new set realtime ingestors at snapshotted offsets (now()-2 days) • Allow catchup • Point query servers at new HBase cluster • Spin down old HBase cluster once everything looks good

Slide 35

Slide 35 text

Architecture HBase& Person'API' Iden+base' contacts' contacts'&' feedback' enriched'contacts' &'deduplica+on' Data'Bus' Ka(a& Crunch' In+ Memory& Crunch' Hadoop& Crunch' Query& FullContact'Apps' Sherlock'Search' S3' What’s interesting is that this pipeline is really just implementing a graph-optimized materialized view over a log of contact observations.

Slide 36

Slide 36 text

Issues • Crunch in memory mode not designed with production workloads in mind • Recreating Hadoop Configuration on each pipeline invocation — expensive! • Excessive locking around counters, can’t rely on elision • Serialization verification • We forked Crunch, very few changes necessary • Disabling serialization verification

Slide 37

Slide 37 text

Production • ~2 months in prod • Powers critical infrastructure • Flexibility paving the way for new capabilities • Batch, realtime ingest, realtime queries • Query: 1000+qps, ~20ms@95th • Ingest: 150-500 batches/s depending on mix of data • Writing an average of 5,000 edges/s into HBase, with spikes >250,000/s (large cached profiles)

Slide 38

Slide 38 text

Whew, that was fast. Questions? • More questions? • Email me: [email protected] • Tweet at me: @Xorlev • Bother me in person! • Find this deck on speakerdeck.com/xorlev

Slide 39

Slide 39 text

Links • https://engineering.linkedin.com/ distributed-systems/log-what-every- software-engineer-should-know-about- real-time-datas-unifying • http://kafka.apache.org/ • http://crunch.apache.org/