Using Kafka and Crunch for Realtime Processing

Using Kafka and Crunch for Realtime Processing Michael Rose, FullContact

most powerful fully-connected contact management platform for professionals and enterprises
who need to master their contacts and be awesome with people. Michael Rose [email protected] @Xorlev on Twitter Senior Software Engineer, Data Science

Data Team Scott Brave Brandon Vargo Michael Rose Before anything
I’d like to give a shout out to the rest of my team who couldn’t attend today. Scott, our CTO and leader and Brandon Vargo the other member of our team. The work I’m talking about today is deﬁnitely a team effort, not solely my own. Ken Michie, another long time FullContact employee also played an important role in helping us with this project.

A little bit about what we do (not marketing, I
promise) Maybe not intentionally.

Turn a partial contact into a full contact via API
[email protected] Bart Lorang CEO / Co-Founder FullContact, Inc. EMAIL work [email protected] TWITTER personal @lorangb PHONE work +1 303-736-9406 Gender: Male Age: 32 Location: Boulder, Colorado WEB “Enrichment” We don’t provide phone We provide an API around taking a originating identiﬁer of some sort, e.g. an email, phone, or twitter handle and return a full social proﬁle. Name, photo, demographic information, social handles, social metadata.

as a Platform for Professionals for Business Single Version of
Truth Complete, Accurate Contact Info. Synced Everywhere. FullContact Cloud Address Book Security Transcription Backups Versioning Cleansing & Correction Enrichment De-duplication & Merge Validation & Verification Sync Tagging Search Sharing Storage We’ve also spun this into a consumer facing business. We sync all of your contacts from the sources you care about. Then we ﬁx them, update them, deduplicate them, and push them out to all the places you use your contacts. I’m going to be talking about our core data technologies that underpin some of our most important functions: keeping contacts up to date and deduplicating contacts.

Enrichment Person API Sherlock WebObservations Facebook API Twitter API Foursquare
API Identibase ... Note: Not a literal cloud [email protected] { [...] "contactInfo":{ "familyName":"Lorang", "fullName":"Bart Lorang", "givenName":"Bart" } [...] } Merged Profiles I’ll focus on our Enrichment system that as it’s the primary provider of information into our data pipeline. Here’s the way it works. A user (internal or external) calls into the Person API. The Person API checks for already merged profiles and if it has a recent one it serves it up. Otherwise it tosses a message onto a queue for a service called Sherlock to work on. Sherlock scours the web for clues about identity, merges a profile from the clues, and dumps it in the database. Meanwhile, each response from each data provider (APIs, scrapers) we have is dumped into a database I’ll call WebObservations. Previously, this was powered by 18 Cassandra nodes. You can imagine the recursive nature of this system could lead to some funky results, especially if someone put someone else’s usernames in one of their profiles. This is where Identibase comes in. I’ve bolded a component labelled as Identibase because Identibase is the lynchpin of this entire system. It’s our secret sauce, and I’m about to reveal (some) of the ingredients.

Identibase • Identity resolution engine • Finds co-occurrences of contact
ﬁelds • email & phone, phone & name • .:. email -> name • Full graph database on top of HBase • Also tracks source/privacy metadata for every piece of information. • Powers Person API & AddressBook deduplication Essentially a giant adjacency list in HBase

Identibase  identity resolution [email protected] [email protected] scott.brave [email protected] @sbrave 1 2
3 4 5 1 2 3 4 5 ABC DEF 1 2 3 4 5 5 1,4 2 3 ABC DEF Identibase provides identity resolution. What do we mean by that? Lets consider two contacts, one with 3 ﬁelds and one with two. If you model contacts instead as a directed graph of co-occurrences and merge the resultant graphs you end up with a joined graph. You can imagine this is fairly useful for ﬁnding new information on contacts as well as helping to deduplicate contacts, even if they share no common information!

Identibase Here’s a non-toy example. This is a graphical representation
from a program called Identiviz. It built a graphical representation of my proﬁle and everyone else I’m linked to in some way (e.g. test accounts, shared phones, etc.) It can get as large as you might imagine, and I’m not even the most connected person around.

Enrichment Sherlock WebObservations Facebook API Twitter API Foursquare API Identibase
Query ... HBase I’ve cut out Person API as it’s not really important to this pipeline. Here you can see that Sherlock writes those observations into a database, then Hadoop magic happens, and we produce an HBase table used by the Identibase system.

Old Batch ~monthly-ish But that isn’t the only source. We
read in data from WebObservations as well as a ton of other static datasets living on HDFS, ran separate importers for each, run algorithms on top of that data, build HFiles, and *then* bulkload into new HBase cluster. This ran well for ~3 years, but there were a number of issues.

Old Batch • Raw Hadoop MapReduce • Fairly complex multi-stage
graph algorithm built on top of chained MapReduce jobs • Hard to reason about, especially for a newcomer • Run monthly, bulk load to HBase • Pulled data directly from source systems • C* cluster with web observations was never happy about this • Job wasn’t frequent enough to have “analytics” second DC • Pulled lots of other static data

We wanted to change this system. Make it realtime. Modernize
it.

Motivations • Lower the time between observing and integrating new
data • Open up new use cases not possible on a batch schedule (e.g. subscriptions) • Still maintain batch pipeline for big changes • Redo underlying storage architecture to support future extensions • Avoid impacting production datastores • Simplify pipeline code, reuse as much code as possible

Requirements Requirement: ~Real-time, within a minute or so • Need
a means of sharing data between systems • Process to ingest this information Requirement: Keep batch pipeline too (lambda architecture) • Still need MapReduce • Reuse existing code where possible Requirement: Uniﬁed data model / no live DB reads • Backfill old data into a unified model • Archive new data for batch process

Two technologies made this possible

Apache Crunch • Abstraction layer on top of MapReduce •
More developer friendly than Pig/Hive (in many cases) • Modeled after Google’s FlumeJava • Flexible Data Types:   PCollection<T>, PTable<K,V> • Simple but Powerful Operations:  parallelDo(), groupByKey(), combineValues(), flatten(), count(), join(), sort(), top() • Robust join strategies: reduce-side, map-side, sharded joins, bloom- filter joins, cogroups, etc. • Three runner pipelines: MRPipeline (Hadoop), SparkPipeline, MemPipeline So for these reasons, we decided to give Apache Crunch a try. Crunch is an abstraction layer for deﬁning data pipelines that under the covers compile to a series of map-reduce jobs Distinct from other abstraction layers like say Pig and Hive it’s geared less towards data scientists and more towards developers. More like Cascading/Scalding if you are familiar with those. The Crunch API itself is modeled after Googles FlumeJava project, which is what they use for this at Google. Their goal is to be simple & ﬂexible. We found that with some tweaks, Crunch actually enabled us to meet all of our goals. So let me just jump right into what we did…

Apache Crunch Can you imagine doing a join in MapReduce
after it’s this easy? Crunch makes it fairly easy to do otherwise nasty operations like joins, group bys, and cogroups.

Apache Kafka publish-subscribe messaging rethought as a distributed commit log
• Efficient - OS sympathy • Trivial horizontal scalability • Boring in production! • It’s a log, not a queue: Rewind, Replay, Multiple consumers. • Systems don’t have to know about each other • The log is a unified abstraction for real-time processing Advantages source: LinkedIn

The Databus

The Databus • Our Kafka-based data pipeline • The Library
(“Frizzle”) has producer/consumer helpers, common message envelope w/ metadata, and serializable POJOs for each unique topic/ system. • Schemas become the interface between systems. • Systems at FullContact publish to Kafka what they would write to a DB e.g.: • Web observations (crawl data) • Contact versions • User activity • Databus Archiver consumes all topics, archives to S3 as Hadoop SequenceFiles (long offset, byte[] serialized). Based on Pinterest’s Secor.

The Databus • Another important component was data backfill •
We wanted to pretend we always had Kafka and the databus • Backfilled all prod data and all backups of prod data into the same SequenceFile format, put on S3. Which, btw, was over 100TB of Cassandra backups and HBase backups. We were able to prune backups after that which made Ops very happy.

Once available, data quickly gets used everywhere This is a
visualization of our Kafka server’s topics and consumers. As you can see, there’s some fan-out on two of the topics.

Architecture HBase& Person'API' Iden+base' contacts' contacts'&' feedback' enriched'contacts' &'deduplica+on' Data'Bus'
Ka(a& Crunch' In+ Memory& Crunch' Hadoop& Crunch' Query& FullContact'Apps' Sherlock'Search' S3' We broke up the work into two major pieces. I worked mainly on becoming our SME on Kafka and building this databus infrastructure, hooking up systems, building schema, then doing the backﬁll of much of our observed web data to S3. My co-worker Brandon took the lead on converting the main pipeline from raw MapReduce to Apache Crunch and starting the realtime project. Sherlock and our apps contribute contacts and feedback to the databus, we archive that on S3 as well as ingest it via our realtime pipeline into HBase. We then have the same code running on batch and realtime ingest.

New Batch • Apache Crunch • Compiles down to MapReduce
jobs • 17 jobs, many run in parallel • Totally tested end-to-end with in-memory runner • Code is easy to follow • Tons of code reused from old implementation • Reads archived Kafka data (+backfill) from S3 • Same output. Totally different pipeline.

Kafka (but still batch) Sherlock Facebook API Twitter API Foursquare
API Identibase Query ... S3 Secor HBase Crunch MapReduce This is what the pipeline looks like now. Sherlock writes web observations to Kafka (the DB too, but not primarily), Secor archives the data onto S3, and then the Crunch MapReduce code reads the SequenceFiles off S3 in order to generate the graph that’s written into HBase.

Lambda Architecture • Batch base, Realtime does deltas, batch eventually
replaces • Can take liberties with realtime process, batch fixes • Run batch when needed/wanted

Realtime Platform • “Out of the box lambda” • Spark
/ Spark Streaming • Summingbird • Separate implementation • Storm • Raw Kafka consumer

Realtime Platform • Spark Streaming • Issues, unknown to us,
no code reuse • Summingbird • Very elegant, complex. • Sparse documentation • Analytics focused • Storm • Pure realtime • Would have to completely rewrite our logic • Something else (raw kafka consumer) Spark - relative unknown to us, had issues with running on cluster Summingbird - sparse documentation, felt analytics focused. Would be awesome if you worked at Twitter with the internal integrations Storm - Would require totally separate implementation It looked like we might need to do an implementation on top of Spark or try and share code as much as possible with our batch pipeline.

Crunch has an in memory mode But wait— And we
have Crunch code already…

Crunch (Ab)use • Crunch has an in-memory runner for development
• Re-use our existing code in realtime • One codebase, 2 great usecases (what a deal!) • Kafka consumer reads data, bundles into related batches, feeds into runner

How it works (Realtime) • Batching layer groups work, tracks
outstanding work • We’re indexing people, we batch around a search • Batch of work is all fed into the exact same code* as the Hadoop job, only in-memory runner • Expensive calculations cached, otherwise only stores raw inputs • Instead of outputting to SequenceFiles, generates HBase mutations * +/- a few code blocks only relevant to one or the other

How it works (Batch) • Snapshot latest offsets from archiver
process • Start batch process (~2-3 days), reads SequenceFiles from S3 • Launch brand new HBase cluster once done • Bulk load to HBase • Start new set realtime ingestors at snapshotted offsets (now()-2 days) • Allow catchup • Point query servers at new HBase cluster • Spin down old HBase cluster once everything looks good

Architecture HBase& Person'API' Iden+base' contacts' contacts'&' feedback' enriched'contacts' &'deduplica+on' Data'Bus'
Ka(a& Crunch' In+ Memory& Crunch' Hadoop& Crunch' Query& FullContact'Apps' Sherlock'Search' S3' What’s interesting is that this pipeline is really just implementing a graph-optimized materialized view over a log of contact observations.

Issues • Crunch in memory mode not designed with production
workloads in mind • Recreating Hadoop Configuration on each pipeline invocation — expensive! • Excessive locking around counters, can’t rely on elision • Serialization verification • We forked Crunch, very few changes necessary • Disabling serialization verification

Production • ~2 months in prod • Powers critical infrastructure
• Flexibility paving the way for new capabilities • Batch, realtime ingest, realtime queries • Query: 1000+qps, ~20ms@95th • Ingest: 150-500 batches/s depending on mix of data • Writing an average of 5,000 edges/s into HBase, with spikes >250,000/s (large cached profiles)

Whew, that was fast. Questions? • More questions? • Email
me: [email protected] • Tweet at me: @Xorlev • Bother me in person! • Find this deck on speakerdeck.com/xorlev

Links • https://engineering.linkedin.com/ distributed-systems/log-what-every- software-engineer-should-know-about- real-time-datas-unifying • http://kafka.apache.org/ • http://crunch.apache.org/

Using Kafka and Crunch for Realtime Processing

Using Kafka and Crunch for Realtime Processing

More Decks by Michael Rose

Other Decks in Programming

Featured

Transcript