Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Using Kafka and Crunch for Realtime Processing

Using Kafka and Crunch for Realtime Processing

Talk given @ GlueCon 2015


Michael Rose

May 20, 2015


  1. Using Kafka and Crunch for Realtime Processing Michael Rose, FullContact

  2. most powerful fully-connected contact management platform for professionals and enterprises

    who need to master their contacts and be awesome with people. Michael Rose michael@fullcontact.com @Xorlev on Twitter Senior Software Engineer, Data Science
  3. Data Team Scott Brave Brandon Vargo Michael Rose Before anything

    I’d like to give a shout out to the rest of my team who couldn’t attend today. Scott, our CTO and leader and Brandon Vargo the other member of our team. The work I’m talking about today is definitely a team effort, not solely my own. Ken Michie, another long time FullContact employee also played an important role in helping us with this project.
  4. A little bit about what we do (not marketing, I

    promise) Maybe not intentionally.
  5. Turn a partial contact into a full contact via API

    bart@fullcontact.com Bart Lorang CEO / Co-Founder FullContact, Inc. EMAIL work bart@fullcontact.com TWITTER personal @lorangb PHONE work +1 303-736-9406 Gender: Male Age: 32 Location: Boulder, Colorado WEB “Enrichment” We don’t provide phone We provide an API around taking a originating identifier of some sort, e.g. an email, phone, or twitter handle and return a full social profile. Name, photo, demographic information, social handles, social metadata.
  6. as a Platform for Professionals for Business Single Version of

    Truth Complete, Accurate Contact Info. Synced Everywhere. FullContact Cloud Address Book Security Transcription Backups Versioning Cleansing & Correction Enrichment De-duplication & Merge Validation & Verification Sync Tagging Search Sharing Storage We’ve also spun this into a consumer facing business. We sync all of your contacts from the sources you care about. Then we fix them, update them, deduplicate them, and push them out to all the places you use your contacts. I’m going to be talking about our core data technologies that underpin some of our most important functions: keeping contacts up to date and deduplicating contacts.
  7. Enrichment Person API Sherlock WebObservations Facebook API Twitter API Foursquare

    API Identibase ... Note: Not a literal cloud GET$email=bart@fullcontact.com { [...] "contactInfo":{ "familyName":"Lorang", "fullName":"Bart Lorang", "givenName":"Bart" } [...] } Merged Profiles I’ll focus on our Enrichment system that as it’s the primary provider of information into our data pipeline. Here’s the way it works. A user (internal or external) calls into the Person API. The Person API checks for already merged profiles and if it has a recent one it serves it up. Otherwise it tosses a message onto a queue for a service called Sherlock to work on. Sherlock scours the web for clues about identity, merges a profile from the clues, and dumps it in the database. Meanwhile, each response from each data provider (APIs, scrapers) we have is dumped into a database I’ll call WebObservations. Previously, this was powered by 18 Cassandra nodes. You can imagine the recursive nature of this system could lead to some funky results, especially if someone put someone else’s usernames in one of their profiles. This is where Identibase comes in. I’ve bolded a component labelled as Identibase because Identibase is the lynchpin of this entire system. It’s our secret sauce, and I’m about to reveal (some) of the ingredients.
  8. Identibase • Identity resolution engine • Finds co-occurrences of contact

    fields • email & phone, phone & name • .:. email -> name • Full graph database on top of HBase • Also tracks source/privacy metadata for every piece of information. • Powers Person API & AddressBook deduplication Essentially a giant adjacency list in HBase
  9. Identibase
 identity resolution scott@fc.com brave@sb.com scott.brave brave@sb.com @sbrave 1 2

    3 4 5 1 2 3 4 5 ABC DEF 1 2 3 4 5 5 1,4 2 3 ABC DEF Identibase provides identity resolution. What do we mean by that? Lets consider two contacts, one with 3 fields and one with two. If you model contacts instead as a directed graph of co-occurrences and merge the resultant graphs you end up with a joined graph. You can imagine this is fairly useful for finding new information on contacts as well as helping to deduplicate contacts, even if they share no common information!
  10. Identibase Here’s a non-toy example. This is a graphical representation

    from a program called Identiviz. It built a graphical representation of my profile and everyone else I’m linked to in some way (e.g. test accounts, shared phones, etc.) It can get as large as you might imagine, and I’m not even the most connected person around.
  11. Enrichment Sherlock WebObservations Facebook API Twitter API Foursquare API Identibase

    Query ... HBase I’ve cut out Person API as it’s not really important to this pipeline. Here you can see that Sherlock writes those observations into a database, then Hadoop magic happens, and we produce an HBase table used by the Identibase system.
  12. Old Batch ~monthly-ish But that isn’t the only source. We

    read in data from WebObservations as well as a ton of other static datasets living on HDFS, ran separate importers for each, run algorithms on top of that data, build HFiles, and *then* bulkload into new HBase cluster. This ran well for ~3 years, but there were a number of issues.
  13. Old Batch • Raw Hadoop MapReduce • Fairly complex multi-stage

    graph algorithm built on top of chained MapReduce jobs • Hard to reason about, especially for a newcomer • Run monthly, bulk load to HBase • Pulled data directly from source systems • C* cluster with web observations was never happy about this • Job wasn’t frequent enough to have “analytics” second DC • Pulled lots of other static data
  14. We wanted to change this system. Make it realtime. Modernize

  15. Motivations • Lower the time between observing and integrating new

    data • Open up new use cases not possible on a batch schedule (e.g. subscriptions) • Still maintain batch pipeline for big changes • Redo underlying storage architecture to support future extensions • Avoid impacting production datastores • Simplify pipeline code, reuse as much code as possible
  16. Requirements Requirement: ~Real-time, within a minute or so • Need

    a means of sharing data between systems • Process to ingest this information Requirement: Keep batch pipeline too (lambda architecture) • Still need MapReduce • Reuse existing code where possible Requirement: Unified data model / no live DB reads • Backfill old data into a unified model • Archive new data for batch process
  17. Two technologies made this possible

  18. Apache Crunch • Abstraction layer on top of MapReduce •

    More developer friendly than Pig/Hive (in many cases) • Modeled after Google’s FlumeJava • Flexible Data Types: 
 PCollection<T>, PTable<K,V> • Simple but Powerful Operations:
 parallelDo(), groupByKey(), combineValues(), flatten(), count(), join(), sort(), top() • Robust join strategies: reduce-side, map-side, sharded joins, bloom- filter joins, cogroups, etc. • Three runner pipelines: MRPipeline (Hadoop), SparkPipeline, MemPipeline So for these reasons, we decided to give Apache Crunch a try. Crunch is an abstraction layer for defining data pipelines that under the covers compile to a series of map-reduce jobs Distinct from other abstraction layers like say Pig and Hive it’s geared less towards data scientists and more towards developers. More like Cascading/Scalding if you are familiar with those. The Crunch API itself is modeled after Googles FlumeJava project, which is what they use for this at Google. Their goal is to be simple & flexible. We found that with some tweaks, Crunch actually enabled us to meet all of our goals. So let me just jump right into what we did…
  19. Apache Crunch Can you imagine doing a join in MapReduce

    after it’s this easy? Crunch makes it fairly easy to do otherwise nasty operations like joins, group bys, and cogroups.
  20. Apache Kafka publish-subscribe messaging rethought as a distributed commit log

    • Efficient - OS sympathy • Trivial horizontal scalability • Boring in production! • It’s a log, not a queue: Rewind, Replay, Multiple consumers. • Systems don’t have to know about each other • The log is a unified abstraction for real-time processing Advantages source: LinkedIn
  21. The Databus

  22. The Databus • Our Kafka-based data pipeline • The Library

    (“Frizzle”) has producer/consumer helpers, common message envelope w/ metadata, and serializable POJOs for each unique topic/ system. • Schemas become the interface between systems. • Systems at FullContact publish to Kafka what they would write to a DB e.g.: • Web observations (crawl data) • Contact versions • User activity • Databus Archiver consumes all topics, archives to S3 as Hadoop SequenceFiles (long offset, byte[] serialized). Based on Pinterest’s Secor.
  23. The Databus • Another important component was data backfill •

    We wanted to pretend we always had Kafka and the databus • Backfilled all prod data and all backups of prod data into the same SequenceFile format, put on S3. Which, btw, was over 100TB of Cassandra backups and HBase backups. We were able to prune backups after that which made Ops very happy.
  24. Once available, data quickly gets used everywhere This is a

    visualization of our Kafka server’s topics and consumers. As you can see, there’s some fan-out on two of the topics.
  25. Architecture HBase& Person'API' Iden+base' contacts' contacts'&' feedback' enriched'contacts' &'deduplica+on' Data'Bus'

    Ka(a& Crunch' In+ Memory& Crunch' Hadoop& Crunch' Query& FullContact'Apps' Sherlock'Search' S3' We broke up the work into two major pieces. I worked mainly on becoming our SME on Kafka and building this databus infrastructure, hooking up systems, building schema, then doing the backfill of much of our observed web data to S3. My co-worker Brandon took the lead on converting the main pipeline from raw MapReduce to Apache Crunch and starting the realtime project. Sherlock and our apps contribute contacts and feedback to the databus, we archive that on S3 as well as ingest it via our realtime pipeline into HBase. We then have the same code running on batch and realtime ingest.
  26. New Batch • Apache Crunch • Compiles down to MapReduce

    jobs • 17 jobs, many run in parallel • Totally tested end-to-end with in-memory runner • Code is easy to follow • Tons of code reused from old implementation • Reads archived Kafka data (+backfill) from S3 • Same output. Totally different pipeline.
  27. Kafka (but still batch) Sherlock Facebook API Twitter API Foursquare

    API Identibase Query ... S3 Secor HBase Crunch MapReduce This is what the pipeline looks like now. Sherlock writes web observations to Kafka (the DB too, but not primarily), Secor archives the data onto S3, and then the Crunch MapReduce code reads the SequenceFiles off S3 in order to generate the graph that’s written into HBase.
  28. Lambda Architecture • Batch base, Realtime does deltas, batch eventually

    replaces • Can take liberties with realtime process, batch fixes • Run batch when needed/wanted
  29. Realtime Platform • “Out of the box lambda” • Spark

    / Spark Streaming • Summingbird • Separate implementation • Storm • Raw Kafka consumer
  30. Realtime Platform • Spark Streaming • Issues, unknown to us,

    no code reuse • Summingbird • Very elegant, complex. • Sparse documentation • Analytics focused • Storm • Pure realtime • Would have to completely rewrite our logic • Something else (raw kafka consumer) Spark - relative unknown to us, had issues with running on cluster Summingbird - sparse documentation, felt analytics focused. Would be awesome if you worked at Twitter with the internal integrations Storm - Would require totally separate implementation It looked like we might need to do an implementation on top of Spark or try and share code as much as possible with our batch pipeline.
  31. Crunch has an in memory mode But wait— And we

    have Crunch code already…
  32. Crunch (Ab)use • Crunch has an in-memory runner for development

    • Re-use our existing code in realtime • One codebase, 2 great usecases (what a deal!) • Kafka consumer reads data, bundles into related batches, feeds into runner
  33. How it works (Realtime) • Batching layer groups work, tracks

    outstanding work • We’re indexing people, we batch around a search • Batch of work is all fed into the exact same code* as the Hadoop job, only in-memory runner • Expensive calculations cached, otherwise only stores raw inputs • Instead of outputting to SequenceFiles, generates HBase mutations * +/- a few code blocks only relevant to one or the other
  34. How it works (Batch) • Snapshot latest offsets from archiver

    process • Start batch process (~2-3 days), reads SequenceFiles from S3 • Launch brand new HBase cluster once done • Bulk load to HBase • Start new set realtime ingestors at snapshotted offsets (now()-2 days) • Allow catchup • Point query servers at new HBase cluster • Spin down old HBase cluster once everything looks good
  35. Architecture HBase& Person'API' Iden+base' contacts' contacts'&' feedback' enriched'contacts' &'deduplica+on' Data'Bus'

    Ka(a& Crunch' In+ Memory& Crunch' Hadoop& Crunch' Query& FullContact'Apps' Sherlock'Search' S3' What’s interesting is that this pipeline is really just implementing a graph-optimized materialized view over a log of contact observations.
  36. Issues • Crunch in memory mode not designed with production

    workloads in mind • Recreating Hadoop Configuration on each pipeline invocation — expensive! • Excessive locking around counters, can’t rely on elision • Serialization verification • We forked Crunch, very few changes necessary • Disabling serialization verification
  37. Production • ~2 months in prod • Powers critical infrastructure

    • Flexibility paving the way for new capabilities • Batch, realtime ingest, realtime queries • Query: 1000+qps, ~20ms@95th • Ingest: 150-500 batches/s depending on mix of data • Writing an average of 5,000 edges/s into HBase, with spikes >250,000/s (large cached profiles)
  38. Whew, that was fast. Questions? • More questions? • Email

    me: michael@fullcontact.com • Tweet at me: @Xorlev • Bother me in person! • Find this deck on speakerdeck.com/xorlev
  39. Links • https://engineering.linkedin.com/ distributed-systems/log-what-every- software-engineer-should-know-about- real-time-datas-unifying • http://kafka.apache.org/ • http://crunch.apache.org/