Upgrade to Pro — share decks privately, control downloads, hide ads and more …

HBase at Meetup

HBase at Meetup

This presentation shows how we made use of HBase to build out site wide activity feeds on the Meetup web site.

Avatar for Gary Helmling

Gary Helmling

February 24, 2010
Tweet

More Decks by Gary Helmling

Other Decks in Programming

Transcript

  1. The Problem circa Jan 2009 • Groups doing great things,

    but how do you find it all? • Wait til the next event • Click around (a lot) • Wanted to show what's happening in groups • Discussions, photos, new members, RSVPs, etc. • But requires 10 different queries!
  2. The Solution • Show activity from all your groups in

    one place • real-time updates • better discovery of what's going on • find new ways to participate and get to know your groups
  3. Challenges • Normalized schema • Each type of activity requires

    querying a separate table – already wasn't scaling at the group level • Query efficiency • Activity occurs at group level • Members can be in hundreds of groups • For member home page we need activity from all groups ordered by most recent – N subqueries by group ID merged back by descending timestamp
  4. Options • De-normalize MySQL • Stuff different activity types into

    a common table (with different fields for different types of activity) • Duplicate entity data (or we're still doing N queries) • Start to lose a lot of the benefits of RDBMS • Query efficiency still a problem • Single system scaling limit • Something new • the Cloud – Google App Engine – Amazon SimpleDB • Hadoop/HBase • CouchDB • MongoDB • Voldemort • Cassandra
  5. Why HBase? • We own infrastructure, no usage limits •

    Data model • Semi-structured data in HBase (easily handles multiple types in same table) • Time-series ordered • Scaling is built in (just add more servers) • But extra indexing is DIY • Very active developer community • Established, mature project (in relative terms!) • Matches our own toolset (java/linux based)
  6. What is HBase? • Clone of Google's BigTable • Distributed

    (automatic partitioning) • Column-oriented • Semi-structured (columns can be added just by inserting) • Built-in versioning • Not an RDBMS • No joins • No SQL • Data usually not normalized • Transactions & built-in secondary indexes available (as contrib) but immature • Need to think differently about how you structure data • Denormalize your data where necessary • Structure data & row keys around common access
  7. What is HBase? Data Storage • Table • Regions, defined

    by row [start key, end key) – Store, 1 per family • 1+ Store Files (Hfile format on HDFS) • (table, rowkey, family, column, timestamp) = value • Everything is byte[] • Rows are ordered sequentially by key • Special tables: -ROOT-, .META. • Tell clients where to find user data
  8. What is HBase? Data Access • Random access (Gets) •

    by rowkey only • Sequential reads (Scans) • starting row key • where you stop is as important as where you start – ending row key (optional) – server-side filter (optional) • Writes (Puts) • No insert vs. update distinction
  9. How It Works Storing activity data in HBase • FeedItem:

    stores activity data for all types • keyed by group and descending timestamp – ch<chapterID>-ts<Long.MAX_VALUE–timestamp>-<type>-<entityID> • each row only contains data for that type Row Key info: content: ch1261585-ts9223... item_type = chapter_greeting target_greeting = 8104438 greeting = “Hi, Gary” ch1261585-ts9223... item_type = new_discussion target_forum = 847743 target_thread = 7369603 title = “Improvements” body = “When a discussion is created...” • MemberFeedIndex: index of FeedItem rows from all of a member's groups • one row per member (keyed by member ID) • columns store refs to FeedItem row keys for that member's groups • TTL of 2 months expires old index values Row Key item: 4679998 ch176399-ts9223370788400750807-mem-10044424 = new_member ch1261585-ts9223370787431124807-ptag-8525047 = photo_tag ...
  10. How It Works MemberFeedIndex • Steps in displaying member home

    page feed • lookup member record in MemberFeedIndex by ID • grab the X most recent columns & values – use a time range for paging (older pages start with an earlier start time) • get each row from FeedItem using (column key as row key) – N gets, where N is number of items to display • populate some basic info about members and aggregate the results – still query MySQL for core entity info (member, group, event)
  11. How it Works Secondary index tables • Still need to

    find rows by column values • tried “tableindexed” contrib (0.19 release), high CPU usage & contention on scans • decided to update to 0.20 release for other performance improvements • built secondary indexing into app layer • Separate table per indexed column • FeedItem info:actor_member indexed by FeedItem-by_actor_member • Index table rows keyed by column value and descending timestamp – <column value>-<Long.MAX_VALUE–timestamp>-<orig row key> • Zero pad numeric values (or big-endian representation) for correct byte ordering
  12. How it Works Secondary index tables ex. FeedItem-by_actor_member Row Key

    info: __idx__: 0002851766-9223370783553935005-rowkey actor_member = 2851766 item_type = new_rsvp pub_date = row = ch1143475- ts9223370783553935005-rsvp-54704795 0004679998-9223370783650851832-rowkey actor_member = 4679998 item_type = new_discussion pub_date = row = ch1261585- ts9223370783650851832-disc-7369603 Row Key info: content: ch1143475-ts9223370783553935005-rsvp-54704795 actor_member = 2851766 item_type = new_rsvp pub_date = comment = “See you there” ch1261585-ts9223370783650851832-disc-7369603 actor_member = 4679998 item_type = new_discussion pub_date = title = “Next month” body = “...” indexes FeedItem
  13. Interacting with HBase Meetup.Beeno package com.meetup.feeds.db; ... @HEntity(name="FeedItem") public class

    FeedItem implements Externalizable { ... @HRowKey public String getId() { return this.id; } public void setId(String id) { this.id = id; } @HProperty(family="info", name="actor_member", indexes = { @HIndex(date_col="info:pub_date", date_invert=true, extra_cols={"info:item_type"}) } ) public Integer getMemberId() { return this.memberId; } public void setMemberId(Integer id) { this.memberId = id; } Java Beans mapped to HBase tables
  14. Interacting with HBase Services Base service class provides round-tripping based

    on annotations public class EntityService<T> { public T get(String rowKey) throws HBaseException {…} public void save(T entity) throws HBaseException {…} public void saveAll(List<T> entities) throws HBaseException {…} public void delete(String rowKey) throws HBaseException {…} public Query<T> query() throws MappingException {…} } easily extended for specific needs Almost all HBase interaction through service instances.
  15. Interacting with HBase Queries Find all items related to a

    discussion FeedItemService service = new FeedItemService(DiscussionItem.class); Query query = service.query() .using( Criteria.eq("threadId", threadId) ); List items = query.execute(); Find all greetings from a given member FeedItemService service = new FeedItemService(GreetingItem.class); Query query = service.query() .using( Criteria.eq("memberId", memberId) ) .where( Criteria.eq(“type”, FeedItem.ItemType.CHAPTER_GREETING) ); List items = query.execute(); Simple Query API uses mappings and secondary index tables
  16. Interacting with HBase Member Feed Retrieval // retrieve the member's

    index record HTable mfiTable = HUtil.getTable("MemberFeedIndex"); Get get = new Get( Bytes.toBytes(String.valueOf(memberId)) ); get.addFamily( Bytes.toBytes("item") ); Result r = mfiTable.get(get); FeedItemService service = new FeedItemService(); Set<IndexKey> sortedKeys = sortKeys(r); List<FeedItem> items = new ArrayList<FeedItem>(); // for each index col get the entity record for (IndexKey key : sortedKeys) { FeedItem item = service.get(key.getKey()); if (item != null) items.add(item); } // populate member and chapter info … Get latest activity from all a member's groups using MemberFeedIndex
  17. HBase @ Meetup Issues along the way • Performance testing

    • Product targeting 3 of our highest traffic pages, simulating load is hard • Started with load scripts • Moved to testing with live traffic – Use AJAX calls to simulate requests – Selective enable for X% of traffic • Launched data collection/write traffic first – Allowed tweaking configuration before impacting user experience
  18. HBase @ Meetup Issues along the way • High CPU

    / Concurrency issues • Updated to 0.20 release for performance gains across the board • Replaced “tableindexed” usage with application level secondary indexing • “Hot regions” - profile page hits small table every page load • Force split table to distribute across multiple servers • “Newest” region still handling high load – changed index keying to <value % 100>-<value>-<timestamp> for even distribution • I/O Heavy load / MemberFeedIndex table growing • Lowered MemberFeedIndex time-to-live to 2 months • Enabled LZO compression
  19. HBase @ Meetup Current Status • Live traffic growing •

    Cluster handling ~2.5k – 3k request/sec • 50+% still write traffic • ~17% of page views hit HBase (for reads) • Expanding to 30% of page views in coming months • Meetup.Beeno now open-source on Github: • http://github.com/ghelmling/meetup.beeno • Next up • Continue tweaking • Site analytics