HBase at Meetup

HBase @ Meetup Gary Helmling – Lead SW Engineer

The Problem circa Jan 2009 • Groups doing great things,
but how do you find it all? • Wait til the next event • Click around (a lot) • Wanted to show what's happening in groups • Discussions, photos, new members, RSVPs, etc. • But requires 10 different queries!

The Solution • Show activity from all your groups in
one place • real-time updates • better discovery of what's going on • find new ways to participate and get to know your groups

Challenges • Normalized schema • Each type of activity requires
querying a separate table – already wasn't scaling at the group level • Query efficiency • Activity occurs at group level • Members can be in hundreds of groups • For member home page we need activity from all groups ordered by most recent – N subqueries by group ID merged back by descending timestamp

Options • De-normalize MySQL • Stuff different activity types into
a common table (with different fields for different types of activity) • Duplicate entity data (or we're still doing N queries) • Start to lose a lot of the benefits of RDBMS • Query efficiency still a problem • Single system scaling limit • Something new • the Cloud – Google App Engine – Amazon SimpleDB • Hadoop/HBase • CouchDB • MongoDB • Voldemort • Cassandra

Why HBase? • We own infrastructure, no usage limits •
Data model • Semi-structured data in HBase (easily handles multiple types in same table) • Time-series ordered • Scaling is built in (just add more servers) • But extra indexing is DIY • Very active developer community • Established, mature project (in relative terms!) • Matches our own toolset (java/linux based)

What is HBase? • Clone of Google's BigTable • Distributed
(automatic partitioning) • Column-oriented • Semi-structured (columns can be added just by inserting) • Built-in versioning • Not an RDBMS • No joins • No SQL • Data usually not normalized • Transactions & built-in secondary indexes available (as contrib) but immature • Need to think differently about how you structure data • Denormalize your data where necessary • Structure data & row keys around common access

What is HBase? Data Storage • Table • Regions, defined
by row [start key, end key) – Store, 1 per family • 1+ Store Files (Hfile format on HDFS) • (table, rowkey, family, column, timestamp) = value • Everything is byte[] • Rows are ordered sequentially by key • Special tables: -ROOT-, .META. • Tell clients where to find user data

HBase Architecture Courtesy of Lars George from http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html

What is HBase? Data Access • Random access (Gets) •
by rowkey only • Sequential reads (Scans) • starting row key • where you stop is as important as where you start – ending row key (optional) – server-side filter (optional) • Writes (Puts) • No insert vs. update distinction

How It Works Storing activity data in HBase • FeedItem:
stores activity data for all types • keyed by group and descending timestamp – ch<chapterID>-ts<Long.MAX_VALUE–timestamp>-<type>-<entityID> • each row only contains data for that type Row Key info: content: ch1261585-ts9223... item_type = chapter_greeting target_greeting = 8104438 greeting = “Hi, Gary” ch1261585-ts9223... item_type = new_discussion target_forum = 847743 target_thread = 7369603 title = “Improvements” body = “When a discussion is created...” • MemberFeedIndex: index of FeedItem rows from all of a member's groups • one row per member (keyed by member ID) • columns store refs to FeedItem row keys for that member's groups • TTL of 2 months expires old index values Row Key item: 4679998 ch176399-ts9223370788400750807-mem-10044424 = new_member ch1261585-ts9223370787431124807-ptag-8525047 = photo_tag ...

How It Works MemberFeedIndex • Steps in displaying member home
page feed • lookup member record in MemberFeedIndex by ID • grab the X most recent columns & values – use a time range for paging (older pages start with an earlier start time) • get each row from FeedItem using (column key as row key) – N gets, where N is number of items to display • populate some basic info about members and aggregate the results – still query MySQL for core entity info (member, group, event)

How it Works Secondary index tables • Still need to
find rows by column values • tried “tableindexed” contrib (0.19 release), high CPU usage & contention on scans • decided to update to 0.20 release for other performance improvements • built secondary indexing into app layer • Separate table per indexed column • FeedItem info:actor_member indexed by FeedItem-by_actor_member • Index table rows keyed by column value and descending timestamp – <column value>-<Long.MAX_VALUE–timestamp>-<orig row key> • Zero pad numeric values (or big-endian representation) for correct byte ordering

How it Works Secondary index tables ex. FeedItem-by_actor_member Row Key
info: __idx__: 0002851766-9223370783553935005-rowkey actor_member = 2851766 item_type = new_rsvp pub_date = row = ch1143475- ts9223370783553935005-rsvp-54704795 0004679998-9223370783650851832-rowkey actor_member = 4679998 item_type = new_discussion pub_date = row = ch1261585- ts9223370783650851832-disc-7369603 Row Key info: content: ch1143475-ts9223370783553935005-rsvp-54704795 actor_member = 2851766 item_type = new_rsvp pub_date = comment = “See you there” ch1261585-ts9223370783650851832-disc-7369603 actor_member = 4679998 item_type = new_discussion pub_date = title = “Next month” body = “...” indexes FeedItem

Interacting with HBase Meetup.Beeno package com.meetup.feeds.db; ... @HEntity(name="FeedItem") public class
FeedItem implements Externalizable { ... @HRowKey public String getId() { return this.id; } public void setId(String id) { this.id = id; } @HProperty(family="info", name="actor_member", indexes = { @HIndex(date_col="info:pub_date", date_invert=true, extra_cols={"info:item_type"}) } ) public Integer getMemberId() { return this.memberId; } public void setMemberId(Integer id) { this.memberId = id; } Java Beans mapped to HBase tables

Interacting with HBase Services Base service class provides round-tripping based
on annotations public class EntityService<T> { public T get(String rowKey) throws HBaseException {…} public void save(T entity) throws HBaseException {…} public void saveAll(List<T> entities) throws HBaseException {…} public void delete(String rowKey) throws HBaseException {…} public Query<T> query() throws MappingException {…} } easily extended for specific needs Almost all HBase interaction through service instances.

Interacting with HBase Queries Find all items related to a
discussion FeedItemService service = new FeedItemService(DiscussionItem.class); Query query = service.query() .using( Criteria.eq("threadId", threadId) ); List items = query.execute(); Find all greetings from a given member FeedItemService service = new FeedItemService(GreetingItem.class); Query query = service.query() .using( Criteria.eq("memberId", memberId) ) .where( Criteria.eq(“type”, FeedItem.ItemType.CHAPTER_GREETING) ); List items = query.execute(); Simple Query API uses mappings and secondary index tables

Interacting with HBase Member Feed Retrieval // retrieve the member's
index record HTable mfiTable = HUtil.getTable("MemberFeedIndex"); Get get = new Get( Bytes.toBytes(String.valueOf(memberId)) ); get.addFamily( Bytes.toBytes("item") ); Result r = mfiTable.get(get); FeedItemService service = new FeedItemService(); Set<IndexKey> sortedKeys = sortKeys(r); List<FeedItem> items = new ArrayList<FeedItem>(); // for each index col get the entity record for (IndexKey key : sortedKeys) { FeedItem item = service.get(key.getKey()); if (item != null) items.add(item); } // populate member and chapter info … Get latest activity from all a member's groups using MemberFeedIndex

HBase @ Meetup Issues along the way • Performance testing
• Product targeting 3 of our highest traffic pages, simulating load is hard • Started with load scripts • Moved to testing with live traffic – Use AJAX calls to simulate requests – Selective enable for X% of traffic • Launched data collection/write traffic first – Allowed tweaking configuration before impacting user experience

HBase @ Meetup Issues along the way • High CPU
/ Concurrency issues • Updated to 0.20 release for performance gains across the board • Replaced “tableindexed” usage with application level secondary indexing • “Hot regions” - profile page hits small table every page load • Force split table to distribute across multiple servers • “Newest” region still handling high load – changed index keying to <value % 100>-<value>-<timestamp> for even distribution • I/O Heavy load / MemberFeedIndex table growing • Lowered MemberFeedIndex time-to-live to 2 months • Enabled LZO compression

HBase @ Meetup Current Status • Live traffic growing •
Cluster handling ~2.5k – 3k request/sec • 50+% still write traffic • ~17% of page views hit HBase (for reads) • Expanding to 30% of page views in coming months • Meetup.Beeno now open-source on Github: • http://github.com/ghelmling/meetup.beeno • Next up • Continue tweaking • Site analytics

HBase at Meetup

HBase at Meetup

Gary Helmling

More Decks by Gary Helmling

Other Decks in Programming

Featured

Transcript

HBase @ Meetup Gary Helmling – Lead SW Engineer

The Problem circa Jan 2009 • Groups doing great things,

The Solution • Show activity from all your groups in

Challenges • Normalized schema • Each type of activity requires

Options • De-normalize MySQL • Stuff different activity types into

Why HBase? • We own infrastructure, no usage limits •

What is HBase? • Clone of Google's BigTable • Distributed

What is HBase? Data Storage • Table • Regions, defined

HBase Architecture Courtesy of Lars George from http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html

What is HBase? Data Access • Random access (Gets) •

How It Works Storing activity data in HBase • FeedItem:

How It Works MemberFeedIndex • Steps in displaying member home

How it Works Secondary index tables • Still need to

How it Works Secondary index tables ex. FeedItem-by_actor_member Row Key

Interacting with HBase Meetup.Beeno package com.meetup.feeds.db; ... @HEntity(name="FeedItem") public class

Interacting with HBase Services Base service class provides round-tripping based

Interacting with HBase Queries Find all items related to a

Interacting with HBase Member Feed Retrieval // retrieve the member's

HBase @ Meetup Issues along the way • Performance testing

HBase @ Meetup Issues along the way • High CPU

HBase @ Meetup Current Status • Live traffic growing •