MongoDB & EC2: A Love Story? - Eytan Daniyalzade, Chartbeat

Eytan Daniyalzade @daniyalzade http://bit.ly/cb_mongodb_meetup MongoDB & EC2: A Love Story?

Contents • Chartbeat • Architecture • MongoDB & EC2 Challenges
• Happy Ending: (MongoDB ? EC2) • Takeaways

chartbeat

Chartbeat: real-time analytics service • 18 person startup in New
York • part of Betaworks • peaking at just under 5M concurrents daily ◦ up from 1M in July/2010

What chartbeat Provides • real-time view of site performance ◦
top pages ◦ new/returning visitors ◦ traffic flow ▪ where are people coming from ▪ where are people going to • historic replay for the last 30 days

the architecture

Architecture, Browser Part 1: <head> <script type="text/javascript">var _sf_startpt=(new Date()).getTime()</script> ...
Part 2: ... function loadChartbeat() { // insert script tag } window.onload = loadChartbeat; </body> (highly simplified) Ping is standard beacon logic, i.e. loading a 1x1 image.

Architecture, Backend • custom libevent-based C backend ◦ real-time collection
and aggregation • real-time system in-memory only • background queue jobs snapshot every x minutes ◦ Gearman • historical data ◦ mostly in MongoDB

Why Chartbeat uses MongoDB • Pure JSON all along ◦
Live API ◦ Historical data ◦ No mapping back and forth • Fast Inserts (fire and forget) • Flexible Schema

Why Chartbeat uses EC2 • Elastic Capacity • No trips
to datacenter • EBS snapshots

Chartbeat & MongoDB & EC2 (1) • 3 Clusters ◦
1 for each product ◦ 1 as a caching layer ◦ 2 - 4 instance/cluster • m2-2xlarge ◦ 34.2 GB merory ◦ Ubuntu 10.04 ◦ RAID0 x 4 - 1 TB volumes • Dedicated Snapshot Server ◦ Shared among clusters ◦ Serves as an arbiter as well

Chartbeat & MongoDB & EC2 (2) Cluster View

MongoDB & EC2 Challenges • Instances disappear ◦ MongoDB can
have long recovery operations ◦ MongoDB is (was) not ACID compliant. Unclean shutdown could corrupt your data. • Poor IO performance on EBS ◦ MongoDB has global read/write lock • Variable IO performance on EBS ◦ Could cause replication issues

Question: ?? ?

Disappearing Instances

Instances Disappearing - Master/Slave • Down-time :( • Slave-promotion =
headache ◦ New instance ◦ Copy oplog ◦ Code change ◦ Long/manual/error prone

Instances Disappearing - Replica Sets • No down-time :) yay!
• Automatic failover on writes • Eventual failover on reads • No code change

Instances Disappearing - Replica Sets (caveats) • pymongo driver reads/writes
from primary ◦ pymongo 2.1 will fix this • chartbeat pymongo driver ◦ based on MasterSlaveConnection ◦ writes to primary ◦ distribute reads among secondaries ◦ automatic failover ◦ eventual read re-distribution

Instances Disappearing - Fact of Life • Accept this fact
of life • Always snapshot ◦ Dedicated snapshot server ◦ Hidden, i.e. no reads • Automate everything ◦ puppet ▪ New instance from scratch within a minute ◦ python-boto ▪ Script all EC2 interaction ▪ new_instance.py ▪ mount_volumes_from_snap.py -o iid -n iid ▪ snapshot_mongo.py

Instances Disappearing - Caveats • New volumes - slow!!! ◦
EBS loads blocks lazily • Warm up EBS & File Cache before use ◦ Options ▪ Slowly direct the reads (app by app) ▪ Run cache warm-up scripts ◦ Not automated currently

Poor IO Performance on EBS

Poor IO Performance on EBS • XFS & RAIDing Helps
but, • Disk IO varies over time • MongoDB holds global lock on writes • Query of death ◦ Grinding-halt if not careful

Case Study: Historical Data • For historical data, we store
time series. { key:<key> ts:<key> values: {metric1: int1, metric2: int2} meta:{} } • High Insert Rate vs Fast Historical Read ◦ Optimize reads or writes? • Fast inserts: ~1 MB/sec (through append only) ◦ No disk-seek • Historical reads: painfully slow

Faster Reads Through Cache DB • Avoid reading from disk
• Favor reads over writes • Aim for disk & memory locality {day_tskey:<key>values: {metric1: list(int), metric2: list(int)} } • Data for historical reads resides together • .append() to list could cause disk fragmentation

Avoid Fragmentation w/ Preallocation • Fragmentation causes: ◦ Inefficient disk
usage ◦ Slower writes (due to block allocation) • Preallocate daily arrays instead ◦ Pros: ▪ No fragmentation ▪ Write causes no change in data size ◦ Cons: ▪ Wasteful (we don't know keys ahead of time) ▪ Requires heavy disk IO, ~7MB/sec (~60Mbis/sec on EBS) • Conclusion: spread preallocation over 1 hour

EC2 Performance is Unpredictable

EC2 Unpredictability - Challenges • Resource contention in virtualized environment
• EBS and Network IO performance varies drastically • RAID0 over 4 disks = 4 x risk

Heavy Monitoring (1) • Track individual disk performance over time
• Create a new instance if disk not getting better

Heavy Monitoring (2) • Monitor replication lag • Remove from
read mix if lag gets too high ◦ Incorrect data ◦ Strain on primary

Heavy Monitoring (3) • Track slow queries / opcounts /
track page faults / IO volume ◦ Tweak indexes accordingly ◦ Limit requested data size if you can

Open Issues • More granular page-fault / memory usage information
◦ Difficult due to mmap • Multi-datacenter usage • Burn-in scripts • Sharding ◦ Tipping point will be insert volume ◦ Or inefficient read memory usage • Better understand replication failures

Take-aways (1) • Automate everything ◦ Instance creation, snapshotting, mount/unmount
• Strive for high locality & low fragmentation • Repeatedly revise schema/index • Heavily monitor ◦ Server: IO/mem/disk ◦ MongoDB: Opcounts/Index Hits/Slow queries ◦ Cluster: Replication lag ◦ Application: CRUD times

Take-aways (2)

Questions? Slides: http://bit. ly/cb_mongodb_meetup

MongoDB & EC2: A Love Story? - Eytan Daniyalzad...

MongoDB & EC2: A Love Story? - Eytan Daniyalzade, Chartbeat

More Decks by mongodb

Other Decks in Technology

Featured

Transcript