Upgrade to Pro — share decks privately, control downloads, hide ads and more …

The Making of a Humongous World Wide Monitor Ne...

mongodb
April 20, 2012
250

The Making of a Humongous World Wide Monitor Network Erik Torlén, Product Manager, Apica AB

Presented at MongoDB Stockholm 2012 Erik Torlén of Apica AB presented The Making of a Humongous World Wide Monitor Network.

How we built a new generation world wide performance monitor network with redundancy and scalability in mind, from beginning to release. In this talk we will go over: How we are using mongodb in different datacenters with replicas and how important it is for the app to work // How we used extensive loadtesting (and A/B loadtesting) in order to verify performance and functionality of our app & db. // Using MongoHQ as the "operation" team of our mongoDBs. How it works and why we choose it. // Our requirements when we started dev on this new monitor network based on our earlier experience with old network & sqlserver. // Importance of monitoring your installation and keeping track of app & performance // The need of testing the different failure situations and the complete setup (app + db).

mongodb

April 20, 2012
Tweet

Transcript

  1. 2! Agenda †  Introduction †  Who are Apica †  Our

    services †  Monitoring service †  New platform †  How it works †  Why MongoDB †  Using MongoDB †  Development †  Operation †  Loadtest †  Monitor
  2. 3! About Apica †  Founded in 2005, Stockholm, Sweden † 

    Backed by KTH, Industrifonden, Almi Invest †  Pioneers in Web Performance Monitoring and Load Testing †  Headquarters in Stockholm and offices in US & UK †  270+ worldwide customers
  3. 6! Apica WebPerformance? †  SaaS for web performance monitoring † 

    Real web browser monitoring †  Selenium script support †  URL, Ping, Port etc. †  Analyzing data †  Screenshots †  Rendering waterfall graph †  Reporting, Alerting, SLA, Trends etc.
  4. 9! Infrastructure †  81 locations world wide †  ~325 servers

    †  Central system in Stockholm operating with five datacenters
  5. 10! FullPageRender (FPR) Monitoring Stats †  ~4700 FPR-navigations/hour (~1,3 jobs/sec)

    †  ~1300 configured FPR checks †  Went from 1500 to 3000 navigations/hour in just 2 months 0" 500" 1000" 1500" 2000" 2500" 3000" 3500" 4000" 4500" 5000" 2011-01-12 2011-01-22 2011-02-01 2011-02-11 2011-02-21 2011-03-03 2011-03-13 2011-03-23 2011-04-02 2011-04-12 2011-04-22 2011-05-02 2011-05-12 2011-05-22 2011-06-01 2011-06-11 2011-06-21 2011-07-01 2011-07-11 2011-07-21 2011-07-31 2011-08-10 2011-08-20 2011-08-30 2011-09-09 2011-09-19 2011-09-29 2011-10-09 2011-10-19 2011-10-29 2011-11-08 2011-11-18 2011-11-28 2011-12-08 2011-12-18 2011-12-28 2012-01-07 2012-01-17 2012-01-27 2012-02-06 2012-02-16 2012-02-26 2012-03-07 Runs/h" Total Checks" FPR = Navigation on a website using Firefox
  6. 12! How it works †  Infrastructure †  Workers †  Controller

    Servers †  Database (MongoDB) †  Storing results †  Response times †  Screenshots †  Req/Resp Content & Hdrs †  REST API †  CRUD operations †  Integration with partners †  Public access by Q3 Apica WebPerformance" REST API" ApicaNet"
  7. 13! Why we choosed MongoDB? †  Great support by 10gen

    and community †  Redundancy with Replica sets †  Secure data with >1 datacenter †  No SPOF †  Sharding to handle our growth in the future †  Easy to use & implement
  8. 14! Our MongoDB †  Infrastructure †  Running in three datacenters

    †  Primary, Secondary & Arbiter in each datacenter †  Replication over fibre between Primary & Secondary †  MongoDB 2.0.4 †  pymongo 2.1.1 DC1" Primary" DC3" Arbiter" DC2" Secondary"
  9. 15! Operation †  MongoHQ †  Operating the DB, Apica operates

    the hardware †  Online dashboard with DB Mgmt & Live stats †  Setup, configuration, operation, testing and support 24/7 †  Outsourcing operation of MongoDB †  Faster production release †  Lack of experience †  Building up knowledge & experience internally over time
  10. 16! Loadtesting database Finding your weak spots and knowing your

    numbers †  MongoDB †  Shutting down primary (graceful + ungraceful) †  Stepping down primary (db.stepDown()) †  Jobs/s - Write vs Reads †  Pymongo †  Handling failover cases †  Reconnection
  11. 17! Outcome of Loadtests against MongoDB †  MongoDB †  Misconfiguration

    caused members to become read-only †  Latency caused primary to become unavailable †  Put the arbiter in DC3 to avoid read-only on primary †  Pymongo & App †  App took 50sec to start working when primary went down †  Upgraded pymongo 2.0.1 > 2.1.1 pushed it down to 10sec †  Reads to secondary node (2.1) †  Better performance of the app †  More use of the secondary node †  Reads was ~2x faster then writes
  12. 18! Monitoring Awareness of your application †  Integrated MongoDB monitoring

    into Apica WebPerformance †  Knowing your application leads to better decisions †  Following trends in response time †  Finding weak spots †  Alerts when something fails †  Using MMS from 10gen
  13. 19! Tips & Suggestions u  Start testing your application ASAP

    (and continue with it) u  Don’t be afraid to make changes from your original design u  Work actively with monitoring