Monitoring Production Systems At Wix

Red Alert Or False Alarm Monitoring Production Systems Aviran Mordo
Head Of Back-End Engineering @ Wix @aviranm http://www.linkedin.com/in/aviran http://www.aviransplace.com 07:30

About Wix 07:29

Wixin Numbers • Over 39,000,000 users – Adding over 1,000,000
new users each month • Static storage is over 150TB of data – Adding over 1TB of files every day • 3 Data centers + 2 Clouds (Google AE, Amazon) – Around 300 servers • Over 100,000,000 Server API calls per day • Over 450 people work at Wix – ~ 150 people in R&D 07:29

07:29 End user monitoring

07:29 Cons • No early warning –Only when site is
down • Don’t know what is the problem • Does not monitor API Pros • 24 / 7 Uptime monitoring • Different Geo locations Pingdom

07:29 Cons • Manually record flows • Does not monitor
internal servers Pros • Transaction monitoring from real user perspective • Support Flash • Different geo locations Keynote

Monitor Hardware and OS 07:29 Cons • Monitor at the
OS level, not application level* • Does not know when there is a problem with the application (the majority of problems) • Hard to manage in a large scale environment Pros • Monitor machine health • Built-in integration with Graphite • Custom checks

07:29 Look inside the application

Server Logs 07:29 Cons • Too much information • Hard
to read, Not friendly to developers • Pinpointing the problem takes long time • Server cluster need log aggregation • Helpful mostly in retrospect investigation • Most developers don’t know how to configure logs properly Pros • Verbose and flexible

Self Reporting Framework 07:29 • Automatic method level performance reporting
• Custom metering • Exception classifications • 4 severity levels (Recoverable, Warning, Error, Fatal) • Business Exceptions • System Exceptions

App-Info 07:29

App-Info Monitoring 07:29 • Expose via API as JSON •
Collect Metrics via Nagios/ Graphite • Nagiosalerts based on app-info metrics

App-Info Monitoring 07:29 Cons • Cores grained information for an
overview • Too much information Pros • Detailed and easy view of a server • Almost no need to look at logs

Log collections 07:29 • Client & Server logs are collected
with Flume and Syslog-ng • Storm + Esperanalyzes log events and feeds Graphite • Store in Hadoop+HBasefor in-depth analysis

Graphite 07:29 • All systems feed Graphite with metrics (Nagios,
App-info, Storm) • Nagiosquery Graphite and triggers alerts

Graphite 07:29 Cons • Not a dashboard (you can build
dashboard on top of it) • Design data schema (hierarchy) in advance • Difficult to get scaling right Pros • Numerous formulas available • Share graphs • Easy to create new graphs

New Relic 07:29 Pros • Easy to use –developer friendly
• Service level overview (both cluster and single server) • Customizable dashboards • JVM profiler on production • Code instrumentation • Real User Monitoring • Hardware monitoring

New Relic 07:29 Cons • No distributed transaction trace for
specific server • No exception classification • A lot of false alarms due to misbehaving bots • False alarms for low throughput services • Hard to pinpoint a problematic server / operation

07:29 What’s Next

07:29 Aviran Mordo @aviranm http://www.linkedin.com/in/aviran http://www.aviransplace.com http://www.slideshare.net/aviranwix/monitoring-production

Monitoring Production Systems At Wix

Monitoring Production Systems At Wix

Aviran Mordo

More Decks by Aviran Mordo

Other Decks in Technology

Featured

Transcript

Red Alert Or False Alarm Monitoring Production Systems Aviran Mordo

About Wix 07:29

Wixin Numbers • Over 39,000,000 users – Adding over 1,000,000

07:29

07:29 End user monitoring

07:29 Cons • No early warning –Only when site is

07:29 Cons • Manually record flows • Does not monitor

Monitor Hardware and OS 07:29 Cons • Monitor at the

07:29 Look inside the application

Server Logs 07:29 Cons • Too much information • Hard

Self Reporting Framework 07:29 • Automatic method level performance reporting

App-Info 07:29

App-Info Monitoring 07:29 • Expose via API as JSON •

App-Info Monitoring 07:29 Cons • Cores grained information for an

Log collections 07:29 • Client & Server logs are collected

Graphite 07:29 • All systems feed Graphite with metrics (Nagios,

Graphite 07:29 Cons • Not a dashboard (you can build

07:29

New Relic 07:29 Pros • Easy to use –developer friendly

New Relic 07:29 Cons • No distributed transaction trace for

07:29 What’s Next

07:29

07:29

07:29 Aviran Mordo @aviranm http://www.linkedin.com/in/aviran http://www.aviransplace.com http://www.slideshare.net/aviranwix/monitoring-production