Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Always on with Opsgenie

Always on with Opsgenie

All about Opsgenie product

Serhat Can

March 05, 2019
Tweet

More Decks by Serhat Can

Other Decks in Technology

Transcript

  1. 2 Serverless Turkey meetup About me 3 Devops Turkey meetup

    1 Devopsdays global core team 4 Devopsdays İstanbul Community Software Engineer, Technical Evangelist at Opsgenie, Atlassian Podcast: Turuncu Pasaport AWS Community Hero @srhtcn
  2. We believe behind every great human achievement, there is a

    team. Our mission is to unleash the potential of every team. OUR MISSION
  3. We believe behind every long lasting incident, there is lack

    of preparation. 
 Our mission is to unleash the potential of every on-call team. OUR MISSION
  4. THE IMPACT OF DOWNTIME AND PERFORMANCE DEGRADATION Direct revenue loss

    Unhappy users Loss of credibility Loss of opportunity Stress
  5. Alert: An alarm or warning for an event that may

    affect operations or a service.

  6. Increasing demands Maintain high availability, performance and security within more

    complex systems Dev - Ops Better alignment of development and operations Put developers on-call
  7. Increasing demands Maintain high availability, performance and security within more

    complex systems Dev - Ops Better alignment of development and operations Management - Dev Better alignment of management and development Put developers on-call
  8. Multiple Alerting Channels Most monitoring tools send notifications via email,

    however, email is not sufficient when alerts are time sensitive and rapid response is necessary. Opsgenie uses multiple communications channels, including email, SMS, mobile push, and voice calls, to ensure recipients are notified in a timely manner.
  9. Alert Enrichment Short text messages often cannot convey sufficient information

    to empower users to make effective decisions. Opsgenie alerts are not limited to a few characters! Add optional fields to your alerts and attach charts, logs, runbooks, and more to further enrich them, provide context, and enable recipients to determine the right course of action.
  10. Custom Alert Actions Respond to alerts by initiating appropriate actions

    directly from the Opsgenie Application. In addition to the default alert actions such as "Add Note" and "Close", you can respond to alerts by executing investigative and corrective actions. For example, you can ping or restart a server or create a service ticket with a click of a button.
  11. Automated Actions Create action policies that automatically run diagnostic or

    remediation actions in response to incoming alerts. Through integration with AWS Systems Manager or other 3rd-party automation platforms, Opsgenie will trigger your response playbooks when an alert meets your predefined criteria. The system can take corrective action without involving your on-call engineers, reducing alert fatigue and reducing MTTR.
  12. Alert Lifecycle Tracking Opsgenie provides detailed tracking for each alert.

    The alert activity log presents all activity related to the alert; when the alert was created, who was notified, when the notifications were sent, and whether the recipients have seen the alert, or taken any action. Tracking is performed seamlessly without requiring specific user action, whenever possible.
  13. Alert & Notification Policies To combat alert fatigue, get notified

    differently depending on the source of the alert, priority, or time of day. Opsgenie provides the flexibility to suppress, delay, or expedite alerts based on their content and timing.
  14. Heartbeats How do you know that your monitoring systems are

    working and creating alerts? Opsgenie Heartbeats ensures alerting works end-to-end, by checking that monitoring tools are active and connected, and that custom tasks are completed on schedule. When an absence of signal is detected within a specified timeframe, Opsgenie instantly alerts you of the problem.
  15. On-Call Schedule Management Easily create on-call schedules with daily, weekly

    and custom rotations. Leverage multiple scheduling rules to use different rotations at different times. You can define sophisticated scheduling scenarios such as after-hours coverage, weekdays and weekends, and geographically distributed teams coverage.
  16. Routing Rules and Escalations Opsgenie’s flexible routing rules enable the

    right teams to be notified based on the source, priority, and timing of the issue. Escalations ensure that the alert gets the necessary attention when an alert is not acknowledged within a certain amount of time. For example, if the person on-call does not respond to a high priority alert within 5 minutes, you can notify another person or team, automatically
  17. On-Call Overrides When one user has scheduling issues or conflicts,

    others can easily take shifts and transfer responsibility, without administrative involvement.
  18. On-Call Reminder Notifications Opsgenie ensures your team is kept aware

    of their duties. Opsgenie automatically notifies users when their shifts begin and end.
  19. ChatOps Create and take actions on Opsgenie alerts and schedules

    from inside your ChatOps tool, including acknowledging & closing alerts, seeing who is on-call, and defining schedule overrides. Opsgenie has bi-directional integrations with Slack, MS Teams, Campfire, MatterMost, Jabber, Flowdock, Kore, and Moxtra.
  20. Web Conference Bridge Opsgenie makes it easy for you to

    communicate with key individuals using your preferred web conferencing provider (WebEx, GoToMeeting, Skype, Jitsi). Conference bridge details are attached to the incident and shared automatically with your team.
  21. Incident Command Center (ICC) The ICC provides a central place

    to command, control, and coordinate incident response. Through integrated communication and incident resolution tools, it enables you to stop switching between different tools and platforms during incident response. You can view the status and progress of each responder team and track all updates and actions, from a centralized dashboard.
  22. Stakeholder Communications Notify stakeholders from across your organization about incidents

    according to organizational specifications. Stakeholders can stay informed about incident resolution progress and service health by automatic notifications, visiting a status page, or subscribing to status page updates.
  23. Operational Efficiency Analytics Instantly understand the volume of alerts your

    company has handled over a specified period of time, and the corresponding mean-time-to- acknowledge and mean-time-to resolve. You can easily visualize how these metrics are trending over time and with a mouse click, drill down into areas of concern to understand which alerts required more time and attention.
  24. Monthly Overview Analytics Use Opsgenie’s standard dashboard to analyze the

    monthly alert distribution and response trends. You can easily compare them with the previous month, and drill into any areas of interest.
  25. Downloadable & Schedulable Reports Easily share data and communicate findings

    by exporting reports in various formats including PDF. You can even instruct Opsgenie to email the reports to peers on a regular schedule.
  26. User & Team Productivity Analytics Evaluate your team’s and team

    members’ productivity, incident response patterns, and efficiency. Understand which members are responding quickly and establish best practices for everyone.
  27. On-Call Analytics Understand how on-call workloads are distributed throughout each

    team. Ensure that teams are balanced and working at peak efficiency.
  28. Conference Attendance & Efficiency Analytics Conference participation is often the

    key to fast incident resolution. During and after an ICC conference, you can analyze team participation in detail. Understand the attendance and efficiency analysis for each Incident Command Center session.
  29. Service & Infrastructure Health Reporting Quickly get a top-level view

    of all services and identify any problems or weaknesses, realize the system and process flaws, and potential improvements.
  30. Post-Incident Analysis Reporting As each major incident is resolved, use

    the post-incident analysis report to understand the actions taken and their timing. Identify how fast people acknowledged the issues, when status changes were communicated, and how teams participated in the resolution. Easily compare different incident responses, to identify opportunities for improvement.
  31. Team-Based Service Management Opsgenie enables you to map alerts to

    the business services they impact and have a clear understanding of which teams need to respond and who needs to be kept up to date on the progress towards resolution. Disparate teams are notified simultaneously and presented with the tools they need to collaborate during resolution.
  32. Planning and Scenarios Design your incident response and set up

    different workflows for incidents of differing priority using Opsgenie’s incident templates. For each type of incident, predefine the needed response teams, the stakeholders, and the best collaboration channels to resolve problems quickly and communicate them effectively.
  33. Alert Clustering Automatically group related alerts originating from across various

    systems into a single incident based on the conditions that you specify. Reduce complexity and noise to let your responders focus on the right context and resolve problems quickly.
  34. Service Status Pages Communicating accurate information during an incident is

    key to a smooth resolution. Service status pages help make this happen. Stakeholders and responders are able to view information about the status of an incident at any time. Additionally, they can view the status page for any service and report a problem that they have encountered with that service. Problems are logged with detailed notes, and an alert is created and sent to the on-call team member.
  35. Call Routing Never miss a customer support call again. Route

    phone calls to the right person using Opsgenie on-call schedules. If no one is available, Opsgenie will take a message, generate an alert, and notify the right person via their preferred notification channel. Call details are attached to the notification, and recipients can listen to the message.
  36. Auto Attendant Setup an auto attendant to help respond faster

    to a customer’s needs. Calls can be routed to different people or teams based on the caller's input (press 1 for network team, press 2 for ...)
  37. Local Phone Numbers Choose local phone numbers in over 35

    different countries to provide convenience to your customer base.
  38. Call Escalations Opsgenie can try multiple users until someone answers

    the phone. You can specify the order of users or let Opsgenie pick someone randomly. Opsgenie only connects the caller, when a live person answers the phone by requesting the user to press a key.
  39. Complete Tracking and Analytics Call metrics are tracked from beginning

    to end. All activity including when the call is received and how it is routed as well as who answered and how long it lasted can be included in your metrics. Calls can also be recorded for training and quality assurance.
  40. Marid Marid is an integration server (code) for OpsGenie designed

    to resolve challenges when integrating internal and external solutions. • Marid can enrich data (i.e. provide physical location of server, by looking up Host ID in a database) • Marid can be used to execute actions to help further investigate and remediate issues (i.e. ping or restart server) • Marid can act as an application level proxy to ensure communication between OpsGenie and other systems when direct connection is not available (firewall issues) soon -> Opsgenie Edge Connector (OEC)
  41. Single Sign On Opsgenie offers several providers for Single Sign

    On solution in which you can control authentication of the hosted accounts on your identity provider to Opsgenie. Authentication via Single Sign-On is available on both Opsgenie web and mobile applications.
  42. OPSGENIE SECURITY opsgenie.com/security Here at Opsgenie we take security very

    seriously. Below is a summary of our key security practices. If you have any questions, contact us at [email protected], or participate in Opsgenie’s Community Forums
  43. Edge Encryption Opsgenie’s Edge Encryption encrypts your user data so

    that Opsgenie never receives the raw version of the payload directly. The encryption application is hosted on your own environment and acts as a bridge between Opsgenie and 3rd party tools.
  44. TO LEARN MORE • opsgenie.com/blog • opsgenie.com/resource-library • docs.opsgenie.com •

    opsgenie.com/webinars • engineering.opsgenie.com • twitter.com/@opsgenie