Meeseeks (Service-Graph)

Agenda • Large application ecosystem and challenges • Handling large
scale failures (BCP) • Meeseeks • Applications of Meeseeks • What’s next • Q & A

Application Evolution of tech echo system • Begins as small
set of services or monolithic ◦ either one or few services ◦ Simple tech stack e.g LAMP ◦ Complete control/visibility/Monitoring

Large Application Evolution of tech echo system • Begins as
small set of services or monolithic • Multiple flows/services start to appear ◦ High level business flows ◦ Develops independent of others ◦ Tech choices differ

Large Application Ecosystem Evolution of tech ecosystem • Begins as
small set of services or monolithic. • Multiple flows start to appear • Each flows themselves are large systems ◦ Evolves independently, microservices ◦ Multiple cross system dependencies. ◦ Largely unaware of other systems. Result -- An application ecosystem - Large Mesh of services

Large Application Ecosystem In Flipkart context …. • Thousands of
microservices ◦ Homegrown/Open source/Enterprise ◦ Stateless/stateful ◦ Variety of tech stack choices • Continuous morphing ◦ Multiple daily deployments ◦ Addition/removal of services and features ◦ Dependency changes

Application Ecosystem : Order System

Application Ecosystem : Pricing System

Application Ecosystem: Flipkart

Large Application Ecosystem & Challenges Flipkart tech ecosystem is a
large mesh of services. Challenges - • Observability ◦ Service interactions and their evolution. ◦ Service and data abstractions. ◦ Chains of microservices. ◦ Runtime behavior and failure analysis/impact

Large Application Ecosystem & Challenges Challenges - • Handling Failures
◦ Scale - small / large - wide spread disasters ◦ Impact - Financial, Legal/Compliance, Reputation etc. ◦ Business continuity even in face of large disasters -- BCP charter ▪ BCP Flows - business flow to sustain with wide variety of disasters. ▪ Work with tech teams to ensure BCP flows are adhered. ▪ Provide tooling to ensure BCP.

Handling Failures (Disasters) - BCP Typical deployment of services in
a cloud • Flipkart Cloud Platform (FCP) - private cloud hosting all of tech systems. • Majority of capacity allocated to running services. • Reserved capacity for operation maintenance. • Typically this is one Data Center (DC) - Availability Zone

Handling Failures (Disasters) - BCP Typical deployment of services in
a cloud • FCP - hosting Flipkart’s tech system. • If disaster happens - ◦ BCP flow defined ◦ Extra capacity reserved ◦ Recovery in alternate Availability zone is needed.

Handling Disasters - BCP FCP has two availability zones (DCs)
• Symmetric, hosts active business flows • Cross DC interactions as well. In case of disaster (e.g. zone failure) • BCP flows from both zones needs to be sustained/restored.

Handling Disasters - BCP • Disaster happens - Availability zone
1 • To ensure business continuity ◦ BCP flow in zone 1 needs to be restored.

Handling Disasters - BCP To ensure business continuity • Need
to know BCP flows. • Order in which to bring up services. • Order in which to shutdown services. • Scale up/down requirements.

Enters, “Meeseeks” • Graph of services across Flipkart. • Understands
dependencies between services. • Enable org-wide scenarios like BCP • Rich querying on service topologies. • Intelligent cluster-recognition.

How we did it • Iftop based agent • Data
tapping on all Motherships • Data Processing

Contd.. • Enrichment of the data • Builds real-time service
topology of Flipkart network. • Intelligent data-store cluster recognition. • Store in a graph DB.

Observability : Order System

Observability : Pricing System

Meeseeks Overlays • “Overlays” sprinkle additional information on the base
service-graph layer. • Overlay definition - a set of possible annotations over the graph nodes and edges • Meeseeks provides APIs to ◦ Create / register custom overlay ◦ Annotate the base Meeseeks graph with custom overlay data ◦ Query based on the annotations

Meeseeks Overlays’ Example • Data DR overlay ◦ Trigger an
event to the Backup/DR infrastructure ◦ Validate schedules configured for recovery point objective etc. • BCP overlay ◦ Tag services with a certain “criticality level” and a recovery time objective for the same in case of a disaster ◦ Tag edges as “essential”, “optional” etc. ◦ Detect anomalies.

Tag annotations on Services

Tagged Services

Anomaly Detection

Anomaly Detection Contd...

Services Boot Order

What next: • Data capturing Improvements: ◦ nf_conntrack based agent
▪ Netfilter connection tracking information ◦ ERSPAN encapsulated SYN duplicated to a processor host via a GRE tunnel ▪ Support available in network devices • Additional Overlays like, Data-Backup & System health • Enable construct like service-orchestration. ERSPAN encapsulated SYN duplicated to a processor host via a GRE tunnel

Key Takeaways: • Challenges in a large application ecosystem. •
Network monitoring tools to form real-time network topologies. • Enrich this data to build service topologies. • Use data overlay constructs to manage/evolve domains like BCP. • Use above overlay data to solve various domain specific challenges.

Thank You For Being Here!

Meeseeks (Service-Graph)

Meeseeks (Service-Graph)

Gaurav

More Decks by Gaurav

Other Decks in Programming

Featured

Transcript