Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Meeseeks (Service-Graph) RootConf

Meeseeks (Service-Graph) RootConf

Gaurav

May 21, 2019
Tweet

More Decks by Gaurav

Other Decks in Programming

Transcript

  1. Agenda • Large application ecosystem and challenges • Handling large

    scale failures (BCP) • Meeseeks • Applications of Meeseeks • What’s next • Q & A
  2. Large Application Ecosystem Evolution of tech ecosystem • Begins as

    small set of services or monolithic. • Multiple flows start to appear • Each flows themselves are large systems ◦ Evolves independently, microservices ◦ Multiple cross system dependencies. ◦ Largely unaware of other systems. Result -- An application ecosystem - Large Mesh of services
  3. Large Application Ecosystem In Flipkart context …. • Thousands of

    microservices ◦ Homegrown/Open source/Enterprise ◦ Stateless/stateful ◦ Variety of tech stack choices • Continuous morphing ◦ Multiple daily deployments ◦ Addition/removal of services and features ◦ Dependency changes
  4. Order System : Service Level view No. of services :

    ~400 No. of svc Level Dep: ~1200
  5. Application Ecosystem : Pricing System No. of Applications: ~350 No.

    of services : ~1000 No. of App Level Dep: ~900 No. of svc Level Dep: ~3400
  6. Application Ecosystem: Flipkart No. of Applications: ~4200 No. of services

    : ~12000 No. of App Level Dep: ~15000 No. of svc Level Dep: ~30000
  7. Large Application Ecosystem & Challenges Flipkart tech ecosystem is a

    large mesh of services. Challenges - • Observability ◦ Service interactions and their evolution. ◦ Service and data abstractions. ◦ Chains of microservices. ◦ Runtime behavior and failure analysis/impact • Handling Failures ◦ Scale - small / large - wide spread disasters ◦ Impact - Financial, Legal/Compliance, Reputation etc. ◦ Business continuity even in face of large disasters -- BCP charter ▪ BCP Flows - business flow to sustain with wide variety of disasters. ▪ Work with tech teams to ensure BCP flows are adhered. ▪ Provide tooling to ensure BCP.
  8. Handling Failures (Disasters) - BCP Typical deployment of services in

    a cloud • Flipkart Cloud Platform (FCP) - private cloud hosting all of tech systems. • Majority of capacity allocated to running services. • Reserved capacity for operation maintenance. • Typically this is one Data Center (DC) - Availability Zone
  9. Handling Failures (Disasters) - BCP Typical deployment of services in

    a cloud • FCP - hosting Flipkart’s tech system. • If disaster happens - ◦ BCP flow defined ◦ Extra capacity reserved ◦ Recovery in alternate Availability zone is needed.
  10. Handling Disasters - BCP FCP has two availability zones (DCs)

    • Symmetric, hosts active business flows • Cross DC interactions as well. In case of disaster (e.g. zone failure) • BCP flows from both zones needs to be sustained/restored.
  11. Handling Disasters - BCP • Disaster happens - Availability zone

    1 • To ensure business continuity ◦ BCP flow in zone 1 needs to be restored.
  12. Handling Disasters - BCP To ensure business continuity • Need

    to know BCP flows. • Order in which to bring up services. • Order in which to shutdown services. • Scale up/down requirements.
  13. Enters, “Meeseeks” • Graph of services across Flipkart. • Understands

    dependencies between services. • Enable org-wide scenarios like BCP • Rich querying on service topologies. • Intelligent cluster-recognition.
  14. Our Needs • We wanted to build this in bottom-up

    manner. ◦ By analysing network traffic and stitching it with higher level service-details. ◦ Intelligent data-store cluster recognition. • Once this core layer has been built, enable domain specific data enrichment on top of this core layer.
  15. Alternatives, Factors that influenced our choices • Why not borrow

    something off the shelf: ◦ No other open-source product available. ◦ Flipkart has its own private cloud, with its own constructs, it was better to build something from scratch. ◦ Flexibility to enable multiple business use-cases on top of bases layer. ◦ We did study products like Netflix Vizceral and Microsoft’s Service Map. • Factors Influenced our choices: ◦ Did not want application intrusion. ◦ Running network data collection agent on Motherships instead on Virtual Machines. ◦ Have control of the environments in which our agent is running. ◦ Availability of information from multiple infra components to stitch and present it.
  16. How we did it • Iftop based agent • Data

    tapping on all Motherships • Data Processing
  17. Contd.. • Enrichment of the data • Builds real-time service

    topology of Flipkart network. • Intelligent data-store cluster recognition. • Store in a graph DB.
  18. Data Store Clustering • Identify the services running on VMs

    using port scan. • Cluster data-stores on the basis of services running on VMs and network topology. • We cluster Hadoop, MySql, ElasticSearch, Redis, Aerospike, MongoDB and many more.
  19. Meeseeks Overlays • “Overlays” sprinkle additional information on the base

    service-graph layer. • Overlay definition - a set of possible annotations over the graph nodes and edges • Meeseeks provides APIs to ◦ Create / register custom overlay ◦ Annotate the base Meeseeks graph with custom overlay data ◦ Query based on the annotations
  20. Meeseeks Overlays’ Example • Data DR overlay ◦ Trigger an

    event to the Backup/DR infrastructure ◦ Validate schedules configured for recovery point objective etc. • BCP overlay ◦ Tag services with a certain “criticality level” and a recovery time objective for the same in case of a disaster ◦ Tag edges as “essential”, “optional” etc. ◦ Detect anomalies.
  21. Meeseeks Journey So Far • Visibility : Automated dependency discovery,

    and boot order in case of disaster. • Cleaning the clutter of services: ◦ Architects use it to reduce the number of microservices, to remove unwanted dependencies between their services to clean their micro-svc ecosystem. • Verification: Surface violations in run-time interactions, which go against desired design.
  22. contd.. • Adherence whether new DBs are backed up, which

    contributes to our BCP charter. • Clustering of data-stores
  23. What next: • Data capturing Improvements: ◦ nf_conntrack based agent

    ▪ Netfilter connection tracking information ◦ ERSPAN encapsulated SYN duplicated to a processor host via a GRE tunnel ▪ Support available in network devices • Additional Overlays like, Data-Backup & System health • Enable construct like service-orchestration. • Make it open-source.
  24. Key Takeaways: • Challenges in a large application ecosystem. •

    Network monitoring tools to form real-time network topologies. • Enrich this data to build service topologies. • Use data overlay constructs to manage/evolve domains like BCP. • Use above overlay data to solve various domain specific challenges.