Lock in $30 Savings on PRO—Offer Ends Soon! ⏳

Meta's microservice analysis- practice talk for...

Meta's microservice analysis- practice talk for ATC

Practice talk from Thursday 06/29.

The paper and final version of the slides can be found here: https://www.usenix.org/conference/atc23/presentation/huye

Avatar for Darby Huye

Darby Huye

July 13, 2023
Tweet

Other Decks in Research

Transcript

  1. Darby Huye, Yuri Shkuro, & Raja Sambasivan Analysis of topology

    and request workflows Lifting the veil on Meta’s microservices
  2. Foundational trends towards microservices 3 Monolith Organizational trends: • desire

    for team to work independently, want quick development, globalization of companies Hardware trends: • death of Moore’s law leads to need for parallelization
  3. 4 Monolith Microservices Basic Idea: apps composed of tiny pieces

    communicating over the network Organizational trends: • desire for team to work independently, want quick development, globalization of companies Hardware trends: • death of Moore’s law leads to need for parallelization Foundational trends towards microservices
  4. Microservices: an abstraction 5 •Concept of service is right granularity

    for deployment, scaling, observability •Independently deployable units •Small, represent a single business capability •Strictly hierarchical archs. •Relatively stable topologies Front end Authentication Friends Feed Ads Ads Dependency Stateless Service Stateful Service Example: Simple Social Network Posts Friends
  5. Feed Feed 6 Front end Authentication Friends Feed Ads Ads

    Dependency Stateless Service Stateful Service Example: Simple Social Network Posts Scheduler Observability Framework Friends •Concept of service is right granularity for deployment, scaling, observability •Independently deployable units •Small, represent a single business capability •Strictly hierarchical archs. •Relatively stable topologies Microservices: an abstraction
  6. Microservices: request workflow 7 Front end Authentication Friends Feed Ads

    Ads Dependency Stateless Service Stateful Service Friends Posts Request Load my feed Microservice Topology (Dependency Diagram)
  7. Current state of microservice research Research understanding microservices [SoCC ’21,

    JSEP ’22, JSys’22] • Focuses on topology and request work fl ows Microservice testbeds [ASPLOS’19, TSE’18] • small in scale and complexity Tools evaluated on testbeds [OSDI’20, SINAN’21, ASPLOS’21] 8
  8. Work fl ows Analysis of Meta’s microservices 9 Observability loss

    impacts deep traces Variation in # calls, even locally Wide & shallow Depth predicts # calls Traces rep. of work fl ows Topology Work fl ows execute consistently Variation in conc., decreased by children set Service is su ffi cient dimension Service is not one size fi ts all Topology is static Long-term growth with daily churn X X X ✓ X X X Services are simple Long tail of complex services Finding Abstraction Finding Abstraction Wide & shallow
  9. Methodology: Topology Service History (22 months) • Service deployment and

    lifetimes Service Complexity (1 day) • Endpoints exposed by deployed services, replication factors, and dependencies Analysis granularity: service id, a unique name assigned to each service (e.g. authentication) 10
  10. Service is not sufficient granularity 11 Inference platform: includes tenant

    info in service id to utilize infrastructure support Service granularity is not su ffi cient for all management tasks: multi- tenancy and data placement must be considered
  11. Daily churn of deployed services •89% of new services deployed

    were also deprecated •40% of regular services lived the entire time range 12 High creation and deprecation rates for service ids, especially for the ill- fi tting services
  12. Long-term growth in total deployed instances • Total number of

    deployed service instances nearly doubled • Growth is due to new (regular) service ids, not an increase in replication factors for existing services 13
  13. Work fl ows Analysis of Meta’s microservices 14 Observability loss

    impacts deep traces Variation in # calls, even locally Wide & shallow Depth predicts # calls Traces rep. of work fl ows Topology Work fl ows execute consistently Variation in conc., decreased by children set Service is su ffi cient dimension Service is not one size fi ts all Topology is static Long-term growth with daily churn X X X ✓ X X X Services are simple Long tail of complex services Finding Abstraction Finding Abstraction Wide & shallow
  14. Front end Load Feed Methodology: Workflows •Distributed tracing: graphs capturing

    the work done on behalf of a request •Canopy [SOSP’17]: Meta’s distributed tracing framework •Traces can be sampled anywhere in the topology 15 Authentication Verify User Execution Unit Legend: Service Block Point Edge Feed Load Posts Example Canopy Trace
  15. Methodology: Workflows Used traces collected on a single day from

    three trace pro fi les: 16 Ads Manager 3.2M traces Random Sampling (0.01%) Fetch Noti fi cations 87,000 traces Adaptive Sampling (1 trace/second) RaaS (Ranking of items) 3.3M traces Adaptive Sampling (25 trace/second)
  16. 17 Description of analyzed workflow properties Children set: A B

    Number of calls: 6 Parent’s characteristics: Node names: service id | endpoint name Child A Child B Parent Root … Child A Child A Child A Child B Time Concurrency Max concurrency rate: 0.5 (3/6)
  17. Predicting number of children 18 Leaf Single Relay … Variable

    Relay The majority of service|endpoints are leaves or single relays: • Ads Manager: 54% • Fetch Noti fi cations: 66% • RaaS: 72% Identi fi ed three categories of nodes:
  18. RaaS Fetch 19 Predicting number of children Ads 1 2

    Variation in number of calls is often attributed to 1) di ff erent children sets or 2) database accesses
  19. RaaS Fetch 21 Predicting concurrency rates Ads High variance in

    concurrency across di ff erent invocations of a parent ingress ID’s children.
  20. Children are either 100% concurrent or 0% concurrent Example service|endpoint

    Always 100% concurrent Always 0% concurrent Children set helps explain concurrency rate Standard deviation in concurrency decreases by 38-44% when grouping by children set Children sets: 2. 1. Predicting concurrency rates
  21. Implications Tooling that uses work fl ows for performance prediction,

    diagnosis, capacity planning [Tprof, Sifter, VAIF, CRISP]: • Need to assume signi fi cant diversity in work fl ows originating from a root endpoint 24 Testbeds should be extended to provide support for: • Heterogeneity of services, churn & growth of deployed instances • Variable concurrency, number of children, and children sets even within requests from a single root endpoint Tooling that uses topology for resource management [Sage, FIRM, Sinan]: • Should be adaptable to dynamic topology • Need to test when results are based on stale data
  22. 25 Topology Service is not one size fi ts all

    Long-term growth with daily churn Long tail of complex services Work fl ows Observability loss impacts deep traces Variation in # calls, even locally Variation in conc., decreased by children set Wide & shallow Observability loss impacts deep traces Summary Data available @ github.com/ facebookresearch/ distributed_traces Microservice abstraction should be extended to support di ff erent types of archs.
  23. References • [SOSP’17]: Kaldor et al. Canopy • [ATC ’22]:

    CRISP • [SoCC ‘21]: Characterizing Microservice Dependency and Performance: Alibaba Trace Analysis. SoCC • [JSEP ‘22]: Characterizing and synthesizing the work fl ow structure of microservices in ByteDance Cloud • [JSys’22]: Huye & Shesagiri et al. [SoK] Identifying Mismatches Between Microservice Testbeds and Industrial Perceptions of Microservices • [ASPLOS’19]: Gan et al. An Open-Source Benchmark Suite for Microservices and Their Hardware-Software Implications for Cloud & Edge Systems • [TSE’18]: Zhou et al. Fault Analysis and Debugging of Microservice Systems: Industrial Survey, Benchmark System, and Empirical Study. • [ASPLOS’21]: Sage • [Tprof’21]: Tprof SoCC 26
  24. References • [ICICCS’22]: Improving task scheduling in microservice environments by

    considering intra-job dependencies. • [OSDI’20]: FIRM: An intelligent fi ne-grained resource management framework for slo-oriented microservices • [IEEE’13]: Visualizing request- fl ow comparison to aid performance diagnosis in distributed systems • [VAIF’21]: Automating instrumentation choices for performance problems in distributed applications with VAIF. SoCC • [SINAN’21]: Sinan: ML-based and qos-aware resource management for cloud microservices. ASPLOS • [NSDI’11]: Spectroscope • [Anand’20]: Aggregate-driven trace visualizations for performance debugging, ArXiv 27