Meta's microservice analysis- practice talk for ATC

Darby Huye, Yuri Shkuro, & Raja Sambasivan Analysis of topology
and request workflows Lifting the veil on Meta’s microservices

Microservices: what are they? 2 Is this a microservice???

Foundational trends towards microservices 3 Monolith Organizational trends: • desire
for team to work independently, want quick development, globalization of companies Hardware trends: • death of Moore’s law leads to need for parallelization

4 Monolith Microservices Basic Idea: apps composed of tiny pieces
communicating over the network Organizational trends: • desire for team to work independently, want quick development, globalization of companies Hardware trends: • death of Moore’s law leads to need for parallelization Foundational trends towards microservices

Microservices: an abstraction 5 •Concept of service is right granularity
for deployment, scaling, observability •Independently deployable units •Small, represent a single business capability •Strictly hierarchical archs. •Relatively stable topologies Front end Authentication Friends Feed Ads Ads Dependency Stateless Service Stateful Service Example: Simple Social Network Posts Friends

Feed Feed 6 Front end Authentication Friends Feed Ads Ads
Dependency Stateless Service Stateful Service Example: Simple Social Network Posts Scheduler Observability Framework Friends •Concept of service is right granularity for deployment, scaling, observability •Independently deployable units •Small, represent a single business capability •Strictly hierarchical archs. •Relatively stable topologies Microservices: an abstraction

Microservices: request workflow 7 Front end Authentication Friends Feed Ads
Ads Dependency Stateless Service Stateful Service Friends Posts Request Load my feed Microservice Topology (Dependency Diagram)

Current state of microservice research Research understanding microservices [SoCC ’21,
JSEP ’22, JSys’22] • Focuses on topology and request work fl ows Microservice testbeds [ASPLOS’19, TSE’18] • small in scale and complexity Tools evaluated on testbeds [OSDI’20, SINAN’21, ASPLOS’21] 8

Work fl ows Analysis of Meta’s microservices 9 Observability loss
impacts deep traces Variation in # calls, even locally Wide & shallow Depth predicts # calls Traces rep. of work fl ows Topology Work fl ows execute consistently Variation in conc., decreased by children set Service is su ffi cient dimension Service is not one size fi ts all Topology is static Long-term growth with daily churn X X X ✓ X X X Services are simple Long tail of complex services Finding Abstraction Finding Abstraction Wide & shallow

Methodology: Topology Service History (22 months) • Service deployment and
lifetimes Service Complexity (1 day) • Endpoints exposed by deployed services, replication factors, and dependencies Analysis granularity: service id, a unique name assigned to each service (e.g. authentication) 10

Service is not sufficient granularity 11 Inference platform: includes tenant
info in service id to utilize infrastructure support Service granularity is not su ffi cient for all management tasks: multi- tenancy and data placement must be considered

Daily churn of deployed services •89% of new services deployed
were also deprecated •40% of regular services lived the entire time range 12 High creation and deprecation rates for service ids, especially for the ill- fi tting services

Long-term growth in total deployed instances • Total number of
deployed service instances nearly doubled • Growth is due to new (regular) service ids, not an increase in replication factors for existing services 13

Work fl ows Analysis of Meta’s microservices 14 Observability loss
impacts deep traces Variation in # calls, even locally Wide & shallow Depth predicts # calls Traces rep. of work fl ows Topology Work fl ows execute consistently Variation in conc., decreased by children set Service is su ffi cient dimension Service is not one size fi ts all Topology is static Long-term growth with daily churn X X X ✓ X X X Services are simple Long tail of complex services Finding Abstraction Finding Abstraction Wide & shallow

Front end Load Feed Methodology: Workflows •Distributed tracing: graphs capturing
the work done on behalf of a request •Canopy [SOSP’17]: Meta’s distributed tracing framework •Traces can be sampled anywhere in the topology 15 Authentication Verify User Execution Unit Legend: Service Block Point Edge Feed Load Posts Example Canopy Trace

Methodology: Workflows Used traces collected on a single day from
three trace pro fi les: 16 Ads Manager 3.2M traces Random Sampling (0.01%) Fetch Noti fi cations 87,000 traces Adaptive Sampling (1 trace/second) RaaS (Ranking of items) 3.3M traces Adaptive Sampling (25 trace/second)

17 Description of analyzed workflow properties Children set: A B
Number of calls: 6 Parent’s characteristics: Node names: service id | endpoint name Child A Child B Parent Root … Child A Child A Child A Child B Time Concurrency Max concurrency rate: 0.5 (3/6)

Predicting number of children 18 Leaf Single Relay … Variable
Relay The majority of service|endpoints are leaves or single relays: • Ads Manager: 54% • Fetch Noti fi cations: 66% • RaaS: 72% Identi fi ed three categories of nodes:

RaaS Fetch 19 Predicting number of children Ads 1 2
Variation in number of calls is often attributed to 1) di ff erent children sets or 2) database accesses

Concurrency of variable relays Time Concurrency Max Concurrency Rate: 2/3
(0.67)

RaaS Fetch 21 Predicting concurrency rates Ads High variance in
concurrency across di ff erent invocations of a parent ingress ID’s children.

Children are either 100% concurrent or 0% concurrent Example service|endpoint
Always 100% concurrent Always 0% concurrent Children set helps explain concurrency rate Standard deviation in concurrency decreases by 38-44% when grouping by children set Children sets: 2. 1. Predicting concurrency rates

Outline • Introduction • Overview of Findings •Topology • Work
fl ows •Implications 23

Implications Tooling that uses work fl ows for performance prediction,
diagnosis, capacity planning [Tprof, Sifter, VAIF, CRISP]: • Need to assume signi fi cant diversity in work fl ows originating from a root endpoint 24 Testbeds should be extended to provide support for: • Heterogeneity of services, churn & growth of deployed instances • Variable concurrency, number of children, and children sets even within requests from a single root endpoint Tooling that uses topology for resource management [Sage, FIRM, Sinan]: • Should be adaptable to dynamic topology • Need to test when results are based on stale data

25 Topology Service is not one size fi ts all
Long-term growth with daily churn Long tail of complex services Work fl ows Observability loss impacts deep traces Variation in # calls, even locally Variation in conc., decreased by children set Wide & shallow Observability loss impacts deep traces Summary Data available @ github.com/ facebookresearch/ distributed_traces Microservice abstraction should be extended to support di ff erent types of archs.

References • [SOSP’17]: Kaldor et al. Canopy • [ATC ’22]:
CRISP • [SoCC ‘21]: Characterizing Microservice Dependency and Performance: Alibaba Trace Analysis. SoCC • [JSEP ‘22]: Characterizing and synthesizing the work fl ow structure of microservices in ByteDance Cloud • [JSys’22]: Huye & Shesagiri et al. [SoK] Identifying Mismatches Between Microservice Testbeds and Industrial Perceptions of Microservices • [ASPLOS’19]: Gan et al. An Open-Source Benchmark Suite for Microservices and Their Hardware-Software Implications for Cloud & Edge Systems • [TSE’18]: Zhou et al. Fault Analysis and Debugging of Microservice Systems: Industrial Survey, Benchmark System, and Empirical Study. • [ASPLOS’21]: Sage • [Tprof’21]: Tprof SoCC 26

References • [ICICCS’22]: Improving task scheduling in microservice environments by
considering intra-job dependencies. • [OSDI’20]: FIRM: An intelligent fi ne-grained resource management framework for slo-oriented microservices • [IEEE’13]: Visualizing request- fl ow comparison to aid performance diagnosis in distributed systems • [VAIF’21]: Automating instrumentation choices for performance problems in distributed applications with VAIF. SoCC • [SINAN’21]: Sinan: ML-based and qos-aware resource management for cloud microservices. ASPLOS • [NSDI’11]: Spectroscope • [Anand’20]: Aggregate-driven trace visualizations for performance debugging, ArXiv 27

Meta's microservice analysis- practice talk for...

Meta's microservice analysis- practice talk for ATC

Darby Huye

Other Decks in Research

Featured

Transcript

Darby Huye, Yuri Shkuro, & Raja Sambasivan Analysis of topology

Microservices: what are they? 2 Is this a microservice???

Foundational trends towards microservices 3 Monolith Organizational trends: • desire

4 Monolith Microservices Basic Idea: apps composed of tiny pieces

Microservices: an abstraction 5 •Concept of service is right granularity

Feed Feed 6 Front end Authentication Friends Feed Ads Ads

Microservices: request workflow 7 Front end Authentication Friends Feed Ads

Current state of microservice research Research understanding microservices [SoCC ’21,

Work fl ows Analysis of Meta’s microservices 9 Observability loss

Methodology: Topology Service History (22 months) • Service deployment and

Service is not sufficient granularity 11 Inference platform: includes tenant

Daily churn of deployed services •89% of new services deployed

Long-term growth in total deployed instances • Total number of

Work fl ows Analysis of Meta’s microservices 14 Observability loss

Front end Load Feed Methodology: Workflows •Distributed tracing: graphs capturing

Methodology: Workflows Used traces collected on a single day from

17 Description of analyzed workflow properties Children set: A B

Predicting number of children 18 Leaf Single Relay … Variable

RaaS Fetch 19 Predicting number of children Ads 1 2

Concurrency of variable relays Time Concurrency Max Concurrency Rate: 2/3

RaaS Fetch 21 Predicting concurrency rates Ads High variance in

Children are either 100% concurrent or 0% concurrent Example service|endpoint

Outline • Introduction • Overview of Findings •Topology • Work

Implications Tooling that uses work fl ows for performance prediction,

25 Topology Service is not one size fi ts all

References • [SOSP’17]: Kaldor et al. Canopy • [ATC ’22]:

References • [ICICCS’22]: Improving task scheduling in microservice environments by