Architecture Today’s Microservices Application Architecture REST APIs Hard to scale, wasting resources Many functions in a single process Cross-functional teams organized around capabilities Scalable, efficient and fully dynamic Siloed teams Each element of functionality defined as “microservices”
CONTAINERS Private Copy Shared User Code Libraries Virtual Processor Operating System Physical Processor Virtual Machines Containers User Code Libraries Virtual Processor Operating System Physical Processor Start time 30-45 seconds < 50 ms Stop time 5-10 seconds < 50 ms Workload density 10 - 100x 1x
IS THE NEW SERVER PHYSICAL (x86) VIRTUAL UNIFIED HYPERSCALE MAINFRAME SERVER VIRTUAL MACHINE PARTITION (LPAR) FULL DATACENTER UNIT OF INTERACTION • ERP, CRM, PRODUCTIVITY, MAIL & WEB SERVER • LINUX, WINDOWS • DATA / TRANSACTION PROCESSING • UNIX, IBM OS/360 DEFINITIVE APPS AND OS NEW FORM FACTOR FOR DEVELOPING APPS AND RUNNING IT • ERP, CRM, PRODUCTIVITY, MAIL & WEB SERVER • LINUX, WINDOWS + HYPERVISOR • BIG DATA, INTERNET OF THINGS, MOBILE APPS • DATACENTER OPERATING SYSTEM
High performance and efficient resource isolation • Easy scalability and multi-tenancy • Fault-tolerant and highly available • Highly efficient with highest utilization • Complete workload portability Mesos Docker Big Data Analytics (Hadoop, Spark, etc.) Cloud Foundry Stateful Service (All) Deploys on-premise in cloud or both
Typical Datacenter siloed, over-provisioned servers, low utilization Mesos Datacenter automated schedulers, workload multiplexing onto the same machines Industry Average 12-15% utilization Mesos Multiplexing 30-40% utilization, up to 96% at some customers 4X
OF MESOS Tupperware/Bistro Borg/Omega Apache Mesos Proprietary Proprietary Open Source (Apache License) ~2007 ~2001 2010+ Production-proven Web Scale Cluster Managers • Built at UC Berkeley AMPLab by Ben Hindman et.al • Built in collaboration with Google to overcome some Borg Challenges • Top level project at Apache Software Foundation
DATACENTER KERNEL Designed to be flexible • Aggregate all resources in the datacenter for modern apps • Intentionally simple to enable massive scalability • Handles different types of tasks - long running, batch & real-time • Two-level scheduler architecture enables multiple scheduling logic (a key challenge at Google) • Extensible to work with new technologies Downloads Mesos daily downloads, July 2014 - November 2015 Gaining massive adoption
• The Mesos agent is a process running on each node in the cluster • Mesos agents have two primary functions: ◦ Manage and offer local resources on the Mesos agent node ◦ Launch and manage the executors using containers to run a task Agent Executor Task Task Task Executor Task Task
• A framework scheduler is the component that decides which Mesos resource offers to accept or reject to complete the work of that specific framework • The scheduler makes these decisions by: ◦ Examining the offer’s ▪ Resources ▪ Attributes ◦ Matching the scheduler’s resource needs and placement constraints to the offer Framework B (Cassandra) Scheduler Framework A (Marathon) Scheduler LEADER Master OFFER 1 OFFER N
The executor does the work on behalf of the framework on the agent nodes. • An executor runs within a container • An executor can run multiple tasks 19 Hadoop Executor Task 1 Task 2 Mesos Agent Process Mesos Executor python -m SimpleHTTPServer Container 1 Container 2 Agent Node
in production for ~ 5 years • Largest known Mesos production clusters : O(10 ^ 5) containers, O (10 ^ 4) hosts • Most stateless services run on Mesos / Aurora • CAPEX and OPEX savings in millions MESOS @ TWITTER
Microservices Interactions Example Design & Deploy Monitoring & Operations • Developer access to production-like environments • Service discovery between large number of services • Complex deployment and rollback of services • Ensuring API contract not broken between versions of various services • Monitoring, tracing and root cause analysis to ensure end-to-end performance across large number of services • Utilization of multiple, independent distributed systems Service Quality & Continuity • Fault tolerance and healing (in an always-on environment) Security • Secrets (key) management across large number of services • Incident detection and remediation Hailo Taxi Platform
Mesosphere Datacenter Operating System (DCOS) is a new kind of operating system that spans all of the machines in your datacenter or cloud. It provides a highly elastic, and highly scalable way of deploying applications, services and big data infrastructure on shared resources. Existing Infrastructure Mesosphere DCOS Microservices & Containers Database, Analytics & Other Services DCOS
Dev capacity request Fwk Setup Config Delivery Setup Dev QA Staging Start work New code successfully running HW available Prod 40~50% of active developer time on activity not related to code improvement Wait time Concept Dev capacity request Fwk Setup Dev Sys Test Phased Rollout Start work New code successfully running Concept Capacity available DCOS enables CI/CD, without being prescriptive on code management or lifecycle automation tools Prod TRADITIONAL APPROACH TO BUILDING MODERN APPS APPROACH WITH MESOSPHERE DCOS
EVENTS Ubiquitous data streams from connected devices FEEDS Kafka ANALYTICS Spark STORAGE Cassandra REACTIVE APP Akka Ingest millions of events per second Real-time and batch process data Distributed & highly scalable database Scalable, resilient, data driven applications Sensors Devices Clients
Datacenter Move to Cloud Burst to Cloud Data-aware scheduling Failover and fault tolerance Identical user experience Hybrid cloud scenarios • Same user experience as customers continue to move workloads from private data centers to Cloud • Autoscaling for burst scenarios to Cloud; dynamically scale cloud server capacity • Schedule workloads to a private datacenter or Cloud based on data gravity type of application (e.g. financial records vs. sensor data) • Automatically move workloads to Cloud in the case of private datacenter failure HYBRID INFRASTRUCTURE
DISTRIBUTED SYSTEMS FUTURE software will manage itself, using Mesos and the DCOS API • most distributed systems are difficult to manage but they don’t need to be Kafka Spark Cassandra Data processing engine Messaging backbone Distributed database HDFS Distributed file system