Slide 1

Slide 1 text

#GoogleSandbox Google's Production Environment Florian Rathgeber, Site Reliability Engineer, Google Cloud

Slide 2

Slide 2 text

#GoogleSandbox Florian Site Reliability Engineer Google Cloud SRE for 2+ years ● On the Cloud Console SRE team ● Spend most of my time on SLOs Previous life ● Computational Scientist @ Imperial College ● Data Engineer @ ECMWF Co-founded PyData London

Slide 3

Slide 3 text

#GoogleSandbox Google's Globe Spanning Network Submarine cable investments Current fiber network https://cloud.google.com/about/locations/

Slide 4

Slide 4 text

#GoogleSandbox ● B4 ● Edge Network ● GSLB ● Jupiter Google Data Centers ● GFE Current Regions & Number of Zones Future Regions & Number of Zones https://cloud.google.com/about/locations/

Slide 5

Slide 5 text

#GoogleSandbox ● Campus ● Data center ● Cluster ● Row ● Rack ● Machine Data Center Setup

Slide 6

Slide 6 text

#GoogleSandbox Scheduler BorgMaster Persistent store Cluster Config Files Tools Borglet Borglet Borglet BNS addresses: /bns//// Cluster Management

Slide 7

Slide 7 text

#GoogleSandbox Chubby Consistent data, e.g. - BNS paths->IP addresses - master election Chubby Chubby Paxos Paxos Cluster Cluster Cluster Lock Service

Slide 8

Slide 8 text

#GoogleSandbox D HDD SSD Colossus Bigtable Spanner ... ... Cluster Spanner Cluster ... ... ... Storage

Slide 9

Slide 9 text

#GoogleSandbox Server Scraping Borgmon Cluster Borgmon Cluster Scraping Borgmon Cluster Borgmon Cluster Scraping Borgmon Cluster Borgmon Global Borgmon Cluster Time Series Database Alert Manager 1 Prober Server Prober Server Prober Data Alerts Monitoring

Slide 10

Slide 10 text

#GoogleSandbox Service Client Client Stubby Server Stubby Stub Stubby Stub C++ Java Ruby protobuf request protobuf response protobuf request protobuf response Server Communication

Slide 11

Slide 11 text

#GoogleSandbox Piper Code Repository Author Changelist Reviewer Looks Good To Me Owner Approval Presubmit Checks OK! submit...done! change Code Repository

Slide 12

Slide 12 text

#GoogleSandbox MPM Piper Code Repository Blaze Continuous Testing Framework Binaries Tests PASS FAIL PASS ... Rapid Sisyphus Production Continuous Build and Deployment

Slide 13

Slide 13 text

#GoogleSandbox Tying it all together... ● Develop the software: Piper, Blaze ● Build the MPMs: Rapid ● Run it in a cluster: Borg, which uses Chubby ● Route requests/responses: GFE, GSLB, ProtoBuf, Stubby ● Store and read messages: Colossus, Bigtable, Spanner ● Monitor and fire alerts: Borgmon ● Roll out new versions: Sisyphus

Slide 14

Slide 14 text

#GoogleSandbox ● Cluster management: Kubernetes kubernetes.io ● Lock service: ZooKeeper zookeeper.apache.org, etcd coreos.com/etcd ● Storage: HDFS hadoop.apache.org, Cassandra cassandra.apache.org ● Monitoring: Prometheus prometheus.io ● RPC: gRPC grpc.io ● Data serialization: Protocol Buffers developers.google.com/protocol-buffers ● Google style guides github.com/google/styleguide ● The Go programming language golang.org ● Code repository: Git git-scm.com ● Code review: Rietveld github.com/rietveld-codereview/rietveld ● Building: Bazel bazel.io List of related open-source projects

Slide 15

Slide 15 text

#GoogleSandbox Cover images used with permission. These books can be found on shop.oreilly.com The full text of the Google SRE Books are available at www.google.com/sre