10 Real problems & solutions in your build and deploy process

Slide 1

Slide 1 text

10 Real problems and solutions for your Build & Deploy process

Slide 2

Slide 2 text

About AppsFlyer ● Mobile analytics & Attribution ● 11 offices worldwide ● Vast majority of mobile platforms supported ● More than 8B events per day ● Over 2K integrated partners ● Cloud based (AWS,GCP)

Slide 3

Slide 3 text

Build & Deploy tools . . . . . . Commits Deploys Building projects Building Docker images Communication Image repository Machines running code Configuration / state Versions storage Deployment system Registry cache SVC

Slide 4

Slide 4 text

Categories ● General issues ● Scale optimizations ● Features & Tools ● Visibility

Slide 5

Slide 5 text

General Issues

Slide 6

Slide 6 text

Case #1: Deployment failures 1) Health check timeouts 2) Machine list mismatch 3) Wrong Java version 4) Incorrect startup parameters 5) Wrong timing for lb health check 6) Port in use

Slide 7

Slide 7 text

The Solution #1 1) Tune start time, validate timeout is not too short 2) Update dynamically, add mismatch alerts, auto correction 3) Jenkins version = instance/container version 4) Set reasonable defaults, guide new members 5) Sync LB healthcheck with deployment state 6) Dynamic allocation / validation

Slide 8

Slide 8 text

Case #2: Docker issues 1) Hostname characters limitation 2) JVM escape 3) Image is corrupted in registry 4) Conntract table limits

Slide 9

Slide 9 text

The Solution #2 1) Make sure you are not exceeding 64 chars 2) Upgrade to higher docker version 3) Let client auto fail to other registry, otherwise fake commit to recreate image 4) Increase conntract table & file descriptors

Slide 10

Slide 10 text

Features & Tools

Slide 11

Slide 11 text

Case #3: (NOT) Loosing traffic while deploying 1) Loosing traffic when deploying to server behind a load balancer 2) Loosing traffic when stopping a container

Slide 12

Slide 12 text

The Solution #3 1) Connection draining: remove server from load balancer and wait for x seconds before the deployment. Drain time value is set in consul per service, the default is 30 seconds 2) Graceful shutdown: Set several seconds before killing the container and capture sigterm to flush in process messages to external DB or queue Ex: POST /containers/e90e34656806/stop?t=5 HTTP/1.1

Slide 13

Slide 13 text

Case #4: Deploy from branch Building & deploying from non non default branch

Slide 14

Slide 14 text

The Solution #4 We added an option to build & deploy from a branch in our deployment system The branch flow includes several steps: 1) Save the current configuration (Jenkins) 2) Update the new branch & revision 3) Initiate Jenkins build 4) Revert configuration state 5) Detect when image is available 6) Enable deployment ** Alternatively you can create a new Jenkins configuration for each branch and cleanup later

Slide 15

Slide 15 text

The Solution #4 A few notes: ● Maintain separate KV for default and branch ● Provide an option to either build from “scratch” or “base image” ● Regularly backup Jenkins configurations ● Send build failures to slack

Slide 16

Slide 16 text

Case #5: Free developers from your burden with self serve

Slide 17

Slide 17 text

The Solution #5 Build self serve UI to enable to add or edit: ● Services ● Modes ● Autoscale ● Healer ● Spots instances ● Alerts

Slide 18

Slide 18 text

Scale

Slide 19

Slide 19 text

Case #6: Slow build time The time period between pushing code to deployment readiness extends over time

Slide 20

Slide 20 text

The Solution #6 ● Create base images, according to service type ● Increase the number of Jenkins slaves ● Migrate slaves to new generation CPU instances ● Split compilation and image build to tune workload ● Re design registry to improve image push time ● Add proxies (jars, npm, etc)

Slide 21

Slide 21 text

Case #7: Distributing Docker registry Single Docker registry, in active–passive mode becomes a bottleneck when building and deploying simultaneously to several dozens of services and instances

Slide 22

Slide 22 text

The Solution #7 Distributed sharded registry with replication factor, high availability, rack awareness and automatic recovery Example: scenario of 3 registries and RF of 2: ● Each service/mode served by 2 registries ● Pairs are distributed evenly between modes ● Metadata is saved in consul KV ● All images are uploaded to S3 so reseeding registry is easy

Slide 23

Slide 23 text

The Solution #7 Relevant Links: ● Project Blog: http://relmos.blogspot.co.il/2016/09/scaling-private-docker-registry-at_49.html ● Registry Deploy: https://docs.docker.com/registry/deploying/

Slide 24

Slide 24 text

Case #8: Cleaning old versions Prevent old versions from pilling up

Slide 25

Slide 25 text

The Solution #8 This is where docker shines We clean: ● Containers with stopped exit code ● Old Images which are not being used ● Registry

Slide 26

Slide 26 text

Visibility

Slide 27

Slide 27 text

Case #9: Detecting versions inconsistency Different versions of the same service deployed in production (unintentionally)

Slide 28

Slide 28 text

The Solution #9 ● A graphical near real-time view on versions per service ● Easy way to add alert on inconsistent versions

Slide 29

Slide 29 text

Case #10: Tracking it all Lack of visibility on deployments and failed builds

Slide 30

Slide 30 text

The Solution #10 ● Add integration to slack that includes all relevant information (version,servers,instances,user) ● Send deployment events to graphite and combine with relevant dashboards ● Send build & deployment logs to central log system ● Event system which graphically presents important events (deployments, heals, autoscale, etc...)