10 Real problems & solutions in your build and deploy process

22d9eb22713520cfb9df28f6b1ce7f83?s=47 AppsFlyer
September 21, 2016

10 Real problems & solutions in your build and deploy process

Ariel Moskovich lecture at Reversim 2016

22d9eb22713520cfb9df28f6b1ce7f83?s=128

AppsFlyer

September 21, 2016
Tweet

Transcript

  1. 10 Real problems and solutions for your Build & Deploy

    process
  2. About AppsFlyer • Mobile analytics & Attribution • 11 offices

    worldwide • Vast majority of mobile platforms supported • More than 8B events per day • Over 2K integrated partners • Cloud based (AWS,GCP)
  3. Build & Deploy tools . . . . . .

    Commits Deploys Building projects Building Docker images Communication Image repository Machines running code Configuration / state Versions storage Deployment system Registry cache SVC
  4. Categories • General issues • Scale optimizations • Features &

    Tools • Visibility
  5. General Issues

  6. Case #1: Deployment failures 1) Health check timeouts 2) Machine

    list mismatch 3) Wrong Java version 4) Incorrect startup parameters 5) Wrong timing for lb health check 6) Port in use
  7. The Solution #1 1) Tune start time, validate timeout is

    not too short 2) Update dynamically, add mismatch alerts, auto correction 3) Jenkins version = instance/container version 4) Set reasonable defaults, guide new members 5) Sync LB healthcheck with deployment state 6) Dynamic allocation / validation
  8. Case #2: Docker issues 1) Hostname characters limitation 2) JVM

    escape 3) Image is corrupted in registry 4) Conntract table limits
  9. The Solution #2 1) Make sure you are not exceeding

    64 chars 2) Upgrade to higher docker version 3) Let client auto fail to other registry, otherwise fake commit to recreate image 4) Increase conntract table & file descriptors
  10. Features & Tools

  11. Case #3: (NOT) Loosing traffic while deploying 1) Loosing traffic

    when deploying to server behind a load balancer 2) Loosing traffic when stopping a container
  12. The Solution #3 1) Connection draining: remove server from load

    balancer and wait for x seconds before the deployment. Drain time value is set in consul per service, the default is 30 seconds 2) Graceful shutdown: Set several seconds before killing the container and capture sigterm to flush in process messages to external DB or queue Ex: POST /containers/e90e34656806/stop?t=5 HTTP/1.1
  13. Case #4: Deploy from branch Building & deploying from non

    non default branch
  14. The Solution #4 We added an option to build &

    deploy from a branch in our deployment system The branch flow includes several steps: 1) Save the current configuration (Jenkins) 2) Update the new branch & revision 3) Initiate Jenkins build 4) Revert configuration state 5) Detect when image is available 6) Enable deployment ** Alternatively you can create a new Jenkins configuration for each branch and cleanup later
  15. The Solution #4 A few notes: • Maintain separate KV

    for default and branch • Provide an option to either build from “scratch” or “base image” • Regularly backup Jenkins configurations • Send build failures to slack
  16. Case #5: Free developers from your burden with self serve

  17. The Solution #5 Build self serve UI to enable to

    add or edit: • Services • Modes • Autoscale • Healer • Spots instances • Alerts
  18. Scale

  19. Case #6: Slow build time The time period between pushing

    code to deployment readiness extends over time
  20. The Solution #6 • Create base images, according to service

    type • Increase the number of Jenkins slaves • Migrate slaves to new generation CPU instances • Split compilation and image build to tune workload • Re design registry to improve image push time • Add proxies (jars, npm, etc)
  21. Case #7: Distributing Docker registry Single Docker registry, in active–passive

    mode becomes a bottleneck when building and deploying simultaneously to several dozens of services and instances
  22. The Solution #7 Distributed sharded registry with replication factor, high

    availability, rack awareness and automatic recovery Example: scenario of 3 registries and RF of 2: • Each service/mode served by 2 registries • Pairs are distributed evenly between modes • Metadata is saved in consul KV • All images are uploaded to S3 so reseeding registry is easy
  23. The Solution #7 Relevant Links: • Project Blog: http://relmos.blogspot.co.il/2016/09/scaling-private-docker-registry-at_49.html •

    Registry Deploy: https://docs.docker.com/registry/deploying/
  24. Case #8: Cleaning old versions Prevent old versions from pilling

    up
  25. The Solution #8 This is where docker shines We clean:

    • Containers with stopped exit code • Old Images which are not being used • Registry
  26. Visibility

  27. Case #9: Detecting versions inconsistency Different versions of the same

    service deployed in production (unintentionally)
  28. The Solution #9 • A graphical near real-time view on

    versions per service • Easy way to add alert on inconsistent versions
  29. Case #10: Tracking it all Lack of visibility on deployments

    and failed builds
  30. The Solution #10 • Add integration to slack that includes

    all relevant information (version,servers,instances,user) • Send deployment events to graphite and combine with relevant dashboards • Send build & deployment logs to central log system • Event system which graphically presents important events (deployments, heals, autoscale, etc...)
  31. None
  32. None
  33. ariel@appsflyer.com