Upgrade to Pro — share decks privately, control downloads, hide ads and more …

10 Real problems & solutions in your build and ...

AppsFlyer
September 21, 2016

10 Real problems & solutions in your build and deploy process

Ariel Moskovich lecture at Reversim 2016

AppsFlyer

September 21, 2016
Tweet

More Decks by AppsFlyer

Other Decks in Technology

Transcript

  1. About AppsFlyer • Mobile analytics & Attribution • 11 offices

    worldwide • Vast majority of mobile platforms supported • More than 8B events per day • Over 2K integrated partners • Cloud based (AWS,GCP)
  2. Build & Deploy tools . . . . . .

    Commits Deploys Building projects Building Docker images Communication Image repository Machines running code Configuration / state Versions storage Deployment system Registry cache SVC
  3. Case #1: Deployment failures 1) Health check timeouts 2) Machine

    list mismatch 3) Wrong Java version 4) Incorrect startup parameters 5) Wrong timing for lb health check 6) Port in use
  4. The Solution #1 1) Tune start time, validate timeout is

    not too short 2) Update dynamically, add mismatch alerts, auto correction 3) Jenkins version = instance/container version 4) Set reasonable defaults, guide new members 5) Sync LB healthcheck with deployment state 6) Dynamic allocation / validation
  5. Case #2: Docker issues 1) Hostname characters limitation 2) JVM

    escape 3) Image is corrupted in registry 4) Conntract table limits
  6. The Solution #2 1) Make sure you are not exceeding

    64 chars 2) Upgrade to higher docker version 3) Let client auto fail to other registry, otherwise fake commit to recreate image 4) Increase conntract table & file descriptors
  7. Case #3: (NOT) Loosing traffic while deploying 1) Loosing traffic

    when deploying to server behind a load balancer 2) Loosing traffic when stopping a container
  8. The Solution #3 1) Connection draining: remove server from load

    balancer and wait for x seconds before the deployment. Drain time value is set in consul per service, the default is 30 seconds 2) Graceful shutdown: Set several seconds before killing the container and capture sigterm to flush in process messages to external DB or queue Ex: POST /containers/e90e34656806/stop?t=5 HTTP/1.1
  9. The Solution #4 We added an option to build &

    deploy from a branch in our deployment system The branch flow includes several steps: 1) Save the current configuration (Jenkins) 2) Update the new branch & revision 3) Initiate Jenkins build 4) Revert configuration state 5) Detect when image is available 6) Enable deployment ** Alternatively you can create a new Jenkins configuration for each branch and cleanup later
  10. The Solution #4 A few notes: • Maintain separate KV

    for default and branch • Provide an option to either build from “scratch” or “base image” • Regularly backup Jenkins configurations • Send build failures to slack
  11. The Solution #5 Build self serve UI to enable to

    add or edit: • Services • Modes • Autoscale • Healer • Spots instances • Alerts
  12. Case #6: Slow build time The time period between pushing

    code to deployment readiness extends over time
  13. The Solution #6 • Create base images, according to service

    type • Increase the number of Jenkins slaves • Migrate slaves to new generation CPU instances • Split compilation and image build to tune workload • Re design registry to improve image push time • Add proxies (jars, npm, etc)
  14. Case #7: Distributing Docker registry Single Docker registry, in active–passive

    mode becomes a bottleneck when building and deploying simultaneously to several dozens of services and instances
  15. The Solution #7 Distributed sharded registry with replication factor, high

    availability, rack awareness and automatic recovery Example: scenario of 3 registries and RF of 2: • Each service/mode served by 2 registries • Pairs are distributed evenly between modes • Metadata is saved in consul KV • All images are uploaded to S3 so reseeding registry is easy
  16. The Solution #8 This is where docker shines We clean:

    • Containers with stopped exit code • Old Images which are not being used • Registry
  17. Case #9: Detecting versions inconsistency Different versions of the same

    service deployed in production (unintentionally)
  18. The Solution #9 • A graphical near real-time view on

    versions per service • Easy way to add alert on inconsistent versions
  19. The Solution #10 • Add integration to slack that includes

    all relevant information (version,servers,instances,user) • Send deployment events to graphite and combine with relevant dashboards • Send build & deployment logs to central log system • Event system which graphically presents important events (deployments, heals, autoscale, etc...)