Slide 1

Slide 1 text

From Git Pull’s to K8S Capillary’s Journey to Zero Touch Deployments Anshul Sao & Piyush Goel SaaS @ Scale, July 2020

Slide 2

Slide 2 text

About Capillary Singapore China India Indonesia South Africa Malaysia UAE 450 Million Employees Worldwide 14 Offices 30+ Countries 400+Brands Consumers on the platform Stores powered 35K 650 KSA Thailand ● Leading SaaS Platform for Omni-Channel Customer Engagement, Commerce, Analytics for Retail ● Product Portfolio - Loyalty+ - SmartStore+ - Anywhere Commerce+ - Insights+ - Engage+

Slide 3

Slide 3 text

About Capillary - Tech Stack & Scale ● Fully Multitenant Architecture ● 100+ Microservices ● 5 Global Deployments ● 50 Master data shards ● 100 nodes Spark Clusters ● 1000+ servers in AWS regions. ● 20 TB ETL runs daily ● 450 million users and counting ● Billions of Transactions processed Annually ● Polyglot Data sources - MySQL, Mongo, Dynamo, HDFS, Parquet

Slide 4

Slide 4 text

The “Before Christ” Era

Slide 5

Slide 5 text

The “Before Christ” Era ● Product ran on 4 servers (3 app, 1 db) ● PHP Monolith + 2 Java Apps ● Code on prod vs Code on local box ● Manual syncs and SVN Pull’s ● Manual Verification and QA ● Life was simple!

Slide 6

Slide 6 text

Initial Growth

Slide 7

Slide 7 text

Initial Growth ● 10-15 servers (8-10 app, 4 dbs) ○ PHP Monolith + 4 Java Services ● Service Oriented Architecture. ○ Static Discovery ● Git with SVN flavour. ● Git Pulls + RSyncs still rule! ○ Tag based deployments ○ Should have automated!

Slide 8

Slide 8 text

Enter the VC’s - Series A

Slide 9

Slide 9 text

Series A - 2012

Slide 10

Slide 10 text

Series A - 2012 ● Launched second cluster - ap-southeast-1 ● ~40 servers (30 app servers, 10 db’s) ○ PHP Monolith + 10 Java services. ○ Discovery via hard-coded ELB’s ○ No Rolling deployments ● Automation Round - 1 ○ Python Fabric to the rescue! ○ Tag instances for discovery ■ PHP - Fabric scripts pull Git Tags on app servers and reloads. ■ Java - Scripts pull JAR’s from the custom maven repos and reboots JVM ● Rollbacks were painful... Argh!!

Slide 11

Slide 11 text

Growth Continues - 2013 Q1

Slide 12

Slide 12 text

Growth Continues - 2013 Q1 ● Entered the Middle East, Africa & Australia in Q1! ● Launched another cluster in eu-west-1 ● 100+ servers (~80 apps, 20 dbs) ● Service Discovery ○ ZooKeeper - Exhibitor ○ Apache Curator! ○ Deployment Order and Dependencies made easy!

Slide 13

Slide 13 text

Growth Continues - 2013 Q1 ● Post Release Monitoring became a problem. ● Observability wasn’t cool, yet! ○ Logs - 400GB per day ■ Elasticsearch -- Too heavy to operate for a 2-member devops team. ■ Log Streaming (Apache Flume) + Alerting Framework (Rule Engine) + MongoDb (Storage) ■ Hive jobs for log processing and metric aggregation ■ Splunk did exist! ○ Metrics ■ Custom implementation of a Time-Series store on MySQL ■ Google Charts for visualisation. ■ Graphite did exist! ● Re-invented the wheel, unnecessarily!

Slide 14

Slide 14 text

Entered US - 2013 Q2

Slide 15

Slide 15 text

Entered US - 2013 Q2 ● Yet another cluster - us-west-2 ● What are we looking at? ○ 150 servers (~120 apps, 30 db’s) ○ PHP Monolith + 15 Java Services. ■ Too many tags! ■ Different Versions. ○ Databases ■ 100 schemas - 1600 tables (25 schemas, 400+ tables, 4 regions) ■ Inconsistent schemas. ○ Too many releases to babysit ■ Devops team going crazy!

Slide 16

Slide 16 text

2013 Q3 : Automation Round - 2 ● Deployments Automation ○ Took Inspiration from Yahoo! Days - YPM / Igor for the win! ■ Move to self-contained bundles - Debian packages. ■ Automated Release Distributions via Jenkins- Testing, Staging, Production ■ Templatize Server States. ■ Easy to deploy & rollback (upto 3 versions). ■ Pre-install and Post-install steps allow seamless deployments & restarts. ○ Deployment times reduced by 75%. ○ Did someone say Containers? ■ Meh.. Too early for us!

Slide 17

Slide 17 text

2013 Q3 : Automation Round - 2 ● Databases need Deployments - duh!! ○ Need version control for DDL’s. ○ Enter - DBDeploy ○ Customized Wrapper on top. ■ Reduced inconsistencies significantly. ■ Devops & DBA’s were happy! ● Only devs to blame now. ● Monitoring & Logs ○ Home grown tools still holding strong! ● No more problems - yay!!

Slide 18

Slide 18 text

2014 Q2 - New Products & Verticals! ● Hypermarkets - Data explosion ● Keys Tables go beyond 500M records each. ● Core Entities are transactional by nature - MySQL is the king! ● Sharded the DB and the Services Layers ● Home grown implementation. ● Vitess wasn’t widely popular, yet! ● Multiple copies of the schema in the same cluster ● DBDeploy is still going strong - Maintains state on the db instance.

Slide 19

Slide 19 text

Lessons Learned So Far! ● Automate Deployment Workflows Early regardless of company stage ● Homegrown tools can lead to Confirmation Bias! ● Deployment troubles grow exponentially as you add more clusters, and microservices. ● Schema Management should be a part of Deployment Workflows!

Slide 20

Slide 20 text

Growth Continues - 2015

Slide 21

Slide 21 text

Growth Continues - 2015 ● 4 clusters ● 500+ servers ● 325 apps + 75 db’s ● 100 devs + 30 QA’s ● PHP Monolith + 25 Java Services in each cluster. ● Package Based Deployments ● Home Grown Tools for post-deployment monitoring & alerts. ● 15 master data shards across clusters. ● DbDeploy for Schema management. ● PHP Servers were not scaling, Unpredictable loads, underutilization of infra resources. DevOps ticket based scale up and down.

Slide 22

Slide 22 text

Gitflow, CI and Rundeck

Slide 23

Slide 23 text

No Late night releases! ● Release Management was pain ‘again’. ● Branching!! Too many branches to manage codes to be merged ● No of microservices exploded ● Which commits to be merged? Cherry picking and manual merges ● Stay back to release, keep it safe! ● Gitflow branching model adopted. ● Jenkins to build and push artifacts in debian repos. Promotion of packages for full QA control. ● Rundeck and rolling releases! Takes care of taking server offline, release and move!

Slide 24

Slide 24 text

No content

Slide 25

Slide 25 text

Growth Continues - 2016

Slide 26

Slide 26 text

Docker and Kubernetes

Slide 27

Slide 27 text

No prior notice for End of Season Sale ● Managing multiple clusters, serving different geographies ● Server estimation upscaling, downscaling was a manual task with Devs and Devops fulfilling the requests by Tickets. ● Lot of wastage, as non peak hours also the cluster size was constant. ● Increased downtimes in case of performance bugs. ● Debian to Docker ● Gitflow along with docker images (Same image is still not promoted from environment to environment. Limitation!!) ● ECR for repository. ● Migrated all config files to env variables. ● Created Capillary custom CI.

Slide 28

Slide 28 text

Deployer (Capillary CI)

Slide 29

Slide 29 text

Jenkins is good. We are too opinionated. ● How to manage Kubernetes deployments? ● Should developers have kubectl access? ● Every deployment meant different env vars. Make files are so old school and unmanageable ● Build selection and updating in yaml is a pain. ● Created a Helm based build & deployment system ● Build selection is UI driven and with SSO and access control ● HPA configs can be easily defined with CPU thresholds in UI. ● Deployment specific environment variables management in UI

Slide 30

Slide 30 text

YAML Fatigue is Real!

Slide 31

Slide 31 text

No content

Slide 32

Slide 32 text

No content

Slide 33

Slide 33 text

No content

Slide 34

Slide 34 text

No content

Slide 35

Slide 35 text

No content

Slide 36

Slide 36 text

No content

Slide 37

Slide 37 text

No content

Slide 38

Slide 38 text

Replacing DBDeploy

Slide 39

Slide 39 text

So Long! ● Mutations are tracked and version controlled, but status in each cluster can vary. ● Developer can write a bad query or a bad undo query wrecking havoc. ● Versions can grow really fast and it can be overwhelming to comprehend the final state. ● How to avoid alters on big tables? ● Track Final version rather than mutations ● Limit permissible operations, no drops. ● Schema diff to find transformation to get to final state. More predictable! ● Types of data ○ schema ○ database_view ○ seed_data

Slide 40

Slide 40 text

State vs Transition Management Manage Transitions Manage States Used in current db migrate To be used in Capillary Cloud Each change is version States are versioned CREATE TABLE tbl1( col1 int PRIMARY KEY ); CREATE TABLE tbl1( col1 int PRIMARY KEY); ALTER TABLE tbl1 ADD COLUMN col2 int; CREATE TABLE tbl1( col1 int PRIMARY KEY, col2 int );

Slide 41

Slide 41 text

Capillary Cloud - Idea

Slide 42

Slide 42 text

The Uber Orchestrator Single reusable definition of the stack, which can be easily launched and managed Opinionated JSON Files to define everything in stack. Applications Service discovery is redundant Kubernetes Namespaces Manage Application access to DB TF providers to fulfill application requests with access control. How to avoid messy nginx configurations and Domain SSL management Application Definitions to contain ingress rules. TF Providers to manage domains. How to reference dynamic cloud objects without hardcoding or maintaining region specific configs DSL to refer any entity in stack. Monitoring, APM and Alerts Reusable hardened TF Modules to achieve high observability. Prometheus! What about Schema Management A new kind of Schema sync which compares end states and does schema diff to get mutations

Slide 43

Slide 43 text

No content

Slide 44

Slide 44 text

A Typical deployment

Slide 45

Slide 45 text

Automated Rollbacks Pipeline for automated validation of builds

Slide 46

Slide 46 text

Capillary 2020

Slide 47

Slide 47 text

Learnings ● Late adoption of containers because of complex infra and priority constraints. ○ Make room for deployment debts along with business growth and priorities ● We tried Deis for deployments which had limited community and support, wasted months of effort. ○ Always choose external dependencies with good community support. ● Trying to make everything Generic consumes a lot of time with limited immediate benefits ○ It’s ok to be opinionated, to move fast and to align with internal company development processes. ● Manual Interventions/ steps will cause problems in long run ○ Automate everything

Slide 48

Slide 48 text

Q & A

Slide 49

Slide 49 text

● Anshul Sao ([email protected]) ● Piyush Goel ([email protected], @pigol1) Co-ordinates