Pragmatic Micro-services for Organisational Scalability

Pragmatic Micro-Services at FashionTrade Docker Meetup January 2017

About Friso van Vollenhoven Mostly worked in software dev and
related roles. Former CTO at a (big) data analytics and machine learning company. Now CTO at FashionTrade. I am the proud owner of a three character Twitter handle: @fzk. I have 18 endorsements for Awesomeness on LinkedIn.

FashionTrade A B2B platform for fashion wholesale. Fashion brands and
retailer can connect and do business. E-commerce for (fashion) businesses. Tagline: “We simplify wholesale so you can Connect, Trade & Grow.” About

Product information enters the brand integration API Validation Product information
is merged with existing data that applies Price lists, stock levels, existing images, etc. Product enters search engine Only if complete (i.e. it has a known price, availability, etc.) Product information is used for orders, confirmations, etc. The life of a product

The life of a product

Why services Separates business concerns Allows people to work on
many things concurrently Works well with log based data architecture Ideally, makes for organisational scalability at the cost of added complexity in delivery

Example

Service written in Java External facing Requires authentication Has a
health check (required) Has a admin endpoint for internal querying Brand integration API service

Dockerfile FROM eu.gcr.io/ft-main/jre:8 EXPOSE 8080 8081 LABEL \ meta.attributes.id="pim-integration" \
meta.attributes.type="service" \ meta.attributes.team="Developers" \ meta.description="The FashionTrade PIM integration service." \ meta.checks.health.endpoint="/healthcheck" \ meta.checks.health.port="8081" \ meta.ports.http.service="80" \ meta.ports.http.container="8080" \ meta.ports.admin.service="8081" \ meta.ports.admin.container="8081" \ meta.routing.gateway.mapping.path-segment="pim" \ meta.routing.gateway.mapping.dns-prefix="api" ENV \ NAMESPACE="ft-prod" \ KAFKA_BOOTSTRAP_SERVERS="kafka:9092" ENTRYPOINT ["java", "-jar", "pim-integration-service.jar", "server", "config.yml"] COPY config.yml config.yml COPY target/pim-integration-service-all.jar pim-integration-service.jar

Dockerfile, deployment information LABEL \ meta.attributes.id="pim-integration" \ meta.attributes.type="service" \ meta.attributes.team="Developers"
\ meta.description="The FashionTrade PIM integration service." \ meta.checks.health.endpoint="/healthcheck" \ meta.checks.health.port="8081" \ meta.ports.http.service="80" \ meta.ports.http.container="8080" \ meta.ports.admin.service="8081" \ meta.ports.admin.container="8081" \ meta.routing.gateway.mapping.path-segment="pim" \ meta.routing.gateway.mapping.dns-prefix="api"

// Snippet from seed_job.groovy // ... // ============================================================================ // Services
// ============================================================================ /** The master list of all services that should be built and deployed. */ def services = [ 'api-docs-service' : Builder.Bash, 'app-service' : Builder.Npm, 'brand-service' : Builder.Maven, 'canary-service' : Builder.Maven, 'connection-service' : Builder.Maven, 'gatekit' : Builder.Maven, 'image-service' : Builder.Maven, 'login-service' : Builder.Npm, 'order-service' : Builder.Maven, 'pim-integration-service' : Builder.Maven, 'product-search-service' : Builder.Maven, 'product-service' : Builder.Maven, 'retailer-service' : Builder.Maven, 'user-service' : Builder.Maven ] // ... Jenkins seed job

What happens? Build Deploy Update routing state Monitor (Scale)

Build Jenkins creates a build pipeline from a seed job
(new repos are added manually) Pipeline for dev branches terminate at build Pipeline for master includes a deploy step Deploy currently goes to a dev environment Manually push to production Will automate when we have better integration testing in place As a startup you don’t always have time for everything you want to do ...

Deploy Metatron Internal tool Generates a Kubernetes manifest (YAML config
file) from Docker labels Deploy step in Jenkins Runs metatron Applies manifest against target environment Environment specific configuration managed through Kubernetes secrets

Service routing is essentially a reverse proxy that knows about
all services Routing state depends on currently deployed and healthy services Should not treat it as configuration Hard to correctly centrally manage Custom built service router: gatekit Routing

Custom service router for Kubernetes Routing state is runtime state
based on deployed services; not static configuration Docker labels become service metadata in Kubernetes services Gatekit polls Kubernetes cluster for services, metadata and health status Current logic: at least one pod healthy == service healthy Gatekit provides later opportunity to solve A/B testing and canary deployments at the platform level Gatekit

The bigger picture

Deployment abstraction through Dockerfile Dockerfile / image + labels are
the lingua franca of our deployments I.e. development delivers containers, go to production automatically Is that DevOps? Not everybody always knows the entire stack But would that scale? There is a tradeoff between complete understanding of all the moving parts and the speed of onboarding before productivity.

A word on: Monitoring Most systems are pull based Need
to install and configure agent to read the necessary data Pull based is complex in dynamic (service) environments Currently experimenting with DataDog (https://www.datadoghq.com/) Services push to agent Agent sends data upstream Still learning

Datadog setup (artist impression) Datadog agent deployed using Daemon Sets
Services push metrics to the agent (found on service host name) Agent takes care of bringing data upstream SaaS solution for dashboard / alerts / etc.

A word on: Logging Logging is mostly push based (JVM
logging uses Appenders) Currently using StackDriver on GCP Limited functionality Poor out-of-the-box experience Moving to ELK Using a hosted ELK SaaS provider (http://logz.io/)

A word on: Dependencies

A word on: Dependencies Sync vs. Async It is mentioned
that sync dependencies are just expensive method calls Probably because it’s called RPC It’s not about sync vs. async It’s about schema’s And being evolution friendly with your schema’s

Schema evolution When adding a new field to an entity,
it must be optional Removing fields from the schema can’t be done But a producer can stop populating optional fields Readers / consumers / clients must have sensible handling of empty optionals Usually default values Sometimes different behaviour Whether the entity comes in over RPC or a queue is a different concern

Side: diagramming made simple $ python render.py $ dot -Tpdf
-Gratio='fill' -Gsize='11.7,8.3!' \ > -Gmargin='0' /tmp/dependencies.gv -O $ open /tmp/dependencies.gv.pdf

Random Experiences and Learnings

$ kubectl get ingress gatekit -o yaml apiVersion: extensions/v1beta1 kind:
Ingress metadata: annotations: ingress.kubernetes.io/backends: '{"k8s-be-30535--cc30c14d35b2a243":"HEALTHY"}' <... lines snipped ...> ingress.kubernetes.io/url-map: k8s-um-default-gatekit--cc30c14d35b2a243 creationTimestamp: 2016-12-19T16:03:07Z generation: 1 name: gatekit namespace: default resourceVersion: "23182657" selfLink: /apis/extensions/v1beta1/namespaces/default/ingresses/gatekit uid: a1bdfb46-c604-11e6-a6ee-42010af00031 spec: backend: serviceName: gatekit-public servicePort: 80 tls: - secretName: gatekit-tls-certs status: loadBalancer: ingress: - ip: 130.211.27.124 GKE Ingress Magic

Creating a Kubernetes ingress on GKE actually creates a Google
Cloud Load Balancer Also, the IP stays static for as long as you don’t change the service name that it’s tied to (Google’s load balancing is not DNS based, but BGP based as it should be) If you change the value of the Kubernetes secret that holds the TLS cert, it automatically reconfigures the load balancer (not bad, right?) Of course you can still configure your own ingress controllers (e.g. if you need pod stickiness) GKE Ingress Magic

Two levels of autoscaling pod autoscaling within Kubernetes (HPA) Node
autoscaling by GKE (scaling settings defined for instance group) With CPU usage bursting, sometimes a cluster node hang / becomes unavailable This also stops the cluster autoscaler, so you don’t get new nodes Unfortunate because of the CPU burst in the first place Still looking into this Autoscaling sometimes doesn’t

Using JSON for everything (including Kafka messages) Schemas defined in
code Code attracts logic; schemas shouldn’t have logic JSON is more troublesome than anticipated Really easy to publish evolution incompatible messages on a queue Conscious decision to lower learning curve while bootstrapping development Will move to binary message format with formal schema definitions as soon as possible gRPC looks promising for synchronous dependencies Schema discipline

Some Observations

Warning: shameless “we’re hiring” slide coming up...

Vacancies Back end engineer (JVM, Python) Core Platform Customer Success
Solutions Front end engineer (JavaScript, React / Redux) Infrastructure / deployment engineer Responsible for infra, Kafka + ES clusters and the build + deployment pipeline

Questions?

THANKS ! www.fashiontrade.com | [email protected]

Pragmatic Micro-services for Organisational Sca...

Pragmatic Micro-services for Organisational Scalability

More Decks by FashionTrade.com Engineering

Other Decks in Technology

Featured

Transcript