Real World Microservices

Slide 1

Slide 1 text

Real World Microservices Christoph Leiter Vienna Microservices Meetup, April 6th 2017

Slide 2

Slide 2 text

Agenda 1 Introduction 2 Microservices 3 Creating the infrastructure 4 Implementation of microservices 5 Lessons learned 2

Slide 3

Slide 3 text

Introduction

Slide 4

Slide 4 text

starjack A platform for buying ski tickets online: 4

Slide 5

Slide 5 text

starjack II Customers sign up and order their personal keycard Tickets of various lift operators can be booked and are available within seconds No more standing in line for a ticket 5

Slide 6

Slide 6 text

starjack III Created with microservices from the ground up Everything is a REST interface Frontend uses ES6 and React/Redux 100% Open Source components Development infrastructure Self hosted Gitlab (git, issues, Docker registry) Jenkins - Builds are executed on DigitalOcean and pushed to the Docker registry 6

Slide 7

Slide 7 text

Microservices

Slide 8

Slide 8 text

Why Microservices Architectural choice with advantages and drawbacks + Scalability, Reliability, Isolation − Operational complexity starjack communicates with many diﬀerent 3rd party systems Uses multiple protocols like REST and SOAP Used to isolate services from each other If one service has problems or is down not everything is aﬀected Should one service get compromised an attacker does not get access to all data Reduces complexity of individual services It’s easier to manage 12 services with 2k LOC than 1 service with 24k LOC When you change one system you can only break so much 8

Slide 9

Slide 9 text

Our Microservices I starjack-auth Holds user data and handles logins starjack-dta Gets tickets from lift operators using the DTA interface from SkiData AG starjack-axess Gets tickets from lift operators using the Axess interface from Axess AG starjack-liftoperator Manages our lift operators and works as a facade for dta and axess starjack-order Veriﬁes and processes customer orders. Creates invoices and sends emails starjack-payment Handles payments with the Mpay24 PSP 9

Slide 10

Slide 10 text

Our Microservices II starjack-keycards Orders new keycards from a 3rd party supplier and updates status once produced starjack-weather Retrieves current weather for lift operators, currently using OpenWeatherMaps starjack-maps Used to get map locations for lift operators and for travel duration estimation. Uses Google Maps starjack-faq Used to manage FAQ entries starjack-mail Sends mails using Mailgun mail service As the system grows we will have more services instead of growing one monolithic system without bounds 10

Slide 11

Slide 11 text

Our Microservices III 11

Slide 12

Slide 12 text

starjack Deployment starjack uses AWS as its deployment platform Oﬀers tons of services, everything is fully automatable Very high reliability possible if your architecture supports it Great security properties because you get your own software deﬁned network Everything is deployed in three availability zones We use many AWS services: EC2, S3, CloudFront, RDS, ElastiCache, SQS, Route53, CloudWatch, . . . 12

Slide 13

Slide 13 text

Deployment II Logical view on AWS 13

Slide 14

Slide 14 text

Deployment III Deployment view on AWS 14

Slide 15

Slide 15 text

Creating the Infrastructure

Slide 16

Slide 16 text

The Basics So, what do we need to get started with a microservices architecture? Very good fully automated infrastructure management is key, see Martin Fowler’s “You need to be this tall to use microservices” Fowler says you need to be able to rapidly provision servers have very good monitoring and logging infrastructure have deployment automated Bonus points if you can programatically recreate your infrastructure from scratch 16

Slide 17

Slide 17 text

Infrastructure as Code I Terraform allows you to specify your whole infrastructure as simple HCL ﬁles Supports AWS, Google Cloud, Mailgun and dozens of other services When you run Terraform it will compare your current state with the desired state and apply the needed changes Creates dependency graph between your resources and modiﬁes them in the right order No more clicking around in AWS web console Every change is documented and versioned, manual changes will be reverted on next run 17

Slide 18

Slide 18 text

Infrastructure as Code II 18

Slide 19

Slide 19 text

Infrastructure as Code III 19

Slide 20

Slide 20 text

Infrastructure as Code IV We use Terraform for the whole basic infrastructure VPC, Firewall rules EC2 instances, ELB, S3, CloudFront, Route 53, SQS RDS Cluster, Redis Cluster Mailgun Very easy to use and works really well 20

Slide 21

Slide 21 text

Server Provisioning Now we have our bare EC2 instances running and we need to install some software on them Terraform is not a provisioner – we need another tool to automate that We chose Ansible to provision our servers Works over SSH and doesn’t have requirements for the clients besides python We tag our instances by role with Terraform. Automatic inventory file by using ec2.py Configures EC2 instances, creates databases and users, defines DigitalOcean Jenkins Slave, . . . 21

Slide 22

Slide 22 text

Service Scheduling We use a cluster as our microservice deployment platform. A scheduler is needed to make decisions on where in your cluster your services should run. We chose Nomad In comparison to other schedulers easy to get started Just one binary to install Relatively new and not as mature as other solutions Requires three servers, should be deployed to diﬀerent availability zones 22

Slide 23

Slide 23 text

Service Discovery Now that our services are deployed somewhere we need to find them That’s the job of a service discovery service We chose Consul because it plays well together with Nomad Whenever a new service is deployed with Nomad it will register its endpoints in Consul A tool called consul-template updates the nginx configuration file as soon as changes happen and reloads nginx 23

Slide 24

Slide 24 text

Deployment I Nomad needs a job description ﬁle to know what to schedule where Nginx needs to be conﬁgured so it knows which endpoints should be routed to which services We deploy our services using our small custom YAML DSL which is processed by Ansible 24

Slide 25

Slide 25 text

Deployment II 25

Slide 26

Slide 26 text

Deployment III Deployment steps 1 Modify services DSL file 2 Trigger deployment with Ansible 3 Creates Nomad job specification files and triggers scheduling 4 Nomad does a rolling update of the services 5 Nomad worker nodes pull new Docker images and start them 6 As new versions are rolled out Consul and Nginx get updated 26

Slide 27

Slide 27 text

Infrastructure Components In summary we have these infrastructure components 27

Slide 28

Slide 28 text

Implementation of Microservices

Slide 29

Slide 29 text

Microservices Details Microservices are written in Kotlin and are based on spring-boot & Hibernate Each service has its own git repository One common library which services can use Be careful not to introduce unwanted dependencies between services! Treat as API and don’t break it Only used for cross cutting concerns Communication between microservices by using REST for synchronous and a queue for asynchronous communication Not covered in detail, your implementation will be diﬀerent anyways :) 29

Slide 30

Slide 30 text

Authentication User requests need be authenticated We don’t want to query the authentication service for every request Would create a lot of load and a potential bottleneck Instead we use JSON Web Tokens (JWT) Authentication service creates cryptographically signed token for the user using its private key Services have a public key and can check whether the token is legitimate Since there’s no invalidation of a token we use a low TTL and a refresh mechanism 30

Slide 31

Slide 31 text

JWT 31

Slide 32

Slide 32 text

Interservice Communication If an immediate response is needed communication is simply done via REST HTTP requests Passes Authentication header if it’s required to identify the user For asynchronous communication we use a queue Decouples services Messages don’t get lost if other service is down Automatic retries Increases reliability of your system Coordination between multiple instances of the same type is done with Redis 32

Slide 33

Slide 33 text

Logging I When you have dozens of services running you need a good centralized logging solution AWS oﬀers CloudWatch Logs Docker can natively log to CloudWatch Every log message should only be one event (one line) We use awslogs as a “remote grep” tool for CloudWatch 33

Slide 34

Slide 34 text

Logging II Allows searching in multiple log groups Time based restrictions with -s and -e 34

Slide 35

Slide 35 text

Logging III Logs are useless if nobody looks at them. You need a quick overview and notifications for errors We have an AWS Lambda function which subscribes to our application logs Filters logs for events we are interested in Errors and warnings Whitelisted info events Uses slack API to push log messages to Slack Different channels based on severity and type Adds Slack notifications for errors 35

Slide 36

Slide 36 text

Logging IV 36

Slide 37

Slide 37 text

Monitoring We need external monitoring to notify us in case a system is down We expose simple status endpoints from our services and let StatusCake monitor those If a service is unreachable StatusCake uses Pushover to send us a push notiﬁcation 37

Slide 38

Slide 38 text

Lessons Learned

Slide 39

Slide 39 text

Log Everything When things fail you need to be able to debug the problem It helps to log as much as you can Application logs Web server access/error logs Queue messages Linux system logs Frontend logs For application logs think about how you will search for messages later, i.e. include relevant data For system logs use blacklists for messages you are not interested in, get notiﬁed for everything you don’t expect 39

Slide 40

Slide 40 text

Distributed Systems A distributed system might not exactly behave as you’d think, see “Fallacies of Distributed Computing” Even when something has a failure likelihood of << 1% if that runs thousands of times it will eventually go wrong Expect that things will go wrong and make your system as robust as possible 40

Slide 41

Slide 41 text

DB Connections Usually DBs have a connection limit set Once the limit is reached you won’t get a new connection Think about how many connections you will have, it might be more than you think connections = services × instances × poolsize 41

Slide 42

Slide 42 text

System Resources Really hard to know your service memory requirements beforehand Services will probably consume more resources than you assume as they also need to have their runtime environment in memory No memory sharing/deduplication if you use Docker Be careful if you do hard memory limit enforcement 42

Slide 43

Slide 43 text

Summary The biggest challenge when doing microservices isn’t programming microservices but the infrastructure Everything has to be automated. You need: Automatic infrastructure setup: Terraform, CloudFormation, Heat, . . . A provisioner: Ansible, Puppet, Chef, . . . Automatic builds: Jenkins, Gitlab, Travis CI, . . . A scheduler: Nomad, Kubernetes, Docker Swarm, . . . A discovery service: Consul, etcd, Zookeeper, . . . An easy way to deploy services Centralized logging: CloudWatch, ELK, Graylog, . . . Monitoring: StatusCake, Nagios, sensu, . . . 43

Slide 44

Slide 44 text

Summary II You only need to invest in infrastructure knowledge once Makes development and evolution of your system easier Dramatically reduces complexity of individual services You get rewarded with a distributed highly reliable system 44

Slide 45

Slide 45 text

Kotlin Meetup If you’re interested in Kotlin please join our meetup! Next meetup is on April 18th. 45

Slide 46

Slide 46 text

End Thanks for your attention Questions? Contact me at [email protected] 46

Slide 47

Slide 47 text

References starjack: https://starjack.at/ MicroservicePrerequisites: https://martinfowler.com/ bliki/MicroservicePrerequisites.html Terraform: https://www.terraform.io/ Ansible: https://www.ansible.com/ Nomad: https://www.nomadproject.io/ Consul: https://www.consul.io/ consul-template: https://github.com/hashicorp/consul-template Fallacies of distributed computing: https://en.wikipedia. org/wiki/Fallacies_of_distributed_computing 47