TomTom NavCloud on AWS - Speaker Deck

Slide 1

Slide 1 text

NAVCLOUD ON AWS AWS Amsterdam Meetup! April 29th 2014

Slide 2

Slide 2 text

NAVCLOUD • A cloud-based storage service that allows users to seamlessly synchronize trip information between devices as well as share and receive navigation information with other people (e.g., friends or companies). • NavCloud aims to be scalable and reactive while ensuring privacy and security.

Slide 3

Slide 3 text

The Team Full stack developers • Server • Mobile / SDKs • Systems / AWS

Slide 4

Slide 4 text

Architecture Riak Cluster HTTP(s) API node … … HTTP(s) API node HTTP(s) API node Clients

Slide 5

Slide 5 text

Architecture Horizontal scaling • Stateless* API nodes. • No direct interconnection between API nodes. • Riak scales horizontally very well.

Slide 6

Slide 6 text

But Why AWS? • Embracing DevOps. • Embracing (horizontal) scalability. • A whole ecosystem of different services helping to solve (almost) any task.

Slide 7

Slide 7 text

Our AWS approach • We allocate resources with CloudFormation stacks. • We build stacks inside of VPC. • We use S3 to store ﬁles, backups and logs. • We manage our DNS records with Route53.

Slide 8

Slide 8 text

So What About AWS?

Slide 9

Slide 9 text

No content

Slide 10

Slide 10 text

Syncing Data Problem Client wants to receive updates in background. Solutions Client Server • Polling Client Server • Streaming

Slide 11

Slide 11 text

Streaming Riak Cluster HTTP API node … … HTTP API node HTTP API node Clients HTTP 1.1 chunked HTTP 1.1 chunked HTTP 1.1 chunked

Slide 12

Slide 12 text

Streaming on AWS

Slide 13

Slide 13 text

Streaming on AWS • All requests go via ELB. • Using TCP/SSL listener instead. Proxy protocol support. • ELB closes connections with timeout (60s). • Sending ‘heartbeat’ messages to keep connection alive. • HTTP(s) listener: issue with RST tcp message.

Slide 14

Slide 14 text

Your Own Load Balancer? • HAProxy, Nginx, Apache. • Full and real-time access to logs. • Conﬁgurability. • But: HA setup? Multi-AZ? • But: Security setup. Tradeoff: conﬁgurability vs simplicity

Slide 15

Slide 15 text

Streaming on AWS Improved

Slide 16

Slide 16 text

Streaming on AWS Improved • Amazon’s good practice. • But: API node should be directly accessible. • More improvements: distributed events (across API nodes) instead of polling storage for updates. • Message Queue (RabbitMQ cluster)   with Fanout pattern. • AWS alternative: google SNS + SQS fanout pattern.

Slide 17

Slide 17 text

Streaming on AWS Improved

Slide 18

Slide 18 text

ELB ‘features’ :) • Performance tests. Pre-warming. • Really easy to hit it beyond ~10K concurrent connections. • Request Amazon support to pre-warm or just run tests for some time without measuring. • Logs access. Improved lately (export to S3) …

Slide 19

Slide 19 text

Monitoring • We are investigating StackDriver (stackdriver.com). • Third-party monitoring tool with rich and customizable UI. • Custom application metrics. • Supports monitoring of a lot of standard services out of the box: Riak, Message Queue services, App containers.

Slide 20

Slide 20 text

Monitoring

Slide 21

Slide 21 text

Provisioning CloudFormation! • JSON script that describes the whole stack. • Automatic resources lifecycle management. • VPC, Security, Route53 records, S3, EC2 -> everything is managed inside CF scripts. • Currently we are stuck with monolithic CF  script -> 3000 LOC. Not very manageable.

Slide 22

Slide 22 text

Deployment • We use Python boto library to talk to AWS services. Including calling our own scripts during CloudFormation stack setup. • Python scripts + shell scripts (AWS SDK CLI). • Capistrano for doing distributed tasks.

Slide 23

Slide 23 text

Capistrano Capistrano - a remote server automation and deployment tool written in Ruby. • Agent-less: Needs ssh and POSIX-compatible shell. That’s it. • Routing out of the box (connecting via ssh router).

Slide 24

Slide 24 text

Capistrano with CF stacks Problem: dynamic nature of AWS resources. IP addresses can’t be hardcoded. Solution: Auto-discovery of CF resources   (e.g. stacks, hosts) is a part of Capistrano job.

Slide 25

Slide 25 text

Capistrano with CF stacks

Slide 26

Slide 26 text

Capistrano with CF stacks • lsﬂeet is a simple shell script that queries the CloudFormation API and returns ip addresses of instances within supplied Auto-Scaling Group. • Could be done even easier with Ruby AWS SDK.

Slide 27

Slide 27 text

Capistrano Use Cases • Distributing application across the whole App stack (Deploying to different ‘dev’ CF stacks). • Gathering log ﬁles. • Getting some OS-related stats from all nodes. Interactively invoke commands on all nodes of ASG

Slide 28

Slide 28 text

Capistrano: why bother? Before! • A huge (480+304 LOC) shell scripts for app deployment. • Doing manual ssh routing, etc. After! • Capﬁle ~70 LOC & helper shell scripts (50+153 LOC) • Easier to maintain. • Capistrano params: easier conﬁgurable.

Slide 29

Slide 29 text

‘Switching’ The Stacks • Allows to fully automate dev environment updates.   Can be a Continuos Integration job! • Decreasing the downtime. • Procedure: 1. Provision the new CF stack using Python boto script. 2. Download & Apply the latest backup from S3 using shell script & s3 cmd tool. 3. Switch the Route53 DNS record using AWS API.

Slide 30

Slide 30 text

Questions? Dmitry Ivanov @idajantis [email protected] Vincenzo Vitale [email protected] Nami Nasserazad [email protected] @nami4552 @sicilianamente