devops at npm: scaling the registry

Slide 1

Slide 1 text

scaling the registry

Slide 2

Slide 2 text

C J Silverio devops at npmjs.com @ceejbot

Slide 3

Slide 3 text

What we did lessons learned generalizations

Slide 4

Slide 4 text

No content

Slide 5

Slide 5 text

Jacques Marneweck Benjamin Coe Laurie Voss

Slide 6

Slide 6 text

No content

Slide 7

Slide 7 text

January 2013 20K packages .5 million dls/day

Slide 8

Slide 8 text

January 2014 60K packages 8 million dls/day

Slide 9

Slide 9 text

Nov 2014 > 100K packages 28 million dls/day peak

Slide 10

Slide 10 text

side project 100% couchdb donated hosting IrisCouch

Slide 11

Slide 11 text

October 2013

Slide 12

Slide 12 text

General lesson #1 Put a cache on it

Slide 13

Slide 13 text

Re-architecture 1 move tarballs out of poor couchdb

Slide 14

Slide 14 text

No content

Slide 15

Slide 15 text

February 2014 company founded

Slide 16

Slide 16 text

hosted on Joyent/SmartOS hand-built CouchDB + Spidermonkey bash scripts to deploy

Slide 17

Slide 17 text

Twitter tells us when we're down

Slide 18

Slide 18 text

Re-architecture 2 Many couchdbs

Slide 19

Slide 19 text

General lesson #2 understand your db deeply

Slide 20

Slide 20 text

Monitoring & alerts

Slide 21

Slide 21 text

General lesson #3 Add monitoring after every outage

Slide 22

Slide 22 text

1: reactive monitor deeply ﬁx things quickly

Slide 23

Slide 23 text

2: proactive self-healing monitoring (also things don't break)

Slide 24

Slide 24 text

June 2014 Superﬁcially similar.

Slide 25

Slide 25 text

AWS / Ubuntu 70/30 west/east split 52 running instances, variable

Slide 26

Slide 26 text

No content

Slide 27

Slide 27 text

50/50 AWS region split haproxy to load balance no AWS-speciﬁc magic

Slide 28

Slide 28 text

Fastly: geoloc + cache haproxy / CouchDB nginx + a ﬁlesystem

Slide 29

Slide 29 text

behind the scenes ansible / nagios InﬂuxDB+Grafana

Slide 30

Slide 30 text

General lesson #4 metrics for everything

Slide 31

Slide 31 text

memory & cpu use request latency event counts

Slide 32

Slide 32 text

metrics == visibility

Slide 33

Slide 33 text

metrics drive monitoring

Slide 34

Slide 34 text

General lesson #5 automate

Slide 35

Slide 35 text

no special snowﬂakes every instance can be replaced

Slide 36

Slide 36 text

General lesson #6 the goal is to be BORING

Slide 37

Slide 37 text

if operations are boring you can do the dev

Slide 38

Slide 38 text

Goal: to be the most boring part of your node experience

Slide 39

Slide 39 text

npm client <3 npm install -g npm@latest

Slide 40

Slide 40 text

npm loves you