Slide 1

Slide 1 text

Building infrastructure for our global service Sorah Fukumori

Slide 2

Slide 2 text

$ whoami Sorah Fukumori (׉׵כ https://sorah.jp/ | GitHub @sorah | Twitter @sora_h Site Reliability Engineer at Cookpad Global Cookpad TechConf 2017 NOC Rubyist, Ruby committer Interests: Site Reliability, Networking, Distributed systems

Slide 3

Slide 3 text

$ whoami Sorah Fukumori (׉׵כ https://sorah.jp/ | GitHub @sorah | Twitter @sora_h Site Reliability Engineer at Cookpad Global Cookpad TechConf 2017 NOC Rubyist, Ruby committer Interests: Site Reliability, Networking, Distributed systems

Slide 4

Slide 4 text

Wi-Fi • Are you enjoying the internet?

Slide 5

Slide 5 text

No content

Slide 6

Slide 6 text

No content

Slide 7

Slide 7 text

Wi-Fi • Are you enjoying the internet? • It’s provided as best-effort, but let us know via Twitter #CookpadTechConf if you’re having a problem

Slide 8

Slide 8 text

Agenda • Global? • About SRE team • Infra • Architecture • Traffic • Relationship with Developers • Plans 2017

Slide 9

Slide 9 text

global?

Slide 10

Slide 10 text

global

Slide 11

Slide 11 text

global • https://cookpad.com/uk /es /ar /id /vn /sa … • Web / Android app & iOS app • 58 countries, in 15 languages 2/3

Slide 12

Slide 12 text

global • Different code base with JP, built from scratch 3/3

Slide 13

Slide 13 text

Terminologies • Global: • Cookpad (global) https://cookpad.com/uk /es … • JP: • ΫοΫύου (JP) https://cookpad.com/

Slide 14

Slide 14 text

SRE team

Slide 15

Slide 15 text

SRE team • 9 SRE members in JP • 2 members in JP are assigned to the global project 1/2

Slide 16

Slide 16 text

SRE team • Also, we have 1 SRE member in US • Recently joined! 2/2

Slide 17

Slide 17 text

global infra

Slide 18

Slide 18 text

global infra • No special point to mention. Currently just an infrastructure for plain Rails app • Do as usual… for now. 1/3

Slide 19

Slide 19 text

global infra • AWS us-east-1 • Amazon Aurora for MySQL • ElastiCache (Redis & memcached) • Ruby on Rails 4.2 on Ruby 2.3 • nginx + unicorn 2/3

Slide 20

Slide 20 text

global infra • App built up from scratch,
 Infra lives in the new region • Building better infra than existing one, based on our past experiences with AWS EC2 and VPC in Japan • e.g. JP: CentOS → Ubuntu, US: Ubuntu only • e.g. JP: weird subnetting US: private/public subnets 3/3

Slide 21

Slide 21 text

architecture • It’s basically simple: • ELB • EC2 (nginx) • EC2 (Rails, unicorn) • RDS (Aurora for MySQL), ElastiCache (Redis,memd) 1/6

Slide 22

Slide 22 text

architecture • cookpad.com is shared between global service and JP service • But app is running on multiple regions…? 2/7

Slide 23

Slide 23 text

" #

Slide 24

Slide 24 text

architecture • Route53 Latency Based Routing
 (ap-northeast-1 or us-east-1) • DNS returns IP of closer region from resolver • If a requested service lives in another region, reverse- proxy to the alternate region • Also, terminating TCP/TLS as possible as close from user is better on latency.
 (But serving only in 2 regions are not enough…) 4/7

Slide 25

Slide 25 text

location ~ ^/(ae|ar|bh|bo|br|cl|co|cr| cu|de|dj|do|dz|ec|eeuu|eg|es|fr|gt|hn| hu|id|in|iq|ir|it|jo|km|kr|kw|lb|ly|ma| mr|mx|ni|om|pa|pe|ph|pri|ps|py|qa|sa|sd| so|sv|sy|th|tn|tw|uk|us|uy|ve|vn|ye)(/| $) { proxy_pass http://cookpad_use1; } location / { proxy_pass https://cookpad_apne1; }

Slide 26

Slide 26 text

architecture • Rails app servers are capable to autoscaling • Using consul + consul-template to apply the latest instance list to configurations • Recent AWS Autoscaling Group (ASG) allows suspending actions by API, so the global relies to ASG
 (JP uses original implementation) 6/7

Slide 27

Slide 27 text

architecture • Monitoring: Zabbix (lives in ap-northeast-1) • ap-northeast-1 connectivity is provided using VyOS + IPsec tunnel • Without perfect redundancy… it’s enough by disallowing critical traffic inside the tunnel 7/7

Slide 28

Slide 28 text

web development • Global uses GitHub.com • and CI is running on CircleCI.com • (JP uses GitHub Enterprise) • Deploy: capistrano base • Deploy server to run capistrano in us-east-1
 (Latency, poor office internet, … etc)

Slide 29

Slide 29 text

Peak Traffic

Slide 30

Slide 30 text

peak traffic • JP is around Valentine’s day • Q. Then, when does the peak come into the global? 1/2

Slide 31

Slide 31 text

peak traffic • Various! • The global has several moments in a year, which expects large increase in traffic: • Ramadan & Eid al-Adha (esp. MENA, Indonesia) • Christmas • and more 2/2

Slide 32

Slide 32 text

Ramadan • Ramadan is the ninth month of the Islamic calendar • Muslims refrain from consuming food during ramadan while fasting from dawn until sunset • They enjoy cooking after sunset • This is the biggest occasion in MENA/Indonesia which expects higher traffic than usual https://en.wikipedia.org/wiki/Ramadan

Slide 33

Slide 33 text

Ramadan Preparation • We’ve survived Ramadan 2015, but we grew a lot before Ramadan 2016 than 2015 • So we have to take extra care for expected traffic in 2016.
 We couldn’t think our infra and application could survive the Ramadan without taking any care. 1/2

Slide 34

Slide 34 text

Ramadan Preparation • So here are what we did: • DB migration:
 ɹRDS MySQL (standard EBS)
 → Amazon Aurora for MySQL • Capacity: Expanding the target of autoscaling • CDN: Switching to Fastly • App: Giving a lot of performance improvements 2/2

Slide 35

Slide 35 text

Ramadan 2016 traffic Ramadan begins

Slide 36

Slide 36 text

Ramadan 2016 traffic Ramadan ends

Slide 37

Slide 37 text

Ramadan 2016 • No critical issues, but • Logs coming a lot than usual — Disks are getting full early and we had to review the log retention or implement S3 archival • Fixing slow queries were required in higher priority — impact of those became massive than usual

Slide 38

Slide 38 text

Relationships with Developers

Slide 39

Slide 39 text

Relationship with Developers • De…

Slide 40

Slide 40 text

Relationship with Developers • De… Dev… DevOps!

Slide 41

Slide 41 text

DevOps • Team with people having different culture, language, and skill • Building good relationships, like by attending developers’ camp • Spending few days with people is good way 1/8

Slide 42

Slide 42 text

DevOps • Dashboards • Grafana (with Zabbix + CloudWatch) to share server status • Kibana: Importing SQL slow logs 2/8

Slide 43

Slide 43 text

DevOps

Slide 44

Slide 44 text

No content

Slide 45

Slide 45 text

No content

Slide 46

Slide 46 text

DevOps • Requests incoming at GitHub issues • Most request is many simple operation request… • We have to reduce simple “applications” or operations, by: • delegating permissions to dev • automation • Reduce SRE blockers to enable asynchronous work, because developers are living all the world 6/8

Slide 47

Slide 47 text

No content

Slide 48

Slide 48 text

No content

Slide 49

Slide 49 text

Plans 2017 • There’s a lot of point to improve • Performance • Architecture • Developers’ Productivity • JP has a lot of useful, time to import those into global • Be good with developers (DevOps…!) 1/2

Slide 50

Slide 50 text

Plans 2017 • Better deploy • Docker, ECS (hako) • Dynamic staging servers • Delegation to dev • HTTP latency • CDN? • and more! 2/2

Slide 51

Slide 51 text

Conclusion • Building the infrastructure receiving traffic from around the world is fun • Team surrounded by people from around the world is also fun Thanks!