Building infrastructure for our global service

Building infrastructure for our global service

About Cookpad Global's Infra: https://cookpad.com/uk

626ca235e8dab778c5bad6fc10e94ad8?s=128

Sorah Fukumori

January 21, 2017
Tweet

Transcript

  1. Building infrastructure for our global service Sorah Fukumori <sorah@cookpad.com>

  2. $ whoami Sorah Fukumori (׉׵כ https://sorah.jp/ | GitHub @sorah |

    Twitter @sora_h Site Reliability Engineer at Cookpad Global Cookpad TechConf 2017 NOC Rubyist, Ruby committer Interests: Site Reliability, Networking, Distributed systems
  3. $ whoami Sorah Fukumori (׉׵כ https://sorah.jp/ | GitHub @sorah |

    Twitter @sora_h Site Reliability Engineer at Cookpad Global Cookpad TechConf 2017 NOC Rubyist, Ruby committer Interests: Site Reliability, Networking, Distributed systems
  4. Wi-Fi • Are you enjoying the internet?

  5. None
  6. None
  7. Wi-Fi • Are you enjoying the internet? • It’s provided

    as best-effort, but let us know via Twitter #CookpadTechConf if you’re having a problem
  8. Agenda • Global? • About SRE team • Infra •

    Architecture • Traffic • Relationship with Developers • Plans 2017
  9. global?

  10. global

  11. global • https://cookpad.com/uk /es /ar /id /vn /sa … •

    Web / Android app & iOS app • 58 countries, in 15 languages 2/3
  12. global • Different code base with JP, built from scratch

    3/3
  13. Terminologies • Global: • Cookpad (global) https://cookpad.com/uk /es … •

    JP: • ΫοΫύου (JP) https://cookpad.com/
  14. SRE team

  15. SRE team • 9 SRE members in JP • 2

    members in JP are assigned to the global project 1/2
  16. SRE team • Also, we have 1 SRE member in

    US • Recently joined! 2/2
  17. global infra

  18. global infra • No special point to mention. Currently just

    an infrastructure for plain Rails app • Do as usual… for now. 1/3
  19. global infra • AWS us-east-1 • Amazon Aurora for MySQL

    • ElastiCache (Redis & memcached) • Ruby on Rails 4.2 on Ruby 2.3 • nginx + unicorn 2/3
  20. global infra • App built up from scratch,
 Infra lives

    in the new region • Building better infra than existing one, based on our past experiences with AWS EC2 and VPC in Japan • e.g. JP: CentOS → Ubuntu, US: Ubuntu only • e.g. JP: weird subnetting US: private/public subnets 3/3
  21. architecture • It’s basically simple: • ELB • EC2 (nginx)

    • EC2 (Rails, unicorn) • RDS (Aurora for MySQL), ElastiCache (Redis,memd) 1/6
  22. architecture • cookpad.com is shared between global service and JP

    service • But app is running on multiple regions…? 2/7
  23. " #

  24. architecture • Route53 Latency Based Routing
 (ap-northeast-1 or us-east-1) •

    DNS returns IP of closer region from resolver • If a requested service lives in another region, reverse- proxy to the alternate region • Also, terminating TCP/TLS as possible as close from user is better on latency.
 (But serving only in 2 regions are not enough…) 4/7
  25. location ~ ^/(ae|ar|bh|bo|br|cl|co|cr| cu|de|dj|do|dz|ec|eeuu|eg|es|fr|gt|hn| hu|id|in|iq|ir|it|jo|km|kr|kw|lb|ly|ma| mr|mx|ni|om|pa|pe|ph|pri|ps|py|qa|sa|sd| so|sv|sy|th|tn|tw|uk|us|uy|ve|vn|ye)(/| $) { proxy_pass

    http://cookpad_use1; } location / { proxy_pass https://cookpad_apne1; }
  26. architecture • Rails app servers are capable to autoscaling •

    Using consul + consul-template to apply the latest instance list to configurations • Recent AWS Autoscaling Group (ASG) allows suspending actions by API, so the global relies to ASG
 (JP uses original implementation) 6/7
  27. architecture • Monitoring: Zabbix (lives in ap-northeast-1) • ap-northeast-1 connectivity

    is provided using VyOS + IPsec tunnel • Without perfect redundancy… it’s enough by disallowing critical traffic inside the tunnel 7/7
  28. web development • Global uses GitHub.com • and CI is

    running on CircleCI.com • (JP uses GitHub Enterprise) • Deploy: capistrano base • Deploy server to run capistrano in us-east-1
 (Latency, poor office internet, … etc)
  29. Peak Traffic

  30. peak traffic • JP is around Valentine’s day • Q.

    Then, when does the peak come into the global? 1/2
  31. peak traffic • Various! • The global has several moments

    in a year, which expects large increase in traffic: • Ramadan & Eid al-Adha (esp. MENA, Indonesia) • Christmas • and more 2/2
  32. Ramadan • Ramadan is the ninth month of the Islamic

    calendar • Muslims refrain from consuming food during ramadan while fasting from dawn until sunset • They enjoy cooking after sunset • This is the biggest occasion in MENA/Indonesia which expects higher traffic than usual https://en.wikipedia.org/wiki/Ramadan
  33. Ramadan Preparation • We’ve survived Ramadan 2015, but we grew

    a lot before Ramadan 2016 than 2015 • So we have to take extra care for expected traffic in 2016.
 We couldn’t think our infra and application could survive the Ramadan without taking any care. 1/2
  34. Ramadan Preparation • So here are what we did: •

    DB migration:
 ɹRDS MySQL (standard EBS)
 → Amazon Aurora for MySQL • Capacity: Expanding the target of autoscaling • CDN: Switching to Fastly • App: Giving a lot of performance improvements 2/2
  35. Ramadan 2016 traffic Ramadan begins

  36. Ramadan 2016 traffic Ramadan ends

  37. Ramadan 2016 • No critical issues, but • Logs coming

    a lot than usual — Disks are getting full early and we had to review the log retention or implement S3 archival • Fixing slow queries were required in higher priority — impact of those became massive than usual
  38. Relationships with Developers

  39. Relationship with Developers • De…

  40. Relationship with Developers • De… Dev… DevOps!

  41. DevOps • Team with people having different culture, language, and

    skill • Building good relationships, like by attending developers’ camp • Spending few days with people is good way 1/8
  42. DevOps • Dashboards • Grafana (with Zabbix + CloudWatch) to

    share server status • Kibana: Importing SQL slow logs 2/8
  43. DevOps

  44. None
  45. None
  46. DevOps • Requests incoming at GitHub issues • Most request

    is many simple operation request… • We have to reduce simple “applications” or operations, by: • delegating permissions to dev • automation • Reduce SRE blockers to enable asynchronous work, because developers are living all the world 6/8
  47. None
  48. None
  49. Plans 2017 • There’s a lot of point to improve

    • Performance • Architecture • Developers’ Productivity • JP has a lot of useful, time to import those into global • Be good with developers (DevOps…!) 1/2
  50. Plans 2017 • Better deploy • Docker, ECS (hako) •

    Dynamic staging servers • Delegation to dev • HTTP latency • CDN? • and more! 2/2
  51. Conclusion • Building the infrastructure receiving traffic from around the

    world is fun • Team surrounded by people from around the world is also fun Thanks!