Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Building infrastructure for our global service

Building infrastructure for our global service

About Cookpad Global's Infra: https://cookpad.com/uk

Sorah Fukumori

January 21, 2017
Tweet

More Decks by Sorah Fukumori

Other Decks in Technology

Transcript

  1. $ whoami Sorah Fukumori (׉׵כ https://sorah.jp/ | GitHub @sorah |

    Twitter @sora_h Site Reliability Engineer at Cookpad Global Cookpad TechConf 2017 NOC Rubyist, Ruby committer Interests: Site Reliability, Networking, Distributed systems
  2. $ whoami Sorah Fukumori (׉׵כ https://sorah.jp/ | GitHub @sorah |

    Twitter @sora_h Site Reliability Engineer at Cookpad Global Cookpad TechConf 2017 NOC Rubyist, Ruby committer Interests: Site Reliability, Networking, Distributed systems
  3. Wi-Fi • Are you enjoying the internet? • It’s provided

    as best-effort, but let us know via Twitter #CookpadTechConf if you’re having a problem
  4. Agenda • Global? • About SRE team • Infra •

    Architecture • Traffic • Relationship with Developers • Plans 2017
  5. global • https://cookpad.com/uk /es /ar /id /vn /sa … •

    Web / Android app & iOS app • 58 countries, in 15 languages 2/3
  6. SRE team • 9 SRE members in JP • 2

    members in JP are assigned to the global project 1/2
  7. SRE team • Also, we have 1 SRE member in

    US • Recently joined! 2/2
  8. global infra • No special point to mention. Currently just

    an infrastructure for plain Rails app • Do as usual… for now. 1/3
  9. global infra • AWS us-east-1 • Amazon Aurora for MySQL

    • ElastiCache (Redis & memcached) • Ruby on Rails 4.2 on Ruby 2.3 • nginx + unicorn 2/3
  10. global infra • App built up from scratch,
 Infra lives

    in the new region • Building better infra than existing one, based on our past experiences with AWS EC2 and VPC in Japan • e.g. JP: CentOS → Ubuntu, US: Ubuntu only • e.g. JP: weird subnetting US: private/public subnets 3/3
  11. architecture • It’s basically simple: • ELB • EC2 (nginx)

    • EC2 (Rails, unicorn) • RDS (Aurora for MySQL), ElastiCache (Redis,memd) 1/6
  12. architecture • cookpad.com is shared between global service and JP

    service • But app is running on multiple regions…? 2/7
  13. " #

  14. architecture • Route53 Latency Based Routing
 (ap-northeast-1 or us-east-1) •

    DNS returns IP of closer region from resolver • If a requested service lives in another region, reverse- proxy to the alternate region • Also, terminating TCP/TLS as possible as close from user is better on latency.
 (But serving only in 2 regions are not enough…) 4/7
  15. architecture • Rails app servers are capable to autoscaling •

    Using consul + consul-template to apply the latest instance list to configurations • Recent AWS Autoscaling Group (ASG) allows suspending actions by API, so the global relies to ASG
 (JP uses original implementation) 6/7
  16. architecture • Monitoring: Zabbix (lives in ap-northeast-1) • ap-northeast-1 connectivity

    is provided using VyOS + IPsec tunnel • Without perfect redundancy… it’s enough by disallowing critical traffic inside the tunnel 7/7
  17. web development • Global uses GitHub.com • and CI is

    running on CircleCI.com • (JP uses GitHub Enterprise) • Deploy: capistrano base • Deploy server to run capistrano in us-east-1
 (Latency, poor office internet, … etc)
  18. peak traffic • JP is around Valentine’s day • Q.

    Then, when does the peak come into the global? 1/2
  19. peak traffic • Various! • The global has several moments

    in a year, which expects large increase in traffic: • Ramadan & Eid al-Adha (esp. MENA, Indonesia) • Christmas • and more 2/2
  20. Ramadan • Ramadan is the ninth month of the Islamic

    calendar • Muslims refrain from consuming food during ramadan while fasting from dawn until sunset • They enjoy cooking after sunset • This is the biggest occasion in MENA/Indonesia which expects higher traffic than usual https://en.wikipedia.org/wiki/Ramadan
  21. Ramadan Preparation • We’ve survived Ramadan 2015, but we grew

    a lot before Ramadan 2016 than 2015 • So we have to take extra care for expected traffic in 2016.
 We couldn’t think our infra and application could survive the Ramadan without taking any care. 1/2
  22. Ramadan Preparation • So here are what we did: •

    DB migration:
 ɹRDS MySQL (standard EBS)
 → Amazon Aurora for MySQL • Capacity: Expanding the target of autoscaling • CDN: Switching to Fastly • App: Giving a lot of performance improvements 2/2
  23. Ramadan 2016 • No critical issues, but • Logs coming

    a lot than usual — Disks are getting full early and we had to review the log retention or implement S3 archival • Fixing slow queries were required in higher priority — impact of those became massive than usual
  24. DevOps • Team with people having different culture, language, and

    skill • Building good relationships, like by attending developers’ camp • Spending few days with people is good way 1/8
  25. DevOps • Dashboards • Grafana (with Zabbix + CloudWatch) to

    share server status • Kibana: Importing SQL slow logs 2/8
  26. DevOps • Requests incoming at GitHub issues • Most request

    is many simple operation request… • We have to reduce simple “applications” or operations, by: • delegating permissions to dev • automation • Reduce SRE blockers to enable asynchronous work, because developers are living all the world 6/8
  27. Plans 2017 • There’s a lot of point to improve

    • Performance • Architecture • Developers’ Productivity • JP has a lot of useful, time to import those into global • Be good with developers (DevOps…!) 1/2
  28. Plans 2017 • Better deploy • Docker, ECS (hako) •

    Dynamic staging servers • Delegation to dev • HTTP latency • CDN? • and more! 2/2
  29. Conclusion • Building the infrastructure receiving traffic from around the

    world is fun • Team surrounded by people from around the world is also fun Thanks!