Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Building infrastructure for our global service

Building infrastructure for our global service

About Cookpad Global's Infra: https://cookpad.com/uk

Sorah Fukumori

January 21, 2017
Tweet

More Decks by Sorah Fukumori

Other Decks in Technology

Transcript

  1. Building infrastructure
    for our global service
    Sorah Fukumori

    View full-size slide

  2. $ whoami
    Sorah Fukumori (׉׵כ

    https://sorah.jp/ | GitHub @sorah | Twitter @sora_h
    Site Reliability Engineer at Cookpad Global
    Cookpad TechConf 2017 NOC
    Rubyist, Ruby committer
    Interests: Site Reliability, Networking, Distributed systems

    View full-size slide

  3. $ whoami
    Sorah Fukumori (׉׵כ

    https://sorah.jp/ | GitHub @sorah | Twitter @sora_h
    Site Reliability Engineer at Cookpad Global
    Cookpad TechConf 2017 NOC
    Rubyist, Ruby committer
    Interests: Site Reliability, Networking, Distributed systems

    View full-size slide

  4. Wi-Fi
    • Are you enjoying the internet?

    View full-size slide

  5. Wi-Fi
    • Are you enjoying the internet?
    • It’s provided as best-effort, but let us know via Twitter
    #CookpadTechConf if you’re having a problem

    View full-size slide

  6. Agenda
    • Global?
    • About SRE team
    • Infra
    • Architecture
    • Traffic
    • Relationship with Developers
    • Plans 2017

    View full-size slide

  7. global
    • https://cookpad.com/uk /es /ar /id /vn /sa …
    • Web / Android app & iOS app
    • 58 countries, in 15 languages
    2/3

    View full-size slide

  8. global
    • Different code base with JP, built from scratch
    3/3

    View full-size slide

  9. Terminologies
    • Global:
    • Cookpad (global) https://cookpad.com/uk /es …
    • JP:
    • ΫοΫύου (JP) https://cookpad.com/

    View full-size slide

  10. SRE team
    • 9 SRE members in JP
    • 2 members in JP are assigned to the global project
    1/2

    View full-size slide

  11. SRE team
    • Also, we have 1 SRE member in US
    • Recently joined!
    2/2

    View full-size slide

  12. global infra

    View full-size slide

  13. global infra
    • No special point to mention. Currently just an
    infrastructure for plain Rails app
    • Do as usual… for now.
    1/3

    View full-size slide

  14. global infra
    • AWS us-east-1
    • Amazon Aurora for MySQL
    • ElastiCache (Redis & memcached)
    • Ruby on Rails 4.2 on Ruby 2.3
    • nginx + unicorn
    2/3

    View full-size slide

  15. global infra
    • App built up from scratch,

    Infra lives in the new region
    • Building better infra than existing one, based on our
    past experiences with AWS EC2 and VPC in Japan
    • e.g. JP: CentOS → Ubuntu, US: Ubuntu only
    • e.g. JP: weird subnetting US: private/public subnets
    3/3

    View full-size slide

  16. architecture
    • It’s basically simple:
    • ELB
    • EC2 (nginx)
    • EC2 (Rails, unicorn)
    • RDS (Aurora for MySQL), ElastiCache (Redis,memd)
    1/6

    View full-size slide

  17. architecture
    • cookpad.com is shared between global service and JP
    service
    • But app is running on multiple regions…?
    2/7

    View full-size slide

  18. architecture
    • Route53 Latency Based Routing

    (ap-northeast-1 or us-east-1)
    • DNS returns IP of closer region from resolver
    • If a requested service lives in another region, reverse-
    proxy to the alternate region
    • Also, terminating TCP/TLS as possible as close from
    user is better on latency.

    (But serving only in 2 regions are not enough…)
    4/7

    View full-size slide

  19. location ~ ^/(ae|ar|bh|bo|br|cl|co|cr|
    cu|de|dj|do|dz|ec|eeuu|eg|es|fr|gt|hn|
    hu|id|in|iq|ir|it|jo|km|kr|kw|lb|ly|ma|
    mr|mx|ni|om|pa|pe|ph|pri|ps|py|qa|sa|sd|
    so|sv|sy|th|tn|tw|uk|us|uy|ve|vn|ye)(/|
    $) {
    proxy_pass http://cookpad_use1;
    }
    location / {
    proxy_pass https://cookpad_apne1;
    }

    View full-size slide

  20. architecture
    • Rails app servers are capable to autoscaling
    • Using consul + consul-template to apply the latest
    instance list to configurations
    • Recent AWS Autoscaling Group (ASG) allows
    suspending actions by API, so the global relies to ASG

    (JP uses original implementation)
    6/7

    View full-size slide

  21. architecture
    • Monitoring: Zabbix (lives in ap-northeast-1)
    • ap-northeast-1 connectivity is provided using VyOS +
    IPsec tunnel
    • Without perfect redundancy… it’s enough by
    disallowing critical traffic inside the tunnel
    7/7

    View full-size slide

  22. web development
    • Global uses GitHub.com
    • and CI is running on CircleCI.com
    • (JP uses GitHub Enterprise)
    • Deploy: capistrano base
    • Deploy server to run capistrano in us-east-1

    (Latency, poor office internet, … etc)

    View full-size slide

  23. Peak Traffic

    View full-size slide

  24. peak traffic
    • JP is around Valentine’s day
    • Q. Then, when does the peak come into the global?
    1/2

    View full-size slide

  25. peak traffic
    • Various!
    • The global has several moments in a year, which
    expects large increase in traffic:
    • Ramadan & Eid al-Adha (esp. MENA, Indonesia)
    • Christmas
    • and more
    2/2

    View full-size slide

  26. Ramadan
    • Ramadan is the ninth month of the Islamic calendar
    • Muslims refrain from consuming food during ramadan
    while fasting from dawn until sunset
    • They enjoy cooking after sunset
    • This is the biggest occasion in MENA/Indonesia which
    expects higher traffic than usual
    https://en.wikipedia.org/wiki/Ramadan

    View full-size slide

  27. Ramadan Preparation
    • We’ve survived Ramadan 2015, but we grew a lot
    before Ramadan 2016 than 2015
    • So we have to take extra care for expected traffic in
    2016.

    We couldn’t think our infra and application could
    survive the Ramadan without taking any care.
    1/2

    View full-size slide

  28. Ramadan Preparation
    • So here are what we did:
    • DB migration:

    ɹRDS MySQL (standard EBS)

    → Amazon Aurora for MySQL
    • Capacity: Expanding the target of autoscaling
    • CDN: Switching to Fastly
    • App: Giving a lot of performance improvements
    2/2

    View full-size slide

  29. Ramadan 2016 traffic
    Ramadan
    begins

    View full-size slide

  30. Ramadan 2016 traffic
    Ramadan
    ends

    View full-size slide

  31. Ramadan 2016
    • No critical issues, but
    • Logs coming a lot than usual — Disks are getting full
    early and we had to review the log retention or
    implement S3 archival
    • Fixing slow queries were required in higher priority —
    impact of those became massive than usual

    View full-size slide

  32. Relationships
    with
    Developers

    View full-size slide

  33. Relationship with Developers
    • De…

    View full-size slide

  34. Relationship with Developers
    • De… Dev… DevOps!

    View full-size slide

  35. DevOps
    • Team with people having different culture, language,
    and skill
    • Building good relationships, like by attending
    developers’ camp
    • Spending few days with people is good way
    1/8

    View full-size slide

  36. DevOps
    • Dashboards
    • Grafana (with Zabbix + CloudWatch) to share server
    status
    • Kibana: Importing SQL slow logs
    2/8

    View full-size slide

  37. DevOps
    • Requests incoming at GitHub issues
    • Most request is many simple operation request…
    • We have to reduce simple “applications” or operations, by:
    • delegating permissions to dev
    • automation
    • Reduce SRE blockers to enable asynchronous work,
    because developers are living all the world
    6/8

    View full-size slide

  38. Plans 2017
    • There’s a lot of point to improve
    • Performance
    • Architecture
    • Developers’ Productivity
    • JP has a lot of useful, time to import those into global
    • Be good with developers (DevOps…!)
    1/2

    View full-size slide

  39. Plans 2017
    • Better deploy
    • Docker, ECS (hako)
    • Dynamic staging servers
    • Delegation to dev
    • HTTP latency
    • CDN?
    • and more!
    2/2

    View full-size slide

  40. Conclusion
    • Building the infrastructure receiving traffic from
    around the world is fun
    • Team surrounded by people from around the world
    is also fun
    Thanks!

    View full-size slide