$30 off During Our Annual Pro Sale. View Details »

How to Dev&Ops Internal PaaS

taichi nakashima
June 29, 2015
4.1k

How to Dev&Ops Internal PaaS

taichi nakashima

June 29, 2015
Tweet

Transcript

  1. HOW TO DEV&OPS
    INTERNAL PAAS

    View Slide

  2. TAICHI NAKASHIMA
    @deeeet @tcnksm

    View Slide

  3. INTERNAL PAAS?
    = PaaS for Rakuten engineers

    View Slide

  4. ONLY FOR TEST?
    = No. It receives production requests

    View Slide

  5. WHY PAAS?
    = Fast app experimentation and iteration with PROD-grade

    View Slide

  6. WHY PAAS?
    = You don’t need to prepare servers by yourself

    View Slide

  7. WHY PAAS?
    = You don’t need to provision servers by yourself

    View Slide

  8. WHY PAAS?
    = You don’t need to prepare DBs by yourself

    View Slide

  9. WHY PAAS?
    = You can scale your app by *one command*

    View Slide

  10. WHY PAAS?
    = You can focus on development, not deployment

    View Slide

  11. WHY INTERNAL PAAS?
    = Easy to connect with other internal service

    View Slide

  12. WHY INTERNAL PAAS?
    = Instant support when something happen

    View Slide

  13. WHY INTERNAL PAAS?
    (From organizational point of view)
    = You can reduce duplicated tooling by different teams

    View Slide

  14. HOW LARGE?
    How many request? servers? language?

    View Slide

  15. 16000 req/sec.
    All application requests

    View Slide

  16. 2500 instances
    1400 (PROD) + 700 (STG) + 400 (DEV)

    View Slide

  17. 4300 VMs
    2800 (PROD) + 1200 (STG) + 300 (DEV)

    View Slide

  18. +300 VMs/mon.
    Growth forecasting

    View Slide

  19. 4 languages support
    Ruby, Node.js, Java, PHP

    View Slide

  20. 3 DB services
    Redis, MongoDB, Clustrix

    View Slide

  21. 100 Redis clusters
    230 Instances

    View Slide

  22. 40 components
    Components (Roles) to run PaaS

    View Slide

  23. 320 chef recipes
    `ls cookbooks/*/recipes | wc -l`

    View Slide

  24. 8 Engineers
    Dev & Ops, From 7 Countries

    View Slide

  25. HOW TO DEV&OPS
    INTERNAL PAAS

    View Slide

  26. HOW TO DEV&OPS
    INTERNAL PAAS

    View Slide

  27. View Slide

  28. Router
    API
    Health
    Check
    Messaging
    DBs
    Apps

    View Slide

  29. DEV FLOW
    RELEASE FLOW

    View Slide

  30. DEV FLOW
    RELEASE FLOW

    View Slide

  31. Create Ticket on JIRA
    Write code
    Write Chef cookbook
    Test on LAB
    Create PR (Git-Flow)
    Review

    View Slide

  32. DEV FLOW
    RELEASE FLOW

    View Slide

  33. Assign release manager
    Collect all JIRA tickets
    Write internal blog
    CanaryRelease
    Release

    View Slide

  34. 1 release for 1 week
    DEV (2day) , STG (2day) , PROD(3day)

    View Slide

  35. HOW TO RELEASE?
    = Chef + Capistrano

    View Slide

  36. RELEASE 1 SERVER

    View Slide

  37. Service-out
    Run Chef solo
    Run Serverspec
    Service-in

    View Slide

  38. Stop Load-Balancing
    Disable Health Check
    Stop monit
    Service-out
    Run Chef solo
    Run Serverspec
    Service-in
    Start monit
    Enable Health Check
    Start Load-Balancing

    View Slide

  39. /etc/service-out
    /etc/service-in
    Service-out
    Run Chef solo
    Run Serverspec
    Service-in

    View Slide

  40. Every server has same startup/stop scripts
    = workflow is same
    = automation is easy

    View Slide

  41. RELEASE X SERVERS

    View Slide

  42. cap service-in
    cap service-out
    cap setup-role
    Service-out X servers
    Run Chef solo X servers
    Run Serverspec X servers
    Service-in X servers

    View Slide

  43. Role A
    Role B
    Role C
    Operation
    170.20.20.21.RoleA
    170.20.20.22.RoleA
    170.20.20.23.RoleA
    170.20.20.24.RoleA
    170.20.20.25.RoleA
    170.20.20.26.RoleA
    170.20.20.27.RoleA
    VMLIST

    View Slide

  44. cap service-out
    170.20.20.21.RoleA
    170.20.20.22.RoleA
    170.20.20.23.RoleA
    170.20.20.24.RoleA
    170.20.20.25.RoleA
    170.20.20.26.RoleA
    170.20.20.27.RoleA
    VMLIST
    Operation
    Role A
    Role B
    Role C
    Parallel execution

    View Slide

  45. cap setup-role
    Operation
    Parallel execution
    170.20.20.21.RoleA
    170.20.20.22.RoleA
    170.20.20.23.RoleA
    170.20.20.24.RoleA
    170.20.20.25.RoleA
    170.20.20.26.RoleA
    170.20.20.27.RoleA
    VMLIST
    Role A
    Role B
    Role C

    View Slide

  46. cap service-in
    Role A
    Role B
    Role C
    Operation
    170.20.20.21.RoleA
    170.20.20.22.RoleA
    170.20.20.23.RoleA
    170.20.20.24.RoleA
    170.20.20.25.RoleA
    170.20.20.26.RoleA
    170.20.20.27.RoleA
    VMLIST
    Parallel execution

    View Slide

  47. cap service-out
    Operation
    Parallel execution
    170.20.20.31.RoleB
    170.20.20.32.RoleB
    170.20.20.33.RoleB
    170.20.20.34.RoleB
    170.20.20.35.RoleB
    170.20.20.36.RoleB
    170.20.20.37.RoleB
    VMLIST
    Role A
    Role B
    Role C

    View Slide

  48. cap service-out
    170.20.20.21.RoleA
    VMLIST
    Operation
    Role A
    Role B
    Role C
    Start from Canary

    View Slide

  49. HOW TO DEV&OPS
    INTERNAL PAAS

    View Slide

  50. LOGGING
    MONITORING
    ALERT HANDLING
    SUPPORT
    IAAS

    View Slide

  51. LOGGING
    MONITORING
    ALERT HANDLING
    SUPPORT
    IAAS

    View Slide

  52. 700GB/day logs
    All logs produced in PaaS

    View Slide

  53. LOGGING IN PAAS?
    = Application logs + Component logs

    View Slide

  54. APPLICATION LOG ?
    = PaaS should provide user the way to debug

    View Slide

  55. Instant logs
    Midterm logs
    Longterm logs
    Real time
    1-2 weeks
    - 6 month

    View Slide

  56. Router
    API
    Health
    Check
    Messaging
    DBs
    Apps
    Instant log

    View Slide

  57. Log
    Server
    Apps
    Object
    Storage
    Instant log Midterm log Longterm log

    View Slide

  58. Log
    Server
    Apps
    Instant log Midterm log
    Hadoop
    (BigData team)
    Analytics

    View Slide

  59. Log
    Server
    Apps
    Instant log Midterm log Splunk

    Dashboard

    View Slide

  60. COMPONENT LOG ?
    = Log which we use for debug PaaS itself

    View Slide

  61. Log
    Server
    Object
    Storage

    View Slide

  62. Log
    Server
    Object
    Storage
    We can debug CF here

    View Slide

  63. Log
    Server
    Object
    Storage
    GlusterFS LeoFS

    View Slide

  64. Log
    Server
    Object
    Storage
    GlusterFS

    View Slide

  65. LOGGING
    METRICS
    ALERT HANDLING
    SUPPORT
    IAAS

    View Slide

  66. OpenTSDB,
    Pandra FMS

    View Slide

  67. LOGGING
    METRICS
    ALERT HANDLING
    SUPPORT
    IAAS

    View Slide

  68. 1 week, 24H charge
    Primary & Sub admin

    View Slide


  69. View Slide

  70. 2500 ✉/day
    MAX. Need to fix…

    View Slide

  71. LOGGING
    METRICS
    ALERT HANDLING
    SUPPORT
    IAAS

    View Slide

  72. JIRA,
    HipChat Instant support is one of *good* point of
    Internal PaaS

    View Slide

  73. LOGGING
    METRICS
    ALERT HANDLING
    SUPPORT
    IAAS

    View Slide

  74. IAAS
    Operating PaaS also means operating IaaS

    View Slide

  75. vSphere

    View Slide

  76. HOW TO BOOT SERVERS?
    = Internal tool like terraform

    View Slide

  77. Role A
    vSphere
    Operation
    rvc create -c rvc.yml 170.20.21.RoleA
    RoleA:
    cpu: 2
    mem: 8192
    rvc.yml
    170.20.20.21.RoleA
    170.20.20.22.RoleA
    170.20.20.23.RoleA
    VMLIST

    View Slide

  78. Role A
    vSphere
    Operation
    VMLIST
    rvc create -c rvc.yml 170.20.21.RoleA
    RoleA:
    cpu: 2
    mem: 8192
    rvc.yml
    170.20.20.21.RoleA
    170.20.20.22.RoleA
    170.20.20.23.RoleA
    VMLIST

    View Slide

  79. Role A
    vSphere
    Operation
    rvc create -c rvc.yml 170.20.21.RoleA
    RoleA:
    cpu: 2
    mem: 8192
    rvc.yml
    170.20.20.21.RoleA
    170.20.20.22.RoleA
    170.20.20.23.RoleA
    VMLIST

    View Slide

  80. Role A
    vSphere
    Operation
    rvc create -c rvc.yml 170.20.22.RoleA
    RoleA:
    cpu: 2
    mem: 8192
    rvc.yml
    170.20.20.21.RoleA
    170.20.20.22.RoleA
    170.20.20.23.RoleA
    VMLIST

    View Slide

  81. Role A
    vSphere
    Operation
    rvc create -c rvc.yml 170.20.23.RoleA
    RoleA:
    cpu: 2
    mem: 8192
    rvc.yml
    170.20.20.21.RoleA
    170.20.20.22.RoleA
    170.20.20.23.RoleA
    VMLIST

    View Slide

  82. cap setup-role
    Role A
    Operation
    vSphere
    170.20.20.21.RoleA
    170.20.20.22.RoleA
    170.20.20.23.RoleA
    VMLIST

    View Slide

  83. cap setup-role
    Role A
    Operation
    vSphere
    170.20.20.21.RoleA
    170.20.20.22.RoleA
    170.20.20.23.RoleA
    VMLIST

    View Slide

  84. Easy to boot & setup servers
    = If there is *physical resource*

    View Slide

  85. FUTURE?
    = We are moving to *version 2*

    View Slide

  86. BE GOPHER
    CloudFoundry moves from Ruby to Golang

    View Slide

  87. NO FORK
    Everything goes to upstream

    View Slide

  88. BE OPEN
    Building tool as OSS

    View Slide


  89. View Slide

  90. NO MORE TOO MUCH ✉
    Planing to use Pagerduty + Riemann

    View Slide

  91. Log
    Server
    Object
    Storage
    GlusterFS LeoFS

    View Slide

  92. Object
    Storage
    LeoFS
    Kafka

    View Slide

  93. MORE FLEXIBLE LOG STACK
    Planning to use Apache Kafka

    View Slide

  94. NEW METRICS STACK
    Planning to use InfluxDB + Grafana

    View Slide

  95. CONTAINER
    Planning to support Docker

    View Slide

  96. MORE HA
    Planning to have a ChaosMonkey

    View Slide

  97. NEW IAAS
    Migrating to OpenStack

    View Slide

  98. NEW IAAS
    Planning to Hybrid Cloud

    View Slide

  99. WE HAVE MANY CHALLENGES

    View Slide

  100. WE ARE HIRING
    http://corp.rakuten.co.jp/careers/experienced/

    View Slide

  101. @deeeet

    View Slide