Slide 1

Slide 1 text

#VJMEJOHJOGSBTUSVDUVSF PO"84XJUI3VCZ !#SJTUPM$MPVE/BUJWF.FFUVQ

Slide 2

Slide 2 text

• Takayuki Watanabe • takanabe • @takanabe_w • Cookpad Inc. • Site Reliability Engineer 8IP

Slide 3

Slide 3 text

What is Cookpad ?

Slide 4

Slide 4 text

No content

Slide 5

Slide 5 text

https://cookpad.com Cookpad is the largest recipe sharing service in the world

Slide 6

Slide 6 text

https://cookpad.com Originally a recipe site for Japan

Slide 7

Slide 7 text

(MPCBMαʔϏεͷઆ໌

Slide 8

Slide 8 text

(MPCBMαʔϏεͷઆ໌ We are currently expanding regions

Slide 9

Slide 9 text

(MPCBMαʔϏεͷઆ໌ 21 Languages 67 Countries https://cookpad.com/us https://cookpad.com/id ɾ ɾ ɾ

Slide 10

Slide 10 text

(MPCBMαʔϏεͷઆ໌ Of course …

Slide 11

Slide 11 text

(MPCBMαʔϏεͷઆ໌ We have a site for the United Kingdom !! https://cookpad.com/uk

Slide 12

Slide 12 text

8BMLUIPVHIUIJTUBML • My experiences and opinions about organization growth • How we manage our infrastructure in Ruby DSL

Slide 13

Slide 13 text

Whobuilds infrastructures? Organization Story

Slide 14

Slide 14 text

*ODSFBTJOHPG&OHJOFFST ZFBS PGFOHJOFFST ɾ 'FXFOHJOFFSTBSFJO+BQBO ɾ PGVTFSTBSFTNBMM ɾ "SPVOEFOHJOFFST ɾ PGFOHJOFFSTBSFPVUTJEFPG+BQBO ɾ 5IFOVNCFSPGUFBNTBOETFSWJDFTJODSFBTF ɾ 6TFSTJODSFBTF ɾ "SPVOEFOHJOFFST ɾ .PTUPGFOHJOFFSTBSFJO+BQBO *KPJOFE$PPLQBE

Slide 15

Slide 15 text

*ODSFBTJOHPG&OHJOFFST ZFBS PGFOHJOFFST ɾ "SPVOEFOHJOFFST ɾ PGFOHJOFFSTBSFPVUTJEFPG+BQBO ɾ 5IFOVNCFSPGUFBNTBOETFSWJDFTJODSFBTF ɾ 6TFSTJODSFBTF ɾ "SPVOEFOHJOFFST ɾ .PTUPGFOHJOFFSTBSFJO+BQBO TUBHF TUBHF *KPJOFE$PPLQBE ɾ 'FXFOHJOFFSTBSFJO+BQBO ɾ PGVTFSTBSFTNBMM TUBHF

Slide 16

Slide 16 text

Selection of infrastructures and development approaches

Slide 17

Slide 17 text

4FMFDUJPOPGJOGSBTUSVDUVSFTBOEBQQSPBDIFT • Which platform do we use for production environment? • PaaS, IaaS, on premise or etc … • Which development approaches(software architecture)? • monolithic architecture, SOA, microservices or etc…

Slide 18

Slide 18 text

1MBUGPSNBQQSPBDIFTGPSTUBHF • Managed PaaS like Heroku • Enables developers to focus on development of service • We don’t need infra engineers • Monolithic Architecture • Communication is quite easy and comfortable

Slide 19

Slide 19 text

1MBUGPSNBQQSPBDIFTGPSTUBHF • Cloud Service Providers like AWS, GCP, Azure • Use Virtual Machines and managed service for production environment • Enable us to build more flexible infrastructures • Users gradually increase and scalabilities of capacity are important • Monolithic Architecture • Communication is still easy and comfortable

Slide 20

Slide 20 text

1MBUGPSNBQQSPBDIFTGPSTUBHF • Cloud Service Providers like AWS, GCP, Azure • Use Containers and managed service for production environment • stage 2 is a warmup period to move stage3 • Enable us to build more flexible infrastructures • Scalabilities of capacity are important • Microservice Architecture • Communication costs are high due to the number of engineers • Separation of authorities and responsibilities are necessary to scale out an organization

Slide 21

Slide 21 text

infrastructures for Cookpad

Slide 22

Slide 22 text

"84 Most servers for our services exist on AWS

Slide 23

Slide 23 text

3FBTPOTGPSVTJOH"84 • We use AWS since its early stage • AWS resources can be controlled via APIs • Pioneering cloud provider in this context • We use management tools written in Ruby for AWS • Declare AWS resources in Ruby DSL

Slide 24

Slide 24 text

8IZXFVTF3VCZ

Slide 25

Slide 25 text

SFGIUUQTTQFBLFSEFDLDPNB@NBUTVEBUIFSFDJQFGPSUIFXPSMETMBSHFTUSBJMTNPOPMJUI

Slide 26

Slide 26 text

SFGIUUQTTQFBLFSEFDLDPNB@NBUTVEBUIFSFDJQFGPSUIFXPSMETMBSHFTUSBJMTNPOPMJUI Most of our products are implemented in Ruby

Slide 27

Slide 27 text

SFGIUUQTTQFBLFSEFDLDPNB@NBUTVEBUIFSFDJQFGPSUIFXPSMETMBSHFTUSBJMTNPOPMJUI Site Reliability Engineers also use primarily in Ruby

Slide 28

Slide 28 text

Tools for our infrastructures

Slide 29

Slide 29 text

5PPMTGPSPVSJOGSBTUSVDUVSFT • AWS resource management • Other server resource management • Database management tools • CDN management tools • Server configuration management (provisioning) • Deployment tools

Slide 30

Slide 30 text

"84SFTPVSDFNBOBHFNFOU We codify AWS resources in Ruby DSL • Change history can be investigated from VCS(git, svn) • Idempotent • Current conditions of AWS resources should be synced to our codes • Don’t allow manual configuration changes to avoid chaos • If codes don’t have manual changes, they will be forcibly erased • Learning costs are low • Non-SRE engineer also can create PRs • Most of tools have: • dry-run feature to confirm changes before applying them • export feature to reflect current AWS condition to Ruby DSL

Slide 31

Slide 31 text

"84SFTPVSDFNBOBHFNFOU • We ride on codenize.tools (https://codenize.tools/) • mainly maintained by one of our SRE • terraform was not ready when we started to use AWS • stable enough • easy to use

Slide 32

Slide 32 text

"84SFTPVSDFNBOBHFNFOU Without any change tracking measures for the following resources, it tends to be linked to high operational costs • Route53 • Route Tables for Virtual Private Cloud • Identity and Access Management • Security Group • Elastic IP Addresses • Elastic Load Balancer • S3 Bucket Policy • CloudWatch Logs & Alarms

Slide 33

Slide 33 text

3PVUF • DNS service of AWS • Use Roadworker (https://github.com/codenize-tools/ roadworker) to define states of Route53 using Ruby DSL

Slide 34

Slide 34 text

hosted_zone "example.com." do rrset "example.com.", "A" do ttl 300 resource_records( "127.0.0.1", "127.0.0.2" ) end end 3PVUF (e.g) Declaration of “example.com” A record to Route53

Slide 35

Slide 35 text

71$3PVUF5BCMFT • Set of rules to determine where network traffic is directed • Use Mappru (https://github.com/codenize-tools/mappru) to define states of VPC Route Tables using Ruby DSL

Slide 36

Slide 36 text

vpc "vpc-12345678" do route_table "foo-rt" do subnets "subnet-12345678" route destination_cidr_block: "0.0.0.0/0", gateway_id: "igw-12345678" route destination_cidr_block: “192.168.100.101/32", network_interface_id: "eni-12345678" end route_table "bar-rt" do subnets "subnet-87654321" route destination_cidr_block: "192.168.100.102/32", network_interface_id: "eni-87654321" end # Undefined Route Table will be ignored end 71$3PVUF5BCMFT (e.g) Declaration of Route Tables for vpc-12345678

Slide 37

Slide 37 text

4FDVSJUZ(SPVQT • Security Groups is a virtual firewall that controls the traffic for one or more instances • Use Piculet (https://github.com/codenize-tools/piculet) to define states of Route53 using Ruby DSL

Slide 38

Slide 38 text

4FDVSJUZ(SPVQT ec2 "vpc-XXXXXXXX" do security_group "default" do description "default VPC security group" tags( "key1" => "value1", "key2" => "value2" ) ingress do permission :tcp, 22..22 do ip_ranges( "0.0.0.0/0", ) end permission :tcp, 80..80 do ip_ranges( "0.0.0.0/0" ) end permission :udp, 60000..61000 do ip_ranges( "0.0.0.0/0" ) end # ESP (IP Protocol number: 50) permission :"50" do ip_ranges( "0.0.0.0/0" ) end permission :any do groups( "any_other_group", "default" ) end end # Continue to the right codes # Continue from the left codes egress do permission :any do ip_ranges( "0.0.0.0/0" ) end end end security_group "any_other_group" do description "any_other_group" tags( "key1" => "value1", "key2" => "value2" ) egress do permission :any do ip_ranges( "0.0.0.0/0" ) end end end end (e.g) Declaration of Security Groups for vpc-XXXXXXXX

Slide 39

Slide 39 text

&MBTUJD*1"EESFTTFT • An Elastic IP address is a static IPv4 address designed for dynamic cloud computing • Use Eipmap (https://github.com/codenize-tools/eipmap) to define states of Elastic IP Addresses using Ruby DSL

Slide 40

Slide 40 text

domain "standard" do ip "54.256.256.1" ip "54.256.256.2", :instance_id=>"i-12345678" end domain "vpc" do ip "54.256.256.11", :network_interface_id=>"eni-12345678", :private_ip_address=>"10.0.1.1" ip "54.256.256.12", :network_interface_id=>"eni-12345678", :private_ip_address=>"10.0.1.2" ip "54.256.256.13" end &MBTUJD*1"EESFTTFT • (e.g) Declaration of Elastic IP Addresses

Slide 41

Slide 41 text

*EFOUJUZBOE"DDFTT.BOBHFNFOU • AWS Identity and Access Management is a web service that helps us securely control access to AWS resources • Use Miam (https://github.com/codenize-tools/miam) to define states of Elastic IP Addresses using Ruby DSL

Slide 42

Slide 42 text

user "takayuki-watanabe", :path=>"/infra/" do login_profile :password_reset_required=>false groups( "Admin" ) end group "Admin", :path => "/admin/" do policy "Admin" do {"Statement"=>[{"Effect"=>"Allow", "Action"=>"*", "Resource"=>"*"}]} end end *EFOUJUZBOE"DDFTT.BOBHFNFOU

Slide 43

Slide 43 text

&MBTUJD-PBE#BMBODJOH • Elastic Load Balancing distributes incoming application traffic across multiple targets, such as Amazon EC2 instances, containers, and IP addresses • Use Kelbim (https://github.com/codenize-tools/kelbim) to define states of Elastic IP Addresses using Ruby DSL

Slide 44

Slide 44 text

ec2 "vpc-XXXXXXXXX" do load_balancer "my-load-balancer", :internal => true do instances( "nyar", "yog" ) # or `any_instances` listeners do listener [:tcp, 80] => [:tcp, 80] listener [:https, 443] => [:http, 80] do app_cookie_stickiness "CookieName"=>"20" ssl_negotiation ["Protocol-TLSv1", "Protocol-SSLv3", "AES256-SHA", ...] server_certificate "my-cert" end end health_check do target "TCP:80" timeout 5 interval 30 healthy_threshold 10 unhealthy_threshold 2 end attributes do access_log :enabled => true, :s3_bucket_name => "any_bucket", :s3_bucket_prefix => nil, :emit_interval => 60 cross_zone_load_balancing :enabled => true connection_draining :enabled => false, :timeout => 300 end subnets( "subnet-XXXXXXXX" ) security_groups( "default" ) end end &MBTUJD-PBE#BMBODJOH • (e.g) Declaration of Elastic Load Balancing for vpc-XXXXXXXX

Slide 45

Slide 45 text

4#VDLFU1PMJDZ • Bucket policy and user policy are two of the access policy options available for you to grant permission to your Amazon S3 resources • Use Bukelatta (https://github.com/codenize-tools/kelbim) to define states of Elastic IP Addresses using Ruby DSL

Slide 46

Slide 46 text

bucket "foo-bucket" do { "Version"=>"2012-10-17", "Id"=>"AWSConsole-AccessLogs-Policy-XXX", "Statement"=> [ { "Sid"=>"AWSConsoleStmt-XXX", "Effect"=>"Allow", "Principal"=>{"AWS"=>"arn:aws:iam::XXX:root"}, "Action"=>"s3:PutObject", "Resource"=> "arn:aws:s3:::foo-bucket/AWSLogs/XXX/*" } ] } end 4#VDLFU1PMJDZ • (e.g) Declaration of S3 Bucket Policy for foo-bucket

Slide 47

Slide 47 text

$MPVE8BUDI-PHT"MBSN • Amazon CloudWatch offers cloud monitoring services for customers of AWS resources • Use Meteorlog (https://github.com/codenize-tools/ meteorlog) to define states of CloudWatch Logs using Ruby DSL • Use Radiosonde (https://github.com/codenize-tools/ radiosonde) to define states of CloudWatch Alarms using Ruby DSL

Slide 48

Slide 48 text

log_group "/var/log/messages" do log_stream "my-stream" metric_filter "MyAppAccessCount" do metric :name=>"EventCount", :namespace=>"YourNamespace", :value=>"1" end metric_filter "MyAppAccessCount2" do filter_pattern '[ip, user, username, timestamp, request, status_code, bytes > 1000]' metric :name=>"EventCount2", :namespace=>"YourNamespace2", :value=>"2" end end log_group "/var/log/maillog" do log_stream "my-stream2" metric_filter "MyAppAccessCount" do filter_pattern '[..., status_code, bytes]' metric :name=>"EventCount3", :namespace=>"YourNamespace", :value=>"1" end metric_filter "MyAppAccessCount2" do filter_pattern '[ip, user, username, timestamp, request = *html*, status_code = 4*, bytes]' metric :name=>"EventCount4", :namespace=>"YourNamespace2", :value=>"2" end end $MPVE8BUDI-PHT • (e.g) Declaration of CloudWatch Logs streams

Slide 49

Slide 49 text

alarm "alarm1" do namespace "AWS/EC2" metric_name "CPUUtilization" dimensions "InstanceId"=>"i-XXXXXXXX" period 300 statistic :average threshold ">=", 50.0 evaluation_periods 1 actions_enabled true alarm_actions [] ok_actions [] insufficient_data_actions ["arn:aws:sns:us-east-1:123456789012:my_topic"] end alarm "alarm2" do ... end $MPVE8BUDI"MBSNT • (e.g) Declaration of CloudWatch Alarms

Slide 50

Slide 50 text

Database management

Slide 51

Slide 51 text

.Z42-QSJWJMFHFT • The privileges granted to a MySQL account determine which operations the account can perform • Use Gratan (https://github.com/codenize-tools/gratan) to define states of MySQL access privileges using Ruby DSL

Slide 52

Slide 52 text

user "bob", "%" do on "*.*" do grant "USAGE" end on "test.*", expired: '2014/10/08', identified: "PASSWORD '*ABCDEF'" do grant "SELECT" grant "INSERT" end on /^foo\.prefix_/ do grant "SELECT" grant "INSERT" end end user "bob", ["localhost", "192.168.%"], expired: '2014/10/10' do on "*.*", with: 'GRANT OPTION' do grant "ALL PRIVILEGES" end end .Z42-QSJWJMFHFT • (e.g) Declaration of MySQL privileges

Slide 53

Slide 53 text

0OMJOFTDIFNBNJHSBUJPO • pt-online-schema-change (https://www.percona.com/doc/ percona-toolkit/3.0/pt-online-schema-change.html) performs online, non-blocking schema changes to a table • Use Departure (https://github.com/departurerb/departure) without needing to use a different DSL other than Rails' migrations DSL (under trial) • Departure uses pt-online-schema-change command-line tool of Percona Toolkit which runs MySQL alter table statements without downtime

Slide 54

Slide 54 text

CDN management

Slide 55

Slide 55 text

'BTUMZ • We use Fastly as our CDN • Use codily (https://github.com/sorah/codily) to define states of Fastly using Ruby DSL

Slide 56

Slide 56 text

service "foo" do response_object "method not allowed" do status "405" response "Method Not Allowed" content "405" content_type "text/plain" request_condition "request method is not GET, HEAD or FASTLYPURGE" do priority 10 statement '!(req.request == "GET" || req.request == "HEAD" || req.request == "FASTLYPURGE")' end end end # equals as follows: service "foo" do condition "request method is not GET, HEAD or FASTLYPURGE" do priority 10 statement '!(req.request == "GET" || req.request == "HEAD" || req.request == "FASTLYPURGE")' type "REQUEST" end response_object "method not allowed" do status "405" response "Method Not Allowed" content "405" content_type "text/plain" request_condition "request method is not GET, HEAD or FASTLYPURGE" end end 'BTUMZ • (e.g) Declaration of Fastly configurations

Slide 57

Slide 57 text

Server Configuration Management

Slide 58

Slide 58 text

4FSWFSDPOpHVSBUJPOT • Around a thousand EC2 instances are running on AWS • We used puppet previously as our configuration management • We want to use light tools like Ansible but also want to use Ruby DSL

Slide 59

Slide 59 text

IUUQTHJUIVCDPNJUBNBFLJUDIFOJUBNBF

Slide 60

Slide 60 text

*UBNBF • Configuration management tool inspired by Chef • An itamae (൘લ) is a cook in a Japanese kitchen • Chef-like Ruby DSL (but not compatible with Chef) • Simpler and lighter weight than Chef • Only recipes • Apply recipes to a local machine • Apply recipes to a remote machine over ssh • Idempotent

Slide 61

Slide 61 text

*UBNBF • (e.g) A sample recipe for nginx package 'nginx' do action :install end service 'nginx' do action [:enable, :start] end template "/path/to/dest" do action :create source "template.erb" variables(message: "World") end # template.erb Hello, <%= @message %>

Slide 62

Slide 62 text

Deployment tools

Slide 63

Slide 63 text

$BQJTUSBOP • Deploy Rails applications via Capistrano3 • Use Capistrano::BundleRsync (https://github.com/ sonots/capistrano-bundle_rsync) • Chat bot can invoke the deploy jobs via a deploy server

Slide 64

Slide 64 text

1SPCMFNTGPSUIFTFUPPMT • Tools explained in the previous slides work quite well for stage2 and a personal usage even though limited SRE can apply them to production environments • But if only SRE has privileges to use these tools, there might be problems in development scalabilities at stage3

Slide 65

Slide 65 text

1SPCMFNTGPSUIFTFUPPMT FYBNQMFT • Developers cannot update environment variables • SREs deploy them via Itamae • Developers cannot install software by themselves • SREs install them via Itamae • Developers cannot use new AWS resources soon • SREs deploy them via Codenize tools • SREs and Developers cannot work productively • frequent ops work might be requested to SREs and it becomes bottleneck of developments • Some part of authorities and responsibilities should be given them

Slide 66

Slide 66 text

What about containers?

Slide 67

Slide 67 text

Amazon ECS

Slide 68

Slide 68 text

%PDLFSDPOUBJOFSTPO&$4 • Use Amazon ECS • ECS allows us to easily run and manage Docker-enabled applications across a cluster of EC2 instances. • Use hako (https://github.com/eagletmt/hako) to deploy Docker containers onto ECS clusters • Some applications will use this container environment

Slide 69

Slide 69 text

%FQMPZNFOUqPXXJUI)BLP • Deploy containers via hako to ECS clusters and inject necessary data • Docker images are stored in ECR • Credentials are stored in Vault • Container app definitions are managed in yaml

Slide 70

Slide 70 text

%FQMPZNFOUqPXXJUI)BLP • (e.g) A sample hako app definition file scheduler: type: ecs region: ap-northeast-1 cluster: eagletmt desired_count: 2 task_role_arn: arn:aws:iam::012345678901:role/Hello deployment_configuration: maximum_percent: 200 minimum_healthy_percent: 50 app: image: ryotarai/hello-sinatra memory: 128 cpu: 256 links: - redis:redis env: $providers: - type: file path: hello.env PORT: '3000' MESSAGE: '#{username}-san' # Continue to the right codes # Continue from the right codes additional_containers: front: image_tag: hako-nginx memory: 32 cpu: 32 redis: image_tag: redis:3.0 cpu: 64 memory: 512 scripts: - <<: !include front.yml backend_port: 3000

Slide 71

Slide 71 text

%FQMPZNFOUGSPN4MBDL • Invoke deploy jobs defined on Rundeck via a chat bot • Use ruboty (https://github.com/r7kamura/ruboty) for chatops on Slack

Slide 72

Slide 72 text

*OUSPEVDUJPOPGDPOUBJOFSFOWJSPONFOUT • Developers can update environment variables • hako app yaml has environment variables for each application • Developers can install necessary software by themselves • Docker images include all software for each application • Developers can use new AWS resources soon • Many AWS resources are ready for use after deploying containers to our ECS clusters • SRE and Developers will become productively • Authority and responsibilities are given them

Slide 73

Slide 73 text

Conclusions

Slide 74

Slide 74 text

)VNCMFPQJOJPOT • Organization becomes big suddenly • Traditional development styles might not work suddenly and need to change them • There are technologies to support us and give more scalable environments • (e.g) Virtual Machine → Containers • (e.g) Monolithic architecture → Microservice architecture • But engineers cannot change their traditional workflows suddenly • Investigation and research at stage 2 is really important in terms of development scalabilities at stage 3 • At this moment, for container orchestrations, using kubernetes is better instead of ECS • Many players around containers join to kubernetes and develop eco systems (standing on the shoulders of giants)

Slide 75

Slide 75 text

3FDBQ • Selection of infrastructure platforms and approaches are important depend on the organization expansions • Writing infrastructures in Ruby DSL is pretty easy and works well • When organization becomes big, traditional workflow might not work

Slide 76

Slide 76 text

5IBOL:PV