History of Falcon, the way to production release

Slide 1

Slide 1 text

History of Falcon, the way to production release Junichi Kato (@j5ik2o) ChatWorkのScala採用プロダクト “Falcon” リリースまでの失敗と成功の歴史

Slide 2

Slide 2 text

Self Introduction ● Approximately 6 years Scala experience. ● A backend software engineer developing a "business chat" service: ChatWork ● Responsible to architect and develop backends of ChatWork are my job. 自己紹介。ChatWorkでエンジニアとして働いています

Slide 3

Slide 3 text

Agenda ● The history of the "Falcon" project which has been released at the end of 2016 ○ “ChatWork” chat service ○ History of Falcon Project ■ Phase-1: Live Migration Project ■ Rebooting Falcon ■ POC(= Proof Of Concept) ■ Phase-2: Production Development ■ DevOps ■ Finally Released ○ Conclusion アジェンダ。

Slide 4

Slide 4 text

About ChatWork ● ChatWork is a chat service for business instead of mail or personal chat. ChatWorkは、メール・チャットに変わるビジネスチャットです。 ● Number Of Clients ○ 124,000 companies (as of the end of Jan, 2017) ● Country / Region ○ 205 places ● Best of Business Chat ● Support for iOS, Android, Web ● ISO27001(ISMS) and ISO27018 Certificated ● Functions ○ Group Messaging ○ Task Management ○ File Sharing ○ Video Conferencing

Slide 5

Slide 5 text

Scale of User Generated Data チャットワークユーザが生成するデータの規模 rapid increase of messages ! Number of . 5th Annivers ary 6th Anniversa ry Chat Rooms 2.4 million 4.2million Messages 1 billion 1.8 billion Tasks 37 million 60 million Files 64 million 133 million

Slide 6

Slide 6 text

Background of Developing ChatWork ● In 2010, ChatWork was developed for a internal product, built on PHP framework. ● Development for business opportunities led to technical debts. ● the system cannot support increasing data and loads. チャットワーク開発の経緯

Slide 7

Slide 7 text

Way to re-implementation ● Occurred events by the technical debts. delayed delivery-time, system down trouble by SPoF, increasing workloads etc ● After that, the technical debts partially was improved, but they were supportive countermeasures. ● Eventually, We decided to re-implement it, because it became difficult to extend it any more. ● Of course, that is not easy. チャットワークの再実装

Slide 8

Slide 8 text

We chose Scala ● Scala won in our training camp. ● The reasons are ○ Maintainability and performance are high ○ From dynamic typed to static typed, success stories ○ AWS SDK for Java is the most fulfilling. ○ Congeniality of Scala and real-time proccesing for chat ○ Even PHP engineers became be able to coding by Scala as quickly as possible. Scalaの採用決定

Slide 9

Slide 9 text

I joined ChatWork ● At July 2014, I joined ChatWork for migration to Scala. ● Approximately 6 years Scala experience. ○ REST API Server by Play2, for VOD Service ○ Chat Server by Finagle with Akka ● After that, we started the server side project that adopted Scala in ChatWork. このタイミングで入社しました

Slide 10

Slide 10 text

Phase1: Live Migration Project P1: ライブマイグレーションプロジェクト

Slide 11

Slide 11 text

Phase1: Strategies for Migrating Architecture ● To minimize the impact of stable legacy systems. ○ Don’t modify existing code as much as possible ○ The new system should be migrated without maintenance with downtime. ○ Don’t migrate existing data. ● Include rooms, messages, tasks, files, contacts in function scope. P1: アーキテクチャ移行のための戦略

Slide 12

Slide 12 text

Phase1: Our Project Team Structure ● Since 07/2014 ● Team Structure (Total 19 memebers) ○ Falcon Team (New Server Side by Scala) ■ 8 members (I belong to this team.) ○ Phoenix Team (Legacy Service Side by PHP) ■ 5 members ○ iOS Team(New-Version iOS Application Team) ■ 6 members ● Note: The number of members means final. It has grown as hiring Scala engineers. P1: プロジェクトチーム体制

Slide 13

Slide 13 text

Phase1: Function Scope ● Chat Room (is a collection of messages) ○ Creating the Chat Room or Updating MetaData of it ○ Posting Messages, Updating them, Deleting them ○ Adding Members, Removing them, Modifying Role of them ○ Uploading Files, Deleting them ○ Adding Tasks, Updating them, Deleting them ● Contact (indicate connections between users) ○ Applying contacts, Reject them, Approving them ● FalconID (is 64bit ID by generating distributed id-workers) ○ Generating 64 bit ID with distributed id-worker ○ Mapping old id with it. P1: 機能スコープ

Slide 14

Slide 14 text

Phase1 : Architecture Overview ● The former "master data" had been persisted in RDS of Legacy system. Access to the master data is via the Phoenix API. ● Falcon receives IOEvent that occurred in Legacy system.With IOEvent as a trigger, the system constructs the event to be delivered by the stream and the model cache in DynamoDB. ● The new client uses Falcon external API and stream API. ● internal-api performs Id generation and Id mapping. P1: アーキテクチャ概要

Slide 15

Slide 15 text

Phase1: Context Map of DDD ● The downstream customer depends on the upstream supplier. ● At planning time, the downstream behaves as the customer to the upstream. At running time, the upstream behave as the interface supplier. ● Actually the communication of our teams were very complicated. It was a difficult problem together with technical issues. P1: アーキテクチャ概要 Falcon (as Customer, Supplier) iOS Team (as Customer) ChatWork Web (as Supplier) Phoenix (as Customer/Supplier)

Slide 16

Slide 16 text

● Specifications and implementations side ○ Missing specifications spawned one after another. ○ Too much DynamoDB I/O cost due to overused secondary-index. ○ Pheonix API server is overloaded than expected. ○ High ID mapping cost. ○ limit of the managed service’ s performance. ● Project side ○ Project definition is ambiguous. ○ The review of each sprint was not enough. ○ The Integration-testing between subsystems was delayed. ○ Exhaustion due to long-term development. ● Scala itself had no major problems. The true problems was about project management, function-scope, and performance. Phase1: Various Problems that occurred P1: 発生した様々な問題

Slide 17

Slide 17 text

Phase1: Make a Tough Decision ● Rescheduling the project repeatedly occurred around 2015. It eventually resulted in suspension in January 2016... ● We reviewed why we failed. ● There were many problems, but the good results were obtained. ○ The size and complexity of our challenge was reconfirmed concretely. ○ Our strong team was organized to solve complex issues. ○ Our practices for Akka and DDD was deepened. Especially We wanted to make Akka's ability more apply effectively to our applications. 苦渋の決断

Slide 18

Slide 18 text

Rebooting the Project ● We welcomed a new leader and project management and strategy were totally revised. ● New Project Strategy ○ To be the robust architecture for infrastructure system. ○ Clarification of business and technology issues to be solved. ○ POC is MUST. ○ Clarification of final non-functional requirements. ■ Decrease infrastructure cost by 30% ■ 15 billion messages / month ■ 500k writes/s, 5000k reads/s (100 times the legacy system) ○ The Data Migration with down time was accepted instead of Live Migration to cope with the rapidly increasing data volume. プロジェクトの再起動

Slide 19

Slide 19 text

● POC Bootcamp(2016/1) ○ Prototyping and review My Best Falcon Application with each members. ● Properties that the system should satisfy ○ Scalability(High throughput, Low latency) ○ Resiliency(Non SPoF, Backoff recovery) ○ twice the number of concurrent connections and R/W throughput. ○ Low cost ○ Functionality (based on DDD) ● Requirement ○ AWS ○ CQRS + Event Sourcing ○ Reactive Systems POC: Objective of “Proof of Concept” POC: POCの目的

Slide 20

Slide 20 text

● Since 2/2016 ● Target scope is the messaging function contains chat room and member. ● As architecture, CQRS+ES was adopted because reading requests are more than writing requests, depending on chat characteristics. ○ akka-http, akka-actor, akka-stream, akka-persistence(-query). ○ our commponents are write-api, read-api, read-model-updater. ○ Layered architecture on our applications is Hexagonal-Architecture. ● Infrastructure and middleware ○ AWS EC2, ELB ○ Deployment tool is Lightbend ConductR. ○ Write DB is Cassandra, Read DB is Aurora ■ These DBs was selected to handle easy with Akka as a temporary option. In production, other options were choiced. POC: Verification for Risk Hedging POC: リスクヘッジのための検証

Slide 21

Slide 21 text

● Write API uses ClusterSharding and PersistentActor as Aggregate. ● Aggregate generates domain events from the received commands then adds them to the write db. ● ReadModelUpdater consumes domain events and constructs read-models asynchronously. ● Read API is non-cluster and stateless , has functions to return a flattened read-models. ● Multiple layers(Interface, UseCase, Domain etc) of "Hexagonal Architecture" in application, and each layers are composed with stream DSL (of akka-stream). POC: Architecture Overview POC: アーキテクチャ概要

Slide 22

Slide 22 text

● Instance Type ○ c3.x2large(vCPU = 8, Mem = 15GB) ○ Cassandra(m3.xlarge x 3) ○ Aurora(db.r3.2xlarge, write x 1, replica x 2) ● Throughput (from Write to Read) ○ random request ○ About 5,000 users concurrency ○ Almost linear and scale out possible. ○ KOs are zero. ● Posting messages ○ 3 nodes, 2,000 users concurrency, 2000rps(120krpm) response time is 90pct max 30ms ! POC: Result of POC(1/2) POC: 成果(1/2)

Slide 23

Slide 23 text

POC: Result of POC (2/2) POC: 成果(2/2) ● Our adoption of akka cluster had many operational problems to make it the production service level within a short period of time. ○ How to solve the Split-Brain problem in 2-AZ? it’s impossible. ○ In our requirements, stateful actors were overkill and high operational cost . ■ Stateful actors are not effective because retrieving old data are few. ■ must be ‘ClusterSharding’ for stateful actors ○ Even other methods with low operation costs was able to satisfied our requirements. ● Cassandra ○ Estimated 24 hrs to re-create failure node. ○ The data distribution method by DHT and virtual node are not intuitive and difficult to understand. ● Aurora ○ Write performance cannot scale well in a single master manner. Sharding can solve it but needs expensive development and operation.

Slide 24

Slide 24 text

Phase2: Production Development P2: プロダクション向け開発

Slide 25

Slide 25 text

Phase2: Re-Architecture from POC P2: POCからのリアーキテクチャ ● akka-cluster was not adopted for reduction of operation cost, then to be stateless actors on APIs. ● For write-db, Kafka replaced Cassandra as write storage ○ straightforward append-only domain event storage with great produce/consume rate performance ● For read-db, HBase replaced Aurora as read storage. ○ Auto sharding based on row key on the storage level, and Master/Slave configuration is intuitive and easy to understand. ○ Underlying HDFS is fault tolerant and easy to manage ● Only focused on messaging system ○ core function that has many dependent features (e.g. tasks, files) ○ the highest business risk ○ the largest business opportunity

Slide 26

Slide 26 text

● Since 7/2016 ● Team Structure (Total 11 members) ○ Falcon Team ■ 4 members (I belong to this team) ○ Data Migration Team ■ 1 members ○ Sparrow Team (Legacy Service Side by PHP) ■ 3 members ○ Infrastructure Team ■ 3 members ● Note: Since the early stages, our starting members is above. Phase2: Our Project Structure P2: プロジェクト体制

Slide 27

Slide 27 text

P2: アーキテクチャ概要 ● Concept ○ Backend service providing messaging function to Legacy system. ○ Storage selection was changed but CQRS+ES was kept. ● Components ○ ReadModelUpdater uses Kafka Streams. ○ Sparrow is mediator system bridging Falcon and the legacy system. ○ The Domain Events to the legacy system are sent from sparrow-forwarder to sparrow. ○ SparrowForwarder propagates domain events to Sparrow. Phase2 : Architecture Overview

Slide 28

Slide 28 text

Phase2: Context Map of DDD ● Simpler Context Map of DDD than Phase1. ● Inter-team communication structure became simple as well. P2: コンテキストマップ Web, iOS, Android (as Existing Customer) Falcon (as Supplier) ChatWork (contains Sparrow) (as Customer/Supplier)

Slide 29

Slide 29 text

● System Configurations ○ c3.xlarge(vCPU = 4, Mem = 7.5GB) * 7 ■ Write API * 2, Read API * 4, ■ ReadModelUpdater * 2, SparrowForwarder * 2 ● Post Message API ○ 3000 users concurrency, throughput mean 2.6Kreq/s (latency 95percentile 104ms) ■ max 70 req/s at exsiting system (37 times throughput) ● Get Message API ○ 1340 users concurrency, throughput mean 1.2Kreq/s (latency 95percentile 62.9ms) ■ max 1.3 Kreq/s at exsiting system Phase2: Results of Stress Test P2: 負荷試験結果

Slide 30

Slide 30 text

Phase2: Data Migration(1/2) ● Data Migration project aimed to migrate message data from Aurora to HBase. Minimizing service downtime is the most important mission. ● Considering them, the migration strategy was decided as follows ○ Basic Migration ■ All data except 4 days before final maintenance. ○ Incremental Migration ■ For INSERT, difference is based on ID increase from previous migration. ■ For UPDATE, difference is based on binlog from previous migration. ○ Verification After Migration ■ It is checked whether column data on HBase matches column data on Aurora. P2: データマイグレーション(1/2)

Slide 31

Slide 31 text

Phase2: Data Migration(2/2) ● Data Migration engine: ○ Spark ● performance ○ Execution of Basic Migration ■ 3.5 hours (1.6 billion messages、60million chat rooms) ○ Verification of Basic Migration ■ 7.5 hours ○ Execution of Incremental Migration ■ 1 hour ○ Verification of Incremental Migration ■ 1 hour P2: データマイグレーション(2/2)

Slide 32

Slide 32 text

● Existing issues ○ It isn’t easy for developers to flexibly construct infrastructure for application development. because it is necessary to collaborate with infrastructure personnel. Collaboration with them has been made more efficient, and the design such as deployment, provisioning, scaling etc needs to be flexible. ● Countermeasure ○ coreos/kube-aws was adopted ■ kube-aws is tool and the installation artifacts for kubernetes on aws, developed by CoreOS. ● Create, update and destroy Kubernetes clusters on AWS ● Highly available and scalable Kubernetes clusters backed by multi-AZ deployment and Node Pools. ● Powered by various AWS services including CloudFormation, KMS, Auto Scaling, Spot Fleet, EC2, ELB, S3, etc. ○ concourse/concourse was adopted ■ Concourse is a pipeline-based CI system written in Go, developed by Pivotal. treats build pipelines and artifacts as first-class citizens. ■ In ThoughtWork's TECNOLOGY-RADAR 11/2015, the concourse-ci is contained in tools that 'ACCESS' category. DevOps: Improving Development Efficiency DevOps: 開発効率の向上

Slide 33

Slide 33 text

DevOps : Falcon Infrastructure by kube-aws DevOps: kube-awsによるFalconインフラ ● kubelet is the primary “node agent” that runs on each node. ● kube-proxy runs on each node. ● APIs validates and configures data for the api objects which include pods, services, replication-controllers, and others. ● Pod is a group of one or more containers, the shared storage for those containers, and options about how to run the containers. ● Falcon applications are deployed as Pods via helm(is package manager for k8s).

Slide 34

Slide 34 text

DevOps : Concourse CI (1/2) DepOps: Concourse CI (1/2) ● Core Concepts ○ End goal of Concourse is to provide an expressive system with as few distinct moving parts as possible. ● Resources ○ A resource is any entity that can be checked for new versions, pulled down at a specific version, and/or pushed up to idempotently create new versions. ● Jobs ○ At a high level, a job describes some actions to perform when dependent resources change (or when manually triggered). Build Job Git Resource Deploy Job

Slide 35

Slide 35 text

DevOps : Concourse CI (2/2) DepOps: Concourse CI (2/2) ● Tasks ○ A task is the execution of a script in an isolated environment with dependent resources available to it. Build Task Notification Task

Slide 36

Slide 36 text

Finally Release ● The final release started at midnight December 29th, 2016, finished after 7 hours later. It succeeded! ● We are grateful for cheering messages from the Scala community. Thank you very much! ● Performance after release ○ As expected, Falcon achieves high throughput, low latency, resilliency. ○ And improvements to achieve the final goal will continue. ついにリリースへ

Slide 37

Slide 37 text

Conclusion ● Falcon was released though twists and turns. ● Success Factors ○ Clarification of Project Strategy ■ The technical methods of achieving the project's goal was clarified. ○ Risk Hedging by POC ■ Verification the potential of CQRS + ES with Akka ○ Re-Architecture from POC ■ Review with consideration of operation costs ■ Function Scope Limitation ○ Data migration accepting downtime ○ Improving Development Efficiency by k8s, concourse-ci ● As a result, we succeeded in adoption an excellent architecture (CQRS+ES, Akka, Kafka, HBase) based on the verification. まとめ

Slide 38

Slide 38 text

Thank you for listening! ご静聴ありがとうございました。