13 Lessons Learned from Building OA Plus Infrastructure

Slide 1

Slide 1 text

12:40 - 13:15 Poomrat Boonyawong 13 Lessons Learned From Building OA Plus Infrastructure Site Reliability Engineer Lead, LINE Thailand

Slide 2

Slide 2 text

13 Lessons Learned From Building OA Plus Infrastructure

Slide 3

Slide 3 text

Agenda Agenda • OA Plus Introduction • Tech Stack • 13 Lessons Learned

Slide 4

Slide 4 text

The (extensible) platform OA PLUS

Slide 5

Slide 5 text

LINE MyShop

Slide 6

Slide 6 text

Thailand to Global

Slide 7

Slide 7 text

Largest account 40M+ followers 8 Plugins Bot + Chat OK! (Almost) No Low Usage Time Thailand to Global 38K OAs with scalability target for 10X Fun Facts

Slide 8

Slide 8 text

OA Plus Tech Stack

Slide 9

Slide 9 text

30,000 Feet View

Slide 10

Slide 10 text

13 LESSONS LEARNED

Slide 11

Slide 11 text

● We start with target (just) 30K OAs but now going strong without major re-architect ● Stateless, Horizontally Scalable down to datastore with no single point of contention is the key ● Micro-service pattern helps with trade-off on redundancy and complexity Or you might regret corner-cutting decision 1. Design for Scale From Day 1

Slide 12

Slide 12 text

● From user behavior down to system level interaction ● System telemetry - CPU, memory, network, storage, process ● Component level data - http req/resp, error rate, database interaction, cache behavior ● Infrastructure telemetry - db shards, query, cache, queue, etc. ● Product level telemetry - how user use/ navigate the product, what happens on client side ● And make sure they are all available when needed While respecting user privacy and law 2. Monitor Everything

Slide 13

Slide 13 text

No content

Slide 14

Slide 14 text

● Users always use product in a very creative way… ● For a sufficiently popular product, there will always be new client configuration (i.e. chrome extension interference) Also Capture Client Error

Slide 15

Slide 15 text

No content

Slide 16

Slide 16 text

● For large-scale platform, RDBMS can be limiting factor ● MySQL, MongoDB, Hbase, Redis, Druid, Elasticsearch and more ● There’s no size fit all DB. Choose wisely 3. Use RIGHT Datastore for the Job

Slide 17

Slide 17 text

● Prometheus is inherently not horizontally scalable. As the application grows, the measurement grow at even the faster speed ● We use thanos.io with S3-compatible object storage for scaling out ● Query caching also helps scaling infrastructure much further. Have a look at trickster in github.com 4. Scaling Prometheus is Worth it All metrics at the longest possible retention

Slide 18

Slide 18 text

No content

Slide 19

Slide 19 text

● You need reliable, detailed log at sufficiently long interval at central location ● Log is expensive to ingest and query. We use ELK+Hadoop with large Kafka cluster as a transport mechanism/event source ● Have standard logging pattern with consistent information level helps when you have 100s services The most detailed of telemetry class at significant cost ● Pay attention to each node’s log shipper performance. Minimize on-node processing and parallelize shipping process. 5. Logging is Challenge but Worth Solving

Slide 20

Slide 20 text

No content

Slide 21

Slide 21 text

● We have leveraged Istio from the start ● 2 Main Benefits: ● Smart proxy fronting every pods ● Decoupling RBAC from application ● Treat Istio upgrade as major change. But can be really helpful 6. Service Mesh is Hard

Slide 22

Slide 22 text

Our Current Mesh

Slide 23

Slide 23 text

● Istio can kill your DNS (if not tune properly) ● You need DNS monitoring as first class citizen. DNS failure can be very hard to trace and cause very severe outage. ● We use CoreDNS with autopath and node local cache with custom code to have multiple local cache instances for HA 7. (CORE) DNS is the Heart of Many Things

Slide 24

Slide 24 text

No content

Slide 25

Slide 25 text

● They are both messaging broker but have lots of difference. We use both. ● Kafka is the heart of LINE messaging and has proven to scale very far. They are complex to operate and come with some parallel processing cost. (LINE has open sourced https://github.com/line/decaton recently which might be useful) ● RabbitMQ is more mature and easy to start. But beware of the scaling limitation. Major benefit over Kafka are controllable retry. 8. Kafka VS. Rabbit MQ

Slide 26

Slide 26 text

● Prepare for poison message. Skipping offset in Kafka is not straightforward without proper tool/process ● Kafka offset (stores in Zookeeper) has expiration. Infrequent consumer beware! ● Published message to Kafka might still lost. Check behavior of your publisher library. Some Kafka Tips

Slide 27

Slide 27 text

9. Tracing is Mandatory in Distributed System

Slide 28

Slide 28 text

10. Auto-Scaling Might Not Be Feasible Expectation Reality

Slide 29

Slide 29 text

● As features list grows, you will want ability to add more people/ team ● Micro-service with proper API gateway at the back ● Micro-frontend with proper strategy at the front 11. Prepare for Integration

Slide 1

Slide 1 text

Slide 2

Slide 2 text

Slide 3

Slide 3 text

Slide 4

Slide 4 text

Slide 5

Slide 5 text

Slide 6

Slide 6 text

Slide 7

Slide 7 text

Slide 8

Slide 8 text

Slide 9

Slide 9 text

Slide 10

Slide 10 text

Slide 11

Slide 11 text

Slide 12

Slide 12 text

Slide 13

Slide 13 text

Slide 14

Slide 14 text

Slide 15

Slide 15 text

Slide 16

Slide 16 text

Slide 17

Slide 17 text

Slide 18

Slide 18 text

Slide 19

Slide 19 text

Slide 20

Slide 20 text

Slide 21

Slide 21 text

Slide 22

Slide 22 text

Slide 23

Slide 23 text

Slide 24

Slide 24 text

Slide 25

Slide 25 text

Slide 26

Slide 26 text

Slide 27

Slide 27 text

Slide 28

Slide 28 text

Slide 29

Slide 29 text

Slide 30

Slide 30 text

Slide 31

Slide 31 text

Slide 32

Slide 32 text