13 Lessons Learned from Building OA Plus Infrastructure

LINE Developers Thailand

September 13, 2020


  1. 12:40 - 13:15 Poomrat Boonyawong 13 Lessons Learned From Building

    OA Plus Infrastructure Site Reliability Engineer Lead, LINE Thailand
  3. Agenda Agenda • OA Plus Introduction • Tech Stack •

    13 Lessons Learned
  4. The (extensible) platform OA PLUS

  5. LINE MyShop

  6. Thailand to Global

  7. Largest account 40M+ followers 8 Plugins Bot + Chat OK!

    (Almost) No Low Usage Time Thailand to Global 38K OAs with scalability target for 10X Fun Facts
  8. OA Plus Tech Stack

  9. 30,000 Feet View


  11. • We start with target (just) 30K OAs but now

    going strong without major re-architect • Stateless, Horizontally Scalable down to datastore with no single point of contention is the key • Micro-service pattern helps with trade-off on redundancy and complexity Or you might regret corner-cutting decision 1. Design for Scale From Day 1
  12. • From user behavior down to system level interaction •

    System telemetry - CPU, memory, network, storage, process • Component level data - http req/resp, error rate, database interaction, cache behavior • Infrastructure telemetry - db shards, query, cache, queue, etc. • Product level telemetry - how user use/ navigate the product, what happens on client side • And make sure they are all available when needed While respecting user privacy and law 2. Monitor Everything
  14. • Users always use product in a very creative way…

    • For a sufficiently popular product, there will always be new client configuration (i.e. chrome extension interference) Also Capture Client Error
  16. • For large-scale platform, RDBMS can be limiting factor •

    MySQL, MongoDB, Hbase, Redis, Druid, Elasticsearch and more • There’s no size fit all DB. Choose wisely 3. Use RIGHT Datastore for the Job
  17. • Prometheus is inherently not horizontally scalable. As the application

    grows, the measurement grow at even the faster speed • We use thanos.io with S3-compatible object storage for scaling out • Query caching also helps scaling infrastructure much further. Have a look at trickster in github.com 4. Scaling Prometheus is Worth it All metrics at the longest possible retention
  19. • You need reliable, detailed log at sufficiently long interval

    at central location • Log is expensive to ingest and query. We use ELK+Hadoop with large Kafka cluster as a transport mechanism/event source • Have standard logging pattern with consistent information level helps when you have 100s services The most detailed of telemetry class at significant cost • Pay attention to each node’s log shipper performance. Minimize on-node processing and parallelize shipping process. 5. Logging is Challenge but Worth Solving
  21. • We have leveraged Istio from the start • 2

    Main Benefits: • Smart proxy fronting every pods • Decoupling RBAC from application • Treat Istio upgrade as major change. But can be really helpful 6. Service Mesh is Hard
  22. Our Current Mesh

  23. • Istio can kill your DNS (if not tune properly)

    • You need DNS monitoring as first class citizen. DNS failure can be very hard to trace and cause very severe outage. • We use CoreDNS with autopath and node local cache with custom code to have multiple local cache instances for HA 7. (CORE) DNS is the Heart of Many Things
  25. • They are both messaging broker but have lots of

    difference. We use both. • Kafka is the heart of LINE messaging and has proven to scale very far. They are complex to operate and come with some parallel processing cost. (LINE has open sourced https://github.com/line/decaton recently which might be useful) • RabbitMQ is more mature and easy to start. But beware of the scaling limitation. Major benefit over Kafka are controllable retry. 8. Kafka VS. Rabbit MQ
  26. • Prepare for poison message. Skipping offset in Kafka is

    not straightforward without proper tool/process • Kafka offset (stores in Zookeeper) has expiration. Infrequent consumer beware! • Published message to Kafka might still lost. Check behavior of your publisher library. Some Kafka Tips
  27. 9. Tracing is Mandatory in Distributed System

  28. 10. Auto-Scaling Might Not Be Feasible Expectation Reality

  29. • As features list grows, you will want ability to

    add more people/ team • Micro-service with proper API gateway at the back • Micro-frontend with proper strategy at the front 11. Prepare for Integration
  30. 12. Leverage CDN

  31. 13. Micro-Frontend

