13 Lessons Learned from Building OA Plus Infrastructure

12:40 - 13:15 Poomrat Boonyawong 13 Lessons Learned From Building
OA Plus Infrastructure Site Reliability Engineer Lead, LINE Thailand

13 Lessons Learned From Building OA Plus Infrastructure

Agenda Agenda • OA Plus Introduction • Tech Stack •
13 Lessons Learned

The (extensible) platform OA PLUS

LINE MyShop

Thailand to Global

Largest account 40M+ followers 8 Plugins Bot + Chat OK!
(Almost) No Low Usage Time Thailand to Global 38K OAs with scalability target for 10X Fun Facts

OA Plus Tech Stack

30,000 Feet View

13 LESSONS LEARNED

• We start with target (just) 30K OAs but now
going strong without major re-architect • Stateless, Horizontally Scalable down to datastore with no single point of contention is the key • Micro-service pattern helps with trade-off on redundancy and complexity Or you might regret corner-cutting decision 1. Design for Scale From Day 1

• From user behavior down to system level interaction •
System telemetry - CPU, memory, network, storage, process • Component level data - http req/resp, error rate, database interaction, cache behavior • Infrastructure telemetry - db shards, query, cache, queue, etc. • Product level telemetry - how user use/ navigate the product, what happens on client side • And make sure they are all available when needed While respecting user privacy and law 2. Monitor Everything

• Users always use product in a very creative way…
• For a sufficiently popular product, there will always be new client configuration (i.e. chrome extension interference) Also Capture Client Error

• For large-scale platform, RDBMS can be limiting factor •
MySQL, MongoDB, Hbase, Redis, Druid, Elasticsearch and more • There’s no size fit all DB. Choose wisely 3. Use RIGHT Datastore for the Job

• Prometheus is inherently not horizontally scalable. As the application
grows, the measurement grow at even the faster speed • We use thanos.io with S3-compatible object storage for scaling out • Query caching also helps scaling infrastructure much further. Have a look at trickster in github.com 4. Scaling Prometheus is Worth it All metrics at the longest possible retention

• You need reliable, detailed log at sufficiently long interval
at central location • Log is expensive to ingest and query. We use ELK+Hadoop with large Kafka cluster as a transport mechanism/event source • Have standard logging pattern with consistent information level helps when you have 100s services The most detailed of telemetry class at significant cost • Pay attention to each node’s log shipper performance. Minimize on-node processing and parallelize shipping process. 5. Logging is Challenge but Worth Solving

• We have leveraged Istio from the start • 2
Main Benefits: • Smart proxy fronting every pods • Decoupling RBAC from application • Treat Istio upgrade as major change. But can be really helpful 6. Service Mesh is Hard

Our Current Mesh

• Istio can kill your DNS (if not tune properly)
• You need DNS monitoring as first class citizen. DNS failure can be very hard to trace and cause very severe outage. • We use CoreDNS with autopath and node local cache with custom code to have multiple local cache instances for HA 7. (CORE) DNS is the Heart of Many Things

• They are both messaging broker but have lots of
difference. We use both. • Kafka is the heart of LINE messaging and has proven to scale very far. They are complex to operate and come with some parallel processing cost. (LINE has open sourced https://github.com/line/decaton recently which might be useful) • RabbitMQ is more mature and easy to start. But beware of the scaling limitation. Major benefit over Kafka are controllable retry. 8. Kafka VS. Rabbit MQ

• Prepare for poison message. Skipping offset in Kafka is
not straightforward without proper tool/process • Kafka offset (stores in Zookeeper) has expiration. Infrequent consumer beware! • Published message to Kafka might still lost. Check behavior of your publisher library. Some Kafka Tips

9. Tracing is Mandatory in Distributed System

10. Auto-Scaling Might Not Be Feasible Expectation Reality

• As features list grows, you will want ability to
add more people/ team • Micro-service with proper API gateway at the back • Micro-frontend with proper strategy at the front 11. Prepare for Integration

12. Leverage CDN

13. Micro-Frontend

13 Lessons Learned from Building OA Plus Infras...

13 Lessons Learned from Building OA Plus Infrastructure

LINE Developers Thailand

More Decks by LINE Developers Thailand

Other Decks in Technology

Featured

Transcript

12:40 - 13:15 Poomrat Boonyawong 13 Lessons Learned From Building