Upgrade to Pro — share decks privately, control downloads, hide ads and more …

13 Lessons Learned from Building OA Plus Infras...

13 Lessons Learned from Building OA Plus Infrastructure

LINE Developers Thailand

September 13, 2020
Tweet

More Decks by LINE Developers Thailand

Other Decks in Technology

Transcript

  1. 12:40 - 13:15 Poomrat Boonyawong 13 Lessons Learned From Building

    OA Plus Infrastructure Site Reliability Engineer Lead, LINE Thailand
  2. Largest account 40M+ followers 8 Plugins Bot + Chat OK!

    (Almost) No Low Usage Time Thailand to Global 38K OAs with scalability target for 10X Fun Facts
  3. • We start with target (just) 30K OAs but now

    going strong without major re-architect • Stateless, Horizontally Scalable down to datastore with no single point of contention is the key • Micro-service pattern helps with trade-off on redundancy and complexity Or you might regret corner-cutting decision 1. Design for Scale From Day 1
  4. • From user behavior down to system level interaction •

    System telemetry - CPU, memory, network, storage, process • Component level data - http req/resp, error rate, database interaction, cache behavior • Infrastructure telemetry - db shards, query, cache, queue, etc. • Product level telemetry - how user use/ navigate the product, what happens on client side • And make sure they are all available when needed While respecting user privacy and law 2. Monitor Everything
  5. • Users always use product in a very creative way…

    • For a sufficiently popular product, there will always be new client configuration (i.e. chrome extension interference) Also Capture Client Error
  6. • For large-scale platform, RDBMS can be limiting factor •

    MySQL, MongoDB, Hbase, Redis, Druid, Elasticsearch and more • There’s no size fit all DB. Choose wisely 3. Use RIGHT Datastore for the Job
  7. • Prometheus is inherently not horizontally scalable. As the application

    grows, the measurement grow at even the faster speed • We use thanos.io with S3-compatible object storage for scaling out • Query caching also helps scaling infrastructure much further. Have a look at trickster in github.com 4. Scaling Prometheus is Worth it All metrics at the longest possible retention
  8. • You need reliable, detailed log at sufficiently long interval

    at central location • Log is expensive to ingest and query. We use ELK+Hadoop with large Kafka cluster as a transport mechanism/event source • Have standard logging pattern with consistent information level helps when you have 100s services The most detailed of telemetry class at significant cost • Pay attention to each node’s log shipper performance. Minimize on-node processing and parallelize shipping process. 5. Logging is Challenge but Worth Solving
  9. • We have leveraged Istio from the start • 2

    Main Benefits: • Smart proxy fronting every pods • Decoupling RBAC from application • Treat Istio upgrade as major change. But can be really helpful 6. Service Mesh is Hard
  10. • Istio can kill your DNS (if not tune properly)

    • You need DNS monitoring as first class citizen. DNS failure can be very hard to trace and cause very severe outage. • We use CoreDNS with autopath and node local cache with custom code to have multiple local cache instances for HA 7. (CORE) DNS is the Heart of Many Things
  11. • They are both messaging broker but have lots of

    difference. We use both. • Kafka is the heart of LINE messaging and has proven to scale very far. They are complex to operate and come with some parallel processing cost. (LINE has open sourced https://github.com/line/decaton recently which might be useful) • RabbitMQ is more mature and easy to start. But beware of the scaling limitation. Major benefit over Kafka are controllable retry. 8. Kafka VS. Rabbit MQ
  12. • Prepare for poison message. Skipping offset in Kafka is

    not straightforward without proper tool/process • Kafka offset (stores in Zookeeper) has expiration. Infrequent consumer beware! • Published message to Kafka might still lost. Check behavior of your publisher library. Some Kafka Tips
  13. • As features list grows, you will want ability to

    add more people/ team • Micro-service with proper API gateway at the back • Micro-frontend with proper strategy at the front 11. Prepare for Integration