Upgrade to Pro — share decks privately, control downloads, hide ads and more …

13 Lessons Learned from Building OA Plus Infrastructure

13 Lessons Learned from Building OA Plus Infrastructure

LINE Developers Thailand

September 13, 2020
Tweet

More Decks by LINE Developers Thailand

Other Decks in Technology

Transcript

  1. 12:40 - 13:15
    Poomrat Boonyawong
    13 Lessons Learned
    From Building OA
    Plus Infrastructure
    Site Reliability Engineer Lead, LINE Thailand

    View Slide

  2. 13 Lessons Learned
    From Building OA
    Plus Infrastructure

    View Slide

  3. Agenda
    Agenda
    • OA Plus Introduction
    • Tech Stack
    • 13 Lessons Learned

    View Slide

  4. The (extensible) platform
    OA PLUS

    View Slide

  5. LINE MyShop

    View Slide

  6. Thailand to Global

    View Slide

  7. Largest account
    40M+ followers
    8 Plugins Bot + Chat OK!
    (Almost) No Low
    Usage Time
    Thailand to
    Global
    38K OAs with
    scalability target
    for 10X
    Fun Facts

    View Slide

  8. OA Plus Tech Stack

    View Slide

  9. 30,000 Feet View

    View Slide

  10. 13 LESSONS LEARNED

    View Slide

  11. ● We start with target (just) 30K OAs but now going strong
    without major re-architect
    ● Stateless, Horizontally Scalable down to datastore with no
    single point of contention is the key
    ● Micro-service pattern helps with trade-off on redundancy and
    complexity
    Or you might regret corner-cutting decision
    1. Design for Scale From Day 1

    View Slide

  12. ● From user behavior down to
    system level interaction
    ● System telemetry - CPU, memory,
    network, storage, process
    ● Component level data - http req/resp,
    error rate, database interaction, cache
    behavior
    ● Infrastructure telemetry - db
    shards, query, cache, queue, etc.
    ● Product level telemetry - how user use/
    navigate the product, what happens on
    client side
    ● And make sure they are all
    available when needed
    While respecting user privacy and law
    2. Monitor Everything

    View Slide

  13. View Slide

  14. ● Users always use product in a very creative way…
    ● For a sufficiently popular product, there will always be new
    client configuration (i.e. chrome extension interference)
    Also Capture Client Error

    View Slide

  15. View Slide

  16. ● For large-scale platform, RDBMS can be limiting factor
    ● MySQL, MongoDB, Hbase, Redis, Druid, Elasticsearch and more
    ● There’s no size fit all DB. Choose wisely
    3. Use RIGHT Datastore for the Job

    View Slide

  17. ● Prometheus is inherently not horizontally scalable. As the
    application grows, the measurement grow at even the faster speed
    ● We use thanos.io with S3-compatible object storage for
    scaling out
    ● Query caching also helps scaling infrastructure much further.
    Have a look at trickster in github.com
    4. Scaling Prometheus is Worth it
    All metrics at the longest possible retention

    View Slide

  18. View Slide

  19. ● You need reliable, detailed log at sufficiently long interval at
    central location
    ● Log is expensive to ingest and query. We use ELK+Hadoop with
    large Kafka cluster as a transport mechanism/event source
    ● Have standard logging pattern with consistent information
    level helps when you have 100s services
    The most detailed of telemetry class at significant cost
    ● Pay attention to each node’s log shipper performance. Minimize
    on-node processing and parallelize shipping process.
    5. Logging is Challenge but Worth Solving

    View Slide

  20. View Slide

  21. ● We have leveraged Istio from the start
    ● 2 Main Benefits:
    ● Smart proxy fronting every pods
    ● Decoupling RBAC from application
    ● Treat Istio upgrade as major change.
    But can be really helpful
    6. Service Mesh is Hard

    View Slide

  22. Our Current Mesh

    View Slide

  23. ● Istio can kill your DNS (if not tune properly)
    ● You need DNS monitoring as first class citizen. DNS failure can
    be very hard to trace and cause very severe outage.
    ● We use CoreDNS with autopath and node local cache with
    custom code to have multiple local cache instances for HA
    7. (CORE) DNS is the Heart of Many Things

    View Slide

  24. View Slide

  25. ● They are both messaging broker but have lots of difference.
    We use both.
    ● Kafka is the heart of LINE messaging and has proven to scale very far. They
    are complex to operate and come with some parallel processing cost. (LINE
    has open sourced https://github.com/line/decaton recently which might be
    useful)
    ● RabbitMQ is more mature and easy to start. But beware of the
    scaling limitation. Major benefit over Kafka are controllable retry.
    8. Kafka VS. Rabbit MQ

    View Slide

  26. ● Prepare for poison message. Skipping offset in Kafka is not
    straightforward without proper tool/process
    ● Kafka offset (stores in Zookeeper) has expiration. Infrequent
    consumer beware!
    ● Published message to Kafka might still lost. Check behavior of
    your publisher library.
    Some Kafka Tips

    View Slide

  27. 9. Tracing is Mandatory in Distributed System

    View Slide

  28. 10. Auto-Scaling Might Not Be Feasible
    Expectation
    Reality

    View Slide

  29. ● As features list grows, you will want ability to add more people/
    team
    ● Micro-service with proper API gateway at the back
    ● Micro-frontend with proper strategy at the front
    11. Prepare for Integration

    View Slide

  30. 12. Leverage CDN

    View Slide

  31. 13. Micro-Frontend

    View Slide

  32. View Slide