Upgrade to Pro — share decks privately, control downloads, hide ads and more …

13 Lessons Learned from Building OA Plus Infrastructure

13 Lessons Learned from Building OA Plus Infrastructure

LINE Developers Thailand

September 13, 2020
Tweet

More Decks by LINE Developers Thailand

Other Decks in Technology

Transcript

  1. 12:40 - 13:15
    Poomrat Boonyawong
    13 Lessons Learned
    From Building OA
    Plus Infrastructure
    Site Reliability Engineer Lead, LINE Thailand

    View full-size slide

  2. 13 Lessons Learned
    From Building OA
    Plus Infrastructure

    View full-size slide

  3. Agenda
    Agenda
    • OA Plus Introduction
    • Tech Stack
    • 13 Lessons Learned

    View full-size slide

  4. The (extensible) platform
    OA PLUS

    View full-size slide

  5. Thailand to Global

    View full-size slide

  6. Largest account
    40M+ followers
    8 Plugins Bot + Chat OK!
    (Almost) No Low
    Usage Time
    Thailand to
    Global
    38K OAs with
    scalability target
    for 10X
    Fun Facts

    View full-size slide

  7. OA Plus Tech Stack

    View full-size slide

  8. 30,000 Feet View

    View full-size slide

  9. 13 LESSONS LEARNED

    View full-size slide

  10. ● We start with target (just) 30K OAs but now going strong
    without major re-architect
    ● Stateless, Horizontally Scalable down to datastore with no
    single point of contention is the key
    ● Micro-service pattern helps with trade-off on redundancy and
    complexity
    Or you might regret corner-cutting decision
    1. Design for Scale From Day 1

    View full-size slide

  11. ● From user behavior down to
    system level interaction
    ● System telemetry - CPU, memory,
    network, storage, process
    ● Component level data - http req/resp,
    error rate, database interaction, cache
    behavior
    ● Infrastructure telemetry - db
    shards, query, cache, queue, etc.
    ● Product level telemetry - how user use/
    navigate the product, what happens on
    client side
    ● And make sure they are all
    available when needed
    While respecting user privacy and law
    2. Monitor Everything

    View full-size slide

  12. ● Users always use product in a very creative way…
    ● For a sufficiently popular product, there will always be new
    client configuration (i.e. chrome extension interference)
    Also Capture Client Error

    View full-size slide

  13. ● For large-scale platform, RDBMS can be limiting factor
    ● MySQL, MongoDB, Hbase, Redis, Druid, Elasticsearch and more
    ● There’s no size fit all DB. Choose wisely
    3. Use RIGHT Datastore for the Job

    View full-size slide

  14. ● Prometheus is inherently not horizontally scalable. As the
    application grows, the measurement grow at even the faster speed
    ● We use thanos.io with S3-compatible object storage for
    scaling out
    ● Query caching also helps scaling infrastructure much further.
    Have a look at trickster in github.com
    4. Scaling Prometheus is Worth it
    All metrics at the longest possible retention

    View full-size slide

  15. ● You need reliable, detailed log at sufficiently long interval at
    central location
    ● Log is expensive to ingest and query. We use ELK+Hadoop with
    large Kafka cluster as a transport mechanism/event source
    ● Have standard logging pattern with consistent information
    level helps when you have 100s services
    The most detailed of telemetry class at significant cost
    ● Pay attention to each node’s log shipper performance. Minimize
    on-node processing and parallelize shipping process.
    5. Logging is Challenge but Worth Solving

    View full-size slide

  16. ● We have leveraged Istio from the start
    ● 2 Main Benefits:
    ● Smart proxy fronting every pods
    ● Decoupling RBAC from application
    ● Treat Istio upgrade as major change.
    But can be really helpful
    6. Service Mesh is Hard

    View full-size slide

  17. Our Current Mesh

    View full-size slide

  18. ● Istio can kill your DNS (if not tune properly)
    ● You need DNS monitoring as first class citizen. DNS failure can
    be very hard to trace and cause very severe outage.
    ● We use CoreDNS with autopath and node local cache with
    custom code to have multiple local cache instances for HA
    7. (CORE) DNS is the Heart of Many Things

    View full-size slide

  19. ● They are both messaging broker but have lots of difference.
    We use both.
    ● Kafka is the heart of LINE messaging and has proven to scale very far. They
    are complex to operate and come with some parallel processing cost. (LINE
    has open sourced https://github.com/line/decaton recently which might be
    useful)
    ● RabbitMQ is more mature and easy to start. But beware of the
    scaling limitation. Major benefit over Kafka are controllable retry.
    8. Kafka VS. Rabbit MQ

    View full-size slide

  20. ● Prepare for poison message. Skipping offset in Kafka is not
    straightforward without proper tool/process
    ● Kafka offset (stores in Zookeeper) has expiration. Infrequent
    consumer beware!
    ● Published message to Kafka might still lost. Check behavior of
    your publisher library.
    Some Kafka Tips

    View full-size slide

  21. 9. Tracing is Mandatory in Distributed System

    View full-size slide

  22. 10. Auto-Scaling Might Not Be Feasible
    Expectation
    Reality

    View full-size slide

  23. ● As features list grows, you will want ability to add more people/
    team
    ● Micro-service with proper API gateway at the back
    ● Micro-frontend with proper strategy at the front
    11. Prepare for Integration

    View full-size slide

  24. 12. Leverage CDN

    View full-size slide

  25. 13. Micro-Frontend

    View full-size slide