Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Initiatives of ABYSS, a Vertical Search Platform

Initiatives of ABYSS, a Vertical Search Platform

Yasushi Takeda (Yahoo! JAPAN / Platform Development Division, Search Services Group, Search Group / Engineer)

https://tech-verse.me/ja/sessions/181
https://tech-verse.me/en/sessions/181
https://tech-verse.me/ko/sessions/181

Tech-Verse2022

November 17, 2022
Tweet

More Decks by Tech-Verse2022

Other Decks in Technology

Transcript

  1. Agenda - Self introduction - Search technology in Yahoo! JAPAN

    - What is ABYSS? - Use case 1: Timeline of Yahoo! JAPAN App - Use case 2: Yahoo! JAPAN Knowledge Search - Conclusion 2
  2. Self introduction What do I do? I work as an

    ABYSS engineer. What do I like? I like fast things, especially cars. One of my favorite hobby is Mini 4WD. For that, my internal slack icon is Mini 4WD made by myself. 3
  3. “abyss” Abyss comes from Greek: a- "without" + byssos, "depth,

    bottom." You may know the related adjective abysmal, which means "appallingly bad" — or "way down in the depths," as it were. from vocabulary.com image: aflo 18
  4. History of ABYSS 2002 • In-house search application released 2010

    •ABYSS 1.0 released using in-house search application 2014 •ABYSS 2.0 released using Apache Solr 22
  5. Before the birth of ABYSS • In-house search application is

    packaged and published. • Service personnels install and maintain the application on real servers. • Knowledges are collected among each services • However, there is no guarantee that these knowledges work well in other service because application execution environments are different each other. Service 24
  6. ABYSS is born as a common search platform. • Knowledges

    are collected in one place. • These knowledges work well in other service because application execution environments are the same Service Platform • Platform personnels focus on reducing operating costs and incorporating modern technologies. • Service personnels focus on improving their service. ABYSS makes a foundation for developing successful cases from one service to another. Roles are clarified. 25
  7. • An open source full-text search application. • Written in

    Java. • Solr can extend its functions by implementing interfaces. What is Solr? 28
  8. VM layer Container layer • Use virtual machines (VMs) provided

    by in-house cloud vendor. • Use container technology by building our own Solr container image. Reduce operating costs by virtualization technology. 34
  9. ※ We call “feeding” to put data into ABYSS VM

    layer Container layer Data • Save fed data on the VM layer outside the container layer. • ABYSS operator can recreate containers without worrying about data loss. Reduce operating costs by virtualization technology. 35
  10. Physical layer VM layer Container layer Data • VM can

    lose its data due to physical layer failure. • In ABYSS, shards have at least 3 replicas for redundancy. • By Solr’s replication, shards will not lose data even if one VM loses its data. Reduce operating costs by redundancy. 36
  11. checker Physical layer VM VM VM check SPOF • It’s

    not safe if some VMs in one shard run on the same physical layer. • ABYSS has checker component which checks whether shards have a single point of failure (SPOF) on the physical layer. • If SPOF is found, ABYSS operator remove it by swapping VMs. Reduce operating costs by removing single points of failure (SPOF). Data Data Data 37
  12. ABYSS Diagram of ABYSS internal API Web UI zookeeper routing

    server Service personnel set configurations see logs search via service servers ~ ~ User 38
  13. Diagram of ABYSS ABYSS internal API Web UI zookeeper routing

    server User Service personnel set configurations set configurations ~ ~ 39
  14. Diagram of ABYSS ABYSS internal API Web UI zookeeper routing

    server User Service personnel Cloud vendor API repairer checker restart VM check SPOF check failure and recovery information ~ ~ 40
  15. ABYSS operator can rest assured even if a server fails.

    Virtualization Remove SPOF Auto healing image: aflo Summary 41
  16. image: aflo If we could realize the search written in

    this paper, we would deliver a better search experience to users. Common feature request 43
  17. image: aflo Please wait until the function is implemented in

    the application, the version that can use it is released, and our operation check is completed. 44
  18. Service Science Platform Development relationship Around ABYSS, platform team and

    science team work together to realize service requests. 46
  19. Service Science Platform Science team implements Solr extensions for service

    requests. Platform team incorporates these extensions to ABYSS So that, service team can use requested functions earlier than Solr natively supports it. Development relationship 47
  20. voronoi diagram What is ANN? ANN is a technology to

    "guess" the document vectors nearest to a query vector. ANN drastically reduces search latency by losing only a bit of accuracy. In case of Approximately Nearest Neighbor search (ANN) Plugin 48
  21. ANN is natively supported in Solr 9.0.0 released in May

    2022 2020 2021 2022 Background In the fields of image search and natural language processing (NLP), vector search attracts attention. In case of Approximately Nearest Neighbor search (ANN) Plugin 49
  22. Science team starts to develop ANN plugin in 2020 The

    first service to use ANN plugin consults ABYSS in Jun 2021. In case of Approximately Nearest Neighbor search (ANN) Plugin 2020 2021 2022 Prototype of ANN plugin is delivered to ABYSS in November 2020 The service switch to using ANN plugin in October 2021. 50
  23. Popular in-house extensions • Japanese morphological analysis tokenizers • WebMA

    Tokenizer • Asagi Tokenizer • Two Phase Ranker (TPR) Plugin • multi-phase ranking • dense vector search • Approximately Nearest Neighbor search (ANN) Plugin • Dedupe Plugin 52
  24. We want to recommend articles that have just been created.

    Batch processing every 5 or 10 minutes does not meet this requirement. Functional Requirement 2 59
  25. thousands requests / sec (Normal daytime) thousands requests / sec

    * 2 (Daily peak time) thousands requests / sec * n (Push notification triggered) A huge mount of requests come System Requirement 1 61
  26. We want to recommend articles with high accuracy by using

    machine learning. However, there is a trade-off between accuracy and response speed image: aflo System Requirement 3 63
  27. ABYSS Article User Generate and update user vector Generate article

    vector Search articles with the user vector Feed articles with their article vectors A simple vector search runs in ABYSS. More details: https://www.slideshare.net/techblogyahoo/yjtc-yjtc21-a1-241223218 Run machine learnings outside of ABYSS. image: aflo 65
  28. We provide many clusters with many replicas There are too

    many requests to be received by one cluster. To distribute them, we use many clusters with the same data. We build the clusters with middle spec VMs because a simple vector search is running on them. High-end spec VM High spec VM Middle spec VM ← We provide! Low spec VM 67
  29. Questions that have good answers are viewed many times. We

    want to lead users to such questions because they visit Yahoo! JAPAN Knowledge Search to solve their problems. Functional Requirement 72
  30. We want to score how good each question is. We

    use machine learning for that. It's tough to score them using natural language processing. image: aflo Functional Requirement 73
  31. Questions and answers are collected over 18 years. • 250,000,000+

    questions • 600,000,000+ answers 40,000+ questions are added every day. Since 2004 System Requirement 1 Huge mounts of questions and answers 75
  32. Response performance Throughput thousands requests / sec (Normal daytime) Latency

    tens of milliseconds (average) hundreds of milliseconds (99%ile) System Requirement 2 76
  33. Use Two Phase Ranker (TPR) plugin. • Generate a search

    result in 2 phases. • 1st Phase • Score questions using simple morphological analysis. • Pass the top N (determined by the tuning parameter) in each shard to the 2nd Phase • 2nd Phase • Score N * shards questions using machine learning. • Rerank questions by this scores and return Top 10. Faster than scoring all questions that hit the search query using machine learning. 78
  34. We provide large-scale clusters built with high-end spec VMs We

    use high-end spec VMs that have large size memory and high-performance vCPU. Because Solr is a full-text search application, having all texts on memory greatly contributes to its performance. Using machine learning, we need high- performance vCPU. High-end spec VM ← We provide! High spec VM Middle spec VM Low spec VM 80
  35. We introduced 2 use cases, timeline of Yahoo! JAPAN App

    and Yahoo! JAPAN Knowledge Search. 84